42
Artificial Neural Networks 173 CHAPTER 4 Resolutions Chapter 1 1. a) Please see text in Section 1.2. b) Please see text in Section 1.2 2. Please see text in Section 3. This is a very simple exercise. a) The net input is just . The corresponding output is . b) To obtain the input patterns you can use the Matlab function inp=randn(10,2). You can also obtain values downloading the file inp.mat. The Matlab function required is in singneur.m. 4. Consider . a) The derivative, , of this activation function, with respect to the standard devia- tion, is: e t 111 1 1 0.5 0 = = 1 1 e 0.5 + ------------------- 0.622 = = f i C ik , σ i , ( ) e C ik , x k ( ) 2 k 1 = n 2σ i 2 ----------------------------------- = σ i f i σ i

redes neuronais

Embed Size (px)

Citation preview

Page 1: redes neuronais

CHAPTER 4 Resolutions

Chapter 1

1.

a) Please see text in Section 1.2.

b) Please see text in Section 1.2

2. Please see text in Section

3. This is a very simple exercise.

a) The net input is just . The corresponding output is

.

b) To obtain the input patterns you can use the Matlab function inp=randn(10,2).You can also obtain values downloading the file inp.mat. The Matlab functionrequired is in singneur.m.

4. Consider .

a) The derivative, , of this activation function, with respect to the standard devia-

tion, is:

et 1 1 1

1

1–

0.5

0= =

1

1 e0.5–

+-------------------- 0.622= =

fi Ci k, σi,( ) e

Ci k, xk–( )2

k 1=

n

∑2σi

2------------------------------------–

=

σi∂∂fi

σi

Artificial Neural Networks 173

Page 2: redes neuronais

Resolutions

b) The derivative, , of this activation function, with respect to its kth centre,

is:

σi∂∂fi e

Ci k, xk–( )2

k 1=

n

∑2σi

2------------------------------------–

f– i

Ci k, xk–( )2

k 1=

n

2σi2

--------------------------------------

f– i

Ci k, xk–( )2

k 1=

n

∑2

-------------------------------------- σi2–( )′⋅ ⋅ f– i

Ci k, xk–( )2

k 1=

n

∑2

-------------------------------------- 2σi3––⋅ ⋅

fi

Ci k, xk–( )2

k 1=

n

σi3

--------------------------------------⋅

= =

= =

=

Ci k,∂∂fi Ci k,

Ci k,∂∂fi e

Ci k, xk–( )2

k 1=

∑2σi

2------------------------------------–

f– i

Ci k, xk–( )2

k 1=

n

2σi2

--------------------------------------

f– i1

2σi2

--------- Ci k, xk–( )2

k 1=

n

∑ ′

⋅ ⋅ f– i1

2σi2

--------- 2 Ci k, xk–( )⋅ ⋅

f– i

Ci k, xk–( )-------------------------⋅

= =

= =

=

174 Artificial Neural Networks

Page 3: redes neuronais

Chapter 1

c) The derivative, , of this activation function, with respect to the kth input, is:

5. The recursive definition a B-spline function is:

a) By definition .

b)

We now have to determine which are the values of and .

If , then , and .

If , then , and .

Splines of order 2 can be seen in fig. 1.13 b).

xk∂∂fi xk

e

Ci k, xk–( )2

k 1=

∑2σi

2------------------------------------–

f– i

Ci k, xk–( )2k 1=∑

2σi2

--------------------------------------

fi1

2σi2

--------- Ci k, xk–( )2k 1=

n

∑ ′

⋅ ⋅ f– i1

2σi2

--------- 2–( ) Ci k, x–(⋅ ⋅

2 C x( )

= =

=

Nkj

x( )x λj k––

λj 1– λj k––---------------------------- Nk 1–

j 1– x( )λj x–

λj λj k– 1+–----------------------------- Nk 1–

j x( )+=

N1j

x( )1 if x Ij∈

0 otherwise

=

1j

x( )1 if x Ij∈

0 otherwise

=

N2j

x( )x λj 2––

λj 1– λj 2––----------------------------- N1

j 1– x( )λj x–

λj λj 1––---------------------- N1

j x( )+=

N1j 1– x( ) N1

j x( )

x Ij∈ N1j 1– x( ) 0

N1j x( ) 1

==

N2j

x( )λj x–

λj λj ––-------------------=

x Ij 1–∈ N1j 1– x( ) 1

N1j x( ) 0

==

2j

x( )x λj 2––

λj 1– λj ––-------------------------=

Artificial Neural Networks 175

Page 4: redes neuronais

Resolutions

c)

We now have to find out which are the values of and . We have donethat above, and:

. Replacing the last two equations, we have:

In a more compact form, we have:

Assuming that the knots are equidistant, and that every interval is denoted by ,we can have:

N3j

x( )x λj 3––

λj 1– λj 3––----------------------------- N3

j 1– x( )λj x–

λj λj 2––---------------------- N2

j x( )+=

N2j 1– x( ) N2

j x( )

1– x( )

λj 1– x–

λj 1– λj 2––----------------------------- if x Ij –∈,

x λj 3––

λj 2– λj 3––----------------------------- if x Ij –∈,

=

x( )

λj 1– x–

λj 1– λj 1––----------------------------- if x Ij∈,

x λj 2––

λj 1– λj 2––----------------------------- if x Ij –∈,

=

x λj 3––

λj 1– λj 3––-----------------------------

x λj 3––

λj 2– λj 3––-----------------------------⋅ x Ij 2–∈,

x λj 3––

λj 1– λj 3––-----------------------------

λj 1– x–

λj 1– λj 2––-----------------------------⋅

λj x–

λj λj 2––----------------------

x λj 2––

λj 1– λj 2––-----------------------------⋅+ x∈,

λj x–

λj λj 2––----------------------

λj x–

λj λj 1––----------------------⋅ x Ij∈,

x λj 3––( )2

λj 1– λj 3––( ) λj 2– λj 3––( )-------------------------------------------------------------------- x Ij 2–∈,

x λj 3––

λj 1– λj 3––-----------------------------

λj 1– x–

λj 1– λj 2––-----------------------------⋅

λj x–

λj λj 2––----------------------

x λj 2––

λj 1– λj 2––-----------------------------⋅+ x∈,

λj x–( )2

λj λj 2–( ) λj λj 1–( )------------------------------------------------------ x Ij∈,

176 Artificial Neural Networks

Page 5: redes neuronais

Chapter 1

Splines of order 3 can be seen in fig. 1.13 c).

6. Please see text text in Section 1.3.3.

7. The input vector is:

and the desired target vector is:

.

a) The training criterion is:

, where . The output vector, y, is, in this case:

, and therefore:

.

The gradient vector, in general form, is:

For the point [0,0], it is:

x λj 3––( )2

2∆( )2--------------------------- x Ij 2–∈,

x λj 3––( ) λj 1– x–( )

2∆( )2--------------------------------------------------

λj x–( ) x λj 2––( )

2∆( )2-------------------------------------------+ x I∈,

λj x–( )2

2∆( )2-------------------- x Ij∈,

1– 0.5– 0 0.5 1, , , ,[ ]=

1 0.25 0 0.25 1, , , ,[ ]=

Ω

e2

i[ ]i 1=

5

∑2

---------------------= e t y–=

xw2 w+=

Ω 1 w1 w2–( )–( )2 0.25 w1 0.5w2–( )–( )2 0 w1–( )2

0.25 w1 0.5w2+( )–( )21 w1 w2+( )–( )2

+ +

+ +

() 2⁄

=

w1∂∂Ω

w1∂∂

e– 1 e2– e3– e4 e5+ +

e1 0.5e2 0 0.5e4– –+ += =

Artificial Neural Networks 177

Page 6: redes neuronais

Resolutions

.

b) For each pattern p, the correlation matrix is:

.

For the 5 input patterns, we have (we have just 1 output, and therefore a weightvector):

Chapter 2

8. A decision surface which is an hyperplane, such as the one represented in the nextfigure, separates data into two classes:

If it is possible to define an hyperplane that separates the data into two classes (i.e., ifit is possible to determine a weight vector w that accomplishes this), then data is saidto be linearly separable.

e– 1 e2– e3– e4 e5+ +

0.5e2 0 0.5e4– e5–+ +

w 0

0=

2–

0=

W IpTpT

=

1

1–1 1

0.5–0.25⋅–⋅ 0 1

0.50.25⋅ 1

11⋅+ + + =

Class C1

Class C2

w1x1 w2x2 θ–+ 0=

178 Artificial Neural Networks

Page 7: redes neuronais

Chapter 2

The above figure illustrates the 2 classes of an XOR problem. There is no straightline that can separate the circles from the crosses. Therefore, the XOR problem is notlinearly separable.

9. In an Adaline, the input and output variables are bipolar -1,+1, while in aPerceptron the inputs and outputs are 0 or 1. The major difference, however, lies inthe learning algorithm, which in the case of the Adaline is the LMS algorithm, and inthe Perceptron is the Perceptron Learning Rule. Also, the point where the error iscomputed, in an Adaline is at the neti point, and not at the output, in a Perceptron.Therefore, in an Adaline, error is not limited to the discrete values -1, 0, 1 as in thenormal perceptron, but can take any real value.

10. Consider the figure below:

The AND function has the following truth table:

Class C1

Class C2

w1x1 w2x2 θ–+ 0=

Artificial Neural Networks 179

Page 8: redes neuronais

Resolutions

This means that if we design a line passing through the points (0,1) and (1,0), andtranslate this line so that it stays in the middle of these points, and the point (1,1), wehave a decision boundary that is able to classify data according to the AND function.

The line that passes through (0,1), (1,0) is given by:

,

which means that . Any value of satisfying will do

the job.

11. Please see Ex. 2.4.

12. The exclusive OR function can be implemented as:

Therefore, we need two AND functions, and one OR function. To implement the first

AND function, if the sign of the 1st weight is changed, and the 3rd weight is changedto , then the original AND function implements the function (Please see

Ex. 2.4).

Using the same reasoning, if the sign of weight 2 is changed, and the 3rd weight ischanged to , then the original AND function implements the function .

Finally, if the perceptron implementing the OR function is employed, with the out-puts of the previous perceptrons as inputs, the XOR problem is solved.

Then, the implementation of the function uses just

the Adaline that implements the AND function, with inputs and the output of the

XOR function.

13. Please see Section 2.1.2.2.

14. Assume that you have a network with just one hidden layer (the proof can be easily

Table 4.172 - AND truth table

I1 I2 AND

0 0 0

0 1 0

1 0 0

1 1 1

x1 x2 1–+ 0=

w1 w2 θ 1= = = θ 1 θ 2< <

X Y⊕ XY XY+=

θ w1+ xy

θ w2+ xy

f x1 x2 x3, ,( ) x1 x2 x3⊕( )∧=

x1

180 Artificial Neural Networks

Page 9: redes neuronais

Chapter 2

extended to more than 1 hidden layer).

The output of the first hidden layer, for pattern p, can be given as:

, as the activation functions are linear.

In the same way, the output of the network is given by:

. Combining the two equations, we stay with:

.

Therefore, a one hidden layer network, with linear activation functions, are equiva-lent to a neural network with no hidden layers.

15. Let us consider . Let us compute the square:

Please note that all the terms in the numerator of the two last fractions are scalar.

a) Let us compute . The derivative of the first term in the numerator of the

last equation is null, as it does not depend on w. is a row vector, and so the

next term in the numerator is a dot product (if we denote as xT, the dot prod-uct is:

.

Therefore .

Op .,2( ) W 1( )Op .,

1( )=

Op .,3( ) W 2( )Op .,

2( )=

Op .,3( )

W2( )

W1( )

Op .,1( )

WOp .,1( )

= =

Ωl t Aw– 2

2-----------------------=

Ωl t Aw–( )T t Aw–( )⋅2

------------------------------------------------- tTt tTAw– wTATt– wTATAw+2

----------------------------------------------------------------------------

tTt 2tTAw– wTATAw+2

---------------------------------------------------------

= =

=

gl

wT

d

dΩl

=

tTA

tTA

tTAw x1w1 x2w2 … xnwn+ + +=

w1dd t

TAw

w2dd tTAw

x1

…xn

tTA( )T

ATt= = =

Artificial Neural Networks 181

Page 10: redes neuronais

Resolutions

Concentrating now on the derivative of the last term, is a square symmetric

matrix. Let us consider a 2*2 matrix denoted as C:

Then the derivative is just:

As , we finally have:

Putting all together, , is:

.

b) The minimum of is given by . Doing that, we stay with:

.

c) Consider an augmented matrix A, , and an augmented vector t,

. Then:

ATA

ATA

wTATAw wTCw w1 w2

C1 1, C1 2,

C2 1, C2 2,

w1

w2

w1C1 1, w2C2 1,+ w1C1 2, w2C2 2,+w1

w2

w12C1 1, w1w2C2 1, w2w1C1 2, w2

2C2 2,+ + +

= =

=

=

wTd

d wTATAw2w1C1 1, w2C2 1, w2C1 2,+ +

w1C2 1, w1C1 2, 2w2C2 2,+ +=

C1 2, C2 1,=

wTd

d wTATAw2w1C1 1, 2w2C2 1,+

2w1C2 1, 2w2C2 2,+2Cw= =

gl

wTd

dΩl

=

gl ATt– A

TAw+ A

Tt– A

Ty+ A

Te–= = =

Ωlg 0=

0 ATt– ATAw

w

+

ATA( )

1–A

Tt

=

=

AA

λI=

t t0

=

182 Artificial Neural Networks

Page 11: redes neuronais

Chapter 2

Then, all the results above can be employed, by replacing A with and t with .Therefore, the gradient is:

. Notice that the gradient can

also be formulated as the negative of the product of the transpose of the Jacobean andthe error vector:

, where:

, .

The optimum is therefore:

.

16.

a) Please see Section 2.1.3.1.

b) The error back-propagation is a computationally efficient algorithm, but, since itimplements a steepest descent method, it is unreliable and can have a very slowrate of convergence. Also it s difficult to select appropriate values of the learningparameter. For more details please see Section 2.1.3.3 and Section 2.1.3.4.

The problem related with lack of convergence can be solved by incorporating a line-search algorithm, to guarantee that the training criterion does not increase in any iter-ation. To have a faster convergence rate, second-order methods can be used. It isproved in Section 2.1.3.5 that the Levenberg-Marquardt algorithm is the best tech-nique to use, which does not employ a learning rate parameter.

t Aw– 2

2----------------------- t

Tt 2t

TAw– wTA

TAw+

2---------------------------------------------------------

tTt 2t

TAw– w

TA

TA λI+( )w+

2--------------------------------------------------------------------------

tTt 2tTAw– wTATAw λwTw+ +2

------------------------------------------------------------------------------

t Aw– 2 λ w 2+

2-------------------------------------------- φl

=

=

=

= =

A t

glφ A

Tt– A

TAw+ ATt– ATA λI+( )w+

ATt– ATA( )w λw+ + gl λw+

= =

= =

glφ A

Tt– A

TAw+ A

Tt– A

Ty+ A

Te–= = =

e t y–= y AwA

λIw

y

λw= = =

0 ATt– ATA λI+( )ww

+

ATA λI+( )1–ATt

=

=

Artificial Neural Networks 183

Page 12: redes neuronais

Resolutions

17. The sigmoid function is covered in Section 1.3.1.2.4 and the hyperbolic

tangent function in Section 1.3.1.2.51. Notice that these functions are related

as: . The advantages of using an hyperbolic tangent function

over a sigmoid function are:

1.The hyperbolic function generates a better conditioned model. Notice that aMLP with a linear function in the output layer has always a column of ones in theJacobean matrix (related with the output bias). As the Jacobean columns relatedwith the weights from the last hidden layer to the output layer are a linear functionof the outputs of the last hidden layer, and as the mean of an hyperbolic tangentfunction is 0, while the mean of a sigmoid function is 1/2, in this latter case thoseJacobean columns are more correlated with the Jacobean column related with theoutput bias;

2.The derivative of the sigmoid function lies between , its expected valueconsidering an uniform probability density function at the output of the node is 1/6. For a hyperbolic tangent function, its derivative lies within and itsexpected value is 2/3. When we compute the Jacobean matrix, one of the factors

involved in the computation is (see (2.42) ). Therefore, in comparison

with weights related with the linear output layer, the columns of the Jacobeanmatrix related with the nonlinear layers appear “squashed” of a mean factor of 1/6,for the sigmoid function, and a factor of 2/3, for the hyperbolic tangent function.This “squashing” is translated into smaller eigenvalues, which itself is translatedinto a slow rate of convergence, as the rate of convergence is related with thesmaller eigenvalues of the normal equation matrix (see Section 2.1.3.3.2). As this“squashing” is smaller for the case of the hyperbolic tangent function, than a net-work with these activation functions has potentially a faster rate of convergence.

18. We shall start by the pH problem. Using the same topology ([4 4 1]) and the sameinitial values, the only difference in the code is to change, in the Matlab fileThreeLay.m, the instructions:

Y1=ones(np,NNP(1))./(1+exp(-X1));

Der1=Y1.*(1-Y1);

Y2=ones(np,NNP(2))./(1+exp(-X2));

Der2=Y2.*(1-Y2);

by the following instructions:

Y1=tanh(X1);

1. If we consider ,

f1 x( )( )

f2 x( )( )

f x( ) tahn x( )= f' x( ) 1 x( )tanh2

– 1 f x( )2–= =

f2 x( ) 2f1 2x( ) 1–=

0 0.25,[ ]

0 1,[ ]

∂Oi .,z 1+( )

∂Neti .,z 1+( )

-------------------------

184 Artificial Neural Networks

Page 13: redes neuronais

Chapter 2

Der1=1-Y1.^2;

Y2=tanh(X2);

Der2=1-Y2.^2;

then, the following results are obtained using BP.m:

Comparing these results with the ones shown in fig. 2.18 , it can be seen that a betteraccuracy has been obtained.

0 10 20 30 40 50 60 70 80 90 1000

1

2

3

4

5

6

7

Iteration

Err

or N

orm

neta=0.005

neta=0.001

Artificial Neural Networks 185

Page 14: redes neuronais

Resolutions

Addressing now the Inverse Coordinate problem, using the same topology ([5 1]) andthe same initial values, and changing the instructions only related with layer 1 (seeabove) in TwoLayer.m, the following results are obtained:

Again, better accuracy results are obtained using the hyperbolic tangent function(compare this figure with fig. 2.23 ). It should be mentioned that smaller learningrates than the ones used with the sigmoid function had to be applied, as the trainingprocess diverged.

19. The error back-propagation is a computationally efficient algorithm, but, since itimplements a steepest descent method, it is unreliable and can have a very slow rateof convergence. Also it s difficult to select appropriate values of the learningparameter. The Levenberg-Marquardt methods is the “state-of-the-art” technique innon-linear least-squares problems. It guarantees convergence to a local minimum,and usually the rate of convergence is second-order. Also, it does not require anyuser-defined parameter, such as learning-rate. Its disadvantage is that,computationally, it is a more demanding algorithm.

20. Please see text in Section 2.1.3.6.

21. Please see text in Section 2.1.3.4.

22. Use the following Matlab code:

x=randn(10,5); % Matrix with 10*5 random elements following a normal distribution

cond(x); %The condition number of the original matrix

0 10 20 30 40 50 60 70 80 90 1000

2

4

6

8

10

12

14

16

18

Iteration

Err

or N

orm

neta=0.005

neta=0.001

186 Artificial Neural Networks

Page 15: redes neuronais

Chapter 2

Now use the following code:

alfa(1)=10;

for i=1:3

x1=[x(:,1:4)/alfa(i) x1(:,5)]; %The first four columns are reduced of alfa

c(i)=cond(x1);

alfa(i+1)=alfa(i)*10; % alfa will have the values of 10, 100 and 1000

end

If now, we compare the ratio of the condition numbers obtained ( c(2)/c(1) and c(3)/c(2) ), we shall see that they are 9.98 and 9.97, very close to the factor 10 that wasused.

23. Use the following Matlab code:

for i=1:100

[W,Ki,Li,Ko,Lo,IPS,TPS,cg,ErrorN,G]=MLP_initial_par([5 1],InpPat,TarPat,2);

E(i)=ErrorN(2);

c(i)=cond(G);

end

This will generate 100 different initializations of the weight vector, with the weightsin the linear output layer computed as random values.

Afterwards use the following Matlab code:

for i=1:100

[W,Ki,Li,Ko,Lo,IPS,TPS,cg,ErrorN,G]=MLP_initial_par([5 1],InpPat,TarPat,1);

E1(i)=ErrorN(2);

c1(i)=cond(G);

end

This will generate 100 different initializations of the weight vector, with the weightsin the linear output layer computed as the least-square values. Finally, use the Matlabcode:

Artificial Neural Networks 187

Page 16: redes neuronais

Resolutions

for i=1:100

W=randn(1,21);

[Y,G,E,c]=TwoLayer(InpPat,TarPat,W,[5 1]);

E2(i)=norm(TarPat-Y);

c2(i)=cond(G);

end

The mean results obtained are summarized in the following table:

24. Let us consider first the input. Determining the net input of the first hidden layer:

This way, each row within the first k1 lines of W(1) appears multiplied by each ele-ment of the diagonal of ki, while for each element of the last row (related with thebias) a quantity is added, which is the dot product of the diagonal elements of li and

each column of the first k1 lines of W(1).

Let us address now the output.

Table 4.1 - Mean values of the Initial Jacobean condition number and error norm

MethodJacobean Condition Number

Initial Error Norm

MLP_init (random values for linear weights)

1.6 106 15.28

MLP_init (optimal values for linear weights)

3.9 106 1.87

random values 2.8 1010 24.11

Net 2( )IPs I W 1( )

IP ki⋅ I m k1×( ) li⋅+ | I m 1×( )

W1…K1 .,1( )

-

WK1 1+ .,1( )

IP kiW1…K1 .,1( )⋅ I m k1×( ) liW1…K1 .,

1( ) I m 1×( )WK1 1+ .,1( )

+⋅+

= =

=

188 Artificial Neural Networks

Page 17: redes neuronais

Chapter 2

This is, the weights connecting the last hidden neurons with the output neuron appeardivided by ko, and the bias is first subtracted of lo, and afterwards divided by ko.

25. The results presented below should take into account that in each iteration ofTrain_MLPs.m a new set of initial weight values is generated, and therefore, no run isequal. Those results were obtained using the Levenberg-Marquardt methods,minimizing the new criterion. For the early-stopping method, a percentage of 30%for the validation set was employed.

In terms of the pH problem, employing a termination criterion of 10-3, the followingresults were obtained:

In terms of the Coordinate Transformation problem, a termination criterion of 10-5

was employed. The following results were obtained:

Table 4.2 - Results for the pH problem

Regularization Parameter Error Norm Linear Weight Norm Number of Iterations

Error Norm (Validation Set)

0 0.021 80 20 0.003

10-6 0.016 5.3 17 0.015

10-4 0.033 7.3 15 0.026

10-2 0.034 9.4 38 0.018

early-stopping 0.028 21 24 0.02

Table 4.3 - Results for the Coordinate Transformation problem

Regularization Parameter Error Norm Linear Weight Norm Number of Iterations

Error Norm (Validation Set)

0 0.41 17.5 49 0.39

10-6 0.99 2.3 45 0.91

Os Oko Ilo+ Net q 1–( )| I

w1…kq 1–

q 1–( )

-

wkq 1– 1+q 1–( )

O1ko----- Net q 1–( )w1…kq 1–

q 1–( ) Iwkq 1– 1+q 1–( ) Ilo–+( )

Netq 1–( ) w1…kq 1–

q 1–( )

ko-------------------⋅ I

wkq 1– 1+q 1–( )

lo–

ko-----------------------------

+

= =

= =

=

Artificial Neural Networks 189

Page 18: redes neuronais

Resolutions

The results presented above show that, only in the 2nd case, the early-stopping tech-nique achieves better generalization results than the standard technique, with or with-out regularization. Again, care should be taken in the interpretation of the results, asin every case different initial values were employed.

26. For both cases we shall use as termination criterion of 10-3. The Matlab files can beextracted from Const.zip.

The results for the pH problem can be seen in the following figure:

There is no noticeable decrease in the error norm after 5 hidden neurons. Networkswith more than 5 neurons exhibit the phenomenon of overmodelling. If a MLP with10 neurons is constructed using the Matlab function Train_MLPs.m, the error normobtained is 0.086, while with the constructive method we obtain 0.042.

The results for the Inverse Coordinate Problem can be seen in the following figure:

10-4 1.28 2.5 20 0.93

10-2 0.5 10.6 141 0.39

early-stopping 0.38 40 119 0.24

Table 4.3 - Results for the Coordinate Transformation problem

Regularization Parameter Error Norm Linear Weight Norm Number of Iterations

Error Norm (Validation Set)

1 2 3 4 5 6 7 8 9 100

0.5

1

1.5

2

2.5

Number of nonlinear neurons

Err

or N

orm

190 Artificial Neural Networks

Page 19: redes neuronais

Chapter 2

As it can be seen, after the 7th neuron, there is no noticeable improvement in theaccuracy. For this particular case, models with more than 7 neurons exhibit the phe-nomenon of overmodelling. If a MLP with 10 neurons is constructed using the Mat-lab function Train_MLPs.m, the error norm obtained is 0.086, while with theconstructive method we obtain 0.097.

It should be mentioned that it can be seen that the strategy employed in this construc-tive method lends to bad initial models, with a number of neurons greater than, let ussay, 5.

27. The instantaneous autocorrelation matrix is given by: . Theeigenvalues and eigenvectors of satisfy the equation: . Replacing theprevious equation in the last one, we have:

As the product is a scalar, then this corresponds to the eigenvalue, and isthe eigenvector.

28. After adaptation with the LMS rule, the a posteriori output of the network, , is

1 2 3 4 5 6 7 8 9 100

0.5

1

1.5

2

2.5

Number of nonlinear neurons

Err

or N

orm

R k[ ] a k[ ]a’ k[ ]=R k[ ] Re λe=

a k[ ]a’ k[ ]e λe=

a’ k[ ]e a k[ ]

y k[ ]

Artificial Neural Networks 191

Page 20: redes neuronais

Resolutions

given by:

,

where the a posterior error , is defined as:

.

For a non-null error, the following relations apply:

29. The following figure illustrates the results obtained with the NLMS rule, for the

y k[ ] aT

k[ ]w k[ ]

aTk[ ]w k 1–[ ] aT

k[ ]ηe k[ ] a k[ ]( )+

δ a k[ ] 2y k[ ] 1 η a k[ ] 2

–( )y k[ ]+

=

=

=

e k[ ]

e k[ ] y k[ ] y k[ ]– 1 η a k[ ] 2–( )e k[ ]= =

e k[ ] e k[ ] if η 0 2,( ) a k[ ] 2⁄[ ]∉( )e k[ ] e k[ ] if = η 0= or η 2 a k[ ] 2⁄e k[ ] e k[ ] if η 0 2,( ) a k[ ] 2⁄[ ]∈( )e k[ ]

<=

>

0 if η 1 a k[ ] 2⁄= =

192 Artificial Neural Networks

Page 21: redes neuronais

Chapter 2

Coordinate Inversion problem, when .

Learning is stable in all cases, and the rate of convergence is almost independent ofthe learning rate employed. If we employ a learning rate (2.001) slightly larger thanthe stable domain, we obtain unstable learning:

η 0.1 1 1.9, , =

0 200 400 600 800 1000 12000

0.5

1

1.5

2

2.5

3

3.5

4

4.5

MS

E

Iterations

0 200 400 600 800 1000 12000

5

10

15

20

25

30

35

40

45

Iterations

MS

E

Artificial Neural Networks 193

Page 22: redes neuronais

Resolutions

Using now the standard LMS rule, the following results were obtained with. Values higher than 0.5 result in unstable learning.

In terms of convergence rate, the methods produce similar results. The NLMS ruleenables to guarantee convergence within the domain .

30. We shall consider first the pH problem. The average absolute error, after off-linetraining (using the parameters stored in Initial_on-pH.mat) is:

. Using this value in

,

η 0.1 0.5, =

0 200 400 600 800 1000 12000

0.5

1

1.5

2

2.5

Iterations

MS

E

η 0 2,[ ]∈

ς E en

k[ ][ ] 0.0014= =

ed

k[ ]0 if e k[ ] ς≤( )e k[ ] ς if e k[ ] ς–<( )+

e k[ ] ς if e k[ ] ς>( )–

=

194 Artificial Neural Networks

Page 23: redes neuronais

Chapter 2

the next figure shows the MSE value, for the last 10 (out of 20 passes) of adaptation,using the NLMS, with .

The results obtained with the LMS rule, with , are shown in the next figure.

η 1=

1000 1200 1400 1600 1800 2000 22000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1x 10

-3

Iterattions(Last 10 passes)

MS

E

With Dead-Zone

Without Dead-Zone

η 0.5=

1000 1200 1400 1600 1800 2000 22000

1

2

3

4

5

6

7

8x 10

-4

Iterattions(Last 10 passes)

MS

E

With Dead-Zone

Without Dead-Zone

Artificial Neural Networks 195

Page 24: redes neuronais

Resolutions

Considering now the Coordinate Transformation problem, the average absolute error,

after off-line training is: . Using this value, the NLMS rule,

with , produces the following results:

The above figure shows the MSE in the last pass (out of 20) of adaptation.

Using now the LMS rule, with , we obtain:

ς E en

k[ ][ ] 0.027= =

η 1=

1900 1920 1940 1960 1980 2000 20200

0.002

0.004

0.006

0.008

0.01

0.012

0.014

Iterattions(Last pass)

MS

E

With Dead-Zone

Without Dead-Zone

η 0.5=

1900 1920 1940 1960 1980 2000 20200

0.002

0.004

0.006

0.008

0.01

0.012

0.014

Iterattions(Last pass)

MS

E

With Dead-Zone

Without Dead-Zone

196 Artificial Neural Networks

Page 25: redes neuronais

Chapter 2

For all the cases, better results are obtained with the inclusion of an output dead-zonein the adaptation algorithm. The main problem is to determine the dead-zone parame-ter, in real situations.

31. Considering the Coordinate Inverse problem, using the NLMS rule, with , theresults obtained with a worse conditioned model (weights in Initial_on_CT_bc.m),compared with the results obtained with a better conditioned model (weights inInitial_on_CT.m), are represented in the next figure:

Regarding now the pH problem, using the NLMS rule, with , the resultsobtained with a worse conditioned model (weights in Initial_on_pH_bc.m), com-

η 1=

0 500 1000 1500 2000 25000

0.5

1

1.5

2

2.5

Iterattions

MS

E

Worse conditioned

Better conditioned

η 1=

Artificial Neural Networks 197

Page 26: redes neuronais

Resolutions

pared with the results obtained with a better conditioned model (weights inInitial_on_pH.m), are represented in the next figure

It is obvious that a better conditioned model achieves a better adaptation rate.

32. We shall use the NLMS rule, with , in the conditions of Ex. 2.6. We shall startthe adaptation from 4 different initial conditions: . The next figure

illustrates the evolution of the adaptation, with 10 passes over the training set. The 4different adaptations converge to a small area, indicated by the green colour in the

0 500 1000 1500 2000 25000

0.05

0.1

0.15

0.2

0.25

Iterations

MS

E

Worse conditioned

Better conditioned

η 1=wi 10±= i, 1 2,=

198 Artificial Neural Networks

Page 27: redes neuronais

Chapter 2

figure.

If we zoom this small area, we can see that . In the

first example ( ), it enter this are in iteration 203, in the second

example ( ), it enter this are in iteration 170, in the third case

( ), it enter this are in iteration 204, and in the fourth

case( ), it enter this domain in iteration 170. This is shown in thenext figure.

w1 0 0.75,[ ]∈ w2 0.07– 1,[ ]∈,

w 1[ ] 10 10,[ ]=

w 1[ ] 10 10–,[ ]=

w 1[ ] 10– 10–,[ ]=

w 1[ ] 10– 10,[ ]=

Artificial Neural Networks 199

Page 28: redes neuronais

Resolutions

This domain, where, after being entered, the weights never leave and never settle, iscalled the minimal capture zone.

If we compare the evolution of the weight vector, starting from ,with or without dead-zone, we obtain the following results:

The optimal values of the weight parameters, in the least squares sense, are given by:

, where x and y are the input and target data, obtaining

an optimal MSE of 0.09. The dead-zone parameter employed was

.

33. Assuming an interpolation scheme, the number of basis is equal to the number ofpatterns. This way, your network has 100 neurons in the hidden layer. The centers ofthe network are placed in the input training points. So, if the matrix of the centers isdenoted as C, then C=X. With respect to the spreads, as nothing is mentioned, you

can employ the most standard scheme, which is equal spreads of value ,

where dmax is the maximum distance between the centers. With respect to the linearoutput weights, they are the optimal values, in the least squares sense, that is,

, where G is the matrix of the outputs of the hidden neurons.

The main problem with the last scheme is that the network grows as the training setgrows. This results in ill-conditioning of matrix G, or even singularity. For this rea-son, an approximation scheme, with the number of neurons strictly less than thenumber of patterns is the option usually taken.

w 1[ ] 10 10–,[ ]=

0 500 1000 1500 2000 2500-10

-8

-6

-4

-2

0

2

4

6

8

10

Iterations

w1,

w2

w x 1+y 0 0.3367= =

ς max en

k[ ][ ] 0.663= =

σdmax

2m1

--------------=

w G+t=

200 Artificial Neural Networks

Page 29: redes neuronais

Chapter 2

34. The k-means clustering algorithm places the centres in regions where a significantnumber of examples is presented. The algorithm is:

35. We shall use, as initial values, the data stored in winiph_opt.m and winict_opt.m, forthe pH and the Coordinate inverse problems, respectively. The new criterion, and theLevenberg-Marquardt will be used in all these problems.

a) With respect to the pH problem, the application of the termination criterion

( ), is expressed in the next table:

Table 4.6 - Standard Termination

MethodNumber of Iterations Error Norm Linear Weight Norm

Condition of Basis Functions

LM (New Criterion)

5 0.0133 7.8 104 1.4 107

1.Initialization - Choose random values for the centres;they must be all different

2.For j=1 to n

2.1.Sampling - Find a sample vector x from the input matrix

2.2.Similarity matching - Find the centre closest to x. Let itsindex be k(x):

(4.4)

2.3.Updating - Adjust the centres of the radial basis func-tions according to:

(4.5)

2.4.j=j+1

end

) mink x k( ) cj i[ ]– 2arg= j 1 …,=,

1+ ]cj i[ ] η x k( ) cj i[ ]–( )+ j k(=,

cj i[ ] otherwise,

=

τ 104–

=

Artificial Neural Networks 201

Page 30: redes neuronais

Resolutions

With respect to the Coordinate Inverse problem, the application of the termination

criterion ( ), is expressed in the next table:

b) With respect to the pH problem, the application of the termination criterion

( ), to the LM method, minimizing the new criterion, and using an early-stopping method (the Matlab function gen_set.m was applied with a percentage of30%), gives the following results:

The second line represents the results obtained, for the same estimation and valida-tion sets, using the parameters found by the application of the regularization tech-nique (unitary matrix) to all the training set. It can be seen that the same accuracy isobtained for the validation set, with better results in the estimation set.

With respect to the Coordinate Inverse problem, the application of the termination

criterion ( ), to the LM method, minimizing the new criterion, and using anearly-stopping method (the Matlab function gen_set.m was applied with a percentageof 30%), gives the following results:

The results presented below show that using all the training data with the regulariza-tion method, a better result was obtained for the validation set, although a worseresult was obtained for the estimation set.

Table 4.7 - Standard Termination

MethodNumber of Iterations Error Norm Linear Weight Norm

Condition of Basis Functions

LM (New Criterion)

18 0.19 5.2 108 2.5 1016

Table 4.8 - Early-stopping Method

MethodNumber of Iterations

Error Norm(Est. set)

Error Norm(Val. set) Linear Weight Norm

Condition of Basis Functions

Early-Stop-ping

6 0.0075 0.0031 8.7 104 1.7 107

19 0.0044 0.0029 15.4 1.6 103

Table 4.9 - Early-stopping Method

MethodNumber of Iterations

Error Norm(Est. set)

Error Norm(Val. set) Linear Weight Norm

Condition of Basis Functions

Early-Stop-ping

77 0.1141 0.1336 1.3 108 4.6 1014

19 0.1693 0.1047 168 3.7 1014

τ 103–

=

τ 104–

=

λ 106–

=

τ 103–

=

λ 106–

=

202 Artificial Neural Networks

Page 31: redes neuronais

Chapter 2

c) With respect to the pH problem, the application of the termination criterion

( ), to the LM method, minimizing the new criterion, is expressed in thenext table:

With respect to the Coordinate Inverse problem, the application of the termination

criterion ( ), to the LM method, minimizing the new criterion, is expressedin the next table:

d) With respect to the pH problem, the application of the termination criterion

( ), to the LM method, minimizing the new criterion, is expressed in thenext table:

Table 4.10 - Explicit Regularization (I)

MethodNumber of Iterations Error Norm Linear Weight Norm

Condition of Basis Functions

19 0.0053 15.4 1.6 103

25 0.0247 9.75 3.8 104

83 0.044 2.07 4.5 103

Table 4.11 - Explicit Regularization (I)

MethodNumber of Iterations Error Norm Linear Weight Norm

Condition of Basis Functions

100 0.199 168 3.7 1014

100 0.4039 29 1.2 1015

100 0.9913 9.9 3.8 1017

Table 4.12 - Explicit Regularization (G0)

MethodNumber of Iterations Error Norm Linear Weight Norm

Condition of Basis Functions

43 0.0132 3.6 104 5.6 106

17 0.0196 633 1.2 105

150 0.0539 38 1.3 106

τ 104–

=

λ 106–

=

λ 104–

=

λ 102–

=

τ 103–

=

λ 106–

=

λ 104–

=

λ 102–

=

τ 104–

=

λ 106–

=

λ 104–

=

λ 102–

=

Artificial Neural Networks 203

Page 32: redes neuronais

Resolutions

With respect to the Coordinate Inverse problem, the application of the termination

criterion ( ), to the LM method, minimizing the new criterion, is expressedin the next table:

36. The generalization parameter is 2, so there are 2 overlays.

a)

FIGURE 4.66 - Overlay diagram, with

There are cells within the lattice. There are 18 basis

functions within the network. At any moment, only 2 basis functions are active in thenetwork.

Table 4.13 - Explicit Regularization (G0)

MethodNumber of Iterations Error Norm Linear Weight Norm

Condition of Basis Functions

100 0.49 91 4 1015

100 0.3544 27 5.2 1015

100 1.229 11.5 2.4 1018

τ 103–

=

λ 106–

=

λ 104–

=

λ 102–

=

a3a1

Input lattice

1st overlayd1=(1,1)

2nd overlayd2=(2,2)

a2

a4 a5 a6

a7 a8 a9

a10 a11 a12

a13 a14 a15

a16 a17 a18

ρ 2=

p’ ri 1+( )i 1=

n

∏ 52

25= = =

204 Artificial Neural Networks

Page 33: redes neuronais

Chapter 2

b) Analysing fig. 4.66 , we can see that as the input moves along the lattice one cellparallel to an input axis, the number of basis functions dropped from, and intro-duced to, the output calculation is a constant (1) and does not depend on the input.

c) A CMAC is said to be well defined if the generalization parameter satisfies:

(4.14)

37. The decomposition of the basis functions into overlays demonstrates that thenumber of basis functions increases exponentially with the input dimension. The totalnumber of basis functions is the sum of basis functions in each overlay. This number,in turn, is the product of the number of univariate basis functions on each axis. Thesehave a bounded support, and therefore there are at least two defined on each axis.Therefore, a lower bound for the number of basis functions for each overlay, andsubsequently, for the AMN, is 2n. These networks suffer therefore from the curse ofdimensionality. In B-splines, this problem can be alleviated decomposing amultidimensional network into a network composed of additive sub-networks ofsmaller dimensions. An algorithm to perform this task is the ASMOD algorithm.

38. The network has 4 inputs and 1 output.

a) The network can be described as: . The number

of basis functions for each sub-network is given by: . Therefore,

we have basis functions for

the overall network. In terms of active basis functions, we have ,

where n is the number of sub-networks, ni is the number of inputs for sub-network

i, and kj,i is the B-spline order for the jth dimension of the ith sub-network. For thiscase, .

ρ maxi ri 1+(≤ ≤

ρ

f1 x1( ) f2 x2( ) f3 x3( ) f4 x3 x,(+ + +=

p ri ki+( )i 1=

n

∏=

5 2+( ) 4 2+( ) 3 2+( ) 4 3+( )2+ + + 18 49+= =

p’’ kj i,

j 1=

ni

∏i 1=

n

∑=

2 2 2 3 3×+ + + 1= =

Artificial Neural Networks 205

Page 34: redes neuronais

Resolutions

b) The ASMOD algorithm can be described as:

Algorithm 4.1 - ASMOD algorithm

Each main part of the algorithm will be detailed below.

•Candidate models are generated by the applications of a refinement step, where the complexity of the current model is increased, and a pruning set, where the current model is simplified, in an attempt to determine a simpler method that performs as well as the current model. Note that, in the majority of the cases, the latter step does not generate candidates that are eager to proceed to the next iteration. Because of this, this set is often applied after a certain number of refinement steps, or just applied to the optimal model resulting from the ASMOD algorithm, with just refinement steps.

Three methods are considered for model growing:

1.For every input variable not presented in the current network, a new univariatesub-model is introduced in the network. The spline order and the number ofinterior knots is specified by the user, and usually 0 or 1 interior knots areapplied;

2.For every combination of sub-models presented in the current model, combinethem in a multivariate network with the same knot vector and spline order. Caremust be taken in this step to ensure that the complexity (in terms of weights) ofthe final model does not overpass the size of the training set;

3.For every sub-model in the current network, for every dimension in each sub-model split every interval in two, creating therefore candidate models with acomplexity higher of 1.

For network pruning, also three possibilities are considered:

1.For all univariate sub-models with no knots in the interior, replace them by aspline of order k-1, also with no interior knots. If k-1=1, remove this sub-modelfrom the network, as it is just a constant;

i = 1;

termination criterion = FALSE;WHILE NOT(termination criterion)

Generate a set of candidate networks;Estimate the parameters for each candidate network;Determine the best candidate, , according to some crite-rion J;IF termination criterion = TRUE END;i=i+1;

END

mi 1– Initial Mode=

Mi

mi

mi( ) J mi 1–(≥

206 Artificial Neural Networks

Page 35: redes neuronais

Chapter 2

2.For every multivariate (n inputs) sub-models in the current network, split theminto n sub-models with n-1 inputs;

3.For every sub-model in the current network, for every dimension in each sub-model, remove each interior knot, creating therefore candidate models with acomplexity smaller of 1.

39. Recall Exercise 1.3. Consider that no interior knots are employed. Therefore, a B-spline of order 1 is given by:

(4.15)

The output corresponding to this basis function is therefore:

, (4.16)

which means that with a sub-model which is a B-spline of order 1, any constant termcan be obtained.

Consider now a spline of order 2. It is defined as:

(4.17)

It is easy to see that

(4.18)

For our case,

(4.19)

The outputs corresponding to these basis functions are simply:

N11

x( )1 x I1∈,

0 x I1∉,

=

N11

x( ))w1 x I∈,

0 x I1∉,

=

N2j

x( )x λj 2––

λj 1– λj 2––----------------------------- N1

j 1– x( )λj x–

λj λj 1––---------------------- N1

j x( )+= j, 1 2,=

N21

x( )λ1 x–

λ1 λo–----------------- = x I∈,

N22

x( )x λ0–

λ1 λ0–----------------- = x I∈,

N21

x1( ) 1 x1–( )= x1 I∈,

N22

x1( ) x1= x1 I1∈,

N21

x2( )1 x2–

2-------------- = x2 I∈,

N22

x2( )x2 1+

2-------------- = x2 I∈,

Artificial Neural Networks 207

Page 36: redes neuronais

Resolutions

(4.20)

Therefore, we can construct the functions 4x1 and -2x2 just by setting w2=0, w3=4,and w4=4 w5=0. Note that this is not the only solution. Using this solution, note that

, which means that we must subtract 2, in order to get -2x2.

Consider now a bivariate sub-model, of order 2. As we know, bivariate B-splines areconstructed from univariate B-splines using:

(4.21)

We have now 4 basis functions:

(4.22)

These are equal to:

(4.23)

Therefore, the corresponding output is:

N21

x1( )( ) w2 1 x1–( )= x1 I∈,

y N22

x1( )( ) w3x1= x1 I1∈,

N21

x2( )( ) w4

1 x2–

2-------------- = x2 I∈,

N22

x2( )( ) w5

x2 1+

2-------------- = x2 I∈,

N21

x2( )( ) 2 2x–=

Nkj x( ) Nki i,

j xi( )i 1=

n

∏=

21

x1 x2,( ) 1 x1–( )1 x2–

2-------------- = x1 I1∈ x2 ∈, ,

N2 2,2

x1 x2,( ) x1

x2 1+

2-------------- = x1 I1∈ x2 I1∈, ,

N2 2,3

x1 x2,( ) x1

1 x2–

2-------------- = x1 I1∈ x2 I1∈, ,

24

x1 x2,( ) 1 x1–( )x2 1+

2-------------- = x1 I1∈ x2 ∈, ,

2 2,1

x1 x2,( )1 x1– x2– x1x2+

2------------------------------------------ = x1 I1∈ x2 ∈, ,

N2 2,2

x1 x2,( )x1x2 x1+

2---------------------- = x1 I1∈ x2 I1∈, ,

N2 2,3

x1 x2,( )x1x2 x1+

2---------------------- –= x1 I1∈ x2 I1∈, ,

24

x1 x2,( )1 x1– x2 x1– x2+ +

2--------------------------------------------- –= x1 I1∈ x2 ∈, ,

208 Artificial Neural Networks

Page 37: redes neuronais

Chapter 2

(4.24)

The function 0.5x1x2 can be constructed in many ways. Consider w6=w8=w9=0 and

w7=1. Therefore , which means that we must

subtract x1/2 from the output to get 0.5x1x2. This means that we should not design4x1, but 7/2x1, therefore setting w3=7/2.

To summarize, we can design a network implementing the function by employing 4 sub-networks, all with zero

interior knots:

1.A univariate sub-network (input x1 or x2, it does not matter) of order 1, withw1=1;

2.A univariate sub-network with input x1, order 2, with w1=0 and w3=7/2;

3.A univariate sub-network with input x2, order 2, with w4=4 and w5=0;

4.A bivariate sub-network with inputs x1 and x2, order 2, with w6=w8=w9=0 andw7=1.

40.

a) The Matlab functions in Asmod.zip were employed to solve this problem. First,gen_set.m was employed, to split the training sets between estimation and valida-tion sets, with a percentage of 30% for the latter. Then Asmod was employed, withthe termination criterion formulated as: the training stopped if the MSE for the val-idation set increased constantly in the last 4 iterations, or the standard ASMODtermination was found. In the following tables, the first row illustrates the resultsobtained with this approach. The second row illustrates the application of themodel obtained with the standard ASMOD, trained using all the training set, to theestimation and validation sets used in the other approach.

Concerning the pH problem, the following results were obtained:

Table 4.25 - ASMOD Results - Early-Stopping versus complete training (pH problem)

MSEe MSREe MSEv MREv Compl. Wei. N.

8.6 10-9 5.9 10-7 4.3 10-6 0.034 42 3.7

1.4 10-31 8.5 10-31 1.5 10-31 2.2 10-30 101 5.8

N2 2,1

x1 x2,( )) w6

1 x1– x2– x1x2+

2------------------------------------------ = x1 I1∈ x2 ∈, ,

y N2 2,2

x1 x2,( )( ) w7

x1x2 x1+

2---------------------- = x1 I1∈ x2 I1∈, ,

y N2 2,3

x1 x2,( )( ) w8

x1x2 x1+

2---------------------- –= x1 I1∈ x2 I1∈, ,

2 2,4

x1 x2,( )) w9

1 x1– x2 x1– x2+

2--------------------------------------- –= x1 I1∈ x2, ,

2 2,2

x1 x2,( ))x1x2 x1+

2---------------------- = x1 I1∈ x2 ∈, ,

f x1 x2,( ) 3 4x1 2x2– 0.5x1x2+ +=

Artificial Neural Networks 209

Page 38: redes neuronais

Resolutions

Concerning the Coordinate Transformation problem, the following results wereobtained:

For both cases, the MSE for the validation set is much lower if the training is per-formed using all the data.

b) The Matlab functions in Asmod.zip were employed to solve this problem. Differ-ent values of the regularization parameter were employed,

Concerning the pH problem, the following table summarizes the results obtained:

Concerning the Coordinate Transformation problem, the following table summarizesthe results obtained:

For both cases, an increase in the regularization parameter decreases the MSE,decreases the complexity and the linear weight norm.

c) To minimize the MSRE, we can apply the following strategy: the training criterion

can be changed to: . This is equivalent to:

Table 4.26 - ASMOD Results - Early-Stopping versus complete training (CT problem)

MSEe MSREe MSEv MREv Compl. Wei. N.

2.7 10-4 7.7 109 15 10-3 9.7 1012 36 5.5

1.4 10-5 4.9 105 1.6 10-5 2.3 105 65 9.6

Table 4.27 - ASMOD Results - Different regularization values (pH problem)

Reg. factor MSE Criterion Complexity Weight Norm N. Candidates N. Iterations

1.4 10-31 -6,705 101 5.82 9945 101

3.2 10-6 -1,190 17 2.25 341 19

3.9 10-9 -1,673 61 4.36 3569 61

3.6 10-13 -2,440 98 5.71 11056 107

Table 4.28 - ASMOD Results - Different regularization values (CT problem)

Reg. factor MSE Criterion Complexity Weight Norm N. Candidates N. Iterations

1.5 10-5 -916 65 9.6 1043 30

3.7 10-5 -831.7 64 4.18 921 26

1.7 10-5 -918.5 61 5.1 1264 33

1.5 10-5 -915.7 65 9.1 1067 30

λ 0=

λ 102–

=

λ 104–

=

λ 106–

=

λ 0=

λ 102–

=

λ 104–

=

λ 106–

=

ti yi–

ti-------------

2

i 1=

n

∑ ti 0≠,

210 Artificial Neural Networks

Page 39: redes neuronais

Chapter 2

, or, in matrix form: , where T is a diagonal matrix

with the values of the target vector in the diagonal, and 1 is a vector of ones. As yis a linear combination of the outputs of the basis functions, A, we can employ, to

determine the optimal weights: . Using this strategy, we comparethe results obtained by the ASMOD algorithm, in terms of the MSE and MSRE,using regularization or not, with the standard criterion. The first 4 rows show theresults obtained by the ASMOD algorithm, in terms of the MSE criterion, and thelast four rows the MSRE. The Matlab functions in Asmod.zip were employed tosolve this problem.

Concerning the pH problem, the following table summarizes the results obtained:

Concerning the Coordinate Transformation problem, the following table summarizesthe results obtained:

Table 4.29 - ASMOD Results - MSE versus MSRE (pH problem)

Reg. factor MSE MSRE Criterion Complexity Weight Norm N. Cand. N. Iterations

1.4 10-31 1.2 10-30 -6,705 101 5.82 9945 101

3.2 10-6 2.7 10-5 -1,190 17 2.25 341 19

3.9 10-9 4.2 10-7 -1,673 61 4.36 3569 61

3.6 10-13 1.5 10-12 -2,440 98 5.71 11056 107

1.9 10-10 1.1 10-30 -6,427 101 5.81 9699 99

9.5 10-7 2.1 10-6 -1,173 29 2.54 989 34

1.3 10-9 1.7 10-9 -1,652 79 4.75 6723 84

2.2 10-10 2.2 10-13 -2,462 98 5.71 11222 108

Table 4.30 - ASMOD Results - MSE versus MSRE (CT problem)

Reg. factor MSE MSRE Criterion Complexity Weight Norm N. Cand. N. Iterations

1.5 10-5 6.2 105 -916 65 9.6 1043 30

3.7 10-5 5.1 109 -831.7 64 4.2 921 26

1.7 10-5 9.9 107 -918.5 61 5.1 1264 33

1.5 10-5 6.7 105 -915.7 65 9.1 1067 30

2.5 10-5 3.2 10-6 -869 111 43.14 1273 38

4.2 10-5 9.9 10-6 -924 65 3.7 495 19

1yi

ti----–

2

i 1=

n

∑ ti 0≠, 1 T 1– y–2

w T1–A( )

+1=

λ 0=

λ 102–

=

λ 104–

=

λ 106–

=

λ 0=

λ 102–

=

λ 104–

=

λ 106–

=

λ 0=

λ 102–

=

λ 104–

=

λ 106–

=

λ 0=

λ 102–

=

Artificial Neural Networks 211

Page 40: redes neuronais

Resolutions

We can observe that, as expected, the use of the MSRE criterion achieves betterresults in terms of the final MRSE, and often better results also in terms of the MSE.The difference in terms of MRSE is more significant in terms of the CoordinateInverse problem, as it has significant smaller values of the target data than the pHproblem.

d) We shall compare here the results of early-stopping methods, with the two criteria,with no regularization or with different values of the regularization parameter.

First we shall use the MSE criterion. The first four rows were obtained using anearly-stopping method, where 30% of the data were used for validation. The last fourrows illustrate the results obtained, for the same estimation and validation data, butwith model trained on all the data. The Matlab function gen_set.m and the files inAsmod.zip were used for this problem.The termination criterion, for the early-stop-ping method was formulated as: the training stopped if the MSE for the validation setincreased constantly in the last 4 iterations, or the standard ASMOD termination wasfound. This can be inspected comparing the column It. Min with N. It. If they areequal, the standard termination criterion was found first.

Concerning the pH problem, the results obtained are in the table below:.

2.4 10-6 3.6 10-7 -1,114 110 4 594 22

1.4 10-6 1.8 10-7 -1,182 110 4.4 829 27

Table 4.31 - ASMOD Results - Early Stopping versus complete training; MSE (pH problem)

Reg. factor MSEe MSREe MSEv MREv It Min Crit. Comp W. N. N. C. N It

8.6 10-9 5.9 10-7 4.3 10-6 0.034 41 -1139 42 3.7 1853 45

4.3 10-6 2.8 10-4 7.5 10-6 0.034 19 -808 16 2.2 285 19

7.6 10-9 5.6 10-7 4.3 10-6 0.034 47 -1139 44 3.7 2098 47

9.9 10-9 6 10-7 4.4 10-6 0.034 41 -1133 41 3.7 1853 45

1.4 10-31 8.5 10-31 1.5 10-31 2.2 10-30 --- -6705 101 5.82 9945 101

3.4 10-6 2.8 10-5 2.7 10-6 2.8 10-5 --- -1190 17 2.25 341 19

4.3 10-9 9.5 10-8 3 10-9 5.6 10-7 --- -1673 61 4.36 3569 61

3.4 10-13 1.5 10-12 3.3 10-9 1.0 10-6 --- -2440 98 5.71 11056 107

Table 4.30 - ASMOD Results - MSE versus MSRE (CT problem)

Reg. factor MSE MSRE Criterion Complexity Weight Norm N. Cand. N. Iterations

λ 104–

=

λ 106–

=

λ 0=

λ 102–

=

λ 104–

=

λ 106–

=

λ 0=

λ 102–

=

λ 104–

=

λ 106–

=

212 Artificial Neural Networks

Page 41: redes neuronais

Chapter 2

Concerning the Coordinate Transformation problem, the initial model consisted of 2univariate sub-models and 1 bivariate sub-model, all with 0 interior knots. The fol-lowing table summarizes the results obtained:

Then we shall employ the MSRE criterion. Concerning the pH problem, the resultsobtained are in the table below:.

Table 4.32 - ASMOD Results - Early Stopping versus complete training; MSE (CT problem)

Reg. factor MSEe MSREe MSEv MREv It Min Crit. Comp W. N. N. C. N It

2.7 10-4 7.7 109 15 10-3 9.7 1012 10 -476 36 5.5 499 101

4.5 10-5 7.0 109 12 10-3 1013 31 -553 50 3.9 1103 31

9 10-6 4.3 107 13 10-3 1013 31 -680 49 5.3 1135 32

7.5 10-7 412 11 10-3 3.7 1012 40 -751 77 5.2 1670 40

1.4 10-5 4.9 105 1.6 10-5 2.4 105 --- -916 65 9.6 1043 30

3.7 10-5 5.5 109 2.6 10-6 4.2 109 --- -831.7 64 4.18 921 26

1.6 10-5 1.4 108 2.2 10-5 1.1 107 --- -918.5 61 5.1 1264 33

1.4 10-5 7.3 105 1.6 10-5 5.9 103 --- -915.7 65 9.1 1067 30

Table 4.33 - ASMOD Results - Early Stopping versus complete training; MSRE (pH problem)

Reg. factor MSEe MSREe MSEv MREv It Min Crit. Comp W. N. N. C. N It

1.3 10-8 2.3 10-8 4.3 10-6 0.034 49 -1038 49 3.6 2605 53

1.2 10-6 2.5 10-6 4.8 10-6 0.034 31 -805 26 2.4 821 31

9.7 10-9 1.9 10-8 4.4 10-6 0.034 49 -1047 50 3.6 2605 53

9.2 10-9 1.8 10-8 4.3 10-6 0.034 49 -1050 50 3.6 2605 53

5.8 10-31 1.4 10-30 6.5 10-10 6.3 10-31 --- -6,427 101 5.81 9699 99

1 10-6 2 10-6 7.9 10-7 2.2 10-6 --- -1,173 29 2.54 989 34

1.2 10-9 1.7 10-9 1.8 10-9 1.7 10-9 --- -1,652 79 4.75 6723 84

1.5 10-13 2 10-13 7.5 10-10 2.6 10-13 --- -2,462 98 5.71 11222 108

λ 0=

λ 102–

=

λ 104–

=

λ 106–

=

λ 0=

λ 102–

=

λ 104–

=

λ 106–

=

λ 0=

λ 102–

=

λ 104–

=

λ 106–

=

λ 0=

λ 102–

=

λ 104–

=

λ 106–

=

Artificial Neural Networks 213

Page 42: redes neuronais

Resolutions

Concerning the Coordinate Inverse problem, the initial model consisted of 2 univari-ate sub-models and 1 bivariate sub-model, all with 2 interior knots. The followingtable summarizes the results obtained:

For all cases, the MSE or the MSRE for the validation set is much lower if the trainingis performed using all the data. This is more evident for the CT problem, employingthe MSRE criterion.

Table 4.34 - ASMOD Results - Early Stopping versus complete training; MSRE (CT problem)

Reg. factor MSEe MSREe MSEv MREv It Min Crit. Comp W. N. N. C. N It

4.9 10-6 8.2 10-7 0.014 1.2 1013 15 -787 67 4.9 624 19

5.7 10-5 1.3 10-5 0.014 1 1013 18 -631 54 3.7 519 18

1.9 10-6 4.2 10-7 0.014 1.1 1013 24 -805 74 5.4 1037 24

5 10-6 8.3 10-7 0.017 1.2 1013 15 -787 67 4.9 626 19

2.8 10-5 3.6 10-6 1.9 10-5 2.3 10-6 --- -869 111 43.14 1273 38

4.7 10-5 1.1 10-5 3.2 10-5 7.4 10-6 --- -924 65 3.7 495 19

2.5 10-6 3.6 10-7 2.2 10-6 3.6 10-7 --- -1,114 110 4 594 22

1.5 10-6 1.9 10-7 1.2 10-6 1.7 10-7 --- -1,182 110 4.4 829 27

λ 0=

λ 102–

=

λ 104–

=

λ 106–

=

λ 0=

λ 102–

=

λ 104–

=

λ 106–

=

214 Artificial Neural Networks