Analysis and applications of artificial neural networks

1. p i Analysis and i Applications of Artificial 1 I ii Neural Networks LP.Veelenturf

2. To Rulger, .Wendy,Irene and Gerrie Analysis and Applications of -~'yl .. .A z Artificial Neural NetworksL.P4 J./ eelenturfum "I1Prentice Hall London New York luromn Sydney Tokyo Singapore Madrid Me-mu Cuy Munich 3. EFirst published 1995 byPrentice Hall international (UK) Limitctl Campus 400, Maylands AvenueHemel Hempstead Hertfordshire,HP2 7EZ ". t 5 A division of ,Simon & Schuster International Group(G Prentice Hall International (UK) Ltd.I995 All rights reserved.No part of this public ion may be reproduced.stored in a retrieval system,or lrttttsntitte ,in any form,or by any means,electronic.mechanical,photocopying,recording or otherwise.without prior permission,in writing,from the publisher. For permission within the United es of Americacontact Prentice Hall tnc. , Englcwootl Ctittx.NJ 07632 Typeset In 10 on 12 pt Times by P& R Typcsettcrs Ltd.Salisbury.Wiltshire.UKPrinted and bound in Great Britain by llt)0l WO:0 otherwiseThe variable w,is called the weight of input line i and represents the synaptic transmission eiciency of the synapse between the nal lament of at neuron and the dendrite itor the soma) of a particular neuron.The threshold T=w, ,, the weights w,and the delay I are real valued.If there is no feedback in the neural network we may take r=0, and the time dependency of x, - and y can be ignored.So the previous3- 7. 12. 3 The binary Perceplron Figure 1.3 Electronic implementation of an artificial neuronformulation of the inputoutput behaviour can be replaced by:= l it'Zw, .iw, ,=0 otherwise An articial neuron can easily be implemented in a simple electronic circuit (see Figure 2.3).Those acquainted with electronics will understand that the transistor will be open if:5 + 2 > Us R1 R2 R: lfthe voltage E,represents x, , E,represents X1, and A U, /R1 represents the threshold -we,then we obtain with l/ R, = w, and 1/R2=wz that the transistor is open ii? w, x,+w2x2> woNetworks composed oflayers olintereonneeted articial neurons have been studied extensively by many authors.The analysis of neural networks is attractive been . e all the building units,the neurons,are the same and the transfer function of such :1 unit is quite simple.More important,however.is that we can alter the behaviour of a neuron by changing in a learning process the weights w,in the input lin Changing weights is the articial counterpart of the adaptation of the synaptic ellieieney in real organic neural networks.Before examining this learning behaviour of ii neur. il network,we consider the zero-or-one behaviour ofjust one single artilieizil binary HCUYOII. With a single neuron,for example,we can realize some restricted class of ]7fLtlll'tll4 lugici Consider the statement:John is going out for a walk if and only if the sun is shining or if it is cold and the wind is blowing west. The predicate John is goingIntroduction 9Table 2.1X1 X2 X3 y 0 0 0 0 0 0 1 0 t 0 1 0 0 0 1 1 1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 1xosl x,-1 +2 1 +1 XOR. +1 / %Figure 2.4 Neuron illustrating Table 2.1out for a walk is only TRUE if the conditions mentioned are TRUE Now we can represent the truth value TRUE by the number 1 and FALSE by the number 0. We represent the truth value of the predicate The sun is shining by x, , the truth value of It is cold by X2, the truth value of The wind is blowing west by X34 and the truth value ol"John is going out for a walk by y.With this notation we can enumerate all possible situations in a simple truth table,as shown in Table 2.1.If we now consider the values of x, . x,and x,as the inputs of a single neuron and y as the output,we can select the weights wo.w, , W2 and it;in such a way that the output behaviour of that neuron yields the truth value of the predicate John is going out for a walk (Figure 2.4).Methods for nding the appropriate weights analytically.or by a learning process.will be discussed later. Pioneers in this eld of research.like Rosenblatt (1962) and Minsky and Pzipert H969].investigated neural networks with the aim of using such networks mainly for 13. 10 the binary PerceplmnVxQCCCQVLCC ggg, QCCQCCCZ Figure 2.6 Classification of sixteen patterns by a neural networks.For this reason they called those networks Perceptrunspattern recognition problem S! we will can the(from perception).In honour of Rosenblatt.who used the term rnetworks discussed in Chapters 2 and 3 Perceptrons.> As an example we consider a pattern recognition problem.Some pattern is projectedonto a grid of small squares A variable x,is assigned to each square.The variable .- i - - ' f h x,will have the value 1 if the pattem is covering that particular square,and 0 i t e pattem is not covering that square (see Figure 2.5).The values of the variables x,_ .- k constitute the inputs of a binary neural network.The output of the neural r: e(t)wor will classify patterns as belonging to some predened class (y=1) DI 00 U- )- For example if we have a very small grid of four squares and one single neuron (see Figure 2.6) one may wish to classify the sixteen dilferent articial patterns (seeIntroduction TlTable 2.2x.xi X,x.y x1 xi xi x.y0 0 o o o 1 o 0 o o0 o o l o 1 o 0 1 o0 o 1 o o 1 o 1 o 1o o 1 1 1 1 0 1 1 1o 1 o o 0 1 1 o 0 1 o 1 o 1 1 1 1 o 1 1o 1 1 o o 1 1 1 o 1o 1 1 1 1 1 1 1 1 1Table 2.2) as whether patterns of at least two black squares are connected (i. e. the black squares are adjacent),y=l.or not.y=0.Although there are many pattern classication problems that can be solved with a single neuron,we will demonstrate in the next chapter that there exists no set of weights w,w, , w, , w,and w,such that Table 2.2 is realized by a single neuron classier.lt turns out that we need a two-layer neural network with at least two neurons in the rst layer.We can see directly that the problem can also be solved with four neurons in the first layer (one neuron for detecting that x,and x,are both 1, one neuron for detecting that x,and x,are both l,one neuron for detecting that x,and x,are both I,and one neuron to detect that X,and x,are both 1),andione neuron in a second layer to detect that least one neuron in the rst has an output of 1. One might suspect that a single neuron is not able to solve a classication problem if there are a great number of input variables.There are,however,problems with only two input variables that are also not solvable with a single neuron. Consider,for example.the Boolean exclusive-or function:y =x,x, , i. e. the output of the single neuron must be 1 if and only if x,=1 or xz=1 but not both.We will see in the next chapter that we cannot solve this problem with a single neuron but we will demonstrate,on the other hand,that any Boolean function can be realized with a two-layer Perceptron.This example indicates at the same time a third application area for the use of binary Perceptrons:the realization of Booleanfunclioris. Because we can realize any Boolean function with a binary Perceptron and because every neuron can be implemented with an electronic circuit,we have the fourth application area:swilching circuits. In a subsequent section of this chapter we will study two-layer binary neural networks composed ofinterconnected articial neurons without feedback connections between neurons. Different Boolean functions can be realized in parallel,eg.with the two-layer neural network given in Figure 2.7. The neural net of Figure 2.7 elassies simple patterns consisting of three pixel points in three classes as specied by Table 23.Ha pattern p= is a member of the class K,= {(000),(001),(100),(111)), 43 14. 12 rite binary PerczptronW= +1 Figure 2.7 Neural net for classication with two-dimensional outputthen y,=1 and y, =0. If the pattern is a member of the class K, ={(0l0),(011)),then y,=0and y, = I.lfthe pattern is nota member otK ,orK, ,theny,=0and y,=0.2.2 The performance of a single-neuron binary PerceptronIn the previous section we saw that a single neuron performs a kind of weighted voting on variables x, : the output y of the neuron will be 1 if and only if w, x,+w, x,++w, ,x, , is greater than some threshold TExample 2.1Consider the balance of Figure 2.8. At equally spaced points there might be objects with some weight g,at the balance pole.At the lcl't~hand side there is one xed weight go attached to the balance at a unit distance from the suspension point.We use the variable X,to indicate whether (x, = I) or not (Xk=0) if there 15 an object placed at distance k from the point of suspension.Now The balance will tip to the right if and only it: Zlyo or with kgk replaced by wt,go=TI and when the predicate The balance will tip toThe performance of 4 singlemeuron binary Perceptron 13 Figure 2.8 The mechanical equivalent of a threshold functionthe right is replaced by the binary variable y,we obtain: y= l if and only if): w,x, ,>T nWe will now investigate the properties of threshold functions like the one used above.We dene rst:a function y:/(x, , 2:1,.. . , x, ,) is called a binary linear thresholdfunction with respect to the binary-valued variables x, , x, , .. ., x, , ifthere exist a number T and a set of numbers {w, , w, ,.. .. w, ,} such that y=1 if and only if 2w; xi> T.We usually will drop the adjective binary and,although important,we will frequently drop the phrase with respect to the binary valued variables X| , x, ,.. .. x, ,'. With the use of the stepfunction S(z) dened by:S(z)= l if z>0 and S(z)=0'ifz0 For the third argument [1,0] we must have: wo+w, x, +wzx, >0 thus:w+w,>0For the fourth argument [1, 1] we must have: w, ,+w, x, +w2x2$0 thus:w, ,+w,+w, 0, and for the second class we must have we + w, x, +w, x, $0. The set of points for which w0+w, x, +w, x2=0 represents a separating line in the two-dimensional input space.On one side of this line we will have for every point (x, . x2) that w, ,+ w. x, + | |': .(_!>0 and thus the output of the neuron will be equal to I.On the other side of the line we will have w, ,+w, x, +wzx, 0wu+w1>0w0+W; +W30w. ,+w, +w, '. E"P! -*. N:-In this case we can see immediately that the parity problem cannot be solved with a linear threshold function.because from (I) and (2) we conclude that w,>0 and from (3) and (4) we conclude that w30 and b>0_-a+b>0 est) and bs0=~a+b0 and a+bS0=b0 a+b>0 and a+c$0=~b>c Za. >0 and / .>0=~/ .a, >0 ): iz, 0=2}. u,. 'S". V'. !-! ":From rules (I) and (2) we can derive a property that we can sometimes use to check whether a logical function can be a linear threshold function or not without writing down the set of all inequalities.Let a in rule (I) represent the sum of some set of weights associated with some input vector x,with f(x, ): 1. Thus the rst inequality in rule (1) with a = vv-fr,> 0 must hold.Let b in rule (I) be the sum of weights associated with some vector X2 with f(x2)= I. Thus the second inequality in rule (1) with b= vv-ft2>0 must hold.According to rule (I) we must have vv-it,+ifv-i1>0. Assume we have for the zero vector 0: f(0)=0. This implies w. ,$0 and thus (wo)+v-ft,+vv'i22 >0. Let x,and X,be vectors with no Is in the same position,we write x, nx 0. Let 1 be a vector obtained from the two input vectors X,and x2 such that z, I if x, ,=l or x,=1 otherwise zi=0, we write z= x,ux, . Vector 1 has a corresponding inequality vvi (~ wo) + W-ft,+ vv-2. Because ( wa)+ vv-ft,+ v'v~it1 > 0 we must have [(2):1. The same kind of reasoning holds if/ (x, )=0, f(x, )=0 and f(0)=I.In that case / (z)=0 must hold.Thus we can write the consistency property ofthe binary linearrhresholdfunciion:If the logical function y= f(x) is a binary linear threshold functionBAY ZOLTAN Allulmuoti Knuth:Aljpltvny lniutikai an Gyinhlochnikzi hugit..: :I = -4 17. Ill The binary PercepimnTable 2.6 X1 X:Y .505 o l 'b.1 o I O 0 I l 1with respect to x, , x, ,.. ., x, , and / (x, )=u,f(x, )=u,/(0)= iI,then f(z)= u must hold.with z= x,ux,and xnx, =0, and u= l (t7=0) or u=0 (ti:l). Example 2.4For the identity function we have Table 2.6. From Table 2.6 we find:f(0,0)= l and thus wo>0 f(0. l)=0 and thus w0+w, $0 / [1, 0)=0 and thus wu+w,soFor f to be a linear threshold function we must have f(l,l)=0, which implies wo+w,+wzS0. However.f(l.l)=l.and thusfcannot bea Iinearthrcshold function with respect to x,and x, . 1Many logical functions of n arguments cannot be realized by at single-neuron Perceptron.This also becomes clear when we consider the determination of ti logical function as a classication problem.We have two classes of points in a n-dimensional input space [see Figure 2.10) for the parity problem as presented in Example 2.3. For one class of points A =((0. 0. l),(0. l,0). (l.0, 0).(1, l.l)) the number of ls is odd tind the output ofthe neuron must be I.and for the other class B ={(0. 0. 0).(0, I,I),(l,1.0).(L0. 1)} the number of Is is even and the output must be 0. Thus for the rst class we must have wo+w, x, +w, x,+w, x,>0 and for the second class we must have wo+w, x, +w, x2+w, x,0 and thus the output of the neuron will become equal to I.On the other side of the plane H we want to have w, ,+w, x, +w, xz+ w, x,0 i. ..-" 19. 22 The binary PerceplronTable 2.7 X1 X:X:y y y 0 .11 o 0 -1 0 o o l 1 o -1 o 0 1 1 0 1 o 1 0 1 1 0 o 1 1 o o 1 o 1 1 o 1 o o 0 1 1 1) 1 1 1 1 1 1 1 1 1then y= l, and if y'=0 then y= l or y=0, and if y. 'J= (l,0.25]:the next weight vector vvl4t= (< l.l.75) is in the solution space.| The question arises of whether we will always enter.in ti nite number ofaduptation steps.the solution space.One can prove that for any constant learning riite i:[called / i,'t'iI in -uniml Imi'iiiii_i; l this will be the case even for El tinie~varying learning rule like i: [A =l/ k or 1:(I: ) .These stutenients are consequences of the Pt'7't't[lIlI| ll (IlIll'tl_t[Lllt'L ilit-nriuii.which will he disciissed in the next section.2.5 The Perceptron convergence theoremThis section mainly deiils with the formal statement of the Perccptroii convergence theorem and its proof.Because we have iilrciidy outlined the theorem in the previous section in an informal way.this section can be omitted on a lirst reiiding of the hook without loss of continuity. The Perceptron convergence theorem concerns the convergence of the letiriiiiig procedure to find.from . amp| e. of correct behaviour.the linear threshold function y= Sl$'0+W| ,"+X1++"X, .l realized by a single binary Perceptron,if the function_ f(', , x,.. ,x, ,) to he itlentitied is ti linear threshold function (the symbol S represents the step function). ln the previous section the variables y and . ', - were biliary valued;the Perceptniii convergence theorem is.however also applicable if the variables r are real valued. The rt-irifnrrmiwiii fLut'IllIlt[ riilu is given by: Let vv(0)= (w(, [0),w, (0),tlll)) be any initial weight vector. Let tv(k)= (w, ,lk),iv,1k),..w, ,(k)t he the weight vector at step k. Let i: (k) he a variable luiiriiiiig raw.Local leamingIf at step k:|$lli'l+ll'l,y:/'[x, l:l for some input '+ | 'ull'l, ; ++>i, ,lli)", ,t) and it is given thatV y= f(x_)= O for some input vector x, =The Perceplron convergence lheorem 33X,x.2,, .,, x,, ,),then change iv(k) to:mt +1):Mk) ~e(k)02. lim E 1:(k)= sc ml k= lZ ((k)l l =03. lim7,.7.( 2 W)The conditions imply that convergence occurs for any positive constant learning r-ite r or if rm:1 k or even if it increases like ntklzk.If the set of samples is not linear separable.then the separating" hyperplane dened by 2770 t". -1:0oscillate between several positions il_the learning rate is conlstant or increaf g1.ula[C samc occurs when the data set contains contradicting samp es.S0 M3 Ccln Ofthe following practical statement: Practical statement 2.]A decreasing value of the learning rate a(k) is particularly advisable ifhthe set ofisamglzs may not be linear separable or contains contradictions,because in I at ease e eof ilisruptive samples will be reduced. (Note:For a formal proofofthc theorem it is required that during learningfor correct classication the inner product vw(>o for xeT, ,, and w-xst) for xeT. , With 0 21certziin small positive constant. ) ' ' h We will not give a general proof of the theorem and restrict it to the case w ereat-)=I/ |c(It, eT, ,(li') and it's 'f, , (kt we have I't~(k)= 0. Let = min, (6(kt/ |c(k)l.So at each step the inner product ft-w(k) is increased with at least the value 5'.thus i'i~w(lkt5.(Note that if we take n(k)> 1/| e(k)| , then the increments at each step will be larger. )Now consider | v'vtk + l)|2 =|v'v(k) + c(k)(k)| =lW(k)lz + 2vv(kl~(k)/ |E(k)|+ l.Because w(k)v(k)i. )= u, lPt. )-'2i, :lPnl'"-. ,,(Pn) with= x.(p. .) if4.=1 =1 irq, .=o-i. .,ttn. >We will call q in xqlph) the ntrt. 'k-of the mask function x.Example 2.13 For n= ] x, (,, =x, .r_,I We define the SI! /Mlfllllllll S, ofa pattern p,as the set of patterns:S, ='qA| xm(p, ) =l }.Example 2'4 S(. u,, ={(XXX). (l0|0.000l. OOlll | We dene the (over (, _ of ti pattern p,as the set of patterns:CE:{q, |x, _tqk)=1}. Example 2.15 C, ,(. ,,= {0()ll.0|ll, l0ll. l|ll} I lfa logical function y:{0. l}"~}t),l} is written as an arithmetical linear combination of the mask function:_vl| 't))= ):t, x(p1) with w, eRthen we will call such a form an itritltrtwtit-til conjunctive normal limit,In the case where the logical function is written as: ylp, )=S(Zir, -x, [vlp)) with w, elRi and S the step functionthen we call this form an ittzlirwl tlI'lI/ llI1(ll('tll ('0IljUIl('flUt' nornutlji-ni.Now we can state the following theorem: Theorem 2.5Any logical function 3';(U.l}"~tl.I) can he written in an arithmeticztl contunctivePerformance of a two-layer Perceptron 37Table 2.11 M X2 J 0 0 0 I 0 I 0 l l l l I 0normal form:5-:Zw. ~x_l q, e(O,1) Ir with W,an integer such that for each pattern peP: 'q= ylp) with S, the substratum of p,Before proving the theorem we will give an example of the theorem. Example 2.16The exclusiveor function dened by Table 2.11 can be written as:y= wl)(Ix00+ w|0xl0+ WUIXOI + WI lxll It turns out that w0(, =0, w, = 1, wo, =l and w:*2. Thus: y= x,+xz~2x, x2 I Proof of Theorem 2.5We have to prove that for any logical function y:(0, l}">{0, 1} we can nd a unique set of coefcients W. such that: Z w. ;x. ,ttit= m2.-) for each t>. EP= t0. 1}" q, ePBecause x_l(pi)=1 if 41153, and 0 otherwise,we have for every p, :Z w. ,;x. ,tt! .)=X w. ;x. ,(t2.) IFP W55For 41,65 we have X, ,(p)=1, thus: 2 'u, "t, l1l)=2 Way 15314 qlxw 27. 3a The binary PerceptronThus if: 2 w, ".= y(pi) for each p, eP q, es, , -H 5we obtain:,Z W. ,xq, (p. )= ytin) 1,6?Moreover we have for a total function 2" patterns and thus 2" independent linear equations of the form: 2 win:.V(PIius,Because the number ofcoefeienls wq,is 2".the solution is unique,QEDExample 2.17Given the function y:(0, l}"-(0, 1} specied by Table 2.12. with the conditions for 2 w.given in the last column.The solution for the set equations is: Wooo=1 Wl00=0 W0lll=0 Wini= 1 Woio= 1 Wim=0 Won=0 Wni=2Thus the function y can be written as a linear combination of mask functions: y= lxzx, x3+2x, x,x,IIn Section 2.2 we dened a linear threshold function with respect to the binaryTable 2.l2 Xi *2 X3 3'pa 0 0 0 1 = w,, m,rt.0 0 1 1 = wm+wao. [)1 0 1 0 0 = wWo+w, ,,, ,P:0 1 1 0 Wono+Wno: +Wo: o+WonP4 1 0 0 1 = Wooo+Wmoll:1 0 1 0 = Wmm+Wcm+Wma+Wm: 1:6 1 1 0 0 wm, +w,0,, +wo, .,+w, ,,,l l 1 l= Wnoo+Woni+Wom+Wnu+Wmu+Wim +Wnu+Wn: Performance of a two-layer Perccplron 39variables x, , xz, ... . x, , as a function that can be written as y= S()2 w, xi~ T),with S the step function,From Theorem 2.5 the next theorem follows immediately. Theorem 2.6Any logical function y:(0, l}">(0, 1) is a linear threshold function with respect lo the set of binary mask functions,i. e.y=2 Wm,Cr or y= S( 2 Wq"Xq- T) with threshold T=- woomo : Iu Example 2.18 The exclusive-or function of Example 2.l6 is a linear threshold function with respectto the binary mask functions:x, ,., = I,x, .,= ,r, . x, ,, = x, and x, , = x,. r,: y= Stx, +x, 2x, .', T) with T= wo=0 jA central theorem is as follows: Theorem 2.7Any logical function f:{O.l)"~(0, 1} can be realized by a simple two-layer binary Peroeptron. ProofAny logical function y:(0, 1}">(0, 1} is a linear threshold function with respect to the binary valued mask functions: y= S( 2 wm-xq/ T> with threshold T=w0o __0 l. loGiven the binary valued mask functions xq,one single binary output neuron can realize the linear threshold function with respect to the mask functions xq.The threshold of the secondlayer neuron is equal to wm, _(,of the linear threshold function y. Any mask function x, .=xixJ-"x,is a linear threshold function with respect to x; ,x, .. ... x,.because xq (x, +xj++x, ,T) with T equal to the number of variables in the product , xJ -X.minus I,and thus it can be realized by a singleWI] 1- 28. an The binary Perceptmnbinary neuron in the rst layer.Such a rstlayer neuron has an input x, for every x, - occurring in the product it, with a corresponding weight w_= |. The output of such a rst-layer neuron equals I if )_q(p, )= 1, and 0 otherwise. All neurons realizing mask function constitute the rst layer of the Perceptron.The output of a first-layer ricuroh realizing a mask function x, is multiplied by the synaptic weight w,of the connection toithe output neuron. Thus the output of the second-layer neuron equals l if and only if: X W, -x, ,> wm, _ 4, QED 1Figure 219 gives tlte two-layer binary Perceptron which realizes the logical function described in Example 217. Figure 2.20 gives an alternative realization of the same function. In Theorem 2.5 we found that any logical function can be written in an arithmetical conjunctive normal form,The proof of the theorem also revealed a methntl of how to nd that arithmetical conjunctive normal form.There is.however,another method to find the arithmetical conjunctive normal form.lfwe have written a logical function as a Boolean function then we can convert it in a systematic way into an equivalent arithmetical function using the following rules: l.lf X,is Boolean function ta single variable),then x,is an equivalent arithmetical function.hgure 2.l| ) 'lw| ayer binary Pereeptron illustrating Table lllPerformance of a two-layer Perceplron 41 Figure 2.20 Alternative configuration of the binary Perceprron of Figure 2.19.If E,is Boolean function (the inverse ofx, ). then lx is an equivalent arithmeticalfunction. .If the Boolean function f,is equivalent with the arithmetical function f,andthe Boolean function fz is equivalent with the arithmetical function / '2, then the Boolean function f,A f,is equivalent with the arithmetical function f, /; ..If the Boolean function f,is equivalent to the arithmetical function f,and theBoolean function f;is equivalent to the arithmetical function f2. then the Boolean function f,vfz (also written as / , +13) is equivalent to the arithmetical function Stf,+f', ), with S the step function. .If the Boolean function f,is equivalent to the arithmetical function f,and theBoolean function f2 is equivalent to the arithmetical function 1",and the Boolean functions f,and f,are not true for the same arguments.then the Boolean function f,v f2 is equivalent to the arithmetical function I,+f'2. 29. 42 11-.binary Perceptron Example 2.19Consider the logical function specied by Table 2.13. This function can be specied by the Boolean function:y= ;'E, i,+x, x,x, . Because one and only one argument ofy can be true we can use rule (5) above and obtain the equivalent arithmetical conjunctive normal form: J'= (| -Xilfl -Xzlll -X; l+(1 'Xil(l "XzlX: +X1(lX2)ll'X: l+X1X1X3 or y=I x, ~x, x,+2x, x,x,as found before in Example 2.17 I A logical function can also be considered as a charucterirlicfunetion ofsome set Q E P.A characteristic function ofa set QSP is a logical function fa such that fQ(p, )=l if| );EQ and fqtp, -)=0 otherwise.As a consequence of the conversion rules stated above.we have the next theorem.in Theorem 2.8 we will use the complement of p, - denotedby it;(e. g. if p, =(0l0) then fl; =).Theorem 2.8A characteristic function f:P~(O,1} for the singleton K ={pi} can always be written in the arithmetical conjunctive normal form: fl:2 (_ I ]ru. na. txq_q_r('_with C, the cover of pattern p,and lqmnpil the order of q, ,,ri),[i. e. the number of ones occurring in q, ,,nf), ).ProofCopsider xi asa Boolean function,then any characteristic functionofthe singleton {p, , can be written as a Boolean product ]'f, =i, i,-~-i, , with EJ= XI ifl-, "=1 andPerformance of a twtrlayer Perceptron 43, =:E,(the negation of x, -) if | ),I=0. (For example for n=3 and pattern p, -=(t01>:fill:Mix): -l. 1he Boolean function x,is equivalent to the arithmetical pixel function x, . The Boolean function i,is equivalent to the arithmetical function (1xi).thus if we .replace the Boolean variables in the Boolean product by the corresponding .arithmetical functions we obtain an equivalent arithmetical product.The elimination of the brackets in this product (and some thinking) gives the desired result.QED1Example 2.20Let f: =)E, :2,x, x, be the Boolean form of the characteristic function of the singleton (p, } =(0011).The complement ofp, - is:[': ,= 1100. The cover ofp, ~ is:C, =(00l1, 1011, 0111, 111).Thus we can replace the Boolean function by the equivalent arithmetical function: foot:='( lllmixooii +( llmmlxroti +l1lmomxt)|11+ll)molx1t11 or simply: , =x, x.x, x,x, -x2x, x.+x, x,x, x, (note:)? ,i; x3x. =(1 -x, )(1 -x, )x, x,) I Theorem 2.8 can be extended to sets of patterns as follows:Theorem 2.9A characteristic function / K: P>{0, 1) for a class of patterns KSP can always be written in the arithmetical conjunctive normal form: fit:2 fr.v. . J - T) Imust have for each qE{0, 1} for which w, .#0, an input neuron realizing xq,moreover the weight of the connections from the output of a first-layer neuron realizing x, to the input of the output neuron must be equal to wq. Assuming we do not have an explicit description of the threshold function but only a nite set D,the training set,of examples consisting of pairs (p, ,y(p, )>,then we investigate whether we can develop a learning rule leading to the recruitment of the required rst-layer neurons and correct values of the weights wq such that at the end of the learning process the two-layer binary Perceptron will at least give the correct response to all patterns of D. A logical function y:Pv{0, 1} with P= {0, 1}" can be considered as a pattern classication function,i. e. y(p, ) I if p,belongs to some subset K of P and y(p, )=0 if p;belongs to the complement K =PK. The subset of patterns in the given data set D that are elements of K will be called the set ofexamples E,and the subset of patterns of D that are elements of K will be called the set of cvunterexamples. The learning rule we will present will give the correct response to the training set D after a nite learning time and will generalize in some sense on the basis of the training set to other inputs not present in the training set.In contradistinction with learning rules for other articial neural nets,we have to present every element of the training set D just once.This implies that the learning time is proportional to the 31. 46 The binary Perceplronnumber of elements in D.As far as we know the learning rule presented here is new and has not been published before. Assume we want to learn a logical function y:(0, 1)"~{0, 1} and we have a finite set of examples (set E) and C0|lnlCl"L! X'5|Il1plL3 (set F). tThe adaptive recruitment learning rule1. Given initially an arbitrary two-layer binary Perceptron with in the rst layer an arbitrary number (it might be zero) of neurons realizing mask functions it,with qe{0, I)".The outputs Z of input neurons are multiplied by arbitrary weights w and connected to one single output neuron with arbitrary valued threshold T2. Present all examples and counterexamples in the order of the number of ones occurring in the set D= IIuF. 3. If an example or counterexample is correctly classied,go to the next element of the ordered set D. 4. If a pattern p is presented and incorrectly classied and there exists no rst-layer neuron that realizes the mask function x, , introduce such a neuron.Change the weight W,to wq+A with A such that the output of the output neuron becomes correct.(A is positive if )1 belongs to I1 and the output was 0: A is negative if p belongs to F and the output was I. )5. Go to the next element of the training set D. Before proving that after learning the set,D is correctly classied,we will give a simple example. Example 2.24Assume we want to identify the logical function such that y= l for the elements of K= {0100, 1001,0101,0110,1101, 1t]l1,t)| |1,111l} and thus y=0 for the elements of K= {0000,000l,0010, 001 l,1000, I011),1100, 1110}. Assume we do not know K but only the set of examples:E:(0100, 1001) and one counterexample F:(1100). Assume we start the learning process with a neural net without any first-layer neurons and only an output neuron with threshold T=0.We start the learning process with the example 0100. The output is incorrect so we have to introduce a rst-layer neuron that realizes the mask function x. ,, .,,For the weight we will obtain wo, oD= l (see igure 2.21). In the next learning step we take the counterexample:1100. We observe that for the neural net obtained after the li I step the output for input 1100 is wrong:y:1. We have to introduce a second 1 .yer neuron that realizes the mask function xmm with a weight w,,0: 1{see l1p, ||l'C 2.22), In the third step we take example 1001, | 'ur the neural net of Figure 2.22 we obtain for the input 1001 the output y= ().Thu we have to add an additional first-layer The adaptive recruitment learning rule 47 Figure 2.22 The neural net after the second learning stepneuron that realizes the mask function xmo,with a weight wwm =1 (see Figure 2.23).Because we have presented all examples and counterexamples,we are at the end ofthe learning process.One easily veries that now the output y of_the nal neural net is equal to I for all elements of K and y=0 for all elements of K. At a rst glance one might be surprised that in the previous example we could identify the logical function y just with two examples and one counterexample.But the example was not fair because the unknown function could as well have been dened as 32: l for the set: K= EU(KF) = {0100, 1001}u{0000, 0001, 0010, 0011, 1000, 1010, 1100, 1110 {1100}and y=0 for the set: K'= FJ(KE) = {11(K))u{0100,1001,0101,0110,1101,1011,0111,1111}(0100,l001}Learning with the same sets of examples E,and eounterexamples F.would result in the same neural net but with a wrong response for all inputs except for the elements of E and F.IAlthough in an ideal learning situation one wishes to generalize from a restricted set of examples and eounterexamples,the previous example gives ground to the 32. as The binary Verceptron Figure 2.23 The neutral net after the third learning stepfollowing general hypothesis: Gcnerztlizatiott by learning front cxtuttplt-. and countcrexumples is in general impossible without utrlizingu prion kimwlctlgc about the properties ofthe function to he identied. We will come back to this subject later.We will rst present a proof of correctness of the adaptive recruitment learning rule. Prrmf afcorrvz'tmn'. t Ilflfl tltftl[lIllA' rtt'rllilIlt(nl leaming ruli-We have to prove that after learning.the set Eu! of examples E and countcrcxztmples F is correctly clttssilicd.This implies that if E= K and F= R we can identify any logical function cxttctly. Let R(k) be a subset of D which is correctly classied after step k,Assume we present at step It-t-l an element pleli After step k+l the linear threshold function realized by the neural net will lutve the form:ytk+ l)= .'( +tt-, x, 'T) for some set Q-1- U 4r I inDue to the ordered prcsenttttiott ofextnitples and countercxumplcs during the lt: :lIl'tlI| l_l process,we have for every pgslltt that the number of Is occurring in )1, is smaller than the number of ls 0}We can now give a theorem concerning the set of patterns accepted by the binary Perceptron after learning. Theorem 2.10if E is the set of examples and F the set of counterexamples and we use the zidiiptive recruitment learning rule,then after learning,the output of the Pcrceptron will be equal to 1 for elements of some set L and will be zero for the set PL,with L: L= ( E Ice 2 1Cu, )tie! -i, sFwith the set step function,C,the cover of example it,and CW the cover of counterexample q, . with az, >0 and o, >0 and E a subset of E and F a subset of F. Before presenting the proof of Theorem 2.10 we will illustrate this theorem with two examples. Example 2.26Let E= {0l0} and F= {lI0},then after leaming the Perceptron will realize the following linear threshold function:.V= SiXzXiXz) According to the theorem we obtain for the set L:L= (C0m-C, m)= ((0l0,0ll.110,lll){ll0,lll))= (0l0,0ll} IExample 2.27Let forthe set of patterns be: .l= ((0l0),(011)).Assume we start learning with an initial neural nctwork~co. nl: Iining in the first layer a neuron thiit real .tlic maskfunction xom and that it is connected to the output neuron with (I wciglit Generalizing with a two-layer binary Pei-cepti-on 51wo, u=2. Let the set of examples be E= ((0ll)} and the set of counterexamples:F:{(110),(1 I 1)}. Alter1earning we obtain a neural net realizing the linear thresholdfunction:y= S{2x12x, x,+x; x,x. x,x, ). For the class L we obtain:L= S(2Cm, 2C, ,+Co, ,-Cm)= (2{0l0,0ll,110, lll)2{ll0, lll}+(0ll,1ll){lll))= (2(0l0},3{0ll})= (oio, oii) IProof of Theorem 2.10By inspection of the adaptive recruitment learning rule we see that for each example q, .eE,a first-layer neuron that realizes the mask function X,with a corresponding weight w, ,_>0 will be introduced if the example does not already give the correct response.if the example has already been accepted,no rst-layer neuron will be introduced and w, ,=0. The same holds for a counterexample q, . but now with w_, 0 and thus petE).In case that pattern [1 is not accepted we obtain 34. 52 The binary Perceplronsimilarly Tg(x_) and in the opposite direction of it;if Itx, l.'[xl=tl.5 0015 t) >1;(x)=0=ol(xl1[(x)=0.l -0: +0.| Learning with a single-neuron continuous Puceplron 79 The adaptation of weights is prescribed by:d .Aw=22 tr(x, )glR that can be realized by a single-neuron Perceptron with transfer function f can go exactly (the MSE is zero) through the samples of the data set [xh t(x, -)]eD,i for every element [xh t(x, )] from D the vector: :2 = {ft, , f[1(x, )]} with )'t, =[1, x, ,, : :,2.. ..,V, -,, ] is a linear combination of some unique set of n+ I linear independent vectors:>1. f"[I(X, )]l obtained from n+1 elements [x, -. I(xj)] from the data set D. 49. 82 The continuous multi-layer PerceptronProofLet the data set D contain a_. subset S with n+l pairs [x, , t(x, )] such that we can form n+l independent equqtiogts: Z w, -xi-J, =f"[t(), l] for each element in S (NB x, _., = lj kill From these equations all n+1 weights and thus g. .. which exactly go through the samples of the subset S.can be determined.if in addition for every element [x, , t(x, |] of the data set D we can write: (t, ,ft[, x, t3) = a, t,. .1/-{mtg} with [x, . t(x, )]esand not all at,=0. then for every element (x, -, ltx, ))eD the following relation should hold; ! lwlXil= f( i Wttv"t. t)i Wit nil f7/ .k) = f1 + (V Amt= u dxWith f (l)=i[ln(1t)"]"" and df/ ds=2s[l f(. v)] we obtain for each ofthe elements [xh t(x, ]]eD a linear equation: 2 WkxtIt; ./ 'lttt0.5 and for xeX, , we have g, ,(x)$0.5. In the learning phase,however for an inputi. ., {- 56. so The continuous multi-layer PerceplrunXED,the target value is Itxtzu with 0.5 0.5 and for xeX, , we have . tI. .tx)0.5, and for an input xeD. , the target value is I(x)s0.5. We will see that in the last case we have to modify the learning rule as discussed in previous sections.In the next scctiott we will discuss the different classication methods. 3.8 Hyperplane boundary classification by one-zero labellingIn case of hyperplane boundary classification by one-zero labelling with a single-neuron Perceptron with sigmoid transfer function,the Ii-dimensional input space X = lR" is divided by a hyperplane dened by the dot product vv-)'t=0 into the regions X,and X.ln the Iwrning phase for XEDA the target value I(x)=l.and for xeD. , the target value t[x)=0. After learning.we use ,1/. .(x) as tl label for the class to be identitied:xe/if _: /,, (xl>0.5 and XEB it g, ,(x)$0.5 if p(A]= p(Bi. We observe that during learning we have to learn for each X the target value Itx),thus the learning goal is identical with the approximate tilting ofa data set D = DAuD, , as discussed in Section 3.5. Therefore we can use the same learning rule as discussed in the previous sections. If the data set is linear thle by a hyperplane dened by the dot product w-i=0, then the tinal we ht vector will be such that the functions _t; ,,(x) becomes (altnost) a threshold function and tlte nal MSE becomes zero.Figure 37" shows the one-dimensional ca.The hypcrpIane w. ,+ w, x, =0 will be in this 0: 1 point 1: w, ,/w, .One may note that if the data set linear separable we also have the same problem as discussed in Section 2.4 on eh I lyAl[ion with a single-neuron binary Pcrccptron.The desired output is also zero or one.and only the input vectors are new real valued instead of binary valued,but the convergence theorem is independent of the type of input vector.Thus to solve the given classication problem we can also use the reinforcement learning rule of Section 2.4.We say that the data set I) is , w/mruhic if the data sets DA and I) do not have input vectors in common.Most In sets are separable but are not linear separable,i. e we cannot divide the input sp X by one single hyperplane into regions XA and X such that I),', .', . and I), ,r: X,, . Although most data sets are not linear trahle we cart frequently obtain very good classication results witlt one single-neuron Ctitltittttotls Pcrceptron. se Hyperplane boundary classication by oltezero labelling 979(x)= !tstx)l I1 Figure 3.22 The sigmoid lunctinn becoming a threshold functionExample 3.8Assume we have a two-dimensional data set D= DAuD, , depicted by the points in Figure 3.23.The data set is DA generated by a Gaussian distribution function with the following probability density function:l .'[t. l2 x,v =_ ex + ex fA( }) / in P 26; _~2My P 205with ; z,=0. as =0.Z,ti.=0.465 and a, .=0.4. Similarly we have the data set D, , generatedby the same type of Gaussian distribution function but now with ; r,=0, t7,=0.4. ; t,=-0.465 and (7,:.2.For an optimal minimal risk classier one can prove that the boundary between the two regions if XA and X3 is dened by the condition that fA(x,y)= f,, (x,y),if pm):pm) and c(lAiB) = c(lBlA).The optimal classication boundary,or discrimination line is given in Figure 3.22 by a curved line.One can calculate that the probability of error in that case is 5.14 per cent. We can also divide the input space by one straight line (the boundary hyperplane is in this case a straight line given in Figure 3.22);the probability of error will then be slightly larger.The optimal position of the line is y=0.|I.ll turns out that the probability of error in that case is 5.15 per cent.ll a single-neuron Perceptron could nd this boundary we have a very good result.In Section 3.11 we will show that with a single-neuron Pereeptron we can learn to nd this boundary.I 57. as The continuous multi-layer Perteplrnn Figure 3.23 Two-dimensional classication problem with Gaussian distributionsThe preceding example illustrates the following practical statement: Practical statement 3.7Many two-class classication problems (in which the optimal classication boundary is an open,non-linear.convex boundary and the intersection area of both classes isnot too large) can be solved reasonably well with one singlancuron continuous Perceptron. Classiers that divide the n-dimensional input space by a nl~dimcnsional hyperplane will be called Ily/ wrplutltl bourulury clam; /icrs.The single-neuronHyperplane boundary classification by onuo labelling 99Perceptron is an optimal hyperplane boundary classier if a certain condition is satised: Theorem 3.6lithe data sets DA and D,are representative for the underlying distribution functipns of class A and class B.then the single-neuron Perceptmn will divide after learning with onezcro labelling the rt-dimensional input space X = lR" by an optimally located Il l-dimensional hyperplane if for the nal weight vector:|wloo. ProofIn the previous section we found that for an optimal classier the risk [for equal costs of misclassication): R= PlAlI'llblA)+PlalIiI-l3) = l plAlfl(xldx+l ptfilfutxldx X. Xmust be minimal.For the one-dimensional case with p(A)= plB) the shaded area in Figure 3.24 must be minimal for a minimal risk. By training a single-neuron Perceptron to lit the data set D= D,uD,by one~zero labelling,the MSE will be minimized.We have to prove that by minimizing the MSE,the risk will also be minimized. in Section 3.2 we have dened the MSE as: 1E - Z H(X. l[I(X. vJ-anlxtlllN asglxlf_(x)4> xFigute3.24 The risk (shaded area) for a one-dimensional classication problemI :4 . ..,.5 , 58. 100 the continuous multi-layer Ferceplronwith N the number of samples in the data set D,and n(x, ) the number of times input x,occurs in the data set. For input vectors x, eDA the target is t(x, )= l,and for input vectors of D the target is ttx, -)=0. Thus::, Z n, i(xr)[ly. .(x. )]+% Z nu(x. )[g. ix. )]urn yen,with N the number of samples in the data set D= DAuD. , and nA(x, .) (respectively n. ,) the number oftimes input it,occurs in DA (respectively Dal We can rewrite the above equation as: _5 mix. ) _ _ 5 n. ./ istxi)]i%will be called the internal learning rare for x, . V _ I b For r(x, ) =l the internal learning rate as a function ofthe weighted input Six. ) Will 9-vtsl=[1 ftsl]f(s)[l f(Si] and is given in Figure 3.27 by curve A. 59. 102 The continuous multi-layer Perceptron0.250.20.150.1Derivative of ftsl0.05 1Weighted input,s Figure 3.26 The derivative df/ dx of the sigmoid transfer functionFor l(x; )=0 the internal learning rate is: / (Sl=- f(s)f(sJ[|f(-3|]represented by curve B in Figure 3.27. The learning rule can now be written as follows: Aw =Z t. ,lx, 2,The learning rule implies: Awe =Z MK. )I Aw =E 5'/ Ix, -)x; We now consider the simple situation that we have to classify two points:D, ={x1} and D, ,=(xh) (see Figure 328).Let the initial weight vector of at single-neuron Perceptron with two external inputs x,and x,be such that the separating hyperplllne w9+w, x, +w, x,=0 is perpendicular to x, x, , with X,on the positive side {wu +w, x, + w, x2>0) of the hyperplane and x, , on the negative side (w. ,+w, x, +wzxz 0.5 and _z, r,, (x, ,)0 and . s(x, ,]0. From Section 2.2 we recall the relation between the distance citx) (in the direction of w) from the hyperplane to a point x.and the weighted input stx): wx +w, , _ s(x) lwl lwlBecause 6tx, )= 3(x, ,) we have s(x, ,)=s(x. ,) and thus ~/ (xy) =, '(Xy, )lsee Figure 3.28).The adaptation of the weight vector: 61x): AW - '"i/ Uni".+ i: :("ol"b =5/ l". ll"n Xblis in the direction of x, ,x. , and thus in the direction of the weight vector w.Thus the weight vector is multiplied by some scalar.Because the weight vector w before and after adaptation is in the same direction.the orientation of the separating hyperplane will also be the same. We recall from Section 2.2 that the distance along w from the origin to the hyperplane is given by: (I =F Wu lwlBecause 70:):~, -tx, ,) we have At'(, =l. ")(Xa]+C[(X, l=0. Thus before and alter adaptation we will be the same.We found that 1w|will be dillerent before and after adaptation and thus the distance il from the origin to the hyperplane will change.The hyperplane comes closer to x, ,.As we continue learning we observe that the hyperplane twists around the initial hyperplane but still between X6 and x, , (if 2 is small enough) while lwl increases,i. e. until gwtx) becomes a threshold function and the separating hyperplane will be the same as the initial one but now with zero MSE. Example 3.9Consider the simple Ont: -Lllll| Llllt)l|1|l classication problem:DA= (x_}= l3{ and D, ,:, x,, ) = :l}.The initial extended weight vector is iTv= [w. ,.w, ]=[4,2] [see Figure 3.19). Hyperplane boundary classification by one-zero labelling 105glxl-Ilstx)l l Figure 3.29 The initial realized sigmoid function for Example 3.9The initial separating hyperplane (a point) is dened by 4+2x, =0. For the weighted inputs we find s(x, )= 44+2-3= 2, s(x, ,)=4+2-l =-2. From Figure 3.27 we obtain for the internal learning rates 7[x_);0.0l and */ (xg -0.01. We nd for the value of the adaptation of the weight vector: Aw _ _ X,M x, , _ l _ l _ 0.00'"lx. ll +"lx. ll ("l3l M lil lo. o2lWith 2: 10 the new weight vector becomes; W0] _ [4,0]w,2.2The separating point Wo! W) 2 1.9 is moving to the left and w,is increased.An additional adaptation will show that the next separating point will be located to the right of the original separating point x=2. Finally we will end up with | vv| =oo and wo/ w, =2. During learning we willjump in the cutter of the error landscape (see Figure 3.30) from one slope to the other until we and w,approach innity and the MSE becomes zero.Iif for x, eDA and x. ,sD,we have x, =x, ,=x,the contribution of those points to the adaptation will be: AW= E7(x. )X. +c7(x. )x. =[7lX. )-vlxollx The value of y(x_)-, -(x, ,) is given by curve A and B in Figure 3.26. We see from 61. 106 The continuous multi-layer Perceplmn Figure 3.30 The error function of Example 3.)Figure 3.26 that if the two points coincide on the actual separating hyperplane,then they do not contribute to the adaptation. If the two coinciding points are on the positive side of the hyperplane (. '(x, )=s(x. ,)>0),they will be treated as a point of D, . If the two coinciding points are on the negative side of the hyperplane (s(x_)= x(x. ,j0.9 and no adaptationwill occur.For a wrongly classied element of D,the target value is,for example.0.1. Theinternal learning rate is:vts)= [0-lf(Sl]fts)[1-f(S)]represented by curve B in Figure 3.32. The curve is only used for values ofs to the right of the solid vertical line because at the left side f(s)0.5 for an element of X and I[x)s0.5 for xeX, . is to multiply after each adaptation step the weights with a scalar such that | vvl remains constant.Multiplying all weights with the same scalar does ltot change the separating hyperplane dened by the dot product vv~i=0. By keeping lwl COIII4|llI during learning we are searching for a minimum of the MSE for values of W on a hypersphere in the solution space WC We know,however.that the value oflwl must become innite to reach an optimal solution [see Section 3.7].If we divide the learning process in a sequence of time intervals siichHyperplane boundary classification by single threshold labelling 117Weights are changed after all 14 examples Epoch 0 Epoch 1 EP"" 7 Epoch 3 Epoch 4 Enoch 5 Epoch 10 Epoch 20 59C 1Figure3.38 Simulation result after several epochs (znumbcl of times the total training set of fourteen examples is supplied] of global learningthat lv'v|is kept constant in each interval until a minimum for thehMSE IS reach:and then change lw|subsequently to a larger constant value for t e next seqluen ,we systematically search through the complete solution space W and end up with therequired large value of lwl. Example 3.1 1Suppose we have one neuron with one input x,and we have the following data sets DA: i*2.0, - L9. -1.8, -1.7. 1.95} and D, ,={0.5.0-6.0.7.03}.Note that the two sets are not linear separable.After learning with single threshold labelling we want to have for the output of the neuron g, ,(x)>0.5 for xeDA and g, ,(x)0.5 for XEDE.An optimal solution for the seprallg h)Pl'Pl3 we I WIXI :0 (3 Win I" M5 Ca) 67. 118The continuous multi-layer PerceptionWeights are changed alter all 14 examplesEpoch 10. example 14 Epoch 20, example I4Epoch 100, example 14Figure3.J9 Simulation result after several epochs (= num| ,c,of mmfourteen rtultllllltly ullosen examples of the total training set in:3'l"l>| it: tl) ol lm; :i|learningHyperplane boundary classification by single threshold labelling 119is obtained when wa/ w, has a value between -1.7 and 0.5; in that case eight elements will be classied correctly and one element is misclassied. If we use the unmodied single threshold labelling method.the weights will indeed go to zero but the separation point -wo/ w. goes to a constant value.There are two sepa ation points to which the network converges;which one is reached depends on the i tial weights.The values of wo/ w, will be either 1.06 or ~2.67. In both cases ve inputs are misclassied.In order to see why we obtain these solutions we calculate the error measure E with W on a circle around the origin in the waw,plane for a small value of | v| . We take wo= lw|cos 4} and iv,=[wl sin 45 and vary 4) from 0 to 21: rad.In Figure 3.40 we see that for lw| =l we obtain two minima for the MSE one at = S.53 rad (corresponding to the separating point wo/ w, =lwl cos d>/ |w|sin =-0.073/-0.068 =1.06).and a second at =3.50 rad (corresponding to a separating point we, /w,=o. o94/ o. o35 = 2.57). When we use constant lwl during subsequent time intervals of learning as described above and start the first learning interval with | w| =1, we will find one of the solutions mentioned before.If we use [w|=2 in the second learning interval.then we will again nd two possible solutions:one larger than -2.67 and the other smaller than 1.06.Which solution is found depends on the solution found in the previous time interval. Angle (rad)Figtlre3.40 The MSE for example 3,11 with | w|=l.w. ,=| w| cos .w, =lwl sind> and ososszrt. 68. 120 The continuous multi-layer Perceptran Angle (rad)Figurelll The MSE for Example 3.11 with rcspeclivcly | w|= t lw| =z '(; ':5S; :d '"l= "0 Wllh wo= lwIcosa>.w. =|wlsin and-; hl'.57"; enad": )2a_; _}: :: ::r: $;'E; ::: :Cl; gl; ec: :;ec; ion olt: _lhe correct interval between the MSE for several values oflifvl Wlien | vv|bccor min {gum 34! where we-piaucd or the MSE will be in the intcri/ 'il for as . ,m'"'7 me We See hm W """''"l"" to the separating points ~1.7 and 0 S A closeini -k and 5--18 rad corrgspondlng minimum just to the H of :5 18(cones oo reveals that there will be one Bonding to the point 0.5).I3.11 A - - ~ .. pplicalion to the classification of normally distributed classesIn Section 3.8 1 1- . . - ,. DID D d wopiesentcd |cla lllVL. Illt)n problem with ii two-dimensional iliila sci , v H epIt. lt. Ll by the points in l-igure 3.42 The (. l'IlL|set is D gencrilcl by .1 .A ':'1>: ..Application to the classification of normally distributed classes 121 Figure3.42 Two-dimensional classication problem with two Gaussian distributed data setsa Gaussian distribution function with probability density function: ly-A1,)26,. 1 (%rx, sti d-ViiWe dene the internal learning rate for input vector xi for the weights of neuron k in the rst layer as: %/ u.(X. l= t5ir(x. )dS lkv 74. 132 The continuous muIti~Iay: r PerceptmnWith this denition we obtain for the adaptation of the weights of neuron k in the first layer due to input x, :A'=5: .1,tnxa-VJ/ -V]'I(xo>= (xxa; (x. ,)++(xxo>gtxo) I P[ix - Xa)'V]"f(Xol =[tx - Xnl'V][(X - Xo)'V]"' 'f(Xa) and lim, ,__,R, ,=0. We now return to the one~dimcnsiona|case and consider the function g, ,'lx) fromIR to R.realized by a continuous Perceptron with m neurons in the rst layer.Each neuron i in the rst layer has a sigmoid transfer function: fie[S(Xl]= [l +eXPtWiia+Wm- l= iThe Taylor expansion of this function round a point X0 will be: a: dxa'L't-Vs)l d Ymgwixo) etc. with _! /:tX0)= m+2Iv2,f[SfXt1): ll in short g: =wm+ZwZ, f,,and df/ ds= f(l f), 79. 142 The continuous nIulli~| ayer Perceplronand we obtain: M I will/ li'_. /llilwlllX 'z. lfit3fii+2fiil'iiil= l Given for some n the Taylor approximation of a given function: f(x)=i 0.(XXo)'+R, ,i= owe can select the weights of the Perceptron w, ,, who and W, for i:1_ 2. that a, =a} forj=0. l,2,. .., n.If the number m of ti t-layer neurons is equal to the number of n+|Taylor coefficients minus one,LL.m= n. then we can take the weights WW and w, , of the.. _ r Ifirst-layer neurons almost arbitrarily.because we can still select the m+l weights wm.w, ,.~-.w, ,,,of the output neuron such that for the n+1 Taylor coeicients we have a, =u} forj=0. l,2.. .., n.We can put the requirement u, =a,and the expressions for ct;in a matrix form: . . m suchn= /twz For n=3 and m=2 we obtain:"0 1 In fr;tum "1 =0 lftt iiliu lfizfizlWi2i ' '1.at 0 i1,. 3/i. +2/raw. .. lft23fi2+2/l2l""t2t wt. The weights in the connections from the rst layer to the output neuron can be found from: vv2=AaThe matrix A can alwaysmade non-singular by choosing suitable values of the Wghls w, A,-0 and _w, ,, for l=l,2,~. .., m. For a certain x0 and W| i| the value of f.occurring ll"l matrix A,can be assigned any value between 0 and l by choosing w, ,-0. Thus we can always make det A $0.The conclusion is that we can approximate any continuous function f:R. R in any domain [x. ,. x. ] arbitrarily well by a two-layer Perccptron with n input neurons and one linear output neuron if the function f is approximated by the first n+|Taylor cocictents.The same holds for the [1-dimensional case.The proof is similar and is based on the introduction of the 1107- 1) additional weights WI whml Wu can.U, Function realizable with a two-layer Fercepti-an 143freely select.Note that for a close approximation the number ofrequired input neuronscan become quite large.For a full proofsee Tromp (1993).Thus we have the following .theorem: Theorem 3.9A two-layer continuous Perceptron with sigmoid transfer function for the neurons in the first layer and one linear neuron in the second layer can approximate any continuous function f:|R"~R in any domain with any given accuracy. Although it is important to know that a two-layer Perccptron can approximate any continuous function and that we can calculate the required weights with the method described above.the weights obtained after learning with the descending gradient method from samples of the function will not be the same as the weights found with the method described above,With the Taylor series expansion the realized function will approximate the (known! ) function very closely in the neighbourhood of the point xo used in the expansion.whereas the realized function obtained after learning will approximate the (unknown! ) function over the entire domain interval of applied samples such that the MSE becomes as small as possible. Example 3.15Let the function we want to approximate be fix) =2x x.With the method described above we want to calculate the weights of a two-layer Perceptron.We approximate the function with the rst three Taylor coefficients.so we need a Perceptron with two neurons in the rst layer.We choose to approximate f(. x) around x0: 1. The rst three Taylor coefficients are 41:1. :1, =0 and a1= l. We select the weights of the two rst-layer neurons arbitrarily (we have to check that the matrix A is not singular): w, m=0.3 Wlll=0'l w, m=0.4 w, ,, =0.2For x0: 1 the values of the outputs ofthe rst-layer neurons become:f, , =0.690 f, ,=o.73tThe matrix: 1 fit fiz A=0 if. ./ inw. .. lft2fiz)i2t0 (f. i~3fii+2fli)w. .i (fr; -3fiz+2fiz)Wm 80. 144 The continuous multi-layer Perceptionbecomes 1 0.690 0.731 A9,0.064 0.l88 0 -0.004 0016For the weights connected with the output neuron we nd with w= A ' it: Wm =56.788 W = l90.867 W =l03.830The function q, ,(. r) realized by the Perceptron with the weights given above is shown in Figure 3.58 together with the function f(x). lhe szime configuration of the network was used to learn the function with the buck-pi'opag: ition grzidient descent l': iriiing rule.The inputs were randomly chosen with n uniform distribution between l iind 3. The initial weights were chosen zity -axis lxaxisFigure 3.58 The function I'C: |il/ .|ILi [upper curve) with ti two-lnycr continuous Percepiruii with two first-latycr neurons and one linear outputneiirnn nntl enleulnictt weights with the Tziylor appmxnnutiun of I'=zt 0.5, and for the second neuron in the rst layer:y,2 < 0.5.The weighted input of the output neuron is . '1 =4 5.297 +6.499y, , + 8.672y.1. For the elements of the learning set in the area B,the output y;of the output neuron will be >0.5 as required.For the urea 13, of Figure 3.69 y, ,0.5. For the samples of the learning set in area 132 this will give an output of the neuron in the second layer:y2>0.5 as required.In area A,of Figure 3.69 we have y, , f(vlB) and to class X5 if f(vlB)>f{v| A).This will not.however,be the case if we use the nearest-neighbour method.If one assumes the existence of overlapping probability distributions by which the examples are generated,it is better to use the Bayes method that we will discuss in the next section. . ... , .4 98. 212 The sell-organizing neural network(cm) 5 BO0 1 2 3 4 5 6 7 3 9 10 (mm)x. Figure4.46 Result of vector qtitiiittztition for class XA with a 2 x 2 neural Kohonen network AExample 4.7We consider a one-dimensional case.We have two nite data sets DA and D.The elements of D,are generated according to some Gaussian distribution density function with a mean of ; AA=0.0 and a devizition of a:1.00. The other set is generated by distribution function with [A=2.0l) iiiid deviation a, ,=l.50 (see Figure 4.50),The optimal boundary for classication is I =[.09. The other boundary is t=-4.29 (not given in Figure 4.50). We train with the data set D u tine-dimensional self-organizing neural network algorithms with ve neurons.The / .| l|l of the nal weights of this network A are given in the first row at the bottom of gure 4.50. We do the same for the data set D5. The nal weights of network l3 tire given in the second row.lf we now ztpply the nearest-neighbour method,all inputs in the receptive eld of the leftmost weight of neural net 8 are wrongly classified its elements of class X (see the liist row of Figure 4.50).In the same way all inpltlx in the receptive eld ofthe rightmost weight of the neural net A are wrongly cltixsilieil as members of class XA.IThe Bayes classication 213 (mm) x,. Figiire4.47 Result of vector quantization for class X,with a 2 x 2 neural Kohonen network B4.9 The Bayes classification with a self-organizing neural net algorithmIfone knows or assumes that the examples ofdata sets ofdifferent classes are generated according to some underlying probability distribution functions.then the best thing to do is to estimate the parameters of these distributions from the given data sets,and use this information to determine a threshold for classifying new data. ln case of a two-class classication problem (with equal class probabilities pm) and p(B) and equal costs for misclassieation) one assigns an input v,to class XA if the class conditional density function f(v, |A) is larger than the class conditional density function f(v, lB).This method of classifying inputs can be realized by a self-organizing neural net without separately estimating the parameters of the underlying probability distributions,because the neural net will do the job for us. First we have to make a slight modication of the algorithmic adaptation rule.We append both the input vectors v,(hereafter called the master input vector) and the 99. 214 The self ' ' |.3 wgmms Mm """ The Bayes classification 215 @ 9 @ "O O O Q~-- 3 | :|@O %@ 0% { E ..,7 Ellil.@ 9@, (cm) a , @OO'/V [El El ' -O'; @~/ ']t:% 5 D @l D E '3' DE 4 DE EU IE can no (mm) (mm)Figure 4.48 Classication boundary with the nearest-neighbour method forthe represe t t ' h - n a ions in t e neural networks A and B Figure4.49 Classication boundary with the nearest-neighbour method forweight Veda W (hereane H d h the representations in two 4 x 8 neural networks A and B vector V es ,1 ~ r ca e t e master weight vectors) with a so-callcd jlaygK pectively w, ) with a number of components equal to the numbe I classes.The newl d ' - .. I weight Vccmr W51 3bPPl'|Cd Input vector will be denoted by y,z [y'_ VI],The appendede denoted by vv, = [w, , w ].If an input vector is an element of the datr t ith kth element of the slave input vector equal toalseifi)t dozskiigtclloais,than we make me then we make that component of the slave input vector e ual ta o(? gTtI1o he kth Class of the initial master weight vector and of the initial slave xgeightovccto C Compznenis Che b 0 V _ _ rareran omy Git : o:': :: naald L The 3lE0l1U'| mlC adaptation rule is then as follows.5 9' ''P VC10l' V(t)= v,- with v(t)= [v, , 7,]. with g(r,s,t) a scalarvalued function with a value between 0 and 1 depending on time and on the distance in the neural net between the winning neuron u,and the neuron ii,to be adapted. To explain the nal result and what is happening during the learning phase we conne ourselves to the two-category classication problem.In this case we can use a one-dimensional slave vector.The one-dimensional slave vectors for master vectors of D,will be given a value 0 and the one-dimensional slave vectors for master vectors of D,will be given the value 1.Due to the vector quantization property of the algorithm the nal weight vectors on the one hand will become similar to the elements 9,.corresponding to the master vectors of the data set DAEXA,and on the other hand similar to vectors i7, corresponding to the examples V, of the data set DBEXE.In a region of the input space (= weight space) where there are only elements of D,the slave element of the weight vectors will be permanently adapted to a value 0, and in a region dominated by elements of D,the slave elements of the weight vectors will be mainly adaptedl.Detennine the winning neuron u,for the master vector v(t) iedy[W. (!).v(1)] =min d, [w, m, v([)]2. Every weight vector VI/ ,[l)= [w, _ W] in the net Wm be changed lo: W+1l= W.(t)+i/ (ms. zJ[m)-w, m] 100. 216 The sell-organizing neural network0.45I i l l 1 I l 0,00 -3.00 I 6.502 2. 2 2 2 I I 11211212 2Figure 4.50 Top diagram:two Gaussi: in-distributed.one-dimensional elasscs.First row:weights obiiiinctt afii. -r vector quantization with a onmlimcnsional net of five neurons for data with the left distribution.Second row:weights obtained after vector quantization with a onediniensional net of ve neurons for data with the right distribution.Third row:classication result with nearest-neighbour methodto a value I.In a region of the input space where the unknown class conditional probability density function are the same.f(vl/ )= f(v| B),the number of elements of DA and D will be almost the same.In that region the slave elements of the weight vectors will be adapted as many times to value 1 as to value 0, thus in those regions the slave elements of weight vectors will become equal to 0.5.The nal result will be that a master input vector v, eXA will belong to the receptive eld R(w, ) of a master weight vector w,with a slave element with a value smaller than 0.5, and if vieX, ,, then v, belongs to the receptive eld of a weight vector with a slave element value larger than 0.5.After the learning phase the neural net can be used as a classier:given some input vector v; , one can determine the winning neuron,then the slave value of the corresponding weight vector indicates whethcrt < 0.5) or not 1 > 0.5) the vector belongs to X A. The weight vectors with a slave value of :0.5 will be located near the optimal discrimination curve,We applied the method (4) the two-dimensional classicationThe Hayes classificalion 217045. . v ._ lCll|,V .l ziwo (ttussitin-disiribitted, one dtmcnsititttt hl're45l fElof: ldf: :iTi? lc value of the slave element of the weight vectors . ''. .'-/ let obtained after vector qu. tnitz. ;tttun wgtli a_}iari(ec| iI&in_; _; ti*; {I:3"')'w_ of ten neurons for data or both tsin ul . classification result with the Bayes mcthotl.~ ,' 1 .-' I ral net with 10x10 problem discussed in Example _3.8 with artgIiSd1t: :I| ::1:l1-'i 05 neurons.We found a classication error o .p -Example 4.3r _ .i '- 1' n roblem with a one-dimensional input.W ijmislderi l)mi, : giiifzfeydcbjisszl Eiaaiiossialii distribution function with a mean of 6 335 ., i i d b Gaussian I4it=0.0O and deviation i7it=1.00. TheAd: _ta set D?islhieffareincygrgmelemems dlsmbmim?with B f1'00 and E =h-Oowithlilizgtilaarsns gonditiotgal density function.is of DAuDE in the training 56:U351 3 given in Figure 4.51. V _ I .I the rst w= a row below the histogram in Flgure 4'51 '0 lhl/ ". l> 0 3 9 V _ A) fthe ten weight vectors We observe that for input elements v with fiVlBl>fll i .' i - .V ' l . The nal row the slave value is larger than 0.5 and thus will be classified correct y_ _ . h t on:after learning.in Figure 4.51 gives the values of the master weights oft e en neur I t. .,-1 I 101. 218 The self-organizing neural networkClassification of handwritten digits 219Figure 4.52 A three-L ss,lW(| -tllll| ClI. 'l0l| lll data itieation problemExample 4.9We consider a three-category two-diin ional classication problem.The threeclasses were generated with the follow g Gaussian probability distributions (sec Figure 452): class A with lt, =(2, 2) and oA= (2, 2) class B with li. .=(l,l) and a, ,:(2, 2) class C with [Ll-Zll).~ 1) and c, =(l,l)The neural network was two-dimen )lt1Il with 10 x 10 neurons.The input vectors of the data set were extended with it e vector with three components.The rst component is equal to I if the input vector is an element of class A;if not.then the value will be zero.The second component is only equal to 1 for input vectors from class B and the third component is only I for elements of class C.The weight vectors of the neurons were ve-dimensional with the rst two representing the master vector and the remaining three the slave vector.All weight values were randomly initialized. The results after training with 1000 examples (lO()0 learning steps] are represented in Figure 4.53. The . r. y coordinates of ' symbol (A,B or C) give the values of the rst and second weight components These master weight vectors represent the quantized data set.lfa symbol is equal to A,then the rst component ofthe pertinent slave weight vector has the largest value.'l'hc same holds for the symbols l3' and 'C' with respect to the second and third slave components.The four bold face symbols in Figure 4.53 are incorrect.The three riglitmost vectors with symbol A have to beFigure4.53 The weights after learning with the data of Figure 4.52 in a two-dimensional 10 x 10 neural net A weight is labelled as A if the first slave element is larger than the other two slave elements.Clrisscs B and C are labelled similarlyclassied as elements of class B.The bold face symbol C must be an A.A larger neural net would improve the results.I4.10 Application of the self-organizing neural net algorithm to the classification of handwritten digitsIf we sample dierent pictures of a some class of pictures (in our case handwritten representations of some digit,see Figure 4.54] with the window introduced in Section 4.2. and present in a learning phase the observation vectors v obtained by that window to a self-organizing net,then the weight vectors will become similar to those observations that are common in all pictures in that class.In this way the topological features of a class of pictures will be stored in the weight vectors of the neural net. We performed a classication experiment for handwritten digits.Figure 4.55 shows some examples ofhandwritten digits.We used the nearest~neighbour method discussed in Section 4.8.We used ten two-dimensional self-organizing neural networks of 7 X7 neurons.one network for each class of handwritten digits:0. l'. ... ,9'.Each handwritten 102. 220 The welt-organixirrg neural network Figure 4.55 Examples of handwritten digitsdigit was presented in a square of 30 x 40 pixels.The centre of the observation window can be placed at 30 x 40 di'erent locations in a picture,giving 1200 different observation vectors v,for each example of a digit.Each net was trained with I0 >< 1200 observations from ten different handwritten examples of one type of digit.In the learning phase each observation vector was twice presented to the neural network.If a network was trained with examples of digit i we denote that network by Ni.After learning,fteen new handwritten examples of each class were used as a test set.For each example 1200 observations were presented to the ten neural networks.An observation vector v obtained by sampling an example was assigned to a net i if lv w, l =minllv wJ-l,with w and W1 the weight vector of the winning neuron in neural net N, , respectively N, . If the majority of the I200 observation vectors of one example was assigned to neural network Nk,then the example was classied as the digit k. In Figure 4.56 we have given an outline of the classication procedure and the result of the classication for a handwritten representation of the digit 6.(Note that we can use the typical distribution of the allocat ns of observation vectors of some digit to the dilferent nets as a criterion for cla tion. ) Classication of handwritten digits 22110 neural nets 7 X 7 neurons859575Observations window with 33 fieldsl581200 vectors 33 dimensions78930 X 40 51pixels21lAllocations of 1200 '6 vectors to different netsFigure 456 Outline of the classication procedureFrom the 150: 10x 15 examples used in the test set.six examples were wrongly classied,i. e. a score of 96 per cent.To train the ten neural networks took 9 min on a HP 9000. The classication of one example in the test phase took 2 sec. The same classication score was obtained when we only used twenty randomly selected observation vectors out of the 1200 possible observation vectors of some test digit.In this way we obtained a classication time of about 0.l sec. We see that if a class of pictures have some topological features in common.we can use the neural network for pattern recognition in a straightforward way. in,1 I 103. 222 The sell-organizing neural network 4.11 Topology preservation with a sell-organizing algorithmIn Section 4.5 we found that by using the adaptation rule: w. + rtour.s,I)[v. (r) W. (t)]we are minimizing the error function: E(W>= iZZ Z F-tr)rr(V. -llttr. S.I)lV. W. | 2II,VI,VR(W, l We could distinguish two dillerent learning phases: 1. The quaniieution phase (nal phase) In that phase we have for the neighbourhood function li(r.s,t)=0 for ras and It(r.s.t)= l for r= s. and we are minimizing: EtW>=3Z Z l7(Vl)lVi"w. ilZ 2 u,wit-,1 The quantization phase is cha acterizcd by the property of lItCllIr quaniiwtiun:an input data space of M zl-dimcnsioiial vectors will be replaced by it smaller representative set of N ti-dimcnsioiial weight vectors of the neural net.2. The orilerinu pliase (initial phase) We are minimizing: Eotw.n=1): Z n(w. )y(r. ztw. w, t= 2 u,w, The ordering phase is characterized by the property that the neural net will be ordered:neighbouring neurons in the network will obtain similar weights.The net is well-rmlurml if neighbouring neurons have adjacent receptive elds. The approximation of the ordering error Eg(W.t) by: Eu(W>= Z Z c. Iw. w, l. i_. s u, ;i, , 1,) =Jreveals more directly the property that we are reducing the sum ofweiglited mutual distances between all W(l_t[lll vet'iur. r in the ordering phase.The weight factor c- is . large for close neighbours (61: I) of neurons in the neural net and will be small for distant neighbours (tl, >> l). The properties mentioned above deal mainly with the mapping of input vectors KV0WBlg': |l vectors.Besides this qiuintimtimrmapping 4: from the space of input vectors rrOtr; l)1'tVfOs| t;1tepll"ivIceight vectorsW.there is a mapping (1) (the projection mapping)_ e Lof neurons,be ruse with each weight vector there is associated a neuron in the neural net L.So we obtain a so-called feature mapping l= ~m from V to the lattice L.In general the input vectors are obtained by observations (measurements) of some object space 0 leg.pictures or signals observed by some window).The zilixurvziiitiii tllllflllltj will he represented by the symbol at (see Figure 4.57).Frequently oite is only interested in the representation of the input space V byTopology preservation 223Observationo{VObject space / Input vector spaceFeature mapping /W t> Quantization mapping Neural lattice L . .Lu W Weight vector space Proiection Figure 4.57 The set of interrelated mappingsthe weight vectors of W,and one disregards the position of the neurons associated with the weight vectors. In more sophisticated applications of the sell~organizing neural net algorithm,the feature mapping w from the vector space V to the lattice L of neurons is used.We will say that an input vector v,is represented by ncuron u.if u,is the winning neuron when we present vector v, - to the neural net:u, =ilz(v].It is frequently desired that similar input vectors are represented by the same neuron or by neighbouring neurons.This will not always be the case.If.for example,the training set D consists of many two-dimensional uniformly distributed vectors,and one is using a one-dimensional neural net (with two-dimensional weight vectors) with nine neurons.then the weight vectors will be uniformly distributed over the input space.The sequence of neurons ll,associated with the weight vectors w,forms a chain through the input space.The input space will be divided by equally sized receptive elds R(w_) (see,for example.Figure 4.58). Similar input vectors on both sides of the border of.for example,R(w1) and R(w3),are represented by the neighbouring neurons u,and u, , but similar input vectors on both sides of R(w1) and R(w9) are represented by neurons uz and ug that are not neighbours at all.If we had used instead a two-dimensional neural lattice,then similar input vectors would always have been represented by the same neuron or by neighbouring neurons with distance 1 (see Figure 4.59). In Figure 4.59 the mutual relative position of two vectors v,and vi in the input space V is to a certain extent preserved by the mutual relative position of neurons (v, ) and l(v, ). The metric of the input space is preserved. A mapping from a metric space V with distance measure dy to a metric space L with distance measure d, _ is metric preserving if the triangular inequality property is preserved,i. e. ifdV(v, , v, ) S dV(v, , v, ) + dy(v, . v, ) then dI. [lll(/ I) WW] S I1:. [tl1(V, ). Mull 4- di. ['l'(VJ.IlI(V. )]For metric preservation it is required that if the distance between two vectors from V is small,they will be represented by the same neuron or neighbouring neurons in 104. 224 The sell-organizing neural networkspace V space W / .'/~' Lattice L . / >(xx~-->t%x 1 2 a 4 5 5 7 8 95liigiirt:4.58 TWO-(lllT| Cl| .lt)I| ill tliit. i represented by two-dimensional weight vectors in ii nitctliiitei. 'ioi: il neural network Input and weight spaceNeural latticeFigure 4.5) Metric prescrvutitiit.Left ligure:input space.Right gure:neural Izittiec with the projection us of 1*,and the projection of us of ii, the neural net.This reoiiircment can be stated as follows.If two receptive elds MW. ) and R(w, ) in V are Z| (.l)tlCCnl (the etinnnon border cR, j=R(w, )rwRtw, -) is not empty},then neuron u,with weight vector w,is it neighbour of neuron uj with weight vector w, .if this property holds for zill receptive fields.we say that the feature mapping is mpoIo_i/ v prcsertiiiiii.One may note that topology preservation is the complement of the well-ordering property.In | igurc 4.58 there IS no topology preservation while in Figure 4.59 there is topology preservation. Topology preservation 225[ Rtwnl |twql v I I ___r_ _ _ _ _ _ _ ___. _ _ ._ _ _ _ _ __i v,x 1 x V; I lI ll O I II II wn I W. , l l II V,X I Figure 4.60 Two adyaccnt rcceptivc eldsBecause the sclf-organizing algorithm is minimizing the error function: E>l) (see also Figure 4.58).A continuous path in V results in a corresponding track in the lattice Lwith in general largejumps between the successive winning neurons._. It might,however.occur that iuehsan obtain topology preservation in a restricted sense ifthe training set D consists olm~dimensional vectors,whereas the neural lattice has a dimension ll much smaller than nio This will be the case if the components of the m~dimensional vectors of the training set D are interrelated such that they can be represented by points in an ii-dimensional space while preserving the topology for the elements of D.A trivial example is the set of threcdimensional vectors {[l,I,2],[2, l.4],[3, l,6],[4, I,8],that can be placed in a one-dimensional row by a mapping I til l(v)= ii, ) while preserving the topology restricted to the set D,i. e. itdv(V. - V. )$I1v(V. V. .)+ Figure4.6l Input/ weight space The In:represent nineteen input vectors.The squares represent the weight vectors in a two-dimensional net with thirty-iwu neuronsproperty.there will be at the nal phase of learning a set of weightlvectors that will be copies of all the input vectors of the training set.During training of the neural net the redundant weights will also be adapted to the input vectors corresponding to the weight vectors of the surrounding winning neurons.In this case the redundant weight vectors will in this way obtain values that one would nd by interpolation between the values of the training set. Example 4.10In Figure 4.61 we have given the result of training a two-dimensional neural net Wllh thirty neurons and a training set with nineteen two-dimensional input vectors represented in Figure 4.61 by the letter b.The nal weights of the neurons are given by small squares.I N,r . 106. 223 The self-organizing neural network We can summarize our discussion as follows: Practical statement 4.2 HInterpolation properlylf the neural net contains more neurons than there are elements in the training set D.then the redundant weight vectors will be interpolated between the weight vectors that are copies (or almost copies) of the input vectors of the training set. A frequently undesired interpolation of weight vectors between input vectors of the data set can.however.also occur if the number of elements in the data set is larger than the number of neurons but the elements of the data set D are separated by a relatively large empty area.We will explain this phenomenon shortly.In Section 4.5 we found that the algorithm tries to minimize the value of tw, wj| for all pairs of wcigl; t.vectors (w, .wJ).By introducing a weight vector wk between w,and W1. if lw, w, ] is large.a iriuch smaller value of the replacement | w,-wkl+lw, w, | can be obtained The weight vector w,is not.however.representing the input vectors. Example 4.11In Figure 4.62 we have given the result of trtiiiting a two-dimensional network of 5 X S neurons with 1000 input vectors uniformly distributed on a circle.Small circles Figure -0.61 Representation of 1000 input vectors uniformly distributed on a circle by weight vectors in at 5 x5 neural networkMasterslave and multi-net decomposition 229represent the nal weight vectors.Weight vectors are connected with a straight line if the corresponding neurons are neighbours in the neural lattice.|4.13 Masterslave and multi-net decomposition of the sell-organizing neural net algorithmln this section we will discuss the decomposition of input vectors and the application of the selfvorganizing algorithm to the different parts of the decomposed input vector. First we will discuss the Ill(l. tlL'l'-SItlLr. zleenniposiliziti as already applied in Section 4.9 on the Bayes classifier. ln several applications of thc self-organizing algorithm the set of data vectors consists of pairs of vectors.For instance.in the case of function identification from samples,the training set contains pairs ofargument vectors and function-value vectors.In the case of function identification we want generalization (or interpolation) from samples.To obtain a proper result the different parts of the vectors of the training set must be treated ditlercntly.We want a metric that preserves quantization of the argument values and a representation and interpolation of the function values. One part of the data vectors will be treated by the self-organizing algorithm in the same way as was done in previous sections.We call that part of the input vector the master input vector v, . The second part of the input vector,called the slave input vector,denoted by v_,will not be used to nd the winner in the neural net and will only be used to adapt the weight vectors in the neural net.The total input vector will be denoted by vi:v-,v_]. In the same way we make a decomposition of the weight vectors of the neural lattice.One part is called the master weight vector W;and the other part is called the slave weight vector t7v, . The total weight vector will be denoted by vv, =[w, , vi, ]. The algorithmic adaptation rule is then as follows. Given some training vector vttt:[v, . 9,]: 1. Determine the winning neuron it,for the master vector v(1),i. e. d>'[W. (ll V(I]] =min Iiv[W, (l).V(Il]2. Every weight vector vv, (tl= [w, , vi] in the net will be changed to:i~, tr+1l= W.tI)+gIr. s.I)[itIlW, lr)] with g(r,s.t) a scalanvalued adaptation function as discussed in Section 4.3.From the algorithmic adaptation rule we conclude that the master input vectors and the master weight vectors are manipulated as in the previous section.The preceding theory will thus hold in the same way for the master vectors. Due to the vector quantization property of the algorithm,the nal complete weightn,1*l I 107. 23a The seiomgmizing neural networkvectors vv will become similar to the elements 9. and thus the slave weight vectors will also become similar to the slave input vectors. If the number of neurons in the neural lattice is larger than the number of input vectors,then the values of weight vecqors ofthe redundant neurons will be interpolated between the weight vectors that are similar to the input vectors. In the next two sections we apply the method of masterslave decomposition to function identification and control of a robot arm. We found in Section 4.11 that it is preferable to have the dimension of the neural net equal to the dimension ofthe input space.Frequently the dimension ofthe input space is large and would thus require the neural lattice to be similarly large,But the required number of neurons will grow exponentially with the dimension of the neural lattice.If the number of neurons per dimension is equal to d and the dimension of the neural net is equal to m,then 11'" neurons are required.This number may become larger than the number of elements in the training set and we cannot use the neural net for proper vector quantization. Moreover,if the dimension of the input space is larger than the dimension of the neural net,the property of topology preservation is lost;the greater the difference in dimension,the greater the number of defects. Therefore we frequently want the dimension In of the neural net to be low and equal (or close to) the dimension ofthe input vectors.A solution to this problem is to use the multi-net decomposition method. The multinet decomposition method is straightforward:we divide the input vector in some way into k parts and we use k ditfcrent neural nets.The pth neural net is trained with the pth part of the vectors of the training set. Ifwe divide the original m-dimensional input vector into k equal parts ofdimension m/ k and use It neural networks of dimension m/ k, and the number of neurons in each dimension of all subnetworks is d,then the number of neurons is reduced from d"' to kd""". In Section 4.16 we will apply the multi-net decomposition method to EEG analysis. 4.14 Application of the seli-organizing algorithm to function identificationAssume we have several pairs ofargument values and function values ofsome unknown function.and we want to know the functional relationship between arguments and function values.Ifwe do not require a mathematical description of the functional relationship but are satised with a (hardware or software) realization of the function in a restricted domain.then we can use a neural network to approximate the unknown function.In Chapter 3 we have shown how we can use a continuous multi-layer Perccptron to identify an unknown function.We can.however,also use a self-organizing neural net algorithm for that purpose.The main difference will he that we obtain a quantized version of the unknown function. Function identification 231At rst glance one is inclined to make training vectors composed of pairs of argument and function values.If we use these training vectors we know that the nal weight vectors will be copies of these training vectors,and if the number of neums is larger than the number of training vectors we also obtain weight vectors by interpolation between the training vectors.The weight vectors are also composed of pairs of arguments and function values;the rst part corresponds with an argument and the second part with a function vzilue.After training,we present some argument of the function and determine the weight vector with an argument part with_the minimal distance to the presented argument.Then we read in the pertinent weight vector the second part as the desired function value.However.this method will give incorrect results if there are weight vectors interpolated between the weight vectors that are copies of the training vectors.Interpolated weight vectors Will be located in the area between the curves representing the functional relationship. Example 4.l2In an experiment we applied 1000 training samplesy, ] of the function y=10X with X,in the interval [0.3, +0.3] to the self-organizing algorithm.The neural netcontained fifty neurons.The result is given in Figure 4.63 (the line represeas th: training samples and dots the [x. y] value of the weight vectors).0 0 servet l ,0 3 40.3 X Figurt-4.63 Representation of I000 pairs of argument and function values of the function y= l0x2 by two-dimensional weight vectors {dots} In a one~tIimensional neural net of fty neurons.lnpui argument values in [-03, +0.3] 108. 232The self-organizing neural network -3 S xFigui-24.64 The representation of 1000 pairs of argument and function V-""5 f the function y=1ox* by two-dimensional weight V'= 'YS idols) in a one-dimensional neural net of fty neurons.Input argument values in [-3, +3]'3 +3 XFigure4.6S The representation of 10000 pairs of argument and function ""S 0 ths fnction y= t_ox1 by two-dimensional weight V9-|0|5 Idols) tn a one-dimensional neural net of fty neurons.Input argument values in [~0.3, +0.3]Function identication 233'1' l -0.3 +0.3 Xlfigurt-4.66 The mttstensluve method.The ICp| t. 'SCI1llllI0|"I of 1000 pairs of argument and function values of the function y= I0x1 by two-dimensional muster slave weight vectors (dots) in a one-dimensional ncurttl net of fifty neurons.Input argument values in [~01 +0.3]that even in this case where the number of elements in the training set is much larger than the number of neurons,we obtain interpolated weight vectors that are wrongly located. In a second experiment we used 1000 samples [xi,y, ] of the function y:10x from the domain [3, +3].The result,given in Figure 4.64, is even worse.If we use more samples (i. e. 10000) and extend the learning phase to 100000 steps,then we obtain the result given in Figure 4.65. IProper function identication with the self-organizing neural net algorithm can,however,be obtained by using the masterslave method presented in Section 4.13. If the argument x of the unknown function is mdimensional,then we use an m-dimensional neural net and take x as the master vector.The corresponding n-dimensional y function value vector is used as the slave vector.If there are enough samples,then the x values of the weight vectors will be ordered in a regular m-dimensional lattice.The slave elements will be copies of the function values and if there are more neurons than samples,additional y values will be interpolated between the y values that are copies of the slave-training samples. WW I 109. 234 11..self-organizing mist network -0.3 v0.3 "Figure4.67 The muster -Sl1|Vt!method The representation of I000 pairs of argument and function values of the function y= |0x by two-dimensional mnsterslLtve weight vectors (dots) in ii one-dimensional neural net of fifty neurons.Input argument values in [3, +3]Example 4,13If we repeat the experiments mentioned in the previous example with the masterslave method,then we obtain under the same conditions the results respectively given in Figures 4.66 4.67. I4.15 Application of the sell-organizing algorithm to robot arm controlSuppose we use a monitor connected to a camera to observe an object on a square table.The coordinates of the table top will be denoted by u and v.We want a robot to learn to grasp the object from the table given the x and y position of the object on the monitor screen (not ii and e).For simplicity our robot consists of an arm with two parts moving in a horizontal plane (see Figure 4.68).With two servomotors we can control the two angles H,and H1 in order to reach every point on the table.In a training phase the object is placed somewhere on the table and we form a vector v with the observed values of x,y,H,und II,as the four components.NoteRobot arm control TirnezllMonitor shoulder.sa.2o eltaaw:senXm:min vm.-0.00D CameraFigure 4.69 Initial representation of table coordinates by the two-dimensional masterslave weight vectors (dots in the righthand gure) in a two-dim

Engineering

Analysis and applications of artificial neural networks