52918523 Artificial Neural Network Technology

ARTIFICIAL NEURALNETWORKS TECHNOLOGY

A DACS State-of-the-Art Report

Contract Number F30602-89-C-0082(Data & Analysis Center for Software)

ELIN: A011

August 20 1992

Prepared for:

Rome LaboratoryRL/C3C

Griffiss AFB, NY 13441-5700

Prepared by:

Dave Anderson and George McNeill

Kaman Sciences Corporation258 Genesse Street

Utica, New York 13502-4627

i

TABLE OF CONTENTS

1.0 Introduction and Purpose ............................................................................. 12.0 What are Artificial Neural Networks? ...................................................... 2

2.1 Analogy to the Brain ............................................................................. 22.2 Artificial Neurons and How They Work ......................................... 32.3 Electronic Implementation of Artificial Neurons.......................... 52.4 Artificial Network Operations ............................................................ 72.5 Training an Artificial Neural Network ............................................ 10

2.5.1 Supervised Training.................................................................. 102.5.2 Unsupervised, or Adaptive Training.................................... 11

2.6 How Neural Networks Differ from Traditional Computing and Expert Systems ............................................................................... 123.0 History of Neural Networks......................................................................... 174.0 Detailed Description of Neural Network Components and How

They Work........................................................................................................ 204.1 Major Components of an Artificial Neuron.................................... 224.2 Teaching an Artificial Neural Network............................................ 26

4.2.1 Supervised Learning. ................................................................ 264.2.2 Unsupervised Learning............................................................ 274.2.3 Learning Rates. ........................................................................... 284.2.4 Learning Laws............................................................................. 29

5.0 Network Selection .......................................................................................... 315.1 Networks for Prediction ....................................................................... 32

5.1.1 Feedforward, Back-Propagation. ............................................. 325.1.2 Delta Bar Delta. ........................................................................... 355.1.3 Extended Delta Bar Delta. ......................................................... 365.1.4 Directed Random Search.......................................................... 375.1.5 Higher-order Neural Network or Functional-linkNetwork ................................................................................................... 395.1.6 Self-Organizing Map into Back-Propagation........................ 40

5.2 Networks for Classification.................................................................. 415.2.1 Learning Vector Quantization. ............................................... 415.2.2 Counter-propagation Network. .............................................. 435.2.3 Probabilistic Neural Network.................................................. 46

5.3 Networks for Data Association ........................................................... 485.3.1 Hopfield Network...................................................................... 485.3.2 Boltzmann Machine.................................................................. 505.3.3 Hamming Network................................................................... 515.3.4 Bi-directional Associative Memory. ...................................... 535.3.5 Spatio-Temporal Pattern Recognition (Avalanche)........... 54

5.4 Networks for Data Conceptualization............................................... 555.4.1 Adaptive Resonance Network................................................ 565.4.2 Self-Organizing Map.................................................................. 56

i i

5.5 Networks for Data Filtering................................................................. 585.5.1 Recirculation............................................................................... 58

6.0 How Artificial Neural Networks Are Being Used................................... 616.1 Language Processing.............................................................................. 626.2 Character Recognition........................................................................... 626.3 Image (data) Compression.................................................................... 636.4 Pattern Recognition............................................................................... 636.5 Signal Processing.................................................................................... 646.6 Financial................................................................................................... 656.7 Servo Control.......................................................................................... 656.8 How to Determine if an Application is a Neural Network

Candidate ................................................................................................. 667.0 New Technologies that are Emerging ........................................................ 68

7.1 What Currently Exists........................................................................... 687.1.1 Development Systems. ............................................................. 687.1.2 Hardware Accelerators.............................................................. 697.1.3 Dedicated Neural Processors.................................................... 69

7.2 What the Next Developments Will Be............................................. 698.0 Summary .......................................................................................................... 719.0 References......................................................................................................... 72

i i i

List of FiguresFigure 2.2.1 A Simple Neuron. .............................................................................. 3Figure 2.2.2 A Basic Artificial Neuron.................................................................. 4Figure 2.2.3 A Model of a "Processing Element"................................................ 6Figure 2.2.4 Sigmoid Transfer Function............................................................... 7Figure 2.4.1 A Simple Neural Network Diagram. ............................................. 8Figure 2.4.2 Simple Network with Feedback and Competition...................... 9Figure 4.0.1 Processing Element............................................................................. 21Figure 4.1.1 Sample Transfer Functions............................................................... 24Figure 5.0.1 An Example Feedforward Back-propagation Network............... 33Figure 5.2.1 An Example Learning Vector Quantization Network................ 42Figure 5.2.2 An Example Counter-propagation Network. ............................... 44Figure 5.2.3 A Probabilistic Neural Network Example. .................................... 47Figure 5.3.1 A Hopfield Network Example.......................................................... 49Figure 5.3.2 A Hamming Network Example....................................................... 52Figure 5.3.4 Bi-directional Associative Memory Example. .............................. 53Figure 5.3.5. A Spatio-temporal Pattern Network Example.............................. 55Figure 5.4.2 An Example Self-organizing Map Network.................................. 57Figure 5.5.1 An Example Recirculation Network. ............................................. 59

List of Tables

Table 2.6.1 Comparison of Computing Approaches........................................ 13Table 2.6.2 Comparisons of Expert Systems and Neural Networks. ............ 14Table 5.0.1 Network Selector Table...................................................................... 31

1

1.0 Introduction and Purpose

This report is intended to help the reader understand what ArtificialNeural Networks are, how to use them, and where they are currently beingused.

Artificial Neural Networks are being touted as the wave of the futurein computing. They are indeed self learning mechanisms which don'trequire the traditional skills of a programmer. But unfortunately,misconceptions have arisen. Writers have hyped that these neuron-inspiredprocessors can do almost anything. These exaggerations have createddisappointments for some potential users who have tried, and failed, to solvetheir problems with neural networks. These application builders have oftencome to the conclusion that neural nets are complicated and confusing.Unfortunately, that confusion has come from the industry itself. A navalanche of articles have appeared touting a large assortment of differentneural networks, all with unique claims and specific examples. Currently,only a few of these neuron-based structures, paradigms actually, are beingused commercially. One particular structure, the feedforward, back-propagation network, is by far and away the most popular. Most of the otherneural network structures represent models for "thinking" that are still beingevolved in the laboratories. Yet, all of these networks are simply tools and assuch the only real demand they make is that they require the networkarchitect to learn how to use them.

This report is intended to help that process by explaining thesestructures, right down to the rules on how to tweak the "nuts and bolts."Also this report discusses what types of applications are currently utilizing thedifferent structures and how some structures lend themselves to specificsolutions.

In reading this report, a reader who wants a general understanding ofneural networks should read sections 2, 3, 6, 7 and 8. These sections providean understanding of neural networks (section 2), their history (section 3), howthey are currently being applied (section 6), the tools to apply them plus theprobable future of neural processing (section 7), and a summary of what it allmeans (section 8). A more serious reader is invited to delve into the innerworking of neural networks (section 4) and the various ways neural networkscan be structured (section 5).

2

2.0 What are Artificial Neural Networks?

Artificial Neural Networks are relatively crude electronic models basedon the neural structure of the brain. The brain basically learns fromexperience. It is natural proof that some problems that are beyond the scopeof current computers are indeed solvable by small energy efficient packages.This brain modeling also promises a less technical way to develop machinesolutions. This new approach to computing also provides a more gracefuldegradation during system overload than its more traditional counterparts.

These biologically inspired methods of computing are thought to be thenext major advancement in the computing industry. Even simple animalbrains are capable of functions that are currently impossible for computers.Computers do rote things well, like keeping ledgers or performing complexmath. But computers have trouble recognizing even simple patterns muchless generalizing those patterns of the past into actions of the future.

Now, advances in biological research promise an initial understandingof the natural thinking mechanism. This research shows that brains storeinformation as patterns. Some of these patterns are very complicated andallow us the ability to recognize individual faces from many different angles.This process of storing information as patterns, utilizing those patterns, andthen solving problems encompasses a new field in computing. This field, asmentioned before, does not utilize traditional programming but involves thecreation of massively parallel networks and the training of those networks tosolve specific problems. This field also utilizes words very different fromtraditional computing, words like behave, react, self-organize, learn,generalize, and forget.

2.1 Analogy to the Brain

The exact workings of the human brain are still a mystery. Yet, someaspects of this amazing processor are known. In particular, the most basicelement of the human brain is a specific type of cell which, unlike the rest ofthe body, doesn't appear to regenerate. Because this type of cell is the onlypart of the body that isn't slowly replaced, it is assumed that these cells arewhat provides us with our abilities to remember, think, and apply previousexperiences to our every action. These cells, all 100 billion of them, areknown as neurons. Each of these neurons can connect with up to 200,000other neurons, although 1,000 to 10,000 is typical.

The power of the human mind comes from the sheer numbers of thesebasic components and the multiple connections between them. It also comesfrom genetic programming and learning.

3

The individual neurons are complicated. They have a myriad of parts,sub-systems, and control mechanisms. They convey information via a hostof electrochemical pathways. There are over one hundred different classes ofneurons, depending on the classification method used. Together theseneurons and their connections form a process which is not binary, not stable,and not synchronous. In short, it is nothing like the currently availableelectronic computers, or even artificial neural networks.

These artificial neural networks try to replicate only the most basicelements of this complicated, versatile, and powerful organism. They do it i na primitive way. But for the software engineer who is trying to solveproblems, neural computing was never about replicating human brains. It isabout machines and a new way to solve problems.

2.2 Artificial Neurons and How They Work

The fundamental processing element of a neural network is a neuron.This building block of human awareness encompasses a few generalcapabilities. Basically, a biological neuron receives inputs from other sources,combines them in some way, performs a generally nonlinear operation onthe result, and then outputs the final result. Figure 2.2.1 shows therelationship of these four parts.

•

4 Parts of aTypical Nerve Cell

Dendrites: Accept inputs

Soma: Process the inputs

Axon: Turn the processed inputs into outputs

Synapses: The electrochemical contact between neurons

Figure 2.2.1 A Simple Neuron.

4

Within humans there are many variations on this basic type of neuron,further complicating man's attempts at electrically replicating the process ofthinking. Yet, all natural neurons have the same four basic components.These components are known by their biological names - dendrites, soma,axon, and synapses. Dendrites are hair-like extensions of the soma which actlike input channels. These input channels receive their input through thesynapses of other neurons. The soma then processes these incoming signalsover time. The soma then turns that processed value into an output whichis sent out to other neurons through the axon and the synapses.

Recent experimental data has provided further evidence that biologicalneurons are structurally more complex than the simplistic explanation above.They are significantly more complex than the existing artificial neurons thatare built into today's artificial neural networks. As biology provides a betterunderstanding of neurons, and as technology advances, network designerscan continue to improve their systems by building upon man'sunderstanding of the biological brain.

But currently, the goal of artificial neural networks is not the grandioserecreation of the brain. On the contrary, neural network researchers areseeking an understanding of nature's capabilities for which people canengineer solutions to problems that have not been solved by traditionalcomputing.

To do this, the basic unit of neural networks, the artificial neurons,simulate the four basic functions of natural neurons. Figure 2.2.2 shows afundamental representation of an artificial neuron.

.

•••

•

•

•

x 0

w0

w2

wnx

n

I = ∑

w i

x j Summation

Y = f(I) Transfer

Output Path

ProcessingElement

Sum Transfer

Inputs x n Weights

wn

x 2

x1

w1

Figure 2.2.2 A Basic Artificial Neuron.

5

In Figure 2.2.2, various inputs to the network are represented by themathematical symbol, x(n). Each of these inputs are multiplied by aconnection weight. These weights are represented by w(n). In the simplestcase, these products are simply summed, fed through a transfer function togenerate a result, and then output. This process lends itself to physicalimplementation on a large scale in a small package. This electronicimplementation is still possible with other network structures which utilizedifferent summing functions as well as different transfer functions.

Some applications require "black and white," or binary, answers. Theseapplications include the recognition of text, the identification of speech, andthe image deciphering of scenes. These applications are required to turn real-world inputs into discrete values. These potential values are limited to someknown set, like the ASCII characters or the most common 50,000 Englishwords. Because of this limitation of output options, these applications don'talways utilize networks composed of neurons that simply sum up, andthereby smooth, inputs. These networks may utilize the binary properties ofORing and ANDing of inputs. These functions, and many others, can be builtinto the summation and transfer functions of a network.

Other networks work on problems where the resolutions are not justone of several known values. These networks need to be capable of aninfinite number of responses. Applications of this type include the"intelligence" behind robotic movements. This "intelligence" processesinputs and then creates outputs which actually cause some device to move.That movement can span an infinite number of very precise motions. Thesenetworks do indeed want to smooth their inputs which, due to limitations ofsensors, comes in non-continuous bursts, say thirty times a second. To dothat, they might accept these inputs, sum that data, and then produce anoutput by, for example, applying a hyperbolic tangent as a transfer function.In this manner, output values from the network are continuous and satisfymore real world interfaces.

Other applications might simply sum and compare to a threshold,thereby producing one of two possible outputs, a zero or a one. Otherfunctions scale the outputs to match the application, such as the valuesminus one and one. Some functions even integrate the input data over time,creating time-dependent networks.

2.3 Electronic Implementation of Artificial Neurons

In currently available software packages these artificial neurons arecalled "processing elements" and have many more capabilities than thesimple artificial neuron described above. Those capabilities will be discussedlater in this report. Figure 2.2.3 is a more detailed schematic of this stillsimplistic artificial neuron.

6

Inputs

SummationFunction

TransferFunction

Learning andRecall Schedule

*wo

*w1

*wn

Sum

Max

Min

Average

Or

And

etc.

Hyperbolic Tangent

Linear

Sigmoid

Sine

etc.

Outputs

Learning Cycle

Figure 2.2.3 A Model of a "Processing Element".

In Figure 2.2.3, inputs enter into the processing element from theupper left. The first step is for each of these inputs to be multiplied by theirrespective weighting factor (w(n)). Then these modified inputs are fed intothe summing function, which usually just sums these products. Yet, manydifferent types of operations can be selected. These operations could producea number of different values which are then propagated forward; values suchas the average, the largest, the smallest, the ORed values, the ANDed values,etc. Furthermore, most commercial development products allow softwareengineers to create their own summing functions via routines coded in ahigher level language (C is commonly supported). Sometimes the summingfunction is further complicated by the addition of an activation functionwhich enables the summing function to operate in a time sensitive way.

Either way, the output of the summing function is then sent into atransfer function. This function then turns this number into a real outputvia some algorithm. It is this algorithm that takes the input and turns it intoa zero or a one, a minus one or a one, or some other number. The transferfunctions that are commonly supported are sigmoid, sine, hyperbolic tangent,etc. This transfer function also can scale the output or control its value viathresholds. The result of the transfer function is usually the direct output ofthe processing element. An example of how a transfer function works isshown in Figure 2.2.4.

This sigmoid transfer function takes the value from the summationfunction, called sum in the Figure 2.2.4, and turns it into a value betweenzero and one.

7

Figure 2.2.4 Sigmoid Transfer Function.

Finally, the processing element is ready to output the result of itstransfer function. This output is then input into other processing elements,or to an outside connection, as dictated by the structure of the network.

All artificial neural networks are constructed from this basic buildingblock - the processing element or the artificial neuron. It is variety and thefundamental differences in these building blocks which partially cause theimplementing of neural networks to be an "art."

2.4 Artificial Network Operations

The other part of the "art" of using neural networks revolve aroundthe myriad of ways these individual neurons can be clustered together. Thisclustering occurs in the human mind in such a way that information can beprocessed in a dynamic, interactive, and self-organizing way. Biologically,neural networks are constructed in a three-dimensional world frommicroscopic components. These neurons seem capable of nearly unrestrictedinterconnections. That is not true of any proposed, or existing, man-madenetwork. Integrated circuits, using current technology, are two-dimensionaldevices with a limited number of layers for interconnection. This physicalreality restrains the types, and scope, of artificial neural networks that can beimplemented in silicon.

Currently, neural networks are the simple clustering of the primitiveartificial neurons. This clustering occurs by creating layers which are thenconnected to one another. How these layers connect is the other part of the"art" of engineering networks to resolve real world problems.

8

HIDDENLAYER(there may be several hidden layers)

OUTPUTLAYER

INPUTLAYER

Figure 2.4.1 A Simple Neural Network Diagram.

Basically, all artificial neural networks have a similar structure ortopology as shown in Figure 2.4.1. In that structure some of the neuronsinterfaces to the real world to receive its inputs. Other neurons provide thereal world with the network's outputs. This output might be the particularcharacter that the network thinks that it has scanned or the particular imageit thinks is being viewed. All the rest of the neurons are hidden from view.

But a neural network is more than a bunch of neurons. Some earlyresearchers tried to simply connect neurons in a random manner, withoutmuch success. Now, it is known that even the brains of snails are structureddevices. One of the easiest ways to design a structure is to create layers ofelements. It is the grouping of these neurons into layers, the connectionsbetween these layers, and the summation and transfer functions thatcomprises a functioning neural network. The general terms used to describethese characteristics are common to all networks.

Although there are useful networks which contain only one layer, oreven one element, most applications require networks that contain at leastthe three normal types of layers - input, hidden, and output. The layer ofinput neurons receive the data either from input files or directly fromelectronic sensors in real-time applications. The output layer sendsinformation directly to the outside world, to a secondary computer process, orto other devices such as a mechanical control system. Between these twolayers can be many hidden layers. These internal layers contain many of the

9

neurons in various interconnected structures. The inputs and outputs ofeach of these hidden neurons simply go to other neurons.

In most networks each neuron in a hidden layer receives the signalsfrom all of the neurons in a layer above it, typically an input layer. After aneuron performs its function it passes its output to all of the neurons in thelayer below it, providing a feedforward path to the output. (Note: in section 5the drawings are reversed, inputs come into the bottom and outputs comeout the top.)

These lines of communication from one neuron to another areimportant aspects of neural networks. They are the glue to the system. Theyare the connections which provide a variable strength to an input. There aretwo types of these connections. One causes the summing mechanism of thenext neuron to add while the other causes it to subtract. In more humanterms one excites while the other inhibits.

Some networks want a neuron to inhibit the other neurons in thesame layer. This is called lateral inhibition. The most common use of this isin the output layer. For example in text recognition if the probability of acharacter being a "P" is .85 and the probability of the character being an "F" is.65, the network wants to choose the highest probability and inhibit all theothers. It can do that with lateral inhibition. This concept is also calledcompetition.

Another type of connection is feedback. This is where the output ofone layer routes back to a previous layer. An example of this is shown i nFigure 2.4.2.

Inputs

Feedback

Outputs

Competition(or inhibition)

Feedback

Figure 2.4.2Simple Network with Feedback and Competition.

10

The way that the neurons are connected to each other has a significantimpact on the operation of the network. In the larger, more professionalsoftware development packages the user is allowed to add, delete, and controlthese connections at will. By "tweaking" parameters these connections can bemade to either excite or inhibit.

2.5 Training an Artificial Neural Network

Once a network has been structured for a particular application, thatnetwork is ready to be trained. To start this process the initial weights arechosen randomly. Then, the training, or learning, begins.

There are two approaches to training - supervised and unsupervised.Supervised training involves a mechanism of providing the network withthe desired output either by manually "grading" the network's performanceor by providing the desired outputs with the inputs. Unsupervised trainingis where the network has to make sense of the inputs without outside help.

The vast bulk of networks utilize supervised training. Unsupervisedtraining is used to perform some initial characterization on inputs. However,in the full blown sense of being truly self learning, it is still just a shiningpromise that is not fully understood, does not completely work, and thus isrelegated to the lab.

2.5.1 Supervised Training.

In supervised training, both the inputs and the outputs are provided.The network then processes the inputs and compares its resulting outputsagainst the desired outputs. Errors are then propagated back through thesystem, causing the system to adjust the weights which control the network.This process occurs over and over as the weights are continually tweaked.The set of data which enables the training is called the "training set." Duringthe training of a network the same set of data is processed many times as theconnection weights are ever refined.

The current commercial network development packages provide toolsto monitor how well an artificial neural network is converging on the abilityto predict the right answer. These tools allow the training process to go on fordays, stopping only when the system reaches some statistically desired point,or accuracy. However, some networks never learn. This could be because theinput data does not contain the specific information from which the desiredoutput is derived. Networks also don't converge if there is not enough datato enable complete learning. Ideally, there should be enough data so that partof the data can be held back as a test. Many layered networks with multiplenodes are capable of memorizing data. To monitor the network to determineif the system is simply memorizing its data in some nonsignificant way,

11

supervised training needs to hold back a set of data to be used to test thesystem after it has undergone its training. (Note: memorization is avoided bynot having too many processing elements.)

If a network simply can't solve the problem, the designer then has toreview the input and outputs, the number of layers, the number of elementsper layer, the connections between the layers, the summation, transfer, andtraining functions, and even the initial weights themselves. Those changesrequired to create a successful network constitute a process wherein the "art"of neural networking occurs.

Another part of the designer's creativity governs the rules of training.There are many laws (algorithms) used to implement the adaptive feedbackrequired to adjust the weights during training. The most common techniqueis backward-error propagation, more commonly known as back-propagation.These various learning techniques are explored in greater depth later in thisreport.

Yet, training is not just a technique. It involves a "feel," and consciousanalysis, to insure that the network is not overtrained. Initially, an artificialneural network configures itself with the general statistical trends of the data.Later, it continues to "learn" about other aspects of the data which may bespurious from a general viewpoint.

When finally the system has been correctly trained, and no furtherlearning is needed, the weights can, if desired, be "frozen." In some systemsthis finalized network is then turned into hardware so that it can be fast.Other systems don't lock themselves in but continue to learn while i nproduction use.

2.5.2 Unsupervised, or Adaptive Training.

The other type of training is called unsupervised training. Inunsupervised training, the network is provided with inputs but not withdesired outputs. The system itself must then decide what features it will useto group the input data. This is often referred to as self-organization oradaption.

At the present time, unsupervised learning is not well understood.This adaption to the environment is the promise which would enable sciencefiction types of robots to continually learn on their own as they encounternew situations and new environments. Life is filled with situations whereexact training sets do not exist. Some of these situations involve militaryaction where new combat techniques and new weapons might beencountered. Because of this unexpected aspect to life and the human desireto be prepared, there continues to be research into, and hope for, this field.

12

Yet, at the present time, the vast bulk of neural network work is in systemswith supervised learning. Supervised learning is achieving results.

One of the leading researchers into unsupervised learning is TuevoKohonen, an electrical engineer at the Helsinki University of Technology. Hehas developed a self-organizing network, sometimes called an auto-associator, that learns without the benefit of knowing the right answer. It isan unusual looking network in that it contains one single layer with manyconnections. The weights for those connections have to be initialized and theinputs have to be normalized. The neurons are set up to compete in awinner-take-all fashion.

Kohonen continues his research into networks that are structureddifferently than standard, feedforward, back-propagation approaches.Kohonen's work deals with the grouping of neurons into fields. Neuronswithin a field are "topologically ordered." Topology is a branch ofmathematics that studies how to map from one space to another withoutchanging the geometric configuration. The three-dimensional groupingsoften found in mammalian brains are an example of topological ordering.

Kohonen has pointed out that the lack of topology in neural networkmodels make today's neural networks just simple abstractions of the realneural networks within the brain. As this research continues, more powerfulself learning networks may become possible. But currently, this field remainsone that is still in the laboratory.

2.6 How Neural Networks Differ from Traditional Computing and Expert Systems

Neural networks offer a different way to analyze data, and to recognizepatterns within that data, than traditional computing methods. However,they are not a solution for all computing problems. Traditional computingmethods work well for problems that can be well characterized. Balancingcheckbooks, keeping ledgers, and keeping tabs of inventory are well definedand do not require the special characteristics of neural networks. Table 2.6.1identifies the basic differences between the two computing approaches.

Traditional computers are ideal for many applications. They canprocess data, track inventories, network results, and protect equipment.These applications do not need the special characteristics of neural networks.

Expert systems are an extension of traditional computing and aresometimes called the fifth generation of computing. (First generationcomputing used switches and wires. The second generation occurred becauseof the development of the transistor. The third generation involved solid-state technology, the use of integrated circuits, and higher level languages like

13

COBOL, Fortran, and "C". End user tools, "code generators," are known as thefourth generation.) The fifth generation involves artificial intelligence.

CHARACTERISTICS TRADITIONALCOMPUTING

(including ExpertSystems)

ARTIFICIAL NEURALNETWORKS

Processing style Sequential ParallelFunctions Logically (left brained)

viaRulesConceptsCalculations

Gestault (right brained)via

ImagesPictures

ControlsLearning Method by rules (didactically) by example

(Socratically)Applications Accounting, word

processing, math,inventory, digitalcommunications

Sensor processing,speech recognition,pattern recognition, textrecognition

Table 2.6.1 Comparison of Computing Approaches.

Typically, an expert system consists of two parts, an inference engine and aknowledge base. The inference engine is generic. It handles the userinterface, external files, program access, and scheduling. The knowledge basecontains the information that is specific to a particular problem. Thisknowledge base allows an expert to define the rules which govern a process.This expert does not have to understand traditional programming. Thatperson simply has to understand both what he wants a computer to do andhow the mechanism of the expert system shell works. It is this shell, part ofthe inference engine, that actually tells the computer how to implement theexpert's desires. This implementation occurs by the expert system generatingthe computer's programming itself, it does that through "programming" ofits own. This programming is needed to establish the rules for a particularapplication. This method of establishing rules is also complex and doesrequire a detail oriented person.

Efforts to make expert systems general have run into a number ofproblems. As the complexity of the system increases, the system simplydemands too much computing resources and becomes too slow. Expertsystems have been found to be feasible only when narrowly confined.

Artificial neural networks offer a completely different approach toproblem solving and they are sometimes called the sixth generation ofcomputing. They try to provide a tool that both programs itself and learns onits own. Neural networks are structured to provide the capability to solve

14

problems without the benefits of an expert and without the need ofprogramming. They can seek patterns in data that no one knows are there.

A comparison of artificial intelligence's expert systems and neuralnetworks is contained in Table 2.6.2.

Characteristics Von NeumannArchitecture Used for

Expert Systems

Artificial NeuralNetworks

Processors VLSI (traditionalprocessors)

Artificial NeuralNetworks; variety oftechnologies; hardwaredevelopment is ongoing

Memory Separate The sameProcessing Approach Processes problem one

rule at a time;sequential

Multiple,simultaneously

Connections Externallyprogrammable

Dynamically selfprogramming

Self learning Only algorithmicparameters modified

Continuously adaptable

Fault tolerance None without specialprocessors

Significant in the verynature of theinterconnected neurons

Use of Neurobiology indesign

None Moderate

Programming Through a rule basedshell; complicated

Self-programming; butnetwork must beproperly set up

Ability to be fast Requires big processors Requires multiplecustom-built chips

Table 2.6.2 Comparisons of Expert Systems and Neural Networks.

Expert systems have enjoyed significant successes. However, artificialintelligence has encountered problems in areas such as vision, continuousspeech recognition and synthesis, and machine learning. Artificialintelligence also is hostage to the speed of the processor that it runs on.Ultimately, it is restricted to the theoretical limit of a single processor.Artificial intelligence is also burdened by the fact that experts don't alwaysspeak in rules.

Yet, despite the advantages of neural networks over both expertsystems and more traditional computing in these specific areas, neural nets

15

are not complete solutions. They offer a capability that is not ironclad, such asa debugged accounting system. They learn, and as such, they do continue tomake "mistakes." Furthermore, even when a network has been developed,there is no way to ensure that the network is the optimal network.

Neural systems do exact their own demands. They do require theirimplementor to meet a number of conditions. These conditions include:

- a data set which includes the information which can characterize theproblem.

- an adequately sized data set to both train and test the network.

- an understanding of the basic nature of the problem to be solved sothat basic first-cut decision on creating the network can be made.These decisions include the activization and transfer functions, andthe learning methods.

- an understanding of the development tools.

- adequate processing power (some applications demand real-timeprocessing that exceeds what is available in the standard, sequentialprocessing hardware. The development of hardware is the key to thefuture of neural networks).

Once these conditions are met, neural networks offer the opportunityof solving problems in an arena where traditional processors lack both theprocessing power and a step-by-step methodology. A number of verycomplicated problems cannot be solved in the traditional computingenvironments. For example, speech is something that all people can easilyparse and understand. A person can understand a southern drawl, a Bronxaccent, and the slurred words of a baby. Without the massively paralleledprocessing power of a neural network, this process is virtually impossible fora computer. Image recognition is another task that a human can easily do butwhich stymies even the biggest of computers. A person can recognize a planeas it turns, flies overhead, and disappears into a dot. A traditional computermight try to compare the changing images to a number of very differentstored patterns.

This new way of computing requires skills beyond traditionalcomputing. It is a natural evolution. Initially, computing was only hardwareand engineers made it work. Then, there were software specialists -programmers, systems engineers, data base specialists, and designers. Now,there are also neural architects. This new professional needs to be skilleddifferent than his predecessors of the past. For instance, he will need to knowstatistics in order to choose and evaluate training and testing situations. This

16

skill of making neural networks work is one that will stress the logicalthinking of current software engineers.

In summary, neural networks offer a unique way to solve someproblems while making their own demands. The biggest demand is that theprocess is not simply logic. It involves an empirical skill, an intuitive feel asto how a network might be created.

17

3.0 History of Neural Networks

The study of the human brain is thousands of years old. With theadvent of modern electronics, it was only natural to try to harness thisthinking process. The first step toward artificial neural networks came i n1943 when Warren McCulloch, a neurophysiologist, and a youngmathematician, Walter Pitts, wrote a paper on how neurons might work.They modeled a simple neural network with electrical circuits.

Reinforcing this concept of neurons and how they work was a bookwritten by Donald Hebb. The Organization of Behavior was written in 1949.It pointed out that neural pathways are strengthened each time that they areused.

As computers advanced into their infancy of the 1950s, it becamepossible to begin to model the rudiments of these theories concerning humanthought. Nathanial Rochester from the IBM research laboratories led the firsteffort to simulate a neural network. That first attempt failed. But laterattempts were successful. It was during this time that traditional computingbegan to flower and, as it did, the emphasis in computing left the neuralresearch in the background.

Yet, throughout this time, advocates of "thinking machines"continued to argue their cases. In 1956 the Dartmouth Summer ResearchProject on Artificial Intelligence provided a boost to both artificial intelligenceand neural networks. One of the outcomes of this process was to stimulateresearch in both the intelligent side, AI, as it is known throughout theindustry, and in the much lower level neural processing part of the brain.

In the years following the Dartmouth Project, John von Neumannsuggested imitating simple neuron functions by using telegraph relays orvacuum tubes. Also, Frank Rosenblatt, a neuro-biologist of Cornell, beganwork on the Perceptron. He was intrigued with the operation of the eye of afly. Much of the processing which tells a fly to flee is done in its eye. ThePerceptron, which resulted from this research, was built in hardware and isthe oldest neural network still in use today. A single-layer perceptron wasfound to be useful in classifying a continuous-valued set of inputs into one oftwo classes. The perceptron computes a weighted sum of the inputs, subtractsa threshold, and passes one of two possible values out as the result.Unfortunately, the perceptron is limited and was proven as such during the"disillusioned years" in Marvin Minsky and Seymour Papert's 1969 bookPerceptrons.

In 1959, Bernard Widrow and Marcian Hoff of Stanford developedmodels they called ADALINE and MADALINE. These models were namedfor their use of Multiple ADAptive LINear Elements. MADALINE was the

18

first neural network to be applied to a real world problem. It is an adaptivefilter which eliminates echoes on phone lines. This neural network is still i ncommercial use.

Unfortunately, these earlier successes caused people to exaggerate thepotential of neural networks, particularly in light of the limitation in theelectronics then available. This excessive hype, which flowed out of theacademic and technical worlds, infected the general literature of the time.Disappointment set in as promises were unfilled. Also, a fear set in as writersbegan to ponder what effect "thinking machines" would have on man.Asimov's series on robots revealed the effects on man's morals and valueswhen machines where capable of doing all of mankind's work. Other writerscreated more sinister computers, such as HAL from the movie 2001.

These fears, combined with unfulfilled, outrageous claims, causedrespected voices to critique the neural network research. The result was tohalt much of the funding. This period of stunted growth lasted through 1981.

In 1982 several events caused a renewed interest. John Hopfield ofCaltech presented a paper to the national Academy of Sciences. Hopfield'sapproach was not to simply model brains but to create useful devices. Withclarity and mathematical analysis, he showed how such networks could workand what they could do. Yet, Hopfield's biggest asset was his charisma. Hewas articulate, likeable, and a champion of a dormant technology.

At the same time, another event occurred. A conference was held i nKyoto, Japan. This conference was the US-Japan Joint Conference onCooperative/Competitive Neural Networks. Japan subsequently announcedtheir Fifth Generation effort. US periodicals picked up that story, generating aworry that the US could be left behind. Soon funding was flowing once again.

By 1985 the American Institute of Physics began what has become anannual meeting - Neural Networks for Computing. By 1987, the Institute ofElectrical and Electronic Engineer's (IEEE) first International Conference onNeural Networks drew more than 1,800 attendees.

By 1989 at the Neural Networks for Defense meeting Bernard Widrowtold his audience that they were engaged in World War IV, "World War IIInever happened," where the battlefields are world trade and manufacturing.The 1990 US Department of Defense Small Business Innovation ResearchProgram named 16 topics which specifically targeted neural networks with anadditional 13 mentioning the possible use of neural networks.

Today, neural networks discussions are occurring everywhere. Theirpromise seems very bright as nature itself is the proof that this kind of thingworks. Yet, its future, indeed the very key to the whole technology, lies i nhardware development. Currently most neural network development is

19

simply proving that the principal works. This research is developing neuralnetworks that, due to processing limitations, take weeks to learn. To takethese prototypes out of the lab and put them into use requires specializedchips. Companies are working on three types of neuro chips - digital, analog,and optical. Some companies are working on creating a "silicon compiler" togenerate a neural network Application Specific Integrated Circuit (ASIC).These ASICs and neuron-like digital chips appear to be the wave of the nearfuture. Ultimately, optical chips look very promising. Yet, it may be yearsbefore optical chips see the light of day in commercial applications.

20

4.0 Detailed Description of Neural Network Components and How They Work

Now that there is a general understanding of artificial neural networks,it is appropriate to explore them in greater detail. But before jumping intothe various networks, a more complete understanding of the inner workingsof an neural network is needed. As stated earlier, artificial neural networksare a large class of parallel processing architectures which are useful in specifictypes of complex problems. These architectures should not be confused withcommon parallel processing configurations which apply many sequentialprocessing units to standard computing topologies. Instead, neural networksare radically different than conventional Von Neumann computers in thatthey crudely mimic the fundamental properties of man's brain.

As mentioned earlier, artificial neural networks are loosely based onbiology. Current research into the brain's physiology has unlocked only alimited understanding of how neurons work or even what constitutesintelligence in general. Researchers are working in both the biological andengineering fields to further decipher the key mechanisms for how manlearns and reacts to everyday experiences. Improved knowledge in neuralprocessing helps create better, more succinct artificial networks. It also createsa cornucopia of new, and ever evolving, architectures. KunihikoFukushima, a senior research scientist in Japan, describes the give and take ofbuilding a neural network model; "We try to follow physiological evidence asfaithfully as possible. For parts not yet clear, however, we construct ahypothesis and build a model that follows that hypothesis. We then analyzeor simulate the behavior of the model and compare it with that of the brain.If we find any discrepancy in the behavior between the model and the brain,we change the initial hypothesis and modify the model. We repeat thisprocedure until the model behaves in the same way as the brain." Thiscommon process has created thousands of network topologies.

21

HebbDelta Rule

etc.

Learning andRecall

Schedule

Recall Firing Rate Input Clamp Temperature Gain Mod FactorLearing Temperature Coefficients 1,2 & 3

Scaling LimitingTransferFunction

OutputFunction

NoiseGenerator

LearningRule Summation

Function

Sum Max MinMajorityProduct etc.

High LimitLow Limit

Offset

Scale Factor

CompetitiveInputs

NetworkRecall Cycle Counter

Learning CycleCounter

w0

w1

wn

SumType Noise

TransferType

NoiseType

RuleType

+

wn

wn

i0

i1

in

LinearSigmoid

SgnBAM

Perceptionetc.

s X + o Output

Error

DirectHighest

TwoHighestAdaline

etc.

PEEnable

Figure 4.0.1 Processing Element.

Neural computing is about machines, not brains. It is the process oftrying to build processing systems that draw upon the highly successfuldesigns naturally occuring in biology. This linkage with biology is the reasonthat there is a common architectural thread throughout today's artificialneural networks. Figure 4.0.1 shows a model of an artificial neuron, orprocessing element, which embodies a wide variety of network architectures.

22

This figure is adapted from NeuralWare's simulation model used i nNeuralWorks Profession II/Plus. NeuralWare sells a artificial neuralnetwork design and development software package. Their processingelement model shows that networks designed for prediction can be verysimilar to networks designed for classification or any other network category.Prediction, classification and other network categories will be discussed later.The point here is that all artificial neural processing elements have commoncomponents.

4.1 Major Components of an Artificial Neuron

This section describes the seven major components which make up anartificial neuron. These components are valid whether the neuron is usedfor input, output, or is in one of the hidden layers.

Component 1. Weighting Factors: A neuron usually receives manysimultaneous inputs. Each input has its own relative weight which gives theinput the impact that it needs on the processing element's summationfunction. These weights perform the same type of function as do the thevarying synaptic strengths of biological neurons. In both cases, some inputsare made more important than others so that they have a greater effect on theprocessing element as they combine to produce a neural response.

Weights are adaptive coefficients within the network that determinethe intensity of the input signal as registered by the artificial neuron. Theyare a measure of an input's connection strength. These strengths can bemodified in response to various training sets and according to a network’sspecific topology or through its learning rules.

Component 2. Summation Function: The first step in a processingelement's operation is to compute the weighted sum of all of the inputs.Mathematically, the inputs and the corresponding weights are vectors whichcan be represented as (i1, i2 . . . in) and (w1, w2 . . . wn). The total input signalis the dot, or inner, product of these two vectors. This simplistic summationfunction is found by muliplying each component of the i vector by thecorresponding component of the w vector and then adding up all theproducts. Input1 = i1 * w1, input2 = i2 * w2, etc., are added as input1 + input2+ . . . + inputn. The result is a single number, not a multi-element vector.

Geometrically, the inner product of two vectors can be considered ameasure of their similarity. If the vectors point in the same direction, theinner product is maximum; if the vectors point in opposite direction (180degrees out of phase), their inner product is minimum.

The summation function can be more complex than just the simpleinput and weight sum of products. The input and weighting coefficients canbe combined in many different ways before passing on to the transfer

23

function. In addition to a simple product summing, the summation functioncan select the minimum, maximum, majority, product, or severalnormalizing algorithms. The specific algorithm for combining neural inputsis determined by the chosen network architecture and paradigm.

Some summation functions have an additional process applied to theresult before it is passed on to the transfer function. This process issometimes called the activation function. The purpose of utilizing anactivation function is to allow the summation output to vary with respect totime. Activation functions currently are pretty much confined to research.Most of the current network implementations use an "identity" activationfunction, which is equivalent to not having one. Additionally, such afunction is likely to be a component of the network as a whole rather than ofeach individual processing element component.

Component 3. Transfer Function: The result of the summationfunction, almost always the weighted sum, is transformed to a workingoutput through an algorithmic process known as the transfer function. In thetransfer function the summation total can be compared with some thresholdto determine the neural output. If the sum is greater than the thresholdvalue, the processing element generates a signal. If the sum of the input andweight products is less than the threshold, no signal (or some inhibitorysignal) is generated. Both types of response are significant.

The threshold, or transfer function, is generally non-linear. Linear(straight-line) functions are limited because the output is simply proportionalto the input. Linear functions are not very useful. That was the problem i nthe earliest network models as noted in Minsky and Papert's bookPerceptrons.

The transfer function could be something as simple as depending uponwhether the result of the summation function is positive or negative. Thenetwork could output zero and one, one and minus one, or other numericcombinations. The transfer function would then be a "hard limiter" or stepfunction. See Figure 4.1.1 for sample transfer functions.

24

Ha rd L im it e rRam p in g Fu n ct io n

Sig m o idFu n ct i on s

y

-1

y

1

xx

x < 0 , y = -1x ≥ 0 , y = 1

X < 0 , y = 0

0 ≤ x ≤ 1 , y = x

x > 1 , y= 1

x ≥ 0 , y = 1 -1 /( 1 + x )x < 0 , y = -1 + 1 /( 1 -x)

y = 1 /( 1 + e )

x

1

1

1 .0

-1 .0

1 .0

y

-x

y

Figure 4.1.1Sample Transfer Functions.

Another type of transfer function, the threshold or ramping function, couldmirror the input within a given range and still act as a hard limiter outsidethat range. It is a linear function that has been clipped to minimum andmaximum values, making it non-linear. Yet another option would be asigmoid or S-shaped curve. That curve approaches a minimum andmaximum value at the asymptotes. It is common for this curve to be called asigmoid when it ranges between 0 and 1, and a hyperbolic tangent when itranges between -1 and 1. Mathematically, the exciting feature of these curvesis that both the function and its derivatives are continuous. This optionworks fairly well and is often the transfer function of choice. Other transferfunctions are dedicated to specific network architectures and will be discussedlater.

Prior to applying the transfer function, uniformly distributed randomnoise may be added. The source and amount of this noise is determined bythe learning mode of a given network paradigm. This noise is normallyreferred to as "temperature" of the artificial neurons. The name,temperature, is derived from the physical phenomenon that as peoplebecome too hot or cold their ability to think is affected. Electronically, thisprocess is simulated by adding noise. Indeed, by adding different levels ofnoise to the summation result, more brain-like transfer functions arerealized. To more closely mimic nature's characteristics, some experimentersare using a gaussian noise source. Gaussian noise is similar to uniformlydistributed noise except that the distribution of random numbers within thetemperature range is along a bell curve. The use of temperature is anongoing research area and is not being applied to many engineeringapplications.

25

NASA just announced a network topology which uses what it calls atemperature coefficient in a new feed-forward, back-propagation learningfunction. But this temperature coefficient is a global term which is applied tothe gain of the transfer function. It should not be confused with the morecommon term, temperature, which is simple noise being added to individualneurons. In contrast, the global temperature coefficient allows the transferfunction to have a learning variable much like the synaptic input weights.This concept is claimed to create a network which has a significantly faster (byseveral order of magnitudes) learning rate and provides more accurate resultsthan other feedforward, back-propagation networks.

Component 4. Scaling and Limiting: After the processing element'stransfer function, the result can pass through additional processes which scaleand limit. This scaling simply multiplies a scale factor times the transfervalue, and then adds an offset. Limiting is the mechanism which insuresthat the scaled result does not exceed an upper or lower bound. This limitingis in addition to the hard limits that the original transfer function may haveperformed.

This type of scaling and limiting is mainly used in topologies to testbiological neuron models, such as James Anderson's brain-state-in-the-box.

Component 5. Output Function (Competition): Each processingelement is allowed one output signal which it may output to hundreds ofother neurons. This is just like the biological neuron, where there are manyinputs and only one output action. Normally, the output is directlyequivalent to the transfer function's result. Some network topologies,however, modify the transfer result to incorporate competition amongneighboring processing elements. Neurons are allowed to compete with eachother, inhibiting processing elements unless they have great strength.Competition can occur at one or both of two levels. First, competitiondetermines which artificial neuron will be active, or provides an output.Second, competitive inputs help determine which processing element willparticipate in the learning or adaptation process.

Component 6. Error Function and Back-Propagated Value: In mostlearning networks the difference between the current output and the desiredoutput is calculated. This raw error is then transformed by the error functionto match a particular network architecture. The most basic architectures usethis error directly, but some square the error while retaining its sign, somecube the error, other paradigms modify the raw error to fit their specificpurposes. The artificial neuron's error is then typically propagated into thelearning function of another processing element. This error term issometimes called the current error.

The current error is typically propagated backwards to a previous layer.Yet, this back-propagated value can be either the current error, the current

26

error scaled in some manner (often by the derivative of the transfer function),or some other desired output depending on the network type. Normally, thisback-propagated value, after being scaled by the learning function, ismultiplied against each of the incoming connection weights to modify thembefore the next learning cycle.

Component 7. Learning Function: The purpose of the learningfunction is to modify the variable connection weights on the inputs of eachprocessing element according to some neural based algorithm. This processof changing the weights of the input connections to achieve some desiredresult can also be called the adaption function, as well as the learning mode.There are two types of learning: supervised and unsupervised. Supervisedlearning requires a teacher. The teacher may be a training set of data or anobserver who grades the performance of the network results. Either way,having a teacher is learning by reinforcement. When there is no externalteacher, the system must organize itself by some internal criteria designedinto the network. This is learning by doing.

4.2 Teaching an Artificial Neural Network

4.2.1 Supervised Learning.

The vast majority of artificial neural network solutions have beentrained with supervision. In this mode, the actual output of a neuralnetwork is compared to the desired output. Weights, which are usuallyrandomly set to begin with, are then adjusted by the network so that the nextiteration, or cycle, will produce a closer match between the desired and theactual output. The learning method tries to minimize the current errors ofall processing elements. This global error reduction is created over time bycontinuously modifying the input weights until an acceptable networkaccuracy is reached.

With supervised learning, the artificial neural network must betrained before it becomes useful. Training consists of presenting input andoutput data to the network. This data is often referred to as the training set.That is, for each input set provided to the system, the corresponding desiredoutput set is provided as well. In most applications, actual data must be used.This training phase can consume a lot of time. In prototype systems, withinadequate processing power, learning can take weeks. This training isconsidered complete when the neural network reaches an user definedperformance level. This level signifies that the network has achieved thedesired statistical accuracy as it produces the required outputs for a givensequence of inputs. When no further learning is necessary, the weights aretypically frozen for the application. Some network types allow continualtraining, at a much slower rate, while in operation. This helps a network toadapt to gradually changing conditions.

27

Training sets need to be fairly large to contain all the neededinformation if the network is to learn the features and relationships that areimportant. Not only do the sets have to be large but the training sessionsmust include a wide variety of data. If the network is trained just oneexample at a time, all the weights set so meticulously for one fact could bedrastically altered in learning the next fact. The previous facts could beforgotten in learning something new. As a result, the system has to learneverything together, finding the best weight settings for the total set of facts.For example, in teaching a system to recognize pixel patterns for the ten digits,if there were twenty examples of each digit, all the examples of the digit sevenshould not be presented at the same time.

How the input and output data is represented, or encoded, is a majorcomponent to successfully instructing a network. Artificial networks onlydeal with numeric input data. Therefore, the raw data must often beconverted from the external environment. Additionally, it is usuallynecessary to scale the data, or normalize it to the network's paradigm. Thispre-processing of real-world stimuli, be they cameras or sensors, into machinereadable format is already common for standard computers. Manyconditioning techniques which directly apply to artificial neural networkimplementations are readily available. It is then up to the network designerto find the best data format and matching network architecture for a givenapplication.

After a supervised network performs well on the training data, then itis important to see what it can do with data it has not seen before. If a systemdoes not give reasonable outputs for this test set, the training period is notover. Indeed, this testing is critical to insure that the network has not simplymemorized a given set of data but has learned the general patterns involvedwithin an application.

4.2.2 Unsupervised Learning.

Unsupervised learning is the great promise of the future. It shouts thatcomputers could someday learn on their own in a true robotic sense.Currently, this learning method is limited to networks known as self-organizing maps. These kinds of networks are not in widespread use. Theyare basically an academic novelty. Yet, they have shown they can provide asolution in a few instances, proving that their promise is not groundless.They have been proven to be more effective than many algorithmictechniques for numerical aerodynamic flow calculations. They are also beingused in the lab where they are split into a front-end network that recognizesshort, phoneme-like fragments of speech which are then passed on to a back-end network. The second artificial network recognizes these strings offragments as words.

28

This promising field of unsupervised learning is sometimes called self-supervised learning. These networks use no external influences to adjusttheir weights. Instead, they internally monitor their performance. Thesenetworks look for regularities or trends in the input signals, and makesadaptations according to the function of the network. Even without beingtold whether it's right or wrong, the network still must have someinformation about how to organize itself. This information is built into thenetwork topology and learning rules.

An unsupervised learning algorithm might emphasize cooperationamong clusters of processing elements. In such a scheme, the clusters wouldwork together. If some external input activated any node in the cluster, thecluster's activity as a whole could be increased. Likewise, if external input tonodes in the cluster was decreased, that could have an inhibitory effect on theentire cluster.

Competition between processing elements could also form a basis forlearning. Training of competitive clusters could amplify the responses ofspecific groups to specific stimuli. As such, it would associate those groupswith each other and with a specific appropriate response. Normally, whencompetition for learning is in effect, only the weights belonging to thewinning processing element will be updated.

At the present state of the art, unsupervised learning is not wellunderstood and is still the subject of research. This research is currently ofinterest to the government because military situations often do not have adata set available to train a network until a conflict arises.

4.2.3 Learning Rates.

The rate at which ANNs learn depends upon several controllablefactors. In selecting the approach there are many trade-offs to consider.Obviously, a slower rate means a lot more time is spent in accomplishing theoff-line learning to produce an adequately trained system. With the fasterlearning rates, however, the network may not be able to make the finediscriminations possible with a system that learns more slowly. Researchersare working on producing the best of both worlds.

Generally, several factors besides time have to be considered whendiscussing the off-line training task, which is often described as "tiresome."Network complexity, size, paradigm selection, architecture, type of learningrule or rules employed, and desired accuracy must all be considered. Thesefactors play a significant role in determining how long it will take to train anetwork. Changing any one of these factors may either extend the trainingtime to an unreasonable length or even result in an unacceptable accuracy.

29

Most learning functions have some provision for a learning rate, orlearning constant. Usually this term is positive and between zero and one. Ifthe learning rate is greater than one, it is easy for the learning algorithm toovershoot in correcting the weights, and the network will oscillate. Smallvalues of the learning rate will not correct the current error as quickly, but ifsmall steps are taken in correcting errors, there is a good chance of arriving atthe best minimum convergence.

4.2.4 Learning Laws.

Many learning laws are in common use. Most of these laws are somesort of variation of the best known and oldest learning law, Hebb's Rule.Research into different learning functions continues as new ideas routinelyshow up in trade publications. Some researchers have the modeling ofbiological learning as their main objective. Others are experimenting withadaptations of their perceptions of how nature handles learning. Either way,man's understanding of how neural processing actually works is verylimited. Learning is certainly more complex than the simplificationsrepresented by the learning laws currently developed. A few of the majorlaws are presented as examples.

Hebb's Rule: The first, and undoubtedly the best known, learning rulewas introduced by Donald Hebb. The description appeared in his book T h eOrganization of Behavior in 1949. His basic rule is: If a neuron receives aninput from another neuron, and if both are highly active (mathematicallyhave the same sign), the weight between the neurons should be strengthened.

Hopfield Law: It is similar to Hebb's rule with the exception that itspecifies the magnitude of the strengthening or weakening. It states, "if thedesired output and the input are both active or both inactive, increment theconnection weight by the learning rate, otherwise decrement the weight bythe learning rate."

The Delta Rule: This rule is a further variation of Hebb's Rule. It isone of the most commonly used. This rule is based on the simple idea ofcontinuously modifying the strengths of the input connections to reduce thedifference (the delta) between the desired output value and the actual outputof a processing element. This rule changes the synaptic weights in the waythat minimizes the mean squared error of the network. This rule is alsoreferred to as the Widrow-Hoff Learning Rule and the Least Mean Square(LMS) Learning Rule.

The way that the Delta Rule works is that the delta error in the outputlayer is transformed by the derivative of the transfer function and is thenused in the previous neural layer to adjust input connection weights. Inother words, this error is back-propagated into previous layers one layer at atime. The process of back-propagating the network errors continues until the

30

first layer is reached. The network type called Feedforward, Back-propagationderives its name from this method of computing the error term.

When using the delta rule, it is important to ensure that the input dataset is well randomized. Well ordered or structured presentation of thetraining set can lead to a network which can not converge to the desiredaccuracy. If that happens, then the network is incapable of learning theproblem.

The Gradient Descent Rule: This rule is similar to the Delta Rule i nthat the derivative of the transfer function is still used to modify the deltaerror before it is applied to the connection weights. Here, however, anadditional proportional constant tied to the learning rate is appended to thefinal modifying factor acting upon the weight. This rule is commonly used,even though it converges to a point of stability very slowly.

It has been shown that different learning rates for different layers of anetwork help the learning process converge faster. In these tests, the learningrates for those layers close to the output were set lower than those layers nearthe input. This is especially important for applications where the input datais not derived from a strong underlying model.

Kohonen's Learning Law: This procedure, developed by TeuvoKohonen, was inspired by learning in biological systems. In this procedure,the processing elements compete for the opportunity to learn, or update theirweights. The processing element with the largest output is declared thewinner and has the capability of inhibiting its competitors as well as excitingits neighbors. Only the winner is permitted an output, and only the winnerplus its neighbors are allowed to adjust their connection weights.

Further, the size of the neighborhood can vary during the trainingperiod. The usual paradigm is to start with a larger definition of theneighborhood, and narrow in as the training process proceeds. Because thewinning element is defined as the one that has the closest match to the inputpattern, Kohonen networks model the distribution of the inputs. This isgood for statistical or topological modeling of the data and is sometimesreferred to as self-organizing maps or self-organizing topologies.

31

5.0 Network Selection

Because all artificial neural networks are based on the concept ofneurons, connections, and transfer functions, there is a similarity between thedifferent structures, or architectures, of neural networks. The majority of thevariations stems from the various learning rules and how those rules modifya network's typical topology. The following sections outline some of themost common artificial neural networks. They are organized in very roughcategories of application. These categories are not meant to be exclusive, theyare merely meant to separate out some of the confusion over networkarchitectures and their best matches to specific applications.

Basically, most applications of neural networks fall into the followingfive categories:

- prediction- classification- data association- data conceptualization- data filtering

Network type Networks Use for networkPrediction - Back-propagation

- Delta Bar Delta- Extended delta bar delta- Directed random search- Higher order Neural Networks- Self Organizing Map into Back-

propagation

Use input values to predict someoutput (e.g. pick the best stocks inthe stock market, predict theweather, identify people withcancer risks)

Classification - Learning vector quantization- Counter-propagation-Probabalistic neural network

Use input values to determine theclassification (e.g. is the input theletter A, is the blob of video data aplane and what kind of plane is it)

Data association - Hopfield- Boltzmann Machine- Hamming network- Bidirectional associative

memory-Spatio-temporal pattern

recognition

Like classification but it alsorecognizes data that containserrors (e.g. not only identify thecharacters that were scanned butalso identify when the scannerwasn't working properly)

Data conceptualization - Adaptive resonance Network- Self organizing map

Analyze the inputs so thatgrouping relationships can beinferred (e.g. extract from a database the names of those most likelyto buy a particular product)

Data filtering - Recirculation Smooth an input signal (e.g. takethe noise out of a telephone signal)

Table 5.0.1 Network Selector Table.

Table 5.0.1 shows the differences between these network categories and showswhich of the more common network topologies belong to which primarycategory. This chart is intended as a guide and is not meant to be allinclusive. While there are many other network derivations, this chart onlyincludes the architectures explained within this section of this report. Some

32

of these networks, which have been grouped by application, have been usedto solve more than one type of problem. Feedforward back-propagation i nparticular has been used to solve almost all types of problems and indeed isthe most popular for the first four categories. The next five subsectionsdescribe these five network types.

5.1 Networks for Prediction

The most common use for neural networks is to project what willmost likely happen. There are many applications where prediction can helpin setting priorities. For example, the emergency room at a hospital can be ahectic place. To know who needs the most time critical help can enable amore successful operation. Basically, all organizations must establishpriorities which govern the allocation of their resources. This projection ofthe future is what drove the creation of networks of prediction.

5.1.1 Feedforward, Back-Propagation.

The feedforward, back-propagation architecture was developed in theearly 1970's by several independent sources (Werbor; Parker; Rumelhart,Hinton and Williams). This independent co-development was the result of aproliferation of articles and talks at various conferences which stimulated theentire industry. Currently, this synergistically developed back-propagationarchitecture is the most popular, effective, and easy to learn model forcomplex, multi-layered networks. This network is used more than all otherscombined. It is used in many different types of applications. This architecturehas spawned a large class of network types with many different topologies andtraining methods. Its greatest strength is in non-linear solutions to ill-defined problems.

The typical back-propagation network has an input layer, an outputlayer, and at least one hidden layer. There is no theoretical limit on thenumber of hidden layers but typically there is just one or two. Some workhas been done which indicates that a maximum of four layers (three hiddenlayers plus an output layer) are required to solve problems of any complexity.Each layer is fully connected to the succeeding layer, as shown in Figure 5.0.1.(Note: all of the drawings of networks in section 5 are from NeuralWare'sNeuralWorks Professional II/Plus artificial neural network developmenttool.)

The in and out layers indicate the flow of information during recall.Recall is the process of putting input data into a trained network andreceiving the answer. Back-propagation is not used during recall, but onlywhen the network is learning a training set.

33

Figure 5.0.1 An Example Feedforward Back-propagation Network.

The number of layers and the number of processing elements per layerare important decisions. These parameters to a feedforward, back-propagationtopology are also the most ethereal. They are the "art" of the networkdesigner. There is no quantifiable, best answer to the layout of the networkfor any particular application. There are only general rules picked up overtime and followed by most researchers and engineers applying thisarchitecture to their problems.

Rule One: As the complexity in the relationship between theinput data and the desired output increases, then the number of theprocessing elements in the hidden layer should also increase.

Rule Two: If the process being modeled is separable intomultiple stages, then additional hidden layer(s) may be required. Ifthe process is not separable into stages, then additional layers maysimply enable memorization and not a true general solution.

Rule Three: The amount of training data available sets anupper bound for the number of processing elements in the hiddenlayer(s). To calculate this upper bound, use the number of input-output pair examples in the training set and divide that number bythe total number of input and output processing elements in thenetwork. Then divide that result again by a scaling factor between

34

five and ten. Larger scaling factors are used for relatively noisy data.Extremely noisy data may require a factor of twenty or even fifty,while very clean input data with an exact relationship to the outputmight drop the factor to around two. It is important that the hiddenlayers have few processing elements. Too many artificial neuronsand the training set will be memorized. If that happens then nogeneralization of the data trends will occur, making the networkuseless on new data sets.

Once the above rules have been used to create a network, the process ofteaching begins. This teaching process for a feedforward network normallyuses some variant of the Delta Rule, which starts with the calculateddifference between the actual outputs and the desired outputs. Using thiserror, connection weights are increased in proportion to the error times ascaling factor for global accuracy. Doing this for an individual node meansthat the inputs, the output, and the desired output all have to be present atthe same processing element. The complex part of this learning mechanismis for the system to determine which input contributed the most to anincorrect output and how does that element get changed to correct the error.An inactive node would not contribute to the error and would have no needto change its weights.

To solve this problem, training inputs are applied to the input layer ofthe network, and desired outputs are compared at the output layer. Duringthe learning process, a forward sweep is made through the network, and theoutput of each element is computed layer by layer. The difference between theoutput of the final layer and the desired output is back-propagated to theprevious layer(s), usually modified by the derivative of the transfer function,and the connection weights are normally adjusted using the Delta Rule. Thisprocess proceeds for the previous layer(s) until the input layer is reached.

There are many variations to the learning rules for back-propagationnetworks. Different error functions, transfer functions, and even themodifying method of the derivative of the transfer function can be used. Theconcept of "momentum error" was introduced to allow for more promptlearning while minimizing unstable behavior. Here, the error function, ordelta weight equation, is modified so that a portion of the previous deltaweight is fed through to the current delta weight. This acts, in engineeringterms, as a low-pass filter on the delta weight terms since general trends arereinforced whereas oscillatory behavior is cancelled out. This allows a low,normally slower, learning coefficient to be used, but creates faster learning.

Another technique that has an effect on convergence speed is to onlyupdate the weights after many pairs of inputs and their desired outputs arepresented to the network, rather than after every presentation. This isreferred to as cumulative back-propagation because the delta weights are not

35

accumulated until the complete set of pairs is presented. The number ofinput-output pairs that are presented during the accumulation is referred toas an "epoch." This epoch may correspond either to the complete set oftraining pairs or to a subset.

There are limitations to the feedforward, back-propagation architecture.Back-propagation requires lots of supervised training, with lots of input-output examples. Additionally, the internal mapping procedures are not wellunderstood, and there is no guarantee that the system will converge to anacceptable solution. At times, the learning gets stuck in a local minima,limiting the best solution. This occurs when the network system finds anerror that is lower than the surrounding possibilities but does not finally getto the smallest possible error. Many learning applications add a term to thecomputations to bump or jog the weights past shallow barriers and find theactual minimum rather than a temporary error pocket.

Typical feedforward, back-propagation applications include speechsynthesis from text, robot arms, evaluation of bank loans, image processing,knowledge representation, forecasting and prediction, and multi-targettracking. Each month more back-propagation solutions are announced in thetrade journals.

5.1.2 Delta Bar Delta.

The delta bar delta network utilizes the same architecture as a back-propagation network. The difference of delta bar delta lies in its uniquealgorithmic method of learning. Delta bar delta was developed by RobertJacobs to improve the learning rate of standard feedforward, back-propagationnetworks.

As outlined above, the back-propagation procedure is based on asteepest descent approach which minimizes the network's prediction errorduring the process where the connection weights to each artificial neuron arechanged. The standard learning rates are applied on a layer by layer basis andthe momentum term is usually assigned globally. Some back-propagationapproaches allow the learning rates to gradually decrease as large quantities oftraining sets pass through the network. Although this method is successfulin solving many applications, the convergence rate of the procedure is stilltoo slow to be used on some practical problems.

The delta bar delta paradigm uses a learning method where eachweight has its own self-adapting learning coefficient. It also does not use themomentum factor of the back-propagation architecture. The remainingoperations of the network, such as feedforward recall, are identical to thenormal back-propagation architecture. Delta bar delta is a "heuristic"approach to training artificial networks. What that means is that past errorvalues can be used to infer future calculated error values. Knowing the

36

probable errors enables the system to take intelligent steps in adjusting theweights. However, this process is complicated in that empirical evidencesuggests that each weight may have quite different effects on the overall error.Jacobs then suggested the common sense notion that back-propagationlearning rules should account for these variations in the effect on the overallerror. In other words, every connection weight of a network should have itsown learning rate. The claim is that the step size appropriate for oneconnection weight may not be appropriate for all weights in that layer.Further, these learning rates should be allowed to vary over time. Byassigning a learning rate to each connection and permitting this learning rateto change continuously over time, more degrees of freedom are introduced toreduce the time to convergence.

Rules which directly apply to this algorithm are straight forward andeasy to implement. Each connection weight has its own learning rate. Theselearning rates are varied based on the current error information found withstandard back-propagation. When the connection weight changes, if thelocal error has the same sign for several consecutive time steps, the learningrate for that connection is linearly increased. Incrementing linearly preventsthe learning rates from becoming too large too fast. When the local errorchanges signs frequently, the learning rate is decreased geometrically.Decrementing geometrically ensures that the connection learning rates arealways positive. Further, they can be decreased more rapidly in regions wherethe change in error is large.

By permitting different learning rates for each connection weight in anetwork, a steepest descent search (in the direction of the negative gradient) isno longer being preformed. Instead, the connection weights are updated onthe basis of the partial derivatives of the error with respect to the weight itself.It is also based on an estimate of the "curvature of the error surface" in thevicinity of the current point weight value. Additionally, the weight changessatisfy the locality constraint, that is, they require information only from theprocessing elements to which they are connected.

5.1.3 Extended Delta Bar Delta.

Ali Minai and Ron Williams developed the extended delta bar deltaalgorithm as a natural outgrowth from Jacob's work. Here, they enhance thedelta bar delta by applying an exponential decay to the learning rate increase,add the momentum component back in, and put a cap on the learning rateand momentum coefficient. As discussed in the section on back-propagation,momentum is a factor used to smooth the learning rate. It is a term added tothe standard weight change which is proportional to the previous weightchange. In this way, good general trends are reinforced, and oscillations aredampened.

37

The learning rate and the momentum rate for each weight haveseparate constants controlling their increase and decrease. Once again, thesign of the current error is used to indicate whether an increase or decrease isappropriate. The adjustment for decrease is identical in form to that of DeltaBar Delta. However, the learning rate and momentum rate increases aremodified to be exponentially decreasing functions of the magnitude of theweighted gradient components. Thus, greater increases will be applied i nareas of small slope or curvature than in areas of high curvature. This is apartial solution to the jump problem of delta bar delta.

To take a step further to prevent wild jumps and oscillations in theweights, ceilings are placed on the individual connection learning rates andmomentum rates. And finally, a memory with a recovery feature is built intothe algorithm. When in use, after each epoch presentation of the trainingdata, the accumulated error is evaluated. If the error is less than the previousminimum error, the weights are saved in memory as the current best. Atolerance parameter controls the recovery phase. Specifically, if the currenterror exceeds the minimum previous error, modified by the toleranceparameter, than all connection weight values revert stochastically to thestored best set of weights in memory. Furthermore, the learning andmomentum rates are decreased to begin the recovery process.

5.1.4 Directed Random Search.

The previous architectures were all based on learning rules, orparadigms, which are based on calculus. Those paradigms use a gradientdescent technique to adjust each of the weights. The architecture of thedirected random search, however, uses a standard feedforward recall structurewhich is not based on back-propagation. Instead, the directed random searchadjusts the weights randomly. To provide some order to this process adirection component is added to the random step which insures that theweights tend toward a previously successful search direction. All processingelements are influenced individually.

This random search paradigm has several important features.Basically, it is fast and easy to use if the problem is well understood andrelatively small. The reason that the problem has to be well understood isthat the best results occur when the initial weights, the first guesses, arewithin close proximity to the best weights. It is fast because the algorithmcycles through its training much more quickly than calculus-bases techniques(i.e., the delta rule and its variations), since no error terms are computed forthe intermediate processing elements. Only the output error is calculated.This learning rule is easy to use because there are only two key parametersassociated with it. But the problem needs to result in a small network becauseif the number of connections becomes high, then the training processbecomes long and cumbersome.

38

To facilitate keeping the weights within the compact region where thealgorithm works best, an upper bound is required on the weight's magnitude.Yet, by setting the weight's bounds reasonably high, the network is stillallowed to seek what is not exactly known - the true global optimum. Thesecond key parameter to this learning rule involves the initial variance of therandom distribution of the weights. In most of the commercial packagesthere is a vendor recommended number for this initial variance parameter.Yet, the setting of this number is not all that important as the self-adjustingfeature of the directed random search has proven to be robust over a widerange of initial variances.

There are four key components to a random search network. They arethe random step, the reversal step, a directed component, and a self-adjustingvariance.

Random Step: A random value is added to each weight.Then, the entire training set is run through the network,producing a "prediction error." If this new total training seterror is less than the previous best prediction error, thecurrent weight values (which include the random step)becomes the new set of "best" weights. The currentprediction error is then saved as the new, best predictionerror.

Reversal Step: If the random step's results are worse thanthe previous best, then the same random value issubtracted from the original weight value. This produces aset of weights that is in the opposite direction to theprevious random step. If the total "prediction error" is lessthan the previous best error, the current weight values ofthe reversal step are stored as the best weights. The currentprediction error is also saved as the new, best predictionerror. If both the forward and reverse steps fail, acompletely new set of random values are added to the bestweights and the process is then begun again.

Directed Component: To add in convergence a set ofdirected components is created, based on the outcomes ofthe forward and reversal steps. These directed componentsreflect the history of success or failure for the previousrandom steps. The directed components, which areinitialized to zero, are added to the random components ateach step in the procedure. Directed components provide a"common sense, let's go this way" element to the search. Ithas been found that the addition of these directed

39

components provide a dramatic performance improvementto convergence.

Self-adjusting Variance: An initial variance parameter isspecified to control the initial size (or length) of the randomsteps which are added to the weights. An adaptivemechanism changes the variance parameter based on thecurrent relative success rate or failure rate. The learningrule assumes that the current size of the steps for theweights is in the right direction if it records severalconsecutive successes, and it then expands to try even largersteps. Conversely, if it detects several consecutive failures itcontracts the variance to reduce the step size.

For small to moderately sized networks, a directed random searchproduces good solutions in a reasonable amount of time. The training isautomatic, requiring little, if any, user interaction. The number of connectionweights imposes a practical limit on the size of a problem that this learningalgorithm can effectively solve. If a network has more than 200 connectionweights, a directed random search can require a relatively long training timeand still end up yielding an acceptable solution.

5.1.5 Higher-order Neural Network or Functional-link Network.

Either name is given to neural networks which expand the standardfeedforward, back-propagation architecture to include nodes at the input layerwhich provide the network with a more complete understanding of theinput. Basically, the inputs are transformed in a well understoodmathematical way so that the network does not have to learn some basicmath functions. These functions do enhance the network's understanding ofa given problem. These mathematical functions transform the inputs viahigher-order functions such as squares, cubes, or sines. It is from the veryname of these functions, higher-order or functionally linked mappings, thatthe two names for this same concept were derived.

This technique has been shown to dramatically improve the learningrates of some applications. An additional advantage to this extension of back-propagation is that these higher order functions can be applied to otherderivations - delta bar delta, extended delta bar delta, or any other enhancedfeedforward, back-propagation networks.

There are two basic ways of adding additional input nodes. First, thecross-products of the input terms can be added into the model. This is alsocalled the output product or tensor model, where each component of theinput pattern multiplies the entire input pattern vector. A reasonable way todo this is to add all interaction terms between input values. For example, fora back-propagation network with three inputs (A, B and C), the cross-products

40

would include: AA, BB, CC, AB, AC, and BC. This example adds second-orderterms to the input structure of the network. Third-order terms, such as ABC,could also be added.

The second method for adding additional input nodes is the functionalexpansion of the base inputs. Thus, a back-propagation model with A, B andC might be transformed into a higher-order neural network model withinputs: A, B, C, SIN(A), COS(B), LOG(C), MAX(A,B,C), etc. In this model,input variables are individually acted upon by appropriate functions. Manydifferent functions can be used. The overall effect is to provide the networkwith an enhanced representation of the input. It is even possible to combinethe tensor and functional expansion models together.

No new information is added, but the representation of the inputs isenhanced. Higher-order representation of the input data can make thenetwork easier to train. The joint or functional activations become directlyavailable to the model. In some cases, a hidden layer is no longer needed.However, there are limitations to the network model. Many more inputnodes must be processed to use the transformations of the original inputs.With higher-order systems, the problem is exacerbated. Yet, because of thefinite processing time of computers, it is important that the inputs are notexpanded more than is needed to get an accurate solution.

Functional-link networks were developed by Yoh-Han Pao and aredocumented in his book, Adaptive Pattern Recognition and NeuralNetworks . Pao draws a distinction between truly adding higher order termsin the sense that some of these terms represent joint activations versusfunctional expansion which increases the dimension of the representationspace without adding joint activations. While most developers recognize thedifference, researchers typically treat these two aspects in the same way. Paohas been awarded a patent for the functional-link network, so its commercialuse may require royalty licensing.

5.1.6 Self-Organizing Map into Back-Propagation.

A hybrid network uses a self-organizing map to conceptually separatethe data before that data is used in the normal back-propagation manner.This map helps to visualize topologies and hierarchical structures of higher-order input spaces before they are entered into the feedforward, back-propagation network. The change to the input is similar to having anautomatic functional-link input structure. This self-organizing map trains i nan unsupervised manner. The rest of the network goes through its normalsupervised training.

The self-organizing map, and its unique approach to learning, isdescribed in section 5.4.2

41

5.2 Networks for Classification

The previous section describes networks that attempt to makeprojections of the future. But understanding trends and what impacts thosetrends might have is only one of several types of applications. The secondclass of applications is classification. A network that can classify could be usedin the medical industry to process both lab results and doctor-recordedpatience symptoms to determine the most likely disease. Other applicationscan separate the "tire kicker" inquiries from the requests for informationfrom real buyers.

5.2.1 Learning Vector Quantization.

This network topology was originally suggested by Tuevo Kohonen i nthe mid 80's, well after his original work in self-organizing maps. Both thisnetwork and self-organizing maps are based on the Kohonen layer, which iscapable of sorting items into appropriate categories of similar objects.Specifically, Learning Vector Quantization is a artificial neural networkmodel used both for classification and image segmentation problems.

Topologically, the network contains an input layer, a single Kohonenlayer and an output layer. An example network is shown in Figure 5.2.1. Theoutput layer has as many processing elements as there are distinct categories,or classes. The Kohonen layer has a number of processing elements groupedfor each of these classes. The number of processing elements per classdepends upon the complexity of the input-output relationship. Usually, eachclass will have the same number of elements throughout the layer. It is theKohonen layer that learns and performs relational classifications with the aidof a training set. This network uses supervised learning rules. However,these rules vary significantly from the back-propagation rules. To optimizethe learning and recall functions, the input layer should contain only oneprocessing element for each separable input parameter. Higher-order inputstructures could also be used.

Learning Vector Quantization classifies its input data into groupingsthat it determines. Essentially, it maps an n-dimensional space into an m-dimensional space. That is it takes n inputs and produces m outputs. Thenetworks can be trained to classify inputs while preserving the inherenttopology of the training set. Topology preserving maps preserve nearestneighbor relationships in the training set such that input patterns whichhave not been previously learned will be categorized by their nearestneighbors in the training data.

42

Figure 5.2.1. An Example Learning Vector Quantization Network.

In the training mode, this supervised network uses the Kohonen layersuch that the distance of a training vector to each processing element iscomputed and the nearest processing element is declared the winner. Thereis only one winner for the whole layer. The winner will enable only oneoutput processing element to fire, announcing the class or category the inputvector belonged to. If the winning element is in the expected class of thetraining vector, it is reinforced toward the training vector. If the winningelement is not in the class of the training vector, the connection weightsentering the processing element are moved away from the training vector.This later operation is referred to as repulsion. During this training process,individual processing elements assigned to a particular class migrate to theregion associated with their specific class.

During the recall mode, the distance of an input vector to eachprocessing element is computed and again the nearest element is declared thewinner. That in turn generates one output, signifying a particular class foundby the network.

There are some shortcomings with the Learning Vector Quantizationarchitecture. Obviously, for complex classification problems with similarobjects or input vectors, the network requires a large Kohonen layer withmany processing elements per class. This can be overcome with selectivelybetter choices for, or higher-order representation of, the input parameters.

43

The learning mechanisms has some weaknesses which have beenaddressed by variants to the paradigm. Normally these variants are applied atdifferent phases of the learning process. They imbue a consciencemechanism, a boundary adjustment algorithm, and an attraction function atdifferent points while training the network.

The simple form of the Learning Vector Quantization network suffersfrom the defect that some processing elements tend to win too often whileothers, in effect, do nothing. This particularly happens when the processingelements begin far from the training vectors. Here, some elements are drawnin close very quickly and the others remain permanently far away. Toalleviate this problem, a conscience mechanism is added so that a processingelement which wins too often develops a "guilty conscience" and ispenalized. The actual conscience mechanism is a distance bias which is addedto each processing element. This distance bias is proportional to thedifference between the win frequency of an element and the averageprocessing element win frequency. As the network progresses along itslearning curve, this bias proportionality factors needs to be decreased.

The boundary adjustment algorithm is used to refine a solution once arelatively good solution has been found. This algorithm effects the caseswhen the winning processing element is in the wrong class and the secondbest processing element is in the right class. A further limitation is that thetraining vector must be near the midpoint of space joining these twoprocessing elements. The winning wrong processing element is moved awayfrom the training vector and the second place element is moved toward thetraining vector. This procedure refines the boundary between regions wherepoor classifications commonly occur.

In the early training of the Learning Vector Quantization network, it issome times desirable to turn off the repulsion. The winning processingelement is only moved toward the training vector if the training vector andthe winning processing element are in the same class. This option isparticularly helpful when a processing element must move across a regionhaving a different class in order to reach the region where it is needed.

5.2.2 Counter-propagation Network.

Robert Hecht-Nielsen developed the counter-propagation network as ameans to combine an unsupervised Kohonen layer with a teachable outputlayer. This is yet another topology to synthesize complex classificationproblems, while trying to minimize the number of processing elements andtraining time. The operation for the counter-propagation network is similarto that of the Learning Vector Quantization network in that the middleKohonen layer acts as an adaptive look-up table, finding the closest fit to aninput stimulus and outputting its equivalent mapping.

44

The first counter-propagation network consisted of a bi-directionalmapping between the input and output layers. In essence, while data ispresented to the input layer to generate a classification pattern on the outputlayer, the output layer in turn would accept an additional input vector andgenerate an output classification on the network's input layer. The networkgot its name from this counter-posing flow of information through itsstructure. Most developers use a uni-flow variant of this formalrepresentation of counter-propagation. In other words. there is only onefeedforward path from input layer to output layer.

An example network is shown in Figure 5.2.2. The uni-directionalcounter-propagation network has three layers. If the inputs are not alreadynormalized before they enter the network., a fourth layer is sometimes added.The main layers include an input buffer layer, a self-organizing Kohonenlayer, and an output layer which uses the Delta Rule to modify its incomingconnection weights. Sometimes this layer is called a Grossberg Outstar layer.

Figure 5.2.2. An Example Counter-propagation Network.

45

The size of the input layer depends upon how many separableparameters define the problem. With too few, the network may notgeneralize sufficiently. With too many, the processing time takes too long.

For the network to operate properly, the input vector must benormalized. This means that for every combination of input values, the total"length" of the input vector must add up to one. This can be done with apreprocessor, before the data is entered into the counter-propagation network.Or, a normalization layer can be added between the input and Kohonenlayers. The normalization layer requires one processing element for eachinput, plus one more for a balancing element. This layer modifies the inputset before going to the Kohonen layer to guarantee that all input sets combineto the same total.

Normalization of the inputs is necessary to insure that the Kohonenlayer finds the correct class for the problem. Without normalization, largerinput vectors bias many of the Kohonen processing elements such thatweaker value input sets cannot be properly classified. Because of thecompetitive nature of the Kohonen layer, the larger value input vectorsoverpower the smaller vectors. Counter-propagation uses a standardKohonen paradigm which self-organizes the input sets into classificationzones. It follows the classical Kohonen learning law described in section 4.2of this report. This layer acts as a nearest neighbor classifier in that theprocessing elements in the competitive layer autonomously adjust theirconnection weights to divide up the input vector space in approximatecorrespondence to the frequency with which the inputs occur. There needs tobe at least as many processing elements in the Kohonen layer as outputclasses. The Kohonen layer usually has many more elements than classessimply because additional processing elements provide a finer resolutionbetween similar objects.

The output layer for counter-propagation is basically made up ofprocessing elements which learn to produce an output when a particularinput is applied. Since the Kohonen layer includes competition, only a singleoutput is produced for a given input vector. This layer provides a way ofdecoding that input to a meaningful output class. It uses the Delta Rule toback-propagate the error between the desired output class and the actualoutput generated with the training set. The errors only adjust the connectionweights coming into the output layer. The Kohonen layer is not effected.

Since only one output from the competitive Kohonen layer is active ata time and all other elements are zero, the only weight adjusted for theoutput processing elements are the ones connected to the winning element inthe competitive layer. In this way the output layer learns to reproduce acertain pattern for each active processing element in the competitive layer. Ifseveral competitive elements belong to the same class, that output processing

46

element will evolve weights in response to those competitive processingelements and zero for all others.

There is a problem which could arise with this architecture. Thecompetitive Kohonen layer learns without any supervision. It does notknow what class it is responding to. This means that it is possible for aprocessing element in the Kohonen layer to learn to take responsibility fortwo or more training inputs which belong to different classes. When thishappens, the output of the network will be ambiguous for any inputs whichactivate this processing element. To alleviate this problem, the processingelements in the Kohonen layer could be pre-conditioned to learn only about aparticular class.

5.2.3 Probabilistic Neural Network.

The probabilistic neural network was developed by Donald Specht. Hisnetwork architecture was first presented in two papers, Probabilistic NeuralNetworks for Classification, Mapping or Associative Memory andProbabilistic Neural Networks , released in 1988 and 1990, respectively. Thisnetwork provides a general solution to pattern classification problems byfollowing an approach developed in statistics, called Bayesian classifiers.Bayes theory, developed in the 1950's, takes into account the relativelikelihood of events and uses a priori information to improve prediction.The network paradigm also uses Parzen Estimators which were developed toconstruct the probability density functions required by Bayes theory.

The probabilistic neural network uses a supervised training set todevelop distribution functions within a pattern layer. These functions, in therecall mode, are used to estimate the likelihood of an input feature vectorbeing part of a learned category, or class. The learned patterns can also becombined, or weighted, with the a priori probability, also called the relativefrequency, of each category to determine the most likely class for a giveninput vector. If the relative frequency of the categories is unknown, then allcategories can be assumed to be equally likely and the determination ofcategory is solely based on the closeness of the input feature vector to thedistribution function of a class.

An example of a probabilistic neural network is shown in Figure 5.2.3.This network has three layers. The network contains an input layer whichhas as many elements as there are separable parameters needed to describe theobjects to be classified. It has a pattern layer, which organizes the training setsuch that each input vector is represented by an individual processingelement. And finally, the network contains an output layer, called thesummation layer, which has as many processing elements as there are classesto be recognized. Each element in this layer combines via processingelements within the pattern layer which relate to the same class and prepares

47

that category for output. Sometimes a fourth layer is added to normalize theinput vector, if the inputs are not already normalized before they enter thenetwork. As with the counter-propagation network, the input vector must benormalized to provided proper object separation in the pattern layer.

Figure 5.2.3. A Probabilistic Neural Network Example.

As mentioned earlier, the pattern layer represents a neuralimplementation of a version of a Bayes classifier, where the class dependentprobability density functions are approximated using a Parzen estimator. Thisapproach provides an optimum pattern classifier in terms of minimizing theexpected risk of wrongly classifying an object. With the estimator, theapproach gets closer to the true underlying class density functions as thenumber of training samples increases, so long as the training set is anadequate representation of the class distinctions.

In the pattern layer, there is a processing element for each input vectorin the training set. Normally, there are equal amounts of processingelements for each output class. Otherwise, one or more classes may beskewed incorrectly and the network will generate poor results. Eachprocessing element in the pattern layer is trained once. An element is trainedto generate a high output value when an input vector matches the trainingvector. The training function may include a global smoothing factor to bettergeneralize classification results. In any case, the training vectors do not haveto be in any special order in the training set, since the category of a particularvector is specified by the desired output of the input. The learning function

48

simply selects the first untrained processing element in the correct outputclass and modifies its weights to match the training vector.

The pattern layer operates competitively, where only the highest matchto an input vector wins and generates an output. In this way, only oneclassification category is generated for any given input vector. If the inputdoes not relate well to any patterns programmed into the pattern layer, nooutput is generated.

The Parzen estimation can be added to the pattern layer to fine tune theclassification of objects, This is done by adding the frequency of occurrence foreach training pattern built into a processing element. Basically, theprobability distribution of occurrence for each example in a class is multipliedinto its respective training node. In this way, a more accurate expectation ofan object is added to the features which make it recognizable as a classmember.

Training of the probabilistic neural network is much simpler than withback-propagation. However, the pattern layer can be quite huge if thedistinction between categories is varied and at the same time quite similar isspecial areas. There are many proponents for this type of network, since thegroundwork for optimization is founded in well known, classicalmathematics.

5.3 Networks for Data Association

The previous class of networks, classification, is related to networks fordata association. In data association, classification is still done. For example,a character reader will classify each of its scanned inputs. However, anadditional element exists for most applications. That element is the fact thatsome data is simply erroneous. Credit card applications might have beenrendered unreadable by water stains. The scanner might have lost its lightsource. The card itself might have been filled out by a five year old.Networks for data association recognize these occurrances as simply bad dataand they recognize that this bad data can span all classifications.

5.3.1 Hopfield Network.

John Hopfield first presented his cross-bar associative network in 1982at the National Academy of Sciences. In honor of Hopfield's success and hischampioning of neural networks in general, this network paradigm is usuallyreferred to as a Hopfield Network. The network can be conceptualized i nterms of its energy and the physics of dynamic systems. A processing elementin the Hopfield layer, will change state only if the overall "energy" of the statespace is reduced. In other words, the state of a processing element will varydepending whether the change will reduce the overall "frustration level" ofthe network. Primary applications for this sort of network have included

49

associative, or content-addressable, memories and a whole set of optimizationproblems, such as the combinatoric best route for a traveling salesman.

The Figure 5.3.1 outlines a basic Hopfield network. The originalnetwork had each processing element operate in a binary format. This iswhere the elements compute the weighted sum of the inputs and quantizethe output to a zero or one. These restrictions were later relaxed, in that theparadigm can use a sigmoid based transfer function for finer class distinction.Hopfield himself showed that the resulting network is equivalent to theoriginal network designed in 1982.

Figure 5.3.1. A Hopfield Network Example.

The Hopfield network uses three layers; an input buffer, a Hopfieldlayer, and an output layer. Each layer has the same number of processingelements. The inputs of the Hopfield layer are connected to the outputs ofthe corresponding processing elements in the input buffer layer throughvariable connection weights. The outputs of the Hopfield layer are connectedback to the inputs of every other processing element except itself. They arealso connected to the corresponding elements in the output layer. In normalrecall operation, the network applies the data from the input layer throughthe learned connection weights to the Hopfield layer. The Hopfield layeroscillates until some fixed number of cycles have been completed, and thecurrent state of that layer is passed on to the output layer. This state matchesa pattern already programmed into the network.

The learning of a Hopfield network requires that a training pattern beimpressed on both the input and output layers simultaneously. Therecursive nature of the Hopfield layer provides a means of adjusting all of the

50

connection weights. The learning rule is the Hopfield Law, whereconnections are increased when both the input and output of an Hopfieldelement are the same and the connection weights are decreased if the outputdoes not match the input. Obviously, any non-binary implementation of thenetwork must have a threshold mechanism in the transfer function, ormatching input-output pairs could be too rare to train the network properly.

The Hopfield network has two major limitations when used as acontent addressable memory. First, the number of patterns that can be storedand accurately recalled is severely limited. If too many patterns are stored, thenetwork may converge to a novel spurious pattern different from allprogrammed patterns. Or, it may not converge at all. The storage capacitylimit for the network is approximately fifteen percent of the number ofprocessing elements in the Hopfield layer. The second limitation of theparadigm is that the Hopfield layer may become unstable if the commonpatterns it shares are too similar. Here an example pattern is consideredunstable if it is applied at time zero and the network converges to some otherpattern from the training set. This problem can be minimized by modifyingthe pattern set to be more orthogonal with each other.

5.3.2 Boltzmann Machine.

The Boltzmann machine is similar in function and operation to theHopfield network with the addition of using a simulated annealing techniquewhen determining the original pattern. The Boltzmann machineincorporates the concept of simulated annealing to search the pattern layer'sstate space for a global minimum. Because of this, the machine will gravitateto an improved set of values over time as data iterates through the system.

Ackley, Hinton, and Sejnowski developed the Boltzmann learningrule in 1985. Like the Hopfield network, the Boltzmann machine has anassociated state space energy based upon the connection weights in the patternlayer. The processes of learning a training set full of patterns involves theminimization of this state space energy. Because of this, the machine willgravitate to an improved set of values for the connection weights while dataiterates through the system.

The Boltzmann machine requires a simulated annealing schedule,which is added to the learning process of the network. Just as in physicalannealing, temperatures start at higher values and decreases over time. Theincreased temperature adds an increased noise factor into each processingelement in the pattern layer. Typically, the final temperature is zero. If thenetwork fails to settle properly, adding more iterations at lower temperaturesmay help to get to a optimum solution.

A Boltzmann machine learning at high temperature behaves muchlike a random model and at low temperatures it behaves like a deterministic

51

model. Because of the random component in annealed learning, a processingelement can sometimes assume a new state value that increases rather thandecreases the overall energy of the system. This mimics physical annealingand is helpful in escaping local minima and moving toward a globalminimum.

As with the Hopfield network, once a set of patterns are learned, apartial pattern can be presented to the network and it will complete themissing information. The limitation on the number of classes, being lessthan fifteen percent of the total processing elements in the pattern layer, stillapplies.

5.3.3 Hamming Network.

The Hamming network is an extension of the Hopfield network in thatit adds a maximum likelihood classifier to the frond end. This network wasdeveloped by Richard Lippman in the mid 1980's. The Hamming networkimplements a classifier based upon least error for binary input vectors, wherethe error is defined by the Hamming distance. The Hamming distance isdefined as the number of bits which differ between two corresponding, fixed-length input vectors. One input vector is the noiseless example pattern, theother is a pattern corrupted by real-world events. In this networkarchitecture, the output categories are defined by a noiseless, pattern-filledtraining set. In the recall mode any incoming input vectors are then assignedto the category for which the distance between the example input vectors andthe current input vector is minimum.

The Hamming network has three layers. There is an example networkshown in Figure 5.3.2. The network uses an input layer with as many nodesas there are separate binary features. It has a category layer, which is theHopfield layer, with as many nodes as there are categories, or classes. Thisdiffers significantly from the formal Hopfield architecture, which has as manynodes in the middle layer as there are input nodes. And finally, there is anoutput layer which matches the number of nodes in the category layer.

The network is a simple feedforward architecture with the input layerfully connected to the category layer. Each processing element in the categorylayer is connected back to every other element in that same layer, as well as toa direct connection to the output processing element. The output from thecategory layer to the output layer is done through competition.

52

Figure 5.3.2. A Hamming Network Example.

The learning of a Hamming network is similar to the Hopfieldmethodology in that it requires a single-pass training set. However, in thissupervised paradigm, the desired training pattern is impressed upon theinput layer while the desired class to which the vector belongs is impressedupon the output layer. Here the output contains only the category output towhich the input vector belongs. Again, the recursive nature of the Hopfieldlayer provides a means of adjusting all connection weights.

The connection weights are first set in the input to category layer suchthat the matching scores generated by the outputs of the category processingelements are equal to the number of input nodes minus the Hammingdistances to the example input vectors. These matching scores range fromzero to the total number of input elements and are highest for those inputvectors which best match the learned patterns. The category layer's recursiveconnection weights are trained in the same manner as in the Hopfieldnetwork. In normal feedforward operation an input vector is applied to theinput layer and must be presented long enough to allow the matching scoreoutputs of the lower input to category subnet to settle. This will initialize theinput to the Hopfield function in the category layer and allow that portion ofthe subnet to find the closest class to which the input vector belongs. Thislayer is competitive, so only one output is enabled at a time.

The Hamming network has a number of advantages over the Hopfieldnetwork. It implements the optimum minimum error classifier when inputbit errors are random and independent. So, the Hopfield with its random set

53

up nature can only be as good a solution as the Hamming, or it can be worse.Fewer processing elements are required for the Hamming solution, since themiddle layer only requires one element per category, instead of an elementfor each input node. And finally, the Hamming network does not sufferfrom spurious classifications which may occur in the Hopfield network. Allin all, the Hamming network is both faster and more accurate than theHopfield network.

5.3.4 Bi-directional Associative Memory.

This network model was developed by Bart Kosko and againgeneralizes the Hopfield model. A set of paired patterns are learned with thepatterns represented as bipolar vectors. Like the Hopfield, when a noisyversion of one pattern is presented, the closest pattern associated with it isdetermined.

Figure 5.3.4. Bi-directional Associative Memory Example.

A diagram of an example bi-directional associative memory is shownin Figure 5.3.4. It has as many inputs as output processing nodes. The twohidden layers are made up of two separate associated memories and representthe size of two input vectors. The two lengths need not be the same,although this examples shows identical input vector lengths of four each.The middle layers are fully connected to each other. The input and outputlayers are for implementation purposes the means to enter and retrieveinformation from the network. Kosko original work targeted the bi-

54

directional associative memory layers for optical processing, which would notneed formal input and output structures.

The middle layers are designed to store associated pairs of vectors.When a noisy pattern vector is impressed upon the input, the middle layersoscillate back and forth until a stable equilibrium state is reached. This state,providing the network is not over trained, corresponds to the closest learnedassociation and will generate the original training pattern on the output. Likethe Hopfield network, the bi-directional associative memory network issusceptible to incorrectly finding a trained pattern when complements of thetraining set are used as the unknown input vector.

5.3.5 Spatio-Temporal Pattern Recognition (Avalanche).

This network as shown in Figure 5.3.5 came out of Stephen Grossberg'swork in the early 1970's. It basically was developed to explain certaincognitive processes for recognizing time varying sequences of events. In hiswork at the time he called this network paradigm an "Avalanche" network.Robert Hecht-Nielsen became interested in how this network could beapplied to engineering applications. The outcome was the spatio-temporalpattern recognition network. Here, specific patterns, for example audiosignals, are memorized and then used as a basis to classify incomingrepetitive signals. This network has parameters which allow tuning toaccommodate detection of time varying signals.

There is a global bias term attached to each processing element. Thisterm is used to normalize the overall activity in the network. It sets avariable threshold against which processing elements must compete, andinsures that the best match wins. The learning paradigm for the networkuses a variant of the Kohonen rule and adds a time varying component to thelearning function, called the attack function. This function is also used in therecall mode, to provide a latency to the history of signals passing through thenetwork.

The primary application of spatio-temporal pattern networks appearsto be in the area of recognizing repetitive audio signals. One group i nGeneral Dynamics has applied this network to classify types of ships based onthe sounds their propellers make. Another characteristic of the network isthat because of the slow decay of the attack function, even though theperiodicity of the input signal varied by as much as a factor of two, thenetwork was still able to correctly classify the propeller signals.

55

Figure 5.3.5. A Spatio-temporal Pattern Network Example.

5.4 Networks for Data Conceptualization

Another network type is data conceptualization. In many applicationsdata is not just classified, for not all applications involve data that can fitwithin a class, not all applications read characters or identify diseases. Someapplications need to group data that may, or may not be, clearly definable. A nexample of this is in the processing of a data base for a mailing list of potentialcustomers. Customers might exist within all classifications, yet they might beconcentrated within a certain age group and certain income levels. Also, i nreal life, other information might stretch and twist the region which containsthe vast majority of potential buyers. This process is data conceptualization.It simply tries to identify a group as best as it can.

56

5.4.1 Adaptive Resonance Network.

Developed by Stephen Grossberg in the mid 1970's, the network createscategories of input data based on adaptive resonance. The topology isbiologically plausible and uses an unsupervised learning function. It analysesbehaviorally significant input data and detects possible features or classifiespatterns in the input vector.

This network was the basis for many other network paradigms, such ascounter-propagation and bi-directional associative memory networks. Theheart of the adaptive resonance network consists of two highlyinterconnected layers of processing elements located between an input andoutput layer. Each input pattern to the lower resonance layer will induce anexpected pattern to be sent from the upper layer to the lower layer toinfluence the next input. This creates a "resonance" between the lower andupper layers to facilitate network adaption of patterns.

The network is normally used in biological modelling, however, someengineering applications do exist. The major limitation to the networkarchitecture is its noise susceptibility. Even a small amount of noise on theinput vector confuses the pattern matching capabilities of a trained network.The adaptive resonance theory network topology is protected by a patent heldby the University of Boston.

5.4.2 Self-Organizing Map.

Developed by Teuvo Kohonen in the early 1980's, the input data isprojected to a two-dimensional layer which preserves order, compacts sparcedata, and spreads out dense data. In other words, if two input vectors areclose, they will be mapped to processing elements that are close together i nthe two-dimensional Kohonen layer that represents the features or clusters ofthe input data. Here, the processing elements represent a two-dimensionalmap of the input data.

The primary use of the self-organizing map is to visualize topologiesand hierarchical structures of higher-order dimensional input spaces. Theself-organizing network has been used to create area-filled curves in two-dimensional space created by the Kohonen layer. The Kohonen layer can alsobe used for optimization problems by allowing the connection weights tosettle out into a minimum energy pattern.

A key difference between this network and many other networks isthat the self-organizing map learns without supervision. However, when thetopology is combined with other neural layers for prediction orcategorization, the network first learns in an unsupervised manner and thenswitches to a supervised mode for the trained network to which it is attached.

57

An example self-organizing map network is shown in Figure 5.4.2.The self-organizing map has typically two layers. The input layer is fullyconnected to a two-dimensional Kohonen layer. The output layer shownhere is used in a categorization problem and represents three classes to whichthe input vector can belong. This output layer typically learns using the deltarule and is similar in operation to the counter-propagation paradigm.

Figure 5.4.2. An Example Self-organizing Map Network.

The Kohonen layer processing elements each measure the Euclideandistance of its weights from the incoming input values. During recall, theKohonen element with the minimum distance is the winner and outputs aone to the output layer, if any. This is a competitive win, so all otherprocessing elements are forced to zero for that input vector. Thus thewinning processing element is, in a measurable way, the closest to the inputvalue and thus represents the input value in the Kohonen two-dimensionalmap. So the input data, which may have many dimensions, comes to berepresented by a two-dimensional vector which preserves the order of thehigher dimensional input data. This can be thought of as an order-perservingprojection of the input space onto the two-dimensional Kohonen layer.

58

During training, the Kohonen processing element with the smallestdistance adjusts its weight to be closer to the values of the input data. Theneighbors of the winning element also adjust their weights to be closer to thesame input data vector. The adjustment of neighboring processing elementsis instrumental in preserving the order of the input space. Training is donewith the competitive Kohonen learning law described in counter-propagation.

The problem of having one processing element take over for a regionand representing too much input data exists in this paradigm. As withcounter-propagation, this problem is solved by a conscience mechanism builtinto the learning function. The conscience rule depends on keeping a recordof how often each Kohonen processing element wins and this information isthen used during training to bias the distance measurement. This consciencemechanism helps the Kohonen layer achieve its strongest benefit. Theprocessing elements naturally represent approximately equal informationabout the input data set. Where the input space has sparse data, therepresentation is compacted in the Kohonen space, or map. Where the inputspace has high density, the representative Kohonen elements spread out toallow finer discrimination. In this way the Kohonen layer is thought tomimic the knowledge representation of biological systems.

5.5 Networks for Data Filtering

The last major type of network is data filtering. An early network, theMADALINE, belongs in this category. The MADALINE removed the echoesfrom a phone line through a dynamic echo cancellation circuit. More recentwork has enabled modems to work reliably at 4800 and 9600 baud throughdynamic equalization techniques. Both of these applications utilize neuralnetworks which were incorporated into special purpose chips.

5.5.1 Recirculation.

Recirculation networks were introduced by Geoffrey Hinton and JamesMcClelland as a biologically plausible alternative to back-propagationnetworks. In a back-propagation network, errors are passed backwardsthrough the same connections that are used in the feedforward mechanismwith an additional scaling by the derivative of the feedforward transferfunction. This makes the back-propagation algorithm difficult to implementin electronic hardware.

In a recirculation network, data is processed in one direction only andlearning is done using only local knowledge. In particular, the knowledgecomes from the state of the processing element and the input value on theparticular connection to be adapted. Recirculation networks useunsupervised learning so no desired output vector is required to be presented

59

at the output layer. The network is auto-associative, where there are thesame number of outputs as inputs.

This network has two layers between the input and output layers,called the visible and hidden layers. The purpose of the learning rule is toconstruct in the hidden layer an internal representation of the data presentedat the visible layer. An important case of this is to compress the input data byusing fewer processing elements in the hidden layer. In this case, the hiddenrepresentation can be considered a compressed version of the visiblerepresentation. The visible and hidden layers are fully connected to eachother in both directions. Also, each element in both the hidden and visiblelayers are connected to a bias element. These connections have variableweights which learn in the same manner as the other variable weights in thenetwork.

Figure 5.5.1. An Example Recirculation Network.

The learning process for this network is similar to the bi-directionalassociative memory technique. Here, the input data is presented to thevisible layer and passed on to the hidden layer. The hidden layer passes theincoming data back to the visible, which in turn passes the results back to thehidden layer and beyond to the output layer. It is the second pass through thehidden layer where learning occurs. In this manner the input data isrecirculated through the network architecture.

During training, the output of the hidden layer at the first pass is theencoded version of the input vector. The output of the visible layer on thenext pass is the reconstruction of the original input vector from the encoded

60

vector on the hidden layer. The aim of the learning is to reduce the errorbetween the reconstructed vector and the input vector. This error is alsoreflected in the difference between the outputs of the hidden layer at the firstand final passes since a good reconstruction will mean that the same valuesare passed to the hidden layer both times around. Learning seeks to reducethe reconstruction error at the hidden layer also.

In most applications of the network, an input data signal is smoothedby compressing then reconstructing the input vector on the output layer. Thenetwork acts as a low bandpass filter whose transition point is controlled bythe number of hidden nodes.

61

6.0 How Artificial Neural Networks Are Being Used

Artificial neural networks are undergoing the change that occurs whena concept leaves the academic environment and is thrown into the harsherworld of users who simply want to get a job done. Many of the networks nowbeing designed are statistically quite accurate but they still leave a bad tastewith users who expect computers to solve their problems absolutely. Thesenetworks might be 85% to 90% accurate. Unfortunately, few applicationstolerate that level of error.

While researchers continue to work on improving the accuracy of their"creations," some explorers are finding uses for the current technology.

In reviewing this state of the art, it is hard not to be overcome by thebright promises or tainted by the unachieved realities. Currently, neuralnetworks are not the user interface which translates spoken works intoinstructions for a machine, but some day they will. Someday, VCRs, homesecurity systems, CD players, and word processors will simply be activated byvoice. Touch screen and voice editing will replace the word processors oftoday while bringing spreadsheets and data bases to a level of usabilitypleasing to most everyone. But for now, neural networks are simply enteringthe marketplace in niches where their statistical accuracy is valuable as theyawait what will surely come.

Many of these niches indeed involve applications where answers arenebulous. Loan approval is one. Financial institutions make more money byhaving the lowest bad loan rate they can achieve. Systems that are "90%accurate" might be an improvement over the current selection process.Indeed, some banks have proven that the failure rate on loans approved byneural networks is lower than those approved by some of their besttraditional methods. Also, some credit card companies are using neuralnetworks in their application screening process.

This newest method of seeking the future by analyzing past experienceshas generated its own unique problems. One of those problems is to providea reason behind the computer-generated answer, say as to why a particularloan application was denied. As mentioned throughout this report, the innerworkings of neural networks are "black boxes." Some people have evencalled the use of neural networks "voodoo engineering." To explain how anetwork learned and why it recommends a particular decision has beendifficult. To facilitate this process of justification, several neural network toolmakers have provided programs which explain which input through whichnode dominates the decision making process. From that information, expertsin the application should be able to infer the reason that a particular piece ofdata is important.

62

Besides this filling of niches, neural network work is progressing i nother more promising application areas. The next section of this report goesthrough some of these areas and briefly details the current work. This is doneto help stimulate within the reader the various possibilities where neuralnetworks might offer solutions, possibilities such as language processing,character recognition, image compression, pattern recognition among others.

6.1 Language Processing

Language processing encompasses a wide variety of applications. Theseapplications include text-to-speech conversion, auditory input for machines,automatic language translation, secure voice keyed locks, automatictranscription, aids for the deaf, aids for the physically disabled which respondto voice commands, and natural language processing.

Many companies and universities are researching how a computer, viaANNs, could be programmed to respond to spoken commands. Thepotential economic rewards are a proverbial gold mine. If this capabilitycould be shrunk to a chip, that chip could become part of almost anyelectronic device sold today. Literally hundreds of millions of these chipscould be sold.

This magic-like capability needs to be able to understand the 50,000most commonly spoken words. Currently, according to the academicjournals, most of the hearing-capable neural networks are trained to only onetalker. These one-talker, isolated-word recognizers can recognize a fewhundred words. Within the context of speech, with pauses between eachword, they can recognize up to 20,000 words.

Some researchers are touting even greater capabilities, but due to thepotential reward the true progress, and methods involved, are being closelyheld. The most highly touted, and demonstrated, speech-parsing systemcomes from the Apple Corporation. This network, according to an April 1992Wall Street Journal article, can recognize most any person's speech through alimited vocabulary.

This works continues in Corporate America (particularly venturecapital land), in the universities, and in Japan.

6.2 Character Recognition

Character recognition is another area in which neural networks areproviding solutions. Some of these solutions are beyond simply academiccuriosities. HNC Inc., according to a HNC spokesman, markets a neuralnetwork based product that can recognize hand printed characters through ascanner. This product can take cards, like a credit card application form, andput those recognized characters into a data base. This product has been out for

63

two and a half years. It is 98% to 99% accurate for numbers, a little less foralphabetical characters. Currently, the system is built to highlight charactersbelow a certain percent probability of being right so that a user can manuallyfill in what the computer could not. This product is in use by banks, financialinstitutions, and credit card companies.

Odin Corp., according to a press release in the November 4, 1991Electronic Engineering Times, has also proved capable of recognizingcharacters, including cursive. This capability utilizes Odin's propriatoryQuantum Neural Network software package called, QNspec. It has provenuncannily successful in analyzing reasonably good handwriting. It actuallybenefits from the cursive stroking.

The largest amount of research in the field of character recognition isaimed at scanning oriental characters into a computer. Currently, thesecharacters requires four or five keystrokes each. This complicated processelongates the task of keying a page of text into hours of drudgery. Severalvendors are saying they are close to commercial products that can scan pages.

6.3 Image (data) Compression

A number of studies have been done proving that neural networks cando real-time compression and decompression of data. These networks areauto associative in that they can reduce eight bits of data to three and thenreverse that process upon restructuring to eight bits again. However, they arenot lossless. Because of this losing of bits they do not favorably compete withmore traditional methods.

6.4 Pattern Recognition

Recently, a number of pattern recognition applications have beenwritten about in the general press. The Wall Street Journal has featured asystem that can detect bombs in luggage at airports by identifying, from smallvariances, patterns from within specialized sensor's outputs. Another articlereported on how a physician had trained a back-propagation neural networkon data collected in emergency rooms from people who felt that they wereexperiencing a heart attack to provide a probability of a real heart attackversus a false alarm. His system is touted as being a very good discriminatorin an arena where priority decisions have to be made all the time.

Another application involves the grading of rare coins. Digitizedimages from an electronic camera are fed into a neural network. Theseimages include several angles of the front and back. These images are thencompared against known patterns which represent the various grades for acoin. This system has enabled a quick evaluation for about $15 as opposed tothe standard three-person evaluation which costs $200. The results have

64

shown that the neural network recommendations are as accurate as thepeople-intensive grading method.

Yet, by far the biggest use of neural networks as a recognizer of patternsis within the field known as quality control. A number of automated qualityapplications are now in use. These applications are designed to find that onein a hundred or one in a thousand part that is defective. Human inspectorsbecome fatigued or distracted. Systems now evaluate solder joints, welds,cuttings, and glue applications. One car manufacturer is now evenprototyping a system which evaluates the color of paints. This systemdigitizes pictures of new batches of paint to determine if they are the rightshades.

Another major area where neural networks are being built into patternrecognition systems is as processors for sensors. Sensors can provide so muchdata that the few meaningful pieces of information can become lost. Peoplecan lose interest as they stare at screens looking for "the needle in thehaystack." Many of these sensor-processing applications exist within thedefense industry. These neural network systems have been shown successfulat recognizing targets. These sensor processors take data from cameras, sonarsystems, seismic recorders, and infrared sensors. That data is then used toidentify probable phenomenon.

Another field related to defense sensor processing is the recognition ofpatterns within the sensor data of the medical industry. A neural network isnow being used in the scanning of PAP smears. This network is trying to do abetter job at reading the smears than can the average lab technician. Misseddiagnoses is a too common problem throughout this industry. In many cases,a professional must perceive patterns from noise, such as identifying afracture from an X-ray or cancer from a X-ray "shadow." Neural networkspromise, particularly when faster hardware becomes available, help in manyareas of the medical profession where data is hard to read.

6.5 Signal Processing

Neural networks' promise for signal processing has resulted in anumber of experiments in various university labs. Neural networks haveproven capable of filtering out noise. Widrow's MADALINE was the firstnetwork applied to a real-world problem. It eliminates noise from phonelines.

Another application is a system that can detect engine misfire simplyfrom the noise. This system, developed by Odin Corp, works on engines upto 10,000 RPMS. The Odin system satisfies the California Air ResourcesBoard's mandate that by 1994 new automobiles will have to detect misfire i nreal time. Misfires are suspected of being a leading cause of pollution. The

65

Odin solution requires 3 kbytes of software running on a Motorola 68030microprocessor.

6.6 Financial

Neural networks are making big inroads into the financial worlds.Banking, credit card companies, and lending institutions deal with decisionsthat are not clear cut. They involve learning and statistical trends.

The loan approval process involves filling out forms which hopefullycan enable a loan officer to make a decision. The data from these forms isnow being used by neural networks which have been trained on the datafrom past decisions. Indeed, to meet government requirements as to whyapplications are being denied, these packages are providing information onwhat input, or combination of inputs, weighed heaviest on the decision.

Credit card companies are also using similar back-propagationnetworks to aid in establishing credit risks and credit limits.

In the world of direct marketing, neural networks are being applied todata bases so that these phone peddlers can achieve higher ordering ratesfrom those annoying calls that most of us receive at dinner time. (A probablymore lucrative business opportunity awaits the person who can devise asystem which will tailor all of the data bases in the world so that certainphone numbers are never selected.)

Neural networks are being used in all of the financial markets - stock,bonds, international currency, and commodities. Some users are cacklingthat these systems just make them "see green," money that is. Indeed, neuralnetworks are reported to be highly successful in the Japanese financialmarkets. Daiichi Kangyo Bank has reported that for government bondtransactions, neural networks have boosted their hit rate from 60% to 75%.Daiwa research Institute has reported a neural net system which has scored20% better than the Nikkei average. Daiwa Securities' stock prediction systemhas boosted the companies hit rate from 70% to 80%.

6.7 Servo Control

Controlling complicated systems is one of the more promising areas ofneural networks. Most conventional control systems model the operation ofall the system's processes with one set of formulas. To customize a system fora specific process, those formulas must be manually tuned. It is an intensiveprocess which involves the tweaking of parameters until a combination isfound that produces the desired results. Neural networks offer twoadvantages. First, the statistical model of neural networks is more complexthat a simple set of formulas, enabling it to handle a wider variety ofoperating conditions without having to be retuned. Second, because neural

66

networks learn on their own, they don't require control system's experts, justsimply enough historical data so that they can adequately train themselves.

Within the oil industry a neural network has been applied to therefinery process. The network controls the flow of materials and is touted todo that in a more vigilant fashion than distractible humans.

NASA is working on a system to control the shuttle during in-flightmaneuvers. This system is known as Martingale's Parametric Avalanche (aspatio-temporal pattern recognition network as explained in section 5.3.5).Another prototype application is known as ALVINN, for Autonomous LandVehicle in a Neural Network. This project has mounted a camera and a laserrange finder on the roof of a vehicle which is being taught to stay in themiddle of a winding road.

British Columbia Hydroelectric funded a prototype network to controloperations of a power-distribution substation that was so successful atoptimizing four large synchronous condensors that it refused to let itssupplier, Neural Systems, take it out.

6.8 How to Determine if an Application is a Neural Network Candidate

As seen by the sections above, neural networks are being successfullyapplied in a number of areas. Each of these applications can be grouped intotwo broad categories. These categories offer a test for anyone who isconsidering using neural networks. Basically, a potential application shouldbe examined for the following two criteria:

- Can a neural network replace existing technologies in an area wheresmall improvements in performance can result in a major economicimpact? Examples of applications which meet this criteria are:

- loan approvals

- credit card approvals

- financial market predictions

- potential customer analysis for the creation of mailing lists

- Can a neural network be used in an area where current technologieshave proven inadequate to making a system viable? Examples ofapplications which meet this criteria are:

- speech recognition

- text recognition

67

- target analysis

(Another example where other technologies failed was in explosivedetection at airports. Previous systems could not achieve the FAAmandated level of performance, but by adding a neural network thesystem not only exceeded the performance, it allowed thereplacement of a $200,000 component.)

The most successful applications have been focused on a singleproblem in a high value, high volume, or a strategically importantapplication.

The easiest implementation of neural networks occur in solutionswhere they can be made to be "plug compatible" with existing systems. Tosimply replace an existing element of a system with a neural network eases aninstallation. It also increases the likelihood of success. These "plugcompatible" solutions might be at the front end of many systems whereneural networks can recognize patterns and classify data.

68

7.0 Emerging Technologies

If the 21st Century is to be the age of intelligent machines, thenartificial neural networks will become an integral part of life.

In order that software engineers can lead us to this "promised life" theymust begin by utilizing the emerging technology of Neural Networks. To dothat they must optimize their time by using already implemented hardwareand commercial software packages while anticipating what is still to come.To accomplish this understanding, this section is broken into two pieces -what currently exists and what implementors think the next developmentswill be.

7.1 What Currently Exists

Currently, a number of vendors exist within the marketplace. Thesevendors are each seeking a share of the neural network business. Some ofthem do that by hitching their wagon to other packages within the industry.Neural network products exist which are simply add-ons to the popular databases and spreadsheets. Other products are geared for particular operatingsystems on particular machines. There are vendors of neural networkdevelopment tools for most machines. The most popular tools work oneither Apple's Macintosh or the IBM PC standard.

Some of these packages are geared toward particular applications suchas image processing. Others are general but lack good data routingcapabilities. Each of these companies are identifying their weaknesses and areworking on them. It is an exciting time for them, with both the rewards andrisks high.

In choosing a development tool a software engineer needs to beware ofthis emerging field. Most products are not evolved into the user friendlyroutines that draw raves. This is a young field. Its very volatility has created aconfusing set of offerings, and features within offerings, which willultimately be relegated to the trash.

7.1.1 Development Systems.

Good development systems allow an user to prototype a network, trainit, tweak it, and use it. These systems run on the standard range ofcomputers. These packages usually don't run on specialized hardware,although some vendors have packaged fast RISC processors into specialneural processing boards. Usually, these packages are simply tools whichcreate networks that prove concepts but may be way too slow to run. One ofthe more complete lists of these vendors is published in the November 1991issue of Personal Engineering & Instrumentation News.

69

7.1.2 Hardware Accelerators.

The key to the continued evolution of neural networking lies in thehardware. Traditional hardware does not enable the massive parallelism thatis required by neural networks. There are several approaches that are beingworked on. One is to develop a processor which is specifically tailored toperforming the tasks of individual artificial neurons. Another approach is topackage fast processors, primarily RISCs, onto a hardware accelerator. Theseprocessors can be packed many to a board to facilitate the parallel nature ofneural networks. Other accelerator boards simply provide more horsepowerfor sequential processing.

Accelerator board products are being developed both independentlyand by the makers of neural network development systems. Each havespecific characteristics that lend themselves to particular resolutions.

7.1.3 Dedicated Neural Processors.

Dedicated neural processors are processors with specific capabilities thatenable their use in neural networks. Several of the large chip manufacturershave developed neural processors. Some of these processors were createdspecifically for the development system vendors. Some of these chipspackage a number of simplistic neurons onto one chip. Others incorporateproprietary concepts, such as creating a specific type of fuzzy neuron. Thesechips come in many broad technologies - analog, digital, hybrid, and optical.There is no clear winner to date.

7.2 What the Next Developments Will Be

The vendors within the industry predict that migration from tools toapplications will continue. In particular, the trend is to move toward hybridsystems. These systems will encompass other types of processes, such as fuzzylogic, expert systems, and kinetic algorithms. Indeed, several manufacturesare working on "fuzzy neurons."

The greatest interest is on merging fuzzy logic with neural networks.Fuzzy logic incorporates the inexactness of life into mathematics. In life mostpieces of data do not exactly fit into certain categories. For instance, a personis not just short or tall. He can be kinda short, pretty tall, a little aboveaverage, or very tall. Fuzzy logic takes these real-world variations intoaccount. In potential application of neural networks, in systems which solvereal problems, this fuzziness is a large part of the problem. In automating acar, to stop is not to slam on the brakes, to speed up is not to "burn rubber."To help neural networks accomodate this fuzziness of life, some researchersare developing fuzzy neurons. These neurons do not simply give yes/noanswers. They provide a more fuzzy answer.

70

Systems built with fuzzy neurons may be initialized to what an expertthinks are the rules and the weights for a given application. This merging ofexpert systems and fuzzy logic with neural networks utilizes the strength ofall three disciplines to provide a better system than either can providethemselves. Expert systems have the problem that most experts don't exactlyunderstand all of the nuances of an application and, therefore, are unable toclearly state rules which define the entire problem to someone else. But theneural network doesn't care that the rules are not exact, for neural networkscan then learn, and then correct, the expert's rules. It can add nodes forconcepts that the expert might not understand. It can tailor the fuzzy logicwhich defines states like tall, short, fast, or slow. It can tweak itself until it canmeet the user identified state of being a workable tool. In short, hybridsystems are the future.

71

8.0 Summary

In summary, artificial neural networks are one of the promises for thefuture in computing. They offer an ability to perform tasks outside the scopeof traditional processors. They can recognize patterns within vast data setsand then generalize those patterns into recommended courses of action.Neural networks learn, they are not programmed.

Yet, even though they are not traditionally programmed, the designingof neural networks does require a skill. It requires an "art." This art involvesthe understanding of the various network topologies, current hardware,current software tools, the application to be solved, and a strategy to acquirethe necessary data to train the network. This art further involves theselection of learning rules, transfer functions, summation functions, and howto connect the neurons within the network.

Then, the art of neural networking requires a lot of hard work as data isfed into the system, performances are monitored, processes tweaked,connections added, rules modified, and on and on until the network achievesthe desired results.

These desired results are statistical in nature. The network is notalways right. It is for that reason that neural networks are finding themselvesin applications where humans are also unable to always be right. Neuralnetworks can now pick stocks, cull marketing prospects, approve loans, denycredit cards, tweak control systems, grade coins, and inspect work.

Yet, the future holds even more promises. Neural networksneed faster hardware. They need to become part of hybrid systemswhich also utilize fuzzy logic and expert systems. It is then thatthese systems will be able to hear speech, read handwriting, andformulate actions. They will be able to become the intelligencebehind robots who never tire nor become distracted. It is then thatthey will become the leading edge in an age of "intelligent"machines.

72

9.0 References

[Aarts, 89] Aarts, Emile, and Korst, Jan, Simulated Annealing and BoltzmannMachines, A Stochastic Approach to Combinatorial Optimization and NeuralComputing, Wiley, Tiptree Essex GB, 1989.

[Abu-Mostafa, 85] Abu-Mostafa, Y.S., and St. Jacques, J.M., "InformationCapacity of the Hopfield Model", IEEE Transactions on Information Theory ,Volume IT-31, Number 4, July 1989.

[Ackley, 85] Ackley, D.H., Hinton, G.E., and Sejnowski, T.J., "A LearningAlgorithm for Boltzmann Machines", Cognitive Science, Volume 9, 1985.

[Ahmad,90] Ahmad, Z., "Improving the Solution of the Inverse KinematicProblem in Robotics Using Neural Networks", Journal of Neural NetworkComputing, Vol. 1 Num. 4, Spring 1990.

[Anderson, 72] Anderson, James A., "A Simple Neural Network Generatingan Interactive Memory", Mathematical Biosciences, Volume 14, 1972.

[Anderson, 77] Anderson, James A., Silverstein, Jack W., Ritzx, Stephen A.,and Jones, Randall S., "Distinctive Features, Categorical Perception, andProbability Learning: Some Applications of a Neural Model", PsychologicalReview, Volume 84, Number 5, September 1977.

[Anderson, 83] Anderson, James A., "Cognitive and PsychologicalComputation with Neural Models", IEEE Transactions on Systems, Man, a n dCybernetics, Volume SMC-13, Number 5, September 9183.

[Anderson, 86a] Anderson, James A., and Murphy, Gregory L., "PsychologicalConcepts in a Parallel System", Physical 22D, 1986.

[Anderson, 86b] Anderson, James A., "Cognitive Capabilities of a ParallelSystem", Springer-Verlag, 1986.

[Anderson,88] Anderson, J.A. and Rosenfeld,E., eds., "Neurocomputing:Foundations of Research, MIT Press, Boston, MA 1988, page 125.

[Antognetti,91] Antognetti, P. and Milutinovic, V., Neural Networks:Concepts, Applications, and Implementations (Eds.) Volumes I-IV, PrenticeHall, Englewood Cliffs, NJ., 1991.

[Baba, 77] Baba, Norio, Shoman, T., and Sawaragi, Y., "A ModifiedConvergence Theorem for a Random Optimization Method", InformationSciences, Volume 13, 1977.

73

[Baba, 89] Baba, Norio, "A New Approach for Finding the Global Minimumof Error Function of Neural Networks", Neural Networks, Volume 2, 1989.

[Barto, 81] Barto, A.G., and Sutton, R.S., "Goal Seeding Components forAdaptive Intelligence: An Initial Assessment", Air Force WrightAeronautical Laboratories/Avionics Laboratory Technical Report AFWAL-TR-18-1070, Ohio, Wright-Patterson AFB, 1981.

[Batchelor, 74] Batchelor, B.G., Practical Approach To Pattern Recognition,Plenum Press, New York, 1974.

[Bernstein, 81] Bernstein, J., "Profiles: Marvin Minsky", The New Yorker,December 1981.

[Brown, 87] Brown, Robert J., "An Artificial Neural Network Experiment",Dr. Dobbs Journal, April 1987.

[Burr, 88] Burr, D.J., "An Improved Elastic Net Method for the TravelingSalesman Problem", Proc. of the IEEE First International Conference o nNeural Networks, Volume 1, June 1988.

(California,88) California Scientific Software, "Brainmaker User Guide andReference Manual", California Scientific Software, 1988.

[Carpenter, 87a] Carpenter, Gail A., and Grossberg, Stephen, "A MassivelyParallel Architecture for a Self-Organizing Neural Pattern RecognitionMachine", Computer Vision, Graphics and Image Processing 37, 1987.

[Carpenter, 87b] Carpenter, Gail A., and Grossberg, Stephen, "ART 2: Self-Organization of Stable Category Recognition Codes for Analog InputPatterns", Applied Optics, 1987.

[Carpenter, 87c] Carpenter, Gail, A., and Grossberg, Stephen, "ART 3:Hierarchical Search Using Chemical Transmitters in Self-Organizing PatternRecognition Architectures", Neural Networks, Volume 3, 1987.

[Carpenter, 88] Carpenter, Gail A., and Grossberg, Stephen, "The ART ofAdaptive Pattern Recognition by a Self-Organizing Neural Network",Computer, March, 1988.

[Carpenter, 91] Carpenter, Gail, A., Grossberg, Stephen, and Rosen, D.B.,"ART-2A: An Adaptive Resonance Algorithm for Rapid Category LearningAnd Recognition", Neural Networks, Volume 4, 1991.

74

[Caudill, 90] Caudill, Maureen, and Butler, Charles, Naturally IntelligentSystems, The MIT Press, ISBN 0-262-03156-6, 1990.[Cohen, 83] Cohen, Michael A., and Grossberg, Stephen, "Absolute Stability ofGlobal Pattern Formation and Parallel Memory Storage by CompetitiveNeural Networks", IEEE Transactions on Systems, Man and Cybernetics,Volume SMC-13, 1983.

[Cottrell, 87] Cottrell, G.W., Munro, P., and Zipser D., "Learning InternalRepresentations form Gray-Scale Images: An Example of ExtensionalProgramming", In Proc. 9th Annual Conference of the Cognitive ScienceSociety, 1987.

[DeSieno, 88] DeSieno, D., "Adding a Conscience to Competitive Learning",Proc. of the Second Annual IEEE International Conference on NeuralNetworks, Volume 1, 1988.

[Durbin, 87] Durbin, R., and Willshaw, D., "An Analog Approach to theTraveling Salesman Problem Using an Elastic Net Method", Nature, Volume326, April 1987.

[Eberhart, 90] Eberhart, Russell C., and Dobbins, Roy W., Neural Network PCTools: A Practical Guide, Academic Press, ISBN 0-12-228640-5, 1990.

[Fahlmann, 88] Fahlmann, Scott E. "An Empirical Study of Learning Speed i nBack-Propagation Networks", CMU Technical Report, CMU-CS-88-162, June1988.

[Fukushima, 75] Fukushima, Kuniko, "Cognitron: A Self-OrganizingMultilayered Neural Network", Biological Cybernetics, Volume 20, 1975.

[Fukushima, 80] Fukushima, Kuniko, "Neocognitron: A Self-OrganizingNeural Network Model for a Mechanism of Pattern Recognition Unaffectedby Shift in Position", Biological Cybernetics, Volume 36, 1980.

[Fukushima,83] Fukushima, Kuniko, and Takayuki, I., "Neocognition: ANeural Network Model for a Mechanism of Visual Pattern Recognition",IEEE Transactions on Systems, Man, and Cybernetics 13(5), September/October1983, pp. 826-34.

[Fukushima, 88] Fukushima, Kuniko, "Neocognitron: A Hierarchical NeuralNetwork Capable of Visual Pattern Recognition", Neural Networks , Volume1, 1988.

[Fukushima, 89] Fuksushima, Kuniko, "Analysis of the Process of VisualPattern Recognition by the Neocognitron", Neural Networks , Volume 2,1989.

75

[Galland, 88] Galland, C., "Biologically Plausible Learning Procedures i nConnectionist Networks", M.S. Thesis, Department of Physics, University ofToronto, 1988.

[Gaudiano, 91] Gaudiano, P., and Grossberg, Stephen, "Vector AssociativeMaps: Unsupervised Real-Time Error-Based Learning and Control ofMovement Trajectories", Neural Networks, Volume 4, 1991.

[Glover, 88] Glover, David E., "A Hybrid Optical Fourier/ElectronicNeurocomputer Machine Vision Inspection System", Proc. Vision '88Conference, SME/MVA, 1988.

[Glover, 89] Glover, David E., "Optical Processing and Neurocomputing in anAutomated Inspection System", Journal of Neural Network Computing, Fall1989.

[Golden, 86] Golden, Richard M., "Brain-State-in-a-Box Neural Model is aGradient Descent Algorithm", Journal of Mathematical Psychology, 1986.

[Gorman, 88] Gorman, R.P., and Sejnowski, T.J., "Analysis of Hidden Units ina Layered Network Trained to Classify Sonar Targets", Neural Networks ,Volume 1, 1988.

[Grossberg, 69a] Grossberg, Stephen, "Embedding Fields: A Theory ofLearning with Physiological Implications", Journal of MathematicalPsychology, Volume 6, 1969.

[Grossberg, 69b] Grossberg, Stephen, "Some Networks That Can Learn,Remember, and Reproduce any Number of Complicated Space-Time Patterns,I", Journal of Mathematics and Mechanics, Volume 19, 1969.

[Grossberg, 70] Grossberg, Stephen, "Some Networks That Can Learn,Remember, and Reproduce any Number of Complicated Space-Time Patterns,II", Studies in Applied Mathematics, Volume 49, 1970.

[Grossberg, 71] Grossberg, Stephen, "Embedding Fields: UnderlyingPhilosophy, Mathematics, and Applications to Psychology, Physiology, andAnatomy", Journal of Cybernetics, Volume 1, 1971.

[Grossberg, 76] Grossberg, Stephen, "Adaptive Pattern Classification andUniversal Recoding: I. Parallel Development and coding of Neural FeatureDetectors", Biological Cybernetics, Volume 23, 1976.

76

[Grossberg, 85] Grossberg, Stephen, and Mingolla, E., "Neural Dynamics ofPerceptual Grouping: Textures, Boundaries, and Emergent Segmentations",Perception and Psychophysics, Volume 38, 1985.[Grossberg, 89] Grossberg, Stephen, and Rudd, M.E., "A Neural Network forVisual Motion Perception: Group and Element Apparent Motion", NeuralNetworks, Volume 2, 1989.

[Hebb, 49] Hebb, D. O., The Organization of Behavior , Wile, New York, NewYork, 1949.

[Hecht-Nielsen, 86] Hecht-Nielsen, Robert, "Nearest Matched FilterClassification of Spatio-temporal Patterns", special report published by Hecht-Nielsen Neuro-Computer Corporation, San Diego, California, June 1986.

[Hecht-Nielsen, 87] Hecht-Nielsen, Robert, "Counter-Propagation Networks",IEEE First International Conference on Neural Networks, Volume II, 1987.

[Hecht-Nielsen,88] Hect-Nielsen, Robert, "Neurocomputing: Picking theHuman Brain", IEEE Spectrum 25(3), March 1988, pp. 36-41.

[Hecht-Nielsen, 90] Hecht-Nielsen, Robert, Neurocomputing , Addison-Wesley, ISBN 0-201-09255-3, 1990.

[Hegde, 88] Hegde, S.V., Sweet, J.L., and Levy, W.B., "Determination ofParameters in a Hopfield/Tank Computational Network", Proc. of the IEEEFirst International Conference on Neural Networks, Volume 2, June 1987.

[Hinton, 87] Hinton, Geoffrey E., and Sejnowski, Terrence J., "NeuralNetwork Architectures for AI", Tutorial Number MP2, National Conferenceon Artificial Intelligence (AAAI-87), July 1987.

[Hinton, 88] Hinton, G.E., and McClelland, J.L., "Learning Representations byRecirculation", Proc. of the IEEE Conference on Neural InformationProcessing Systems, November 1988.

[Hopfield, 82] Hopfield, John J., "Neural Networks and Physical Systems withEmergent Collective Computational Abilities", Proceedings of the NationalAcademy of Sciences, Volume 79, April 1982.

[Hopfield, 83] Hopfield, John J., Feinstein, D.I., and Palmer, R.G., "Unlearninghas a Stabilizing Effect in Collective Memories", Nature, Volume 304, July1983.

[Hopfield, 84] Hopfield, John J., and Tank, David W., "Neural Computationof Decisions in Optimization Problems", Biological Cybernetics, Volume 52,1985.

77

[Hopfield, 86a] Hopfield, John J., "Physics, Biological Computation andComplimentarity", The Lessons of Quantum Theory, Elsevier SciencePublishers, B.B., 1986.

[Hopfield, 86b] Hopfield, John J., and Tank, David W., "CollectiveComputation with Continuous Variables", Disordered Systems and BiologicalOrganization, Springer-Verlag, 1986.

[Isik, 91] Isik, C. and Uhrig, R. E., "Neural Networks and Power UtilityApplications", EPRI Knowledge Based Technology Applications Seminar,September 1991.

[Jacobs, 88] Jacobs, R.A., "Increased Rates of Convergence Through LearningRate Adaptation", Neural Networks, Volume 1, 1988.

[Johnson,89] Johnson, R. Colin, "Neural Nose to Sniff Out Explosives at JFKAirport", Electronic Engineering Times, May 1, 1989.

[Johnson,91a] Johnson, R. Colin, "Moto Readies Odin Neural Net", ElectricalEngineering Times, November 4, 1991.

[Johnson,91b] Johnson, R. Colin, "Darpa Continues Neural Funding",Electrical Engineering Times, August 5, 1991.

[Johnson,92a] Johnson, R. Colin, "Odin Delivers Neural Pack", ElectricalEngineering Times, March 9, 1992.

[Johnson,92b] Johnson, R. Colin, "Gap Closing Between Fuzzy, Neural Nets",Electronic Engineering Times, April 13, 1992.

[Jorgensen, 86] Jorgensen, C., and Matheus, C., "Catching Knowledge i nNeural Nets", AI Expert, December 1986.

[Kirpatrick, 83] Kirpatrick, S., Gelatt Jr., C.D., and Vecchi, M.P., "Optimizationby Simulated Annealing", Science, Volume 220, 1983.

[Klimasauskas, 91] Klimasauskas, Casimir, "Applying Neural Networks: PartsI-VI", PC AI, January-December 1991.

[Kohnen, 82] Kohonen, T., "Self-Organization Formation of TopologicallyCorrect Feature Maps", Biological Cybernetics, Volume 43, 1982.

[Kohonen, 88a] Kohonen, T., Self-Organization and Associative Memory ,Second Edition, Springer-Verlag, New York, 1988.

78

[Kohonen, 88b] Kohonen, T., et al., "Statistical Pattern Recognition withNeural Networks: Benchmark Studies", Proc. of the Second Annual IEEEInternational Conference on Neural Networks, Volume 1, 1988.

[Kosko, 87] Kosko, Bart, "Adaptive Bidirectional Associative Memories",Applied Optics, Volume 26, 1987.

[Kosko, 92] Kosko, Bart, Neural Networks and Fuzzy Systems, Prentice Hall,Englewood Cliffs, NJ, 1992.

[Lapedes, 87] Lapedes, A., and Farber, R., "Non-Linear Signal ProcessingUsing Neural Networks: Prediction and System Modeling", Los A lamosNational Laboratory Report LA-UR-87-2662, 1987.

[Lippman, 87] Lippmann, Richard P., "An Introduction to Computing withNeural Nets", IEEE ASSP Magazine, April 1987.

[Maren, 90] Maren, Alianna, and Harston, Craig, and Pap, Robert, Handbookof Neural Computing Applications, Academic Press, ISBN 0-12-546090-2, 1990.

[Matyas, 65] Matyas, J., "Random Optimization", Automation and R e m o t eControl, Volume 26, 1965.

[McClelland, 87] McClelland, J.L., and St. John, M., "Three Short Papers onLanguage and Connectionism", Carnegie-Mellon University TechnicalReport AIP-1, 1987.

[McCord, 91] McCord-Nelson, Marilyn, and Illingworth, W.T., Addison-Wesley, ISBN-0-201-52376-0, 1991.

[McCulloch, 43] McCulloch, Warren S., and Pitts, Walter H., "A LogicalCalculus of the Ideas Immanent in Neural Nets", Bulletin of MathematicalBiophysics, Volume 5, 1943.

[McEliece, 86] McEliece, R.J., Posner, E.C., Rodemich, E.R., and Venkatesh,S.S., "Capacity of the Hopfield Associative Memory", California Institute o fTechnology Internal Report, 1986.

[Miller, 91a] Miller, Donald L., and Pekny, Joseph F., "Exact Solution of LargeAsymmetric Traveling Salesman Problems", Science, Volume 251, February1991.

[Miller, 91b] Miller, W. Thomas, Sutton, Richard S., and Werbos, Paul J.,Neural Networks for Control, MIT Press, ISBN 0-262-13261-3, 1991.

79

[Minia, 90a] Minai, A.A., and Williams, R.D., "Acceleration of Back-Propagation through Learning Rate and Momentum Adaptation",International Joint Conference on Neural Networks, Volume 1, January 1990.

[Minsky, 69] Minsky, Marvin L., and Papert, Seymour S., Perceptrons: A nIntroduction to Computational Geometry, MIT Press, Cambridge, MA 1969.

[Miyazaki,91] Miyazaki, Noboyuki, "Neural Networks Go to Work in Japan",Electronic Engineering Times, January 28, 1991.

[Nelson,91] Nelson, M. McCord and Illingworth, W. T., A Practical Guide t oNeural Nets, Addison-Wesley, Reading, MA, 1991.

[NeuralWare, 91] Neural Computing, authored by NeuralWare, Inc.employees for their NeuralWorks Professional II/Plus ANN DevelopmentSoftware, Pittsburg, PA, 1991.

[North,91] North, Robert, "Are Neural Networks Practical Engineering Toolsor Ivory-tower Technology?", Personal Engineering & InstrumentationNews, October 1991.

[Olson, 89] Olson, Willard W., and Huang, Yee-Wei, "Toward SystemicNeural Network Modeling", IJCNN-89 Proceedings, IEEE Cat. #89CH2765-6,June 1989.

[Parker, 85] Parker, David B., "Learning-logic", Report TR-47, Cambridge, MA:Massachusetts Institute of Technology, Center for Computational Research i nEconomics and Management Science, 1985.

[Pao, 89] Pao, Yoh-Han, Adaptive Pattern Recognition and Neural Networks ,Addison-Wesley, 1989.

[Parker, 87] Parker, D.B., "Optimal Algorithms for Adaptive Networks:Second Order Back Propagation, Second Order Direct Propagation and SecondOrder Hebbian Learning", Proc. of the 1st ICNN, Volume II, 1987.

[Porter,91] Porter, Michael L., "Neural Nets Offer Real Solutions - Not Smokeand Mirrors", Personal Engineering & Instrumentation News , November1991.

[Reece, 87] Reece, Peter, "Perceptrons and Neural Nets", AI Expert, Volume 2,1987.

[Reed, 89] Reed, F., "USPS Investigated Neural Nets", (p. 10) and "NeuralNetworks Fall Short of Miraculous But Still Work" (p. 28), Federal ComputerWeek, January 23, 1989.

80

[Reeke, 87] Reeke, G.N., and Edelman, G.M., "Selective Neural Networks andTheir Implications for Recognition Automata", International Journal o fSupercomputer Applications, Volume 1, 1987.

[Rosenberg, 87] Rosenberg, Charles R., "Revealing the Structure of NETtalk'sInternal Representations", In Proc. 9th Annual Conference of the CognitiveScience Society, 1987.

[Rosenblatt, 58] Rosenblatt, F., "The Perceptron: A Probabilistic Model forInformation Storage and Organization in the Brain", Psychological Review ,Volume 65, 1958.

[Rumelhart, 85] Rumelhart, D.E., Hinton, G.E., and Williams, R.J., "LearningInternal Representations by Error Propagation", Institute for CognitiveScience Report 8506, San Diego, University of California, 1985.

[Rumelhart, 86] Rumelhart, D.E., and McClelland, J.L., editors, "ParallelDistributed Processing : Explorations in the Microstructure of Cognition",Foundations,, Volume 1, MIT Press, 1986.

[Samad, 89] Samad, Tariq, "Back-Propagation Extensions", Honeywell SSDCTechnical Report, Golden Valley, MN, 1989.

[Saridis, 70] Saridis, G.N., "Learning Applied to Successive ApproximationAlgorithms", IEEE Transaction on Systems Science and Cybernetics, VolumeSSC-6, 1970.

[Schalkoff,92] Schalkoff, R. J., Pattern Recognition: Statistical, Structural, a n dNeural Approaches, John Wiley & Sons, New York, NY, 1992.

[Scheff, 90] Scheff, K. and Szu, H. "Gram-Schmidt Orthogonalization NeuralNetworks for Optical Character Recognition", Journal of Neural NetworkComputing, Winter, 1990.

[Schrack, 76] Schrack, G., and Choit, M., "Optimized Relative Step SizeRandom Searches", Mathematical Programming, Volume 10, 1976.

[Scofield, 88] Scofield, C.L., "Learning Internal Prepresentations in theCoulomb Energy Network", ICNN-88 Proceedings, IEEE Cat. #88CH2632-8,July 1988.

[Sejnowski, 87] Sejnowski, T.J., and Rosenberg, C.R., "Parallel Networks thatLearn to Pronounce English Text", Complex Systems, Volume 1, 1987.

[Solis, 81] Solis, F.J., and Wets, R.J.B., "Minimization by Random SearchTechniques", Mathematics of Operations Research, Volume 6, 1981.

81

[Specht, 88] Specht, D.F., "Probabilistic Neural Networks for Classification,Mapping or Associative Memory", ICNN-88 Conference Proceedings, 1988.

[Specht, 90] Specht, D.F., "Probabilistic Neural Networks", Neural Networks ,November 1990.

[Stanley,88] Stanely,Jeannette, "Introduction to Neural Networks",Californian Scientific Software, 1988.

[Stork, 88] Stork, David G., "Counter-Propagation Networks: AdaptiveHierarchical Networks for Near-Optimal Mappings", Synapse Connection,Volume 1, Number 2, 1988.

[Szu, 87] Szu, H., and Hartley, R., "Fast Simulated Annealing", Physics Letters1222(3,4), 1987.

[Szu, 88] Szu, Harold, "Fast TSP Algorithm Based on Binary Neuron Outputand Analog Neuron Input Using the Zero-Diagonal Interconnect Matrix andNecessary and Sufficient Constraints of the Permutation Matrix", Proc. of t h eIEEE First International Conference on Neural Networks , Volume 2, June,1987.

[Tank, 86] Tank, David W., and Hopfield, John J., "Simple NeuralOptimization Networks: An A/D Converter, Signal Decision Circuit, and aLinear Programming Circuit", IEEE Transactions on Circuits and Systems,Volume CAS-33, Number 5, May 1986.

[Uttley, 66] Uttley, Albert M., "The Transmission of Information and theEffect of Local Feedback in Theoretical and Neural Networks", BrainResearch, Volume 2, 1966.

[Van den Bout, 88] Van den Bout, D.E., and Miller, T.K., "A TravelingSalesman Objective Function That Works", Proc. of the IEEE FirstInternational Conference on Neural Networks, Volume 2, June 1987.

[Wassermann, 88] Wassermann, Philip D., "Combined Back-Propagation/Cauchy Machine, Neural Networks", Abstracts of the First INNSMeeting, Volume 1, Pergamon Press, 1988.

[Wassermann, 89] Wassermann, Philip D., Neural Computing, Theory a n dPractice, Van Nostrand, NY, 1989.

[Watanabe, 85] Watanabe, S., "Theorem of the Ugly Duckling", PatternRecognition: Human and Mechanical, Wiley, 1985.

82

[Waltrous, 87] Waltrous, R.L., "Learning Algorithms for ConnectionistNetworks: Applied Gradient Methods of Nonlinear Optimization", Proc. o fthe 1st ICNN, Volume II, 1987.

[Widrow, 60a] Widrow, Bernard, and Hoff, Marcian, "Adaptive SwitchingCircuits", 1960 IRE WESCON Convention Record, Part 4, August 1960.

[Widrow, 60b] Widrow, Bernard, "An Adaptive 'Adaline' Neuron UsingChemical 'Memistors'", Technical Report Number 1553-2, StanfordElectronics Laboratories, October 1960.

[Widrow, 63] Widrow, Bernard, and Smith, Fred W., "Pattern-RecognizingControl Systems", Computer and Informations Sciences Sympos iumProceedings, Spartan Books, Washington, DC, 1963.

[Widrow, 64] Widrow, Bernard, "Pattern Recognition and Adaptive Control",Applications and Industry, September 1964.

[Widrow, 73] Widrow, Bernard, Bupta, Narendra K., and Maitra, Sidharha,"Punish/Reward: Learning with a Critic in Adaptive Threshold Systems",IEEE Transactions on Systems, Man, and Cybernetics, Volume SMC-3,Number 5, September 1973.

[Widrow, 75] Widrow, Bernard, Glover, John R., McCool, John M., Kaunitz,John, Williams, Charles S., Hearn, Robert H., Zeidler, James R., Dong,Eugene, and Goodlin, Robert C., "Adaptive Noise Canceling: Principles andApplications", Proceedings of the IEEE, Volume 63, Number 12, December1975.

[Widrow, 88] Widrow, Bernard, editor, DARPA Neural Network Study,AFCEA International Press, ISBN 0-916159-17-5, 1988.

[Willshaw, 76] Willshaw, D.J., and Von Der Malsberg, C., "How PatternedNeural Connections Can Be Set Up by Self-Organization", Proc. R. Soc.London, Brit., Volume 194, 1976.

[Wilson, 88] Wilson, G.V., and Pawley, G.S., "On the Stability of theTraveling Salesman Problem Algorithm of Hopfield and Tank", BiologicalCybernetics, 1988.

[Wright,91] Wright, Maury, "Neural-network IC Architectures DefineSuitable Applications", EDN, July 4,1991.

[Yager, 80] Yager, R.R., "A Measurement - Information Discussion of FuzzyUnion and Intersection", International Journal of Man-Machine Studies,1980.

83

[Zadeh, 65] Zadeh, L.A., "Fuzzy Sets", Information and Control, 1965.

[Zornetzer, 90] Zornetzer, Steven, F., Davis, Joel L., and Lau, Clifford, A nIntroduction to Neural and Electronic Networks , Academic Press, ISBN-0-12-781881-2, 1990.

An

introduction

to

Neural Networks

Patrick van der SmagtBen Krose..

Eighth edition

November 1996

2

c 1996 The University of Amsterdam. Permission is granted to distribute single copies of thisbook for non-commercial use, as long as it is distributed as a whole in its original form, andthe names of the authors and the University of Amsterdam are mentioned. Permission is alsogranted to use this book for non-commercial courses, provided the authors are notied of thisbeforehand.

The authors can be reached at:

Ben Krose Patrick van der SmagtFaculty of Mathematics & Computer Science Institute of Robotics and System DynamicsUniversity of Amsterdam German Aerospace Research EstablishmentKruislaan 403, NL1098 SJ Amsterdam P. O. Box 1116, D82230 WesslingTHE NETHERLANDS GERMANYPhone: +31 20 525 7463 Phone: +49 8153 282400Fax: +31 20 525 7490 Fax: +49 8153 281134email: [email protected] email: [email protected]: http://www.fwi.uva.nl/research/neuro/ URL: http://www.op.dlr.de/FF-DR-RS/

Contents

Preface 9

I FUNDAMENTALS 11

1 Introduction 13

2 Fundamentals 15

2.1 A framework for distributed representation : : : : : : : : : : : : : : : : : : : : : 15

2.1.1 Processing units : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 152.1.2 Connections between units : : : : : : : : : : : : : : : : : : : : : : : : : : 16

2.1.3 Activation and output rules : : : : : : : : : : : : : : : : : : : : : : : : : : 162.2 Network topologies : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 17

2.3 Training of articial neural networks : : : : : : : : : : : : : : : : : : : : : : : : : 182.3.1 Paradigms of learning : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 18

2.3.2 Modifying patterns of connectivity : : : : : : : : : : : : : : : : : : : : : : 182.4 Notation and terminology : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 18

2.4.1 Notation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 192.4.2 Terminology : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 19

II THEORY 21

3 Perceptron and Adaline 23

3.1 Networks with threshold activation functions : : : : : : : : : : : : : : : : : : : : 233.2 Perceptron learning rule and convergence theorem : : : : : : : : : : : : : : : : : 24

3.2.1 Example of the Perceptron learning rule : : : : : : : : : : : : : : : : : : : 253.2.2 Convergence theorem : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 25

3.2.3 The original Perceptron : : : : : : : : : : : : : : : : : : : : : : : : : : : : 263.3 The adaptive linear element (Adaline) : : : : : : : : : : : : : : : : : : : : : : : : 273.4 Networks with linear activation functions: the delta rule : : : : : : : : : : : : : : 28

3.5 Exclusive-OR problem : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 293.6 Multi-layer perceptrons can do everything : : : : : : : : : : : : : : : : : : : : : : 30

3.7 Conclusions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 31

4 Back-Propagation 33

4.1 Multi-layer feed-forward networks : : : : : : : : : : : : : : : : : : : : : : : : : : : 33

4.2 The generalised delta rule : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 334.2.1 Understanding back-propagation : : : : : : : : : : : : : : : : : : : : : : : 35

4.3 Working with back-propagation : : : : : : : : : : : : : : : : : : : : : : : : : : : : 36

4.4 An example : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 374.5 Other activation functions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 38

3

4 CONTENTS

4.6 Deciencies of back-propagation : : : : : : : : : : : : : : : : : : : : : : : : : : : : 394.7 Advanced algorithms : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 404.8 How good are multi-layer feed-forward networks? : : : : : : : : : : : : : : : : : : 42

4.8.1 The eect of the number of learning samples : : : : : : : : : : : : : : : : 434.8.2 The eect of the number of hidden units : : : : : : : : : : : : : : : : : : : 44

4.9 Applications : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 45

5 Recurrent Networks 47

5.1 The generalised delta-rule in recurrent networks : : : : : : : : : : : : : : : : : : : 475.1.1 The Jordan network : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 485.1.2 The Elman network : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 485.1.3 Back-propagation in fully recurrent networks : : : : : : : : : : : : : : : : 50

5.2 The Hopeld network : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 505.2.1 Description : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 505.2.2 Hopeld network as associative memory : : : : : : : : : : : : : : : : : : : 525.2.3 Neurons with graded response : : : : : : : : : : : : : : : : : : : : : : : : : 52

5.3 Boltzmann machines : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 54

6 Self-Organising Networks 57

6.1 Competitive learning : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 576.1.1 Clustering : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 576.1.2 Vector quantisation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61

6.2 Kohonen network : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 646.3 Principal component networks : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 66

6.3.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 666.3.2 Normalised Hebbian rule : : : : : : : : : : : : : : : : : : : : : : : : : : : 676.3.3 Principal component extractor : : : : : : : : : : : : : : : : : : : : : : : : 686.3.4 More eigenvectors : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 69

6.4 Adaptive resonance theory : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 696.4.1 Background: Adaptive resonance theory : : : : : : : : : : : : : : : : : : : 696.4.2 ART1: The simplied neural network model : : : : : : : : : : : : : : : : : 706.4.3 ART1: The original model : : : : : : : : : : : : : : : : : : : : : : : : : : : 72

7 Reinforcement learning 75

7.1 The critic : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 757.2 The controller network : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 767.3 Barto's approach: the ASE-ACE combination : : : : : : : : : : : : : : : : : : : : 77

7.3.1 Associative search : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 777.3.2 Adaptive critic : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 787.3.3 The cart-pole system : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 79

7.4 Reinforcement learning versus optimal control : : : : : : : : : : : : : : : : : : : : 80

III APPLICATIONS 83

8 Robot Control 85

8.1 End-eector positioning : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 868.1.1 Camerarobot coordination is function approximation : : : : : : : : : : : 87

8.2 Robot arm dynamics : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 918.3 Mobile robots : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 94

8.3.1 Model based navigation : : : : : : : : : : : : : : : : : : : : : : : : : : : : 948.3.2 Sensor based control : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 95

CONTENTS 5

9 Vision 97

9.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 979.2 Feed-forward types of networks : : : : : : : : : : : : : : : : : : : : : : : : : : : : 979.3 Self-organising networks for image compression : : : : : : : : : : : : : : : : : : : 98

9.3.1 Back-propagation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 999.3.2 Linear networks : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 999.3.3 Principal components as features : : : : : : : : : : : : : : : : : : : : : : : 99

9.4 The cognitron and neocognitron : : : : : : : : : : : : : : : : : : : : : : : : : : : 1009.4.1 Description of the cells : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1009.4.2 Structure of the cognitron : : : : : : : : : : : : : : : : : : : : : : : : : : : 1019.4.3 Simulation results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 102

9.5 Relaxation types of networks : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1039.5.1 Depth from stereo : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1039.5.2 Image restoration and image segmentation : : : : : : : : : : : : : : : : : : 1059.5.3 Silicon retina : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 105

IV IMPLEMENTATIONS 107

10 General Purpose Hardware 111

10.1 The Connection Machine : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11210.1.1 Architecture : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11210.1.2 Applicability to neural networks : : : : : : : : : : : : : : : : : : : : : : : 113

10.2 Systolic arrays : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 114

11 Dedicated Neuro-Hardware 115

11.1 General issues : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11511.1.1 Connectivity constraints : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11511.1.2 Analogue vs. digital : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11611.1.3 Optics : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11611.1.4 Learning vs. non-learning : : : : : : : : : : : : : : : : : : : : : : : : : : : 117

11.2 Implementation examples : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11711.2.1 Carver Mead's silicon retina : : : : : : : : : : : : : : : : : : : : : : : : : : 11711.2.2 LEP's LNeuro chip : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 119

References 123

Index 131

6 CONTENTS

List of Figures

2.1 The basic components of an articial neural network. : : : : : : : : : : : : : : : : 16

2.2 Various activation functions for a unit. : : : : : : : : : : : : : : : : : : : : : : : : 17

3.1 Single layer network with one output and two inputs. : : : : : : : : : : : : : : : : 23

3.2 Geometric representation of the discriminant function and the weights. : : : : : : 24

3.3 Discriminant function before and after weight update. : : : : : : : : : : : : : : : 25

3.4 The Perceptron. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 27

3.5 The Adaline. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 27

3.6 Geometric representation of input space. : : : : : : : : : : : : : : : : : : : : : : : 29

3.7 Solution of the XOR problem. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 30

4.1 A multi-layer network with l layers of units. : : : : : : : : : : : : : : : : : : : : : 34

4.2 The descent in weight space. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 37

4.3 Example of function approximation with a feedforward network. : : : : : : : : : 38

4.4 The periodic function f(x) = sin(2x) sin(x) approximated with sine activationfunctions. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 39

4.5 The periodic function f(x) = sin(2x) sin(x) approximated with sigmoid activationfunctions. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 40

4.6 Slow decrease with conjugate gradient in non-quadratic systems. : : : : : : : : : 42

4.7 Eect of the learning set size on the generalization : : : : : : : : : : : : : : : : : 44

4.8 Eect of the learning set size on the error rate : : : : : : : : : : : : : : : : : : : 44

4.9 Eect of the number of hidden units on the network performance : : : : : : : : : 45

4.10 Eect of the number of hidden units on the error rate : : : : : : : : : : : : : : : 45

5.1 The Jordan network : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 48

5.2 The Elman network : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 49

5.3 Training an Elman network to control an object : : : : : : : : : : : : : : : : : : : 49

5.4 Training a feed-forward network to control an object : : : : : : : : : : : : : : : : 50

5.5 The auto-associator network. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 51

6.1 A simple competitive learning network. : : : : : : : : : : : : : : : : : : : : : : : 58

6.2 Example of clustering in 3D with normalised vectors. : : : : : : : : : : : : : : : : 59

6.3 Determining the winner in a competitive learning network. : : : : : : : : : : : : 59

6.4 Competitive learning for clustering data. : : : : : : : : : : : : : : : : : : : : : : : 61

6.5 Vector quantisation tracks input density. : : : : : : : : : : : : : : : : : : : : : : : 62

6.6 A network combining a vector quantisation layer with a 1-layer feed-forward neu-ral network. This network can be used to approximate functions from <2 to <2,the input space <2 is discretised in 5 disjoint subspaces. : : : : : : : : : : : : : : 62

6.7 Gaussian neuron distance function. : : : : : : : : : : : : : : : : : : : : : : : : : : 65

6.8 A topology-conserving map converging. : : : : : : : : : : : : : : : : : : : : : : : 65

6.9 The mapping of a two-dimensional input space on a one-dimensional Kohonennetwork. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 66

7

8 LIST OF FIGURES

6.10 Mexican hat : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 666.11 Distribution of input samples. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 676.12 The ART architecture. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 706.13 The ART1 neural network. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 716.14 An example ART run. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 72

7.1 Reinforcement learning scheme. : : : : : : : : : : : : : : : : : : : : : : : : : : : : 757.2 Architecture of a reinforcement learning scheme with critic element : : : : : : : : 787.3 The cart-pole system. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 80

8.1 An exemplar robot manipulator. : : : : : : : : : : : : : : : : : : : : : : : : : : : 858.2 Indirect learning system for robotics. : : : : : : : : : : : : : : : : : : : : : : : : : 888.3 The system used for specialised learning. : : : : : : : : : : : : : : : : : : : : : : : 898.4 A Kohonen network merging the output of two cameras. : : : : : : : : : : : : : : 908.5 The neural model proposed by Kawato et al. : : : : : : : : : : : : : : : : : : : : 928.6 The neural network used by Kawato et al. : : : : : : : : : : : : : : : : : : : : : : 928.7 The desired joint pattern for joints 1. Joints 2 and 3 have similar time patterns. 938.8 Schematic representation of the stored rooms, and the partial information which

is available from a single sonar scan. : : : : : : : : : : : : : : : : : : : : : : : : : 958.9 The structure of the network for the autonomous land vehicle. : : : : : : : : : : 95

9.1 Input image for the network. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1009.2 Weights of the PCA network. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1009.3 The basic structure of the cognitron. : : : : : : : : : : : : : : : : : : : : : : : : : 1019.4 Cognitron receptive regions. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1029.5 Two learning iterations in the cognitron. : : : : : : : : : : : : : : : : : : : : : : : 1039.6 Feeding back activation values in the cognitron. : : : : : : : : : : : : : : : : : : : 104

10.1 The Connection Machine system organisation. : : : : : : : : : : : : : : : : : : : : 11310.2 Typical use of a systolic array. : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11410.3 The Warp system architecture. : : : : : : : : : : : : : : : : : : : : : : : : : : : : 114

11.1 Connections between M input and N output neurons. : : : : : : : : : : : : : : : 11511.2 Optical implementation of matrix multiplication. : : : : : : : : : : : : : : : : : : 11711.3 The photo-receptor used by Mead. : : : : : : : : : : : : : : : : : : : : : : : : : : 11811.4 The resistive layer (a) and, enlarged, a single node (b). : : : : : : : : : : : : : : : 11911.5 The LNeuro chip. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 120

Preface

This manuscript attempts to provide the reader with an insight in articial neural networks.Back in 1990, the absence of any state-of-the-art textbook forced us into writing our own.However, in the meantime a number of worthwhile textbooks have been published which canbe used for background and in-depth information. We are aware of the fact that, at times, thismanuscript may prove to be too thorough or not thorough enough for a complete understandingof the material; therefore, further reading material can be found in some excellent text bookssuch as (Hertz, Krogh, & Palmer, 1991; Ritter, Martinetz, & Schulten, 1990; Kohonen, 1995;Anderson & Rosenfeld, 1988; DARPA, 1988; McClelland & Rumelhart, 1986; Rumelhart &McClelland, 1986).

Some of the material in this book, especially parts III and IV, contains timely material andthus may heavily change throughout the ages. The choice of describing robotics and vision asneural network applications coincides with the neural network research interests of the authors.

Much of the material presented in chapter 6 has been written by Joris van Dam and Anuj Devat the University of Amsterdam. Also, Anuj contributed to material in chapter 9. The basis ofchapter 7 was form by a report of Gerard Schram at the University of Amsterdam. Furthermore,we express our gratitude to those people out there in Net-Land who gave us feedback on thismanuscript, especially Michiel van der Korst and Nicolas Maudit who pointed out quite a fewof our goof-ups. We owe them many kwartjes for their help.

The seventh edition is not drastically dierent from the sixth one; we corrected some typingerrors, added some examples and deleted some obscure parts of the text. In the eighth edition,symbols used in the text have been globally changed. Also, the chapter on recurrent networkshas been (albeit marginally) updated. The index still requires an update, though.

Amsterdam/Oberpfaenhofen, November 1996Patrick van der Smagt

Ben Krose

9

10 LIST OF FIGURES

Part I

FUNDAMENTALS

11

1 Introduction

A rst wave of interest in neural networks (also known as `connectionist models' or `paralleldistributed processing') emerged after the introduction of simplied neurons by McCulloch andPitts in 1943 (McCulloch & Pitts, 1943). These neurons were presented as models of biologicalneurons and as conceptual components for circuits that could perform computational tasks.

When Minsky and Papert published their book Perceptrons in 1969 (Minsky & Papert, 1969)in which they showed the deciencies of perceptron models, most neural network funding wasredirected and researchers left the eld. Only a few researchers continued their eorts, mostnotably Teuvo Kohonen, Stephen Grossberg, James Anderson, and Kunihiko Fukushima.

The interest in neural networks re-emerged only after some important theoretical results wereattained in the early eighties (most notably the discovery of error back-propagation), and newhardware developments increased the processing capacities. This renewed interest is re ectedin the number of scientists, the amounts of funding, the number of large conferences, and thenumber of journals associated with neural networks. Nowadays most universities have a neuralnetworks group, within their psychology, physics, computer science, or biology departments.

Articial neural networks can be most adequately characterised as `computational models'with particular properties such as the ability to adapt or learn, to generalise, or to cluster ororganise data, and which operation is based on parallel processing. However, many of the above-mentioned properties can be attributed to existing (non-neural) models; the intriguing questionis to which extent the neural approach proves to be better suited for certain applications thanexisting models. To date an equivocal answer to this question is not found.

Often parallels with biological systems are described. However, there is still so little known(even at the lowest cell level) about biological systems, that the models we are using for ourarticial neural systems seem to introduce an oversimplication of the `biological' models.

In this course we give an introduction to articial neural networks. The point of view wetake is that of a computer scientist. We are not concerned with the psychological implication ofthe networks, and we will at most occasionally refer to biological neural models. We considerneural networks as an alternative computational scheme rather than anything else.

These lecture notes start with a chapter in which a number of fundamental properties arediscussed. In chapter 3 a number of `classical' approaches are described, as well as the discussionon their limitations which took place in the early sixties. Chapter 4 continues with the descrip-tion of attempts to overcome these limitations and introduces the back-propagation learningalgorithm. Chapter 5 discusses recurrent networks; in these networks, the restraint that thereare no cycles in the network graph is removed. Self-organising networks, which require no exter-nal teacher, are discussed in chapter 6. Then, in chapter 7 reinforcement learning is introduced.Chapters 8 and 9 focus on applications of neural networks in the elds of robotics and imageprocessing respectively. The nal chapters discuss implementational aspects.

13

14 CHAPTER 1. INTRODUCTION

2 Fundamentals

The articial neural networks which we describe in this course are all variations on the paralleldistributed processing (PDP) idea. The architecture of each network is based on very similarbuilding blocks which perform the processing. In this chapter we rst discuss these processingunits and discuss dierent network topologies. Learning strategies|as a basis for an adaptivesystem|will be presented in the last section.

2.1 A framework for distributed representation

An articial network consists of a pool of simple processing units which communicate by sendingsignals to each other over a large number of weighted connections.

A set of major aspects of a parallel distributed model can be distinguished (cf. Rumelhartand McClelland, 1986 (McClelland & Rumelhart, 1986; Rumelhart & McClelland, 1986)):

a set of processing units (`neurons,' `cells');

a state of activation yk for every unit, which equivalent to the output of the unit;

connections between the units. Generally each connection is dened by a weight wjk whichdetermines the eect which the signal of unit j has on unit k;

a propagation rule, which determines the eective input sk of a unit from its externalinputs;

an activation function Fk, which determines the new level of activation based on theeective input sk(t) and the current activation yk(t) (i.e., the update);

an external input (aka bias, oset) k for each unit;

a method for information gathering (the learning rule);

an environment within which the system must operate, providing input signals and|ifnecessary|error signals.

Figure 2.1 illustrates these basics, some of which will be discussed in the next sections.

2.1.1 Processing units

Each unit performs a relatively simple job: receive input from neighbours or external sourcesand use this to compute an output signal which is propagated to other units. Apart from thisprocessing, a second task is the adjustment of the weights. The system is inherently parallel inthe sense that many units can carry out their computations at the same time.

Within neural systems it is useful to distinguish three types of units: input units (indicatedby an index i) which receive data from outside the neural network, output units (indicated by

15

16 CHAPTER 2. FUNDAMENTALS

wjk

k

ww

w

Fksk =P

j wjkyjyk

yj

+k

k

j

Figure 2.1: The basic components of an articial neural network. The propagation rule used here is

the `standard' weighted summation.

an index o) which send data out of the neural network, and hidden units (indicated by an indexh) whose input and output signals remain within the neural network.

During operation, units can be updated either synchronously or asynchronously. With syn-chronous updating, all units update their activation simultaneously; with asynchronous updat-ing, each unit has a (usually xed) probability of updating its activation at a time t, and usuallyonly one unit will be able to do this at a time. In some cases the latter model has someadvantages.

2.1.2 Connections between units

In most cases we assume that each unit provides an additive contribution to the input of theunit with which it is connected. The total input to unit k is simply the weighted sum of theseparate outputs from each of the connected units plus a bias or oset term k:

sk(t) =Xj

wjk(t) yj(t) + k(t): (2.1)

The contribution for positive wjk is considered as an excitation and for negative wjk as inhibition.In some cases more complex rules for combining inputs are used, in which a distinction is madebetween excitatory and inhibitory inputs. We call units with a propagation rule (2.1) sigmaunits.

A dierent propagation rule, introduced by Feldman and Ballard (Feldman & Ballard, 1982),is known as the propagation rule for the sigma-pi unit:

sk(t) =Xj

wjk(t)Ym

yjm(t) + k(t): (2.2)

Often, the yjm are weighted before multiplication. Although these units are not frequently used,they have their value for gating of input, as well as implementation of lookup tables (Mel, 1990).

2.1.3 Activation and output rules

We also need a rule which gives the eect of the total input on the activation of the unit. We needa function Fk which takes the total input sk(t) and the current activation yk(t) and produces anew value of the activation of the unit k:

yk(t+ 1) = Fk(yk(t); sk(t)): (2.3)

2.2. NETWORK TOPOLOGIES 17

Often, the activation function is a nondecreasing function of the total input of the unit:

yk(t+ 1) = Fk(sk(t)) = Fk0@X

j

wjk(t) yj(t) + k(t)

1A ; (2.4)

although activation functions are not restricted to nondecreasing functions. Generally, some sortof threshold function is used: a hard limiting threshold function (a sgn function), or a linear orsemi-linear function, or a smoothly limiting threshold (see gure 2.2). For this smoothly limitingfunction often a sigmoid (S-shaped) function like

yk = F(sk) =1

1 + esk(2.5)

is used. In some applications a hyperbolic tangent is used, yielding output values in the range[1;+1].

sigmoidsgn semi-linearii i

Figure 2.2: Various activation functions for a unit.

In some cases, the output of a unit can be a stochastic function of the total input of theunit. In that case the activation is not deterministically determined by the neuron input, butthe neuron input determines the probability p that a neuron get a high activation value:

p(yk 1) =1

1 + esk=T; (2.6)

in which T (cf. temperature) is a parameter which determines the slope of the probabilityfunction. This type of unit will be discussed more extensively in chapter 5.

In all networks we describe we consider the output of a neuron to be identical to its activationlevel.

2.2 Network topologies

In the previous section we discussed the properties of the basic processing unit in an articialneural network. This section focuses on the pattern of connections between the units and thepropagation of data.

As for this pattern of connections, the main distinction we can make is between:

Feed-forward networks, where the data ow from input to output units is strictly feed-forward. The data processing can extend over multiple (layers of) units, but no feedbackconnections are present, that is, connections extending from outputs of units to inputs ofunits in the same layer or previous layers.

Recurrent networks that do contain feedback connections. Contrary to feed-forward net-works, the dynamical properties of the network are important. In some cases, the activa-tion values of the units undergo a relaxation process such that the network will evolve toa stable state in which these activations do not change anymore. In other applications,the change of the activation values of the output neurons are signicant, such that thedynamical behaviour constitutes the output of the network (Pearlmutter, 1990).


Classical examples of feed-forward networks are the Perceptron and Adaline, which will bediscussed in the next chapter. Examples of recurrent networks have been presented by Anderson(Anderson, 1977), Kohonen (Kohonen, 1977), and Hopeld (Hopeld, 1982) and will be discussedin chapter 5.

2.3 Training of articial neural networks

A neural network has to be congured such that the application of a set of inputs produces(either `direct' or via a relaxation process) the desired set of outputs. Various methods to setthe strengths of the connections exist. One way is to set the weights explicitly, using a priori

knowledge. Another way is to `train' the neural network by feeding it teaching patterns andletting it change its weights according to some learning rule.

2.3.1 Paradigms of learning

We can categorise the learning situations in two distinct sorts. These are:

Supervised learning or Associative learning in which the network is trained by providingit with input and matching output patterns. These input-output pairs can be provided byan external teacher, or by the system which contains the network (self-supervised).

Unsupervised learning or Self-organisation in which an (output) unit is trained to respondto clusters of pattern within the input. In this paradigm the system is supposed to dis-cover statistically salient features of the input population. Unlike the supervised learningparadigm, there is no a priori set of categories into which the patterns are to be classied;rather the system must develop its own representation of the input stimuli.

2.3.2 Modifying patterns of connectivity

Both learning paradigms discussed above result in an adjustment of the weights of the connec-tions between units, according to some modication rule. Virtually all learning rules for modelsof this type can be considered as a variant of the Hebbian learning rule suggested by Hebb inhis classic book Organization of Behaviour (1949) (Hebb, 1949). The basic idea is that if twounits j and k are active simultaneously, their interconnection must be strengthened. If j receivesinput from k, the simplest version of Hebbian learning prescribes to modify the weight wjk with

wjk = yjyk; (2.7)

where is a positive constant of proportionality representing the learning rate. Another commonrule uses not the actual activation of unit k but the dierence between the actual and desiredactivation for adjusting the weights:

wjk = yj(dk yk); (2.8)

in which dk is the desired activation provided by a teacher. This is often called the Widrow-Horule or the delta rule, and will be discussed in the next chapter.

Many variants (often very exotic ones) have been published the last few years. In the nextchapters some of these update rules will be discussed.

2.4 Notation and terminology

Throughout the years researchers from dierent disciplines have come up with a vast number ofterms applicable in the eld of neural networks. Our computer scientist point-of-view enablesus to adhere to a subset of the terminology which is less biologically inspired, yet still con ictsarise. Our conventions are discussed below.

2.4. NOTATION AND TERMINOLOGY 19

2.4.1 Notation

We use the following notation in our formulae. Note that not all symbols are meaningful for allnetworks, and that in some cases subscripts or superscripts may be left out (e.g., p is often notnecessary) or added (e.g., vectors can, contrariwise to the notation below, have indices) wherenecessary. Vectors are indicated with a bold non-slanted font:

j, k, : : : the unit j, k, : : :;

i an input unit;

h a hidden unit;

o an output unit;

xp the pth input pattern vector;

xpj the jth element of the pth input pattern vector;

sp the input to a set of neurons when input pattern vector p is clamped (i.e., presented to thenetwork); often: the input of the network by clamping input pattern vector p;

dp the desired output of the network when input pattern vector p was input to the network;

dpj the jth element of the desired output of the network when input pattern vector p was inputto the network;

yp the activation values of the network when input pattern vector p was input to the network;

ypj the activation values of element j of the network when input pattern vector p was input tothe network;

W the matrix of connection weights;

wj the weights of the connections which feed into unit j;

wjk the weight of the connection from unit j to unit k;

Fj the activation function associated with unit j;

jk the learning rate associated with weight wjk;

the biases to the units;

j the bias input to unit j;

Uj the threshold of unit j in Fj ;Ep the error in the output of the network when input pattern vector p is input;

E the energy of the network.

2.4.2 Terminology

Output vs. activation of a unit. Since there is no need to do otherwise, we consider theoutput and the activation value of a unit to be one and the same thing. That is, the output ofeach neuron equals its activation value.


Bias, oset, threshold. These terms all refer to a constant (i.e., independent of the networkinput but adapted by the learning rule) term which is input to a unit. They may be usedinterchangeably, although the latter two terms are often envisaged as a property of the activationfunction. Furthermore, this external input is usually implemented (and can be written) as aweight from a unit with activation value 1.

Number of layers. In a feed-forward network, the inputs perform no computation and theirlayer is therefore not counted. Thus a network with one input layer, one hidden layer, and oneoutput layer is referred to as a network with two layers. This convention is widely though notyet universally used.

Representation vs. learning. When using a neural network one has to distinguish two issueswhich in uence the performance of the system. The rst one is the representational power ofthe network, the second one is the learning algorithm.

The representational power of a neural network refers to the ability of a neural network torepresent a desired function. Because a neural network is built from a set of standard functions,in most cases the network will only approximate the desired function, and even for an optimalset of weights the approximation error is not zero.

The second issue is the learning algorithm. Given that there exist a set of optimal weightsin the network, is there a procedure to (iteratively) nd this set of weights?

Part II

THEORY

21

3 Perceptron and Adaline

This chapter describes single layer neural networks, including some of the classical approachesto the neural computing and learning problem. In the rst part of this chapter we discuss therepresentational power of the single layer networks and their learning algorithms and will givesome examples of using the networks. In the second part we will discuss the representationallimitations of single layer networks.

Two `classical' models will be described in the rst part of the chapter: the Perceptron,proposed by Rosenblatt (Rosenblatt, 1959) in the late 50's and the Adaline, presented in theearly 60's by by Widrow and Ho (Widrow & Ho, 1960).

3.1 Networks with threshold activation functions

A single layer feed-forward network consists of one or more output neurons o, each of which isconnected with a weighting factor wio to all of the inputs i. In the simplest case the networkhas only two inputs and a single output, as sketched in gure 3.1 (we leave the output index oout). The input of the neuron is the weighted sum of the inputs plus the bias term. The output

w1

w2

y

+1

x1

x2

Figure 3.1: Single layer network with one output and two inputs.

of the network is formed by the activation of the output neuron, which is some function of theinput:

y = F

2Xi=1

wixi +

!; (3.1)

The activation function F can be linear so that we have a linear network, or nonlinear. In thissection we consider the threshold (or Heaviside or sgn) function:

F(s) =1 if s > 01 otherwise.

(3.2)

The output of the network thus is either +1 or 1, depending on the input. The networkcan now be used for a classication task: it can decide whether an input pattern belongs toone of two classes. If the total input is positive, the pattern will be assigned to class +1, if the

23

24 CHAPTER 3. PERCEPTRON AND ADALINE

total input is negative, the sample will be assigned to class 1. The separation between the twoclasses in this case is a straight line, given by the equation:

w1x1 + w2x2 + = 0 (3.3)

The single layer network represents a linear discriminant function.A geometrical representation of the linear threshold neural network is given in gure 3.2.

Equation (3.3) can be written as

x2 = w1

w2x1

w2; (3.4)

and we see that the weights determine the slope of the line and the bias determines the òset',i.e. how far the line is from the origin. Note that also the weights can be plotted in the inputspace: the weight vector is always perpendicular to the discriminant function.

x2

x1

w1

w2

kwk

Figure 3.2: Geometric representation of the discriminant function and the weights.

Now that we have shown the representational power of the single layer network with linearthreshold units, we come to the second issue: how do we learn the weights and biases in thenetwork? We will describe two learning methods for these types of networks: the `perceptron'learning rule and the `delta' or `LMS' rule. Both methods are iterative procedures that adjustthe weights. A learning sample is presented to the network. For each weight the new value iscomputed by adding a correction to the old value. The threshold is updated in a same way:

wi(t+ 1) = wi(t) + wi(t); (3.5)

(t+ 1) = (t) + (t): (3.6)

The learning problem can now be formulated as: how do we compute wi(t) and (t) in orderto classify the learning patterns correctly?

3.2 Perceptron learning rule and convergence theorem

Suppose we have a set of learning samples consisting of an input vector x and a desired outputd(x). For a classication task the d(x) is usually +1 or 1. The perceptron learning rule is verysimple and can be stated as follows:

1. Start with random weights for the connections;

2. Select an input vector x from the set of training samples;

3. If y 6= d(x) (the perceptron gives an incorrect response), modify all connections wi accord-ing to: wi = d(x)xi;

3.2. PERCEPTRON LEARNING RULE AND CONVERGENCE THEOREM 25

4. Go back to 2.

Note that the procedure is very similar to the Hebb rule; the only dierence is that, when thenetwork responds correctly, no connection weights are modied. Besides modifying the weights,we must also modify the threshold . This is considered as a connection w0 between the outputneuron and a `dummy' predicate unit which is always on: x0 = 1. Given the perceptron learningrule as stated above, this threshold is modied according to:

=

0 if the perceptron responds correctly;d(x) otherwise.

(3.7)

3.2.1 Example of the Perceptron learning rule

A perceptron is initialized with the following weights: w1 = 1; w2 = 2; = 2. The perceptronlearning rule is used to learn a correct discriminant function for a number of samples, sketched ingure 3.3. The rst sample A, with values x = (0:5; 1:5) and target value d(x) = +1 is presentedto the network. From eq. (3.1) it can be calculated that the network output is +1, so no weightsare adjusted. The same is the case for point B, with values x = (0:5; 0:5) and target valued(x) = 1; the network output is negative, so no change. When presenting point C with valuesx = (0:5; 0:5) the network output will be 1, while the target value d(x) = +1. According tothe perceptron learning rule, the weight changes are: w1 = 0:5, w2 = 0:5, = 1. The newweights are now: w1 = 1:5, w2 = 2:5, = 1, and sample C is classied correctly.

In gure 3.3 the discriminant function before and after this weight update is shown.

original discriminant functionafter weight update

C

A

B

1 2

1

2

x1

x2

Figure 3.3: Discriminant function before and after weight update.

3.2.2 Convergence theorem

For the perceptron learning rule there exists a convergence theorem, which states the following:

Theorem 1 If there exists a set of connection weights w which is able to perform the transfor-mation y = d(x), the perceptron learning rule will converge to some solution (which may or maynot be the same as w) in a nite number of steps for any initial choice of the weights.

Proof Given the fact that the length of the vector w does not play a role (because of the sgnoperation), we take kwk = 1. Because w is a correct solution, the value jw xj, where denotes dot or inner product, will be greater than 0 or: there exists a > 0 such that jw xj > for all inputs x1. Now dene cos w w=kwk. When according to the perceptron learning

1Technically this need not to be true for any w; w x could in fact be equal to 0 for a w which yields nomisclassications (look at denition of F). However, another w can be found for which the quantity will not be0. (Thanks to: Terry Regier, Computer Science, UC Berkeley)


rule, connection weights are modied at a given input x, we know that w = d(x)x, and theweight after modication is w0 =w +w. From this it follows that:

w0 w = w w + d(x) w x

= w w + sgnw xw x

> w w +

kw0k2 = kw + d(x) xk2= w

2 + 2d(x)w x + x2

< w2 + x2 (because d(x) = sgn[w x] !!)

= w2 +M:

After t modications we have:

w(t) w > w w + t

kw(t)k2 < w2 + tM

such that

cos(t) =w w(t)

kw(t)k>

w w + tpw

2 + tM:

From this follows that limt!1 cos(t) = limt!1 pM

pt =1, while by denition cos 1 !

The conclusion is that there must be an upper limit tmax for t. The system modies itsconnections only a limited number of times. In other words: after maximally tmax modicationsof the weights the perceptron is correctly performing the mapping. tmax will be reached whencos = 1. If we start with connections w = 0,

tmax =M

2: (3.8)

3.2.3 The original Perceptron

The Perceptron, proposed by Rosenblatt (Rosenblatt, 1959) is somewhat more complex than asingle layer network with threshold activation functions. In its simplest form it consist of anN -element input layer (`retina') which feeds into a layer ofM àssociation,' `mask,' or `predicate'units h, and a single output unit. The goal of the operation of the perceptron is to learn a giventransformation d : f1; 1gN ! f1; 1g using learning samples with input x and correspondingoutput y = d(x). In the original denition, the activity of the predicate units can be any functionh of the input layer x but the learning procedure only adjusts the connections to the outputunit. The reason for this is that no recipe had been found to adjust the connections betweenx and h. Depending on the functions h, perceptrons can be grouped into dierent families.In (Minsky & Papert, 1969) a number of these families are described and properties of thesefamilies have been described. The output unit of a perceptron is a linear threshold element.Rosenblatt (1959) (Rosenblatt, 1959) proved the remarkable theorem about perceptron learningand in the early 60s perceptrons created a great deal of interest and optimism. The initialeuphoria was replaced by disillusion after the publication of Minsky and Papert's Perceptronsin 1969 (Minsky & Papert, 1969). In this book they analysed the perceptron thoroughly andproved that there are severe restrictions on what perceptrons can represent.

3.3. THE ADAPTIVE LINEAR ELEMENT (ADALINE) 27

1

nφ

φ

ΨΩ

Figure 3.4: The Perceptron.

3.3 The adaptive linear element (Adaline)

An important generalisation of the perceptron training algorithm was presented by Widrow andHo as the `least mean square' (LMS) learning procedure, also known as the delta rule. Themain functional dierence with the perceptron training rule is the way the output of the system isused in the learning rule. The perceptron learning rule uses the output of the threshold function(either 1 or +1) for learning. The delta-rule uses the net output without further mapping intooutput values 1 or +1.

The learning rule was applied to the àdaptive linear element,' also named Adaline2, devel-oped by Widrow and Ho (Widrow & Ho, 1960). In a simple physical implementation (g. 3.5)this device consists of a set of controllable resistors connected to a circuit which can sum upcurrents caused by the input voltage signals. Usually the central block, the summer, is alsofollowed by a quantiser which outputs either +1 of 1, depending on the polarity of the sum.

w

w

w w

Σ

Σ

01

2

3

+

− −1

+1

−1 +1

−1 +1

+1

errorsummer

gains

switchespatterninput

level

output

quantizer

switchreference

Figure 3.5: The Adaline.

Although the adaptive process is here exemplied in a case when there is only one output,it may be clear that a system with many parallel outputs is directly implementable by multipleunits of the above kind.

If the input conductances are denoted by wi, i = 0; 1; : : : ; n, and the input and output signals

2ADALINE rst stood for ADAptive LInear NEuron, but when articial neurons became less and less popularthis acronym was changed to ADAptive LINear Element.


by xi and y, respectively, then the output of the central block is dened to be

y =nXi=1

wixi + ; (3.9)

where w0. The purpose of this device is to yield a given value y = dp at its output whenthe set of values xpi , i = 1; 2; : : : ; n, is applied at the inputs. The problem is to determine thecoecients wi, i = 0; 1; : : : ; n, in such a way that the input-output response is correct for a largenumber of arbitrarily chosen signal sets. If an exact mapping is not possible, the average errormust be minimised, for instance, in the sense of least squares. An adaptive operation meansthat there exists a mechanism by which the wi can be adjusted, usually iteratively, to attain thecorrect values. For the Adaline, Widrow introduced the delta rule to adjust the weights. Thisrule will be discussed in section 3.4.

3.4 Networks with linear activation functions: the delta rule

For a single layer network with an output unit with a linear activation function the output issimply given by

y =Xj

wjxj + : (3.10)

Such a simple network is able to represent a linear relationship between the value of theoutput unit and the value of the input units. By thresholding the output value, a classier canbe constructed (such as Widrow's Adaline), but here we focus on the linear relationship and usethe network for a function approximation task. In high dimensional input spaces the networkrepresents a (hyper)plane and it will be clear that also multiple output units may be dened.

Suppose we want to train the network such that a hyperplane is tted as well as possibleto a set of training samples consisting of input values xp and desired (or target) output valuesdp. For every given input sample, the output of the network diers from the target value dp

by (dp yp), where yp is the actual output for this pattern. The delta-rule now uses a cost- orerror-function based on these dierences to adjust the weights.

The error function, as indicated by the name least mean square, is the summed squarederror. That is, the total error E is dened to be

E =Xp

Ep = 12

Xp

(dp yp)2; (3.11)

where the index p ranges over the set of input patterns and Ep represents the error on patternp. The LMS procedure nds the values of all the weights that minimise the error function by amethod called gradient descent. The idea is to make a change in the weight proportional to thenegative of the derivative of the error as measured on the current pattern with respect to eachweight:

pwj = @Ep

@wj(3.12)

where is a constant of proportionality. The derivative is

@Ep

@wj=@Ep

@yp@yp

@wj: (3.13)

Because of the linear units (eq. (3.10)),

@yp

@wj= xj (3.14)

3.5. EXCLUSIVE-OR PROBLEM 29

and@Ep

@yp= (dp yp) (3.15)

such that

pwj = pxj (3.16)

where p = dpyp is the dierence between the target output and the actual output for patternp.

The delta rule modies weight appropriately for target and actual outputs of either polarityand for both continuous and binary input and output units. These characteristics have openedup a wealth of new applications.

3.5 Exclusive-OR problem

In the previous sections we have discussed two learning algorithms for single layer networks, butwe have not discussed the limitations on the representation of these networks.

x0 x1 d

1 1 11 1 11 1 11 1 1

Table 3.1: Exclusive-or truth table.

One of Minsky and Papert's most discouraging results shows that a single layer percep-tron cannot represent a simple exclusive-or function. Table 3.1 shows the desired relationshipsbetween inputs and output units for this function.

In a simple network with two inputs and one output, as depicted in gure 3.1, the net inputis equal to:

s = w1x1 + w2x2 + : (3.17)

According to eq. (3.1), the output of the perceptron is zero when s is negative and equal toone when s is positive. In gure 3.6 a geometrical representation of the input domain is given.For a constant , the output of the perceptron is equal to one on one side of the dividing linewhich is dened by:

w1x1 + w2x2 = (3.18)

and equal to zero on the other side of this line.

?

(−1,−1)

(−1,1)

AND

(1,1)

(1,−1)

?

XOROR

x1

x2 x2 x2

x1 x1

Figure 3.6: Geometric representation of input space.


To see that such a solution cannot be found, take a loot at gure 3.6. The input space consistsof four points, and the two solid circles at (1;1) and (1; 1) cannot be separated by a straightline from the two open circles at (1;1) and (1; 1). The obvious question to ask is: How canthis problem be overcome? Minsky and Papert prove in their book that for binary inputs, anytransformation can be carried out by adding a layer of predicates which are connected to allinputs. The proof is given in the next section.

For the specic XOR problem we geometrically show that by introducing hidden units,thereby extending the network to a multi-layer perceptron, the problem can be solved. Fig. 3.7ademonstrates that the four input points are now embedded in a three-dimensional space denedby the two inputs plus the single hidden unit. These four points are now easily separated by

−2

1

1

−0.5

(1,1,1)

(-1,-1,-1)

1

1−0.5

a. b.

Figure 3.7: Solution of the XOR problem.

a) The perceptron of g. 3.1 with an extra hidden unit. With the indicated values of the

weights wij (next to the connecting lines) and the thresholds i (in the circles) this perceptron

solves the XOR problem. b) This is accomplished by mapping the four points of gure 3.6

onto the four points indicated here; clearly, separation (by a linear manifold) into the required

groups is now possible.

a linear manifold (plane) into two groups, as desired. This simple example demonstrates thatadding hidden units increases the class of problems that are soluble by feed-forward, perceptron-like networks. However, by this generalisation of the basic architecture we have also incurred aserious loss: we no longer have a learning rule to determine the optimal weights!

3.6 Multi-layer perceptrons can do everything

In the previous section we showed that by adding an extra hidden unit, the XOR problemcan be solved. For binary units, one can prove that this architecture is able to perform anytransformation given the correct connections and weights. The most primitive is the next one.For a given transformation y = d(x), we can divide the set of all possible input vectors into twoclasses:

X+ = f x j d(x) = 1 g and X = f x j d(x) = 1 g: (3.19)

Since there are N input units, the total number of possible input vectors x is 2N . For everyxp 2 X+ a hidden unit h can be reserved of which the activation yh is 1 if and only if the specicpattern p is present at the input: we can choose its weights wih equal to the specic pattern xp

and the bias h equal to 1N such that

yph = sgn

Xi

wihxpi N + 1

2

!(3.20)

3.7. CONCLUSIONS 31

is equal to 1 for xp = wh only. Similarly, the weights to the output neuron can be chosen suchthat the output is one as soon as one of the M predicate neurons is one:

ypo = sgn

MXh=1

yh +M 12

!: (3.21)

This perceptron will give yo = 1 only if x 2 X+: it performs the desired mapping. Theproblem is the large number of predicate units, which is equal to the number of patterns in X+,which is maximally 2N . Of course we can do the same trick for X, and we will always takethe minimal number of mask units, which is maximally 2N1. A more elegant proof is givenin (Minsky & Papert, 1969), but the point is that for complex transformations the number ofrequired units in the hidden layer is exponential in N .

3.7 Conclusions

In this chapter we presented single layer feedforward networks for classication tasks and forfunction approximation tasks. The representational power of single layer feedforward networkswas discussed and two learning algorithms for nding the optimal weights were presented. Thesimple networks presented here have their advantages and disadvantages. The disadvantageis the limited representational power: only linear classiers can be constructed or, in case offunction approximation, only linear functions can be represented. The advantage, however, isthat because of the linearity of the system, the training algorithm will converge to the optimalsolution. This is not the case anymore for nonlinear systems such as multiple layer networks, aswe will see in the next chapter.


4 Back-Propagation

As we have seen in the previous chapter, a single-layer network has severe restrictions: the classof tasks that can be accomplished is very limited. In this chapter we will focus on feed-forwardnetworks with layers of processing units.

Minsky and Papert (Minsky & Papert, 1969) showed in 1969 that a two layer feed-forwardnetwork can overcome many restrictions, but did not present a solution to the problem of howto adjust the weights from input to hidden units. An answer to this question was presented byRumelhart, Hinton and Williams in 1986 (Rumelhart, Hinton, & Williams, 1986), and similarsolutions appeared to have been published earlier (Werbos, 1974; Parker, 1985; Cun, 1985).

The central idea behind this solution is that the errors for the units of the hidden layer aredetermined by back-propagating the errors of the units of the output layer. For this reasonthe method is often called the back-propagation learning rule. Back-propagation can also beconsidered as a generalisation of the delta rule for non-linear activation functions1 and multi-layer networks.

4.1 Multi-layer feed-forward networks

A feed-forward network has a layered structure. Each layer consists of units which receive theirinput from units from a layer directly below and send their output to units in a layer directlyabove the unit. There are no connections within a layer. The Ni inputs are fed into the rstlayer of Nh;1 hidden units. The input units are merely `fan-out' units; no processing takes placein these units. The activation of a hidden unit is a function Fi of the weighted inputs plus abias, as given in in eq. (2.4). The output of the hidden units is distributed over the next layer ofNh;2 hidden units, until the last layer of hidden units, of which the outputs are fed into a layerof No output units (see gure 4.1).

Although back-propagation can be applied to networks with any number of layers, just asfor networks with binary units (section 3.6) it has been shown (Hornik, Stinchcombe, & White,1989; Funahashi, 1989; Cybenko, 1989; Hartman, Keeler, & Kowalski, 1990) that only onelayer of hidden units suces to approximate any function with nitely many discontinuities toarbitrary precision, provided the activation functions of the hidden units are non-linear (theuniversal approximation theorem). In most applications a feed-forward network with a singlelayer of hidden units is used with a sigmoid activation function for the units.

4.2 The generalised delta rule

Since we are now using units with nonlinear activation functions, we have to generalise the deltarule which was presented in chapter 3 for linear functions to the set of non-linear activation

1Of course, when linear activation functions are used, a multi-layer network is not more powerful than asingle-layer network.

33

34 CHAPTER 4. BACK-PROPAGATION

oh

No

Nh;l2Nh;l1Nh;1

Ni

Figure 4.1: A multi-layer network with l layers of units.

functions. The activation is a dierentiable function of the total input, given by

ypk = F(spk); (4.1)

in whichspk =

Xj

wjkypj + k: (4.2)

To get the correct generalisation of the delta rule as presented in the previous chapter, we mustset

pwjk = @Ep

@wjk: (4.3)

The error measure Ep is dened as the total quadratic error for pattern p at the output units:

Ep = 12

NoXo=1

(dpo ypo)2; (4.4)

where dpo is the desired output for unit o when pattern p is clamped. We further set E =Xp

Ep

as the summed squared error. We can write

@Ep

@wjk=@Ep

@spk

@spk@wjk

: (4.5)

By equation (4.2) we see that the second factor is

@spk@wjk

= ypj : (4.6)

When we dene

pk = @Ep

@spk; (4.7)

we will get an update rule which is equivalent to the delta rule as described in the previouschapter, resulting in a gradient descent on the error surface if we make the weight changesaccording to:

pwjk = pkypj : (4.8)

The trick is to gure out what pk should be for each unit k in the network. The interestingresult, which we now derive, is that there is a simple recursive computation of these 's whichcan be implemented by propagating error signals backward through the network.

4.2. THE GENERALISED DELTA RULE 35

To compute pk we apply the chain rule to write this partial derivative as the product of twofactors, one factor re ecting the change in error as a function of the output of the unit and onere ecting the change in the output as a function of changes in the input. Thus, we have

pk = @Ep

@spk= @E

p

@ypk

@ypk@spk

: (4.9)

Let us compute the second factor. By equation (4.1) we see that

@ypk@spk

= F0(spk); (4.10)

which is simply the derivative of the squashing function F for the kth unit, evaluated at thenet input spk to that unit. To compute the rst factor of equation (4.9), we consider two cases.First, assume that unit k is an output unit k = o of the network. In this case, it follows fromthe denition of Ep that

@Ep

@ypo= (dpo ypo); (4.11)

which is the same result as we obtained with the standard delta rule. Substituting this andequation (4.10) in equation (4.9), we get

po = (dpo ypo)Fo0(spo) (4.12)

for any output unit o. Secondly, if k is not an output unit but a hidden unit k = h, we donot readily know the contribution of the unit to the output error of the network. However,the error measure can be written as a function of the net inputs from hidden to output layer;Ep = Ep(sp1; s

p2; : : : ; s

pj ; : : :) and we use the chain rule to write

@Ep

@yph=

NoXo=1

@Ep

@spo

@spo@yph

=NoXo=1

@Ep

@spo

@

@yph

NhXj=1

wkoypj =

NoXo=1

@Ep

@spowho =

NoXo=1

powho: (4.13)

Substituting this in equation (4.9) yields

ph = F0(sph)NoXo=1

powho: (4.14)

Equations (4.12) and (4.14) give a recursive procedure for computing the 's for all units inthe network, which are then used to compute the weight changes according to equation (4.8).This procedure constitutes the generalised delta rule for a feed-forward network of non-linearunits.

4.2.1 Understanding back-propagation

The equations derived in the previous section may be mathematically correct, but what dothey actually mean? Is there a way of understanding back-propagation other than reciting thenecessary equations?

The answer is, of course, yes. In fact, the whole back-propagation process is intuitivelyvery clear. What happens in the above equations is the following. When a learning patternis clamped, the activation values are propagated to the output units, and the actual networkoutput is compared with the desired output values, we usually end up with an error in each ofthe output units. Let's call this error eo for a particular output unit o. We have to bring eo tozero.


The simplest method to do this is the greedy method: we strive to change the connectionsin the neural network in such a way that, next time around, the error eo will be zero for thisparticular pattern. We know from the delta rule that, in order to reduce an error, we have toadapt its incoming weights according to

who = (do yo)yh: (4.15)

That's step one. But it alone is not enough: when we only apply this rule, the weights frominput to hidden units are never changed, and we do not have the full representational powerof the feed-forward network as promised by the universal approximation theorem. In order toadapt the weights from input to hidden units, we again want to apply the delta rule. In thiscase, however, we do not have a value for for the hidden units. This is solved by the chainrule which does the following: distribute the error of an output unit o to all the hidden unitsthat is it connected to, weighted by this connection. Dierently put, a hidden unit h receives adelta from each output unit o equal to the delta of that output unit weighted with (= multipliedby) the weight of the connection between those units. In symbols: h =

Po owho. Well, not

exactly: we forgot the activation function of the hidden unit; F0 has to be applied to the delta,before the back-propagation process can continue.

4.3 Working with back-propagation

The application of the generalised delta rule thus involves two phases: During the rst phasethe input x is presented and propagated forward through the network to compute the outputvalues ypo for each output unit. This output is compared with its desired value do, resulting inan error signal po for each output unit. The second phase involves a backward pass throughthe network during which the error signal is passed to each unit in the network and appropriateweight changes are calculated.

Weight adjustments with sigmoid activation function. The results from the previoussection can be summarised in three equations:

The weight of a connection is adjusted by an amount proportional to the product of anerror signal , on the unit k receiving the input and the output of the unit j sending thissignal along the connection:

pwjk = pkypj : (4.16)

If the unit is an output unit, the error signal is given by

po = (dpo ypo)F0(spo): (4.17)

Take as the activation function F the `sigmoid' function as dened in chapter 2:

yp = F(sp) = 1

1 + esp: (4.18)

In this case the derivative is equal to

F0(sp) =@

@sp1

1 + esp

=1

(1 + esp)2(esp)

=1

(1 + esp)

esp

(1 + esp)

= yp(1 yp): (4.19)

4.4. AN EXAMPLE 37

such that the error signal for an output unit can be written as:

po = (dpo ypo) ypo(1 ypo): (4.20)

The error signal for a hidden unit is determined recursively in terms of error signals of theunits to which it directly connects and the weights of those connections. For the sigmoidactivation function:

ph = F0(sph)NoXo=1

powho = yph(1 yph)NoXo=1

powho: (4.21)

Learning rate and momentum. The learning procedure requires that the change in weightis proportional to @Ep=@w. True gradient descent requires that innitesimal steps are taken. Theconstant of proportionality is the learning rate . For practical purposes we choose a learningrate that is as large as possible without leading to oscillation. One way to avoid oscillationat large , is to make the change in weight dependent of the past weight change by adding amomentum term:

wjk(t+ 1) = pkypj + wjk(t); (4.22)

where t indexes the presentation number and is a constant which determines the eect of theprevious weight change.

The role of the momentum term is shown in gure 4.2. When no momentum term is used,it takes a long time before the minimum has been reached with a low learning rate, whereas forhigh learning rates the minimum is never reached because of the oscillations. When adding themomentum term, the minimum will be reached faster.

c

b a

Figure 4.2: The descent in weight space. a) for small learning rate; b) for large learning rate: note

the oscillations, and c) with large learning rate and momentum term added.

Learning per pattern. Although, theoretically, the back-propagation algorithm performsgradient descent on the total error only if the weights are adjusted after the full set of learningpatterns has been presented, more often than not the learning rule is applied to each patternseparately, i.e., a pattern p is applied, Ep is calculated, and the weights are adapted (p =1; 2; : : : ; P ). There exists empirical indication that this results in faster convergence. Care hasto be taken, however, with the order in which the patterns are taught. For example, whenusing the same sequence over and over again the network may become focused on the rst fewpatterns. This problem can be overcome by using a permuted training method.

4.4 An example

A feed-forward network can be used to approximate a function from examples. Suppose wehave a system (for example a chemical process or a nancial market) of which we want to know


the characteristics. The input of the system is given by the two-dimensional vector x and theoutput is given by the one-dimensional vector d. We want to estimate the relationship d = f(x)from 80 examples fxp; dpg as depicted in gure 4.3 (top left). A feed-forward network was

−10

1

−1

0

1−1

0

1

−10

1

−1

0

1−1

0

1

−10

1

−1

0

1−1

0

1

−10

1

−1

0

1−1

0

1

Figure 4.3: Example of function approximation with a feedforward network. Top left: The original

learning samples; Top right: The approximation with the network; Bottom left: The function which

generated the learning samples; Bottom right: The error in the approximation.

programmed with two inputs, 10 hidden units with sigmoid activation function and an outputunit with a linear activation function. Check for yourself how equation (4.20) should be adaptedfor the linear instead of sigmoid activation function. The network weights are initialized tosmall values and the network is trained for 5,000 learning iterations with the back-propagationtraining rule, described in the previous section. The relationship between x and d as representedby the network is shown in gure 4.3 (top right), while the function which generated the learningsamples is given in gure 4.3 (bottom left). The approximation error is depicted in gure 4.3(bottom right). We see that the error is higher at the edges of the region within which thelearning samples were generated. The network is considerably better at interpolation thanextrapolation.

4.5 Other activation functions

Although sigmoid functions are quite often used as activation functions, other functions can beused as well. In some cases this leads to a formula which is known from traditional functionapproximation theories.

For example, from Fourier analysis it is known that any periodic function can be written asa innite sum of sine and cosine terms (Fourier series):

f(x) =1Xn=0

(an cosnx+ bn sinnx): (4.23)

4.6. DEFICIENCIES OF BACK-PROPAGATION 39

We can rewrite this as a summation of sine terms

f(x) = a0 +1Xn=1

cn sin(nx+ n); (4.24)

with cn =p(a2n + b2n) and n = arctan(b=a). This can be seen as a feed-forward network with

a single input unit for x; a single output unit for f(x) and hidden units with an activationfunction F = sin(s). The factor a0 corresponds with the bias of the output unit, the factors cncorrespond with the weighs from hidden to output unit; the phase factor n corresponds withthe bias term of the hidden units and the factor n corresponds with the weights between theinput and hidden layer.

The basic dierence between the Fourier approach and the back-propagation approach isthat the in the Fourier approach the `weights' between the input and the hidden units (theseare the factors n) are xed integer numbers which are analytically determined, whereas in theback-propagation approach these weights can take any value and are typically learning using alearning heuristic.

To illustrate the use of other activation functions we have trained a feed-forward networkwith one output unit, four hidden units, and one input with ten patterns drawn from the functionf(x) = sin(2x) sin(x). The result is depicted in Figure 4.4. The same function (albeit with otherlearning points) is learned with a network with eight (!) sigmoid hidden units (see gure 4.5).From the gures it is clear that it pays o to use as much knowledge of the problem at hand aspossible.

2 4 6 8

+1

-0.5

-2-4

Figure 4.4: The periodic function f(x) = sin(2x) sin(x) approximated with sine activation functions.

(Adapted from (Dastani, 1991).)

4.6 Deciencies of back-propagation

Despite the apparent success of the back-propagation learning algorithm, there are some aspectswhich make the algorithm not guaranteed to be universally useful. Most troublesome is the longtraining process. This can be a result of a non-optimum learning rate and momentum. A lot ofadvanced algorithms based on back-propagation learning have some optimised method to adaptthis learning rate, as will be discussed in the next section. Outright training failures generallyarise from two sources: network paralysis and local minima.

Network paralysis. As the network trains, the weights can be adjusted to very large values.The total input of a hidden unit or output unit can therefore reach very high (either positive ornegative) values, and because of the sigmoid activation function the unit will have an activationvery close to zero or very close to one. As is clear from equations (4.20) and (4.21), the weight


2 4 6-4

+1

-1

Figure 4.5: The periodic function f(x) = sin(2x) sin(x) approximated with sigmoid activation func-

tions.

(Adapted from (Dastani, 1991).)

adjustments which are proportional to ypk(1 ypk) will be close to zero, and the training processcan come to a virtual standstill.

Local minima. The error surface of a complex network is full of hills and valleys. Becauseof the gradient descent, the network can get trapped in a local minimum when there is a muchdeeper minimum nearby. Probabilistic methods can help to avoid this trap, but they tend tobe slow. Another suggested possibility is to increase the number of hidden units. Although thiswill work because of the higher dimensionality of the error space, and the chance to get trappedis smaller, it appears that there is some upper limit of the number of hidden units which, whenexceeded, again results in the system being trapped in local minima.

4.7 Advanced algorithms

Many researchers have devised improvements of and extensions to the basic back-propagationalgorithm described above. It is too early for a full evaluation: some of these techniques mayprove to be fundamental, others may simply fade away. A few methods are discussed in thissection.

Maybe the most obvious improvement is to replace the rather primitive steepest descentmethod with a direction set minimisation method, e.g., conjugate gradient minimisation. Notethat minimisation along a direction u brings the function f at a place where its gradient isperpendicular to u (otherwise minimisation along u is not complete). Instead of following thegradient at every step, a set of n directions is constructed which are all conjugate to each othersuch that minimisation along one of these directions uj does not spoil the minimisation along oneof the earlier directions ui, i.e., the directions are non-interfering. Thus one minimisation in thedirection of ui suces, such that nminimisations in a system with n degrees of freedom bring thissystem to a minimum (provided the system is quadratic). This is dierent from gradient descent,which directly minimises in the direction of the steepest descent (Press, Flannery, Teukolsky, &Vetterling, 1986).

Suppose the function to be minimised is approximated by its Taylor series

f(x) = f(p) +Xi

@f

@xi

p

xi +12

Xi;j

@2f

@xi@xj

p

xixj +

12x

TAx bTx+ c

4.7. ADVANCED ALGORITHMS 41

where T denotes transpose, and

c f(p) b rf jp [A]ij @2f

@xi@xj

p

: (4.25)

A is a symmetric positive denite2 n n matrix, the Hessian of f at p. The gradient of f is

rf = Ax b; (4.27)

such that a change of x results in a change of the gradient as

(rf) = A(x): (4.28)

Now suppose f was minimised along a direction ui to a point where the gradient gi+1 of f isperpendicular to ui, i.e.,

uTi gi+1 = 0; (4.29)

and a new direction ui+1 is sought. In order to make sure that moving along ui+1 does not spoilminimisation along ui we require that the gradient of f remain perpendicular to ui, i.e.,

uTi gi+2 = 0; (4.30)

otherwise we would once more have to minimise in a direction which has a component of ui.Combining (4.29) and (4.30), we get

0 = uTi (gi+1 gi+2) = uTi (rf) = uTi Aui+1: (4.31)

When eq. (4.31) holds for two vectors ui and ui+1 they are said to be conjugate.Now, starting at some point p0, the rst minimisation direction u0 is taken equal to g0 =

rf(p0), resulting in a new point p1. For i 0, calculate the directions

ui+1 = gi+1 + iui; (4.32)

where i is chosen to make uTi Aui1 = 0 and the successive gradients perpendicular, i.e.,

i =gTi+1gi+1

gTi giwith gk = rf jpk for all k 0: (4.33)

Next, calculate pi+2 = pi+1 + i+1ui+1 where i+1 is chosen so as to minimise f(pi+2)3.

It can be shown that the u's thus constructed are all mutually conjugate (e.g., see (Stoer& Bulirsch, 1980)). The process described above is known as the Fletcher-Reeves method, butthere are many variants which work more or less the same (Hestenes & Stiefel, 1952; Polak,1971; Powell, 1977).

Although only n iterations are needed for a quadratic system with n degrees of freedom,due to the fact that we are not minimising quadratic systems, as well as a result of round-oerrors, the n directions have to be followed several times (see gure 4.6). Powell introducedsome improvements to correct for behaviour in non-quadratic systems. The resulting cost isO(n) which is signicantly better than the linear convergence4 of steepest descent.

2A matrix A is called positive denite if 8y 6= 0,

yTAy > 0: (4.26)

3This is not a trivial problem (see (Press et al., 1986).) However, line minimisation methods exist withsuper-linear convergence (see footnote 4).

4A method is said to converge linearly if E i+1 = cE i with c < 1. Methods which converge with a higher power,i.e., E i+1 = c(E i)

m with m > 1 are called super-linear.


i

a very slow approximation

gradient

i+1uu

Figure 4.6: Slow decrease with conjugate gradient in non-quadratic systems. The hills on the left

are very steep, resulting in a large search vector ui. When the quadratic portion is entered the new

search direction is constructed from the previous direction and the gradient, resulting in a spiraling

minimisation. This problem can be overcome by detecting such spiraling minimisations and restarting

the algorithm with u0 = rf .

Some improvements on back-propagation have been presented based on an independent adap-tive learning rate parameter for each weight.

Van den Boomgaard and Smeulders (Boomgaard & Smeulders, 1989) show that for a feed-forward network without hidden units an incremental procedure to nd the optimal weightmatrix W needs an adjustment of the weights with

W (t+ 1) = (t+ 1) (d(t+ 1)W (t)x(t+ 1))x(t+ 1); (4.34)

in which is not a constant but an variable (Ni + 1) (Ni + 1) matrix which depends on theinput vector. By using a priori knowledge about the input signal, the storage requirements for can be reduced.

Silva and Almeida (Silva & Almeida, 1990) also show the advantages of an independent stepsize for each weight in the network. In their algorithm the learning rate is adapted after everylearning pattern:

jk(t+ 1) =

8<:

u jk(t) if @E(t+1)@wjk

and @E(t)@wjk

have the same signs;

d jk(t) if @E(t+1)@wjk

and @E(t)@wjk

have opposite signs.(4.35)

where u and d are positive constants with values slightly above and below unity, respectively.The idea is to decrease the learning rate in case of oscillations.

4.8 How good are multi-layer feed-forward networks?

From the example shown in gure 4.3 is is clear that the approximation of the network is notperfect. The resulting approximation error is in uenced by:

1. The learning algorithm and number of iterations. This determines how good the error onthe training set is minimized.

4.8. HOW GOOD ARE MULTI-LAYER FEED-FORWARD NETWORKS? 43

2. The number of learning samples. This determines how good the training samples representthe actual function.

3. The number of hidden units. This determines the èxpressive power' of the network. For`smooth' functions only a few number of hidden units are needed, for wildly uctuatingfunctions more hidden units will be needed.

In the previous sections we discussed the learning rules such as back-propagation and the othergradient based learning algorithms, and the problem of nding the minimum error. In thissection we particularly address the eect of the number of learning samples and the eect of thenumber of hidden units.

We rst have to dene an adequate error measure. All neural network training algorithmstry to minimize the error of the set of learning samples which are available for training thenetwork. The average error per learning sample is dened as the learning error rate error rate:

E learning =1

Plearning

PlearningXp=1

Ep;

in which Ep is the dierence between the desired output value and the actual network outputfor the learning samples:

Ep = 12

NoXo=1

(dpo ypo)2:

This is the error which is measurable during the training process.

It is obvious that the actual error of the network will dier from the error at the locations ofthe training samples. The dierence between the desired output value and the actual networkoutput should be integrated over the entire input domain to give a more realistic error measure.This integral can be estimated if we have a large set of samples: the test set. We now dene thetest error rate as the average error of the test set:

E test =1

Ptest

PtestXp=1

Ep:

In the following subsections we will see how these error measures depend on learning set sizeand number of hidden units.

4.8.1 The eect of the number of learning samples

A simple problem is used as example: a function y = f(x) has to be approximated with a feed-forward neural network. A neural network is created with an input, 5 hidden units with sigmoidactivation function and a linear output unit. Suppose we have only a small number of learningsamples (e.g., 4) and the networks is trained with these samples. Training is stopped when theerror does not decrease anymore. The original (desired) function is shown in gure 4.7A as adashed line. The learning samples and the approximation of the network are shown in the samegure. We see that in this case E learning is small (the network output goes perfectly through thelearning samples) but E test is large: the test error of the network is large. The approximationobtained from 20 learning samples is shown in gure 4.7B. The E learning is larger than in thecase of 5 learning samples, but the E test is smaller.

This experiment was carried out with other learning set sizes, where for each learning set sizethe experiment was repeated 10 times. The average learning and test error rates as a functionof the learning set size are given in gure 4.8. Note that the learning error increases with anincreasing learning set size, and the test error decreases with increasing learning set size. A low


0 0.5 10

0.2

0.4

0.6

0.8

1

x

y

B

0 0.5 10

0.2

0.4

0.6

0.8

1

x

y

A

Figure 4.7: Eect of the learning set size on the generalization. The dashed line gives the desired

function, the learning samples are depicted as circles and the approximation by the network is shown

by the drawn line. 5 hidden units are used. a) 4 learning samples. b) 20 learning samples.

learning error on the (small) learning set is no guarantee for a good network performance! Withincreasing number of learning samples the two error rates converge to the same value. Thisvalue depends on the representational power of the network: given the optimal weights, howgood is the approximation. This error depends on the number of hidden units and the activationfunction. If the learning error rate does not converge to the test error rate the learning procedurehas not found a global minimum.

number of learning samples

errorrate

test set

learning set

Figure 4.8: Eect of the learning set size on the error rate. The average error rate and the average

test error rate as a function of the number of learning samples.

4.8.2 The eect of the number of hidden units

The same function as in the previous subsection is used, but now the number of hidden units isvaried. The original (desired) function, learning samples and network approximation is shownin gure 4.9A for 5 hidden units and in gure 4.9B for 20 hidden units. The eect visiblein gure 4.9B is called overtraining. The network ts exactly with the learning samples, butbecause of the large number of hidden units the function which is actually represented by thenetwork is far more wild than the original one. Particularly in case of learning samples whichcontain a certain amount of noise (which all real-world data have), the network will `t the noise'of the learning samples instead of making a smooth approximation.

4.9. APPLICATIONS 45

0 0.5 10

0.2

0.4

0.6

0.8

1

x

y

A

0 0.5 10

0.2

0.4

0.6

0.8

1

x

y

B

Figure 4.9: Eect of the number of hidden units on the network performance. The dashed line

gives the desired function, the circles denote the learning samples and the drawn line gives the

approximation by the network. 12 learning samples are used. a) 5 hidden units. b) 20 hidden units.

This example shows that a large number of hidden units leads to a small error on the trainingset but not necessarily leads to a small error on the test set. Adding hidden units will alwayslead to a reduction of the E learning. However, adding hidden units will rst lead to a reductionof the E test, but then lead to an increase of E test. This eect is called the peaking eect. Theaverage learning and test error rates as a function of the learning set size are given in gure 4.10.

errorrate

number of hidden units

learning set

test set

Figure 4.10: The average learning error rate and the average test error rate as a function of the

number of hidden units.

4.9 Applications

Back-propagation has been applied to a wide variety of research applications. Sejnowski andRosenberg (1987) (Sejnowski & Rosenberg, 1986) produced a spectacular success with NETtalk,a system that converts printed English text into highly intelligible speech.

A feed-forward network with one layer of hidden units has been described by Gorman andSejnowski (1988) (Gorman & Sejnowski, 1988) as a classication machine for sonar signals.

Another application of a multi-layer feed-forward network with a back-propagation trainingalgorithm is to learn an unknown function between input and output signals from the presen-


tation of examples. It is hoped that the network is able to generalise correctly, so that inputvalues which are not presented as learning patterns will result in correct output values. Anexample is the work of Josin (Josin, 1988), who used a two-layer feed-forward network withback-propagation learning to perform the inverse kinematic transform which is needed by arobot arm controller (see chapter 8).

5 Recurrent Networks

The learning algorithms discussed in the previous chapter were applied to feed-forward networks:all data ows in a network in which no cycles are present.

But what happens when we introduce a cycle? For instance, we can connect a hidden unitwith itself over a weighted connection, connect hidden units to input units, or even connect allunits with each other. Although, as we know from the previous chapter, the approximationalcapabilities of such networks do not increase, we may obtain decreased complexity, network size,etc. to solve the same problem.

An important question we have to consider is the following: what do we want to learn ina recurrent network? After all, when one is considering a recurrent network, it is possible tocontinue propagating activation values ad innitum, or until a stable point (attractor) is reached.As we will see in the sequel, there exist recurrent network which are attractor based, i.e., theactivation values in the network are repeatedly updated until a stable point is reached afterwhich the weights are adapted, but there are also recurrent networks where the learning ruleis used after each propagation (where an activation value is transversed over each weight onlyonce), while external inputs are included in each propagation. In such networks, the recurrentconnections can be regarded as extra inputs to the network (the values of which are computedby the network itself).

In this chapter recurrent extensions to the feed-forward network introduced in the previouschapters will be discussed|yet not to exhaustion. The theory of the dynamics of recurrentnetworks extends beyond the scope of a one-semester course on neural networks. Yet the basicsof these networks will be discussed.

Subsequently some special recurrent networks will be discussed: the Hopeld network insection 5.2, which can be used for the representation of binary patterns; subsequently we touchupon Boltzmann machines, therewith introducing stochasticity in neural computation.

5.1 The generalised delta-rule in recurrent networks

The back-propagation learning rule, introduced in chapter 4, can be easily used for trainingpatterns in recurrent networks. Before we will consider this general case, however, we will rstdescribe networks where some of the hidden unit activation values are fed back to an extra setof input units (the Elman network), or where output values are fed back into hidden units (theJordan network).

A typical application of such a network is the following. Suppose we have to construct anetwork that must generate a control command depending on an external input, which is a timeseries x(t), x(t1), x(t2), : : :. With a feed-forward network there are two possible approaches:

1. create inputs x1, x2, : : :, xn which constitute the last n values of the input vector. Thusa `time window' of the input vector is input to the network.

2. create inputs x, x0, x", : : :. Besides only inputting x(t), we also input its rst, second, etc.

47

48 CHAPTER 5. RECURRENT NETWORKS

derivatives. Naturally, computation of these derivatives is not a trivial task for higher-orderderivatives.

The disadvantage is, of course, that the input dimensionality of the feed-forward network ismultiplied with n, leading to a very large network, which is slow and dicult to train. TheJordan and Elman networks provide a solution to this problem. Due to the recurrent connections,a window of inputs need not be input anymore; instead, the network is supposed to learn thein uence of the previous time steps itself.

5.1.1 The Jordan network

One of the earliest recurrent neural network was the Jordan network (Jordan, 1986a, 1986b).An exemplar network is shown in gure 5.1. In the Jordan network, the activation values of the

oh

inputunits

stateunits

Figure 5.1: The Jordan network. Output activation values are fed back to the input layer, to a setof extra neurons called the state units.

output units are fed back into the input layer through a set of extra input units called the stateunits. There are as many state units as there are output units in the network. The connectionsbetween the output and state units have a xed weight of +1; learning takes place only in theconnections between input and hidden units as well as hidden and output units. Thus all thelearning rules derived for the multi-layer perceptron can be used to train this network.

5.1.2 The Elman network

The Elman network was introduced by Elman in 1990 (Elman, 1990). In this network a set ofcontext units are introduced, which are extra input units whose activation values are fed backfrom the hidden units. Thus the network is very similar to the Jordan network, except that(1) the hidden units instead of the output units are fed back; and (2) the extra input units haveno self-connections.

The schematic structure of this network is shown in gure 5.2.

Again the hidden units are connected to the context units with a xed weight of value +1.Learning is done as follows:

1. the context units are set to 0; t = 1;

5.1. THE GENERALISED DELTA-RULE IN RECURRENT NETWORKS 49

output layer

hidden layer

input layer context layer

Figure 5.2: The Elman network. With this network, the hidden unit activation values are fed back

to the input layer, to a set of extra neurons called the context units.

2. pattern xt is clamped, the forward calculations are performed once;

3. the back-propagation learning rule is applied;

4. t t+ 1; go to 2.

The context units at step t thus always have the activation value of the hidden units at stept 1.

Example

As we mentioned above, the Jordan and Elman networks can be used to train a network onreproducing time sequences. The idea of the recurrent connections is that the network is able to`remember' the previous states of the input values. As an example, we trained an Elman networkon controlling an object moving in 1D. This object has to follow a pre-specied trajectory xd. Tocontrol the object, forces F must be applied, since the object suers from friction and perhapsother external forces.

To tackle this problem, we use an Elman net with inputs x and xd, one output F , and threehidden units. The hidden units are connected to three context units. In total, ve units feedinto the hidden layer.

The results of training are shown in gure 5.3. The same test can be done with an ordinary

0

2

4

500400300200100

2

4

Figure 5.3: Training an Elman network to control an object. The solid line depicts the desired

trajectory xd; the dashed line the realised trajectory. The third line is the error.


feed-forward network with sliding window input. We tested this with a network with ve inputs,four of which constituted the sliding window x3, x2, x1, and x0, and one the desired nextposition of the object. Results are shown in gure 5.4. The disappointing observation is that

0 100 200 300 400 5000

2

4

2

4

Figure 5.4: Training a feed-forward network to control an object. The solid line depicts the desired

trajectory xd; the dashed line the realised trajectory. The third line is the error.

the results are actually better with the ordinary feed-forward network, which has the samecomplexity as the Elman network.

5.1.3 Back-propagation in fully recurrent networks

More complex schemes than the above are possible. For instance, independently of each otherPineda (Pineda, 1987) and Almeida (Almeida, 1987) discovered that error back-propagation isin fact a special case of a more general gradient learning method which can be used for trainingattractor networks. However, also when a network does not reach a xpoint, a learning methodcan be used: back-propagation through time (Pearlmutter, 1989, 1990). This learning method,the discussion of which extents beyond the scope of our course, can be used to train a multi-layerperceptron to follow trajectories in its activation values.

5.2 The Hopeld network

One of the earliest recurrent neural networks reported in literature was the auto-associatorindependently described by Anderson (Anderson, 1977) and Kohonen (Kohonen, 1977) in 1977.It consists of a pool of neurons with connections between each unit i and j, i 6= j (see gure 5.5).All connections are weighted.

In 1982, Hopeld (Hopeld, 1982) brings together several earlier ideas concerning thesenetworks and presents a complete mathematical analysis based on Ising spin models (Amit,Gutfreund, & Sompolinsky, 1986). It is therefore that this network, which we will describe inthis chapter, is generally referred to as the Hopeld network.

5.2.1 Description

The Hopeld network consists of a set of N interconnected neurons (gure 5.5) which updatetheir activation values asynchronously and independently of other neurons. All neurons areboth input and output neurons. The activation values are binary. Originally, Hopeld choseactivation values of 1 and 0, but using values +1 and 1 presents some advantages discussedbelow. We will therefore adhere to the latter convention.

5.2. THE HOPFIELD NETWORK 51

Figure 5.5: The auto-associator network. All neurons are both input and output neurons, i.e., a

pattern is clamped, the network iterates to a stable state, and the output of the network consists of

the new activation values of the neurons.

The state of the system is given by the activation values1 y = (yk). The net input sk(t+ 1)of a neuron k at cycle t+ 1 is a weighted sum

sk(t+ 1) =Xj 6=k

yj(t)wjk + k: (5.1)

A simple threshold function (gure 2.2) is applied to the net input to obtain the new activationvalue yi(t+ 1) at time t+ 1:

yk(t+ 1) =

8<:+1 if sk(t+ 1) > Uk1 if sk(t+ 1) < Ukyk(t) otherwise,

(5.2)

i.e., yk(t+1) = sgn(sk(t+1)). For simplicity we henceforth choose Uk = 0, but this is of coursenot essential.

A neuron k in the Hopeld network is called stable at time t if, in accordance with equa-tions (5.1) and (5.2),

yk(t) = sgn(sk(t 1)): (5.3)

A state is called stable if, when the network is in state , all neurons are stable. A patternxp is called stable if, when xp is clamped, all neurons are stable.

When the extra restriction wjk = wkj is made, the behaviour of the system can be describedwith an energy function

E = 12

XXj 6=k

yjykwjk Xk

kyk: (5.4)

Theorem 2 A recurrent network with connections wjk = wkj in which the neurons are updatedusing rule (5.2) has stable limit points.

Proof First, note that the energy expressed in eq. (5.4) is bounded from below, since the yk arebounded from below and the wjk and k are constant. Secondly, E is monotonically decreasingwhen state changes occur, because

E = yk

0@X

j 6=kyjwjk + k

1A (5.5)

is always negative when yk changes according to eqs. (5.1) and (5.2).

1Often, these networks are described using the symbols used by Hopeld: Vk for activation of unit k, Tjk forthe connection weight between units j and k, and Uk for the external input of unit k. We decided to stick to themore general symbols yk, wjk, and k.


The advantage of a +1=1 model over a 1=0 model then is symmetry of the states of thenetwork. For, when some pattern x is stable, its inverse is stable, too, whereas in the 1=0 modelthis is not always true (as an example, the pattern 00 00 is always stable, but 11 11 neednot be). Similarly, both a pattern and its inverse have the same energy in the +1=1 model.

Removing the restriction of bidirectional connections (i.e., wjk = wkj) results in a systemthat is not guaranteed to settle to a stable state.

5.2.2 Hopeld network as associative memory

A primary application of the Hopeld network is an associative memory. In this case, theweights of the connections between the neurons have to be thus set that the states of the systemcorresponding with the patterns which are to be stored in the network are stable. These statescan be seen as `dips' in energy space. When the network is cued with a noisy or incomplete testpattern, it will render the incorrect or missing data by iterating to a stable state which is insome sense `near' to the cued pattern.

The Hebb rule can be used (section 2.3.2) to store P patterns:

wjk =

8><>:

PXp=1

xpjxpk if j 6= k

0 otherwise,

(5.6)

i.e., if xpj and xpk are equal, wjk is increased, otherwise decreased by one (note that, in the original

Hebb rule, weights only increase). It appears, however, that the network gets saturated veryquickly, and that about 0:15N memories can be stored before recall errors become severe.

There are two problems associated with storing too many patterns:

1. the stored patterns become unstable;

2. spurious stable states appear (i.e., stable states which do not correspond with storedpatterns).

The rst of these two problems can be solved by an algorithm proposed by Bruce et al. (Bruce,Canning, Forrest, Gardner, & Wallace, 1986):

Algorithm 1 Given a starting weight matrix W =hwjk

i, for each pattern xp to be stored and

each element xpk in xp dene a correction k such that

k =

0 if yk is stable and xp is clamped;

1 otherwise.(5.7)

Now modify wjk by wjk = yjyk(j + k) if j 6= k. Repeat this procedure until all patterns are

stable.

It appears that, in practice, this algorithm usually converges. There exist cases, however, wherethe algorithm remains oscillatory (try to nd one)!

The second problem stated above can be alleviated by applying the Hebb rule in reverse tothe spurious stable state, but with a low learning factor (Hopeld, Feinstein, & Palmer, 1983).Thus these patterns are weakly unstored and will become unstable again.

5.2.3 Neurons with graded response

The network described in section 5.2.1 can be generalised by allowing continuous activationvalues. Here, the threshold activation function is replaced by a sigmoid. As before, this systemcan be proved to be stable when a symmetric weight matrix is used (Hopeld, 1984).

5.2. THE HOPFIELD NETWORK 53

Hopeld networks for optimisation problems

An interesting application of the Hopeld network with graded response arises in a heuristicsolution to the NP-complete travelling salesman problem (Garey & Johnson, 1979). In thisproblem, a path of minimal distance must be found between n cities, such that the begin- andend-points are the same.

Hopeld and Tank (Hopeld & Tank, 1985) use a network with n n neurons. Each rowin the matrix represents a city, whereas each column represents the position in the tour. Whenthe network is settled, each row and each column should have one and only one active neuron,indicating a specic city occupying a specic position in the tour. The neurons are updated usingrule (5.2) with a sigmoid activation function between 0 and 1. The activation value yXj = 1

indicates that city X occupies the jth place in the tour.

An energy function describing this problem can be set up as follows. To ensure a correctsolution, the following energy must be minimised:

E = A2

XX

Xj

Xk 6=j

yXjyXk

+ B2

Xj

XX

XX 6=Y

yXjyY j

+ C2

0@X

X

Xj

yXj n

1A

2

(5.8)

where A, B, and C are constants. The rst and second terms in equation (5.8) are zero if andonly if there is a maximum of one active neuron in each row and column, respectively. The lastterm is zero if and only if there are exactly n active neurons.

To minimise the distance of the tour, an extra term

D2

XX

XY 6=X

Xj

dXY yXj(yY;j+1 + yY;j1) (5.9)

is added to the energy, where dXY is the distance between cities X and Y and D is a constant.For convenience, the subscripts are dened modulo n.

The weights are set as follows:

wXj;Y k = AXY (1 jk) inhibitory connections within each row

Bjk(1 XY ) inhibitory connections within each column

C global inhibition

DdXY (k;j+1 + k;j1) data term

(5.10)

where jk = 1 if j = k and 0 otherwise. Finally, each neuron has an external bias input Cn.

Discussion

Although this application is interesting from a theoretical point of view, the applicability islimited. Whereas Hopeld and Tank state that, in a ten city tour, the network converges to avalid solution in 16 out of 20 trials while 50% of the solutions are optimal, other reports showless encouraging results. For example, (Wilson & Pawley, 1988) nd that in only 15% of theruns a valid result is obtained, few of which lead to an optimal or near-optimal solution. Themain problem is the lack of global information. Since, for an N -city problem, there are N !possible tours, each of which may be traversed in two directions as well as started in N points,the number of dierent tours is N !=2N . Dierently put, the N -dimensional hypercube in whichthe solutions are situated is 2N degenerate. The degenerate solutions occur evenly within the


hypercube, such that all but one of the nal 2N congurations are redundant. The competitionbetween the degenerate tours often leads to solutions which are piecewise optimal but globallyinecient.

5.3 Boltzmann machines

The Boltzmann machine, as rst described by Ackley, Hinton, and Sejnowski in 1985 (Ackley,Hinton, & Sejnowski, 1985) is a neural network that can be seen as an extension to Hopeldnetworks to include hidden units, and with a stochastic instead of deterministic update rule.The weights are still symmetric. The operation of the network is based on the physics principleof annealing . This is a process whereby a material is heated and then cooled very, very slowly toa freezing point. As a result, the crystal lattice will be highly ordered, without any impurities,such that the system is in a state of very low energy. In the Boltzmann machine this systemis mimicked by changing the deterministic update of equation (5.2) in a stochastic update, inwhich a neuron becomes active with a probability p,

p(yk +1) =1

1 + eEk=T(5.11)

where T is a parameter comparable with the (synthetic) temperature of the system. Thisstochastic activation function is not to be confused with neurons having a sigmoid deterministicactivation function.

In accordance with a physical system obeying a Boltzmann distribution, the network willeventually reach `thermal equilibrium' and the relative probability of two global states and will follow the Boltzmann distribution

PP

= e(EE)=T (5.12)

where P is the probability of being in the th global state, and E is the energy of that state.Note that at thermal equilibrium the units still change state, but the probability of nding thenetwork in any global state remains constant.

At low temperatures there is a strong bias in favour of states with low energy, but thetime required to reach equilibrium may be long. At higher temperatures the bias is not sofavourable but equilibrium is reached faster. A good way to beat this trade-o is to start at ahigh temperature and gradually reduce it. At high temperatures, the network will ignore smallenergy dierences and will rapidly approach equilibrium. In doing so, it will perform a search ofthe coarse overall structure of the space of global states, and will nd a good minimum at thatcoarse level. As the temperature is lowered, it will begin to respond to smaller energy dierencesand will nd one of the better minima within the coarse-scale minimum it discovered at hightemperature.

As multi-layer perceptrons, the Boltzmann machine consists of a non-empty set of visibleand a possibly empty set of hidden units. Here, however, the units are binary-valued and areupdated stochastically and asynchronously. The simplicity of the Boltzmann distribution leadsto a simple learning procedure which adjusts the weights so as to use the hidden units in anoptimal way (Ackley et al., 1985). This algorithm works as follows.

First, the input and output vectors are clamped. The network is then annealed until itapproaches thermal equilibrium at a temperature of 0. It then runs for a xed time at equi-librium and each connection measures the fraction of the time during which both the units itconnects are active. This is repeated for all input-output pairs so that each connection canmeasure hyjykiclamped, the expected probability, averaged over all cases, that units j and k aresimultaneously active at thermal equilibrium when the input and output vectors are clamped.

5.3. BOLTZMANN MACHINES 55

Similarly, hyjykifree is measured when the output units are not clamped but determined by thenetwork.

In order to determine optimal weights in the network, an error function must be determined.Now, the probability P free(yp) that the visible units are in state yp when the system is runningfreely can be measured. Also, the desired probability P clamped(yp) that the visible units arein state yp is determined by clamping the visible units and letting the network run. Now, ifthe weights in the network are correctly set, both probabilities are equal to each other, and theerror E in the network must be 0. Otherwise, the error must have a positive value measuringthe discrepancy between the network's internal mode and the environment. For this eect, theàsymmetric divergence' or `Kullback information' is used:

E =Xp

P clamped(yp) logP clamped(yp)

P free(yp); (5.13)

Now, in order to minimise E using gradient descent, we must change the weights according to

wjk = @E

@wjk: (5.14)

It is not dicult to show that

@E

@wjk= 1

T

hyjykiclamped hyjykifree

: (5.15)

Therefore, each weight is updated by

wjk = hyjykiclamped hyjykifree

: (5.16)


6 Self-Organising Networks

In the previous chapters we discussed a number of networks which were trained to perform amapping F : <n ! <m by presenting the network èxamples' (xp;dp) with dp = F (xp) of thismapping. However, problems exist where such training data, consisting of input and desiredoutput pairs are not available, but where the only information is provided by a set of inputpatterns xp. In these cases the relevant information has to be found within the (redundant)training samples xp.

Some examples of such problems are:

clustering: the input data may be grouped in `clusters' and the data processing systemhas to nd these inherent clusters in the input data. The output of the system should givethe cluster label of the input pattern (discrete output);

vector quantisation: this problem occurs when a continuous space has to be discretised.The input of the system is the n-dimensional vector x, the output is a discrete repre-sentation of the input space. The system has to nd optimal discretisation of the inputspace;

dimensionality reduction: the input data are grouped in a subspace which has lower di-mensionality than the dimensionality of the data. The system has to learn an optimalmapping, such that most of the variance in the input data is preserved in the output data;

feature extraction: the system has to extract features from the input signal. This oftenmeans a dimensionality reduction as described above.

In this chapter we discuss a number of neuro-computational approaches for these kinds ofproblems. Training is done without the presence of an external teacher. The unsupervisedweight adapting algorithms are usually based on some form of global competition between theneurons.

There are very many types of self-organising networks, applicable to a wide area of problems.One of the most basic schemes is competitive learning as proposed by Rumelhart and Zipser(Rumelhart & Zipser, 1985). A very similar network but with dierent emergent propertiesis the topology-conserving map devised by Kohonen. Other self-organising networks are ART,proposed by Carpenter and Grossberg (Carpenter & Grossberg, 1987a; Grossberg, 1976), andFukushima's cognitron (Fukushima, 1975, 1988).

6.1 Competitive learning

6.1.1 Clustering

Competitive learning is a learning procedure that divides a set of input patterns in clustersthat are inherent to the input data. A competitive learning network is provided only with input

57

58 CHAPTER 6. SELF-ORGANISING NETWORKS

vectors x and thus implements an unsupervised learning procedure. We will show its equivalenceto a class of `traditional' clustering algorithms shortly. Another important use of these networksis vector quantisation, as discussed in section 6.1.2.

o

i

wio

Figure 6.1: A simple competitive learning network. Each of the four outputs o is connected to all

inputs i.

An example of a competitive learning network is shown in gure 6.1. All output units o areconnected to all input units i with weights wio. When an input pattern x is presented, only asingle output unit of the network (the winner) will be activated. In a correctly trained network,all x in one cluster will have the same winner. For the determination of the winner and thecorresponding learning rule, two methods exist.

Winner selection: dot product

For the time being, we assume that both input vectors x and weight vectors wo are normalisedto unit length. Each output unit o calculates its activation value yo according to the dot productof input and weight vector:

yo =Xi

wioxi =woTx: (6.1)

In a next pass, output neuron k is selected with maximum activation

8o 6= k : yo yk: (6.2)

Activations are reset such that yk = 1 and yo 6=k = 0. This is is the competitive aspect of thenetwork, and we refer to the output layer as the winner-take-all layer. The winner-take-all layeris usually implemented in software by simply selecting the output neuron with highest activationvalue. This function can also be performed by a neural network known as MAXNET (Lippmann,1989). In MAXNET, all neurons o are connected to other units o0 with inhibitory links and toitself with an excitatory link:

wo;o0 =

if o 6= o0

+1 otherwise.(6.3)

It can be shown that this network converges to a situation where only the neuron with highestinitial activation survives, whereas the activations of all other neurons converge to zero. Fromnow on, we will simply assume a winner k is selected without being concerned which algorithmis used.

Once the winner k has been selected, the weights are updated according to:

wk(t+ 1) =wk(t) + (x(t)wk(t))

kwk(t) + (x(t)wk(t))k (6.4)

where the divisor ensures that all weight vectors w are normalised. Note that only the weightsof winner k are updated.

The weight update given in equation (6.4) eectively rotates the weight vector wo towardsthe input vector x. Each time an input x is presented, the weight vector closest to this input is

6.1. COMPETITIVE LEARNING 59

weight vector

pattern vector

w1

w2

w3

Figure 6.2: Example of clustering in 3D with normalised vectors, which all lie on the unity sphere. The

three weight vectors are rotated towards the centres of gravity of the three dierent input clusters.

selected and is subsequently rotated towards the input. Consequently, weight vectors are rotatedtowards those areas where many inputs appear: the clusters in the input. This procedure isvisualised in gure 6.2.

a. b.

xw1

w2

w1

x

w2

Figure 6.3: Determining the winner in a competitive learning network. a. Three normalised vectors.b. The three vectors having the same directions as in a., but with dierent lengths. In a., vectors

x and w1 are nearest to each other, and their dot product xTw1 = jxjjw1j cos is larger than the

dot product of x and w2. In b., however, the pattern and weight vectors are not normalised, and in

this case w2 should be considered the `winner' when x is applied. However, the dot product xTw1

is still larger than xTw2.

Winner selection: Euclidean distance

Previously it was assumed that both inputs x and weight vectors w were normalised. Using thethe activation function given in equation (6.1) gives a `biological plausible' solution. In gure 6.3it is shown how the algorithm would fail if unnormalised vectors were to be used. Naturallyone would like to accommodate the algorithm for unnormalised input data. To this end, thewinning neuron k is selected with its weight vector wk closest to the input pattern x, using the


Euclidean distance measure:

k : kwk xk kwo xk 8o: (6.5)

It is easily checked that equation (6.5) reduces to (6.1) and (6.2) if all vectors are normalised. TheEuclidean distance norm is therefore a more general case of equations (6.1) and (6.2). Instead ofrotating the weight vector towards the input as performed by equation (6.4), the weight updatemust be changed to implement a shift towards the input:

wk(t+ 1) =wk(t) + (x(t)wk(t)): (6.6)

Again only the weights of the winner are updated.A point of attention in these recursive clustering techniques is the initialisation. Especially

if the input vectors are drawn from a large or high-dimensional input space, it is not beyondimagination that a randomly initialised weight vector wo will never be chosen as the winnerand will thus never be moved and never be used. Therefore, it is customary to initialise weightvectors to a set of input patterns fxg drawn from the input set at random. Another morethorough approach that avoids these and other problems in competitive learning is called leakylearning . This is implemented by expanding the weight update given in equation (6.6) with

wl(t+ 1) =wl(t) + 0(x(t)wl(t)) 8l 6= k (6.7)

with 0 the leaky learning rate. A somewhat similar method is known as frequency sensitivecompetitive learning (Ahalt, Krishnamurthy, Chen, & Melton, 1990). In this algorithm,each neuron records the number of times it is selected winner. The more often it wins, the lesssensitive it becomes to competition. Conversely, neurons that consistently fail to win increasetheir chances of being selected winner.

Cost function

Earlier it was claimed, that a competitive network performs a clustering process on the inputdata. I.e., input patterns are divided in disjoint clusters such that similarities between inputpatterns in the same cluster are much bigger than similarities between inputs in dierent clusters.Similarity is measured by a distance function on the input vectors, as discussed before. Acommon criterion to measure the quality of a given clustering is the square error criterion, givenby

E =Xp

kwk xpk2 (6.8)

where k is the winning neuron when input xp is presented. The weights w are interpretedas cluster centres. It is not dicult to show that competitive learning indeed seeks to nd aminimum for this square error by following the negative gradient of the error-function:

Theorem 3 The error function for pattern xp

Ep = 1

2

Xi

(wki xpi )2; (6.9)

where k is the winning unit, is minimised by the weight update rule in eq. (6.6).

Proof As in eq. (3.12), we calculate the eect of a weight change on the error function. So wehave that

pwio = @Ep

@wio

(6.10)

where is a constant of proportionality. Now, we have to determine the partial derivative of Ep:

@Ep

@wio

=nwio xpi if unit o wins0 otherwise

(6.11)


such that

pwio = (wio xpi ) = (xpo wio) (6.12)

which is eq. (6.6) written down for one element of wo.Therefore, eq. (6.8) is minimised by repeated weight updates using eq. (6.6).

An almost identical process of moving cluster centres is used in a large family of conven-tional clustering algorithms known as square error clustering methods, e.g., k-means, FORGY,ISODATA, CLUSTER.

Example

In gure 6.4, 8 clusters of each 6 data points are depicted. A competitive learning network usingEuclidean distance to select the winner was initialised with all weight vectors wo = 0. Thenetwork was trained with = 0:1 and a 0 = 0:001 and the positions of the weights after 500iterations are shown.

−0.5 0 0.5 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 6.4: Competitive learning for clustering data. The data are given by \+". The positions of

the weight vectors after 500 iterations is given by \o".

6.1.2 Vector quantisation

Another important use of competitive learning networks is found in vector quantisation. A vectorquantisation scheme divides the input space in a number of disjoint subspaces and represents eachinput vector x by the label of the subspace it falls into (i.e., index k of the winning neuron). Thedierence with clustering is that we are not so much interested in nding clusters of similar data,but more in quantising the entire input space. The quantisation performed by the competitivelearning network is said to `track the input probability density function': the density of neuronsand thus subspaces is highest in those areas where inputs are most likely to appear, whereasa more coarse quantisation is obtained in those areas where inputs are scarce. An example oftracking the input density is sketched in gure 6.5. Vector quantisation through competitive


: input pattern : weight vector

x2

x1

Figure 6.5: This gure visualises the tracking of the input density. The input patterns are drawnfrom <2; the weight vectors also lie in <2. In the areas where inputs are scarce, the upper part of the

gure, only few (in this case two) neurons are used to discretise the input space. Thus, the upper

part of the input space is divided into two large separate regions. The lower part, however, where

many more inputs have occurred, ve neurons discretise the input space into ve smaller subspaces.

learning results in a more ne-grained discretisation in those areas of the input space wheremost input have occurred in the past.

In this way, competitive learning can be used in applications where data has to be com-pressed such as telecommunication or storage. However, competitive learning has also be usedin combination with supervised learning methods, and be applied to function approximationproblems or classication problems. We will describe two examples: the \counterpropagation"method and the \learning vector quantization".

Counterpropagation

In a large number of applications, networks that perform vector quantisation are combined withanother type of network in order to perform function approximation. An example of such a

forwardfeed-vector

quantisation

i

h

o

y

whowih

Figure 6.6: A network combining a vector quantisation layer with a 1-layer feed-forward neural

network. This network can be used to approximate functions from <2 to <2, the input space <2 is

discretised in 5 disjoint subspaces.


network is given in gure 6.6. This network can approximate a function

f : <n ! <m

by associating with each neuron o a function value [w1o; w2o; : : : ; wmo]T which is somehow repre-

sentative for the function values f(x) of inputs x represented by o. This way of approximatinga function eectively implements a `look-up table': an input x is assigned to a table entry kwith 8o 6= k: kx wkk kx wok, and the function value [w1k; w2k; : : : ; wmk]

T in this tableentry is taken as an approximation of f(x). A well-known example of such a network is theCounterpropagation network (Hecht-Nielsen, 1988).

Depending on the application, one can choose to perform the vector quantisation beforelearning the function approximation, or one can choose to learn the quantisation and the ap-proximation layer simultaneously. As an example of the latter, the network presented in gure 6.6can be supervisedly trained in the following way:

1. present the network with both input x and function value d = f(x);

2. perform the unsupervised quantisation step. For each weight vector, calculate the distancefrom its weight vector to the input pattern and nd winner k. Update the weights wih

with equation (6.6);

3. perform the supervised approximation step:

wko(t+ 1) = wko(t) + (do wko(t)): (6.13)

This is simply the -rule with yo =P

h yhwho = wko when k is the winning neuron and thedesired output is given by d = f(x).

If we dene a function g(x; k) as :

g(x; k) =

(1 if k is winner0 otherwise

(6.14)

it can be shown that this learning procedure converges to

who =

Z<n

yog(x; h) dx: (6.15)

I.e., each table entry converges to the mean function value over all inputs in the subspacerepresented by that table entry. As we have seen before, the quantisation scheme tracks theinput probability density function, which results in a better approximation of the function inthose areas where input is most likely to appear.

Not all functions are represented accurately by this combination of quantisation and approx-imation layers. E.g., a simple identity or combinations of sines and cosines are much betterapproximated by multilayer back-propagation networks if the activation functions are chosenappropriately. However, if we expect our input to be (a subspace of) a high dimensional inputspace <n and we expect our function f to be discontinuous at numerous points, the combinationof quantisation and approximation is not uncommon and probably very ecient. Of course thiscombination extends itself much further than the presented combination of the presented singlelayer competitive learning network and the single layer feed-forward network. The latter couldbe replaced by a reinforcement learning procedure (see chapter 7). The quantisation layer canbe replaced by various other quantisation schemes, such as Kohonen networks (see section 6.2)or octree methods (Jansen, Smagt, & Groen, 1994). In fact, various modern statistical functionapproximation methods (CART, MARS (Breiman, Friedman, Olshen, & Stone, 1984; Friedman,1991)) are based on this very idea, extended with the possibility to have the approximation layerin uence the quantisation layer (e.g., to obtain a better or locally more ne-grained quantisa-tion). Recent research (Rosen, Goodwin, & Vidal, 1992) also investigates in this direction.


Learning Vector Quantisation

It is an unpleasant habit in neural network literature, to also cover Learning Vector Quantisation(LVQ) methods in chapters on unsupervised clustering. Granted that these methods also performa clustering or quantisation task and use similar learning rules, they are trained supervisedlyand perform discriminant analysis rather than unsupervised clustering. These networks attemptto dene `decision boundaries' in the input space, given a large set of exemplary decisions (thetraining set); each decision could, e.g., be a correct class label.

A rather large number of slightly dierent LVQ methods is appearing in recent literature.They are all based on the following basic algorithm:

1. with each output neuron o, a class label (or decision of some other kind) yo is associated;

2. a learning sample consists of input vector xp together with its correct class label ypo ;

3. using distance measures between weight vectors wo and input vector xp, not only thewinner k1 is determined, but also the second best k2:

kxp wk1k < kxp wk2k < kxp wik 8o 6= k1; k2;

4. the labels ypk1 , ypk2

are compared with dp. The weight update rule given in equation (6.6)is used selectively based on this comparison.

An example of the last step is given by the LVQ2 algorithm by Kohonen (Kohonen, 1977), usingthe following strategy:

if ypk1 6= dp and dp = ypk2 ;

and kxp wk2k kxp wk1k < ;

then wk2(t+ 1) =wk2 + (x wk2(t))

and wk1(t+ 1) =wk1(t) (x wk1(t))

I.e., wk2 with the correct label is moved towards the input vector, while wk1 with the incorrectlabel is moved away from it.

The new LVQ algorithms that are emerging all use dierent implementations of these dierentsteps, e.g., how to dene class labels yo, how many `next-best' winners are to be determined,how to adapt the number of output neurons i and how to selectively use the weight update rule.

6.2 Kohonen network

The Kohonen network (Kohonen, 1982, 1984) can be seen as an extension to the competitivelearning network, although this is chronologically incorrect. Also, the Kohonen network has adierent set of applications.

In the Kohonen network, the output units in S are ordered in some fashion, often in a two-dimensional grid or array, although this is application-dependent. The ordering, which is chosenby the user1, determines which output neurons are neighbours.

Now, when learning patterns are presented to the network, the weights to the output unitsare thus adapted such that the order present in the input space <N is preserved in the output,i.e., the neurons in S. This means that learning patterns which are near to each other in theinput space (where `near' is determined by the distance measure used in nding the winning unit)

1Of course, variants have been designed which automatically generate the structure of the network (Martinetz& Schulten, 1991; Fritzke, 1991).

6.2. KOHONEN NETWORK 65

must be mapped on output units which are also near to each other, i.e., the same or neighbouringunits. Thus, if inputs are uniformly distributed in <N and the order must be preserved, thedimensionality of S must be at least N . The mapping, which represents a discretisation of theinput space, is said to be topology preserving . However, if the inputs are restricted to a subspaceof <N , a Kohonen network can be used of lower dimensionality. For example: data on a two-dimensional manifold in a high dimensional input space can be mapped onto a two-dimensionalKohonen network, which can for example be used for visualisation of the data.

Usually, the learning patterns are random samples from <N . At time t, a sample x(t) isgenerated and presented to the network. Using the same formulas as in section 6.1, the winningunit k is determined. Next, the weights to this winning unit as well as its neighbours are adaptedusing the learning rule

wo(t+ 1) =wo(t) + g(o; k) (x(t)wo(t)) 8o 2 S: (6.16)

Here, g(o; k) is a decreasing function of the grid-distance between units o and k, such thatg(k; k) = 1. For example, for g() a Gaussian function can be used, such that (in one dimension!)g(o; k) = exp

(o k)2(see gure 6.7). Due to this collective learning scheme, input signals

-2

-1

0

1

2 -2

-1

0

1

2

0

0.25

0.5

0.75

1

-2

-1

0

1

2 -2

-1

0

1

2

0

25

5

5

1

h(i,k)

Figure 6.7: Gaussian neuron distance function g(). In this case, g() is shown for a two-dimensional

grid because it looks nice.

which are near to each other will be mapped on neighbouring neurons. Thus the topologyinherently present in the input signals will be preserved in the mapping, such as depicted ingure 6.8.

Iteration 0 Iteration 200 Iteration 600 Iteration 1900

Figure 6.8: A topology-conserving map converging. The weight vectors of a network with two inputs

and 8 8 output neurons arranged in a planar grid are shown. A line in each gure connects weight

wi;(o1;o2) with weights wi;(o1+1;o2) and wi;(i1;i2+1). The leftmost gure shows the initial weights; therightmost when the map is almost completely formed.

If the intrinsic dimensionality of S is less than N , the neurons in the network are `folded' inthe input space, such as depicted in gure 6.9.


Figure 6.9: The mapping of a two-dimensional input space on a one-dimensional Kohonen network.

The topology-conserving quality of this network has many counterparts in biological brains.The brain is organised in many places so that aspects of the sensory environment are representedin the form of two-dimensional maps. For example, in the visual system, there are severaltopographic mappings of visual space onto the surface of the visual cortex. There are organisedmappings of the body surface onto the cortex in both motor and somatosensory areas, andtonotopic mappings of frequency in the auditory cortex. The use of topographic representations,where some important aspect of a sensory modality is related to the physical locations of thecells on a surface, is so common that it obviously serves an important information processingfunction.

It does not come as a surprise, therefore, that already many applications have been devisedof the Kohonen topology-conserving maps. Kohonen himself has successfully used the networkfor phoneme-recognition (Kohonen, Makisara, & Saramaki, 1984). Also, the network has beenused to merge sensory data from dierent kinds of sensors, such as auditory and visual, `looking'at the same scene (Gielen, Krommenhoek, & Gisbergen, 1991). Yet another application is inrobotics, such as shown in section 8.1.1.

To explain the plausibility of a similar structure in biological networks, Kohonen remarksthat the lateral inhibition between the neurons could be obtained via eerent connections be-tween those neurons. In one dimension, those connection strengths form a `Mexican hat' (seegure 6.10).

lateral distance

excitation

Figure 6.10: Mexican hat. Lateral interaction around the winning neuron as a function of distance:

excitation to nearby neurons, inhibition to farther o neurons.

6.3 Principal component networks

6.3.1 Introduction

The networks presented in the previous sections can be seen as (nonlinear) vector transformationswhich map an input vector to a number of binary output elements or neurons. The weights areadjusted in such a way that they could be considered as prototype vectors (vectorial means) forthe input patterns for which the competing neuron wins.

The self-organising transform described in this section rotates the input space in such away that the values of the output neurons are as uncorrelated as possible and the energy orvariances of the patterns is mainly concentrated in a few output neurons. An example is shown

6.3. PRINCIPAL COMPONENT NETWORKS 67

dx1

dx2

x1

x2

e1

de1

e2

de2

Figure 6.11: Distribution of input samples.

in gure 6.11. The two dimensional samples (x1; x2) are plotted in the gure. It can be easilyseen that x1 and x2 are related, such that if we know x1 we can make a reasonable predictionof x2 and vice versa since the points are centered around the line x1 = x2. If we rotate the axesover =4 we get the (e1; e2) axis as plotted in the gure. Here the conditional prediction has nouse because the points have uncorrelated coordinates. Another property of this rotation is thatthe variance or energy of the transformed patterns is maximised on a lower dimension. This canbe intuitively veried by comparing the spreads (dx1 ; dx2) and (de1 ; de2) in the gures. After therotation, the variance of the samples is large along the e1 axis and small along the e2 axis.

This transform is very closely related to the eigenvector transformation known from imageprocessing where the image has to be coded or transformed to a lower dimension and recon-structed again by another transform as well as possible (see section 9.3.2).

The next section describes a learning rule which acts as a Hebbian learning rule, but whichscales the vector length to unity. In the subsequent section we will see that a linear neuron witha normalised Hebbian learning rule acts as such a transform, extending the theory in the lastsection to multi-dimensional outputs.

6.3.2 Normalised Hebbian rule

The model considered here consists of one linear(!) neuron with input weights w. The outputyo(t) of this neuron is given by the usual inner product of its weight w and the input vector x:

yo(:t) =w(t)Tx(t) (6.17)

As seen in the previous sections, all models are based on a kind of Hebbian learning. However,the basic Hebbian rule would make the weights grow uninhibitedly if there were correlation inthe input patterns. This can be overcome by normalising the weight vector to a xed length,typically 1, which leads to the following learning rule

w(t+ 1) =w(t) + y(t)x(t)

L (w(t) + y(t)x(t))(6.18)

where L() indicates an operator which returns the vector length, and is a small learningparameter. Compare this learning rule with the normalised learning rule of competitive learning.There the delta rule was normalised, here the standard Hebb rule is.


Now the operator which computes the vector length, the norm of the vector, can be approx-imated by a Taylor expansion around = 0:

L (w(t) + y(t)x(t)) = 1 + @L

@

=0

+O( 2): (6.19)

When we substitute this expression for the vector length in equation (6.18), it resolves forsmall to2

w(t+ 1) = (w(t) + y(t)x(t))

1

@L

@

=0

+O( 2)

!: (6.20)

Since L j =0 = y(t)2, discarding the higher order terms of leads to

w(t+ 1) =w(t) + y(t) (x(t) y(t)w(t)) (6.21)

which is called the Òja learning rule' (Oja, 1982). This learning rule thus modies the weightin the usual Hebbian sense, the rst product terms is the Hebb rule yo(t)x(t), but normalisesits weight vector directly by the second product term yo(t)yo(t)w(t). What exactly does thislearning rule do with the weight vector?

6.3.3 Principal component extractor

Remember probability theory? Consider an N -dimensional signal x(t) with

mean = E(x(t));

correlation matrix R = E((x(t) )(x(t) )T ).

In the following we assume the signal mean to be zero, so = 0.

From equation (6.21) we see that the expectation of the weights for the Oja learning ruleequals

E(w(t+ 1)jw(t)) =w(t) + Rw(t)

w(t)TRw(t)

w(t)

(6.22)

which has a continuous counterpart

d

dtw(t) = Rw(t)

w(t)TRw(t)

w(t): (6.23)

Theorem 1 Let the eigenvectors ei of R be ordered with descending associated eigenvalues isuch that 1 > 2 > : : : > N . With equation (6.23) the weights w(t) will converge to e1.

Proof 1 Since the eigenvectors of R span the N -dimensional space, the weight vector can be

decomposed as

w(t) =NXi

i(t)ei: (6.24)

Substituting this in the dierential equation and concluding the theorem is left as an exercise.

2Remembering that 1=(1 + a ) = 1 a +O( 2).

6.4. ADAPTIVE RESONANCE THEORY 69

6.3.4 More eigenvectors

In the previous section it was shown that a single neuron's weight converges to the eigenvector ofthe correlation matrix with maximum eigenvalue, i.e., the weight of the neuron is directed in thedirection of highest energy or variance of the input patterns. Here we tackle the question of howto nd the remaining eigenvectors of the correlation matrix given the rst found eigenvector.

Consider the signal x which can be decomposed into the basis of eigenvectors ei of itscorrelation matrix R,

x =NXi

iei (6.25)

If we now subtract the component in the direction of e1, the direction in which the signal hasthe most energy, from the signal x

~x = x 1e1 (6.26)

we are sure that when we again decompose ~x into the eigenvector basis, the coecient 1 = 0,simply because we just subtracted it. We call ~x the de ation of x.

If now a second neuron is taught on this signal ~x, then its weights will lie in the direction of theremaining eigenvector with the highest eigenvalue. Since the de ation removed the componentin the direction of the rst eigenvector, the weight will converge to the remaining eigenvectorwith maximum eigenvalue. In the previous section we ordered the eigenvalues in magnitude, soaccording to this denition in the limit we will nd e2. We can continue this strategy and ndall the N eigenvectors belonging to the signal x.

We can write the de ation in neural network terms if we see that

yo =wTx = eT1

NXi

iei = i (6.27)

sincew = e1: (6.28)

So that the de ated vector ~x equals~x = x yow: (6.29)

The term subtracted from the input vector can be interpreted as a kind of a back-projection orexpectation. Compare this to ART described in the next section.

6.4 Adaptive resonance theory

The last unsupervised learning network we discuss diers from the previous networks in that itis recurrent; as with networks in the next chapter, the data is not only fed forward but also backfrom output to input units.

6.4.1 Background: Adaptive resonance theory

In 1976, Grossberg (Grossberg, 1976) introduced a model for explaining biological phenomena.The model has three crucial properties:

1. a normalisation of the total network activity. Biological systems are usually very adaptiveto large changes in their environment. For example, the human eye can adapt itself tolarge variations in light intensities;

2. contrast enhancement of input patterns. The awareness of subtle dierences in inputpatterns can mean a lot in terms of survival. Distinguishing a hiding panther from aresting one makes all the dierence in the world. The mechanism used here is contrastenhancement;


3. short-term memory (STM) storage of the contrast-enhanced pattern. Before the inputpattern can be decoded, it must be stored in the short-term memory. The long-termmemory (LTM) implements an arousal mechanism (i.e., the classication), whereas theSTM is used to cause gradual changes in the LTM.

The system consists of two layers, F1 and F2, which are connected to each other via theLTM (see gure 6.12). The input pattern is received at F1, whereas classication takes place inF2. As mentioned before, the input is not directly classied. First a characterisation takes place

LTMLTM

STM activity pattern

STM activity pattern

category representation field

feature representation fieldF1

F2

input

Figure 6.12: The ART architecture.

by means of extracting features, giving rise to activation in the feature representation eld. Theexpectations, residing in the LTM connections, translate the input pattern to a categorisationin the category representation eld. The classication is compared to the expectation of thenetwork, which resides in the LTM weights from F2 to F1. If there is a match, the expectationsare strengthened, otherwise the classication is rejected.

6.4.2 ART1: The simplied neural network model

The ART1 simplied model consists of two layers of binary neurons (with values 1 and 0), calledF1 (the comparison layer) and F2 (the recognition layer) (see gure 6.13). Each neuron in F1is connected to all neurons in F2 via the continuous-valued forward long term memory (LTM)W f , and vice versa via the binary-valued backward LTM W b. The other modules are gain 1and 2 (G1 and G2), and a reset module.

Each neuron in the comparison layer receives three inputs: a component of the input pattern,a component of the feedback pattern, and a gain G1. A neuron outputs a 1 if and only if atleast three of these inputs are high: the `two-thirds rule.'

The neurons in the recognition layer each compute the inner product of their incoming(continuous-valued) weights and the pattern sent over these connections. The winning neuronthen inhibits all the other neurons via lateral inhibition.

Gain 2 is the logical òr' of all the elements in the input pattern x.

Gain 1 equals gain 2, except when the feedback pattern from F2 contains any 1; then it isforced to zero.

Finally, the reset signal is sent to the active neuron in F2 if the input vector x and theoutput of F1 dier by more than some vigilance level.

Operation

The network starts by clamping the input at F1. Because the output of F2 is zero, G1 and G2are both on and the output of F1 matches its input.


neurons

neurons

N

M++

+− +

++

−

+

+

F1

F2

G2

G1 reset

input

W fW b

i

j

Figure 6.13: The ART1 neural network.

The pattern is sent to F2, and in F2 one neuron becomes active. This signal is then sentback over the backward LTM, which reproduces a binary pattern at F1. Gain 1 is inhibited,and only the neurons in F1 which receive a òne' from both x and F2 remain active.

If there is a substantial mismatch between the two patterns, the reset signal will inhibit theneuron in F2 and the process is repeated.

Instead of following Carpenter and Grossberg's description of the system using dierentialequations, we use the notation employed by Lippmann (Lippmann, 1987):

1. Initialisation:

wjib(0) = 1

wijf (0) =

1

1 +N

where N is the number of neurons in F1, M the number of neurons in F2, 0 i < N ,and 0 j < M . Also, choose the vigilance threshold , 0 1;

2. Apply the new input pattern x;

3. compute the activation values y0 of the neurons in F2:

yi0 =

NXj=1

wijf (t)xi; (6.30)

4. select the winning neuron k (0 k < M);

5. vigilance test: ifwk

b(t) xx x > ; (6.31)

where denotes inner product, go to step 7, else go to step 6. Note that wkb x essentially

is the inner product x x, which will be large if x and x are near to each other;

6. neuron k is disabled from further activity. Go to step 3;

7. Set for all l, 0 l < N :

wklb(t+ 1) = wkl

b(t)xl;

wlkf (t+ 1) =

wklb(t)xl

12 +

PNi=1wki

b(t)xi;


8. re-enable all neurons in F2 and go to step 2.

Figure 6.14 shows exemplar behaviour of the network.

backward LTM from:

patterninput

4321outputoutputoutputoutput

activenotnot

active

notactive active

not

notactive

notactive

activenot

Figure 6.14: An example of the behaviour of the Carpenter Grossberg network for letter patterns.

The binary input patterns on the left were applied sequentially. On the right the stored patterns (i.e.,

the weights of W b for the rst four output units) are shown.

6.4.3 ART1: The original model

In later work, Carpenter and Grossberg (Carpenter & Grossberg, 1987a, 1987b) present severalneural network models to incorporate parts of the complete theory. We will only discuss therst model, ART1.

The network incorporates a follow-the-leader clustering algorithm (Hartigan, 1975). Thisalgorithm tries to t each new input pattern in an existing class. If no matching class can befound, i.e., the distance between the new pattern and all existing classes exceeds some threshold,a new class is created containing the new pattern.

The novelty in this approach is that the network is able to adapt to new incoming pat-terns, while the previous memory is not corrupted. In most neural networks, such as the back-propagation network, all patterns must be taught sequentially; the teaching of a new patternmight corrupt the weights for all previously learned patterns. By changing the structure of thenetwork rather than the weights, ART1 overcomes this problem.

Normalisation

We will refer to a cell in F1 or F2 with k.Each cell k in F1 or F2 receives an input sk and respond with an activation level yk.In order to introduce normalisation in the model, we set I =

Psk and let the relative input

intensity k = skI1.

So we have a model in which the change of the response yk of an input at a certain cell k

depends inhibitorily on all other inputs and the sensitivity of the cell, i.e., the surroundingsof each cell have a negative in uence on the cell yk

Pl 6=k sl;


has an excitatory response as far as the input at the cell is concerned +Bsk;

has an inhibitory response for normalisation yksk; has a decay Ayk.

Here, A and B are constants. The dierential equation for the neurons in F1 and F2 now is

dykdt

= Ayk + (B yk)sk ykXl 6=k

sl; (6.32)

with 0 yk(0) B because the inhibitory eect of an input can never exceed the excitatoryinput.

At equilibrium, when dyk=dt = 0, and with I =Psk we have that

yk(A+ I) = Bsk: (6.33)

Because of the denition of k = skI1 we get

yk = kBI

A+ I: (6.34)

Therefore, at equilibrium yk is proportional to k, and, since

BI

A+ I B; (6.35)

the total activity ytotal =Pyk never exceeds B: it is normalised.

Contrast enhancement

In order to make F2 react better on dierences in neuron values in F1 (or vice versa), contrastenhancement is applied: the contrasts between the neuronal values in a layer are amplied. Wecan show that eq. (6.32) does not suce anymore. In order to enhance the contrasts, we chopo all the equal fractions (uniform parts) in F1 or F2. This can be done by adding an extrainhibitory input proportional to the inputs from the other cells with a factor C:

dykdt

= Ayk + (B yk)sk (yk + C)Xl 6=k

sl: (6.36)

At equilibrium, when we set B = (n 1)C where n is the number of neurons, we have

yk =nCI

A+ I

k 1

n

: (6.37)

Now, when an input in which all the sk are equal is given, then all the yk are zero: the eect ofC is enhancing dierences. If we set B (n 1)C or C=(B+C) 1=n, then more of the inputshall be chopped o.

Discussion

The description of ART1 continues by dening the dierential equations for the LTM. Insteadof following Carpenter and Grossberg's description, we will revert to the simplied model aspresented by Lippmann (Lippmann, 1987).


7 Reinforcement learning

In the previous chapters a number of supervised training methods have been described in whichthe weight adjustments are calculated using a set of `learning samples', existing of input anddesired output values. However, not always such a set of learning examples is available. Oftenthe only information is a scalar evaluation r which indicates how well the neural network is per-forming. Reinforcement learning involves two subproblems. The rst is that the `reinforcement'signal r is often delayed since it is a result of network outputs in the past. This temporal creditassignment problem is solved by learning a `critic' network which represents a cost functionJ predicting future reinforcement. The second problem is to nd a learning procedure whichadapts the weights of the neural network such that a mapping is established which minimizesJ . The two problems are discussed in the next paragraphs, respectively. Figure 7.1 shows areinforcement-learning network interacting with a system.

7.1 The critic

The rst problem is how to construct a critic which is able to evaluate system performance. Ifthe objective of the network is to minimize a direct measurable quantity r, performance feedbackis straightforward and a critic is not required. On the other hand, how is current behavior tobe evaluated if the objective concerns future system performance. The performance may forinstance be measured by the cumulative or future error. Most reinforcement learning methods(such as Barto, Sutton and Anderson (Barto, Sutton, & Anderson, 1983)) use the temporaldierence (TD) algorithm (Sutton, 1988) to train the critic.

Suppose the immediate cost of the system at time step k are measured by r(xk;uk; k), as afunction of system states xk and control actions (network outputs) uk. The immediate measurer is often called the external reinforcement signal in contrast to the internal reinforcementsignal in gure 7.1. Dene the performance measure J(xk;uk; k) of the system as a discounted

critic J

xu system

reinforcementsignal

reinf.learningcontroller

Figure 7.1: Reinforcement learning scheme.

75

76 CHAPTER 7. REINFORCEMENT LEARNING

cumulative of future cost. The task of the critic is to predict the performance measure:

J(xk;uk; k) =1Xi=k

ikr(xi;ui; i) (7.1)

in which 2 [0; 1] is a discount factor (usually 0.95).

The relation between two successive prediction can easily be derived:

J(xk;uk; k) = r(xk;uk; k) + J(xk+1;uk+1; k + 1): (7.2)

If the network is correctly trained, the relation between two successive network outputs Jshould be:

J(xk;uk; k) = r(xk;uk; k) + J(xk+1;uk+1; k + 1): (7.3)

If the network is not correctly trained, the temporal dierence (k) between two successivepredictions is used to adapt the critic network:

(k) =hr(xk;uk; k) + J(xk+1;uk+1; k + 1)

i J(xk;uk; k): (7.4)

A learning rule for the weights of the critic network wc(k), based on minimizing 2(k) canbe derived:

wc(k) = "(k)@J(xk;uk; k)@wc(k)

(7.5)

in which is the learning rate.

7.2 The controller network

If the critic is capable of providing an immediate evaluation of performance, the controllernetwork can be adapted such that the optimal relation between system states and control actionsis found. Three approaches are distinguished:

1. In case of a nite set of actions U , all actions may virtually be executed. The action whichdecreases the performance criterion most is selected:

uk = minu2U

J(xk;uk; k) (7.6)

The RL-method with this `controller' is called Q-learning (Watkins & Dayan, 1992). Themethod approximates dynamic programming which will be discussed in the next section.

2. If the performance measure J(xk;uk; k) is accurately predicted, then the gradient withrespect to the controller command uk can be calculated, assuming that the critic networkis dierentiable. If the measure is to be minimized, the weights of the controller wr areadjusted in the direction of the negative gradient:

wr(k) = @J(xk;uk; k)@u(k)

@u(k)

@wr(k)(7.7)

with being the learning rate. Werbos (Werbos, 1992) has discussed some of these gradientbased algorithms in detail. Sofge and White (Sofge & White, 1992) applied one of thegradient based methods to optimize a manufacturing process.

7.3. BARTO'S APPROACH: THE ASE-ACE COMBINATION 77

3. A direct approach to adapt the controller is to use the dierence between the predicted andthe `true' performance measure as expressed in equation 7.3. Suppose that the performancemeasure is to be minimized. Control actions that result in negative dierences, i.e. the trueperformance is better than was expected, then the controller has to be `rewarded'. On theother hand, in case of a positive dierence, then the control action has to be `penalized'.The idea is to explore the set of possible actions during learning and incorporate thebenecial ones into the controller. Learning in this way is related to trial-and-error learningstudied by psychologists in which behavior is selected according to its consequences.

Generally, the algorithms select probabilistically actions from a set of possible actions andupdate action probabilities on basis of the evaluation feedback. Most of the algorithmsare based on a look-up table representation of the mapping from system states to actions(Barto et al., 1983). Each table entry has to learn which control action is best when thatentry is accessed. It may be also possible to use a parametric mapping from systems statesto action probabilities. Gullapalli (Gullapalli, 1990) adapted the weights of a single layernetwork. In the next section the approach of Barto et. al. is described.

7.3 Barto's approach: the ASE-ACE combination

Barto, Sutton and Anderson (Barto et al., 1983) have formulated `reinforcement learning'as a learning strategy which does not need a set of examples provided by a `teacher.' Thesystem described by Barto explores the space of alternative input-output mappings and uses anevaluative feedback (reinforcement signal) on the consequences of the control signal (networkoutput) on the environment. It has been shown that such reinforcement learning algorithms areimplementing an on-line, incremental approximation to the dynamic programming method foroptimal control, and are also called `heuristic' dynamic programming (Werbos, 1990).

The basic building blocks in the Barto network are an Associative Search Element (ASE)which uses a stochastic method to determine the correct relation between input and output andan Adaptive Critic Element (ACE) which learns to give a correct prediction of future rewardor punishment (Figure 7.2). The external reinforcement signal r can be generated by a specialsensor (for example a collision sensor of a mobile robot) or be derived from the state vector. Forexample, in control applications, where the state s of a system should remain in a certain partA of the control space, reinforcement is given by:

r =

0 if s 2 A,1 otherwise.

(7.8)

7.3.1 Associative search

In its most elementary form the ASE gives a binary output value yo(t) 2 f0; 1g as a stochasticfunction of an input vector. The total input of the ASE is, similar to the neuron presented inchapter 2, the weighted sum of the inputs, with the exception that the bias input in this case isa stochastic variable N with mean zero normal distribution:

s(t) =NXj=1

wSjxj(t) +Nj: (7.9)

The activation function F is a threshold such that

yo(t) = y(t) =

1 if s(t) > 0,0 otherwise.

(7.10)


reinforcementinternal

reinforcement

state vector

detectorreinforcement

ASE systemdecoder

ACE

r

r

wS1 yo

wCnwC2

wC1

wS2

wSn

Figure 7.2: Architecture of a reinforcement learning scheme with critic element

For updating the weights, a Hebbian type of learning rule is used. However, the update isweighted with the reinforcement signal r(t) and an èligibility' ej is dened instead of the productyo(t)xj(t) of input and output:

wSj(t+ 1) = wSj(t) + r(t)ej(t) (7.11)

where is a learning factor. The eligibility ej is given by

ej(t+ 1) = ej(t) + (1 )yo(t)xj(t) (7.12)

with the decay rate of the eligibility. The eligibility is a sort of `memory;' ej is high if thesignals from the input state unit j and the output unit are correlated over some time.

Using r(t) in expression (7.11) has the disadvantage that learning only nds place when thereis an external reinforcement signal. Instead of r(t), usually a continuous internal reinforcementsignal r(t) given by the ACE, is used.

Barto and Anandan (Barto & Anandan, 1985) proved convergence for the case of a singlebinary output unit and a set of linearly independent patterns xp: In control applications, theinput vector is the (n-dimensional) state vector s of the system. In order to obtain a linearindependent set of patterns xp, often a `decoder' is used, which divides the range of each of theinput variables si in a number of intervals. The aim is to divide the input (state) space in anumber of disjunct subspaces (or `boxes' as called by Barto). The input vector can thereforeonly be in one subspace at a time. The decoder converts the input vector into a binary valuedvector x, with only one element equal to one, indicating which subspace is currently visited. Ithas been shown (Krose & Dam, 1992) that instead of a-priori quantisation of the input space,a self-organising quantisation, based on methods described in this chapter, results in a betterperformance.

7.3.2 Adaptive critic

The Adaptive Critic Element (ACE, or èvaluation network') is basically the same as described insection 7.1. An error signal is derived from the temporal dierence of two successive predictions(in this case denoted by p!) and is used for training the ACE:

r(t) = r(t) + p(t) p(t 1): (7.13)

7.3. BARTO'S APPROACH: THE ASE-ACE COMBINATION 79

p(t) is implemented as a series of `weights' wCj to the ACE such that

p(t) = wCk (7.14)

if the system is in state k at time t, denoted by xk = 1. The function is learned by adjustingthe wCj 's according to a `delta-rule' with an error signal given by r(t):

wCj(t) = r(t)hj(t): (7.15)

is the learning parameter and hj(t) indicates the `trace' of neuron xj:

hj(t) = hj(t 1) + (1 )xj(t 1): (7.16)

This trace is a low-pass lter or momentum, through which the credit assigned to state j increaseswhile state j is active and decays exponentially after the activity of j has expired.

If r(t) is positive, the action u of the system has resulted in a higher evaluation value, whereasa negative r(t) indicates a deterioration of the system. r(t) can be considered as an internalreinforcement signal.

7.3.3 The cart-pole system

An example of such a system is the cart-pole balancing system (see gure 7.3). Here, a dynamicscontroller must control the cart in such a way that the pole always stands up straight. Thecontroller applies a `left' or `right' force F of xed magnitude to the cart, which may changedirection at discrete time intervals. The model has four state variables:

x the position of the cart on the track,

the angle of the pole with the vertical,

_x the cart velocity, and

_ the angle velocity of the pole.

Furthermore, a set of parameters specify the pole length and mass, cart mass, coecients offriction between the cart and the track and at the hinge between the pole and the cart, thecontrol force magnitude, and the force due to gravity. The state space is partitioned on thebasis of the following quantisation thresholds:

1. x: 0:8;2:4m,

2. : 0;1;6;12,

3. _x: 0:5;1 m/s,

4. _: 50;1/s.

This yields 3633 = 162 regions corresponding to all of the combinations of the intervals.The decoder output is a 162-dimensional vector. A negative reinforcement signal is providedwhen the state vector gets out of the admissible range: when x > 2:4, x < 2:4, > 12 or < 12. The system has proved to solve the problem in about 75 learning steps.


F

x

θ

Figure 7.3: The cart-pole system.

7.4 Reinforcement learning versus optimal control

The objective of optimal control is generate control actions in order to optimize a predenedperformance measure. One technique to nd such a sequence of control actions which dene anoptimal control policy is Dynamic Programming (DP). The method is based on the principleof optimality, formulated by Bellman (Bellman, 1957): Whatever the initial system state, ifthe rst control action is contained in an optimal control policy, then the remaining control

actions must constitute an optimal control policy for the problem with as initial system state the

state remaining from the rst control action. The `Bellman equations' follow directly from theprinciple of optimality. Solving the equations backwards in time is called dynamic programming.

Assume that a performance measure J(xk;uk; k) =PN

i=k r(xi;ui; i) with r being theimmediate costs, is to be minimized. The minimum costs Jmin of cost J can be derived by theBellman equations of DP. The equations for the discrete case are (White & Jordan, 1992):

Jmin(xk;uk; k) = minu2U

[Jmin(xk+1;uk+1; k + 1) + r(xk;uk; k)] ; (7.17)

Jmin(xN ) = r(xN ): (7.18)

The strategy for nding the optimal control actions is solving equation (7.17) and (7.18) fromwhich uk can be derived. This can be achieved backwards, starting at state xN . The require-ments are a bounded N, and a model which is assumed to be an exact representation of thesystem and the environment. The model has to provide the relation between successive systemstates resulting from system dynamics, control actions and disturbances. In practice, a solutioncan be derived only for a small N and simple systems. In order to deal with large or innity N,the performance measure could be dened as a discounted sum of future costs as expressed byequation 7.2.

Reinforcement learning provides a solution for the problem stated above without the use ofa model of the system and environment. RL is therefore often called an `heuristic' dynamic pro-gramming technique (Barto, Sutton, & Watkins, 1990),(Sutton, Barto, & Wilson, 1992),(Wer-bos, 1992). The most directly related RL-technique to DP is Q-learning (Watkins & Dayan,1992). The basic idea in Q-learning is to estimate a function, Q, of states and actions, whereQ is the minimum discounted sum of future costs Jmin(xk;uk; k) (the name `Q-learning' comesfrom Watkins' notation). For convenience, the notation with J is continued here:

J(xk;uk; k) = Jmin(xk+1;uk+1; k + 1) + r(xk;uk; k) (7.19)

The optimal control rule can be expressed in terms of J by noting that an optimal control actionfor state xk is any action uk that minimizes J according to equation 7.6.

The estimate of minimum cost J is updated at time step k+1 according equation 7.5 . The

7.4. REINFORCEMENT LEARNING VERSUS OPTIMAL CONTROL 81

temporal dierence "(k) between the `true' and expected performance is again used:

"(k) =

min

u2UJ(xk+1;uk+1; k + 1) + r(xk;uk; k)

J(xk;uk; k)

Watkins has shown that the function converges under some pre-specied conditions to the trueoptimal Bellmann equation (Watkins & Dayan, 1992): (1) the critic is implemented as a look-uptable; (2) the learning parameter must converge to zero; (3) all actions continue to be triedfrom all states.


Part III

APPLICATIONS

83

8 Robot Control

An important area of application of neural networks is in the eld of robotics. Usually, thesenetworks are designed to direct a manipulator, which is the most important form of the industrialrobot, to grasp objects, based on sensor data. Another applications include the steering andpath-planning of autonomous robot vehicles.

In robotics, the major task involves making movements dependent on sensor data. Thereare four, related, problems to be distinguished (Craig, 1989):

Forward kinematics. Kinematics is the science of motion which treats motion without regardto the forces which cause it. Within this science one studies the position, velocity, acceleration,and all higher order derivatives of the position variables. A very basic problem in the study ofmechanical manipulation is that of forward kinematics. This is the static geometrical problem ofcomputing the position and orientation of the end-eector (`hand') of the manipulator. Speci-cally, given a set of joint angles, the forward kinematic problem is to compute the position andorientation of the tool frame relative to the base frame (see gure 8.1).

1

4

3

2

tool frame

base frame

Figure 8.1: An exemplar robot manipulator.

Inverse kinematics. This problem is posed as follows: given the position and orientation ofthe end-eector of the manipulator, calculate all possible sets of joint angles which could be usedto attain this given position and orientation. This is a fundamental problem in the practical useof manipulators.

The inverse kinematic problem is not as simple as the forward one. Because the kinematicequations are nonlinear, their solution is not always easy or even possible in a closed form. Also,the questions of existence of a solution, and of multiple solutions, arise.

Solving this problem is a least requirement for most robot control systems.

85

86 CHAPTER 8. ROBOT CONTROL

Dynamics. Dynamics is a eld of study devoted to studying the forces required to causemotion. In order to accelerate a manipulator from rest, glide at a constant end-eector velocity,and nally decelerate to a stop, a complex set of torque functions must be applied by the jointactuators. In dynamics not only the geometrical properties (kinematics) are used, but also thephysical properties of the robot are taken into account. Take for instance the weight (inertia)of the robotarm, which determines the force required to change the motion of the arm. Thedynamics introduces two extra problems to the kinematic problems.

1. The robot arm has a `memory'. Its responds to a control signal depends also on its history(e.g. previous positions, speed, acceleration).

2. If a robot grabs an object then the dynamics change but the kinematics don't. This isbecause the weight of the object has to be added to the weight of the arm (that's whyrobot arms are so heavy, making the relative weight change very small).

Trajectory generation. To move a manipulator from here to there in a smooth, controlledfashion each joint must be moved via a smooth function of time. Exactly how to compute thesemotion functions is the problem of trajectory generation.

In the rst section of this chapter we will discuss the problems associated with the positioningof the end-eector (in eect, representing the inverse kinematics in combination with sensorytransformation). Section 8.2 discusses a network for controlling the dynamics of a robot arm.Finally, section 8.3 describes neural networks for mobile robot control.

8.1 End-eector positioning

The nal goal in robot manipulator control is often the positioning of the hand or end-eector inorder to be able to, e.g., pick up an object. With the accurate robot arm that are manufactured,this task is often relatively simple, involving the following steps:

1. determine the target coordinates relative to the base of the robot. Typically, when thisposition is not always the same, this is done with a number of xed cameras or othersensors which observe the work scene, from the image frame determine the position of theobject in that frame, and perform a pre-determined coordinate transformation;

2. with a precise model of the robot (supplied by the manufacturer), calculate the joint anglesto reach the target (i.e., the inverse kinematics). This is a relatively simple problem;

3. move the arm (dynamics control) and close the gripper.

The arm motion in point 3 is discussed in section 8.2. Gripper control is not a trivial matter atall, but we will not focus on that.

Involvement of neural networks. So if these parts are relatively simple to solve with ahigh accuracy, why involve neural networks? The reason is the applicability of robots. When`traditional' methods are used to control a robot arm, accurate models of the sensors and manip-ulators (in some cases with unknown parameters which have to be estimated from the system'sbehaviour; yet still with accurate models as starting point) are required and the system mustbe calibrated. Also, systems which suer from wear-and-tear (and which mechanical systemsdon't?) need frequent recalibration or parameter determination. Finally, the development ofmore complex (adaptive!) control methods allows the design and use of more exible (i.e., lessrigid) robot systems, both on the sensory and motory side.

8.1. END-EFFECTOR POSITIONING 87

8.1.1 Camerarobot coordination is function approximation

The system we focus on in this section is a work oor observed by a xed cameras, and a robotarm. The visual system must identify the target as well as determine the visual position of theend-eector.

The target position xtarget together with the visual position of the hand xhand are input tothe neural controller N (). This controller then generates a joint position for the robot:

= N (xtarget;xhand): (8.1)

We can compare the neurally generated with the optimal 0 generated by a ctitious perfectcontroller R():

0 = R(xtarget; xhand): (8.2)

The task of learning is to make the N generate an output `close enough' to 0.There are two problems associated with teaching N ():1. generating learning samples which are in accordance with eq. (8.2). This is not trivial,

since in useful applications R() is an unknown function. Instead, a form of self-supervisedor unsupervised learning is required. Some examples to solve this problem are given below;

2. constructing the mapping N () from the available learning samples. When the (usuallyrandomly drawn) learning samples are available, a neural network uses these samples torepresent the whole input space over which the robot is active. This is evidently a formof interpolation, but has the problem that the input space is of a high dimensionality, andthe samples are randomly distributed.

We will discuss three fundamentally dierent approaches to neural networks for robot end-eector positioning. In each of these approaches, a solution will be found for both the learningsample generation and the function representation.

Approach 1: Feed-forward networks

When using a feed-forward system for controlling the manipulator, a self-supervised learningsystem must be used.

One such a system has been reported by Psaltis, Sideris and Yamamura (Psaltis, Sideris, &Yamamura, 1988). Here, the network, which is constrained to two-dimensional positioning ofthe robot arm, learns by experimentation. Three methods are proposed:

1. Indirect learning.In indirect learning, a Cartesian target point x in world coordinates is generated, e.g.,

by a two cameras looking at an object. This target point is fed into the network, whichgenerates an angle vector . The manipulator moves to position , and the camerasdetermine the new position x0 of the end-eector in world coordinates. This x0 again isinput to the network, resulting in 0. The network is then trained on the error 1 = 0(see gure 8.2).

However, minimisation of 1 does not guarantee minimisation of the overall error = xx0.For example, the network often settles at a `solution' that maps all x's to a single (i.e.,the mapping I).

2. General learning.The method is basically very much like supervised learning, but here the plant input

must be provided by the user. Thus the network can directly minimise j 0j. Thesuccess of this method depends on the interpolation capabilities of the network. Correctchoice of may pose a problem.


θ

θ

NetworkNeural

NetworkNeural Plant

’

1ε

x x’θ

θ

Figure 8.2: Indirect learning system for robotics. In each cycle, the network is used in two dierent

places: rst in the forward step, then for feeding back the error.

3. Specialised learning.Keep in mind that the goal of the training of the network is to minimise the error at

the output of the plant: = x x0. We can also train the network by `backpropagating'this error trough the plant (compare this with the backpropagation of the error in Chap-ter 4). This method requires knowledge of the Jacobian matrix of the plant. A Jacobianmatrix of a multidimensional function F is a matrix of partial derivatives of F , i.e., themultidimensional form of the derivative. For example, if we have Y = F (X), i.e.,

y1 = f1(x1; x2; : : : ; xn);

y2 = f2(x1; x2; : : : ; xn);

ym = fm(x1; x2; : : : ; xn)

then

y1 =@f1@x1

x1 +@f1@x2

x2 + : : :+@f1@xn

xn;

y2 =@f2@x1

x1 +@f2@x2

x2 + : : :+@f2@xn

xn;

ym =@fm@x1

x1 +@fm@x2

x2 + : : : +@fm@xn

xn

or

Y =@F

@XX: (8.3)

Eq. (8.3) is also written as

Y = J(X)X (8.4)

where J is the Jacobian matrix of F . So, the Jacobian matrix can be used to calculate thechange in the function when its parameters change.

Now, in this case we have

Jij =

"@Pi@j

#(8.5)

where Pi() the ith element of the plant output for input . The learning rule appliedhere regards the plant as an additional and unmodiable layer in the neural network. The

8.1. END-EFFECTOR POSITIONING 89

ε

θPlant

NetworkNeuralx x’θ

Figure 8.3: The system used for specialised learning.

total error = x x0 is propagated back through the plant by calculating the j as ineq. (4.14):

j = F0(sj)Xi

i@Pi()

@j;

i = xi x0i;

where i iterates over the outputs of the plant. When the plant is an unknown function,@Pi()@j

can be approximated by

@Pi()

@j Pi(+ hjej) Pi()

h(8.6)

where ej is used to change the scalar j into a vector. This approximate derivative canbe measured by slightly changing the input to the plant and measuring the changes in theoutput.

A somewhat similar approach is taken in (Krose, Korst, & Groen, 1990) and (Smagt & Krose,1991). Again a two-layer feed-forward network is trained with back-propagation. However,instead of calculating a desired output vector the input vector which should have invoked thecurrent output vector is reconstructed, and back-propagation is applied to this new input vectorand the existing output vector.

The conguration used consists of a monocular manipulator which has to grasp objects. Dueto the fact that the camera is situated in the hand of the robot, the task is to move the handsuch that the object is in the centre of the image and has some predetermined size (in a laterarticle, a biologically inspired system is proposed (Smagt, Krose, & Groen, 1992) in which thevisual ow-eld is used to account for the monocularity of the system, such that the dimensionsof the object need not to be known anymore to the system).

One step towards the target consists of the following operations:

1. measure the distance from the current position to the target position in camera domain,x;

2. use this distance, together with the current state of the robot, as input for the neuralnetwork. The network then generates a joint displacement vector ;

3. send to the manipulator;

4. again measure the distance from the current position to the target position in cameradomain, x0;

5. calculate the move made by the manipulator in visual domain, x t+1t Rx0, where t+1

t R isthe rotation matrix of the second camera image with respect to the rst camera image;


6. teach the learning pair (x t+1t Rx0;;) to the network.

This system has shown to learn correct behaviour in only tens of iterations, and to be veryadaptive to changes in the sensor or manipulator (Smagt & Krose, 1991; Smagt, Groen, &Krose, 1993).

By using a feed-forward network, the available learning samples are approximated by a single,smooth function consisting of a summation of sigmoid functions. As mentioned in section 4, afeed-forward network with one layer of sigmoid units is capable of representing practically anyfunction. But how are the optimal weights determined in nite time to obtain this optimalrepresentation? Experiments have shown that, although a reasonable representation can beobtained in a short period of time, an accurate representation of the function that governs thelearning samples is often not feasible or extremely dicult (Jansen et al., 1994). The reasonfor this is the global character of the approximation obtained with a feed-forward network withsigmoid units: every weight in the network has a global eect on the nal approximation thatis obtained.

Building local representations is the obvious way out: every part of the network is responsiblefor a small subspace of the total input space. Thus accuracy is obtained locally (Keep It Small& Simple). This is typically obtained with a Kohonen network.

Approach 2: Topology conserving maps

Ritter, Martinetz, and Schulten (Ritter, Martinetz, & Schulten, 1989) describe the use of aKohonen-like network for robot control. We will only describe the kinematics part, since it isthe most interesting and straightforward.

The system described by Ritter et al. consists of a robot manipulator with three degrees offreedom (orientation of the end-eector is not included) which has to grab objects in 3D-space.The system is observed by two xed cameras which output their (x; y) coordinates of the objectand the end eector (see gure 8.4).

Figure 8.4: A Kohonen network merging the output of two cameras.

Each run consists of two movements. In the gross move, the observed location of the objectx (a four-component vector) is input to the network. As with the Kohonen network, the neuronk with highest activation value is selected as winner, because its weight vector wk is nearest tox. The neurons, which are arranged in a 3-dimensional lattice, correspond in a 11 fashion withsubregions of the 3D workspace of the robot, i.e., the neuronal lattice is a discrete representationof the workspace. With each neuron a vector and Jacobian matrix A are associated. Duringgross move k is fed to the robot which makes its move, resulting in retinal coordinates xg ofthe end-eector. To correct for the discretisation of the working space, an additional move is

8.2. ROBOT ARM DYNAMICS 91

made which is dependent of the distance between the neuron and the object in space wk x;this small displacement in Cartesian space is translated to an angle change using the JacobianAk:

nal = k +Ak(xwk) (8.7)

which is a rst-order Taylor expansion of nal. The nal retinal coordinates of the end-eectorafter this ne move are in xf .

Learning proceeds as follows: when an improved estimate (; A) has been found, the fol-lowing adaptations are made for all neurons j:

wjnew = wj

old + (t) gjk(t)xwj

old;

(; A)newj = (; A)oldj + 0(t) g0jk(t)(; A)j (; A)oldj

:

If gjk(t) = g0jk(t) = jk, this is similar to perceptron learning. Here, as with the Kohonenlearning rule, a distance function is used such that gjk(t) and g

0jk(t) are Gaussians depending on

the distance between neurons j and k with a maximum at j = k (cf. eq. (6.6)).An improved estimate (; A) is obtained as follows.

= k +Ak(x xf ); (8.8)

A = Ak +Ak(xwk xf + xg) (xf xg)Tkxf xgk2 (8.9)

= Ak + (Akx)xT

kxk2 :

In eq. (8.8), the nal error x xf in Cartesian space is translated to an error in joint space viamultiplication by Ak. This error is then added to k to constitute the improved estimate

(steepest descent minimisation of error).In eq. (8.9), x = xfxg, i.e., the change in retinal coordinates of the end-eector due to the

ne movement, and = Ak(xwk), i.e., the related joint angles during ne movement. Thuseq. (8.9) can be recognised as an error-correction rule of the Widrow-Ho type for Jacobians A.

It appears that after 6,000 iterations the system approaches correct behaviour, and that after30,000 learning steps no noteworthy deviation is present.

8.2 Robot arm dynamics

While end-eector positioning via sensorrobot coordination is an important problem to solve,the robot itself will not move without dynamic control of its limbs.

Again, accurate control with non-adaptive controllers is possible only when accurate modelsof the robot are available, and the robot is not too susceptible to wear-and-tear. This requirementhas led to the current-day robots that are used in many factories. But the application of neuralnetworks in this eld changes these requirements.

One of the rst neural networks which succeeded in doing dynamic control of a robot armwas presented by Kawato, Furukawa, and Suzuki (Kawato, Furukawa, & Suzuki, 1987). Theydescribe a neural network which generates motor commands from a desired trajectory in jointangles. Their system does not include the trajectory generation or the transformation of visualcoordinates to body coordinates.

The network is extremely simple. In fact, the system is a feed-forward network, but bycarefully choosing the basis functions, the network can be restricted to one learning layer suchthat nding the optimal is a trivial task. In this case, the basis functions are thus chosen thatthe function that is approximated is a linear combination of those basis functions. This approachis similar to that presented in section 4.5.


Dynamics model. The manipulator used consists of three joints as the manipulator in g-ure 8.1 without wrist joint. The desired trajectory d(t), which is generated by another subsys-tem, is fed into the inverse-dynamics model (gure 8.5). The error between d(t) and (t) isfed into the neural model.

manipulator

model

inverse dynamics

d

f

i(t)θ

-+

++

K

(t)θ(t) (t)T T

T (t)

θ θ

Figure 8.5: The neural model proposed by Kawato et al.

The neural model, which is shown in gure 8.6, consists of three perceptrons, each onefeeding in one joint of the manipulator. The desired trajectory d = (d1; d2; d3) is fed into 13nonlinear subsystems. The resulting signals are weighted and summed, such that

Tik(t) =13Xl=1

wlkxlk; (k = 1; 2; 3); (8.10)

with

xl1 = fl(d1(t); d2(t); d3(t));

xl2 = xl3 = gl(d1(t); d2(t); d3(t));

and fl and gl as in table 8.1.

(t)

(t)

(t)

(t)

(t)

(t)

x

x

13,1

1,1

i3

i2

i1(t)

(t)

(t)T

T

T

TT

T

32

1

(t)(t)

(t)

Σ

Σ

Σ

θθθ

(t)

(t)

(t)d3

d2

d1

f

f

g

g

1

13

1

13

x1,2

1,3x

x13,3

13,2x

Figure 8.6: The neural network used by Kawato et al. There are three neurons, one per joint in the

robot arm. Each neuron feeds from thirteen nonlinear subsystems. The upper neuron is connected

to the rotary base joint (cf. joint 1 in gure 8.1), the other two neurons to joints 2 and 3.

8.2. ROBOT ARM DYNAMICS 93

l fl(1; 2; 3) gl(1; 2; 3)

1 1 22 1 sin

2 2 33 1 cos

2 2 2 cos 34 1 sin

2(2 + 3) 3 cos 35 1 cos

2(2 + 3) _21 sin 2 cos 26 1 sin 2 sin(2 + 3) _21 sin(2 + 3) cos(2 + 3)

7 _1 _2 sin 2 cos 2 _21 sin 2 cos(2 + 3)

8 _1 _2 sin(2 + 3) cos(2 + 3) _21 cos 2 sin(2 + 3)

9 _1 _2 sin 2 cos(2 + 3) _22 sin 310 _1 _2 cos 2 sin(2 + 3) _23 sin 311 _1 _3 sin(2 + 3) cos(2 + 3) _2 _3 sin 312 _1 _3 sin 2 cos(2 + 3) _213 _1 _3

Table 8.1: Nonlinear transformations used in the Kawato model.

The feedback torque Tf (t) in gure 8.5 consists of

Tfk(t) = Kpk(dk(t) k(t)) +Kvkdk(t)

dt; (k = 1; 2; 3);

Kvk = 0 unless jk(t) dk(objective point)j < ":

The feedback gains Kp and Kv were computed as (517:2; 746:0; 191:4)T and (16:2; 37:2; 8:4)T .

Next, the weights adapt using the delta rule

dwik

dt= xikT1 = xik(Tfk Tik); (k = 1; 2; 3): (8.11)

A desired move pattern is shown in gure 8.7. After 20 minutes of learning the feedbacktorques are nearly zero such that the system has successfully learned the transformation. Al-though the applied patterns are very dedicated, training with a repetitive pattern sin(!kt), with!1 : !2 : !3 = 1 :

p2 :p3 is also successful.

1θ

302010−π

0

π

t/s

Figure 8.7: The desired joint pattern for joints 1. Joints 2 and 3 have similar time patterns.

The usefulness of neural algorithms is demonstrated by the fact that novel robot architectures,which no longer need a very rigid structure to simplify the controller, are now constructed. Forexample, several groups (Katayama & Kawato, 1992; Hesselroth, Sarkar, Smagt, & Schulten,1994) report on work with a pneumatic musculo-skeletal robot arm, with rubber actuators re-placing the DC motors. The very complex dynamics and environmental temperature dependencyof this arm make the use of non-adaptive algorithms impossible, where neural networks succeed.


8.3 Mobile robots

In the previous sections some applications of neural networks on robot arms were discussed. Inthis section we focus on mobile robots. Basically, the control of a robot arm and the controlof a mobile robot is very similar: the (hierarchical) controller rst plans a path, the path istransformed from Cartesian (world) domain to the joint or wheel domain using the inversekinematics of the system and nally a dynamic controller takes care of the mapping from set-points in this domain to actuator signals. However, in practice the problems with mobile robotsoccur more with path-planning and navigation than with the dynamics of the system. Twoexamples will be given.

8.3.1 Model based navigation

Jorgensen (Jorgensen, 1987) describes a neural approach for path-planning. Robot path-planningtechniques can be divided into two categories. The rst, called local planning relies on informa-tion available from the current `viewpoint' of the robot. This planning is important, since it isable to deal with fast changes in the environment. Unfortunately, by itself local data is generallynot adequate since occlusion in the line of sight can cause the robot to wander into dead endcorridors or choose non-optimal routes of travel. The second situation is called global path-planning, in which case the system uses global knowledge from a topographic map previouslystored into memory. Although global planning permits optimal paths to be generated, it has itsweakness. Missing knowledge or incorrectly selected maps can invalidate a global path to an ex-tent that it becomes useless. A possible third, ànticipatory' planning combined both strategies:the local information is constantly used to give a best guess what the global environment maycontain.

Jorgensen investigates two issues associated with neural network applications in unstructuredor changing environments. First, can neural networks be used in conjunction with direct sensorreadings to associatively approximate global terrain features not observable from a single robotperspective. Secondly, is a neural network fast enough to be useful in path relaxation planning,where the robot is required to optimise motion and situation sensitive constraints.

For the rst problem, the system had to store a number of possible sensor maps of theenvironment. The robot was positioned in eight positions in each room and 180 sonar scanswere made from each position. Based on these data, for each room a map was made. To be ableto represent these maps in a neural network, the map was divided into 32 32 grid elements,which could be projected onto the 32 32 nodes neural network. The maps of the dierentrooms were `stored' in a Hopeld type of network. In the operational phase, the robot wandersaround, and enters an unknown room. It makes one scan with the sonar, which provides a partialrepresentation of the room map (see gure 8.8). This pattern is clamped onto the network, whichwill regenerate the best tting pattern. With this information a global path-planner can be used.The results which are presented in the paper are not very encouraging. With a network of 3232neurons, the total number of weights is 1024 squared, which costs more than 1 Mbyte of storageif only one byte per weight is used. Also the speed of the recall is low: Jorgensen mentions arecall time of more than two and a half hour on an IBM AT, which is used on board of therobot.

Also the use of a simulated annealing paradigm for path planning is not proving to be aneective approach. The large number of settling trials (> 1000) is far too slow for real time,when the same functions could be better served by the use of a potential eld approach ordistance transform.

8.3. MOBILE ROBOTS 95

EDCBA

Figure 8.8: Schematic representation of the stored rooms, and the partial information which is

available from a single sonar scan.

8.3.2 Sensor based control

Very similar to the sensor based control for the robot arm, as described in the previous sections,a mobile robot can be controlled directly using the sensor data. Such an application has beendeveloped at Carnegy-Mellon by Touretzky and Pomerleau. The goal of their network is to drivea vehicle along a winding road. The network receives two type of sensor inputs from the sensorysystem. One is a 30 32 (see gure 8.9) pixel image from a camera mounted on the roof of thevehicle, where each pixel corresponds to an input unit of the network. The other input is an8 32 pixel image from a laser range nder. The activation levels of units in the range nder'sretina represent the distance to the corresponding objects.

units

units45 output

29 hidden

sharp left sharp right

30x32 video input retina

8x32 range finderinput retina

straight ahead

66

Figure 8.9: The structure of the network for the autonomous land vehicle.

The network was trained by presenting it samples with as inputs a wide variety of road imagestaken under dierent viewing angles and lighting conditions. 1,200 Images were presented,


40 times each while the weights were adjusted using the back-propagation principle. The authorsclaim that once the network is trained, the vehicle can accurately drive (at about 5 km/hour)along `: : : a path though a wooded area adjoining the Carnegie Mellon campus, under a varietyof weather and lighting conditions.' The speed is nearly twice as high as a non-neural algorithmrunning on the same vehicle.

Although these results show that neural approaches can be possible solutions for the sensorbased control problem, there still are serious shortcomings. In simulations in our own laboratory,we found that networks trained with examples which are provided by human operators are notalways able to nd a correct approximation of the human behaviour. This is the case if thehuman operator uses other information than the network's input to generate the steering signal.Also the learning of in particular back-propagation networks is dependent on the sequence ofsamples, and, for all supervised training methods, depends on the distribution of the trainingsamples.

9 Vision

9.1 Introduction

In this chapter we illustrate some applications of neural networks which deal with visual infor-mation processing. In the neural literature we nd roughly two types of problems: the modellingof biological vision systems and the use of articial neural networks for machine vision. We willfocus on the latter.

The primary goal of machine vision is to obtain information about the environment byprocessing data from one or multiple two-dimensional arrays of intensity values (ìmages'), whichare projections of this environment on the system. This information can be of dierent nature:

recognition: the classication of the input data in one of a number of possible classes;

geometric information about the environment, which is important for autonomous systems;

compression of the image for storage and transmission.

Often a distinction is made between low level (or early) vision, intermediate level vision andhigh level vision. Typical low-level operations include image ltering, isolated feature detectionand consistency calculations. At a higher level segmentation can be carried out, as well asthe calculation of invariants. The high level vision modules organise and control the ow ofinformation from these modules and combine this information with high level knowledge foranalysis.

Computer vision already has a long tradition of research, and many algorithms for imageprocessing and pattern recognition have been developed. There appear to be two computationalparadigms that are easily adapted to massive parallelism: local calculations and neighbourhoodfunctions. Calculations that are strictly localised to one area of an image are obviously easy tocompute in parallel. Examples are lters and edge detectors in early vision. A cascade of theselocal calculations can be implemented in a feed-forward network.

The rst section describes feed-forward networks for vision. Section 9.3 shows how back-propagation can be used for image compression. In the same section, it is shown that thePCA neuron is ideally suited for image compression. Finally, sections 9.4 and 9.5 describe thecognitron for optical character recognition, and relaxation networks for calculating depth fromstereo images.

9.2 Feed-forward types of networks

The early feed-forward networks as the perceptron and the adaline were essentially designed tobe be visual pattern classiers. In principle a multi-layer feed-forward network is able to learn toclassify all possible input patterns correctly, but an enormous amount of connections is needed(for the perceptron, Minsky showed that many problems can only be solved if each hidden unit is

97

98 CHAPTER 9. VISION

connected to all inputs). The question is whether such systems can still be regarded as `vision'systems. No use is made of the spatial relationships in the input patterns and the problemof classifying a set of `real world' images is the same as the problem of classifying a set ofarticial random dot patterns which are, according to Smeulders, no ìmages.' For that reason,most successful neural vision applications combine self-organising techniques with a feed-forwardarchitecture, such as for example the neocognitron (Fukushima, 1988), described in section 9.4.The neocognitron performs the mapping from input data to output data by a layered structurein which at each stage increasingly complex features are extracted. The lower layers extractlocal features such as a line at a particular orientation and the higher layers aim to extract moreglobal features.

Also there is the problem of translation invariance: the system has to classify a patterncorrectly independent of the location on the `retina.' However, a standard feed-forward networkconsiders an input pattern which is translated as a totally `new' pattern. Several attempts havebeen described to overcome this problem, one of the more exotic ones by Widrow (Widrow,Winter, & Baxter, 1988) as a layered structure of adalines.

9.3 Self-organising networks for image compression

In image compression one wants to reduce the number of bits required to store or transmit animage. We can either require a perfect reconstruction of the original or we can accept a smalldeterioration of the image. The former is called a lossless coding and the latter a lossy coding.In this section we will consider lossy coding of images with neural networks.

The basic idea behind compression is that an n-dimensional stochastic vector n, (part of)the image, is transformed into an m-dimensional stochastic vector

m = Tn: (9.1)

After transmission or storage of this vector ~m, a discrete version of m, we can make a recon-struction of n by some sort of inverse transform ~T so that the reconstructed signal equals

~n = ~T~n: (9.2)

The error of the compression and reconstruction stage together can be given as

= E [kn ~nk] : (9.3)

There is a trade-o between the dimensionality of m and the error . As one decreases thedimensionality of m the error increases and vice versa, i.e., a better compression leads to ahigher deterioration of the image. The basic problem of compression is nding T and ~T suchthat the information in m is as compact as possible with acceptable error . The denition ofacceptable depends on the application area.

The cautious reader has already concluded that dimension reduction is in itself not enough toobtain a compression of the data. The main importance is that some aspects of an image are moreimportant for the reconstruction then others. For example, the mean grey level and generallythe low frequency components of the image are very important, so we should code these featureswith high precision. Other, like high frequency components, are much less important so thesecan be coarse-coded. So, when we reduce the dimension of the data, we are actually trying toconcentrate the information of the data in a few numbers (the low frequency components) whichcan be coded with precision, while throwing the rest away (the high frequency components).In this section we will consider coding an image of 256 256 pixels. It is a bit tedious totransform the whole image directly by the network. This requires a huge amount of neurons.Because the statistical description over parts of the image is supposed to be stationary, we can

9.3. SELF-ORGANISING NETWORKS FOR IMAGE COMPRESSION 99

break the image into 1024 blocks of size 8 8, which is large enough to entail a local statisticaldescription and small enough to be managed. These blocks can then be coded separately, storedor transmitted, where after a reconstruction of the whole image can be made based on thesecoded 8 8 blocks.

9.3.1 Back-propagation

The process above can be interpreted as a 2-layer neural network. The inputs to the networkare the 8 8 patters and the desired outputs are the same 8 8 patterns as presented on theinput units. This type of network is called an auto-associator.

After training with a gradient search method, minimising , the weights between the rstand second layer can be seen as the coding matrix T and between the second and third as thereconstruction matrix ~T.

If the number of hidden units is smaller then the number of input (output) units, a com-pression is obtained, in other words we are trying to squeeze the information through a smallerchannel namely the hidden layer.

This network has been used for the recognition of human faces by Cottrell (Cottrell, Munro,& Zipser, 1987). He uses an input and output layer of 6464 units (!) on which he presented thewhole face at once. The hidden layer, which consisted of 64 units, was classied with anothernetwork by means of a delta rule. Is this complex network invariant to translations in the input?

9.3.2 Linear networks

It is known from statistics that the optimal transform from an n-dimensional to anm-dimensionalstochastic vector, optimal in the sense that contains the lowest energy possible, equals theconcatenation of the rst m eigenvectors of the correlation matrix R of N. So if (e1;e2; ::;en)are the eigenvectors of R, ordered in decreasing corresponding eigenvalue, the transformationmatrix is given as T = [e1e2 : : :e2]

T .In section 6.3.1 a linear neuron with a normalised Hebbian learning rule was able to learn

the eigenvectors of the correlation matrix of the input patterns. The denition of the optimaltransform given above, suits exactly in the PCA network we have described.

So we end up with a 64 m 64 network, where m is the desired number of hidden unitswhich is coupled to the total error . Since the eigenvalues are ordered in decreasing values,which are the outputs of the hidden units, the hidden units are ordered in importance for thereconstruction.

Sanger (Sanger, 1989) used this implementation for image compression. The test image isshown in gure 9.1. It is 256 256 with 8 bits/pixel.

After training the image four times, thus generating 4 1024 learning patterns of size 8 8,the weights of the network converge into gure 9.2.

9.3.3 Principal components as features

If parts of the image are very characteristic for the scene, like corners, lines, shades etc., onespeaks of features of the image. The extraction of features can make the image understandingtask on a higher level much easer. If the image analysis is based on features it is very importantthat the features are tolerant of noise, distortion etc.

From an image compression viewpoint it would be smart to code these features with as littlebits as possible, just because the denition of features was that they occur frequently in theimage.

So one can ask oneself if the two described compression methods also extract features fromthe image. Indeed this is true and can most easily be seen in g. 9.2. It might not be cleardirectly, but one can see that the weights are converged to:


Figure 9.1: Input image for the network. The image is divided into 8 8 blocks which are fed to thenetwork.

neuronneuron

1 2neuronneuronneuron

0

neuron3 4 5

76neuron neuron

(unused)

orderingof the neurons:

Figure 9.2: Weights of the PCA network. The nal weights of the network trained on the test

image. For each neuron, an 8 8 rectangle is shown, in which the grey level of each of the elements

represents the value of the weight. Dark indicates a large weight, light a small weight.

neuron 0: the mean grey level;

neuron 1 and neuron 2: the rst order gradients of the image;

neuron 3 : : : neuron 5: second orders derivates of the image.

The features extracted by the principal component network are the gradients of the image.

9.4 The cognitron and neocognitron

Yet another type of unsupervised learning is found in the cognitron, introduced by Fukushima asearly as 1975 (Fukushima, 1975). This network, with primary applications in pattern recognition,was improved at a later stage to incorporate scale, rotation, and translation invariance resultingin the neocognitron (Fukushima, 1988), which we will not discuss here.

9.4.1 Description of the cells

Central in the cognitron is the type of neuron used. Whereas the Hebb synapse (unit k, say),which is used in the perceptron model, increases an incoming weight (wjk) if and only if the

9.4. THE COGNITRON AND NEOCOGNITRON 101

incoming signal (yj) is high and a control input is high, the synapse introduced by Fukushimaincreases (the absolute value of) its weight (jwjkj) only if it has positive input yj and a maximumactivation value yk = max(yk1 ; yk2 ; : : : ; ykn), where k1; k2; : : : ; kn are all `neighbours' of k. Notethat this learning scheme is competitive and unsupervised, and the same type of neuron has,at a later stage, been used in the competitive learning network (section 6.1) as well as in otherunsupervised networks.

Fukushima distinguishes between excitatory inputs and inhibitory inputs. The output of anexcitatory cell u is given by1

u(k) = F1 + e

1 + h 1

= F

e h

1 + h

; (9.4)

where e is the excitatory input from u-cells and h the inhibitory input from v-cells. The activationfunction is

F(x) =x if x 0,0 otherwise.

(9.5)

When the inhibitory input is small, i.e., h 1, u(k) can be approximated by u(k) = e h,which agrees with the formula for a conventional linear threshold element (with a threshold ofzero).

When both the excitatory and inhibitory inputs increase in proportion, i.e.,

e = x; h = x (9.6)

(, constants) and > , then eq. (9.4) can be transformed into

u(i) =( )x

1 + x=

2

1 + tanh(12 log x)

(9.7)

i.e., a squashing function as in gure 2.2.

9.4.2 Structure of the cognitron

The basic structure of the cognitron is depicted in gure 9.3.

U U ll-1

ll-1

l

l

l(

(

(

(

(

)

)) )

)

)(l-1

l-1 )(n+vv,n

nv

n

n’

n

ua

cb

v

u

u

Figure 9.3: The basic structure of the cognitron.

The cognitron has a multi-layered structure. The l-th layer Ul consists of excitatory neuronsul(n) and inhibitory neurons vl(n), where n = (nx; ny) is a two-dimensional location of the cell.

1Here our notational system fails. We adhere to Fukushima's symbols.


A cell ul(n) receives inputs via modiable connections al(v;n) from neurons ul1(n+v) andconnections bl(n) from neurons vl1(n), where v is in the connectable area (cf. area of atten-tion) of the neuron. Furthermore, an inhibitory cell vl1(n) receives inputs via xed excitatoryconnections cl1(v) from the neighbouring cells ul1(n + v), and yields an output equal to itsweighted input:

vl1(n) =Xv

cl1(v)ul1(n+ v): (9.8)

whereP

v cl1(v) = 1 and are xed.

It can be shown that the growing of the synapses (i.e., modication of the a and b weights)ensures that, if an excitatory neuron has a relatively large response, the excitatory synapsesgrow faster than the inhibitory synapses, and vice versa.

Receptive region

For each cell in the cascaded layers described above a connectable area must be established. Aconnection scheme as in gure 9.4 is used: a neuron in layer Ul connects to a small region inlayer Ul1.

""21100 UUUUU

Figure 9.4: Cognitron receptive regions.

If the connection region of a neuron is constant in all layers, a too large number of layers isneeded to cover the whole input layer. On the other hand, increasing the region in later layersresults in so much overlap that the output neurons have near identical connectable areas andthus all react similarly. This again can be prevented by increasing the size of the vicinity area inwhich neurons compete, but then only one neuron in the output layer will react to some inputstimulus. This is in contradiction with the behaviour of biological brains.

A solution is to distribute the connections probabilistically such that connections with alarge deviation are less numerous.

9.4.3 Simulation results

In order to illustrate the working of the network, a simulation has been run with a four-layerednetwork with 16 16 neurons in each layer. The network is trained with four learning patterns,consisting of a vertical, a horizontal, and two diagonal lines. Figure 9.5 shows the activationlevels in the layers in the rst two learning iterations.

After 20 learning iterations, the learning is halted and the activation values of the neuronsin layer 4 are fed back to the input neurons; also, the maximum output neuron alone is fed back,and thus the input pattern is `recognised' (see gure 9.6).

9.5. RELAXATION TYPES OF NETWORKS 103

b.a.

Figure 9.5: Two learning iterations in the cognitron.

Four learning patterns (one in each row) are shown in iteration 1 (a.) and 2 (b.). Each

column in a. and b. shows one layer in the network. The activation level of each neuron is

shown by a circle. A large circle means a high activation. In the rst iteration (a.), a structure

is already developing in the second layer of the network. In the second iteration, the second

layer can distinguish between the four patterns.

9.5 Relaxation types of networks

As demonstrated by the Hopeld network, a relaxation process in a connectionist network canprovide a powerful mechanism for solving some dicult optimisation problems. Many visionproblems can be considered as optimisation problems, and are potential candidates for an im-plementation in a Hopeld-like network. A few examples that are found in the literature will bementioned here.

9.5.1 Depth from stereo

By observing a scene with two cameras one can retrieve depth information out of the imagesby nding the pairs of pixels in the images that belong to the same point of the scene. Thecalculation of the depth is relatively easy; nding the correspondences is the main problem. Onesolution is to nd features such as corners and edges and match those, reducing the computationalcomplexity of the matching. Marr (Marr, 1982) showed that the correspondence problem canbe solved correctly when taking into account the physical constraints underlying the process.Three matching criteria were dened:

Compatibility: Two descriptive elements can only match if they arise from the same phys-ical marking (corners can only match with corners, `blobs' with `blobs,' etc.);

Uniqueness: Almost always a descriptive element from the left image corresponds to exactlyone element in the right image and vice versa;

Continuity: The disparity of the matches varies smoothly almost everywhere over theimage.


d.c.

b.a.

Figure 9.6: Feeding back activation values in the cognitron.

The four learning patterns are now successively applied to the network (row 1 of gures

ad). Next, the activation values of the neurons in layer 4 are fed back to the input (row 2

of gures ad). Finally, all the neurons except the most active in layer 4 are set to 0, and

the resulting activation values are again fed back (row 3 of gures ad). After as little as 20iterations, the network has shown to be rather robust.

Marr's `cooperative' algorithm (also a `non-cooperative' or local algorithm has been described(Marr, 1982)) is able to calculate the disparity map from which the depth can be reconstructed.This algorithm is some kind of neural network, consisting of neurons N(x; y; d), where neuronN(x; y; d) represents the hypothesis that pixel (x; y) in the left image corresponds with pixel(x+ d; y) in the right image. The update function is

N t+1(x; y; d) =

0BBB@

Xx0;y0;d02

S(x;y;d)

N t(x0; y0; d0) X

x0;y0;d02

O(x;y;d)

N t(x0 ; y0; d0) +N0(x; y; d)

1CCCA : (9.9)

Here, is an inhibition constant, is a threshold function, S(x; y; d) is the local excitatoryneighbourhood, and O(x; y; d) is the local inhibitory neighbourhood, which are chosen as follows:

S(x; y; d) = f (r; s; t) j (r = x _ r = x d) ^ s = y g; (9.10)

O(x; y; d) = f (r; s; t) j d = t ^ k (r; s) (x; y) k w g: (9.11)

9.5. RELAXATION TYPES OF NETWORKS 105

The network is loaded with the cross correlation of the images at rst: N0(x; y; d) =Il(x; y)Ir(x + d; y), where Il and Ir are the intensity matrices of the left and right image re-spectively. This network state represents all possible matches of pixels. Then the set of possiblematches is reduced by recursive application of the update function until the state of the networkis stable.

The algorithm converges in about ten iterations. Then the disparity of a pixel (x; y) isdisplayed by the ring neuron in the set fN(r; s; d) j r = x; s = yg. In each of these sets thereshould be exactly one neuron ring, but if the algorithm could not compute the exact disparity,for instance at hidden contours, there may be zero or more than one neurons ring.

9.5.2 Image restoration and image segmentation

The restoration of degraded images is a branch of digital picture processing closely related toimage segmentation and boundary nding. An analysis of the major applications and proceduresmay be found in (Rosenfeld & Kak, 1982). An algorithm which is based on the minimisationof an energy function and can very well be parallelised is given by Geman and Geman (Geman& Geman, 1984). Their approach is based on stochastic modelling, in which image samplesare considered to be generated by a random process that changes its statistical properties fromregion to region. The random process that that generates the image samples is a two-dimensionalanalogue of a Markov process, called a Markov random eld. Image segmentation is thenconsidered as a statistical estimation problem in which the system calculates the optimal estimateof the region boundaries for the input image. Simultaneously estimation of the region propertiesand boundary properties has to be performed, resulting in a set of nonlinear estimation equationsthat dene the optimal estimate of the regions. The system must nd the maximum a posterioriprobability estimate of the image segmentation. Geman and Geman showed that the problem canbe recast into the minimisation of an energy function, which, in turn, can be solved approximatelyby optimisation techniques such as simulated annealing. The interesting point is that simulatedannealing can be implemented using a network with local connections, in which the networkiterates into a global solution using these local operations.

9.5.3 Silicon retina

Mead and his co-workers (Mead, 1989) have developed an analogue VLSI vision preprocessingchip modelled after the retina. The design not only replicates many of the important functionsof the rst stages of retinal processing, but it does so by replicating in a detailed way boththe structure and dynamics of the constituent biological units. The logarithmic compressionfrom photon input to output signal is accomplished by analogue circuits, while similarly spaceand time averaging and temporal dierentiation are accomplished by analogue processes and aresistive network (see section 11.2.1).


Part IV

IMPLEMENTATIONS

107

109

Implementation of neural networks can be divided into three categories:

software simulation;

(hardware) emulation2;

hardware implementation.

The distinction between the former two categories is not clear-cut. We will use the term sim-ulation to describe software packages which can run on a variety of host machines (e.g., PYG-MALION, the Rochester Connectionist Simulator, NeuralWare, Nestor, etc.). Implementation ofneural networks on general-purpose multi-processor machines such as the Connection Machine,the Warp, transputers, etc., will be referred to as emulation. Hardware implementation will bereserved for neuro-chips and the like which are specically designed to run neural networks.

To evaluate and provide a taxonomy of the neural network simulators and emulators dis-cussed, we will use the descriptors of table 9.1 (cf. (DARPA, 1988)).

1. Equation type: many networks are dened by the type of equation describing their operation. Forexample, Grossberg's ART (cf. section 6.4) is described by the dierential equation

dxkdt

= Axk + (B xk)Ik xkXj 6=k

Ij ; (9.12)

in which Axk is a decay term, +BIk is an external input, xkIk is a normalisation term, and xkP

j 6=kIj

is a neighbour shut-o term for competition. Although dierential equations are very powerful, they requirea high degree of exibility in the software and hardware and are thus dicult to implement on special-purpose machines. Other types of equations are, e.g., dierence equations as used in the description ofKohonen's topological maps (see section 6.2), and optimisation equations as used in back-propagationnetworks.

2. Connection topology: the design of most general purpose computers includes random access memory(RAM) such that each memory position can be accessed with uniform speed. Such designs always presenta trade-o between size of memory and speed of access. The topology of neural networks can be matchedin a hardware design with fast local interconnections instead of global access. Most networks are more orless local in their interconnections, and a global RAM is unnecessary.

3. Processing schema: although most articial neural networks use a synchronous update, i.e., the outputof the network depends on the previous state of the network, asynchronous update, in which componentsor blocks of components can be updated one by one, can be implemented much more eciently. Also,continuous update is a possibility encountered in some implementations.

4. Synaptic transmission mode: most articial neural networks have a transmission mode based on theneuronal activation values multiplied by synaptic weights. In these models, the propagation time from oneneuron to another is neglected. On the other hand, biological neurons output a series of pulses in which thefrequency determines the neuron output, such that propagation times are an essential part of the model.Currently, models arise which make use of temporal synaptic transmission (Murray, 1989; Tomlinson &Walker, 1990).

Table 9.1: A possible taxonomy.

The following chapters describe general-purpose hardware which can be used for neuralnetwork applications, and neuro-chips and other dedicated hardware.

2The term emulation (see, e.g., (Mallach, 1975) for a good introduction) in computer design means runningone computer to execute instructions specic to another computer. It is often used to provide the user with amachine which is seemingly compatible with earlier models.

110

10 General Purpose Hardware

Parallel computers (Almasi & Gottlieb, 1989) can be divided into several categories. One im-portant aspect is the granularity of the parallelism. Broadly speaking, the granularity rangesfrom coarse-grain parallelism, typically up to ten processors, to ne-grain parallelism, up tothousands or millions of processors.

Both ne-grain and coarse-grain parallelism is in use for emulation of neural networks. Theformer model, in which one or more processors can be used for each neuron, corresponds withtable 9.1's type 2, whereas the second corresponds with type 1. We will discuss one model of bothtypes of architectures: the (extremely) ne-grain Connection Machine and coarse-grain Systolicarrays, viz. the Warp computer. A more complete discussion should also include transputerswhich are very popular nowadays due to their very high performance/price ratio (Group, 1987;Board, 1989; Eckmiller, Hartmann, & Hauske, 1990). In this case, descriptor 1 of table 9.1 ismost applicable.

Besides the granularity, the computers can be categorised by their operation. The mostwidely used categorisation is by Flynn (Flynn, 1972) (see table 10.1). It distinguishes twotypes of parallel computers: SIMD (Single Instruction, Multiple Data) and MIMD (MultipleInstruction, Multiple Data). The former type consists of a number of processors which executethe same instructions but on dierent data, whereas the latter has a separate program for eachprocessor. Fine-grain computers are usually SIMD, while coarse grain computers tend to beMIMD (also in correspondence with table 9.1, entries 1 and 2).

Number of Data Streamssingle multiple

Number ofInstructionStreams

single SISD SIMD(von Neumann) (vector, array)

multiple MISD MIMD(pipeline?) (multiple micros)

Table 10.1: Flynn's classication.

Table 10.2 shows a comparison of several types of hardware for neural network simulation.The speed entry, measured in interconnects per second, is an important measure which is of-ten used to compare neural network simulators. It measures the number of multiply-and-addoperations that can be performed per second. However, the comparison is not 100% honest:it does not always include the time needed to fetch the data on which the operations are tobe performed, and may also ignore other functions required by some algorithms such as thecomputation of a sigmoid function. Also, the speed is of course dependent of the algorithmused.

111

112 CHAPTER 10. GENERAL PURPOSE HARDWARE

HARDWARE WORD STORAGE SPEED COST SPEEDLENGTH (K Intcnts) (K Int/s) (K$) / COST

WORKSTATIONSMicro/Mini PC/AT 16 100 25 5 5.0Computers Sun 3 32 250 250 20 12.5

VAX 32 100 100 300 0.33Symbolics 32 32,000 35 100 0.35

Attached ANZA 832 500 45 10 4.5Processors 1 32 1,000 10,000 15 667

Transputer 16 2,000 3,000 4 750

Bus-oriented Mark III, IV 16 1,000 500 75 6.7MX/116 16 50,000 120,000 300 400

MASSIVELY CM2 (64K) 32 64,000 13,000 2,000 6.5PARALLEL Warp (10) 32 320 17,000 300 56.7

Warp (20) 32,000Butter y (64) 32 60,000 8,000 500 16

SUPER- Cray XMP 64 2,000 50,000 4,000 12.5COMPUTERS

Table 10.2: Hardware machines for neural network simulation.The authors are well aware that the mentioned computer architectures are archaic: : : current computer

architectures are several orders of magnitute faster. For instance, current day Sun Sparc machines (e.g., anUltra at 200 MHz) benchmark at almost 300,000 dhrystones per second, whereas the archaic Sun 3 benchmarksat about 3,800. Prices of both machines (then vs. now) are approximately the same. Go gure! Nevertheless,

the table gives an insight of the performance of dierent types of architectures.

10.1 The Connection Machine

10.1.1 Architecture

One of the most outstanding ne-grain SIMD parallel machines is Daniel Hillis' Connection Ma-chine (Hillis, 1985; Corporation, 1987), originally developed at MIT and later built at ThinkingMachines Corporation. The original model, the CM1, consists of 64K (65,536) one-bit proces-sors, divided up into four units of 16K processors each. The units are connected via a cross-barswitch (the nexus) to up to four front-end computers (see gure 10.1). The large number of ex-tremely simple processors make the machine a data parallel computer, and can be best envisagedas active memory.

Each processor chip contains 16 processors, a control unit, and a router. It is connectedto a memory chip which contains 4K bits of memory per processor. Each processor consistsof a one-bit ALU with three inputs and two outputs, and a set of registers. The control unitdecodes incoming instructions broadcast by the front-end computers (which can be DEX VAXesor Symbolics Lisp machines). At any time, a processor may be either listening to the incominginstruction or not.

The router implements the communication algorithm: each router is connected to its nearestneighbours via a two-dimensional grid (the NEWS grid) for fast neighbour communication; also,the chips are connected via a Boolean 12-cube, i.e., chips i and j are connected if and only ifji jj = 2k for some integer k. Thus at most 12 hops are needed to deliver a message. So thereare 4,096 routers connected by 24,576 bidirectional wires.

By slicing the memory of a processor, the CM can also implement virtual processors.

The CM2 diers from the CM1 in that it has 64K bits instead of 4K bits memory perprocessor, and an improved I/O system.

10.1. THE CONNECTION MACHINE 113

Network

Bus interface

or Symbolics)(DEC VAX

Bus interface


Bus interface


Bus interface

or Symbolics)

Front end 3

Front end 2

Front end 1

(DEC VAX

Front end 0

Nexus

32

10

sequencersequencer

sequencersequencer

Connection Machine I/O System

16,384 Processors

Connection Machine

16,384 Processors

Connection Machine

16,384 Processors

Connection MachineConnection Machine

16,384 Processors

displayGraphicsData

VaultDataVault

DataVault

Figure 10.1: The Connection Machine system organisation.

10.1.2 Applicability to neural networks

There have been a few researchers trying to implement neural networks on the ConnectionMachine (Blelloch & Rosenberg, 1987; Singer, 1990). Even though the Connection Machine hasa topology which matches the topology of most articial neural networks very well, the relativelyslow message passing system makes the machine not very useful as a general-purpose neuralnetwork simulator. It appears that the Connection Machine suers from a dramatic decrease inthroughput due to communication delays (Hummel, 1990). Furthermore, the cost/speed ratio(see table 10.2) is very bad compared to, e.g., a transputer board. As an eect, the ConnectionMachine is not widely used for neural network simulation.

One possible implementation is given in (Blelloch & Rosenberg, 1987). Here, a back-propagation network is implemented by allocating one processor per unit and one per outgoingweight and one per incoming weight. The processors are thus arranged that each processor for aunit is immediately followed by the processors for its outgoing weights and preceded by those forits incoming weights. The feed-forward step is performed by rst clamping input units and nextexecuting a copy-scan operation by moving those activation values to the next k processors (theoutgoing weight processors). The weights then multiply themselves with the activation valuesand perform a send operation in which the resulting values are sent to the processors allocatedfor incoming weights. A plus-scan then sums these values to the next layer of units in the net-work. The feedback step is executed similarly. Both the feed-forward and feedback steps can beinterleaved and pipelined such that no layer is ever idle. For example, for the feed-forward step,a new pattern xp is clamped on the input layer while the next layer is computing on xp1, etc.

To prevent inecient use of processors, one weight could also be represented by one processor.

114 CHAPTER 10. GENERAL PURPOSE HARDWARE

10.2 Systolic arrays

Systolic arrays (Kung & Leierson, 1979) take the advantage of laying out algorithms in twodimensions. The design favours compute-bound as opposed to I/O-bound operations. Thename systolic is derived from the analogy of pumping blood through a heart and feeding datathrough a systolic array.

A typical use is depicted in gure 10.2. Here, two band matrices A and B are multipliedand added to C, resulting in an output C + AB. Essential in the design is the reuse of dataelements, instead of referencing the memory each time the element is needed.

C+A*B

cell

systolic

c+a*b

b

b

a

a

c

a11

c11

11b

a 32 22a12a

a 21a 31

21b b 22 23b

b 1312b

21 12ccc cc 22 1331

14233241c c c c

42 33 24c c c

3443 cc

Figure 10.2: Typical use of a systolic array.

The Warp computer, developed at Carnegie Mellon University, has been used for simulatingarticial neural networks (Pomerleau, Gusciora, Touretzky, & Kung, 1988) (see table 10.2). Itis a system with ten or more programmable one-dimensional systolic arrays. Two data streams,one of which is bi-directional, ow through the processors (see gure 10.3). To implement amatrix product Wx + , the W is not a stream as in gure 10.2 but stored in the memory ofthe processors.

Warp Interface & Host

address

X

Yn2

cell

1

cell cell

Figure 10.3: The Warp system architecture.

11 Dedicated Neuro-Hardware

Recently, many neuro-chips have been designed and built. Although many techniques, such asdigital and analogue electronics, optical computers, chemical implementation, and bio-chips, areinvestigated for implementing neuro-computers, only digital and analogue electronics, and ina lesser degree optical implementations, are at present feasible techniques. We will thereforeconcentrate on such implementations.

11.1 General issues

11.1.1 Connectivity constraints

Connectivity within a chip

A major problem with neuro-chips always is the connectivity. A single integrated circuit is, incurrent-day technology, planar with limited possibility for cross-over connections. This posesa problem. Whereas connectivity to nearest neighbour can be implemented without problems,connectivity to the second nearest neighbour results in a cross-over of four which is alreadyproblematic. On the other hand, full connectivity between a set of input and output units canbe easily attained when the input and output neurons are situated near two edges of the chip(see gure 11.1). Note that the number of neurons in the chip grows linearly with the size ofthe chip, whereas in the earlier layout, the dependence is quadratic.

outputs

inputs

N

M

Figure 11.1: Connections between M input and N output neurons.

Connectivity between chips

To build large or layered ANN's, the neuro-chips have to be connected together. When onlyfew neurons have to be connected together, or the chips can be placed in subsequent rows infeed-forward types of networks, this is no problem. But in other cases, when large numbers

115

116 CHAPTER 11. DEDICATED NEURO-HARDWARE

of neurons in one chip have to be connected to neurons in other chips, there are a number ofproblems:

designing chip packages with a very large number of input or output leads;

fan-out of chips: each chip can ordinarily only send signals two a small number of otherchips. Ampliers are needed, which are costly in power dissipation and chip area;

wiring.

A possible solution would be using optical interconnections. In this case, an external light sourcewould re ect light on one set of neurons, which would re ect part of this light using deformablemirror spatial light modulator technology on to another set of neurons. Also under developmentare three-dimensional integrated circuits.

11.1.2 Analogue vs. digital

Due to the similarity between articial and biological neural networks, analogue hardware seemsa good choice for implementing articial neural networks, resulting in cheaper implementationswhich operate at higher speed. On the other hand, digital approaches oer far greater exibilityand, not to be neglected, arbitrarily high accuracy. Also, digital chips can be designed withoutthe need of very advanced knowledge of the circuitry using CAD/CAM systems, whereas thedesign of analogue chips requires good theoretical knowledge of transistor physics as well asexperience.

An advantage that analogue implementations have over digital neural networks is that theyclosely match the physical laws present in neural networks (table 9.1, point 1). First of all,weights in a neural network can be coded by one single analogue element (e.g., a resistor) whereseveral digital elements are needed1. Secondly, very simple rules as Kircho's laws2 can be usedto carry out the addition of input signals. As another example, Boltzmann machines (section 5.3)can be easily implemented by amplifying the natural noise present in analogue devices.

11.1.3 Optics

As mentioned above, optics could be very well used to interconnect several (layers of) neurons.One can distinguish two approaches. One is to store weights in a planar transmissive or re ectivedevice (e.g., a spatial light modulator) and use lenses and xed holograms for interconnection.Figure 11.2 shows an implementation of optical matrix multiplication. When N is the linearsize of the optical array divided by wavelength of the light used, the array has capacity for N2

weights, so it can fully connect N neurons with N neurons (Fahrat, Psaltis, Prata, & Paek,1985).

A second approach uses volume holographic correlators, oering connectivity between twoareas of N2 neurons for a total of N4 connections3. A possible use of such volume hologramsin an all-optical network would be to use the system for image completion (Abu-Mostafa &Psaltis, 1987). A number of images could be stored in the hologram. The input pattern iscorrelated with each of them, resulting in output patterns with a brightness varying with the

1On the other hand, the opposite can be found when considering the size of the element, especially when highaccuracy is needed. However, once articial neural networks have outgrown rules like back-propagation, highaccuracy might not be needed.

2The Kircho laws state that for two resistors R1 and R2 (1) in series, the total resistance can be calculatedusing R = R1 + R2, and (2) in parallel, the total resistance can be found using 1=R = 1=R1 + 1=R2 (Feynman,Leighton, & Sands, 1983).

3Well : : : not exactly. Due to diraction, the total number of independent connections that can be stored inan ideal medium is N3, i.e., the volume of the hologram divided by the cube of the wavelength. So, in fact N3=2

neurons can be connected with N3=2 neurons.

11.2. IMPLEMENTATION EXAMPLES 117

receptor for

row 5laser for

row 1laser for

maskweight

column 6

Figure 11.2: Optical implementation of matrix multiplication.

degree of correlation. The images are fed into a threshold device which will conduct the imagewith highest brightness better than others. This enhancement can be repeated for several loops.

11.1.4 Learning vs. non-learning

It is generally agreed that the major forte of neural networks is their ability to learn. Whereas anetwork with xed, pre-computed, weight values could have its merit in industrial applications,on-line adaptivity remains a design goal for most neural systems.

With respect to learning, we can distinguish between the following levels:

1. xed weights: the design of the network determines the weights. Examples are theretina and cochlea chips of Carver Mead's group discussed below (cf. a ROM (Read-OnlyMemory) in computer design);

2. pre-programmed weights: the weights in the network can be set only once, when thechip is installed. Many optical implementations fall in this category (cf. PROM (Pro-grammable ROM));

3. programmable weights: the weights can be set more than once by an external device(cf. EPROM (Erasable PROM) or EEPROM (Electrically Erasable PROM));

4. on-site adapting weights: the learning mechanism is incorporated in the network(cf. RAM (Random Access Memory)).

11.2 Implementation examples

11.2.1 Carver Mead's silicon retina

The chips devised by Carver Mead's group at Caltech (Mead, 1989) are heavily inspired bybiological neural networks. Mead attempts to build analogue neural chips which match biolog-ical neurons as closely as possible, including extremely low power consumption, fully analoguehardware, and operation in continuous time (table 9.1, point 3). One example of such a chip isthe Silicon Retina (Mead & Mahowald, 1988).


Retinal structure

The o-center retinal structure can be described as follows. Light is transduced to electricalsignals by photo-receptors which have a primary pathway through the triad synapses to thebipolar cells. The bipolar cells are connected to the retinal ganglion cells which are the outputcells of the retina. The horizontal cells, which are also connected via the triad synapses to thephoto-receptors, are situated directly below the photo-receptors and have synapses connected tothe axons leading to the bipolar cells.

The system can be described in terms of the triad synapse's three elements:

1. the photo-receptor outputs the logarithm of the intensity of the light;

2. the horizontal cells form a network which averages the photo-receptor over space and time;

3. the output of the bipolar cell is proportional to the dierence between the photo-receptoroutput and the horizontal cell output.

The photo-receptor

The photo-receptor circuit outputs a voltage which is proportional to the logarithm of theintensity of the incoming light. There are two important consequences:

1. several orders of magnitude of intensity can be handled in a moderate signal level range;

2. the voltage dierence between two points is proportional to the contrast ratio of theirilluminance.

The photo-receptor can be implemented using a photo-detector, two FET's4 connected in series5

and one transistor (see gure 11.3). The lowest photo-current is about 1014A or 105 photons

Intensity

outV

outV

3.6

3.4

3.2

3.0

2.8

2.6

2.4

2.28765432

Figure 11.3: The photo-receptor used by Mead. To prevent current being drawn from the photo-

receptor, the output is only connected to the gate of the transistor.

per second, corresponding with a moonlit scene.

4Field Eect Transistor5A detailed description of the electronics involved is out of place here. However, we will provide gures where

useful. See (Mead, 1989) for an in-depth study.


Horizontal resistive layer

Each photo-receptor is connected to its six neighbours via resistors forming a hexagonal array.The voltage at every node in the network is a spatially weighted average of the photo-receptorinputs, such that farther away inputs have less in uence (see gure 11.4(a)).

(b)(a)

+

+−

−

ganglionoutput

photo-receptor

Figure 11.4: The resistive layer (a) and, enlarged, a single node (b).

Bipolar cell

The output of the bipolar cell is proportional to the dierence between the photo-receptor outputand the voltage of the horizontal resistive layer. The architecture is shown in gure 11.4(b). Itconsists of two elements: a wide-range amplier which drives the resistive network towardsthe photo-receptor output, and an amplier sensing the voltage dierence between the photo-receptor output and the network potential.

Implementation

A chip was built containing 48 48 pixels. The output of every pixel can be accessed indepen-dently by providing the chip with the horizontal and vertical address of the pixel. The selectorscan be run in two modes: static probe or serial access. In the rst mode, a single row andcolumn are addressed and the output of a single pixel is observed as a function of time. In thesecond mode, both vertical and horizontal shift registers are clocked to provide a serial scan ofthe processed image for display on a television display.

Performance

Several experiments show that the silicon retina performs similarly as biological retina (Mead &Mahowald, 1988). Similarities are shown between sensitivity for intensities; time responses fora single output when ashes of light are input; response to contrast edges.

11.2.2 LEP's LNeuro chip

A radically dierent approach is the LNeuro chip developed at the Laboratoires d'ElectroniquePhilips (LEP) in France (Theeten, Duranton, Mauduit, & Sirat, 1990; Duranton & Sirat, 1989).Whereas most neuro-chips implement Hopeld networks (section 5.2) or, in some cases, Kohonen


networks (section 6.2) (due to the fact that these networks have local learning rules), these digitalneuro-chips can be congured to incorporate any learning rule and network topology.

Architecture

The LNeuro chip, depicted in gure 11.5, consists of an multiply-and-add or relaxation part,and a learning part. The LNeuro 1.0 has a parallelism of 16. The weights wij are 8 bits long inthe relaxation phase. and 16 bit in the learning phase.

δj

j

j

a

LP

processorlearning

neural state registers

learning functions

learning register

memory

synaptic

readable

and

loadable

parallel

controland

functionlinearnon-

controllearning

accumulator

alu

addersof

tree

mul

mul

mul

mul

Fct. Fct. Fct. Fct.

Figure 11.5: The LNeuro chip. For clarity, only four neurons are drawn.

Multiply-and-add

The multiply-and-add in fact performs a matrix multiplication

yk(t+ 1) = F0@X

j

wjkyj(t)

1A : (11.1)

The input activations yk are kept in the neural state registers. For each neural state there aretwo registers. These can be used to implement synchronous or asynchronous update. In theformer mode, the computed state of neurons wait in registers until all states are known; thenthe whole register is written into the register used for the calculations. In asynchronous mode,however, every new state is directly written into the register used for the next calculation.

The arithmetical logical unit (ALU) has an external input to allow for accumulation ofexternal partial products. This can be used to construct larger, structured, or higher-precisionnetworks.

The neural states (yk) are coded in one to eight bits, whereas either eight or sixteen bits canbe used for the weights which are kept in a RAM. In order to save silicon area, the multiplicationswjkyj are serialised over the bits of yj, replacing N eight by eight bit parallel multipliers by Neight bit AND gates. The partial products are saved and added in the tree of adders.


The computation thus increases linearly with the number of neurons (instead of quadraticin simulation on serial machines).

The activation function is, for reasons of exibility, kept o-chip. The results of the weightedsum calculation go o-chip serially (i.e., bit by bit), and the result must be written back to theneural state registers.

Finally, a column of latches is included to temporarily store memory values, such that duringa multiply of the weight with several bits the memory can be freely accessed. These latches infact take part in the learning mechanism described below.

Learning

The remaining parts in the chip are dedicated to the learning mechanism. The learning mecha-nism is designed to implement the Hebbian learning rule (Hebb, 1949)

wjk wjk + kyj (11.2)

where k is a scalar which only depends on the output neuron k. To simplify the circuitry,eq. (11.2) is simplied to

wjk wjk + g(yk; yj)k (11.3)

where g(yk; yj) can have value 1, 0, or +1. In eect, eq. (11.3) either increments or decrementsthe wjk with k, or keeps wjk unchanged. Thus eq. (11.2) can be simulated by executing eq. (11.3)several times over the same set of weights.

The weights wk related to the output neuron k are all modied in parallel. A learning stepproceeds as follows. Every learning processor (see gure 11.5) LPj loads the weight wjk fromthe synaptic memory, the k from the learning register, and the neural state yj. Next, theyall modify their weights in parallel using eq. (11.3) and write the adapted weights back to thesynaptic memory, also in parallel.


References

Abu-Mostafa, Y. A., & Psaltis, D. (1987). Optical neural computers. Scientic American, ?,88-95. [Cited on p. 116.]

Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmannmachines. Cognitive Science, 9(1), 147169. [Cited on p. 54.]

Ahalt, S. C., Krishnamurthy, A. K., Chen, P., & Melton, D. (1990). Competititve learningalgorithms for vector quantization. Neural Networks, 3, 277-290. [Cited on p. 60.]

Almasi, G. S., & Gottlieb, A. (1989). Highly Parallel Computing. The Benjamin/CummingsPublishing Company Inc. [Cited on p. 111.]

Almeida, L. B. (1987). A learning rule for asynchronous perceptrons with feedback in a com-binatorial environment. In Proceedings of the First International Conference on NeuralNetworks (Vol. 2, pp. 609618). IEEE. [Cited on p. 50.]

Amit, D. J., Gutfreund, H., & Sompolinsky, H. (1986). Spin-glass models of neural networks.Physical Review A, 32(2), 10071018. [Cited on p. 50.]

Anderson, J. A. (1977). Neural models with cognitive implications. In D. LaBerge & S. J.Samuels (Eds.), Basic Processes in Reading Perception and Comprehension Models (p.27-90). Hillsdale, NJ: Erlbaum. [Cited on pp. 18, 50.]

Anderson, J. A., & Rosenfeld, E. (1988). Neurocomputing: Foundations of Research. Cambridge,MA: The MIT Press. [Cited on p. 9.]

Barto, A. G., & Anandan, P. (1985). Pattern-recognizing stochastic learning automata. IEEETransactions on Systems, Man and Cybernetics, 15, 360375. [Cited on p. 78.]

Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that cansolve dicult learning problems. IEEE Transactions on Systems, Man and Cybernetics,13, 834-846. [Cited on pp. 75, 77.]

Barto, A. G., Sutton, R. S., & Watkins, C. (1990). Sequential decision problems and neuralnetworks. In D. Touretsky (Ed.), Advances in Neural Information Processing II. DUNNO.[Cited on p. 80.]

Bellman, R. (1957). Dynamic Programming. Princeton University Press. [Cited on p. 80.]

Blelloch, G., & Rosenberg, C. R. (1987). Network learning on the Connection Machine. InProceedings of the Tenth International Joint Conference on Articial Intelligence (pp.323326). DUNNO. [Cited on p. 113.]

Board, J. A. B., Jr. (1989). Transputer Research and Applications 2: Proceedings of the SecondConference on the North American Transputer Users Group. Durham, NC: IOS Press.[Cited on p. 111.]

123

124 REFERENCES

Boomgaard, R. van den, & Smeulders, A. (1989). Self learning image processing using a-prioriknowledge of spatial patterns. In T. Kanade, F. C. A. Groen, & L. O. Hertzberger (Eds.),Proceedings of the I.A.S. Conference (p. 305-314). Elsevier Science Publishers. [Cited onp. 42.]

Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classication and RegressionTrees. Wadsworth and Broks/Cole. [Cited on p. 63.]

Bruce, A. D., Canning, A., Forrest, B., Gardner, E., & Wallace, D. J. (1986). Learning andmemory properties in fully connected networks. In J. S. Denker (Ed.), AIP ConferenceProceedings 151, Neural Networks for Computing (pp. 6570). DUNNO. [Cited on p. 52.]

Carpenter, G. A., & Grossberg, S. (1987a). A massively parallel architecture for a self-organizingneural pattern recognition machine. Computer Vision, Graphics, and Image Processing,37, 54115. [Cited on pp. 57, 72.]

Carpenter, G. A., & Grossberg, S. (1987b). ART 2: Self-organization of stable category recog-nition codes for analog input patterns. Applied Optics, 26(23), 4919-4930. [Cited onp. 72.]

Corporation, T. M. (1987). Connection Machine Model CM2 Technical Summary (Tech. Rep.Nos. HA874). Thinking Machines Corporation. [Cited on p. 112.]

Cottrell iG.W., Munro, P., & Zipser, D. (1987). Image compression by back-propagation: ademonstration of extensional programming (Tech. Rep. No. TR 8702). USCD, Instituteof Cognitive Sciences. [Cited on p. 99.]

Craig, J. J. (1989). Introduction to Robotics. Addison-Wesley Publishing Company. [Cited onp. 85.]

Cun, Y. L. (1985). Une procedure d'apprentissage pour reseau a seuil assymetrique. Proceedingsof Cognitiva, 85, 599604. [Cited on p. 33.]

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics ofControl, Signals, and Systems, 2(4), 303314. [Cited on p. 33.]

DARPA. (1988). DARPA Neural Network Study. AFCEA International Press. [Cited onpp. 9, 109.]

Dastani, M. M. (1991). Functie-Benadering met Feed-Forward Netwerken. Unpublished master'sthesis, Universiteit van Amsterdam, Faculteit Computer Systemen. [Cited on pp. 39, 40.]

Duranton, M., & Sirat, J. A. (1989). Learning on VLSI: A general purpose digital neurochip.In Proceedings of the Third International Joint Conference on Neural Networks. DUNNO.[Cited on p. 119.]

Eckmiller, R., Hartmann, G., & Hauske, G. (1990). Parallel Processing in Neural Systems andComputers. Elsevier Science Publishers B.V. [Cited on p. 111.]

Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179211. [Cited on p. 48.]

Fahrat, N. H., Psaltis, D., Prata, A., & Paek, E. (1985). Optical implementation of the Hopeldmodel. Applied Optics, 24, 14691475. [Cited on p. 116.]

Feldman, J. A., & Ballard, D. H. (1982). Connectionist models and their properties. CognitiveScience, 6, 205254. [Cited on p. 16.]

REFERENCES 125

Feynman, R. P., Leighton, R. B., & Sands, M. (1983). The Feynman Lectures on Physics.Reading (MA), Menlo Park (CA), London, Sidney, Manila: Addison-Wesley PublishingCompany. [Cited on p. 116.]

Flynn, M. J. (1972). Some computer organizations and their eectiveness. IEEE Transactionson Computers, C21, 948-960. [Cited on p. 111.]

Friedman, J. H. (1991). Multivariate adaptive regression splines. Annals of Statistics, 19, 1-141.[Cited on p. 63.]

Fritzke, B. (1991). Let it grow|self-organizing feature maps with problem dependent cellstructure. In T. Kohonen, K. Makisara, O. Simula, & J. Kangas (Eds.), Proceedings ofthe 1991 International Conference on Articial Neural Networks (pp. 403408). North-Holland/Elsevier Science Publishers. [Cited on p. 64.]

Fukushima, K. (1975). Cognitron: A self-organizing multilayered neural network. BiologicalCybernetics, 20, 121136. [Cited on pp. 57, 100.]

Fukushima, K. (1988). Neocognitron: A hierarchical neural network capable of visual patternrecognition. Neural Networks, 1, 119130. [Cited on pp. 57, 98, 100.]

Funahashi, K.-I. (1989). On the approximate realization of continuous mappings by neuralnetworks. Neural Networks, 2(3), 193192. [Cited on p. 33.]

Garey, M. R., & Johnson, D. S. (1979). Computers and Intractability. New York: W. H.Freeman. [Cited on p. 53.]

Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesianrestoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence,6, 721724. [Cited on p. 105.]

Gielen, C., Krommenhoek, K., & Gisbergen, J. van. (1991). A procedure for self-organizedsensor-fusion in topologically ordered maps. In T. Kanade, F. C. A. Groen, & L. O.Hertzberger (Eds.), Proceedings of the Second International Conference on AutonomousSystems (pp. 417423). Elsevier Science Publishers. [Cited on p. 66.]

Gorman, R. P., & Sejnowski, T. J. (1988). Analysis of hidden units in a layered network trainedto classify sonar targets. Neural Networks, 1(1), 75-89. [Cited on p. 45.]

Grossberg, S. (1976). Adaptive pattern classication and universal recoding I & II. BiologicalCybernetics, 23, 121134, 187202. [Cited on pp. 57, 69.]

Group, O. U. (1987). Parallel Programming of Transputer Based Machines: Proceedings ofthe 7th Occam User Group Technical Meeting. Grenoble, France: IOS Press. [Cited onp. 111.]

Gullapalli, V. (1990). A stochastic reinforcement learning algorithm for learning real-valuedfunctions. Neural Networks, 3, 671-692. [Cited on p. 77.]

Hartigan, J. A. (1975). Clustering Algorithms. New York: John Wiley & Sons. [Cited on p. 72.]

Hartman, E. J., Keeler, J. D., & Kowalski, J. M. (1990). Layered neural networks with Gaussianhidden units as universal approximations. Neural Computation, 2(2), 210215. [Cited onp. 33.]

Hebb, D. O. (1949). The Organization of Behaviour. New York: Wiley. [Cited on pp. 18, 121.]

126 REFERENCES

Hecht-Nielsen, R. (1988). Counterpropagation networks. Neural Networks, 1, 131-139. [Citedon p. 63.]

Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the Theory of Neural Computation.Addison Wesley. [Cited on p. 9.]

Hesselroth, T., Sarkar, K., Smagt, P. van der, & Schulten, K. (1994). Neural network controlof a pneumatic robot arm. IEEE Transactions on Systems, Man, and Cybernetics, 24(1),2838. [Cited on p. 93.]

Hestenes, M. R., & Stiefel, E. (1952). Methods of conjugate gradients for solving linear systems.Nat. Bur. Standards J. Res., 49, 409-436. [Cited on p. 41.]

Hillis, W. D. (1985). The Connection Machine. The MIT Press. [Cited on p. 112.]

Hopeld, J. J. (1982). Neural networks and physical systems with emergent collective compu-tational abilities. Proceedings of the National Academy of Sciences, 79, 25542558. [Citedon pp. 18, 50.]

Hopeld, J. J. (1984). Neurons with graded response have collective computational propertieslike those of two-state neurons. Proceedings of the National Academy of Sciences, 81,30883092. [Cited on p. 52.]

Hopeld, J. J., Feinstein, D. I., & Palmer, R. G. (1983). ùnlearning' has a stabilizing eect incollective memories. Nature, 304, 159159. [Cited on p. 52.]

Hopeld, J. J., & Tank, D. W. (1985). `neural' computation of decisions in optimizationproblems. Biological Cybernetics, 52, 141152. [Cited on p. 53.]

Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universalapproximators. Neural Networks, 2(5), 359366. [Cited on p. 33.]

Hummel, R. A. (1990). Personal Communication. [Cited on p. 113.]

Jansen, A., Smagt, P. P. van der, & Groen, F. C. A. (1994). Nested networks for robot control.In A. F. Murray (Ed.), Neural Network Applications. Kluwer Academic Publishers. (Inpress) [Cited on pp. 63, 90.]

Jordan, M. I. (1986a). Attractor dynamics and parallelism in a connectionist sequential machine.In Proceedings of the Eighth Annual Conference of the Cognitive Science Society (pp. 531546). Hillsdale, NJ: Erlbaum. [Cited on p. 48.]

Jordan, M. I. (1986b). Serial OPrder: A Parallel Distributed Processing Approach (Tech.Rep. No. 8604). San Diego, La Jolla, CA: Institute for Cognitive Science, University ofCalifornia. [Cited on p. 48.]

Jorgensen, C. C. (1987). Neural network representation of sensor graphs in autonomous robotpath planning. In IEEE First International Conference on Neural Networks (Vol. IV, pp.507515). IEEE. [Cited on p. 94.]

Josin, G. (1988). Neural-space generalization of a topological transformation. Biological Cyber-netics, 59, 283290. [Cited on p. 46.]

Katayama, M., & Kawato, M. (1992). A Parallel-Hierarchical Neural Network Model for MotorControl of a Musculo-Skeletal System (Tech. Rep. Nos. TRA0145). ATR Auditory andVisual Perception Research Laboratories. [Cited on p. 93.]

REFERENCES 127

Kawato, M., Furukawa, K., & Suzuki, R. (1987). A hierarchical neural-network model for controland learning of voluntary movement. Biological Cybernetics, 57, 169185. [Cited on p. 91.]

Kohonen, T. (1977). Associative Memory: A System-Theoretical Approach. Springer-Verlag.[Cited on pp. 18, 50, 64.]

Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. BiologicalCybernetics, 43, 5969. [Cited on p. 64.]

Kohonen, T. (1984). Self-Organization and Associative Memory. Berlin: Springer-Verlag. [Citedon p. 64.]

Kohonen, T. (1995). Self-Organizing Maps. Springer. [Cited on p. 9.]

Kohonen, T., Makisara, M., & Saramaki, T. (1984). Phonotopic maps|insightful represen-tation of phonological features for speech recognition. In Proceedings of the 7th IEEEInternational Conference on Pattern Recognition. DUNNO. [Cited on p. 66.]

Krose, B. J. A., & Dam, J. W. M. van. (1992). Learning to avoid collisions: A reinforcementlearning paradigm for mobile robot manipulation. In Proceedings of IFAC/IFIP/IMACSInternational Symposium on Articial Intelligence in Real-Time Control (p. 295-300).Delft: IFAC, Laxenburg. [Cited on p. 78.]

Krose, B. J. A., Korst, M. J. van der, & Groen, F. C. A. (1990). Learning strategies for avision based neural controller for a robot arm. In O. Kaynak (Ed.), IEEE InternationalWorkshop on Intelligent Motor Control (pp. 199203). IEEE. [Cited on p. 89.]

Kung, H. T., & Leierson, C. E. (1979). Systolic arrays (for VLSI). In Sparse Matrix Proceedings,1978. Academic Press. (also in: Algorithms for VLSI Processor Arrays, C. Mead & L.Conway, eds., Addison-Wesley, 1980, 271-292) [Cited on p. 114.]

Lippmann, R. P. (1987). An introduction to computing with neural nets. IEEE Transactionson Acoustics, Speech, and Signal Processing, 2(4), 422. [Cited on pp. 71, 73.]

Lippmann, R. P. (1989). Review of neural networks for speech recognition. Neural Computation,1, 138. [Cited on p. 58.]

Mallach, E. G. (1975). Emulator architecture. Computer, 8, 2432. [Cited on p. 109.]

Marr, D. (1982). Vision. San Francisco: W. H. Freeman. [Cited on pp. 103, 104.]

Martinetz, T., & Schulten, K. (1991). A \neural-gas" network learns topologies. In T. Koho-nen, K. Makisara, O. Simula, & J. Kangas (Eds.), Proceedings of the 1991 InternationalConference on Articial Neural Networks (pp. 397402). North-Holland/Elsevier SciencePublishers. [Cited on p. 64.]

McClelland, J. L., & Rumelhart, D. E. (1986). Parallel Distributed Processing: Explorations inthe Microstructure of Cognition. The MIT Press. [Cited on pp. 9, 15.]

McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervousactivity. Bulletin of Mathematical Biophysics, 5, 115133. [Cited on p. 13.]

Mead, C. (1989). Analog VLSI and Neural Systems. Reading, MA: Addison-Wesley. [Cited onpp. 105, 117, 118.]

Mead, C. A., & Mahowald, M. A. (1988). A silicon model of early visual processing. NeuralNetworks, 1(1), 9197. [Cited on pp. 117, 119.]

128 REFERENCES

Mel, B. W. (1990). Connectionist Robot Motion Planning. San Diego, CA: Academic Press.[Cited on p. 16.]

Minsky, M., & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry.The MIT Press. [Cited on pp. 13, 26, 31, 33.]

Murray, A. F. (1989). Pulse arithmetic in VLSI neural networks. IEEE Micro, 9(12), 6474.[Cited on p. 109.]

Oja, E. (1982). A simplied neuron model as a principal component analyzer. Journal ofmathematical biology, 15, 267273. [Cited on p. 68.]

Parker, D. B. (1985). Learning-Logic (Tech. Rep. Nos. TR47). Cambridge, MA: MassachusettsInstitute of Technology, Center for Computational Research in Economics and Manage-ment Science. [Cited on p. 33.]

Pearlmutter, B. A. (1989). Learning state space trajectories in recurrent neural networks. NeuralComputation, 1(2), 263269. [Cited on p. 50.]

Pearlmutter, B. A. (1990). Dynamic Recurrent Neural Networks (Tech. Rep. Nos. CMUCS90196). Pittsburgh, PA 15213: School of Computer Science, Carnegie Mellon University.[Cited on pp. 17, 50.]

Pineda, F. (1987). Generalization of back-propagation to recurrent neural networks. PhysicalReview Letters, 19(59), 22292232. [Cited on p. 50.]

Polak, E. (1971). Computational Methods in Optimization. New York: Academic Press. [Citedon p. 41.]

Pomerleau, D. A., Gusciora, G. L., Touretzky, D. S., & Kung, H. T. (1988). Neural networksimulation at Warp speed: How we got 17 million connections per second. In IEEE SecondInternational Conference on Neural Networks (Vol. II, p. 143-150). DUNNO. [Cited onp. 114.]

Powell, M. J. D. (1977). Restart procedures for the conjugate gradient method. MathematicalProgramming, 12, 241-254. [Cited on p. 41.]

Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1986). Numerical Recipes:The Art of Scientic Computing. Cambridge: Cambridge University Press. [Cited onpp. 40, 41.]

Psaltis, D., Sideris, A., & Yamamura, A. A. (1988). A multilayer neural network controller.IEEE Control Systems Magazine, 8(2), 1721. [Cited on p. 87.]

Ritter, H. J., Martinetz, T. M., & Schulten, K. J. (1989). Topology-conserving maps for learningvisuo-motor-coordination. Neural Networks, 2, 159168. [Cited on p. 90.]

Ritter, H. J., Martinetz, T. M., & Schulten, K. J. (1990). Neuronale Netze. Addison-WesleyPublishing Company. [Cited on p. 9.]

Rosen, B. E., Goodwin, J. M., & Vidal, J. J. (1992). Process control with adaptive range coding.Biological Cybernetics, 66, 419-428. [Cited on p. 63.]

Rosenblatt, F. (1959). Principles of Neurodynamics. New York: Spartan Books. [Cited onpp. 23, 26.]

REFERENCES 129

Rosenfeld, A., & Kak, A. C. (1982). Digital Picture Processing. New York: Academic Press.[Cited on p. 105.]

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533536. [Cited on p. 33.]

Rumelhart, D. E., & McClelland, J. L. (1986). Parallel Distributed Processing: Explorations inthe Microstructure of Cognition. The MIT Press. [Cited on pp. 9, 15.]

Rumelhart, D. E., & Zipser, D. (1985). Feature discovery by competitive learning. CognitiveScience, 9, 75112. [Cited on p. 57.]

Sanger, T. D. (1989). Optimal unsupervised learning in a single-layer linear feedforward neuralnetwork. Neural Networks, 2, 459473. [Cited on p. 99.]

Sejnowski, T. J., & Rosenberg, C. R. (1986). NETtalk: A Parallel Network that Learns toRead Aloud (Tech. Rep. Nos. JHU/EECS86/01). The John Hopkins University ElectricalEngineering and Computer Science Department. [Cited on p. 45.]

Silva, F. M., & Almeida, L. B. (1990). Speeding up backpropagation. In R. Eckmiller (Ed.),Advanced Neural Computers (pp. 151160). North-Holland. [Cited on p. 42.]

Singer, A. (1990). Implementations of Articial Neural Networks on the Connection Machine(Tech. Rep. Nos. RL902). Cambridge, MA: Thinking Machines Corporation. [Cited onp. 113.]

Smagt, P. van der, Groen, F., & Krose, B. (1993). Robot Hand-Eye Coordination Using NeuralNetworks (Tech. Rep. Nos. CS9310). Department of Computer Systems, University ofAmsterdam. (ftp'able from archive.cis.ohio-state.edu) [Cited on p. 90.]

Smagt, P. van der, Krose, B. J. A., & Groen, F. C. A. (1992). Using time-to-contact to guidea robot manipulator. In Proceedings of the 1992 IEEE/RSJ International Conference onIntelligent Robots and Systems (pp. 177182). IEEE. [Cited on p. 89.]

Smagt, P. P. van der, & Krose, B. J. A. (1991). A real-time learning neural robot controller. InT. Kohonen, K. Makisara, O. Simula, & J. Kangas (Eds.), Proceedings of the 1991 Inter-national Conference on Articial Neural Networks (pp. 351356). North-Holland/ElsevierScience Publishers. [Cited on pp. 89, 90.]

Sofge, D., & White, D. (1992). Applied learning: optimal control for manufacturing. InD. Sofge & D. White (Eds.), Handbook of Intelligent Control, Neural, Fuzzy, and AdaptiveApproaches. Van Nostrand Reinhold, New York. (In press) [Cited on p. 76.]

Stoer, J., & Bulirsch, R. (1980). Introduction to Numerical Analysis. New YorkHeidelbergBerlin: Springer-Verlag. [Cited on p. 41.]

Sutton, R. S. (1988). Learning to predict by the methods of temporal dierences. MachineLearning, 3, 9-44. [Cited on p. 75.]

Sutton, R. S., Barto, A., & Wilson, R. (1992). Reinforcement learning is direct adaptive optimalcontrol. IEEE Control Systems, 6, 19-22. [Cited on p. 80.]

Theeten, J. B., Duranton, M., Mauduit, N., & Sirat, J. A. (1990). The LNeuro chip: A digitalVLSI with on-chip learning mechanism. In Proceedings of the International Conference onNeural Networks (Vol. I, pp. 593596). DUNNO. [Cited on p. 119.]

130 REFERENCES

Tomlinson, M. S., Jr., & Walker, D. J. (1990). DNNA: A digital neural network architecture.In Proceedings of the International Conference on Neural Networks (Vol. II, pp. 589592).DUNNO. [Cited on p. 109.]

Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279-292. [Cited onpp. 76, 80, 81.]

Werbos, P. (1992). Approximate dynamic programming for real-time control and neural mod-elling. In D. Sofge & D. White (Eds.), Handbook of Intelligent Control, Neural, Fuzzy,and Adaptive Approaches. Van Nostrand Reinhold, New York. [Cited on pp. 76, 80.]

Werbos, P. J. (1974). Beyond Regression: New Tools for Prediction and Analysis in the Behav-ioral Sciences. Unpublished doctoral dissertation, Harvard University. [Cited on p. 33.]

Werbos, P. W. (1990). A menu for designs of reinforcement learning over time. In W. T. M. III,R. S. Sutton, & P. J. Werbos (Eds.), Neural Networks for Control. MIT Press/Bradford.[Cited on p. 77.]

White, D., & Jordan, M. (1992). Optimal control: a foundation for intelligent control. InD. Sofge & D. White (Eds.), Handbook of Intelligent Control, Neural, Fuzzy, and AdaptiveApproaches. Van Nostrand Reinhold, New York. [Cited on p. 80.]

Widrow, B., & Ho, M. E. (1960). Adaptive switching circuits. In 1960 IRE WESCONConvention Record (pp. 96104). DUNNO. [Cited on pp. 23, 27.]

Widrow, B., Winter, R. G., & Baxter, R. A. (1988). Layered neural nets for pattern recognition.IEEE Transactions on Acoustics, Speech, and Signal Processing, 36(7), 1109-1117. [Citedon p. 98.]

Wilson, G. V., & Pawley, G. S. (1988). On the stability of the travelling salesman problemalgorithm of Hopfield and tank. Biological Cybernetics, 58, 6370. [Cited on p. 53.]

Index

Symbolsk-means clustering, 61

AACE, 77

activation function, 17, 19

hard limiting, 17

Heaviside, 23

linear, 17

nonlinear, 33

semi-linear, 17

sgn, 23

sigmoid, 17, 36, 39

derivative of, 36

threshold, 51

Adaline, 23

adaline, 18, 23, 27f.

vision, 97

adaptive critic element, 77

analogue implementation, 115117

annealing, 54

ART, 57, 69, 109

ASE, 77

activation function, 77

bias, 77

associative learning, 18

associative memory, 52

instability of stored patterns, 52

spurious stable states, 52

associative search element, 77

asymmetric divergence, 55

asynchronous update, 16, 50

auto-associator, 50

vision, 99

Bback-propagation, 33, 39, 45, 109, 116

advanced training algorithms, 40

conjugate gradient, 40

derivation of, 34

discovery of, 13

gradient descent, 34, 37, 40

implementation on Connection Machine,113

learning by pattern, 37

learning rate, 37

local minima, 40

momentum, 37

network paralysis, 39

oscillation in, 37

understanding, 35f.

vision, 99

bias, 19f.

bio-chips, 115

bipolar cells

retina, 118

silicon retina, 119

Boltzmann distribution, 54

Boltzmann machine, 54, 116

CCarnegie Mellon University, 114

CART, 63

cart-pole, 79

chemical implementation, 115

CLUSTER, 61

clustering, 57

coarse-grain parallelism, 111

coding

lossless, 98

lossy, 98

cognitron, 57, 100

competitive learning, 57

error-function, 60

frequency sensitive, 60

conjugate directions, 41

conjugate gradient, 40

connectionist models, 13

Connection Machine, 109, 111113

architecture, 112

communication, 112

NEWS grid, 112

nexus, 112

connectivity

131

132 INDEX

constraints, 115optical, 116

convergencesteepest descent, 41

cooperative algorithm, 104correlation matrix, 68counterpropagation, 62counterpropagation network, 63

DDARPA neural network study, 9decoder, 78de ation, 69delta rule, 18, 2729

generalised, 33, 35f.digital implementation, 115f.dimensionality reduction, 57discovery vs. creation, 13discriminant analysis, 64distance measure, 60dynamic programming, 77dynamics

in neural networks, 17robotics, 86, 92

EEEPROM, 117eigenvector transformation, 67eligibility, 78Elman network, 4850emulation, 109, 111energy, 19

Hopeld network, 51f.travelling salesman problem, 53

EPROM, 117error, 19

back-propagation, 34competitive learning, 60learning, 43perceptron, 28quadratic, 34test, 43

error measure, 43excitation, 16external input, 20eye, 69

Fface recognition, 99feature extraction, 57, 99feed-forward network, 17f., 20, 33, 35, 42,

45f.

FET, 118ne-grain parallelism, 111FORGY, 61forward kinematics, 85

GGaussian, 91generalised delta rule, 33, 35f.general learning, 87gradient descent, 28, 34, 37, 40granularity of parallelism, 111

Hhard limiting activation function, 17Heaviside, 23Hebb rule, 18, 25, 52, 67, 121

normalised, 67Hessian, 41high level vision, 97holographic correlators, 116Hopeld network, 50, 94, 119

as associative memory, 52instability of stored patterns, 52spurious stable states, 52

energy, 51f.graded response neurons, 52optimisation, 53stable limit points, 51stable neuron in, 51stable pattern in, 51stable state in, 51stable storage algorithm, 52stochastic update, 54symmetry, 52un-learning, 52

horizontal cellsretina, 118silicon retina, 119

Iimage compression, 98

back-propagation, 99PCA, 99self-organising networks, 98

implementation, 109analogue, 115117chemical, 115connectivity constraints, 115digital, 115f.on Connection Machine, 113optical, 115f.silicon retina, 119

INDEX 133

indirect learning, 87information gathering, 15inhibition, 16instability of stored patterns, 52intermediate level vision, 97inverse kinematics, 85Ising spin model, 50ISODATA, 61

JJacobian matrix, 88, 90f.Jordan network, 48

KKircho laws, 116KISS, 90Kohonen network, 64, 119

3-dimensional, 90for robot control, 90

Kullback information, 55

Lleaky learning, 60learning, 18, 20, 117

associative, 18general, 87indirect, 87LNeuro, 121self-supervised, 18, 87specialised, 88supervised, 18unsupervised, 18, 87

learning error, 43learning rate, 18f.

back-propagation, 37learning vector quantisation, 64LEP, 119linear activation function, 17linear convergence, 41linear discriminant function, 24linear networks, 28

vision, 99linear threshold element, 26LNeuro, 119f.

activation function, 121ALU, 120learning, 121RAM, 120

local minimaback-propagation, 40

look-up table, 16, 63lossless coding, 98

lossy coding, 98low level vision, 97LVQ2, 64

MMarkov random eld, 105MARS, 63mean vector, 68MIMD, 111

MIT, 112mobile robots, 94momentum, 37multi-layer perceptron, 54

Nneocognitron, 98, 100Nestor, 109NETtalk, 45network paralysis

back-propagation, 39NeuralWare, 109neuro-computers, 115nexus, 112non-cooperative algorithm, 104normalisation, 67notation, 19

Ooctree methods, 63oset, 20Oja learning rule, 68optical implementation, 115f.optimisation, 53oscillation in back-propagation, 37output vs. activation of a unit, 19

Ppanther

hiding, 69resting, 69

parallel distributed processing, 13, 15parallelism

coarse-grain, 111ne-grain, 111

PCA, 66image compression, 99

PDP, 13, 15Perceptron, 23perceptron, 13, 18, 23f., 26, 29, 31

convergence theorem, 24error, 28learning rule, 24f.

134 INDEX

threshold, 25vision, 97

photo-receptor

retina, 118silicon retina, 118f.

positive denite, 41

Principal components, 66

PROM, 117prototype vectors, 66

PYGMALION, 109

RRAM, 109, 117recurrent networks, 17, 47

Elman network, 4850

Jordan network, 48

reinforcement learning, 75relaxation, 17

representation, 20

representation vs. learning, 20resistor, 116

retina, 98

bipolar cells, 118

horizontal cells, 118photo-receptor, 118

retinal ganglion, 118

structure, 118triad synapses, 118

retinal ganglion, 118

robotics, 85

dynamics, 86, 92forward kinematics, 85

inverse kinematics, 85

trajectory generation, 86Rochester Connectionist Simulator, 109

ROM, 117

Sself-organisation, 18, 57

self-organising networks, 57image compression, 98

vision, 98

self-supervised learning, 18, 87semi-linear activation function, 17

sgn function, 23

sigma-pi unit, 16

sigma unit, 16sigmoid activation function, 17, 36, 39

derivative of, 36

silicon retina, 105, 117bipolar cells, 119

horizontal cells, 119

implementation, 119

photo-receptor, 118f.

SIMD, 111f.

simulated annealing, 54

simulation, 109

taxonomy, 109

specialised learning, 88

spurious stable states, 52

stable limit points, 51

stable neuron, 51

stable pattern, 51

stable state, 51

stable storage algorithm, 52

steepest descent, 91

convergence, 41

stochastic update, 17, 54

summed squared error, 28, 34

supervised learning, 18

synchronous update, 16

systolic, 114

systolic arrays, 111, 114

Ttarget, 28

temperature, 54

terminology, 19

test error, 43

Thinking Machines Corporation, 112

threshold, 19f.

topologies, 17

topology-conserving map, 57, 65, 109

training, 18

trajectory generation, 86

transistor, 118

transputer, 109, 111

travelling salesman problem, 53

energy, 53

triad synapses, 118

Uunderstanding back-propagation, 35f.

universal approximation theorem, 33

unsupervised learning, 18, 57f., 87

update of a unit, 15

asynchronous, 16, 50

stochastic, 54

synchronous, 16

INDEX 135

Vvector quantisation, 57f., 61vision, 97

high level, 97intermediate level, 97low level, 97

WWarp, 109, 111, 114Widrow-Ho rule, 18winner-take-all, 58

XXOR problem, 29f.

Documents

52918523 Artificial Neural Network Technology