Neural Networks

A Brief Introduction to

Neural Networks

David Kriesel www.dkriesel.com

dkriesel.com

In remembrance ofDr. Peter Kemp, Notary (ret.), Bonn, Germany.

D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN) iii

A Little Preface”Originally, this paper has been prepared in the framework of a seminar of the

University of Bonn in Germany, but it has been and will be extended (afterbeing delivered and published online under www.dkriesel.com on

5/27/2005). First and foremost, to provide a comprehensive overview of thesubject of neural networks and, secondly, just to acquire more and more

knowledge about LATEX . And who knows – maybe one day this summary willbecome a real preface!”

Abstract of this paper, end of 2005

The above abstract has not yet become apreface but, after all, a little preface sincethe extended paper (then 40 pages long)has turned out to be a download hit.

Ambition and Intention ofthis Manuscript

At the time I promised to continue the pa-per chapter by chapter, and this is the re-sult. Meanwhile I changed the structureof the document a bit so that it is easierfor me to add a chapter or to vary the out-line.

The entire text is written and laid outmore effectively and with more illustra-tions than before. I did all the illustra-tions myself, most of them directly inLATEX by using XYpic. They reflect whatI would have liked to see when becomingacquainted with the subject: Text and il-lustrations should be memorable and easy

to understand to offer as many people aspossible access to the field of neural net-works.

Nevertheless, the mathematically and for-mally skilled readers will be able to under-stand the definitions without reading theflow text, while the reverse is true for read-ers only interested in the subject matter;everything is explained in colloquial andformal language. Please let me know ifyou find out that I have acted against thisprinciple.

The Sections of This Work areMostly Independent from EachOther

The document itself is divided into differ-ent parts, which are again divided intochapters. Although the chapters containcross-references they are also individuallyaccessible to readers with little previousknowledge. There are larger and smaller

v

dkriesel.com

chapters: While the larger chapters shouldprovide profound insight into a paradigmof neural networks (e.g. the classic neuralnetwork structure: the perceptron and itslearning procedures), the smaller chaptersgive a short overview – but this is also ex-plained in the introduction of each chapter.In addition to all the definitions and expla-nations I have included some excursusesto provide interesting information not di-rectly related to the subject.

When this document still was a term pa-per, I never kept secret that some ofthe content is, among others, inspired by[Zel94], a book I liked alot but only onebook. In the meantime I have added manyother sources and a lot of experience, sothis statement is no longer appropriate.What I would like to say is that this scriptis – in contrast to a book – intended forthose readers who think that a book is toocomprehensive and who want to get a gen-eral idea of and an easy approach to thefield. But it is not intended to be what isoften passed off as lecture notes: it’s notsimply copied from a book.

Unfortunately, I was not able to find freeGerman sources that are multi-facetedin respect of content (concerning theparadigms of neural networks) and, nev-ertheless, written in coherent style. Theaim of this work is (even if it could not befulfilled at first go) to close this gap bit bybit and to provide easy acces to the sub-ject.

Terms of Use and License

From the epsilon edition, the text islicensed under the Creative CommonsAttribution-No Derivative Works 3.0 Un-ported License1, except for some little por-tions of the work licensed under more lib-eral licenses as mentioned (mainly somefigures from Wikimedia Commons). Aquick license summary:

1. You are free to redistribute this docu-ment (even though it is a much betteridea to just distribute the URL of myhome page, for it always contains themost recent version of the text).

2. You may not modify, transform, orbuild upon the document except forpersonal use.

3. You must maintain the author’s attri-bution of the document at all times.

4. You may not use the attribution toimply that the author endorses youor your document use.

For I’m no lawyer, the above bullet-pointsummary is just informational: if there isany conflict in interpretation between thesummary and the actual license, the actuallicense always takes precedence. Note thatthis license does not extend to the sourcefiles used to produce the document. Thoseare still mine.

1 http://creativecommons.org/licenses/by-nd/3.0/

vi D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN)

http://creativecommons.org/licenses/by-nd/3.0/

http://creativecommons.org/licenses/by-nd/3.0/

dkriesel.com

How to Cite this Manuscript

There’s no official publisher, so you needto be careful with your citation. Pleasefind more information in English andGerman language on my home page, re-spectively the subpage concerning themanuscript2.

It’s easy to print thismanuscript

This paper is completely illustrated incolor, but it can also be printed as is inmonochrome: The colors of figures, tablesand text are well-chosen so that in addi-tion to an appealing design the colors arestill easy to distinguish when printed inmonochrome.

There are many tools directlyintegrated into the text

Different tools are directly integrated inthe document to make reading more flexi-ble: But anyone (like me) who prefers read-ing words on paper rather than on screencan also enjoy some features.

2 http://www.dkriesel.com/en/science/neural_networks

In the Table of Contents, DifferentTypes of Chapters are marked

Different types of chapters are directlymarked within the table of contents. Chap-ters, that are marked as ”fundamental”are definitively ones to read because al-most all subsequent chapters heavily de-pend on them. Other chapters addi-tionally depend on information given inother (preceding) chapters, which then ismarked in the table of contents, too.

Speaking Headlines throughout theText, Short Ones in the Table ofContents

The whole manuscript is now pervaded bysuch headlines. Speaking headlines arenot just title-like (”Reinforcement Learn-ing”), but centralize the information givenin the associated section to a single sen-tence. In the named instance, an appro-priate headline would be ”Reinforcementlearning methods provide Feedback to theNetwork, whether it behaves good or bad”.However, such long headlines would bloatthe table of contents in an unacceptableway. So I used short titles like the first onein the table of contents, and speaking ones,like the letter, throughout the text.

Marginal notes are a navigationalaide

The entire document contains marginalnotes in colloquial language (see the exam-ple in the margin), allowing you to ”skim”

Hypertexton paper:-)

D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN) vii

http://www.dkriesel.com/en/science/neural_networks

http://www.dkriesel.com/en/science/neural_networks

dkriesel.com

the document quickly to find a certain pas-sage in the text (including the titles).

New mathematical symbols are marked byspecific marginal notes for easy finding(see the example for x in the margin).

xI

There are several kinds of indexing

This document contains different types ofindexing: If you have found a word inthe index and opened the correspondingpage, you can easily find it by searchingfor highlighted text – all indexed wordsare highlighted like this.

Mathematical symbols appearing in sev-eral chapters of this document (e.g. Ω foran output neuron; I tried to maintain aconsistent nomenclature for regularly re-curring elements) are separately indexedunder ”Mathematical Symbols”, so theycan easily be assigned to the correspond-ing term.

Names of persons written in small capsare indexed in the category ”Persons” andordered by the last names.

Acknowledgement

Now I would like to express my grati-tude to all the people who contributed, inwhatever manner, to the success of thiswork, since a paper like this needs manyhelpers. First of all, I want to thankthe proofreaders of this paper, who helpedme and my readers very much. In al-phabetical order: Wolfgang Apolinarski,

Kathrin Grave, Paul Imhoff, ThomasKuhn, Christoph Kunze, Malte Lohmeyer,Joachim Nock, Daniel Plohmann, DanielRosenthal, Christian Schulz and TobiasWilken.

Additionally, I want to thank the read-ers Dietmar Berger, Igor Buchmuller,Marie Christ, Julia Damaschek, Maxim-ilian Ernestus, Hardy Falk, Anne Feld-meier, Sascha Fink, Andreas Friedmann,Jan Gassen, Markus Gerhards, Sebas-tian Hirsch, Andreas Hochrath, Nico Hoft,Thomas Ihme, Boris Jentsch, Tim Hussein,Thilo Keller, Mario Krenn, Mirko Kunze,Maikel Linke, Adam Maciak, BenjaminMeier, David Moller, Andreas Muller,Rainer Penninger, Matthias Siegmund,Mathias Tirtasana, Oliver Tischler, Max-imilian Voit, Igor Wall, Achim Weber,Frank Weinreis, Gideon Maillette de BuijWenniger, Philipp Woock and many oth-ers for their feedback, suggestions and re-marks.

Especially, I would like to thank BeateKuhl for translating the entire paper fromGerman to English, and for her questionswhich made think of changing the text andunderstandability of some paragraphs.

I would particularly like to thank Prof.Rolf Eckmiller and Dr. Nils Goerke aswell as the entire Division of Neuroinfor-matics, Department of Computer Scienceof the University of Bonn – they all madesure that I always learned (and also hadto learn) something new about neural net-works and related subjects. Especially Dr.Goerke has always been willing to respondto any questions I was not able to answer

viii D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN)

dkriesel.com

myself during the writing process. Conver-sations with Prof. Eckmiller made me stepback from the whiteboard to get a betteroverall view on what I was doing and whatI should do next.

Globally, and not only in the context ofthis paper, I want to thank my parentswho never get tired to buy me special-ized and therefore expensive books andwho have always supported me in my stud-ies.

For many ”remarks” and the very specialand cordial atmosphere ;-) I want to thankAndreas Huber and Tobias Treutler. Sinceour first semester it has rarely been boringwith you!

Now I would like to think back to myschool days and cordially thank someteachers who (in my opinion) had im-parted some scientific knowledge to me –although my class participation had notalways been wholehearted: Mr. WilfriedHartmann, Mr. Hubert Peters and Mr.Frank Nokel.

Further I would like to thank the wholeteam at the notary’s office of Dr. Kempand Dr. Kolb in Bonn, where I have al-ways felt myself to be in good hands andwho have helped me to keep my print-ing costs low - in particular ChristianeFlamme and Dr. Kemp!

Thanks go also to the Wikimedia Com-mons, where I took some (few) images andaltered them to fit this paper.

Last but not least I want to thank two per-sons who render outstanding services toboth the paper and me and who occupy, so

to speak, a place of honor: My girl-friendVerena Thomas, who found many mathe-matical and logical errors in my paper anddiscussed them with me, although she hasto do lots of other things, and ChristianeSchultze, who carefully reviewed the paperfor spelling and inconsistency.

What better could have happened to methan having your accuracy and precisenesson my side?

David Kriesel

D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN) ix

Contents

A Little Preface v

I From Biology to Formalization – Motivation, Philosophy, History andRealization of Neural Models 1

1 Introduction, Motivation and History 31.1 Why Neural Networks? . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 The 100-step rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.1.2 Simple application examples . . . . . . . . . . . . . . . . . . . . . 6

1.2 History of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 81.2.1 The beginning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.2.2 Golden age . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2.3 Long silence and slow reconstruction . . . . . . . . . . . . . . . . 111.2.4 Renaissance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Biological Neural Networks 152.1 The Vertebrate Nervous System . . . . . . . . . . . . . . . . . . . . . . . 15

2.1.1 Peripheral and central nervous system . . . . . . . . . . . . . . . 152.1.2 Cerebrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.1.3 Cerebellum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.1.4 Diencephalon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.1.5 Brainstem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2 The Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.1 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.2 Electrochemical processes in the neuron . . . . . . . . . . . . . . 21

2.3 Receptor Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.3.1 Various Sorts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.3.2 Information processing within the nervous system . . . . . . . . 272.3.3 Light sensing organs . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4 The Amount of Neurons in Living Organisms . . . . . . . . . . . . . . . 30

xi

Contents dkriesel.com

2.5 Technical Neurons as Caricature of Biology . . . . . . . . . . . . . . . . 32Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3 Components of Artificial Neural Networks (fundamental) 353.1 The Concept of Time in Neural Networks . . . . . . . . . . . . . . . . . 353.2 Components of Neural Networks . . . . . . . . . . . . . . . . . . . . . . 35

3.2.1 Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.2.2 Propagation function and Network Input . . . . . . . . . . . . . 373.2.3 Activation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2.4 Threshold value . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.2.5 Activation function . . . . . . . . . . . . . . . . . . . . . . . . . . 383.2.6 Output function . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.2.7 Learning Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3 Network topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.3.1 Feedforward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.3.2 Recurrent networks . . . . . . . . . . . . . . . . . . . . . . . . . . 413.3.3 Completely linked networks . . . . . . . . . . . . . . . . . . . . . 44

3.4 The bias neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.5 Representing Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.6 Orders of activation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.6.1 Synchronous activation . . . . . . . . . . . . . . . . . . . . . . . 463.6.2 Asynchronous activation . . . . . . . . . . . . . . . . . . . . . . . 47

3.7 Input and output of data . . . . . . . . . . . . . . . . . . . . . . . . . . 48Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4 How to Train a Neural Network? (fundamental) 514.1 Paradigms of Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.1.1 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . 524.1.2 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . 524.1.3 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . 534.1.4 Offline or Online Learning? . . . . . . . . . . . . . . . . . . . . . 534.1.5 Questions in advance . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.2 Training Patterns and Teaching Input . . . . . . . . . . . . . . . . . . . 544.3 Using Training Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.3.1 Division of the training set . . . . . . . . . . . . . . . . . . . . . 564.3.2 Order of pattern representation . . . . . . . . . . . . . . . . . . . 57

4.4 Learning curve and error measurement . . . . . . . . . . . . . . . . . . . 574.4.1 When do we stop learning? . . . . . . . . . . . . . . . . . . . . . 58

4.5 Gradient optimization procedures . . . . . . . . . . . . . . . . . . . . . . 604.5.1 Problems of gradient procedures . . . . . . . . . . . . . . . . . . 62

xii D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN)

dkriesel.com Contents

4.6 Problem Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.6.1 Boolean functions . . . . . . . . . . . . . . . . . . . . . . . . . . 644.6.2 The parity function . . . . . . . . . . . . . . . . . . . . . . . . . 644.6.3 The 2-spiral problem . . . . . . . . . . . . . . . . . . . . . . . . . 644.6.4 The checkerboard problem . . . . . . . . . . . . . . . . . . . . . . 644.6.5 The identity function . . . . . . . . . . . . . . . . . . . . . . . . 65

4.7 Hebbian Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.7.1 Original rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.7.2 Generalized form . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

II Supervised learning Network Paradigms 69

5 The Perceptron 715.1 The Single-layer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.1.1 Perceptron Learning Algorithm and Convergence Theorem . . . 755.2 Delta Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.3 Linear Separability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.4 The Multi-layer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . 835.5 Backpropagation of Error . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.5.1 Boiling backpropagation down to delta rule . . . . . . . . . . . . 915.5.2 Selecting a learning rate . . . . . . . . . . . . . . . . . . . . . . . 915.5.3 Initial configuration of a Multi-layer Perceptron . . . . . . . . . . 925.5.4 Variations and extensions to backpropagation . . . . . . . . . . . 94

5.6 The 8-3-8 encoding problem and related problems . . . . . . . . . . . . 97Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6 Radial Basis Functions 1016.1 Components and Structure . . . . . . . . . . . . . . . . . . . . . . . . . 1016.2 Information processing of a RBF network . . . . . . . . . . . . . . . . . 102

6.2.1 Information processing in RBF-neurons . . . . . . . . . . . . . . 1046.2.2 Analytical thoughts prior to the training . . . . . . . . . . . . . . 107

6.3 Training of RBF-Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 1116.3.1 Centers and widths of RBF neurons . . . . . . . . . . . . . . . . 111

6.4 Growing RBF Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 1146.4.1 Adding neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1146.4.2 Limiting the number of neurons . . . . . . . . . . . . . . . . . . . 1156.4.3 Deleting neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.5 Comparing RBF Networks and Multi-layer Perceptrons . . . . . . . . . 115

D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN) xiii


Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7 Recurrent Perceptron-like Networks (depends on chapter 5) 1177.1 Jordan Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1187.2 Elman Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1197.3 Training Recurrent Networks . . . . . . . . . . . . . . . . . . . . . . . . 121

7.3.1 Unfolding in Time . . . . . . . . . . . . . . . . . . . . . . . . . . 1217.3.2 Teacher forcing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1227.3.3 Recurrent backpropagation . . . . . . . . . . . . . . . . . . . . . 1227.3.4 Training with evolution . . . . . . . . . . . . . . . . . . . . . . . 122

8 Hopfield Networks 1258.1 Inspired by Magnetism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1258.2 Structure and Functionality . . . . . . . . . . . . . . . . . . . . . . . . . 125

8.2.1 Input and Output of a Hopfield network . . . . . . . . . . . . . . 1268.2.2 Significance of Weights . . . . . . . . . . . . . . . . . . . . . . . . 1278.2.3 Change in the state of neurons . . . . . . . . . . . . . . . . . . . 127

8.3 Generating the Weight Matrix . . . . . . . . . . . . . . . . . . . . . . . 1288.4 Autoassociation and Traditional Application . . . . . . . . . . . . . . . . 1298.5 Heteroassociation and Analogies to Neural Data Storage . . . . . . . . . 130

8.5.1 Generating the heteroassociative matrix . . . . . . . . . . . . . . 1318.5.2 Stabilizing the heteroassociations . . . . . . . . . . . . . . . . . . 1318.5.3 Biological motivation of heterassociation . . . . . . . . . . . . . . 132

8.6 Continuous Hopfield Networks . . . . . . . . . . . . . . . . . . . . . . . 132Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

9 Learning Vector Quantization 1359.1 About Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1359.2 Purpose of LVQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1369.3 Using Code Book Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 1369.4 Adjusting Codebook Vectors . . . . . . . . . . . . . . . . . . . . . . . . 137

9.4.1 The procedure of learning . . . . . . . . . . . . . . . . . . . . . . 1379.5 Connection to Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 139Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

III Unsupervised learning Network Paradigms 141

10 Self-organizing Feature Maps 14310.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

xiv D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN)

dkriesel.com Contents

10.2 Functionality and Output Interpretation . . . . . . . . . . . . . . . . . . 14510.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

10.3.1 The topology function . . . . . . . . . . . . . . . . . . . . . . . . 14610.3.2 Monotonically decreasing learning rate and neighborhood . . . . 148

10.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15110.4.1 Topological defects . . . . . . . . . . . . . . . . . . . . . . . . . . 152

10.5 Resolution Dose and Position-dependent Learning Rate . . . . . . . . . 15210.6 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

10.6.1 Interaction with RBF networks . . . . . . . . . . . . . . . . . . . 15710.7 Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

10.7.1 Neural gas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15710.7.2 Multi-SOMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15910.7.3 Multi-neural gas . . . . . . . . . . . . . . . . . . . . . . . . . . . 15910.7.4 Growing neural gas . . . . . . . . . . . . . . . . . . . . . . . . . . 161

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

11 Adaptive Resonance Theory 16311.1 Task and Structure of an ART Network . . . . . . . . . . . . . . . . . . 163

11.1.1 Resonance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16411.2 Learning process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

11.2.1 Pattern input and top-down learning . . . . . . . . . . . . . . . . 16511.2.2 Resonance and bottom-up learning . . . . . . . . . . . . . . . . . 16511.2.3 Adding an output neuron . . . . . . . . . . . . . . . . . . . . . . 165

11.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

IV Excursi, Appendices and Registers 167

A Excursus: Cluster Analysis and Regional and Online Learnable Fields 169A.1 k-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170A.2 k-Nearest Neighbouring . . . . . . . . . . . . . . . . . . . . . . . . . . . 170A.3 ε-Nearest Neighbouring . . . . . . . . . . . . . . . . . . . . . . . . . . . 171A.4 The Silhouette coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . 171A.5 Regional and Online Learnable Fields . . . . . . . . . . . . . . . . . . . 173

A.5.1 Structure of a ROLF . . . . . . . . . . . . . . . . . . . . . . . . . 174A.5.2 Training a ROLF . . . . . . . . . . . . . . . . . . . . . . . . . . . 175A.5.3 Evaluating a ROLF . . . . . . . . . . . . . . . . . . . . . . . . . 176A.5.4 Comparison with Popular Clustering Methods . . . . . . . . . . 177A.5.5 Initializing Radii, Learning Rates and Multiplier . . . . . . . . . 178A.5.6 Application examples . . . . . . . . . . . . . . . . . . . . . . . . 178

D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN) xv


Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

B Excursus: Neural Networks Used for Prediction 179B.1 About Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179B.2 One-Step-Ahead Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 181B.3 Two-Step-Ahead Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 183

B.3.1 Recursive two-step-ahead prediction . . . . . . . . . . . . . . . . 183B.3.2 Direct two-step-ahead prediction . . . . . . . . . . . . . . . . . . 183

B.4 Additional Optimization Approaches for Prediction . . . . . . . . . . . . 183B.4.1 Changing temporal parameters . . . . . . . . . . . . . . . . . . . 183B.4.2 Heterogeneous prediction . . . . . . . . . . . . . . . . . . . . . . 185

B.5 Remarks on the Prediction of Share Prices . . . . . . . . . . . . . . . . . 185

C Excursus: Reinforcement Learning 189C.1 System Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

C.1.1 The gridworld . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190C.1.2 Agent und environment . . . . . . . . . . . . . . . . . . . . . . . 191C.1.3 States, situations and actions . . . . . . . . . . . . . . . . . . . . 192C.1.4 Reward and return . . . . . . . . . . . . . . . . . . . . . . . . . . 193C.1.5 The policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

C.2 Learning process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196C.2.1 Rewarding strategies . . . . . . . . . . . . . . . . . . . . . . . . . 196C.2.2 The state-value function . . . . . . . . . . . . . . . . . . . . . . . 197C.2.3 Monte Carlo method . . . . . . . . . . . . . . . . . . . . . . . . . 199C.2.4 Temporal difference learning . . . . . . . . . . . . . . . . . . . . 200C.2.5 The action-value function . . . . . . . . . . . . . . . . . . . . . . 201C.2.6 Q learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

C.3 Example Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203C.3.1 TD gammon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203C.3.2 The car in the pit . . . . . . . . . . . . . . . . . . . . . . . . . . 203C.3.3 The pole balancer . . . . . . . . . . . . . . . . . . . . . . . . . . 204

C.4 Reinforcement Learning in Connection with Neural Networks . . . . . . 205Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

Bibliography 207

List of Figures 213

List of Tables 217

Index 219

xvi D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN)

Part I

From Biology to Formalization –Motivation, Philosophy, History and

Realization of Neural Models

1

Chapter 1

Introduction, Motivation and HistoryHow to teach a computer? You can either write a rigid program – or you can

enable the computer to learn on its own. Living beings don’t have anyprogrammer writing a program for developing their skills, which only has to beexecuted. They learn by themselves – without the initial experience of external

knowledge – and thus can solve problems better than any computer today.What qualities are needed to achieve such a behavior for devices like

computers? Can such cognition be adapted from biology? History,development, decline and resurgence of a wide approach to solve problems.

1.1 Why Neural Networks?

There are problem categories that cannotbe formulated as an algorithm. Problemsthat depend on many subtle factors, for ex-ample the purchase price of a real estatewhich our brain can (approximately) cal-culate. Without an algorithm a computercannot do the same. Therefore the ques-tion to be asked is: How do we learn toexplore such problems?

Exactly – we learn; a capability comput-ers obviously do not have. Humans have

Computerscannot

learna brain that can learn. Computers havesome processing units and memory. Theyallow the computer to perform the mostcomplex numerical calculations in a veryshort time, but they are not adaptive.

If we compare computer and brain1, wewill note that, theoretically, the computershould be more powerful than our brain:It comprises 109 transistors with a switch-ing time of 10−9 seconds. The brain con-tains 1011 neurons, but these only have aswitching time of about 10−3 seconds.

The largest part of the brain is work-ing continuously, while the largest part ofthe computer is only passive data storage.Thus, the brain is parallel and therefore

parallelismperforming close to its theoretical maxi-

1 Of course, this comparison is - for obvious rea-sons - controversially discussed by biologists andcomputer scientists, since response time and quan-tity do not tell anything about quality and perfor-mance of the processing units as well as neuronsand transistors cannot be compared directly. Nev-ertheless, the comparison serves its purpose andindicates the advantage of parallelism by meansof processing time.

3

Chapter 1 Introduction, Motivation and History dkriesel.com

Brain ComputerNo. of processing units ≈ 1011 ≈ 109

Type of processing units Neurons TransistorsType of calculation massively parallel usually serialData storage associative address-basedSwitching time ≈ 10−3s ≈ 10−9sPossible switching operations ≈ 1013 1

s ≈ 1018 1s

Actual switching operations ≈ 1012 1s ≈ 1010 1

s

Table 1.1: The (flawed) comparison between brain and computer at a glance. Inspired by: [Zel94]

mum, from which the computer is pow-ers of ten away (Table 1.1). Additionally,a computer is static - the brain as a bi-ological neural network can reorganize it-self during its ”lifespan” and therefore isable to learn, to compensate errors and soforce.

Within this paper I want to outline howwe can use the said brain characteristicsfor a computer system.

So the study of artificial neural networksis motivated by their similarity to success-fully working biological systems, which -compared to the complete system - consistof very simple but numerous nerve cells

simplebut many

processingunits

that work massively parallel and (whichis probably one of the most significantaspects) have the capability to learn.There is no need to explicitly program aneural network. For instance, it can learnfrom training examples or by means of en-

n. networkcapableto learn

couragement - with a carrot and a stick,so to speak (reinforcement learning).

One result from this learning procedure isthe capability of neural networks to gen-

eralize and associate data: After suc-cessful training a neural network can findreasonable solutions for similar problemsof the same class that were not explicitlytrained. This in turn results in a high de-gree of fault tolerance against noisy in-put data.

Fault tolerance is closely related to biolog-ical neural networks, in which this charac-teristic is very distinct: As previously men-tioned, a human has about 1011 neuronsthat continuously reorganize themselvesor are reorganized by external influences(complete drunkenness destroys about 105

neurons, some types of food or environ-mental influences can also destroy braincells). Nevertheless, our cognitive abilitiesare not significantly affected. Thus, the

n. networkfaulttolerant

brain is tolerant against internal errors –and also against external errors, for wecan often read a really ”dreadful scrawl”although the individual letters are nearlyimpossible to read.

Our modern technology, however, is notautomatically fault-tolerant. I have neverheard that someone forgot to install the

4 D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN)

dkriesel.com 1.1 Why Neural Networks?

hard disk controller into the computer andtherefore the graphics card automaticallytook on its tasks, i.e. removed conductorsand developed communication, so that thesystem as a whole was affected by themissing component, but not completely de-stroyed.

An disadvantage of this distributed fault-tolerant storage is certainly the fact thatwe cannot realize at first sight what a neu-ral neutwork knows and performs or whereits faults lie. Usually, it is easier to per-form such analyses for conventional alo-gorithms. Most often we can only trans-fer knowledge into our neural network bymeans of a learning procedure, which cancause several errors and is not always easyto manage.

Fault tolerance of data, on the other hand,is already more sophisticated in state-of-the-art technology: Let us compare arecord and a CD. If there is a scratch on arecord, the audio information on this spotwill be completely lost (you will hear apop) and then the music goes on. On a CDthe audio data are distributedly stored: Ascratch causes a blurry sound in its vicin-ity, but the data stream remains largelyunaffected. The listener won’t notice any-thing.

So let us summarize the main characteris-tics we try to adapt from biology:

. Self-organization and learning capa-bility,

. Generalization capability and

. Fault tolerance.

What types of neural networks particu-larly develop what kinds of abilities andcan be used for what problem classes willbe discussed in the course of the paper.

In the introductory chapter I want toclarify the following: ”The neural net-work” does not exist. There are differ- Important!ent paradigms for neural networks, howthey are trained and where they are used.It is my aim to introduce some of theseparadigms and supplement some remarksfor practical application.

We have already mentioned that our brainis working massively in parallel in contrastto the work of a computer, i.e. every com-ponent is active at any time. If we wantto state an argument for massive parallelprocessing, then the 100-step rule canbe cited.

1.1.1 The 100-step rule

Experiments showed that a human canrecognize the picture of a familiar objector person in ≈ 0.1 seconds, which cor-responds to a neuron switching time of≈ 10−3 seconds in ≈ 100 discrete timesteps of parallel processing.

parallelprocessing

A computer following the von Neumannarchitecture, however, can do practicallynothing in 100 time steps of sequential pro-cessing, which are 100 assembler steps orcycle steps.

Now we want to look at a simple applica-tion example for a neural network.

D. Kriesel – A Brief Introduction to Neural Networks (EPSILON2-EN) 5


Figure 1.1: A small robot with eight sensorsand two motors. The arrow indicates the driv-ing direction.

1.1.2 Simple application examples

Let us assume that we have a small robotas shown in fig. 1.1. This robot has eightdistance sensors from which it extracts in-put data: Three sensors are placed on thefront right, three on the front left, and twoon the back. Each sensor provides a realnumeric value at any time, that means weare always receiving an input I ∈ R8.

Despite its two motors (which will beneeded later) the robot in our simple ex-ample is not capable to do much: It shallonly drive on but stop when it might col-lide with an obstacle. Thus, our outputis binary: H = 0 for”Everything is okay,drive on” and H = 1 for ”Stop” (The out-

put is called H for ”halt signal”). There-fore we need a mapping

f : R8 → B1,

that applies the input signals to a robotactivity.

1.1.2.1 The classical way

There are two ways of realizing this map-ping. For one thing, there is the classicalway: We sit down and think for a while,and finally the result is a circuit or a smallcomputer program which realizes the map-ping (this is easily possible, since the ex-ample is very simple). After that we referto the technical reference of the sensors,study their characteristic curve in order tolearn the values for the different obstacledistances, and embed these values into theabove-mentioned set of rules. Such proce-dures are applied to the classic artificial in-telligence, and if you know the exact rulesof a mapping algorithm, you are alwayswell advised to follow this scheme.

1.1.2.2 The way of learning

More interesting and more successful formany mappings and problems that arehard to comprehend at first go is the wayof learning: We show different possible sit-uations to the robot (fig. 1.2 on page 8), –and the robot shall learn on its own whatto do in the course of its robot life.

In this example the robot shall simplylearn when to stop. To begin with, we


dkriesel.com 1.1 Why Neural Networks?

Figure 1.3: Initially, we regard the robot controlas a black box whose inner life is unknown. Theblack box receives eight real sensor values andmaps these values to a binary output value.

treat the neural network as a kind of blackbox (fig. 1.3), this means we do not knowits structure but just regard its behaviorin practice.

The situations in form of simply mea-sured sensor values (e.g. placing the robotin front of an obstacle, see illustration),which we show to the robot and for whichwe specify whether to drive on or to stop,are called training examples. Thus, a train-ing example consists of an exemplary in-put and a corresponding desired output.Now the question is how to transfer thisknowledge, the information, into the neu-ral network.

The examples can be taught to a neuralnetwork by using a simple learning pro-cedure (a learning procedure is a simplealgorithm or a mathematical formula. Ifwe have done everything right and chosengood examples, the neural network willgeneralize from these examples and finda universal rule when it has to stop.

Our example can be optionally expanded.For the purpose of direction control itwould be possible to control the motorsof our robot separately2, with the sensorlayout being the same. In this case we arelooking for a mapping

f : R8 → R2,

which gradually controls the two motorsby means of the sensor inputs and thuscannot only, for example, stop the robotbut also lets it avoid obstacles. Here it ismore difficult to mentally derive the rules,and de facto a neural network would bemore appropriate.

Our aim is not to learn the examples byheart, but to realize the principle behindthem: Ideally, the robot should apply theneural network in any situation and beable to avoid obstacles. In particular, therobot should query the network continu-ously and repeatedly while driving in orderto continously avoid obstacles. The resultis a constant cycle: The robot queries thenetwork. As a consequence, it will drivein one direction, which changes the sen-sors values. EAgain the robot queries thenetwork and changes its position, the sen-sor values are changed once again, and soon. It is obvious that this system can alsobe adapted to dynamic, i.e changing, en-vironments (e.g. the moving obstacles inour example).

2 There is a robot called Khepera with more or lesssimilar characteristics. It is round-shaped, approx.7 cm in diameter, has two motors with wheelsand various sensors. For more information I rec-ommend to refer to the internet.



Figure 1.2: The robot is positioned in a landscape that provides sensor values for different situa-tions. We add the desired output values H and so receive our learning examples. The directions inwhich the sensors are oriented are exemplarily applied to two robots.

1.2 A Brief History of NeuralNetworks

The field of neural networks has, like anyother field of science, a long history ofdevelopment with many ups and downs,as we will see soon. To continue the styleof my work I will not represent this his-tory in text form but more compact inform of a timeline. Citations and biblio-graphical references are for the most partput down for those topics that will not befurther discussed in this paper. Citationsfor keywords that will be figured out laterare mentioned in the corresponding chap-ters.

The history of neural networks begins inthe early 1940’s and thus nearly simultane-ous with the history of programmable elec-tronic computers. The youth of this fieldof research, as with the field of computerscience itself, can be easily recognized dueto the fact that many of the cited personsare still with us.

1.2.1 The beginning

As soon as 1943 Warren McCullochand Walter Pitts introduced mod-els of neurological networks, recre-ated threshold switches based on neu-rons and showed that even simplenetworks of this kind are able to


dkriesel.com 1.2 History of Neural Networks

Figure 1.4: Some institutions of the field of neural networks. From left to right: John von Neu-mann, Donald O. Hebb, Marvin Minsky, Bernard Widrow, Seymour Papert, Teuvo Kohonen, JohnHopfield, ”in the order of appearance” as far as possible.

calculate nearly any logic or arith-metic function [MP43]. Further-more, the first computer precur-sors (”electronic brains”)were de-veloped, among others supported byKonrad Zuse, who was tired of cal-culating ballistic trajectories by hand.

1947: Walter Pitts and Warren Mc-Culloch indicated a practical fieldof application (which was not men-tioned in their work from 1943),namely the recognition of spacial pat-terns by neural networks [PM47].

1949: Donald O. Hebb formulated theclassical Hebbian rule [Heb49] whichrepresents in its more generalizedform the basis of nearly all neurallearning procedures. The rule im-plies that the connection between twoneurons is strengthened when bothneurons are active at the same time.This change in strength is propor-tional to the product of the two activ-ities. Hebb could postulate this rule,but due to the absence of neurologicalresearch he was not able to verify it.

1950: The neuropsychologist KarlLashley defended the thesisthat brain information storage isrealized as a distributed system Histhesis was based on experimentson rats, where only the extent butnot the location of the destroyednerve tissue influences the rats’performance to find their way out ofa labyrinthperformance to find theirway out of a labyrinth.

1.2.2 Golden age

1951: For his dissertation Marvin Min-sky developed the neurocomputerSnark, which has already been capa-ble to adjust its weights3 automati-cally. But it has never been practi-cally implemented, since it is capableto busily calculate, but nobody reallyknows what it calculates.

1956: Well-known scientists and ambi-tious students met at the Dart-mouth Summer Research Project

3 We will learn soon what weights are.



and discussed, to put it crudely, howto simulate a brain. Differences be-tween top-down and bottom-up re-search were formed.While the earlysupporters of artificial intelligencewanted to simulate capabilities bymeans of software, supporters of neu-ral networks wanted to achieve sys-tem behavior by imitating the small-est parts of the system – the neurons.

1957-1958: At the MIT, Frank Rosen-blatt, Charles Wightman andtheir coworkers developed the firstsuccessful neurocomputer, the MarkI perceptron, which was capable to

developmentaccelerates recognize simple numerics by means

of a 20 × 20 pixel image sensor andelectromechanically worked with 512motor driven potentiometers - eachpotentiometer representing one vari-able weight.

1959: Frank Rosenblatt described dif-ferent versions of the perceptron, for-mulated and verified his perceptronconvergence theorem. He describedneuron layers mimicking the retina,threshold switches,and a learning ruleadjusting the connecting weights.

1960: Bernard Widrow and Mar-cian E. Hoff introduced the ADA-LINE (ADAptive LInear NEu-ron) [WH60], a fast and preciselearning adaptive system being thefirst widely commercially used neu-ral network: It could be found innearly every analog telephone for real-time adaptive echo filtering and wastrained by menas of the Widrow-Hoff

firstspread

use

rule or delta rule. At that timeHoff, later co-founder of Intel Corpo-ration, was a PhD student of Widrow,who himself is known as the inven-tor of modern microprocessors. Oneadvantage the Delta rule had overthe original perceptron learning algo-rithm was its adaptivity: If the differ-ence between the actual output andthe correct solution was large, theconnecting weights also changed inlarger steps – the smaller the steps,the closer the target was. Disadvan-tage: missapplication led to infinites-imal small steps close to the target.In the following stagnation and out offear of scientific unpopularity of theneural networks ADALINE was re-named in adaptive linear element– which was undone again later on.

1961: Karl Steinbuch Karl Steinbruchintroduced technical realizations of as-sociative memory, which can be seenas predecessors of today’s neural as-sociative memories [Ste61]. Addition-ally, he described concepts for neuraltechniques and analyzed their possi-bilities and limits.

1965: In his book Learning MachinesNils Nilsson gave an overview ofthe progress and works of this periodof neural network research. It wasassumed that the basic principles ofself-learning and therefore, generallyspeaking, ”intelligent” systems had al-ready been discovered. Today this as-sumption seems to be an exorbitantoverestimation, but at that time it



provided for high popularity and suf-ficient research funds.

1969: Marvin Minsky and SeymourPapert published a precise mathe-matical analysis of the perceptron[MP69] to show that the perceptronmodel was not capable of representingmany important problems (keywords:XOR problem and linear separability),and so put an end to overestimation,popularity and research funds. The

researchfunds were

stoppedimplication that more powerful mod-els would show exactly the same prob-lems ang the forecast that the entirefield would be a research dead-end re-sulted in a nearly complete decline inresearch funds for the next 15 years– no matter how incorrect these fore-casts were from today’s point of view.

1.2.3 Long silence and slowreconstruction

The research funds were, as previously-mentioned, extremely short. Everywhereresearch went on, but there were neitherconferences nor other events and thereforeonly few publications. This isolation ofindividual researchers provided for manyindependently developed neural networkparadigms: They researched, but therewas no discourse among them.

In spite of the poor appreciation the fieldinspired, the basic theories for the stillcontinuing renaissance were laid at thattime:

1972: Teuvo Kohonen introduced amodel of the linear associator,a model of an associative memory[Koh72]. In the same year, such amodel was presented independentlyand from a neurophysiologist’s pointof view by James A. Anderson[And72].

1973: Christoph von der Malsburgused a neuron model that was non-linear and biologically more moti-vated [vdM73].

1974: For his dissertation in HarvardPaul Werbos developed a learningprocedure called backpropagation oferror [Wer74], but it was not untilone decade later that this procedurereached today’s importance.

backpropdeveloped1976-1980 and thereafter: Stephen

Grossberg presented many papers(for instance [Gro76]) in whichnumerous neural models are analyzedmathematically. Furthermore, heapplied himself to the problem ofkeeping a neural network capableof learning without destroyingalready learned associations. Undercooperation of Gail Carpenterthis led to models of adaptiveresonance theory (ART).

1982: Teuvo Kohonen described theself-organizing feature maps(SOM) [Koh82, Koh98] – alsoknown as Kohonen maps. He waslooking for the mechanisms involvingself-organization in the brain (Heknew that the information about thecreation of a being is stored in the



genome, which has, however, notenough memory for a structure likethe brain. As a consequence, thebrain has to organize and createitself for the most part).

John Hopfield also invented theso-called Hopfield networks [Hop82]which are inspired by the laws of mag-netism in physics. They were notwidely used in technical applications,but the field of neural networks slowlyregained importance.

1983: Fukushima, Miyake and Ito theneural model of the Neocognitronwhich could recognize handwrittencharacters [FMI83] and was an exten-sion of the Cognitron network alreadydeveloped in 1975.

1.2.4 Renaissance

The influence of John Hopfield,who had personally convinced manyresearchers of the importance of the field,and the wide publication of backprop-agation by Rumelhart, Hinton andWilliams the field of neural networksslowly showed signs of upswing.

1985: veroffentlicht John Hopfield pub-lished an article describing a way offinding acceptable solutions for theTravelling Salesman problem by usingHopfield nets.

Renaissance1986: The backpropagation of error learn-

ing procedure as a generalizationof the Delta rule was separatelydeveloped and widely published by

the Parallel Distributed ProcessingGroup [RHW86a]: Non-linear separa-ble problems could be solved by mul-tilayer perceptrons, and Marvin Min-sky’s negative evaluations were dis-proved at a single blow. At the sametime a certain kind of fatigue spreadin the field of artificial intelligence,caused by a series of failures and un-fulfilled hopes.

From this time on, the development ofthe field of research has almost beenexplosive. It can no longer be item-ized, but some of its results will beseen in the following.

Exercises

Exercise 1: Indicate one example for eachof the following topics:

. A book on neural networks or neuroin-formatics,

. A collaborative group of a universityworking with neural networks,

. A software tool realizing neural net-works (”simulator”),

. A company using neural networks,and

. A product or service being realized bymeans of neural networks.

Exercise 2: Indicate at least four appli-cations of technical neural networks: twofrom the field of pattern recognition andtwo from the field of function approxima-tion.



Exercise 3: Briefly characterize the fourdevelopment phases of neural networksand indicate expressive examples for eachphase.


Chapter 2

Biological Neural NetworksHow do biological systems solve problems? How is a system of neuronsworking? How can we understand its functionality? What are different

quantities of neurons able to do? Where in the nervous system are informationprocessed? A short biological overview of the complexity of simple elements of

neural information processing followed by some thoughts about theirsimplification in order to technically adapt them.

Before we begin to describe the technicalside of neural networks, it would be use-ful to briefly discuss the biology of neu-ral networks and the cognition of livingorganisms – the reader may skip the fol-lowing chapter without missing any tech-nical information. On the other hand Irecommend to read the said excursus ifyou want to learn something about the un-derlying neurophysiology and see that oursmall approaches, the technical neural net-works, are only caricatures of nature – howpowerful their natural counterparts mustbe when our small approaches are alreadythat effective. Now we want to take a brieflook at the nervous system of vertebrates:We will start with very rough granularityand then proceed with the brain and theneural level. For further reading I wantto recommend the books [CR00, KSJ00],which helped me a lot during this chap-ter.

2.1 The Vertebrate NervousSystem

The entire information processing system,i.e., the vertebrate nervous system, con-sists of the central nervous system and theperipheral nervous system, which is onlya first and simple subdivision. In real-ity, such a rigid subdivision does not makesense, but here it is helpful to outline theinformation processing in a body.

2.1.1 Peripheral and centralnervous system

The peripheral nervous system (PNS)comprises the nerves that are situated out-side of the brain or the spinal cord. Thesenerves form a branched and very dense net-work throughout the whole body. The pe-

15

Chapter 2 Biological Neural Networks dkriesel.com

ripheral nervous system includes, for ex-ample, the spinal nerves which pass out ofthe spinal cord (two on a level within eachvertebra of the spine) and supply extrem-ities, neck and trunk, but also the cranialnerves directly leading to the brain.

The central nervous system (CNS),however, is the ”main-frame” within thevertebrate. It is the place where infor-mation received by the sense organs arestored and managed. Furthermore, it con-trols the inner processes in the body and,last but not least, coordinates the mo-tor functions of the organism. The ver-tebrate central nervous system consists ofthe brain and the spinal cord (Fig. 2.1).Thus, we want to focus on the brain, whichcan - for the purpose of simplification - bedivided into four areas (Fig. 2.2 on theright page) to be discussed here.

2.1.2 The cerebrum is responsiblefor abstract thinkingprocesses.

The cerebrum (telencephalon) is one ofthe areas of the brain that changed mostduring evolution. Along an axis runningfrom the lateral face to the back of thehead this area is divided into two hemi-spheres, which are organized in a foldedstructure. These cerebral hemispheresare connected by one strong nerve cord(”bar”) and several small ones. A largenumber of neurons are located in the cere-bral cortex (cortex) which is approx. 2-4 cm thick and divided into different cor-tical fields, each having a specific task to Figure 2.1: Illustration of the central nervous

system with spinal cord and brain.


dkriesel.com 2.1 The Vertebrate Nervous System

Figure 2.2: Illustration of the brain. The col-ored areas of the brain are discussed in the text.The more we turn from abstract information pro-cessing to direct reflexive processing, the darkerthe areas of the brain are colored.

fulfill. Primary cortical fields are re-sponsible for processing qualitative infor-mation, such as the management of differ-ent perceptions (e.g. the visual cortex)is responsible for the management of vi-sion). Association cortical fields, how-ever, perform more abstract associationand thinking processes; they also containour memory.

2.1.3 The cerebellum controls andcoordinates motor functions

The little brain (cerebellum) is locatedbelow the cerebrum, therefore it is closerto the spinal cord. Accordingly, it servesless abstract functions with higher prior-ity: Here, large parts of motor coordi-nation are performed, i.e., balance and

movements are controlled and errors arecontinually corrected. For this purpose,the cerebellum has direct sensory informa-tion about muscle lengths as well as acous-tic and visual information. Furthermore,it also receives messages about more ab-stract motor signals coming from the cere-brum.

In the human brain the cerebellum is con-siderably smaller than the cerebrum, butthis is rather an exception. In many verte-brates this ratio is less pronounced. If wetake a look at vertebrate evolution, we willrecognize that the cerebellum is not ”toosmall” but the cerebum is ”too large” (atleast, it is the most highly developed struc-ture in the vertebrate brain). The two re-maining brain areas should also be brieflydiscussed: the diencephalon and the brain-stem.

2.1.4 The diencephalon controlsfundamental physiologicalprocesses

The interbrain (diencephalon) includesparts of which only the thalamus will

thalamusfiltersincomingdata

be briefly discussed: This part of the di-encephalon mediates between sensory andmotor signals and the cerebrum. Particu-larly, the thalamus decides which part ofthe information is transferred to the cere-brum, so that especially less importantsensory perceptions can be suppressed atshort notice to avoid overloads. Anotherpart of the diencephalon is the hypotha-lamus, which controls a number of pro-cesses within the body. The diencephalon



is also heavily involved in the human cir-cadian rhythm (”internal clock”) and thesensation of pain.

2.1.5 The brainstem connects thebrain with the spinal cord andcontrols reflexes.

In comparison with the diencephalon thebrainstem or the (Truncus cerebri) re-spectively is phylogenetically much older.Roughly speaking, it is the ”extendedspinal cord” and thus the connection be-tween brain and spinal cord. The brain-stem can also be divided into different ar-eas, some of which will be exemplarily in-troduced in this chapter. The functionswill be discussed from abstract functionstowards more fundamental ones. One im-portant component is the pons (=bridge),a kind of way station for many a nerve sig-nal from brain to body and vice versa.

If the pons is damaged (e.g. by a cere-bral infarct), then the result could bethe locked-in syndrome – a conditionin which a patient is ”walled-in” withinhis own body. He is conscious and awarewith no loss of cognitive function, but can-not move or communicate by any means.Only his senses of sight, hearing, smell andtaste are generally working perfectly nor-mal. Locked-in patients may often be ableto communicate with others by blinking ormoving their eyes.

Furthermore, the brainstem is responsiblefor many fundamental reflexes, such as thewinking reflex or coughing.

All parts of the nervous system have onething in common: information processing.This is accomplished by means of huge ac-cumulations of billions of very similar cells,whose structure is very simple but whocommunicate continuously. Large groupsof these cells send coordinated signals andthus reach the enormous information pro-cessing capacity we are familiar with fromour brain. We will now leave the level ofbrain areas and go on with the cellularlevel of the body - the level of neurons.

2.2 Neurons Are InformationProcessing Cells

Before specifying the functions and pro-cesses within a neuron, we will give arough description of neuron functions: Aneuron is nothing more than a switch withinformation input and output. The switchwill be activated if there are enough stim-uli of other neurons hitting the informa-tion input. Then, at the information out-put, a pulse is sent to other neurons, forexample.

2.2.1 Components of a neuron

Now we want to take a look at the com-ponents of a neuron (Fig. 2.3 on the rightpage). In doing so, we will follow the waythe electrical information takes the neuronin and the electrical infomation within theneuron (the electrical information takeswithin the neuron). The dendrites of a


dkriesel.com 2.2 The Neuron

Figure 2.3: Illustration of a biological neuron with the components discussed in this text.

neuron receive the information by specialconnections, the synapses.

2.2.1.1 Synapses weight the individualparts of information

Incoming signals from other neurons orcells are transferred to a neuron by specialconnections, the synapses. For the mostpart, such a connection can be found atthe dendrites of a neuron, sometimes alsodirectly at the Soma. We distinguish be-tween electrical and chemical synapses.

The electrical synapse is the simpler vari-ant is the electrical synapse. An elec-

electricalsynapse:

simpletrical signal received by the synapse, i.e.coming from the pre-synaptic side, is di-rectly transferred to the postsynaptic nu-cleus of the cell. Thus, there is a direct,strong, unadjustable connection between

the signal transmitter and the signal re-ceiver, which is, for example, relevant toshortening reactions that must be ”hardcoded” within a living organism.

The chemical synapse is the more dis-tinctive variant. Here, the electrical cou-pling of source and target does not takeplace, the coupling is interrupted by thesynaptic cleft. This cleft electrically sep-arates the pre-synaptic side from the post-synaptic one. You might think that, never-theless, the information has to flow, so wewill discuss how this happens: It is not anelectrical, but a chemical process. On thepre-synaptic side of the synaptic cleft theelectrical signal is converted into a chemi-cal signal, a process induced by chemicalcues released there (the so-called neuro-transmitters). These neurotransmitterscross the synaptic cleft and transfer theinformation into the nucleus of the cell(this is a very simple explanation, but later



on we will see how this exactly works),where it is reconverted into electrical in-formation. The neurotransmitters are de-graded very fast, so that it is possible to re-lease very precise information pulses here,too.

In spite of the more complex function-ing the chemical synapse has - compared

cemicalsynapseis more

complexbut also

morepowerful

with the electrical synapse - utmost advan-tages:

Single way connection: A chemicalsynapse is a single way connection.Due to the fact that there is no directelectrical connection between thepre- and postsynaptic area, electricalpulses in the postsynaptic areacannot flash over to the pre-synapticarea.

Adjustability: There are a large number ofdifferent neurotransmitters that canalso be released in various quanti-ties in a synaptic cleft. There areneurotransmitters that stimulate thepostsynaptic cell nucleus, and oth-ers that make such stimulation slowdown. Some synapses transfer astrongly stimulating signal, some onlyweakly stimulating ones. The adjusta-bility varies a lot, and one of the cen-tral points in the examination of thelearning ability of the brain is thathere the synapses are variable, too.That is, over time they can form astronger or weaker connection.

2.2.1.2 Dendrites collect all parts orinformation

Dendrites ramify like trees from the cellnucleus of the neuron (which is calledsoma) and receive electrical signals frommany different sources, which are thentransferred into the nucleus of the cell.The ramifying amount of dendrites is alsocalled dendrite tree.

2.2.1.3 In the soma the weightedinformation are accumulated

After the cell nucleus (soma) has re-ceived a plenty of activating (=stimulat-ing) and inhibiting (=diminishing) signalsby synapses or dendrites, the soma accu-mulates these signals. As soon as the ac-cumulated signal exceeds a certain value(called threshold value), the cell nucleusof the neuron activates an electrical pulsewhich then is transmitted to the neuronsconnected to the current one.

2.2.1.4 The axon transfers outgoingpulses

The pulse is transferred to other neuronsby means of the axon. The axon is along, slender extension of the soma. Inan extreme case an axon can stretch upto one meter (e.g. within the spinal cord).The axon is electrically isolated in orderto achieve a better conduction of the elec-trical signal (we will return to this pointlater on) and it leads to dendrites, whichtransfer the information to, for example,



other neurons. So now we are back to thebeginning of our description of the neuronelements.Remark: An axon can, however, transferinformation to other kinds of cells in orderto control them.

2.2.2 Electrochemical processes inthe neuron and itscomponents

After having pursued the path of an elec-trical signal from the dendrites via thesynapses to the nucleus of the cell andfrom there via the axon into other den-drites, we now want to take a small stepfrom biology towards technology. In doingso a simplified introduction of the electro-chemical information processing should beprovided.

2.2.2.1 Neurons maintain electricalmembrane potential

One fundamental aspect is the fact thatcompared to their environment the neu-rons show a difference in electrical charge,a potential. In the membrane (=enve-lope) of the neuron the charge is differentfrom the charge on the outside. This differ-ence in charge is a central concept impor-tant to understand the processes withinthe neuron. The difference is called mem-brane potential. The membrane poten-tial, i.e., the difference in charge, is createdby several kinds of loaded atoms (ions),whose concentration varies within and out-side of the neuron. If we penetrate the

membrane from the inside outwards, wewill find certain kinds of ions more oftenor less often than on the inside. This de-scent or ascent of concentration is called aconcentration gradient.

Let us first take a look at the membranepotential of the resting state of the neu-ron, i.e., we assume that no electrical sig-nals are received from the outside. In thiscase, the membrane potential is −70 mV.Since we have learned that this potentialdepends on the concentration gradients ofvarious ions, there is of course the centralquestion of how to maintain these concen-tration gradients: Normally, diffusion pre-dominates and therefore each ion is eagerto decrease concentration gradients andto spread out evenly. If this happens,the membrane potential will move towards0 mV, so finally there would be no mem-brane potential anymore. Thus, the neu-ron actively maintains its membrane po-tential to be able to process information.How does this work.

The secret is the membrane itself, which ispermeable to some ions, but not for others.To maintain the potential, various mecha-nisms are in progress at the same time:

Concentration gradient: As describedabove the ions try to be as uniformlydistributed as possible. If theconcentration of an ion is higher onthe inside of the neuron than onthe outside, it will try to diffuseto the outside and vice versa.The positively charged ion K+

(potassium) occurs very frequentlywithin the neuron but less frequently



outside of the neuron, and thereforeit slowly diffuses out through theneuron’s membrane. But anothercollection of negative ions, calledA−, remains within the neuron sincethe membrane is not permeableto them. Thus, the inside of theneuron becomes negatively charged.Negative A ions remain, positive Kions disappear, and so the inside ofthe cell becomes more negative. Theresult is another gradient.

Electrical Gradient: The electrical gradi-ent acts contrary to the concentrationgradient. The intracellular charge isnow very strong, therefore it attractspositive ions: K+ wants to get backinto the cell.

If these two gradients are now left to takecare of themselves, they would balance out,reach a steady state, and a membrane po-tential of −85 mV would develop. But wewant to achieve a resting membrane po-tential of −70 mV, thus there seem to ex-ist some disturbances which prevent thisplan. Furthermore, there is another impor-tant ion, Na+ (sodium), to which the mem-brane is not very permeable but which,however, slowly pours through the cell intothe membrane. As a result, the sodiumfeels all the more driven into the cell: Onthe one hand, there is less sodium withinthe neuron than outside the neuron. Onthe other hand, sodium is positive but theinterior of the cell is negative, which is asecond reason for the sodium to want toget into the cell.

Due to the low diffusion of sodium into thecell the intracellular sodium concentrationincreases. But at the same time the insideof the cell becomes less negative, so thatK+ pours in slower (we can see that thisis a complex mechanism where everythingis influenced by everything). The sodiumshifts the intracellular equilibrium fromnegative to less negative, compared withits environment. But even with these twoions a standstill with all gradients beingbalanced out could still be achieved. Nowthe last piece of the puzzle gets into thegame: a ”pump” (or rather, the proteinATP) actively transports ions against thedirection they actually want to take!

Sodium is actively pumped out of the cell,although it tries to get into the cellalong the concentration gradient andthe electrical gradient.

Potassium , however, diffuses stronglyout of the cell, but is actively pumpedback into it.

For this reason the pump is also calledsodium-potassium pump. The pumpmaintains the concentration gradient forthe sodium as well as for the potassium,so that some sort of steady state equilib-rium is created and finally the resting po-tential is −70 mV as observed. All in allthe membrane potential is maintained bythe fact that the membrane is imperme-able to some ions and other ions are ac-tively pumped against the concentrationand electrical gradients. Now that weknow that each neuron has a membranepotential we want to observe how a neu-ron receives and transmits signals.



2.2.2.2 The neuron is activated bychanges in the membranepotential

Above we have learned that sodium andpotassium can diffuse through the mem-brane - sodium slowly, potassium faster.They move through channels within themembrane, the sodium and potassiumchannels. In addition to these per-manently open channels responsible fordiffusion and balanced by the sodium-potassium pump, there also exist channelsthat are not always open but which onlyresponse ”if required”. Since the openingof these channels changes the concentra-tion of ions within and outside of the mem-brane, it also changes the membrane po-tential.

These controllable channels are opened assoon as the accumulated received stimu-lus exceeds a certain threshold. For ex-ample, stimuli can be received from otherneurons or have other causes. There exist,for example, specialized forms of neurons,the sense cells, for which a light incidencecould be such a stimulus. If the incom-ing amount of light exceeds the threshold,controllable channels are opened.

The said threshold (the threshold poten-tial) is at about −55 mV. As soon as thereceived stimuli reach this value, the neu-ron is activated and an electrical signal,an action potential, is initiated. Thenthis signal is transmitted to the cells con-nected to the observed neuron, i.e., thecells ”listen” to the neuron. Now we wantto take a closer look at the different stages

of the action potential (Fig. 2.4 on the nextpage):

Resting state: Only the permanentlyopen sodium and potassium channelsare open. The membrane potential isat −70 mV and actively kept thereby the neuron.

Stimulus up to the threshold: A (stim-ulus) opens channels so that sodiumcan pour in. The intracellular chargebecomes more positive. As soon asthe membrane potential exceeds thethreshold of −55 mV, the action po-tential is initiated by the opening ofmany sodium channels.

Depolarization: Sodium is pouring in. Re-member: Sodium wants to pour intothe cell because there is a lower in-tracellular than extracellular concen-tration of sodium. Additionally, thecell is dominated by a negative en-vironment which attracts the posi-tive sodium ions. This massive in-flux of sodium drastically increasesthe membrane potential - up to ap-prox. +30 mV - which is the electricalpulse, i.e., the action potential.

Repolarization: Now the sodium channelsare closed and the potassium channelsare opened. The positively chargedion wants to leave the positive inte-rior of the cell. Additionally, the intra-cellular concentration is much higherthan the extracellular one, which in-creases the efflux of ions even more.The interior of the cell is once againmore negatively charged than the ex-terior.



Figure 2.4: Initiation of action potential over time.



Hyperpolarization: Sodium as well aspotassium channels are closed again.At first the membrane potential isslightly more negative than the rest-ing potential. This is due to thefact that the potassium channels closeslower. As a result, (positivelycharged) potassium is effused becauseof its lower extracellular concentra-tion. After a refractory periodof 1 − 2 ms the resting state is re-established so that the neuron canreact to newly applied stimuli withan action potential. In simple terms,the refractory period is a mandatorybreak a neuron has to take in order toregenerate. The shorter this break is,the more often a neuron can fire pertime.

Then the resulting pulse is transmitted bythe axon.

2.2.2.3 In the axon a pulse isconducted in a saltatory way

We have already learned that the axonis used to transmit the action potentialacross long distances (remember: You willfind an illustration of a neuron includingan axon in Fig. 2.3 on page 19). The axonis a long, slender extension of the soma.In vertebrates it is normally coated by amyelin sheath that consists of Schwanncells (in the PNS) or oligodendrocytes

(in the CNS) 1, which insulate the axonvery good from electrical activity. At adistance of 0.1 − 2mm there are gaps be-tween these cells, the so-called nodes ofRanvier. The said gaps appear where oneinsulate cell ends and the next one begins.It is obvious that at such a node the axonis less insulated.

Now you may assume that these less in-sulated nodes are a disadvantage of theaxon - it is not. At the nodes mass can betransferred between the intracellular andextracellular area, a transfer that is im-possible at those parts of the axon whichare situated between two nodes (intern-odesn) and therefore insulated by themyelin sheath. This mass transfer permitsthe generation of signals similar to the gen-eration of the action potential within thesoma. The action potential is transferredas follows: It does not continuously travelalong the axon but jumps from node tonode. Thus, a series of depolarization trav-els along the nodes of Ranvier. One ac-tion potential initiates the next one, andmostly even several nodes are active atthe same time here. The pulse ”jumping”from node to node is responsible for thename of this pulse conductor: saltatoryconductor.

Obviously, the pulse will move faster if itsjumps are larger. Axons with large in-ternodes (2 mm) achieve a signal disper-sion of approx. 180 meters per second.

1 Schwann cells as well as oligodendrocytes are vari-eties of the glial cells. There are about 50 timesmore glial cells than neurons: They surround theneurons (glia = glue), insulate them from eachother, provide energy, etc.



However, the internodes cannot grow in-definitely, since the action potential to betransferred would fade too much until itreaches the next node. So the nodes have atask, too: to constantly amplify the signal.The cells receiving the action potential areattached to the end of the axon – oftenconnected by dendrites and synapses. Asalready indicated above the action poten-tials cannot only be generated by informa-tion received by the dendrites from otherneurons.

2.3 Receptor Cells AreModified Neurons

Action potentials can also be generated bysensory information an organism receivesfrom its environment by means of sensorycells. Specialized receptor cells are ableto perceive specific stimulus energies suchas light, temperature and sound or the ex-istence of certain molecules (like, for exam-ple, the sense of smell). This is workingbecause of the fact that these sense cellsare actually modified neurons. They donot receive electrical signals via dendritesbut the existence of the stimulus beingspecific for the receptor cell ensures thatthe ion channels open and an action po-tential is developed. This process of trans-forming stimulus energy into changes inthe membrane potential is called sensorytransduction. Usually, the stimulus en-ergy itself is too weak to directly causenerve signals. Therefore, the signals areamplified either during transduction or by

means of the stimulus-conducting ap-paratus. The resulting action potentialcan be processed by other neurons and isthen transmitted into the thalamus, whichis, as we have already learned, a gate tothe cerebral cortex and therefore can sortout sensory impressions according to cur-rent relevance and thus prevent an abun-dance of information to be managed.

2.3.1 There are different receptorcells for various sorts ofperceptions

Primary receptors transmit their pulsesdirectly to the nervous system. A goodexample for this is the sense of pain.Here, the stimulus intensity is propor-tional to the amplitude of the action po-tential. Technically, this is an amplitudemodulation.

Secondary receptors, however, continu-ously transmit pulses. These pulses con-trol the amount of the related neurotrans-mitter, which is responsible for transfer-ring the stimulus. The stimulus in turncontrols the frequency of the action poten-tial of the receiving neuron. This processis a frequency modulation, an encoding ofthe stimulus, which allows to better per-ceive the increase and decrease of a stimu-lus.

There can be individual receptor cells orcells forming complex sense organs (e.g.eyes or ears). They can receive stimuliwithin the body (by means of the intero-ceptors) as well as stimuli outside of thebody (by means of the exteroceptors).


dkriesel.com 2.3 Receptor Cells

After having outlined how an informationis received from the environment, it will beinteresting to look at how the informationis processed.

2.3.2 Information are processed onevery level of the nervoussystem

There is no reason to believe that everyinformation received is transmitted to thebrain and processed there, and that thebrain ensures that it is ”output” in theform of motor pulses (the only thing anorganism can actually do within its envi-ronment is to move). The information pro-cessing is entirely decentralized. In orderto illustrate this principle, we want to takea look at some examples, which leads usagain from the abstract to the fundamen-tal in our hierarchy of information process-ing.

. It is certain that information are pro-cessed in the cerebrum, which is themost developed natural informationprocessing structure.

. The midbrain and the thalamus,which serves – as we have alreadylearned – as a gate to the cerebralcortex, are much lower down in the hi-erarchy. The filtering of informationwith respect to the current relevanceexecuted by the midbrain is a veryimportant method of information pro-cessing, too. But even the thalamusdoes not receive any pre-processedstimuli from the outside. Now let us

go on with the lowest level, the sen-sory cells.

. On the lowest level, the receptorcells, the information is not only re-ceived and transferred but directlyprocessed. One of the main aspectsof this subject is to prevent the trans-mission of ”continuous stimuli” tothe central nervous system because ofsensory adaptation: Due to contin-uous stimulation many receptor cellsautomatically become insensitive tostimuli. Thus, receptor cells are nota direct mapping of specific stimu-lus energy onto action potentials butdepend on the past. Other sensorschange their sensitivity according tothe situation: There are taste recep-tors which respond more or less to thesame stimulus according to the nutri-tional condition of the organism.

. Even before a stimulus reaches thereceptor cells information processingcan already be executed by a preced-ing signal carrying apparatus, for ex-ample in the form of amplification:The external and the internal earhave a specific shape to amplify thesound, which also allows – in asso-ciation with the sensory cells of thesense of hearing – the sensory stim-ulus only to increase logarithmicallywith the intensity of the heard sig-nal. On closer examination, this isnecessary, since the sound pressure ofthe signals for which the ear is con-structed can vary over a wide expo-nential range. Here, a logarithmicmeasurement is an advantage. Firstly,



an overload is prevented and, sec-ondly, the fact that the intensity mea-surement of intensive signals will beless precise, doesn’t matter as well. Ifa jet fighter is starting next to you,small changes in the noise level canbe ignored.

Just to get a feel for sense organs and in-formation processing in the organism wewill briefly describe ”usual” light sensingorgans, i.e. organs often found in nature.At the third light sensing organ describedbelow, the single lens eye, we will discussthe information processing in the eye.

2.3.3 An outline of common lightsensing organs

For many organisms it turned out to be ex-tremely useful to be able to perceive elec-tromagnetic radiation in certain regions ofthe spectrum. Consequently, sense organshave been developed which can detectsuch electromagnetic radiation. Therefore,the wavelength range of the radiation per-ceivable by the human eye is called visiblerange or simply light. The different wave-lengths of this electromagnetic radiationare perceived by the human eye as differ-ent colors. The visible range of the elec-tromagnetic radiation is different each or-ganism. Some organisms cannot see thecolors (=wavelength ranges) we can see,others can even perceive additional wave-length ranges (e.g. in the UV range). Be-fore we begin with the human being – inorder to get a broader knowledge of thesense of sight– we briefly want to look at

two organs of sight which, from an evolu-tionary point of view, exist much longerthan the human.

2.3.3.1 Complex eyes and pinhole eyesonly provide high temporal orspatial resolution

Let us first take a look at the so-calledcomplex eye (Fig. 2.5 on the right page),also called compound eye, which is, forexample, common in insects and crus-taceans. The complex eye consists of a

complex eye:high temp.,lowspatialresolution

great number of small, individual eyes.If we look at the complex eye from theoutside, the individual eyes are clearly vis-ible and arranged in a hexagonal pattern.Each individual eye has its own nerve fiberwhich is connected to the insect brain.Since the individual eyes can be distin-guished, it is obvious that the number ofpixels, i.e. the spatial resolution, of com-plex eyes must be very low and the imageis blurred. But complex eyes have advan-tages, too, especially for fast-flying insects.Certain complex eyes process more than300 images per second (to the human eye,however, movies with 25 images per sec-ond appear as a fluent motion).

Pinhole eyes are, for example, found inoctopus species and work – you can guessit – similar to a pinhole camera. A pin-

pinholecamera:high spat.,lowtemporalresolution

hole eye has a very small opening for lightentry, which projects a sharp image ontothe sensory cells behind. Thus, the spatialresolution is much higher than in the com-plex eye. But due to the very small open-


dkriesel.com 2.3 Receptor Cells

Figure 2.5: Complex eye of a robber fly

ing for light entry the resultant image isless bright.

2.3.3.2 Single lens eyes combine theadvantages of the other twoeye types, but they are morecomplex

The light sensing organ common in verte-brates is the single lense eye. The re-sultant image is a sharp high-resolutionimage of the environment at high or vari-able light intensity. On the other hand itis more complex. Similar to the pinholeeye the light enters through an opening(pupil) and is projected onto a layer ofsensory cells in the eye. (retina). But in

Singlelense eye:

high temp.and spat.resolution

contrast to the pinhole eye the size of thepupil can be adapted to the lighting condi-tions (by means of the iris muscle, whichexpands or contracts the pupil). Thesedifferences in pupil dilation require to ac-tively focus the image. Therefore, the

single lens eye contains an additional ad-justable lens.

2.3.3.3 The retina does not onlyreceive information but is alsoresponsible for informationprocessing

The light signals falling on the eye arereceived by the retina and directly pre-processed by several layers of information-processing cells. We want to briefly dis-cuss the different steps of this informa-tion processing, and in doing so we followthe way of the information inserted by thelight:

Photoreceptors receive the light signalund cause action potentials (thereare different receptors for differentcolor components and light intensi-ties). These receptors are the reallight-receiving part of the retina andthey are sensitive to such an extentthat only one single photon fallingon the retina can cause an action po-tential. Then several photoreceptorstransmit their signals to one single

bipolar cell . This means that here theinformation has already been summa-rized. Finally, the now transformedlight signal travels from several bipo-lar cells 2 into

ganglion cells. Various bipolar cells cantransmit their information to one gan-glion cell. The higher the number

2 There are different kinds of bipolar cells, as well,but to discuss all of them would go too far.



of photoreceptors that affect the gan-glion cell, the larger the field of per-ception, the receptive field, whichcovers the ganglions – and the lesssharp is the image in the area ofthis ganglion cell. So the informationare already sorted out directly in theretina and the overall image is, for ex-ample, blurred in the peripheral fieldof vision. So far, we have learnedabout the information processing inthe retina only as a top-down struc-ture. Now we want to take a look atthe

horizontal and amacrine cells. Thesecells are not connected from thefront backwards but laterally. Theyallow the light signals to influencethemselves laterally directly duringthe information processing in theretina – a much more powerfulmethod of information processingthan compressing and blurring.When the horizontal cells are excitedby a photoreceptor, they are able toexcite other nearby photoreceptorsand at the same time inhibit moredistant bipolar cells and receptors.This ensures the clear perception ofoutlines and bright points. Amacrinecells can further intensify certainstimuli by distributing informationfrom bipolar cells to several ganglioncells or by inhibiting ganglions.

These first steps of transmitting visual in-formation to the brain show that infor-mation are processed from the first mo-ment the information are received and, onthe other hand, are parallely processed

within millions of information-processingcells. The systems poer and resistance toerrors is based upon this massive divisionof work.

2.4 The Amount of Neuronsin Living Organisms onDifferent Levels ofDevelopment

An overview of different organisms andtheir neural capacity (in large part from[RD05]):

302 neurons are required by the nervoussystem of a nematode worm, whichserves as a popular model organismin biology. Nematodes live in the soiland feed on bacteria.

104 neurons make an ant (To simplifymatters we neglect the fact that someant species also can have more or lessefficient nervous systems). Due to theuse of different attractants and odors,ants are able to engage in complexsocial behavior and form huge stateswith millions of individuals. If you re-gard such an ant state as an individ-ual, it has a cognitive capacity similarto a chimpanzee or even a human.

With 105 neurons the nervous system ofa fly can be constructed. In a three-dimensional space a fly can evade anobject in real-time, it can land uponthe ceiling upside down, has a consid-erable sensory system because of com-pound eyes, vibrissae, nerves at the


dkriesel.com 2.4 The Amount of Neurons in Living Organisms

end of its legs and much more. Thus,a fly has considerable differential andintegral calculus in high dimensionsimplemented ”in the hardware”. Weall know that a fly is not easy to catch.Of course, the bodily functions arealso controlled by neurons, but theseshould be ignored here.

With 0.8 · 106 neurons we have enoughcerebral matter to create a honeybee.Honeybees build colonies and haveamazing capabilities in the field ofaerial reconnaissance and navigation.

4 · 106 neurons result in a mouse, andhere already begins the world of ver-tebrates.

1.5 · 107 neurons are sufficient for a rat,an animal which is denounced as be-ing extremely intelligent and are of-ten used to participate in a varietyof intelligence tests representative forthe animal world. Rats have an ex-traordinary sense of smell and orien-tation, and they also show social be-havior. The brain of a frog can bepositioned within the same dimension.The frog has a complex physique withmany functions, it can swim and hasevolved complex behavior. A frogcan continuously target the said flyby means of his eyes while jumping inthe three-dimensional space and andcatch it with its tongue with reason-able probability.

5 · 107 neurons make a bat. The bat cannavigate in total darkness through aspace, exact to several centimeters,by only using their sense of hearing.

It uses acoustic signals to localizeself-camouflaging insects (e.g. somemoths have a certain wing structurethat reflects less sound waves and theecho will be small) and also eats itsprey while flying.

1.6 · 108 neurons are required by thebrain of a dog, companion of man forages. Now take a look at another pop-ular companion of man:

3 · 108 neurons can be found in a cat,which is about twice as much as ina dog. We know that cats are veryelegant, patient carnivores that canshow a variety of behaviors. By theway, an octopus can be positionedwithin the same dimension. Only veryfew people know that, for example, inlabyrinth orientation the octopus isvastly superior to the rat.

For 6 · 109 neurons you already get achimpanzee, one of the animals beingvery similar to the human.

1011 neurons make a human. Usually,the human has considerable cognitivecapabilities, is able to speak, to ab-stract, to remember and to use toolsas well as the knowledge of other hu-mans to develop advanced technolo-gies and manifold social structures.

With 2 · 1011 neurons there are nervoussystems having more neurons thanthe human nervous system. Here weshould mention elephants and certainwhale species.



Remark: Our state-of-the-art computersare not able to keep up with the abovementioned processing power of a fly.

2.5 Transition to TechnicalNeurons: NeuralNetworks are a Caricatureof Biology

How do we change from biological neuralnetworks to the technical ones? By meansof radical simplification. I want to brieflysummarize the conclusions relevant for thetechnical part:

We have learned that the biological neu-rons are linked to each other in a weightedway and when stimulated they electricallytransmit their signal via the axon. Fromthe axon they are not directly transferredto the succeeding neurons, but they firsthave to cross the synaptic cleft where thesignal is changed again by variable chem-ical processes. In the receiving neuronthe various inputs that have been post-processed in the synaptic cleft are summa-rized or accumulatedto one single pulse.Depending on how the neuron is stimu-lated by the cumulated input, the neuronitself emits a pulse or not – thus, the out-put is non-linear and not proportional tothe cumulated input. Our brief summarycorresponds exactly with the few elementsof biological neural networks we want totake over into the technical approxima-tion:

Vectorial input: The input of technicalneurons consists of many components,therefore it is a vector. In nature aneuron receives pulses of 103 to 104

other neurons on average.

Scalar output: The output of a neuron isa scalar, which means that the neu-ron only consists of one component.Several scalar outputs in turn formthe vectorial input of another neuron.This particularly means that some-where in the neuron the various inputcomponents have to be summarized insuch a way that only one componentremains.

Synapses change input: In technical neu-ral networks the inputs are pre-processed, too. They are multipliedby a number (the weight) – they areweighted. The set of such weights rep-resents the information storage of aneural network – in both biologicaloriginal and technical adaptation.

Accumulating the inputs: In biology, theinputs are summarized to a pulse ac-cording to the chemical change, i.e.,they are accumulated – on the techni-cal side this is often realized by theweighted sum, which we will get toknow later on. This means that afteraccumulation we continue with onlyone value, a scalar, instead of a vec-tor.

Non-linear characteristic: The input ofour technical neurons is also not pro-portional to the output.


dkriesel.com 2.5 Technical Neurons as Caricature of Biology

Adjustable weights: The weights weight-ing the inputs are variable, similar tothe chemical processes at the synap-tic cleft. This adds a great dynamicto the network because a large part ofthe ”knowledge” of a neural networkis saved in the weights and in the formand power of the chemical processesin a synaptic cleft.

So our current, only casually formulatedand very simple neuron model receives avectorial input

~x,

with components xi. These are multipliedby the appropriate weights wi and accumu-lated: ∑

i

wixi.

The above mentioned term is calledweighted sum. Then the non-linear map-ping f defines the scalar output y:

y = f

(∑i

wixi

).

After this transition we now want to spec-ify more precisely our neuron model andadd some odds and ends. Afterwards wewill take a look at how the weights can beadjusted.

Exercises

Exercise 4: It is estimated that a hu-man brain consists of approx. 1011 nervecells, each of which has about 103 to 104

synapses. For this exercise we assume 103

synapses per neuron. Let us further as-sume that a single synapse could save 4bits of information. Naıvely calculated:How much storage capacity does the brainhave? Note: The information which neu-ron is connected to which other neuron isalso important.


Chapter 3

Components of Artificial Neural NetworksFormal definitions and colloquial explanations of the components that realizethe technical adaptations of biological neural networks. Initial descriptions of

how to combine these components to a neural network.

This chapter contains the formal defini-tions for most of the neural network com-ponents used later in the text. After thischapter you will be able to read the in-dividual chapters of this paper withoutknowing the preceding ones (although thiswould be useful).Remark (on definitions): In the followingdefinitions, especially in the function def-initions, I indicate the input elements orthe target element (e.g. netj or oi) insteadof the usual pre-image set or target set (e.g.R or C) – this is necessary and also easierto write and to understand, since the formof these elements can be chosen nearly ar-bitrarily. But usually these values are nu-meric.

3.1 The Concept of Time inNeural Networks

In some definitions of this paper we usethe term time or the number of cycles of

the neural network, respectively. The timeis divided in discrete time steps:

discretetime stepsDefinition 3.1 (The concept of time): The

current time (present time) is referred toas (t), the next time step as (t + 1), the

J(t)preceding one as (t − 1). Any other timesteps are analogously referred to. If inthe following chapters several mathemat-ical variables (e.g. netj or oi) refer to acertain point of time, the notation there-fore will be, for example, netj(t − 1) oroi(t).Remark: From a biological point of viewthis is, of course, not very plausible (in thehuman brain a neuron does not wait foranother one), but it significantly simplifiesthe implementation.

3.2 Components of NeuralNetworks

A technical neural network consists of sim-ple processing units and the Neurons

35

Chapter 3 Components of Artificial Neural Networks (fundamental) dkriesel.com

which are connected by directed, weightedconnections. Here, the strength of aconnection (or the connecting weight) be-tween two neurons i and j is referred to aswi,j

1.Definition 3.2 (Neural network): A neu-ral network is a sorted triple (N,V,w)with two sets N , V and a function w,whereas N is the set of neurons and V asorted set (i, j)|i, j ∈ N whose elementsare called connections between neuron iand neuron j. The function w : V → R

n. network= neurons

+ weightedconnection

defines the weights, whereasw((i, j)), theweight of the connection between neuron iand neuron j, is shortly referred to as wi,j .

wi,jI Depending on the point of view it is eitherundefined or 0 for connections that do notexist in the network.Remark: So the weights can be imple-mented in a square weight matrix W or,optionally, in a weight vector W with the

WI line number of the matrix indicating wherethe connection begins, and the columnnumber of the matrix indicating, whichneuronis the target. Indeed, in this casethe numeric 0 marks a non-existing con-nection. This matrix representation is alsocalled Hinton diagram2.

The neurons and connections comprise thefollowing components and variables (I’mfollowing the path of the data within a

1 Note: In some of the cited literature i and j couldbe interchanged in wi,j . Here, a common standarddoes not exist. But in this paper I try to use thenotation I found more frequently and in the moresignificant citations.

2 Note here, too, that in some of the cited litera-ture axes and lines could be interchanged. Thepublished literature is not consistent here, as well

Propagierungsfunktion (oft gewichtete Summe, verarbeitet

Eingaben zur Netzeingabe)

Ausgabefunktion (Erzeugt aus Aktivierung die Ausgabe,

ist oft Identität)

Aktivierungsfunktion (Erzeugt aus Netzeingabe und alter

Aktivierung die neue Aktivierung)

Eingaben anderer Neuronen Netzeingabe

Aktivierung Ausgabe zu anderen Neuronen

Propagation function (often weighted sum, transforms

outputs of other neurons to net input)

Output function (often Identity function, transforms

activation to output for other neurons)

Activation function (Transforms net input and sometimes

old activation to new activation)

Data Input of other Neurons

Network Input

Activation

Data Output to other Neurons

Figure 3.1: Data processing of a neuron. Theactivation function of a neuron implies thethreshold value.

neuron, which is according to fig. 3.1 intop-down direction):

3.2.1 Connections conductInformation that is processedby Neurons

Data are transferred between neurons viaconnections with the connecting weight be-ing either excitatory or inhibitory. Thedefinition of connections has already beenincluded in the definition of the neural net-work.


dkriesel.com 3.2 Components of Neural Networks

3.2.2 The Propagation Functionconverts vector inputs toscalar Network Inputs

Looking at a neuron j, most of the time wewill find a lot of neurons with a connectionto j, i.e. which transfer the output to j.

For a neuron j the propagation func-tion receives the outputs oi1 , . . . , oin ofother neurons i1, i2, . . . , in (which are con-nected to j), and transforms them in con-manages

inputs sideration of the connecting weights wi,jinto network input netj , that can be usedby the activation function.Definition 3.3 (Propagation function):Let I = i1, i2, . . . , in be the set of neu-rons such that ∀z ∈ 1, . . . , n : ∃wiz ,j .Then the propagation function of a neu-ron j is defined as

fprop : oi1 , . . . , oin × wi1,j , . . . , win,j→ netj (3.1)

Here the weighted sum is very popular: :The multiplication of the output of each iby wi,j , and the addition of the results:

netj =∑i∈I

(oi · wi,j) (3.2)

The network input is the result of thepropagation function and is already in-cluded in the afore-mentioned definition.But for reasons of completeness, let us setthe following definition:Definition 3.4 (Network input): Let j be aneuron and let the assumptions from thepropagation function definition be given.Then the network input of j, called netj ,

is calculated by means of the propagationfunction:

netj = fprop(oi1 , . . . , oin , wi1,j , . . . , win,j)(3.3)

3.2.3 The Activation is the”switching status” of aNeuron

Based on the model of nature every neu-ron is always active, excited, or whateveryou will call it, to a certain extent. Thereactions of the neurons to the input val-ues depend on this activation state. The

How activeis aneuron?

activation state indicates the extent of aneuron’s activation and is often shortly re-ferred to as activation. Its formal defini-tion is included in the following definitionof the activation function. But generally,itcan be defined as follows:Definition 3.5 (Activation state/activationin general): Let j be a neuron. The ac-tivation state aj , shortly called activation,is explicitly assigned to j indicates the ex-tent of the neuron’s activity and resultsfrom the activation function.

Provided that the definition of activationfunction is given, the activation state of jis defined as: Jaj

aj(t) = fact(netj(t), aj(t− 1),Θj)

Thereby the variable Θ stands for thethreshold value explained in the follow-ing.



3.2.4 Neurons get activated, if thenetwork input exceeds theirtreshold value

When centered around the threshold value,the activation function of a neuron re-acts particularly sensitive. From the bi-ological point of view the threshold valuerepresents the threshold from which on aneuron fires. The threshold value is also

highestpoint of

sensationwidely included in the definition of the ac-tivation function, but generally the defini-tion is the following:Definition 3.6 (General threshold value):

Let j be a neuron. The thresholdvalue Θj is explicitly assigned to j and

ΘI marks the position of the maximum gradi-ent value of the activation function.

3.2.5 The Activation Functiondetermines the activation of aNeuron dependent on networkinput and treshold value

At a certain time – as we have alreadylearned – the activation aj of a neuron jdepends on the prior3 activation state ofthe neuron and the external input.Definition 3.7 (Activation function): Let

calculatesactivation j be a neuron. The activation function

is defined as

fact : netj(t)× aj(t− 1)×Θj → aj(t).(3.4)

It transforms the network input netj asfactI well as the previous activation state aj(t−

3 The prior activation is not always relevant for thecurrent – we will see examples for both variants.

1) into a new activation state aj(t), withthe threshold value Θ playing an impor-tant role, as already mentioned.Remark: Unlike the other variables withinthe neural network (particularly unlike theones defined so far) the activation functionis often defined globally for all neurons orat least for a set of neurons, and only thethreshold values are different for each neu-ron. We should also keep in mind that thethreshold values can be changed, for exam-ple by a learning procedure. So it can be-come particularly necessary to relate thethreshold value to the time and to write,for instance Θj als Θj(t) (but for reasonsof clarity, I omitted this here). The ac-tivation function is also called transferfunction.

Example activation functions: The mostsimple activation function is the binarythreshold function (fig. 3.2 on the rightpage), which can only take two values(also referred to as Heaviside function).If the input is above a certain threshold,the function changes from one value toanother, but otherwise remains constant.This implies that the function is not differ-entiable at the threshold and for the restthe derivative is 0. Due to this fact back-propagation learning, for example, is im-possible (as we will see later). Also verypopular is the Fermi function or logis-tic function (fig. 3.2)

11 + e−x , (3.5)

with a range of values of (0, 1) and thehyperbolic tangent (fig. 3.2) mit with arange of values of (−1, 1). Both functions


dkriesel.com 3.2 Components of Neural Networks

are differentiable. The Fermi function canbe expanded by a temperature parame-ter T into the form

TI 11 + e−xT

, (3.6)

and the smaller this parameter is, themore does it compress the function onthe x axis. Thus, it can be arbitrar-ily approximated to the Heaviside func-tion. Incidentally, there exist activationfunctions which are not explicitly definedbut depend on the input according to arandom distribution (stochastic activationfunction).

3.2.6 An Output Function may beused to process the activationonce again

The output function of a neuron j cal-culates the values which are transferred tothe other neurons connected to j. Moreformal:Definition 3.8 (Output function): Let j

informsother

neuronsbe a neuron. The output function

fout : aj → oj (3.7)

then calculates the output value oj of thefoutI neuron j from its activation state aj .

Remark: Generally, the output function isdefined globally, too.Example: Often this function is the iden-tity, i.e. the activation aj is directly out-put4:

fout(aj) = aj , so oj = aj (3.8)

4 Other definitions of output functions may be use-ful if the range of the activation function is notsufficient.

−1

−0.5

0

0.5

1

−4 −2 0 2 4

f(x)

x

Heaviside Function

0

0.2

0.4

0.6

0.8

1

−4 −2 0 2 4

f(x)

x

Fermi Function with Temperature Parameter

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

−4 −2 0 2 4

tanh

(x)

x

Hyperbolic Tangent

Figure 3.2: Various popular activation func-tions, from top down: Heaviside or binary thresh-old function, Fermi function, hyperbolic tangent.The Fermi function was expanded by a temper-ature parameter. The original Fermi functionis represented by dark colors, the temperatureparameter of the modified Fermi functions are,from outward to inward 1

2 , 15 , 1

10 und 125 .



Unless explicitly specified otherwise, wewill use the identity as output functionwithin this paper.

3.2.7 Learning strategies adjust aNetwork to fit our needs

Since we will address this subject later indetail and at first want to get to know theprinciples of neural network structures, Iwill only provide a brief and general defi-nition here:Definition 3.9 (General learning rule):The learning strategy is an algorithmthat can be used to change the neural net-work and so such network can be trainedto produce a desired output for a giveninput.

3.3 Network topologies

After we have become acquainted with thestructure of neural network components,I want to give an overview of the usualtopologies (= designs) of neural networks,i.e. to construct networks consisting of thecomponents. Every topology described inthis paper is illustrated by a map and itsHinton diagram so that the reader can im-mediately see the characteristics and applythem to other networks.

In the Hinton diagram the dotted-drawnweights are represented by light grey fields,the solid-drawn ones by dark grey fields.The input and output arrows, which wereadded for reasons of clarity, cannot befound in the Hinton diagram. In order to

clarify that the connections are betweenthe line neurons and the column neurons,I have inserted the small arrow in theupper-left cell.

3.3.1 Feedforward

FeedForward-Networks consist of layersand connections towards any one next lay-ers In this paper feedforward networks(fig. 3.3 on the right page) are the net-works we will explore at first (even if wewill use different topologies later). Theneurons are grouped in the following lay-ers: One input layer, n hidden pro-

network oflayerscessing layers (invisible from the out-

side, that’s why the neurons are also re-ferred to as hidden) and one output layer.In a feedforward network each neuron inone layer has only directed connections tothe neurons of the next following layer (to-wards the output layer). In fig. 3.3 on theright page the connections permitted fora feedforward network are represented bysolid lines. We will often be confrontedwith feedforward networks in which everyneuron i is connected to all neurons of thenext following layer (these layers are calledcompletely linked). To prevent namingconflicts the output neurons are often re-ferred to as Ω.Definition 3.10 (Feedforward network):The neuron layers of a feedforward net-work (fig. 3.3 on the right page) are clearlyseparated: One input layer, one outputlayer and one or more processing layerswhich are invisible from the outside (alsocalled hidden layers). Connections are


dkriesel.com 3.3 Network topologies

only permitted to neurons of the next fol-lowing layer.

3.3.1.1 ShortCut connections skiplayers

ShortCutsskip

layersSome feedforward networks permit the so-called shortcut connections (fig. 3.4 onthe next page): Connections that skip oneor more levels. These connections mayonly be directed towards the output layer,too.Definition 3.11 (Feedforward network withshortcut connections): Similar to the feed-forward network, but the connections maynot only be directed towards the next fol-lowing layer but also towards any othersubsequent layer.

3.3.2 Recurrent networks haveinfluence on theirselves

Recurrence is defined as the process of aneuron influencing itself by any means orby any connection. Recurrent networks donot always have explicitly defined input oroutput neurons. Therefore in the figuresI omitted all markings that concern thismatter and only numbered the neurons.

3.3.2.1 Direct recurrencies start andend at the same neuron

Some networks allow for neurons to beconnected to themselves, which is calleddirect recurrence (or sometimes self-recurrence (fig. 3.5 on the next page). As

GFED@ABCi1

~~

AAAAAAAAA

**UUUUUUUUUUUUUUUUUUUUUUUUUU GFED@ABCi2

ttiiiiiiiiiiiiiiiiiiiiiiiiii

~~

AAAAAAAAA

GFED@ABCh1

AAAAAAAAA

**UUUUUUUUUUUUUUUUUUUUUUUUUU GFED@ABCh2

~~

AAAAAAAAAGFED@ABCh3

~~


GFED@ABCΩ1

GFED@ABCΩ2

i1 i2 h1 h2 h3 Ω1 Ω2i1i2h1h2h3Ω1Ω2

Figure 3.3: A feedforward network with threelayers: Two input neurons, three hidden neuronsand two output neurons. Characteristic for theHinton diagram of completely linked feedforwardnetworks is block building above the diagonalline.



GFED@ABCi1

++

~~ **

GFED@ABCi2

ss

tt ~~ GFED@ABCh1

**

GFED@ABCh2

~~

GFED@ABCh3

~~ttGFED@ABCΩ1

GFED@ABCΩ2

i1 i2 h1 h2 h3 Ω1 Ω2i1i2h1h2h3Ω1Ω2

Figure 3.4: A feedforward network with short-cut connections, which are represented by solidlines. On the right side of the feedforward blocksnew connections have been added to the Hintondiagram.

?>=<89:;1vv

))

?>=<89:;2vv

uu ?>=<89:;3vv

))

?>=<89:;4vv

?>=<89:;5vv

uu?>=<89:;6vv ?>=<89:;7

vv

1 2 3 4 5 6 71234567

Figure 3.5: A network similar to a feedforwardnetwork with directly recurrent neurons. The di-rect recurrences are represented by solid lines andexactly correspond to the diagonal in the Hintondiagram matrix.

a result, neurons inhibit and thereforestrengthen themselves in order to reachtheir activation limits.

Definition 3.12 (Direct recurrence): Now neuronsinfluencethemselves

we expand the feedforward network byconnecting a neuron j to itself, with theweights of these connections being referredto as wj,j . In other words: the diagonal ofthe weight matrix W may be unequal to0.


dkriesel.com 3.3 Network topologies

3.3.2.2 Indirect recurrences caninfluence their starting neurononly by making detours

If connections are allowed towards the in-put layer, they will be called indirect re-currences. Then a neuron j can use in-direct ways forward to influence itself, forexample, by influencing the neurons of thenext layer and the neurons of this nextlayer influencing j (fig. 3.6).

Definition 3.13 (Indirect recurrence):Again our network is based on a feedfor-ward network with now additional connec-tions between neurons and the precedinglayer being allowed. Therefore, below thediagonal of W unequal to 0.

3.3.2.3 Lateral recurrences

[Lateral recurrences connect neuronswithin one layer] Connections betweenneurons within one layer are calledlateral recurrences (fig. 3.7 on thenext page). Here, each neuron ofteninhibits the other neurons of the layerand strengthens itself. As a result onlythe strongest neuron becomes active(winner-takes-all scheme).

Definition 3.14 (Lateral recurrence): Alaterally recurrent network permits con-nections within one layer.

?>=<89:;1

))

?>=<89:;2

uu ?>=<89:;3

88 22

))

?>=<89:;4

XX 88

?>=<89:;5

XXgg

uu?>=<89:;6

XX 88 22

?>=<89:;7

gg XX 88

1 2 3 4 5 6 71234567

Figure 3.6: A network similar to a feedforwardnetwork with indirectly recurrent neurons. Theindirect recurrences are represented by solid lines.As we can see, connections to the preceding lay-ers can exist here, too. The fields that are sym-metric to the feedforward blocks in the Hintondiagram are now occupied.



?>=<89:;1 ++kk

))

?>=<89:;2

uu ?>=<89:;3 ++kk **jj

))

?>=<89:;4 ++kk

?>=<89:;5

uu?>=<89:;6 ++kk ?>=<89:;7

1 2 3 4 5 6 71234567

Figure 3.7: A network similar to a feedforwardnetwork with laterally recurrent neurons. Thedirect recurrences are represented by solid lines.Here, recurrences only exist within the layer.In the Hinton diagram, filled squares are con-centrated around the diagonal in the height ofthe feedforward blocks, but the diagonal is leftuncovered.

3.3.3 Completely linked networksallow for every possibleconnection

Completely linked networks permit connec-tions between all neurons, except for directrecurrences. Furthermore, the connectionsmust be symmetric (fig. 3.8 on the rightpage). A popular example are the self-organizing maps, which will be introducedin chapter 10.Definition 3.15 (Complete interconnec-tion): In this case, every neuron is alwaysallowed to be connected to every other neu-ron – but as a result every neuron canbecome an input neuron. Therefore, di-rect recurrences normally cannot be ap-plied here and clearly defined layers do notlonger exist. Thus, the matrix W may beunequal to 0 everywhere, except along itsdiagonal.

3.4 The bias neuron is atechnical trick to considerthreshold values asconnection weights

By now we know that in many networkparadigms neurons have a threshold valuethat indicates when a neuron becomes ac-tiv. Thus, the threshold value is an activa-tion function parameter of a neuron. Fromthe biological point of view this soundsmost plausible, but it is complicated to ac-cess the activation function at runtime inorder to train the threshold value.


dkriesel.com 3.4 The bias neuron

?>=<89:;1 ii

ii

))TTTTTTTTTTTTTTTTTTTTTTTOO

oo //^^

>>>>>>>>>?>=<89:;255

uujjjjjjjjjjjjjjjjjjjjjjj OO

@@

^^

>>>>>>>>>

?>=<89:;3 ii

))TTTTTTTTTTTTTTTTTTTTTTToo //

@@ ?>=<89:;4 ?>=<89:;544jj 55

uujjjjjjjjjjjjjjjjjjjjjjj//oo@@

?>=<89:;6

55

@@

^>>>>>>>>> ?>=<89:;7//oo

^>>>>>>>>>

1 2 3 4 5 6 71234567

Figure 3.8: A completely linked network withsymmetric connections and without direct recur-rences. In the Hinton diagram only the diagonalare left blank.

But threshold values Θj1 , . . . ,Θjn for neu-rons j1, j2, . . . , jn can also be realized asconnecting weight of a continuously firingneuron: For this purpose an additionalbias neuron whose output value is always 1is integrated in the network and connectedto the neurons j1, j2, . . . , jn These new con-nections get the weights −Θj1 , . . . ,−Θjn ,i.e. they get the negative threshold val-ues.Definition 3.16: A bias neuron is a neu-ron whose output value is always 1 andwhich is represented by

GFED@ABCBIAS .

It is used to represent neuron biases as con-nection weights, which enables any weicht-training algorithm to train the biases atthe same time.

Then the threshold value of the neuronsj1, j2, . . . , jn is set to 0. Now the thresh-old values are implemented as connectingweights (fig. 3.9 on page 47) and can di-rectly be trained together with the con-necting weights, which considerably facil-itates the learning process.

In other words: Instead of including thethreshold value in the activation function,it is now included in the propagation func-tion. Or even shorter: The threshold valueis subtracted from the network input, i.e.it is part of the network input. More for-mally:

bias neuronreplacesthresh. valuewith weights

Letj1, j2, . . . , jn be neurons withthe threshold values Θj1 , . . . ,Θjn .By inserting a bias neuron whoseoutput value is always 1, generating



connections between the said biasneuron and the neurons j1, j2, . . . , jnand weighting these connectionswBIAS,j1 , . . . , wBIAS,jnwith −Θj1 , . . . ,−Θjn ,we can set Θj1 = . . . = Θjn = 0 andreceive an equivalent neural networkwhose threshold values are realized byconnecting weights.Remark: Undoubtedly, the advantage ofthe bias neuron is the fact that it is mucheasier to implement it into the network.One disadvantage is that the representa-tion of the network already becomes quiteplain with only a few neurons, let alonewith a great number of them. By the way,a bias neuron is often referred to as onneuron.

From now on, the bias neuron is omit-ted for clarity of the following illustrations,but we know that it exists and that with itthe threshold values can simply be treatedas weights.

3.5 Representing Neurons

We have already seen that we can eitherwrite its name or its threshold value intoa neuron. Another useful representation,which we will use several times in thefollowing, is to illustrate neurons accord-ing to their type of data processing. Seefig. 3.10 for some examples without fur-ther explanation – the different types ofneurons are explained as soon as we needthem.

WVUTPQRS||c,x||Gauß

GFED@ABC ONMLHIJKΣ

WVUTPQRSΣL|H

WVUTPQRSΣTanh

WVUTPQRSΣFermi

ONMLHIJKΣfact

GFED@ABCBIAS

Figure 3.10: Different types of neurons that willappear in the following text.

3.6 Take care of the order inwhich neuron activationsare calculated

For a neural network it is very importantin which order the individual neurons re-ceive and process the input and output theresults. Here, we distinguish two modelclasses:

3.6.1 Synchronous activation

All neurons change their values syn-chronously, i.e. they simultaneously cal-culate network inputs, activation and out-put, and pass them on. Synchronous acti-vation corresponds closest to its biologicalcounterpart, but it is only useful on cer-tain parallel computers and especially notfor feedforward networks. However, syn-chronous activation can be emulated bysaving old neuron activations in memoryand using them to generate new values ofthe at this time next time step. Thus,only activation values of the time step t


dkriesel.com 3.6 Orders of activation

GFED@ABCΘ1

BBBBBBBBB

~~|||||||||

GFED@ABCΘ2

GFED@ABCΘ3

GFED@ABCBIAS −Θ1 //

−Θ2AAAA

AAAA −Θ3TTTTTTTTTT

**TTTTTTTTTT

?>=<89:;0

?>=<89:;0

?>=<89:;0

Figure 3.9: Two equivalent neural networks, one without bias neuron on the left, one with biasneuron on the right. The neuron threshold values can be found in the neurons, the connectingweights at the connections. Furthermore, I omitted the weights of the already existing connections(represented by dotted lines on the right side).

are used for the calculation of those of(t+1). This order of activation is the mostgeneric.Definition 3.17 (Synchronous activation):All neurons of a network calculate network

biologicallyplausible inputs at the same time by means of the

propagation function, activation by meansof the activation function and output bymeans of the output function. After thatthe activation cycle is complete.

3.6.2 Asynchronous activation

Here, the neurons do not change their val-ues simultaneously but at different pointsof time. For this, there exist different or-ders, some of which I want to introduce inthe following:easier to

implement

3.6.2.1 Random order

Definition 3.18 (Random order of activa-tion): With random order of activa-tion a neuron i is randomly chosen andits neti, ai and oi are updated. For n neu-rons a cycle is the n-fold execution of thisstep. Obviously, some neurons are repeat-edly updated during one cycle, and others,however, not at all. nicht.Remark: Apparently, this order of activa-tion is not always useful.

3.6.2.2 Random permutation

With random permutation each neuronis chosen exactly once, but in random or-der, during one cycle.Definition 3.19 (Random permutation):Initially, a permutation of the neurons is



randomly calculated and therefore definesthe order of activation. Then the neuronsare successively executed in this order.Remark: This order of activation is rarelyused, too, because firstly, the order is gen-erally unuseful and, secondly, it is verytime-consuming to compute a new permu-tation for every cycle. A Hopfield network(chapter 8) is a topology nominally havinga random or a randomly permuted order ofactivation. But note that in practice, forthe previously mentioned reasons, a fixedorder of activation is preferred.

For all orders either the previous neuronactivations at time t or, if already existing,the neuron activations at time t + 1, forwhich we are calculating the activations,can be taken as starting point.

3.6.2.3 Topological order

Definition 3.20 (Topological activation):With topological order of activation

often veryuseful the neurons are updated during one cycle

and according to a fixed order. The orderis defined by the network topology. Thus,in feedforward networks (for which the pro-cedure is very reasonable) the input neu-rons would be updated at first, then theinner neurons and at last the output neu-rons.Remark: This procedure can only be con-sidered for non-cyclic, i.e. non-recurrent,networks, since otherwise there is no orderof activation.

3.6.2.4 Fixed orders of activationduring implementation

Obviously, fixed orders of activationcan be defined as well. Therefore, whenimplementing, for instance, feedforwardnetworks it is very popular to determinethe order of activation once according tothe topology and to use this order withoutfurther verification at runtime. But this isnot necessarily useful for networks that arecapable to change their topology.

3.7 Communication with theoutside world: Input andoutput of data in andfrom Neural Networks

At last, let us take a look at the factthat, of course, many types of neural net-works permit the input of data. Thenthese data are processed and can produceoutput. Let us, for example, regard thefeedforward network shown in fig. 3.3 onpage 41: It has two input neurons andtwo output neurons, which means that italso has two numerical inputs x1, x2 andoutputs y1, y2. So we take it easy andsummarize the input and output compo-nents for n input or output neurons withinthe vectors x = (x1, x2, . . . , xn) und y =(y1, y2, . . . , yn).Definition 3.21 (Input vector): A network Jxwith n input neurons needs n inputs ofx1, x2, . . . , xn. They are understood as in-put vector x = (x1, x2, . . . , xn). As a


dkriesel.com 3.7 Input and output of data

consequence, the input dimension is re-ferred to as n. Data is put into a neural

nI network by using the components of theinput vector as net inputs of the input neu-rons.Definition 3.22 (Output vector): A net-

yI work with m output neurons providesm outputs y1, y2, . . . , ym. They areunderstood as output vector y =(y1, y2, . . . , ym). Thus, the output di-mension is referred to as m. Date is

mI output by a neural network by definingthe components of the output vector byoutputs of the output neurons.

Now we have defined and closely examinedthe basic components of neural networks –without having seen a network in action.But at first we will continue with theo-retical explanations and generally describehow a neural network could learn.

Exercises

Exercise 5: Would it be useful (from yourpoint of view) to insert one bias neuron ineach layer of a layer-based network, suchas a feedforward network? Discuss this inrelation to the representation and imple-mentation of the network. Will the resultof the network change?Exercise 6: Show for the Fermi functionf(x)as well as for the hyperbolic tangenttanh(x), that the derivative functions canbe expressed by the function itself so thatthe two statements

1. f ′(x) = f(x) · (1− f(x)) and

2. tanh′(x) = 1− tanh2(x)

are true.


Chapter 4

How to Train a Neural Network?Approaches and thoughts of how to teach machines. Should neural networks

be corrected? Should they only be encouraged? Or should they even learnwithout any help? Thoughts about what we want to change during the

learning procedure and how we will change it, about the measurement oferrors and when we have learned enough.

As written above, the most interestingcharacteristic of neural networks is theircapability to familiarize with problemsby means of training and, after sufficienttraining, to be able to solve unknown prob-lems of the same class. This approach is re-ferred to as generalization. Before intro-ducing specific learning procedures, I wantto propose some basic principles about thelearning procedure in this chapter.

4.1 There are differentparadigms of learning

Learning is a broad term. A neural net-work could learn from many things but, of

From whatdo we learn? course, there will always be the question

of implementability. In principle, a neu-ral network changes when its componentsare changing, as we have learned above.

Theoretically, a neural network could learnby

1. developing new connections,

2. deleting existing connections,

3. changing connecting weights,

4. changing the threshold values of neu-rons,

5. varying one or more of the three neu-ron functions (remember: activationfunction, propagation function andoutput function),

6. developing new neurons, or

7. Deleting existing neurons (and so, ofcourse, existing connections).

As mentioned above, we assume thechange in weight to be the most commonprocedure. Furthermore, deletion of con-nections can be realized by additionallyseeing to it that a connection set to 0 is

51

Chapter 4 How to Train a Neural Network? (fundamental) dkriesel.com

no longer trained. Moreover, we can de-velop further connections by setting a non-existing connection (with the value 0 inthe connection matrix) to a value differ-ent from 0. As for the modification ofthreshold values I refer to the possibilityof implementing them as weights (section3.4).

The change of neuron functions is difficultto implement, not very intuitive and notbiologically motivated. Therefore it is notvery popular and I will omit this topichere. The possibilities to develop or deleteneurons do not only provide well adjustedweights during the training of a neural net-work, but also optimize the network topol-ogy. Thus, they attract a growing interestand are often realized by using evolution-ary procedures. But since we accept thata large part of learning possibilities canalready be covered by changes in weight,they are also not the subject matter of thispaper.

Learningby changes

in weight Thus, we let our neural network learn bymodifying the connecting weights accord-ing to rules that can be formulated as al-gorithms. Therefore a learning procedureis always an algorithm that can easily beimplemented by means of a programminglanguage. Later in the text I will assumethat the definition of the term desired out-put which is worth learning is known (andI will define formally what a training pat-tern is) and that we have a training set oflearning examples. Let a training set bedefined as follows:Definition 4.1 (Training set): A Train-

PI ing set (named P ) is a set of training

patterns, which we use to train our neu-ral net.

Additionally, I will introduce the three es-sential paradigms of learning by the dif-ferences between the regarding trainingsets.

4.1.1 Unsupervised learningprovides input patterns to thenetwork, but no learning aides

Unsupervised learning is the biologi-cally most plausible method, but is notsuitable for all problems. Only the in-put patterns are given; the network triesto identify similar patterns and to classifythem into similar categories.Definition 4.2 (Unsupervised learning):The training set only consists of inputpatterns, the network tries by itself to de-tect similarities and to generate patternclasses.Example: Here again I want to refer tothe popular example of Kohonen’s self-organising maps (chapter 10).

4.1.2 Reinforcement learningmethods provide Feedback tothe Network, whether itbehaves good or bad

In reinforcement learning the networkreceives a logical value or a real value af-ter completion of a sequence, which defineswhether the result is right or wrong. It

networkreceivesremuneration orpunishment

is intuitively obvious that this procedure


dkriesel.com 4.1 Paradigms of Learning

should be more effective than unsuper-vised learning since the network receivesspecific critera for problem-solving.Definition 4.3 (Reinforcement learning):The training set consists of input patterns,after completion of a sequence a value is re-turned to the network indicating whetherthe result was right or wrong and, possibly,how right or wrong it was.

4.1.3 Supervised learning methodsprovide Training patternstogether with appropriatedesired outputs

In supervised learning the training setconsists of input patterns as well as theircorrect results in the form of the precise ac-tivation of all output neurons. Thus, foreach training set that is input in the net-work the output, for instance, can directly

networkreceivescorrect

results forexamples

be compared with the correct solution andand the network weights can be changedaccording to / on the basis of the dif-ference. The objective is to change theweights to the effect that the network can-not only associate input and output pat-terns independently after the training, butcan provide unknown, similar input pat-terns to a plausible result, i.e. to gener-alise.Definition 4.4 (Supervised learning): Thetraining set consists of input patterns withcorrect results so that the network can re-ceive a precise error vector1 can be re-turned.1 The term error vector will be defined in section

4.2, where mathematical formalisation of learningis discussed.

Remark: This learning procedure is notalways biologically plausible, but it is ex-tremely effective and therefore very feasi-ble.

At first we want to look at the the su-pervised learning procedures in general,which - in this paper - are correspondingto the following steps:

Entering the input pattern (activation ofinput neurons),

Forward propagation of the input by thenetwork, generation of the output,

learningschemeComparing the output with the desired

output (teaching input), provides er-ror vector (difference vector),

Corrections of the Network arecalculated based on the error vector.

Corrections are applied.

4.1.4 Offline or Online Learning?

It must be noted that learning can be of-fline (a set of training examples is pre-sented, then the weights are changed, thetotal error is calculated by means of a er-ror function operation or simply accumu-lated - see also section 4.4) or online (af-ter every example presented the weightsare changed). Both procedures have ad-vantages and disadvantages, which will bediscussed with the learning procedures sec-tion if necessary. Offline training proce-dures are also called batch training pro-cedures since a batch of results is cor-rected all at once. Such a training section



of a whole batch of training examples in-cluding the related change in weight valuesis called epoch.Definition 4.5 (Offline learning): Severaltraining patternns are entered into the net-work all at once, the errors are accumu-lated and it learns for all patterns at thesame time.Definition 4.6 (Online learning): The net-work learns directly from the errors of eachtraining example.

4.1.5 Questions you should takecare of before learning

The application of such schemas certainlyrequires preliminary thoughts about somequestions, which I want to introduce nowas a check list and, if possible, answerthem in the course of this paper. Laufeder Arbeit sukzessive beantworten mochte,soweit moglich:

. Where does the learning input comefrom and in what form?

. How must the weights be modified toallow fast and reliable learning?

. How can the success of a learning pro-cess be measured in an objective way?

. Is it possible to determine the ”best”learning procedure?

. Is it possible to predict if a learningprocedure terminates, i.e. whether itwill reach an optimal state after a fi-nite time or if it, for example, will os-cillate between different states?

. How can the learned patterns bestored in the network?

. Is it possible to avoid that newlylearned patterns destroy previouslylearned associations (the so-called sta-bility/plasticity dilemma)?

Remark: We will see that all these ques-tions cannot be generally answered butthat they have to be discussed for each JJJ

no easyanswers!

learning procedure and each networktopology individually.

4.2 Training Patterns andTeaching Input

Before we get to know our first learningrule, we need to introduce the teachinginput. In (this) case of supervised learn-ing we assume a training set consisting oftraining patterns and the correspondingcorrect output values we want to see at

desiredoutputthe output neurons after the training. Un-

til the network is trained, i.e. as long as itis generating wrong outputs, these outputvalues are referred to as teaching input,and that is for each neuron individually.Thus, for a neuron j with the incorrectoutput oj tj is the teaching input, whichmeans it is the correct or desired outputfor a training pattern p.Definition 4.7 (Training patterns): A Jptraining pattern is an input vector pwith the components p1, p2, . . . , pn whosedesired output is known. By entering thetraining pattern into the network we re-ceive an output that can be compared withthe teaching input, which is the desired


dkriesel.com 4.3 Using Training Examples

output. The set of training patterns iscalled P . It contains a finite number of or-dered pairs(p, t) of training patterns withcorresponding desired output.Remark: Training patterns pattern areoften simply called patterns, that is whythey are referred to as p. In the litera-ture as well as in this paper they are calledsynonymously patterns, training examplesetc.Definition 4.8 (Teaching input): Let j be

tI an output neuron. The teaching inputtj is the desired, correct value j shouldoutput after the input of a certain trainingpattern . Analogous to the vector p the

desiredoutput teaching inputs t1, t2, . . . , tn of the neurons

can also be combined into a vector t. talways refers to a specific training patternp and is, as already mentioned, containedin the set P of the training patterns.Definition 4.9 (Error vector): For several

EpI output neurons Ω1,Ω2, . . . ,Ωn the differ-ence between output vector and teachinginput under a training input p

Ep =

t1 − y1...

tn − yn

is referred to as error vector, sometimesit is also called difference vector. De-pending on whether you are learning of-fline or online, the difference vector refersto a specific training pattern, or to the er-ror of a set of training patterns which isnormed in a certain way.Remark: Now I want to briefly summarizethe vectors we have yet defined. There is

the [input vector] x, which can be enteredinto the neural network. Depending

on the type of network being used theneural network will output an

output vector y. Basically, the

training example p is nothing more thanan input vector. We only use it fortraining purposes because we knowthe corresponding

teaching input t which is nothing morethan the desired output vector to thetraining example. The

error vector Ep is the difference betweenthe teaching input t and the acturaloutput y.

So, what x and y are for the general net-work operation are p and t for the network Important!training - and during training we try tobring y and t as close as possible.

Note: One advice concerning notation:We referred to the output values of a neu-ron i as oi. Thus, the output of an out-put neuron Ω is called oΩ. But the outputvalues of a network are referred to as yΩ.Certainly, these network outputs are onlyneuron outputs, too, but they are outputsof output neurons. In this respect

yΩ = oΩ

is true.

4.3 Using Training Examples

We have seen how we can basically learnand which steps are required to do so. Nowwe should take a look at the selection of



training data and the learning curve. Af-ter successful learning it is particularly in-teresting whether the network has onlymemorized – i.e. whether it can use ourtraining examples to quite exactly producethe right output but to provide wrong an-swers for all other problems of the sameclass.

Let us assume that we want the network totrain a mapping R2 → B1 and therefor usethe training examples from fig. 4.1: Thenthere could be a chance that, finally, thenetwork will exactly mark the colored ar-eas around the training examples with theoutput 1 (fig. 4.1 oben), and otherwise willoutput 0 . Thus, it has sufficient storagecapacity to concentrate on the six trainingexamples with the output 1. This impliesan oversized network with too much freestorage capacity.

On the other hand a network could haveinsufficient capacity (fig. 4.1 below) – thisrough presentation of input data does notcorrespond to the good generalization per-formance we desire. Thus, we have to findthe balance (fig. 4.1 in the middle.

4.3.1 It is useful to divide the set ofTraining Examples

An often proposed solution for these prob-lems is to divide, the training set into

. one training set really used to train ,

. and one verification set to test ourprogress

Figure 4.1: Visualization of training results ofthe same training set on networks with a capacitybeing too high (above), correct (middle) or toolow (below.


dkriesel.com 4.4 Learning curve and error measurement

– provided that there are enough train-ing examples. The usual division relationsare, for instance, 70% for training dataand 30% for verification data (randomlychosen). We can finish the training whenthe network provides good results on thetraing data as well as on the verificationdata.

But note: If the verificationo data providepoor results do not modify the networkstructure until these data provide good re-sults – otherwise you run the risk of tai-loring the network to the verification data.This means, that these data are includedin the training, even if they are not usedexplicitly for the training. The solutionis a third set of validation data used onlyfor validation after a supposably success-ful training.Remark: By training less patterns, we ob-viously withhold information from the net-work and risk to worsen the learning per-formance. But this paper is not about100% exact reproduction of given exam-ples but about successful generalizationand approximation of a whole function –for which it can definitely be useful totrain less information into the network.

4.3.2 Order of patternrepresentation

You can find different strategies to choosethe order of pattern presentation: If pat-terns are presented in random sequence,there is no guarantee that the patternsare learned equally well. Always the samesequence of patterns, on the other hand,

provokes that the patterns will be memo-rized when using recurrent networks (later,we will learn more about this type of net-works). A random permutation wouldsolve both problems, but it is – as alreadymentioned – very timeconsuming to calcu-late such a permutation.

4.4 Learning curve and errormeasurement

The learning curve indicates the progressof the error, der The learning curve indi- norm

tocompare

cates the progress of the error, which canbe determined in various ways. The mo-tivation to create a learning curve is thatsuch a curve can indicate whether the net-work is progressing or not. For this, theerror should be normed, i.e. indicate a dis-tance dimension between the correct andthe current output of the network. Forexample, we can take the same pattern-specific, squared error with prefactor likewe use to derive the backpropagation of er-ror (Let Ω be output neurons and O theset of output neurons):

Errp = 12∑Ω∈O

(tΩ − yΩ)2 (4.1)

Definition 4.10 (Specific error): The spe-cific error Errp is based on a single train- JErrping example, which means it is generatedonline.

Additionally, the root mean square (ab-breviated: RMS) and the Euclideandistance are often used.



The Euclidean distance (generalization ofthe theorem of Pythagoras) is useful forinferior dimensions where we can still visu-alize its usefulness.Definition 4.11 (Euclidean distance): TheEuclidean distance between two vectors tand y is defined as

Errp =√∑

Ω∈O(tΩ − yΩ)2. (4.2)

Generally, the root mean square is com-monly used since it considers extreme out-liers to a greater extent.Definition 4.12 (Root mean square): Theroot mean square of two vectors t and y isdefined as

Errp =√∑

Ω∈O(tΩ − yΩ)2

|O|. (4.3)

As for offline learning, the total error inthe course of one training epoch is inter-esting and useful, too:

Err =∑p∈P

Errp (4.4)

Definition 4.13 (Total error): The totalerror Err is based on all training exam-

ErrI ples, that means it is generated offline.

Analogous we can generate a total rms anda total Euclidean distance in the courseof a whole epoch. Of course, other typesof error measurement are also possible touse.

Depending on our error measurementmethod our learning curve certainlychanges, too. A perfect learning curve

looks like a negative exponential func-tion, that means it is proportional to e−t(fig. 4.2 on the right page). Thus, the rep-resentation of the learning curve can beillustrated by means of a logarithmic scale(fig. 4.2, second diagram from the bot-tom) – with the said scaling combinationa descending line implies an exponentialdescent of the error.

With the network doing a good job, theproblems being not too difficult and thelogarithmic representation of Err you cansee - metaphorically speaking - a descend-ing line that often forms ”spikes” at thebottom – here, we reach the limit of the64-bit resolution of our computer and ournetwork has actually learned the optimumof what it is capable of learning.

Typical learning curves can show a few flatareas, as well, i.e. they can show somesteps, which is no sign for a malfunctioninglearning process. As we can also see in fig.4.2, a well-suited representation can makeany slightly decreasing learning curve lookgood – so just be cautious when readingthe literature.

4.4.1 When do we stop learning?

Now, the great question is: When do westop learning? Generally, the training isstopped when the user in front of the learn-ing computer ”thinks” the error was smallenough. Indeed, there is no easy answerand thus I can once again only give yousomething to think about, which, however,depends on a more objective view on thecomparison of several learning curves.


dkriesel.com 4.4 Learning curve and error measurement

0

5e−005

0.0001

0.00015

0.0002

0.00025

0 100 200 300 400 500 600 700 800 900 1000

Feh

ler

Epoche

0

2e−005

4e−005

6e−005

8e−005

0.0001

0.00012

0.00014

0.00016

0.00018

0.0002

1 10 100 1000

Feh

ler

Epoche

1e−035

1e−030

1e−025

1e−020

1e−015

1e−010

1e−005

1

0 100 200 300 400 500 600 700 800 900 1000

Feh

ler

Epoche

1e−035

1e−030

1e−025

1e−020

1e−015

1e−010

1e−005

1

1 10 100 1000

Feh

ler

Epoche

Figure 4.2: All four illustrations show the same (idealized, because very smooth) learning curve.Note the alternating logarithmic and linear scalings! Also note the small ”inaccurate spikes” visiblein the bend of the curve in the first and second diagram from bottom.



Confidence in the results, for example, isboosted, when the network always reaches

objectivitynearly the same final error-rate for differ-ent random initializations – so repeatedinitialization and training will provide amore objective result.

On the other hand, it can be possible thata curve descending fast in the beginningcan, after a longer time of learning, beovertaken by another curve: This can indi-cate that either the learning rate of theworse curve was too high or the worsecurve itself simply got stuck in a secondaryminimum, but was the first to find it.

Remember: Larger error values are worsethan the small ones.

But, in any case, note: Many people onlygenerate a learning curve in respect of thetraining data (and then they are surprisedthat only a few things will work) – but forreasons of objectivity and clarity it shouldnot be forgotten to plot the verificationdata on a second learning curve, which gen-erally provides values being slightly worseand with stronger oscillation. But withgood generalization the curve can decrease,too.

When the network eventually begins tomemorize the examples, the shape of thelearning curve can provide an indication:If the learning curve of the verificationexamples is suddenly and rapidly risingwhile the learning curve of the verifica-tion data is continuously falling, this couldindicate memorizing and a generalizationgetting poorer and poorer. At this pointit could be decided whether the networkhas already learned good enough at the

next point of the two curves, and maybethe final point of learning is to be appliedhere (this procedure is called early stop-ping).

Once again I want to remind you that theyare all acting as indicators and not as If-Then keys.

Let me say one word to the number oflearning epochs: At the moment, vari-ous publications often use ≈ 106 to≈ 107

epochs. Why not trying to use some more?The answer is simple: The current stan-dard PC is not yet fast enough to support108 epochs. But with increasing process-ing speed the trend will go towards it allby itself.

4.5 Gradient optimizationprocedures

In order to establish the mathematical ba-sis for some of the following learning pro-cedures I want to explain briefly what ismeant by gradient descent: the backprop-agation of error learning procedure, forexample, involves this mathematical basisand thus inherits the advantages and dis-advantages of the gradient descent.

Gradient descent procedures are generallyused where we want to maximize or mini-mize n-dimensional functions. Due to clar-ity the illustration (fig. 4.3 on the rightpage) shows only two dimensions, but prin-cipally there is no limit to the number ofdimensions.


dkriesel.com 4.5 Gradient optimization procedures

Figure 4.3: Visualization of the gradient descent on two-dimensional error function. Vi-sualization of the gradient descent on two-dimensional error function. We go forward indiametrical opposition to g, i.e. with the steepest descent towards the lowest point, with thestep width being proportional to |g| ist (the steeper the descent, the faster the steps). Onthe left the area is shown in 3D, on the right the steps over the level curve shown in 2D.Here it is obvious how a movement is made in the opposite direction of g towards the min-imum of the function and becomes continuously slowing-down proportional to |g|. Source:http://webster.fhs-hagenberg.ac.at/staff/sdreisei/Teaching/WS2001-2002/PatternClassification/graddescent.pdf



The gradient is a vector g that is de-fined for any differentiable point of a func-tion, that points from this point exactlytowards the steepest ascent and indicatesthe gradient in this direction by meansof its norm |g|. Thus, the gradient is ageneralization of the derivative for multi-dimensional functions2. Accordingly, thenegative gradient −g exactly points to-wards the steepest ascent. The gradientoperator ∇ is referred to as nabla op-

∇Igradient ismulti-dim.derivative

erator, the overall notation of the thegradient g of the point (x, y) of a two-dimensional function f being, for instance,g(x, y) = ∇f(x, y).Definition 4.14 (Gradient): Let g be agradient. Then g is a vector with ncomponents that is defined for any pointof a (differential) n-dimensional functionf(x1, x2, . . . , xn). The gradient operatornotation is defined as

g(x1, x2, . . . , xn) = ∇f(x1, x2, . . . , xn)

. g directs from any point of f towardsthe steepest ascent from this point, with|g| corresponding to the degree of this as-cent.

gradient descent means to going downhillin small steps from any starting point ofour function towards the gradient g (whichmeans, vividly speaking, the direction towhich a ball would roll from the start-ing point), with the size of the steps be-ing proportional to |g| (the steeper the de-scent, the broader the steps). Therefore,

2 I don’t want to dwell on how to determine themultidimensional derivative – the interested readermay consult the usual analysis literature.

we move slowly on a flat plateau, and on asteep ascent we run downhill. If we cameinto a valley, we would - depending on thesize of our steps - jump over it or we wouldreturn into the valley across the oppositehillside in order to come closer and closerto the deepest point of the valley by walk-ing to and fro, similar to our ball movingwithin a round bowl.Definition 4.15 (Gradient descent): Let

We gotowards thegradient

f be an n-dimensional function and s =(s1, s2, . . . , sn) the given starting point.Gradient descent means to going fromf(s) against the direction of g, i.e. towards−g with steps of the size of |g| towardssmaller and smaller values of f .

Now, we will see that the gradient descentprocedures are not free from errors (sec-tion 4.5.1) but, however, they are promis-ing.

4.5.1 Gradient proceduresincorporate several problems

As already implied in section 4.5, the gra-dient descent (and therefore the backprop-agation) is promising but not foolproof.One problem, is that the result does notalways reveal if an error has occurred.

gradientdescentwith errors

4.5.1.1 Convergence againstsuboptimal minima

Every gradient descent procedure can, forexample, get stuck within a local mini-mum (part a of fig. 4.4 on the right page).This problem is increasing proportionally


dkriesel.com 4.6 Problem Examples

Figure 4.4: Possible errors during a gradient descent: a) Detecting bad minima, b) Quasi-standstillwith small gradient, c) Oscillation in canyons, d) Leaving good minima.

to the size of the error surface, and thereis no universal solution.

4.5.1.2 Stagnation at flat plateaus

When passing a flat plateau, for instance,the gradient also becomes negligibly small(because there is hardly a descent (part bof fig. 4.4), which requires many furthersteps. A hypothetically possible gradientof 0 would completely stop the descent.

4.5.1.3 Leaving good minima

On the other hand the gradient is verylarge at a steep slope so that large stepscan be made and a good minimum can pos-sibly be missed (part d of fig. 4.4).

4.5.1.4 Oscillation in steep canyons

A sudden alternation from one very strongnegative gradient to a very strong positiveone can even result in oscillation (part cof fig. 4.4). In nature, such an error doesnot occurr very often so that we can thinkabout the possibilities b and d.

4.6 Problem examples allowfor testing self-codedlearning strategies

We looked at learning from the formalpoint of view – not much yet but a little.Now it is time to look at a few problemexamples you can later use to test imple-mented networks and learning rules.



4.6.1 Boolean functions

A popular example is the one that doesnot work in the nineteen-sixties: the XORfunction (B2 → B1). We need a hiddenneuron layer, which we have discussed indetail. Thus, we need at least two neu-rons in the inner layer. Let the activationfunction in all layers (except in the inputlayer, of course) be the hyperbolic tangent.Trivially, we now expect the outputs 1.0or −1.0, depending on whether the func-tion XOR outputs 1 or 0 - and exactlyhere is where the first beginner’s mistakeoccurs.

For outputs close to 1 or -1, i.e. close tothe limits of the hyperbolic tangent (orin case of the Fermi function 0 or 1), weneed very large network inputs. The onlychance to reach these network inputs arelarge weights, which have to be learned:The learning process is largely extended.Therefore it is wiser to enter the teachinginputs 0.9 or −0.9 as desired outputs orto be satisfied when the network outputsthose values instead of 1 and −1.

Another favourite example for single-layerperceptrons are the boolean functionsAND and OR.

4.6.2 The parity function

The parity function maps a set of bits to 1or 0, depending on whether an even num-ber of input bits is set to 1 or not. Ba-sically, this is the function Bn → B1. Itis characterized by easy learnability up toapprox. n = 3 (shown in table 4.1), but

i1 i2 i3 Ω0 0 0 10 0 1 00 1 0 00 1 1 11 0 0 01 0 1 11 1 0 11 1 1 0

Table 4.1: Illustration of the parity functionwith three inputs.

the learning effort rapidly increases fromn = 4.Remark: The reader may create a scoretable for the 2-bit parity function. Whatis conspicuous?

4.6.3 The 2-spiral problem

As a training example for a function letus take two spirals coiled into each other(fig. 4.5 on the right page) with thefunction certainly representing a mappingR2 → B1. One of the spirals is assignedto the output value 1, the other spiral to0. Here, memorizing does not help. Thenetwork has to understand the mapping it-self. This example can be solved by meansof an MLP, too.

4.6.4 The checkerboard problem

We again create a two-dimensional func-tion of the form R2 → B1 and specify


dkriesel.com 4.6 Problem Examples

Figure 4.5: Illustration to the training exampleof the 2-spiral problem

checkered training examples (fig. 4.6) withone colored field representing 1 and all therest of them representing 0. The difficultyincreases proportionally to the size of thefunction: While a 3×3 field is easy to learn,the larger fields are more difficult (herewe eventually use methods that are moresuitable for this kind of problems than theMLP).

The 2-spiral problem is very similar to thecheckerboard problem, only that, mathe-matically speaking, the first problem is us-ing polar coordinates instead of Cartesiancoordinates. I just want to introduce as anexample one last trivia: The identity.

Figure 4.6: Illustration of training examples forthe checkerboard problem

4.6.5 The identity function

By using linear activation functions theidentity mapping from R1 to R1 (of courseonly within the parameters of the used ac-tivation function) is no problem for thenetwork, but we put some obstacles in itsway by using our sigmoid functions so thatit would be difficult for the network tolearn the identity. Just try it for the funof it.

Now, it is time to hava a look at our firstmathematical learning rule.



4.7 The Hebbian learning ruleis the basis for mostother learning rules

In 1949, Donald O. Hebb formulatedthe Hebbian rule [Heb49] which is the ba-sis for most of the more complicated learn-ing rules we will discuss in this paper. Wedistinguish between the original form andthe more general form, which is a kind ofprinciple for other learning rules.

4.7.1 Original rule

Definition 4.16 (Hebbian rule): ”If neu-ron j receives an input from neuron i andif both neurons are strongly active at thesame time, then increase the weight wi,j(i.e. the strength of the connection be-tween i and j).” Mathematically speaking,the rule is:

earlyform ofthe rule ∆wi,j ∼ ηoiaj (4.5)

with ∆wi,j being the change in weightfrom i to j , which is proportional to the

∆wi,jI following factors:

. The output oi of hte predecessor neu-ron i, as well as,

. The activation aj of the successor neu-ron j,

. A constant η, i.e. the learning rate,which will be discussed in section5.5.2.

The changes in weight ∆wi,j are simplyadded to the weight wi,j .

Remark: Why am I speaking twice aboutactivation, but in the formula I am usingoi and aj , i.e. the output of neuron outputof neuron i and the activation of neuron j?Remember that the identity is often usedas output function and therefore ai and oiof a neuron are often the same. Besides,Hebb postulated his rule long before thespecification of technical neurons.

Considering that this learning rule waspreferred in binary activations, it is clearthat with the possible activations (1, 0) theweights will either increase or remain con-stant . Sooner or later they would go ad

weightsgo adinfinitum

infinitum, since they can only be corrected”upwards” when an error occurs. This canbe compensated by using the activations(-1,1)3. Thus, the weights are decreasedwhen the activation of the predecessor neu-ron dissents from the one of the successorneuron, otherwise they are increased.

4.7.2 Generalized form

Remark: Most of the afore-discussedlearning rules are a specialization of themathematically more general form [MR86]of the Hebbian rule.Definition 4.17 (Hebbian rule, more gen-eral): The generalized form of theHebbian Rule only specifies the propor-tionality of the change in weight to theproduct of two undefined functions, butwith defined input values.

∆wi,j = η · h(oi, wi,j) · g(aj , tj) (4.6)

3 But that is no longer the ”original version” of theHebbian rule.


dkriesel.com 4.7 Hebbian Rule

Thus, the product of the functions

. g(aj , tj) and

. h(oi, wi,j)

. as well as the constant learning rateη

results in the change in weight. As youcan see, h receives the output of the pre-decessor cell oi as well as the weight frompredecessor to successor wi,j while g ex-pects the actual and desired activation ofthe successor aj and tj (here t stands forthe above-mentioned teaching input). Asalready mentioned g and h are not speci-fied in this general definition. Therefore,we will now return to the path of special-ization we discussed before equation 4.6on the left page After we have had a shortpicture of what a learning rule could looklike and of our thoughts about learning it-self, we will be introduced to our first net-work paradigm including the learning pro-cedure.

Exercises

Exercise 7: Calculate the average valueµ and the standard deviation σ for the fol-lowing data points.

p1 = (2, 2, 2)p2 = (3, 3, 3)p3 = (4, 4, 4)p4 = (6, 0, 0)p5 = (0, 6, 0)p6 = (0, 0, 6)


Part II

Supervised learning NetworkParadigms

69

Chapter 5

The PerceptronA classic among the neural networks. If we talk about a neural network, then

in the majority of cases we speak about a percepton or a variation of it.Perceptrons are multi-layer networks without recurrence and with fixed inputand output layers. Description of a perceptron, its limits and extensions that

should avoid the limitations. Derivation of learning procedures and discussionabout their problems.

As already mentioned in the history of neu-ral networks the perceptron was describedby Frank Rosenblatt in 1958 [Ros58].Initially, Rosenblatt defined the alreadydiscussed weighted sum and a non-linearactivation function as components of theperceptron.

There is no established definition for a per-ceptron, but most of the time the termperception is used to describe a feedfor-ward network with shortcut connections.This network has a layer of scanner neu-rons (retina) with statically weighted con-nections to the next following layer andis called input layer (fig. 5.1 on the nextpage); but the weights of all other layersare allowed to be changed. All neuronssubordinate to the retina are pattern de-tectors. Here we initially use a binaryperceptron with every output neuron hav-ing exactly two possible output values (e.g.

0, 1 or −1, 1). Thus, a binary thresh-old function is used as activation function,depending on the threshold value Θ of theoutput neuron..

In a way, the binary activation functionrepresents an IF query which can alsobe negated by means of negative weights.The perceptron can be used to accomplishreal and logical information processing.

Remark: Whether this method is reason-able is another matter – of course, thisis not the easiest way to achieve Booleanlogic. I just want to illustrate that percep-trons can be used as simple logical compo-nents and that, theoretically speaking, anyBoolean function can be realized by meansof perceptrons being subtly connected inseries or interconnected. But we will seethat this is not possible without series con-nection.

71

Chapter 5 The Perceptron dkriesel.com

Kapitel 5 Das Perceptron dkriesel.com

"" )) ++ ,, ## )) ++|| ## ))uu

""uuss||uussrrGFED@ABC

''OOOOOOOOOOOOOOOOO GFED@ABC

@@@@@@@@@GFED@ABC

GFED@ABC

~~~~~~~~~GFED@ABC

wwooooooooooooooooo

WVUTPQRSΣL|H

GFED@ABCi1

((PPPPPPPPPPPPPPPPPP GFED@ABCi2

!!CCCCCCCCCCGFED@ABCi3

GFED@ABCi4

GFED@ABCi5

vvnnnnnnnnnnnnnnnnnn

?>=<89:;Ω

Abbildung 5.1: Aufbau eines Perceptrons mit einer Schicht variabler Verbindungen in verschiede-nen Ansichten. Die durchgezogene Gewichtsschicht in den unteren beiden Abbildungen ist trainier-bar.Oben: Am Beispiel der Informationsabtastung im Auge.Mitte: Skizze desselben mit eingezeichneter fester Gewichtsschicht unter Verwendung der definier-ten funktionsbeschreibenden Designs fur Neurone.Unten: Ohne eingezeichnete feste Gewichtsschicht, mit Benennung der einzelnen Neuronen nachunserer Konvention. Wir werden die feste Gewichtschicht im weiteren Verlauf der Arbeit nicht mehrbetrachten.

70 D. Kriesel – Ein kleiner Uberblick uber Neuronale Netze (EPSILON-DE)

Figure 5.1: Architecture of a perceptron with one layer of variable connections in different views.The solid-drawn weight layer in the two illustrations on the bottom can be trained.Left side: Example of scanning information in the eye.Right side, upper part: Drawing of the same example with indicated fixed-weight layer using thedefined designs of the functional descriptions for neurons.Right side, lower part: Without indicated fixed-weight layer, with the name of each neuroncorresponding to our convention. The fixed-weight layer will no longer be taken into account in thecourse of this paper.


dkriesel.com

Before providing the definition of the per-ceptron, I want to define some types ofneurons used in this chapter.Definition 5.1 (Input neuron): An inputneuron is an identity neuron. It ex-actly forwards the information received .input neuron

only forwardsdata

Thus, it represents the identity function,which should be indicated by the symbol. Therefore the input neuron is repre-sented by the symbol GFED@ABC .

Definition 5.2 (Information processing neu-ron): Information processing neu-rons process the input information some-how or other, i.e. do not represent theidentity function. A binary neuronsums up all inputs by using the weightedsum as propagation function, which wewant to illustrate by the sigma signΣ.Then the activation function of the neuronis the binary threshold function, which canbe illustrated byL|H. Now we turn our at-

tention to the symbol WVUTPQRSΣL|H

. Other neu-

rons with weighted sum are similarly rep-resented as propagation function but withthe activation functions hyperbolic tangentor Fermi function, or with a separately de-fined activation function fact as

WVUTPQRSΣTanh

WVUTPQRSΣFermi

ONMLHIJKΣfact

.

These neurons are also referred to asFermi neurons or Tanh neuron.

Now that we know the components of aperceptron we should be able to defineit.

Definition 5.3 (Perceptron): The percep-tron (fig. 5.1 on the left page) is1 a feed-forward network containing a retina thatis used only for data acquisition and whichhas fixed-weighted connections with thefirst neuron layer (input layer). The fixed-weight layer is followed by at least onetrainable weight layer. One neuron layer iscompletely linked with the following layer.The first layer of the perceptron consistsof the above-defined input neurons.

A feedforward network often containsshortcuts, which does not exactly corre-spond to the original description and there-fore is not added to the definition.

Remark (on the retina): We can see thatthe retina is not included in the lower partof fig. 5.1. As a matter of fact the firstneuron layer is often understood (simpli-fied and sufficient for this method) as in- retina is

unconsideredput layer, because this layer only forwardsthe input values. The retina itself and thestatic weights behind it are no longer men-tioned or mapped since they do not pro-cess information in any case. So, the map-ping of a perceptron starts with the inputneurons.

1 It may rub some readers the wrong way that Iclaim there is no definition of a perceptron butthen define the perceptron in the following section.I therefore suggest keeping my definition in theback of your mind and just take it for granted inthe course of this paper.



5.1 The Single-layerPerceptron provides onlyone trainable weight layer

Here, connections with trainable weightsgo from the input layer to an outputneuron Ω, which returns the information

1 trainablelayer whether the pattern entered at the input

neurons is recognized or not. Thus, asingle-layer perception (abbreviated SLP)has only one level of trainable weights(fig. 5.1 on page 72).Definition 5.4 (Single-layer perceptron):A Single-layer perceptron (SLP) isa perceptron having only one variable-weight layer and one layer of output neu-rons Ω. The technical view of an SLP isshown in fig. 5.2.Remark: Certainly, the existence of sev-eral output neurons Ω1,Ω2, . . . ,Ωn doesnot considerably change the principle ofImportant!the perceptron (fig. 5.3): A perceptronwith several output neurons can also beregarded as several different perceptronswith the same input.

The Boolean functions AND and OR shownin fig. 5.4 on the right page are trivial, com-posable examples.

Now we want to know how to train a single-layer perceptron. For this, we will at firsttake a look at the perceptron learning al-gorithm and then we will look at the Deltarule.

GFED@ABCBIAS

wBIAS,Ω

GFED@ABCi1

wi1,Ω

GFED@ABCi2

wi2,Ω

?>=<89:;Ω

Figure 5.2: A single-layer perceptron with twoinput neurons and one output neuron. The net-work returns the output by means of the ar-row leaving the network. The trainable layer ofweights is situated in the center (labeled). As areminder the bias neuron is again included here.Although the weight wBIAS,Ω is a normal weightand also treated like this, I have represented itby a dotted line – which significantly increasesthe perceivability of larger networks. In future,the bias neuron will no longer be included.

GFED@ABCi1

@@@@@@@@@

**UUUUUUUUUUUUUUUUUUUUUUUUUU

''PPPPPPPPPPPPPPPPP GFED@ABCi2

((PPPPPPPPPPPPPPPPPP

AAAAAAAAAGFED@ABCi3

~~

AAAAAAAAA

GFED@ABCi4

vvnnnnnnnnnnnnnnnnnn

~~GFED@ABCi5

~~~~~~~~~~~


wwnnnnnnnnnnnnnnnnn

GFED@ABCΩ1

GFED@ABCΩ2

GFED@ABCΩ3

Figure 5.3: Single-layer perceptron with severaloutput neurons


dkriesel.com 5.2 Delta Rule

GFED@ABC

1AAAA

AAAA

GFED@ABC

1

~~

[email protected]

GFED@ABC

1AAAA

AAAA

GFED@ABC

1

~~

[email protected]

Figure 5.4: Two single-layer perceptrons forBoolean functions. The upper single-layer per-ceptron realizes an AND, the lower one realizesan OR. The activation function of the informa-tion processing neuron is the binary thresholdfunction, the threshold values are written intothe neurons where available.

5.1.1 Perceptron LearningAlgorithm and ConvergenceTheorem

The original perceptron learning algo-rithm with binary neuron activation func-tion is described in alg. 1. It is proved thatthe algorithm converges in finite time – soin finite time the perceptron can learn any-thing it can represent (perceptron con-vergence theorem, [Ros62]). But don’thalloo till you’re out of the wood! Whatthe perceptron is capable to represent willbe explored later.

During the exploration of linear separabil-ity of problems we will cover the fact thatat least the single-layer perceptron unfor-tunately cannot represent a lot of prob-lems.

5.2 The Delta-Rule as agradient based learningstrategy for SLPs

In the followingwe deviate from our binarythreshold value as activation function, be-cause at least for the backpropagation oferror we need, as you will see, a differ-

fact now differ-entiableentiable or even a semi-linear activation

function. For the now following Delta rule(like backpropagation derived in [MR86])it is not always necessary but useful. Thisfact, however, will also be pointed out inthe appropriate part of this paper. Com-pared with the above-mentioned percep-tron learning algorithm the Delta rule



Algorithm 1 Perceptron learning algorithm. The perceptron learning algorithm re-duces the weights to output neurons that return 1 instead of 0, and in inverse caseincreases weights.

1: while ∃p ∈ P and Fehler zu groß do2: Input p into the network, calculate output y P set of training patterns3: for all output neurons Ω do4: if yΩ = tΩ then5: Output is okay, no correction of weights6: else7: if yΩ = 0 then8: for all input neurons i do9: wi,Ω := wi,Ω + oi ...increase weight towards Ω by oi

10: end for11: end if12: if yΩ = 1 then13: for all input neurons i do14: wi,Ω := wi,Ω − oi ...decrease weight towards Ω by oi15: end for16: end if17: end if18: end for19: end while



has the advantage to be suitable for non-binary activation functions and, being faraway from the learning target, to automat-ically learn faster.

Let us assume that we have a single-layerperceptron with randomly set weights towhich we want to teach a function bymeans of training examples. The set ofthese training examples is called P . It con-tains, as already defined, the pairs (p, t) ofthe training examples p and the associatedteaching input t. I also want to remind youthat

. x is the input vector and

. y is the output vector of a neural net-work,

. output neurons are referred to asΩ1,Ω2, . . . ,Ω|O|,

. i is the input and

. o is the output of a neuron.

Additionally, we defined that

. the error vector Ep represents the dif-ference (t−y) under a certain trainingexample p.

. Furthermore, let O be the set of out-put neurons and

. I be the set of input neurons.

Another naming convention shall be that,for example, for the output o and theteaching input t an additional index p maybe set in order to indicate that this value ispattern-specific. Sometimes this will con-siderably enhance clarity.

Now our learning target will certainly bethat for all training examples the outputy of the network is approximate to the de-sired output t i.e. it is formally true that

∀p : y ≈ t or. ∀p : Ep ≈ 0.

This means we first have to understand thetotal error Err as a function of the weights:The total error increases or decreases de-pending on how we change the weights.Definition 5.5 (Error function): The er-ror function

JErr(W )Err : W → R

understands the set2 weights W as a vec-tor and maps the values on the normed error as

functionoutput error (normed because otherwisenot all errors can be mapped in one sin-gle e e ∈ R to perform a gradient descent).It is obvious that a specific error func-tion can analogously be generated for a

JErrp(W )single pattern p.

As already shown in section 4.5 to the sub-ject gradient descent procedure, gradientdescent procedures calculate the gradientof a random but finite-dimensional func-tion (here: of the error function Err(W ))and go down towards the gradient until aminimum is reached. Err(W ) is definedon the set of all weights which is hereinunderstood to be vector W . So we try todecrease or to minimize the error by meansof, casually speaking, turning the weights– thus you receive information about how

2 Following the tradition of the literature, I previ-ously defined W as a weight matrix. I am awareof this conflict but it shall not mind us here.



−2−1

0 1

2w1

−2−1

0 1

2

w2

0

1

2

3

4

5

Figure 5.5: Exemplary error surface of a neuralnetwork with two trainable connections w1 undw2. Generally, neural networks have more thantwo connections, but this would have made theillustration too complex. And most of the timethe error surface is too craggy, which complicatesthe search for the minimum.

to change the weights (the change in allweights is referred to as ∆W ) by derivingthe error function Err(W ):

∆W ∼ −∇Err(W ). (5.1)

Due to this proportionality a proportionalconstant η provides equality (η will soonget another meaning and a real practicaluse beyond the mere meaning of propor-tional constant. I just want to ask thereader to be patient for a while.):

∆W = −η∇Err(W ). (5.2)

The derivative of the error-function accord-ing to the weights is now written as normalpartial derivative according to a weightwi,Ω (there are only variable weights tothe output neurons Ω), so that it can bemathematically used to a greater extent.Thus, we turn every single weight and see

how the error function is changing, i.e. wederive the error function according to aweight wi,Ω and receive the information∆wi,Ω of how change this weight.

∆wi,Ω = −η∂Err(W )∂wi,Ω

. (5.3)

A question therefore arises: How is ourerror function exactly defined? It is notgood, if many results are far away fromthe desired one; the error function shouldthen provide large values – on the otherhand it is similarly bad if many resultsare close to the desired one but includ-ing an extreme very far away outlier. Sowe use the squared distance between theoutput vector y and the teaching input twhich provides the Errp that is specific fora training example p over the output of alloutput neurons Ω:

Errp(W ) = 12∑Ω∈O

(tp,Ω − yp,Ω)2. (5.4)

Thus, we square the difference of the com-ponents of the vectors t and y under thepattern p and sum these squares. Thenthe error definition Err and therefore thedefinition of the error function Err(W ) )result from the summation of the specificerrors Errp(W ) of all patterns p:

Err(W ) =∑p∈P

Errp(W ) (5.5)

= 12

sum over all p︷︸︸︷∑p∈P

∑Ω∈O

(tp,Ω − yp,Ω)2

︸︷︷︸

sum over all Ω

.

(5.6)



Remark: The attentive reader will cer-tainly wonder from where the factor 1

2 inequation 5.4 on the left page suddenly ap-peared and where the root is in the equa-tion, since the equation is very similar tothe Euclidean distance. Both facts resultfrom simple pragmatics: It is a matter oferror minimization. Since the root func-tion decreases with its argument, thereforewe can omit it for reasons of calculatingand implementation efforts, since we donot need them for minimization. Equally,it does not matter if the part to be min-imized is divided in half by the prefactor12 : Therefore I am allowed to multiply by12 . This is mere idleness so that it can bereduced in the course of our calculation to-wards 2.

Now we want to continue to derive theDelta rule for linear activation functions.We have already discussed that we turnthe individual weights wi,Ω a bit and seehow the error Err(W ) is changing – whichcorresponds to the derivative of the er-ror function Err(W ) according to the verysame weight wi,Ω. This derivative cor-responds to the sum of the derivativesof all specific errors Errp according tothis weight (since the total error Err(W )results from the sum of the specific er-rors):

∆wi,Ω = −η∂Err(W )∂wi,Ω

(5.7)

=∑p∈P−η∂Errp(W )

∂wi,Ω. (5.8)

Once again I want to think about the ques-tion of how a neural network is processing

/ processes data. Basically, the data isonly transferred through a function, theresult of the function is sent through an-other one, and so on. If we ignore theoutput function, the way of the neuronoutputs oi1 and oi2 , which the neurons i1and i2 entered into a neuron Ω, initiallyis the propagation function function (hereweighted sum), from which the network in-put is going to be received:

oi1 , oi2 → fprop

⇒ fprop(oi1 , oi2)= oi1 · wi1,Ω + oi2 · wi2,Ω= netΩ

Then this is sent through the activationfunction of the neuron Ω so that we re-ceive the output of this neuron which is atthe same time a component of the outputvector y:

netΩ → fact

= fact(netΩ)= oΩ

= yΩ.

As we can see, this output results frommany nested functions:

oΩ = fact(netΩ) (5.9)

= fact(oi1 · wi1,Ω + oi2 · wi2,Ω). (5.10)

It is clear that we could break down theoutput into the input neurons (this is un-necessary here, since they do not processinformation in an SLP). Thus, we want toexecute the derivates of equation 5.8 anddue to the nested functions we can apply



the chain rule to factorize the derivative∂Errp(W )∂wi,Ω

included in equation 5.8 on theprevious page.

∂Errp(W )∂wi,Ω

= ∂Errp(W )∂op,Ω

· ∂op,Ω∂wi,Ω

. (5.11)

Let us take a look at the first multiplica-tive factor of the above-mentioned equa-tion 5.11 which represents the derivativeof the specific error Errp(W ) according tothe output, i.e. the change of the errorErrp with an output op,Ω: The examina-tion of Errp (equation 5.4 on page 78)clearly shows that this change is exactlythe difference between teaching input andoutput (tp,Ω − op,Ω) (remember: Since Ωis output neuron, op,Ω = yp,Ω). The closerthe output is to the teaching input, thesmaller is the specific error. Thus we canreplace the one by the other. This differ-ence is also called δp,Ω (which is the reasonfor the name Delta rule):

∂Errp(W )∂wi,Ω

= −(tp,Ω − op,Ω) · ∂op,Ω∂wi,Ω

(5.12)

= −δp,Ω ·∂op,Ω∂wi,Ω

(5.13)

The second multiplicative factor of equa-tion 5.11 and of the following one is thederivative of the output of the neuron Ωto the pattern p according to the weightwi,Ω. So how does op,Ω change when theweight is changed from i to Ω? Due tothe requirement at the beginning of thederivative / derivation we only have a lin-ear activation function fact, therefore we

can just as well look at the change of thenetwork input when wi,Ω is changing:

∂Errp(W )∂wi,Ω

= −δp,Ω ·∂∑i∈I(op,iwi,Ω)∂wi,Ω

.

(5.14)

The resulting derivative ∂∑

i∈I(op,iwi,Ω)∂wi,Ω

can now be reduced: The function∑i∈I(op,iwi,Ω) to be derived consists of

many summands, and only the sum-mand op,iwi,Ω contains the variable wi,Ω,according to which we derive. Thus,∂∑

i∈I(op,iwi,Ω)∂wi,Ω

= op,i and therefore:

∂Errp(W )∂wi,Ω

= −δp,Ω · op,i (5.15)

= −op,i · δp,Ω. (5.16)

This we insert in equation 5.8 on the previ-ous page, which results in our modificationrule for a weight wi,Ω:

∆wi,Ω = η ·∑p∈P

op,i · δp,Ω. (5.17)

However: From the very first the deriva-tion has been intended as an offline rule bymeans of the question of how to add theerrors of all patterns and how learn themafter all patterns have been represented.Although this approach is mathematicallycorrect, the implementation is far moretime-consuming and, as we will see laterin this chapter, and partially needs a lotof compuational effort during training.

The ”online-learning version” of the Deltarule simply omits the summation and


dkriesel.com 5.3 Linear Separability

learning is realized immediately after thepresentation of each pattern, this also sim-plifies the notation (which is no longer nec-essary to be related to a pattern p):

∆wi,Ω = η · oi · δΩ. (5.18)

This version of the Delta rule shall be usedfor the following definition:Definition 5.6 (Delta rule): If we deter-mine, analogously to the above-mentionedderivation, that the function h of the Heb-bian theory (equation 4.6 on page 66) onlyprovides the output neuron oi of the pre-decessor neuron i and if the function g isthe difference between the desired activa-tion tΩ and the actual activation aΩ, wewill receive the Delta rule, also known asWidrow-Hoff rule:

∆wi,Ω = η · oi · (tΩ − aΩ) = ηoiδΩ (5.19)

If the desired output instead of activationis used as teaching input, and thereforethe output function of the output neuronsdoes not represent an identity, we will re-ceive

∆wi,Ω = η · oi · (tΩ − oΩ) = ηoiδΩ (5.20)

and δΩ then corresponds to the differencebetween tΩ and oΩ.Remark: In case of the Delta rule, thechange of all weights to an output neuronΩ is proportional

. to the difference between the currentactivation or output aΩ or oΩ and thecorresponding teaching input tΩ. Wewant to refer to this factor as δΩ ,

δI which is also spoken of as ”Delta”.

In. 1 In. 2 Output0 0 00 1 11 0 11 1 0

Table 5.1: Definition of the logical XOR. Theinput values are shown of the left, the outputvalues on the right.

Apparently, the Delta rule only applies forSLPs, since the formula always refers tothe teaching input , and there is no teach-

Delta ruleonly for SLPing input for the inner processing layers of

neurons.

5.3 A SLP is only capable ofrepresenting linearseparable data

Letf be the XOR function which expectstwo binary inputs and generates a binaryoutput (for the precise definition see ta-ble 5.1).

Let us try to represent the XOR func-tion by means of an SLP with two inputneurons i1, i2 and one output neuron Ω(fig. 5.6 on the next page).

Here we use the weighted sum as propaga-tion function, a binary activation functionwith the threshold value Θ and the iden-tity as output function. Dependent on i1and i2 Ω has to output the value 1:

netΩ = oi1wi1,Ω + oi2wi2,Ω ≥ ΘΩ (5.21)



GFED@ABCi1

wi1,ΩBBBB

BBBB

GFED@ABCi2

wi2,Ω||||

~~||||

?>=<89:;Ω

XOR?

Figure 5.6: Sketch of a single-layer perceptronthat shall represent the XOR function - which isimpossible.

We assume a positive weight wi2,Ω, theinequation 5.21 on the previous page isequivalent to the inequation

oi1 ≥1

wi1,Ω(ΘΩ − oi2wi2,Ω) (5.22)

With a constant threshold value ΘΩ theright part of the inequation 5.22 is astraight line through a coordination sys-tem defined by the possible outputs oi1und oi2 of the input neurons i1 and i2(fig. 5.7).

For a (as required for inequation 5.22) pos-itive wi2,Ω the output neuron Ω fires at theinput combinations lying above the gener-ated straight line. For a negative wi2,Ω itwould fire for all input combinations lyingbelow the straight line. Note that only thefour corners of the unit square are possi-ble inputs because the XOR function onlyknows binary inputs.

Figure 5.7: Linear separation of n = 2 inputsof input neurons i1 and i2 by a 1-dimensionalstraight line. A and B show the corners be-longing to the sets of the XOR function to beseparated..


dkriesel.com 5.4 The Multi-layer Perceptron

Figure 5.8: Linear Separation of n = 3 inputsfrom input neurons i1, i2 and i3 by 2-dimensionalplane.

In order to solve the XOR problem, wehave to turn and move the straight line sothat input set A = (0, 0), (1, 1) is sepa-rated from input set B = (0, 1), (1, 0) –which is, obviously, impossible.

Generally, the input options n of many in-put neurons can be represented in an n-dimensional cube which is separated from

SLP cannotdo everything an SLP by an (n − 1)-dimensional hyper-

plane (fig. 5.8). ). Only sets that can beseparated by such a hyperplane, i.e. whichare linearly separable can be classifiedby an SLP.Remark: Unfortunately, it seems that thepercentage of the linearly separable prob-lems rapidly decreases with increasing n(see table 5.2), which limits the functional-

few tasksare linearly

separableity of the SLP. Additionally, tests for linear

n number ofbinaryfunctions

lin.separableones

share

1 4 4 100%2 16 14 87.5%3 256 104 40.6%4 65, 536 1, 772 2.7%5 4.3 · 109 94, 572 0.002%6 1.8 · 1019 5, 028, 134 ≈ 0%

Table 5.2: Number of functions concerning n bi-nary inputs, and number and proportion of thefunctions thereof which can be linearly separated.In accordance with [Zel94, Wid89, Was89].

separability are difficult. Thus, for moredifficult tasks with more inputs we needsomething more powerful than SLP.Example: The XOR problem itself is oneof these tasks, since a perceptron thatwants to represent the XOR function al-ready is a hidden layer (fig. 5.9 on the nextpage).

5.4 A Multi-layer Perceptroncontains more trainableweight layers

A perceptron with two or more trainableweight layers (called multi-layer percep-tron or MLP) is more powerful than anSLP. As we know a single-layer perceptroncan divide the input space by means ofa hyperplane (in a two-dimensional inputspace by means of a straight line). A two-stage perceptron (two trainable weight lay-

more planes



GFED@ABC

1AAAA

AAAA

111111111

11111111

GFED@ABC

1

~~

1

[email protected]

−[email protected]

XOR

Figure 5.9: Neural network realizing the XORfunction. Threshold values (as far as they areexisting) are located within the neurons.

ers, three neuron layers) can classify con-vex polygons by proceeding these straightlines, e.g. in the form ”recognize pat-terns lying above straight line 1, belowstraight line 2 and below straight line3”. Thus, we – metaphorically speaking- took an SLP with several output neu-rons and ”attached” another SLP (upperpart of fig. 5.10 on the right page). AMulti-layer Perceptron represents an uni-versal function approximator, whichis proven by the Theorem of Cybenko[Cyb89].

Another trainable weight layer proceedsequally, only with convex polygons thatcan be added, subtracted or reworked withother operations (lower part of fig. 5.10 onthe right page).

Generally it can be mathematically provedthat even a multi-layer perceptron with

one layer of hidden neurons can approxi-mate arbitrarily precisely a function witha finite number of points of discontinuityas well as their first derivative. Unfortu-nately, this proof is not constructive andtherefore it is left to us to find the correctnumber of neurons and weights.

In the following we want to use awidely-spread abbreviated form for differ-ent multi-layer perceptrons: Thus, a two-stage perceptron with 5 neurons in the in-put layer, 3 neurons in the hidden layerand 4 neurons in the output layer is a 5-3-4-MLP.Definition 5.7 (Multi-layer perceptron):Perceptrons with more than one layer ofvariably weighted connections are referredto as multi-layer perceptron (MLP).Thereby an n-layer or n-stage perceptronhas exactly n variable weight layers andn + 1 neuron layers (the retina is disre-garded here) with neuron layer 1 being theinput layer.

Since three-stage perceptrons can classifysets of any form by combining and separat- 3-stage

MLP issufficient

ing arbitrarily many convex polygons, an-other step will not be advantageous withrespect to function representations.Remark: Be cautious when reading the lit-erature: Opinions are divided over the def-inition of layers. Some sources count theneuron layers, some count the weight lay-ers. Some sources include the retina, somethe trainable weight layers. Some exclude(for any reason) the output neuron layer.In this paper, I chose the definition thatprovides, in my opinion, the most informa-tion about the learning capabilities – and


dkriesel.com 5.4 The Multi-layer Perceptron

GFED@ABCi1

@@@@@@@@@

**UUUUUUUUUUUUUUUUUUUUUUUUU GFED@ABCi2

@@@@@@@@@

ttjjjjjjjjjjjjjjjjjjjjjjjjj

GFED@ABCh1

''PPPPPPPPPPPPPPPPP GFED@ABCh2

GFED@ABCh3

wwooooooooooooooooo

?>=<89:;Ω

GFED@ABCi1

~~~~~~~~~~~

@@@@@@@@@

'' )) **

GFED@ABCi2

tt uu ww ~~~~~~~~~~~

@@@@@@@@@

GFED@ABCh1

''PPPPPPPPPPPPPPPPP

--

GFED@ABCh2

@@@@@@@@@

,,

GFED@ABCh3

**

GFED@ABCh4

tt

GFED@ABCh5

~~~~~~~~~~~

rr

GFED@ABCh6

wwnnnnnnnnnnnnnnnnn

qqGFED@ABCh7

@@@@@@@@@GFED@ABCh8

~~~~~~~~~

?>=<89:;Ω

Figure 5.10: We know that an SLP represents a straight line. With 2 trainable weight layers severalstraight lines can be combined to form convex polygons (above). By using 3 trainable weight layersseveral polygons can be formed into arbitrary sets (below).



n classifiable sets1 hyperplane2 convex polygon3 any set4 any set as well, i.e. no

advantage

Table 5.3: Here it is represented which percep-tron can classify which types of sets with n beingthe number of trainable weight layers.

I will use it cosistently. Remember: An n-stage perceptron has exactly n trainableweight layers.

You can find a summary of which percep-trons can classify which types of sets intable 5.3. We now want to face the chal-lenge of training perceptrons with morethan one weight layer.

5.5 Backpropagation of Errorgeneralizes theDelta-Rule to allow forMLP training

Next, I want to derive and explain thebackpropagation of error learning rule(abbreviated: backpropagation, backpropor BP), which can be used to train multi-stage perceptrons semi-linear3 activationfunctions. Binary threshold functions andother non-differentiable functions are no

3 Semilinear functions are monotonous and differen-tiable – but generally they are not linear.

longer supported, but that doesn’t mat-ter: We have seen that the Fermi functionor the hyperbolic tangent can be arbitrar-ily approximated on the binary thresholdfunction by means of a temperature pa-rameter T . To a large extent I will fol-low the derivation according to [Zel94] and[MR86]. Once again I want to point outthat this procedure had previously beenpublished by Paul Werbos in [Wer74]but had consideraby less readers than in[MR86].

Backpropagation is a gradient descent pro-cedure (including all strengths and weak-nesses of the gradient descent) with theerror function Err(W ) receiving all nweights as argument (fig. 5.5 on page 78)and assigning them to the output error,i.e. being n-dimensional. On Err(W ) apoint of small error or even a point ofthe smallest error is sought by means ofthe gradient descent. Thus, equally to theDelta rule the backpropagation trains theweights of the neural network. And it is ex-actly the Delta rule or its variable δi for aneuron i which is expanded from one train-able weight layer to several ones by back-propagation.

Let us define in advance that the networkinput of the individual neurons i resultsfrom the weighted sum. Furthermore, aswith the derivation of the Delta rule, letop,i, netp,i etc. be defined as the alreadyfamiliear oi, neti, etc. under the input pat-tern p we use for the training. Let the out-put function be the identity again, thusoi = fact(netp,i) is true vor any neuron i.Since this is a generalization of the Deltarule, we use the same formula framework


dkriesel.com 5.5 Backpropagation of Error

/.-,()*+

&&LLLLLLLLLLLLLLL /.-,()*+

========== /.-,()*+

. . . ?>=<89:;kwk,h

pppppppp

wwppppppp

K

ONMLHIJKΣfact

xxrrrrrrrrrrrrrrr

wh,lNNNNNNN

''NNNNNNNN

h H

/.-,()*+ /.-,()*+ /.-,()*+ . . . ?>=<89:;l L

Figure 5.11: Illustration of the position of ourneuron h within the neural network. It is lyingin layer H, the preceding layer is K, the nextfollowing layer is L.

as with the Delta rule (equation 5.20). Asgeneral-

izationof δ

already indicated, we have to generalizethe variable δ for every neuron.

At first: Where is the neuron for which wewant to calculate a δ? ? It is obvious to se-lect an arbitrary inner neuron h having aset K of predecessor neurons k as well as aset of L successor neurons l, which are alsoinner neurons (see fig. 5.11). Thereby itis irrelevant whether the predecessor neu-rons are already input neurons.

Now we perform the same derivation asfor the Delta rule and split functions bymeans the chain rule. I will not discussthis derivation in great detail, but theprincipal is similar to that of the Delta

rule (the differences are, as already men-tioned, in the generalized δ). We initiallyderive the error function Err according toa weight wk,h.

∂Err(wk,h)∂wk,h

= ∂Err∂neth︸︷︷︸=−δh

·∂neth∂wk,h

(5.23)

The first factor of the equation 5.23 is −δh,, which we will regard later in this text.The numerator of the second factor of theequation includes the network input, i.e.the weighted sum is included in the numer-ator so that we can immediately derive it.All summands of the sum drop out againapart from the summand containing wk,h.This summand is referred to as wk,h · ok.If it is derived, the output of the neuron kis left:

∂neth∂wk,h

= ∂∑k∈K wk,hok∂wk,h

(5.24)

= ok (5.25)

As promised we will now discuss the−δh ofthe equation 5.23, which is splitted againby means of the chain rule:

δh = − ∂Err∂neth

(5.26)

= −∂Err∂oh

· ∂oh∂neth

(5.27)

The derivation of the output according tothe network input (the second factor inthe equation 5.27) is certainly equal to



the derivation of the activation functionaccording to the network input:

∂oh∂neth

= ∂fact(neth)∂neth

(5.28)

= fact′(neth) (5.29)

Now we analogously derive the first factorin the equation 5.27 on the previous page.The reader may well mull over this pas-sage. For this we have to point out thatthe derivation of the error function accord-ing to the output of an inner neuron layerdepends on the vector of all network in-puts of the next following layer. This isreflected in equation 5.30:

−∂Err∂oh

= −∂Err(netl1 , . . . ,netl|L|)

∂oh(5.30)

According to the definition of the multi-dimensional chain rule equation 5.31 im-mediately follows4: 5.31:

−∂Err∂oh

=∑l∈L

(− ∂Err∂netl

· ∂netl∂oh

)(5.31)

The sum in equation 5.31 has two factors.Now we want to discuss these factors beingadded over the next following layer L. Wesimply calculate the second factor in thefollowing equation 5.33:

∑l∈L

∂netl∂oh

= ∂∑h∈H wh,l · oh∂oh

(5.32)

= wh,l (5.33)

4 For more information please consult the usual anal-ysis literature.

The same applies for the first factor accord-ing to the definition of our δ:

− ∂Err∂netl

= δl (5.34)

Now we replace:

⇒ −∂Err∂oh

=∑l∈L

δlwh,l (5.35)

You can find a graphic version of the δgeneralization including all splittings infig. 5.12 on the right page.

The reader might already have noticedthat some intermediate results wereframed. Exactly those intermediate re-sults were framed which are a factor inthe change in weight of wk,h. If the above-mentioned equations are combined withthe framed intermediate results, the result/the outcome of this will be the wantedchange in weight ∆wk,h to

∆wk,h = ηokδh mit (5.36)

δh = f ′act(neth) ·∑l∈L

(δlwh,l)

But certanly only in case of h being an in-ner neuron (otherweise there wouldn’t bea following layer L).

The case of h being an output neuron hasalready been discussed during the deriva-tion of the Delta rule. All in all, the re-sult is the generalization of the Delta rule,called backpropagation of error :

∆wk,h = ηokδh mit

δh =f ′act(neth) · (th − yh) (h außen)f ′act(neth) ·∑l∈L(δlwh,l) (h innen)

(5.37)



δh

− ∂Err∂neth

∂oh∂neth −∂Err

∂oh

f ′act(neth) − ∂Err

∂netl∑l∈L

∂netl∂oh

δl∂∑

h∈H wh,l·oh∂oh

wh,l

Figure 5.12: Graphical representation of the equations (by equal signs) and chain rule splittings(by arrows) in the framework of the backpropagation derivation. The leaves of the tree reflect thefinal results from the generalization of δ, which are framed in the derivation.



Unlike the Delta rule, δ, is treated differ-ently depending on whether h is an outputor an inner (i.e. hidden) neuron:

1. If h is output neuron, then

δp,h = f ′act(netp,h) · (tp,h − yp,h)(5.38)

Thus, under our training pattern pthe weight wk,h from k nach h is pro-portionally changed into

. the learning rate η,

. the output op,k of the predeces-sor neuron k,

. the gradient of the activationfunction at the position of thenetwork input of the successorneuron f ′act(netp,h) and

. the difference between teachinginput tp,h and output yp,h of thesuccessor neuron h.

Teach. Inputchanged for

the outerweight layer

In this case, backpropagation is work-ing on two neuron layers, the outputlayer with the successor neuron h andthe preceding layer with the predeces-sor neuron k.

2. If h is an inner, hidden neuron, then

δp,h = f ′act(netp,h) ·∑l∈L

(δp,l · wh,l)

(5.39)

Here I want to explicitly mention thatbackpropagation is now working onthree layers. At this, the neuron kis predecessor of the connection to be

back-propagation

for innerlayers

changed with the weight wk,h, the neu-ron h is the successor of the connec-tion to be changed and the neuronsl are lying in the layer following thesuccessor neuron. Thus, according toour training pattern p the weight wk,hfrom k toh is proportionally changedto

. the learning rate η,

. the output of the predecessorneuron op,k,

. the gradient of the activationfunction at the position of thenetwork input of the successorneuron f ′act(netp,h),

. as well as, and this is the differ-ence, from the weighted sum ofthe changes in weight to all neu-rons following h, ∑l∈L(δp,l ·wh,l).

Definition 5.8 (Backpropagation): If wesummarize the formulas 5.38 and 5.39,we receivethe following total formula forbackpropagation (the identifiers p areommited for reasons of clarity):



(5.40)

It is obvious that backpropagation initiallyprocesses/executes the last weight layer di-rectly by means of the teaching input andit then works forward from layer to layerin consideration of each preceding changein weights. Thus, the teaching input leavestraces in all weight layers.



Remark: Here I am describing the first(Delta rule) and the second part of back-propagation (generalized Delta rule onmore layers) in one go, which may meetthe requirements of the matter but not ofthe research. The first part is obvious,which you will see in the framework ofa mathematical gimmick. Decades of de-velopment time and work lie between thefirst and the second, recursive part. Likemany groundbreakingng inventions it wasnot until its development that it was rec-ognized how plausible this invention was.

5.5.1 Heading back: Boilingbackpropagation down todelta rule

As explained above, the Delta rule is aspecial case of backpropagation for one-stage perceptrons and linear activationfunctions – I want to briefly explain this

backpropexpands

Delta rulecircumstance and develop the Delta ruleout of backpropagation in order to raisethe understanding of both rules. We haveseen that backpropagation is defined by



(5.41)

Since we only use it for one-stage percep-trons, the second part of backpropagation(light-colored) is omitted without substitu-tion. The result is:

∆wk,h = ηokδh mitδh = f ′act(neth) · (th − oh) (5.42)

Furthermore, we only want to use linearactivation functions so that f ′act (light-colored) is constant. KAs is generallyknown, constants can be summarized, andtherefore we directly combine the constantderivative f ′act and (being constant for atleast one lerning cycle) the learning rate η(also light-colored) in η. Thus, the resultis:

∆wk,h = ηokδh = ηok · (th − oh) (5.43)

This exactly corresponds with the Deltarule definition.

5.5.2 The selection of the learningrate has heavy influence onthe learning process

In the meantime we have often seen thatthe change in weight is, in any case, pro-portional to the learning rate η. Thus, theselection of η is very decisive for the be-haviour of backpropagation and for learn-ing procedures in general.

how fastwill belearned?

Definition 5.9 (Learning rate): DSpeedand accuracy of a learning procedure canalways be controlled by and is always pro-portional to a learning rate which is writ-ten as η. JηRemark: If the value of the chosen η istoo large, the jumps on the error surfaceare also too large and, for example, narrowvalleys could simply be jumped over. Ad-ditionally, the movements across the errorsurface would be very uncontrolled. Thus,a small η is the desired input, which, how-ever, can involve a huge, often unaccept-able expenditure of time.



Experience shows that good learning ratevalues are in the range of

0.01 ≤ η ≤ 0.9.

The selection of η significally depends onthe problem, the network and the trainingdata so that it is barely possible to givepractical advise. But it is popular to startwith a relatively large η, e.g. 0.9, dand toslowly decrease it down to 0.1, for instance.For simpler problems η can ofte be keptconstant.

5.5.2.1 Variation of the learning rateover time

During training another stylistic devicecan be a variable learning rate: Inthe beginning, a large learning rate learnsgood, but later it results in inaccuratelearning. A smaller learning rate is moretime-consuming, but the result is more pre-cise. Thus, the learning rate is to be de-creased by one unit once or repeatedly –during the learning process.Remark: A common error (which alsoseems to be a very neat solution at a firstglance) is to continually decrease the learn-ing rate. Here it easily happens that thedescent of the learning rate is larger thanthe ascent of a hill of the error function weare scaling. The result is that we simplyget stuck at this ascent. Solution: Ratherreduce the learning rate gradually as men-tioned above.

5.5.2.2 Different layers – Differentlearning rates

The farer we move away from the out-put layer during the learning process, theslower backpropagation is learning. Thus,it is a good idea to select a larger learningrate for the weight layers close to the in-put layer than for the weight layers closeto the output layer.

5.5.3 Initial configuration of aMulti-layer Perceptron

After having discussed the backpropaga-tion of error learning procedure and know-ing how to train an existing network, itwould be useful to consider how to acquiresuch a network.

5.5.3.1 Number of layers: Mostly twoor three

Let us begin with the trivial circumstancethat a network should have one layer of in-put neurons and one layer of output neu-rons, which results in at least two layers.

Additionally, we need – as we have alreadylearned during the examination of linearseparability – at least one hidden layer ofneurons, if our problem is not linearly sep-arable (which is, as we have seen, verylikely).

It is possible, as already mentioned, tomathematically prove that this MLP withone hidden neuron layer is already capable



of arbitrarily accurate approximation of ar-bitrary functions 5 – but it is necessary notonly to discuss the representability of aproblem by means of a perceptron but alsothe learnability. Representability meansthat a perceptron can principally realize amapping - but learnability means that weare also able to teach it.

In this respect experience shows that twohidden neuron layers (or three trainableweight layers) can be very useful to solvea problem, since many problems can berepresented by a hidden layer but are verydifficult to learn. Two hidden layers arestill a good value because three hidden lay-ers are not needed very often. Moreover,any additional layer generates additionalsub-minima of the error function in whichwe can get stuck. All things considered,a promising way is to try it with one hid-den layer at first and if that fails try twoones.

5.5.3.2 The number of neurons has tobe tested

The number of neurons (apart from in-put and output layer, the number of inputand output neurons is already defined bythe problem statement) principally corre-sponds to the number of free parametersof the problem to be represented.

Since we have already discussed the net-work capacity with respect to memorizingor a too imprecise problem representation

5 Note: We have not indicated the number of neu-rons in the hidden layer, we only mentioned thehypothetical possibility.

it is clear that our goal is to have as fewfree parameters as possible but as many asnecessary.

But we also know that there is no patentformula for the question of how many neu-rons should be used. Thus, the most use-ful approach is to initially train with onlya few neurons and to repeatedly train newnetworks with more neurons until the re-sult significantly improves and, particu-larly, the generalization performance is notaffected (bottom-up approach).

5.5.3.3 Selecting an activation function

Another very important parameter for theway of information processing of a neuralnetwork is the selection of an activa-tion function. The activation functionfor input neurons is defined, since they donot process information.

The first question to be asked is whetherwe actually want to use the same activa-tion function for/in the hidden layer andin the ouput layer – noone prevents usfrom varying the different functions. Gen-erally, the activation function is the samefor all hidden neurons as well as for theoutput neurons.

For tasks of function approximation ithas been found reasonable to use the hy-perbolic tangent (left part of fig. 5.13 onpage 95) as activation function of the hid-den neurons, while a linear activation func-tion is used in the output. The latter is



absolutely necessary so that we do not gen-erate a limited output intervall. DIn con-trast to the linear input layer the also lin-ear output layer has threshold values butalthough can process information. How-ever, linear activation functions in the out-put can also cause huge learning steps andjumping over good minima in the error sur-face. This can be avoided by setting thelearning rate to very small values at theoutput layer.

An unlimited output interval is not essen-tial for pattern recognition tasks6. Ifthe hyperbolic tangent is used in any case,the output interval will be a bit larger.Unlike the hyperbolic tangent, the Fermifunction (right part of fig. 5.13 on theright page) hardly has the possibility tolearn something long before the thresholdvalue is reached (where its result is closeto 0). However, here a lot of discretion isgiven for selecting an activation function.SBut generally, the disadvantage of sig-moid functions is the fact that they hardlylearn something long before their thresh-old value is reached, unless the networkwill be modified.

5.5.3.4 Weights should be initializedwith small, randomly chosenvalues

The initialization of weights is not astrivial as one might think Initialisiertman sie einfach mit 0, wird gar keine

6 Generally, pattern recognition is understood as aspecial case of function approximation with a fewdiscrete output possibilities.

Gewichtsanderung stattfinden. If they aresimply initialized by 0, there will be nochange in weights. If they are all initial-ized by the same value, they will all bechanged similarly during training. Thesimple solution of this problem is calledsymmetry breaking, which is the initial-ization of weights with with small random

randominitialweights

values.Example: The range of random valuescould be the interval [−0.5; 0.5] not includ-ing 0 or values very close to 0.Remark: This random initialization has anice side effect: Chances are that the aver-age of network inputs is close to 0.

5.5.4 Backpropagation has oftenbeen extended and altered

Backpropagation has often been extended.Many of these extensions can simply beimplemented as optional features of back-propagation in order to have a larger testscope. In the following I want to brieflydescribe some of them.

5.5.4.1 Momentum term

Let us assume to descent a steep slopeon skis - what prevents us from immedi-ately stopping at the edge of the slopeto the plateau? Exactly - the momen-tum. With backpropagation the momen-tum term [RHW86b] is responsible for thefact that a kind of moment of inertia(momentum) is added to every step size(fig. 5.14 on page 96), by always adding a



−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

−4 −2 0 2 4

tanh

(x)

x

Hyperbolic Tangent

0

0.2

0.4

0.6

0.8

1

−4 −2 0 2 4

f(x)

x


Figure 5.13: As a reminder the illustration of the hyperbolic tangent (left) and the Fermi function(right). DThe Fermi function was expanded by a temperature parameter. Thereby the originalFermi function is represented by dark colors, the temperature parameter of the modified Fermifunctions are, from outward to inward 1

2 , 15 , 1

10 und 125 .

proportion of the previous change to everynew change in weight:

(∆pwi,j)jetzt = ηop,iδp,j +α · (∆pwi,j)vorher

Of course, this notation is only used fora better understanding. Generally, as al-ready defined by the concept of time, themoment of the current cycle is referred toas (t) , then the previous cycle is identifiedby (t− 1), which is successively continued.And now we come to the formal definitionof the momentum term:Definition 5.10 (Momentum term): The

moment ofinertia variation of backpropagation by means of

the momentum term is defined as fol-lows:

∆wi,j(t) = ηoiδj + α ·∆wi,j(t− 1) (5.44)

Remark: We accelerate on plateaus(avoids quasi-standstill on plateaus) andslow down on craggy surfaces (againstoscillations). Moreover, the effect of

inertia can be varied over the prefactor α, common values are between 0.6 und 0.9. JαAdditionally, the momentum enables thepositive effect that our skier swings backand forth several times in a minimum,and finally lands in the minimum. Despiteits nice one-dimensional appearance,the otherwise very rare error of leavinggood minima unfortunately occurs morefrequently because of the momentumterm – which means that there is again noeasy answer (but we become accustomedto this conclusion).

5.5.4.2 Flat spot elimination

It must be pointed out that with the hy-perbolic tangent as well as with the Fermifunction the derivative outside of the closeproximity of Θ is nearly 0. This resultsin the fact that it is very difficult to re-move neurons from the limits of the activa-



Figure 5.14: We want to execute the gradientdescent like a skier crossing a slope who hardlywould immediately stop at the edge to theplateau.

tion (flat spots), which could extremelyneuronsget stuck extend the learning time. This problem

can be dealt with by modifying the deriva-tive, for example by adding a constant (e.g.0.1), which is called flat spot eliminationflat spot elimination.Remark: Interesting: Success has alsobeen achieved by using constants definedas derivatives [Fah88].

5.5.4.3 Second order backpropagation

According to David Parker [Par87] Sec-ond order backpropagation is also us-ing the second gradient, i.e. the sec-ond multi-dimensional derivative of the er-ror function, to obtain more precise es-timations of the correct ∆wi,j . Higher

derivates only rarely improve the estima-tions. Thus, less training cycles are neededbut those require much more computa-tional effort.

For higher order methods we generallyuse further derivatives (i.e. Hessian ma-trices, since the functions are multidimen-sional). As expected the procedures re-duce the number of learning epochs, butsignificantly increase the computational ef-fort of the individual epochs. So finallythese procedures often need more learningtime than backpropagation.

5.5.4.4 Quickpropagation

The quickpropagation learning proce-dure [Fah88] uses the second derivative ofthe error propagation and locally under-stands the error function to be a parabola.We analytically determine the vertex ofthe said parabola and directly jump to thisvertex. Thus, this learning procedure isa second-order procedure. Of course, thisdoes not work with error surfaces that can-not locally be approximated by a parabola(certainly it is not always possible to di-rectly say whether this is the case).

5.5.4.5 Weight decay

The weight decay according to PaulWerbos [Wer88] is a modification that ex-tends the error by a term punishing largeweights. So the error under weight de-cay

ErrWD


dkriesel.com 5.6 The 8-3-8 encoding problem and related problems

does not only increase proportional to theErrWDI actual error but also proportionally to the

square of the weights. As a result the net-work is keeping the weights small duringlearning.

ErrWD = Err + β · 12∑w∈W

(w)2

︸︷︷︸punishment

(5.45)

This approach is inspired by nature wheresynaptic weights cannot become infinitelystrong, as well. Additionally, due to

keep weightssmall these small weights the error function of-

ten shows less strong fluctuations allowingeasier and more controlled learning.

The prefactor 12 again resulted from sim-

ple pragmatics. The factor β controls theβI strength of punishment: Values from 0.001

to 0.02 are often used here.

5.5.4.6 Pruning / Optimal BrainDamage

If we have executed the weight decay longenough and determine that for a neuron inthe input layer all successor weights are 0

prune thenetwork or close to 0, we can remove the neuron,

have lost one neuron and some weightsand so reduce the opportunity that thenetwork will memorize. This procedure iscalled pruning pruning.

Such a method to detect and delete un-necessary weights and neurons is refferedto as optimal brain damage [lCDS90].I only want to describe it briefly: Themean error per output neuron is composedof two competing terms. While one term,

as usual, considers the difference betweenoutput and teaching input, the other onetries to ”press” a weight against 0. If aweight is strongly needed to minimize theerror, the first term will win. If this is notthe case, the second term will win. Neu-rons which only have zero weights can becut again in the end.

There are many other variations of back-prop and whole books only about thissubject, but since my aim is to offer anoverview of neural networks I just want tomention the variations above as a motiva-tion to read on.

For some of these extensions it is obvi-ous that they cannot only be applied tofeedforward networks with backpropaga-tion learning procedures.

We have got to know backpropagation andfeedforward topology – now we have tolearn how to build a neural network. Itis certainly impossible to provide this ex-perience in the framework of this paper,for not it’s your turn: You could now trysome of the Problem examples from sec-tion 4.6.

5.6 The 8-3-8 encodingproblem and relatedproblems

The 8-3-8 encoding problem is a clas-sic among the multilayer perceptron testtraining problems. In our MLP wehave one input layer with eight neuronsi1, i2, . . . , i8, one output layer with eight



neurons Ω1,Ω2, . . . ,Ω8 and one hiddenlayer with three neurons. Thus, this net-work represents a function B8 → B8. Nowthe training task is that the input of avalue 1 into the neuron ij provides the out-put of a value 1 into the neuron Ωj (onlyone neuron should be activated, which re-sults in 8 training examples.

During the analysis of the trained networkwe will see that the network with the 3hidden neurons represents some kind of bi-nary encoding and that the upper map-ping is possible (assumed training time:≈ 104 epochs). Thus, our network is amachine in which the input is encoded, de-coded and output.

Analogously, we can train a 1024-10-1024encoding problem. But is it possible toimprove the efficiency of this procedure?Could there be, for example, a 1024-9-1024- or an 8-2-8-encoding network?

Yes, even that is possible, since the net-work does not depend on binary encod-ings: Thus, an 8-2-8 network is sufficientfor our problem. But the encoding of thenetwork is far more difficult to understand(fig. 5.15) and the training of the networksrequires a lot more time.

An 8-1-8 network, however, does not work,since the possibility that the output of oneneuron is compensated by another one isessential, and if there is only one hiddenneuron, there is certainly no compensatoryneuron.

Figure 5.15: Illustration of the functionality of8-2-8 network encoding. The marked points rep-resent the vectors of the inner neuron activationassociated to the samples. As you can see, itis possible to find inner activation formations sothat each point can be separated from the restof the points by means of a straight line. The il-lustration shows an exemplary separation of onepoint.


dkriesel.com 5.6 The 8-3-8 encoding problem and related problems

Exercises

Exercise 8: Fig. 5.4 on page 75 shows asmall network for the boolean functionsAND and OR. Write a table with all compu-tational variables of neural networks (e.g.network input, activation etc.). Calculatethe four possible inputs of the networksand write down the values of these vari-ables for each input. Do the same for theXOR network (fig. 5.9 on page 84).Exercise 9:

1. Indicate all boolean functions B3 →B1, that are linearly separable and ex-actly characterize them.

2. NIndicate those that are not linearlyseparable and exactly characterizethem, too.

Exercise 10: A simple 2-1 network shallbe trained with one single pattern bymeans of backpropagation of error andη = 0.1. Verify if the error

Err = Errp = 12(t− y)2

converges and if so, at what value. Howdoes the error curve look like? Let thepattern (p, t) be defined by p = (p1, p2) =(0.3, 0.7) and tΩ = 0.4. Randomly initalizethe weights in the interval [1;−1].Exercise 11: A one-stage perceptron withtwo input neurons, bias neuron and binarythreshold function as activation functiondivides the two-dimensional space into tworegions by means of a straight line g. Ana-lytically calculate a set of weight values forsuch a perceptron so that the following set

P of the 6 patterns of the form (p1, p2, tΩ)with ε 1 is correctly classified.

P =(0, 0,−1);(2,−1, 1);(7 + ε, 3− ε, 1);(7− ε, 3 + ε,−1);(0,−2− ε, 1);(0− ε,−2,−1)

Exercise 12: Calculate in a comprehensi-ble way one vector ∆W of all changes inweight by means of the backpropagation oferror procedure with η = 1. Let a 2-2-1MLP with bias neuron be given nd let thepattern be defined by

p = (p1, p2, tΩ) = (2, 0, 0.1).

For all weights with the target Ω the ini-tial value of the weights should be 1. Forall other weights should the initial valueshould be 0.5. What is conspicuous aboutthe changes?


Chapter 6

Radial Basis FunctionsRBF networks approximate functions by stretching and compressing Gaussianbells and then summing them spatially shifted. Description of their functions

and their learning process. Comparison with multi-layer perceptrons.

According to Poggio and Girosi [PG89]radial basis function networks (RBF net-works) are a paradigm of neural networks,which was developed considerably laterthan that of perceptrons. Like perceptronsthe RBF networks are built in layers but,however, in this case it has exactly threelayers, i.e. only one single layer of hiddenneurons.

Like perceptrons the networks have a feed-forward structure and their layers are com-pletely linked. Here, the input layer doesalso not participate in information process-ing. The RBF networks are - like MLPs -universal function approximators.

Despite all things in common: What is thedifference between RBF networks and per-ceptrons? The difference is in the informa-tion processing itself and in the computa-tional rules within the neurons lying out-side of the input layer. So in a momentwe will define a so far unknown type ofneurons.

6.1 Components andStructure of anRBF-Network

Initially, we want to colloquially discussand then define some concepts concerningRBF networks.

Output neurons: In an RBF network theoutput neurons only contain the iden-tity as activation function and oneweighted sum as propagation func-tion. Thus, they do little morethan adding all input values and out-putting the sum.

Hidden neurons hare also called RBFneurons (as well as the layer in whichthey are located is referred to as RBFlayer). As propagation function eachhidden neuron receives a norm thatcalculates the distance between the in-put into the network and the so-calledposition of the neuron (center). This

101

Chapter 6 Radial Basis Functions dkriesel.com

is entered into a radial activation func-tion which calculates and outputs theactivation of the neuron.

Definition 6.1 (RBF input neuron): Defi-nition and representation is identical to toinput

is linearagain

the definition 5.1 on page 73 of the inputneuron.Definition 6.2 (Center of an RBF neuron):The center ch of an RBF neuron h is the

cI point in the input space where the RBFneuron is located . The closer the input

Positionin the input

spacevector is to the center vector of an RBFneuron, the higher, generally, is its activa-tion.Definition 6.3 (RBF neuron): The so-called RBF neurons h have a propaga-tion function fprop that determines the dis-tance between the center ch of a neuronImportant!and the input vector y. This distance rep-resents the network input. Then the net-work input is sent through a radial basisfunction fact which outputs the activationor the output of the neuron. RBF neurons

are represented by the symbol WVUTPQRS||c,x||Gauß

.

Definition 6.4 (RBF output neuron):RBF output neurons Ω use the

weighted sum as propagation functionfprop, and the identity as activation func-

only sumsup tion fact. They are represented by the sym-

bol ONMLHIJKΣ .

Definition 6.5 (RBF network): An RBFnetwork has exactly three layers in thefollowing order: The input layer consist-ing of input neurons, the hidden layer (alsocalled RBF layer) consisting of RBF neu-rons and the output layer consisting ofRBF output neurons. Each layer is com-

3 layers,feedforward

pletely linked with the next following one,shortcuts do not exist (fig. 6.1 on the rightpage) – it is a feedforward topology. Theconnections between input layer and RBFlayer are unweighted, i.e. they only trans-mit the input. The connections betweenRBF layer and output layer are weighted.The original definition of an RBF networkonly referred to an output neuron, but –analogous to the perceptrons – it is appar-ent that such a definition can be general-ized. A bias neuron is unknown in theRBF network. The set of input neuronsshall be represented by I, the set of hid- JHden neurons by H and the set of outputneurons by O.

Therefore, the inner neurons are called ra-dial basis neurons because from their def-inition directly follows that all input vec-tors with the same distance from the cen-ter of a neuron also produce the same out-put value (fig. 6.2 on page 104).

6.2 Information processing ofa RBF network

Now the question is what is realized bysuch a network and what is its sense. Letus go over the RBF network from top tobottom: An RBF network receives theinput by means of the unweighted con-nections. Then the input vector is sentthrough a norm so that the result is ascalar. This scalar (which, by the way, canonly be positive due to the norm) is pro-cessed by a radial basis function, for exam-


dkriesel.com 6.2 Information processing of a RBF network

GFED@ABC

||yyyyyyyyyy

""EEEEEEEEEE

((RRRRRRRRRRRRRRRRRRRR

++VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV GFED@ABC

""EEEEEEEEEE

||yyyyyyyyyy

vvllllllllllllllllllll

sshhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh i1, i2, . . . , i|I|


!!CCCCCCCCCC

((QQQQQQQQQQQQQQQQQQQQ

**VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV WVUTPQRS||c,x||Gauß

!!CCCCCCCCCC

((QQQQQQQQQQQQQQQQQQQQWVUTPQRS||c,x||

Gauß

!!CCCCCCCCCCWVUTPQRS||c,x||

Gauß

vvmmmmmmmmmmmmmmmmmmmm


tthhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh

vvmmmmmmmmmmmmmmmmmmmm

h1, h2, . . . , h|H|

ONMLHIJKΣ

ONMLHIJKΣ

ONMLHIJKΣ

Ω1,Ω2, . . . ,Ω|O|

Figure 6.1: An exemplary RBF network with two input neurons, five hidden neurons and threeoutput neurons. The connections to the hidden neurons are not weighted, they only transmit theinput. Right of the illustration you can find the names of the neurons, which correspond to theknown names of the MLP neurons: Input neurons are called i, hidden neurons are called h andoutput neurons are called Ω. The associated sets are referred to as I, H and O.



Figure 6.2: Let ch be the center of an RBF neu-ron h. Then the activation function facth pro-vides the same value for all inputs lying on thecircle.

ple by a Gaussian bell (fig. 6.3 on the rightpage) .input

→ distance→ Gaussian bell

→ sum→ output

The output values of the different neuronsof the RBF layer or of the different Gaus-sian bells are added within the third layer:Actually, Gaussian bells are, related to thewhole input space, added here.

Let us assume that we have a second, athird and a fourth RBF neuron and there-fore four differently located centers. Eachof these neurons now measures another dis-tance from the input to its own center andde facto provides different values, even ifthe Gaussian bell is the same. Since finallythese values are simply accumulated in theoutput layer, it is easy to understand that

any surface can be shaped by dragging,compressing and removing Gaussian bellsand subsequent accumulation. Here, thedevelopment terms for the superpositionof the Gaussian bells are in the weightsof the connections between the RBF layerand the output layer.

Furthermore, the network architecture of-fers the possibility to freely define or trainheight and width of the Gaussian bells –due to which the network paradigm be-comes even more manifold. We will getto know methods and approches for thislater.

6.2.1 Information processing inRBF-neurons

RBF-Neurons process information by us-ing norms and radial basis functions

At first, let us take as an example a sim-ple 1-4-1 RBF network. It is apparentthat we will receive a one-dimensional out-put which can be represented as a func-tion (fig. 6.4 on the right page). Addi-tionally, the network includes the centersc1, c2, . . . , c4 dof the four inner neuronsh1, h2, . . . , h4, and therefore it has Gaus-sian bells which are finally added withinthe output neuron Ω. The network alsopossesses four values σ1, σ2, . . . , σ4 whichinfluence the width of the Gaussian bells.However, the height of the Gaussian bellis influenced by the next following weights,since the individual output values of thebells are multiplied by those weights.



0

0.2

0.4

0.6

0.8

1

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

h(r)

r

Gaussian in 1D Gaussian in 2D

−2−1

0 1

x

−2−1

0 1

2

y

0 0.2 0.4 0.6 0.8

1

h(r)

Figure 6.3: Two individual one- or two-dimensional Gaussian bells. In both cases σ = 0.4 is trueand in both cases the center of the Gaussian bell is in the point of origin. The distance r to thecenter (0, 0) is simply calculated from the Pythagorean theorem: r =

√x2 + y2.

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

−2 0 2 4 6 8

y

x

Figure 6.4: For different Gaussian bells in one-dimensional space generated by means of RBFneurons are added by an output neuron of the RBF network. The Gaussian bells have differentheights, widths and positions. Their centers c1, c2, . . . , c4 were located at 0, 1, 3, 4, the widthsσ1, σ2, . . . , σ4 at 0.4, 1, 0.2, 0.8. You can see an example for a two-dimensional casein fig. 6.5 onthe next page.



Gaussian 1

−2−1

0 1x

−2−1

0 1

2

y

−1−0.5

0 0.5

1 1.5

2

h(r)Gaussian 2

−2−1

0 1x

−2−1

0 1

2

y

−1−0.5

0 0.5

1 1.5

2

h(r)

Gaussian 3

−2−1

0 1x

−2−1

0 1

2

y

−1−0.5

0 0.5

1 1.5

2

h(r)Gaussian 4

−2−1

0 1x

−2−1

0 1

2

y

−1−0.5

0 0.5

1 1.5

2

h(r)


((QQQQQQQQQQQQQQQQQQQQWVUTPQRS||c,x||

Gauß

AAAAAAAAAAWVUTPQRS||c,x||

Gauß

~~


vvmmmmmmmmmmmmmmmmmmm

ONMLHIJKΣ

Sum of the 4 Gaussians

−2−1.5

−1−0.5

0 0.5

1 1.5

2

x

−2−1.5

−1−0.5

0 0.5

1 1.5

2

y

−1−0.75

−0.5−0.25

0 0.25

0.5 0.75

1 1.25

1.5 1.75

2

Figure 6.5: Four different Gaussian bells in two-dimensional space generated by menas of RBFneurons are added by an output neuron of the RBF ntwork. Once again r =

√x2 + y2 applies for

the distance. The heights w, widths σ and centers c = (x, y) are: w1 = 1, σ1 = 0.4, c1 = (0.5, 0.5),w2 = −1, σ2 = 0.6, c2 = (1.15,−1.15), w3 = 1.5, σ3 = 0.2, c3 = (−0.5,−1), w4 = 0.8, σ4 =1.4, c4 = (−2, 0).



Since we use a norm to calculate the dis-tance between the input vector and thecenter of a neuron h, we have differentchoices: Often the Euclidian distance ischosen to calculate the distance:

rh = ||x− ch|| (6.1)

=√∑i∈I

(xi − ch,i)2 (6.2)

Remember: The input vector was referredto as x. Here, the index i passes throughthe input neurons as well as through theinput vector components and the neuroncenter components. As we can see, As wecan see, the Euclidean distance generatesthe sqaures of the differences of all vectorcomponents, adds them and extracts theroot of the sum. In two-dimensional spacethis equals the Pythagorean theorem.Remark: From the definition of a norm di-rectly follows that the distance can onlybe positive, because of which we, strictlyspeaking, use the positive part of the ac-tivation function. By the way, activationfunctions other than the Gaussian bell arepossible. Normally, functions that mono-tonically decrease in the interval [0;∞] areselected.

Now that we know the distance rh be-rhI tween the input vector x zand the center

ch of the RBF neuron h this distance hasto be passed through the activation func-tion. Here we use, as already mentioned,a Gaussian bell:

fact(rh) = e

(−r2h

2σ2h

)(6.3)

Remark (on nomenclature): It is obviousthat both the center ch and the widthσhcan be understood as part of the activa-tion function fact, and according to thisnot all activation functions can be re-ferred to as fact. One solution would beto number the activation functions likefact1, fact2, . . . , fact|H| with H being theset of hidden neurons. But as a result theexplanation would be very confusing. So Isimply use the name fact for all activationfunctions and regard σ and c as variablesthat are defined for individual neurons butno directly included in the activation func-tion.Remark: The reader will definitely noticethat in the literature the Gaussian bell isoften provided with a multiplicative fac-tor. Due to the existing multiplication bythe following weights and the comparabil-ity of constant multiplication we do notneed this factor (especially because for ourpurpose the integral of the Gaussian bellmust not always be 1) and therefore wesimply leave it out.

6.2.2 Some analytical thoughtsprior to the training

The output yΩ of an RBF output neuronΩ results from combining the functions ofan RBF neuron to

yΩ =∑h∈H

wh,Ω · fact (||x− ch||) . (6.4)

Let us assume that similar to the multi-layer perceptron we have a s P , that con-



tains |P | training examples (p, t). Thenwe receive |P |functions of the form

yΩ =∑h∈H

wh,Ω · fact (||p− ch||) , (6.5)

that mean one function for each trainingexample.

Certainly, the aim of this effort is to letthe output y for all training patterns p runagainst the corresponding teaching inputt.

6.2.2.1 Weights can simply becomputed as solution of asystem of equations

Thus we can see that we have |P | equa-tions. Now let us assume that the widthsσ1, σ2, . . . , σk, the centers c1, c2, . . . , ck andthe training examples p including teach-ing input t are given. We are looking forthe weights wh,Ω with |H| weights for oneoutput neuron Ω. Thus, our problem canbe seen as a system of equations since theonly thing we want to change at the mo-ment are the weights.

This demands a case differentiation con-cerning the number of training examples|P | and the number of RBF neurons |H|:

|P | = |H|: If the number of RBF neuronsequals the number of patterns, i.e.|P | = |H|, the equation can be re-duced to a matrix multiplication

simplycalculate

weights

T = M ·G (6.6)

⇔ M−1 · T = M−1 ·M ·G (6.7)

⇔ M−1 · T = E ·G (6.8)

⇔ M−1 · T = G, (6.9)

where

. T is the vector of the teaching JTinputs for all training examples,

. M is the |P | × |H| matrix of JMthe outputs of all |H| RBF neu-rons to |P | examples (remember:|P | = |H|, the matrix is squaredand therefore invertible),

. G is the vector of the desired JGweights and

. E is a unit matrix appropriate JEto G.

Mathematically speaking, we can sim-ply calculate the weights: In the caseof |P | = |H| there is exactly one RBFneuron available per training exam-ple. This means, that the network ex-actly meets the |P | existing nodes af-ter having calculated the weights, i.e.it performs a precise interpolation.To calculate such an equation we cer-tainly do not need an RBF network,and therefore we can proceed to thenext case.Remark: Exact interpolation mustnot be mistaken for the memorizingmentioned with the MLPs: The firstis, that we are not talking about the



training of RBF networks at the mo-ment. The second is, that it couldbe very good for us and definitely in-tended if the network exactly interpo-lates between the nodes.

|P | < |H|: The system of equations isunder-determined, there are moreRBF neurons than training examples,i.e. |P | < |H|. Certainly, this casenormally does not occur very often.In this case, there are so many solu-tions which we do not need to discussin detail. We can select one set ofweights out of many obviously possi-ble ones.

|P | > |H|: But most interesting for fur-ther discussion is that there are signif-icantly more training examples thanRBF neurons, that means |P | > |H|.Thus, we again want to use the gener-alization capability of the neural net-work.

If we have more training examplesthan RBF neurons, we cannot assumethat every training example is exactlyhit. So, if we cannot exactly hit thepoints and therefore cannot only in-terpolate as in the above-mentionedideal case with |P | = |H|, we musttry to find a function that approxi-mates our training set P as exactlyas possible: As with the MLP we tryto reduce the sum of the squared errorto a minimum.

How do we continue the calculation inthe case of |P | > |H|? As above wehave to solve a matrix multiplication

T = M ·G (6.10)

by means of a matrix M in order tosolve the system of equations. Theproblem is that this time we cannotinvert the |P |×|H| matrix M becauseit is not squared (here, |P | 6= |H|is true). Here, we have to use theMoore-Penrose pseudo inverse

JM+M+ which is defined

M+ = (MT ·M)−1 ·MT (6.11)

Although the Moore-Penrose pseudoinverse is not the inverse of a matrix,it can be used like one in this case1.We get equations that are very similarto those in the case of |P | = |H|:

T = M ·G (6.12)

⇔ M+ · T = M+ ·M ·G (6.13)

⇔ M+ · T = E ·G (6.14)

⇔ M+ · T = G (6.15)

Another reason for the use of theMoore-Penrose pseudo inverse is thefact that it minimizes the squared er-ror (which is our aim): The estima-tion of the vector G in equation 6.15corresponds to the Gauss-Markovmodel known from statistics, whichis used to minimize the squared error.

1 Particularly, M+ = M−1 is true if M is invertible.I don’t want further discuss the reasons for thesecircumstances and applications of M+ - they caneasily be found in linear algebra.



Remark: In the above-mentionedequations 6.11 on the previous pageand the following ones please donot mistake the T in MT (of thetranspose of the matrix M) for the Tof the vector of all teaching inputs.

6.2.2.2 The generalization on severaloutputs is trivial and not quitecomputationally expensive

We have found a mathematically exactway to directly calculate the weights.What will happens if there are several out-put neurons, i.e. |O| > 1, withO being, asusual, the set of the output neurons Ω? Inthis case, as we have already indicated, itdoes not change much: The additional out-put neurons have their own set of weightswhile we do not change the σ and c of theRBF layer. Thus, in an RBF network it iseasy for given σ and c to realize a lot ofoutput neurons since we only have to calcu-late the individual vector of the weights

GΩ = M+ · TΩ (6.16)

for every new output neuron Ω with thematrix M+ which generally requires a lotof computational effort, being always thesame: So it is quite inexpensive – at leastinexpensive

outputdimension

as far as the computational complexityis concerned – to add more output neu-rons.

6.2.2.3 Computational effort andaccuracy

For realistic problems it normally appliesthat there are considerably more trainingexamples than RBF neurons, i.e. |P | |H|: You can, without any difficulty, trainwith 106 training examples, if you like.Theoretically, we could find the terms forthe mathematically correct solution at theblackboard (after a very long time), butsuch calculations often seem to be impre-cise and very time-consuming (matrix in-versions require a lot of computational ef-fort).

Furthermore, our Moore-Penrose pseudo-inverse is, inspite of numeric stability, no

M+ complexand impreciseguarantee for the output vector being cor-

respondent to the teaching vector, sincesuch time-consuming computations couldbe subject to many imprecisions, althoughthe calculation is mathematically correct:Even our computers could only providethe pseudo-inverse matrices by (nonethe-less very good) approximation. Thismeans that we also get only an approxi-mation of the correct weights (maybe witha lot of imprecisions) and therefore onlyone (maybe very rough or even unrecog-nizable) approximation of output values atthe desired output.

If we have enough computing power to an-alytically determine a weight vector, weshould use it in any case only as initialvalue for our learning process, which leadsus to the real training methods – but oth-erwise it would be boring, wouldn’t it?


dkriesel.com 6.3 Training of RBF-Networks

6.3 Combinations of equationsystem and gradientstrategies are useful fortraining

Analogous to the MLP we perform a gra-dient descent to find the suitable weightsby means of the already well known Deltaretraining

Delta rule rule. Here, backpropagation is unneces-sary since we only have to train one singleweight layer – which requires less comput-ing time.

We know that the Delta rule is

∆wh,Ω = η · δΩ · oh, (6.17)

in which we now insert the following:

∆wh,Ω = η · (tΩ − yΩ) · fact(||p− ch||)(6.18)

Here again I explicitly want to mentionthat it is very popular to divide the train-ing into two phases by analytically com-puting a set of weights and then retrainingit by means of the Delta rule.

There is still the question whether to learnoffline or online. Here, the answer is sim-ilar to the answer for the multi-layer per-ceptron: Initially, , it is often learned on-training

in phases line (faster movement across the error sur-face). Then, after having approximatedthe solution, the errors are once again ac-cumulated and for amore precise approxi-mation learned offline in a third learningphase. But, similar to the MLPs, you canbe successful by using many methods.

But we have already implied that in anRBF network not only the weights previ-ous to the output layer can be optimized.So let us take a look at the possibility tovary σ snd c.

6.3.1 It is not always trivial todetermine centers and widthsof RBF neurons

It is obvious that the approximation accu-racy of RBF networks can be decreased byadapting the widths and positions of theGaussian bells in the input space to theproblem to be approximated. There areseveral methods to deal with the centers c vary

σ and cand the widths σ of the Gaussian bells:

Fixed selection: The centers and widthscan be selected fixedly and regardlessof the patterns – this is what we haveassumed until now.

Conditional, fixed selection: Again cen-ters and widths are selected fixedly,but we have previous knowledgeabout the functions to be approxi-mated and comply with it.

Adaptive to the learning process: Thisis definitely the most elegant variant,but certainly the most ambitious one,too. A realization of this approachwill not be discussed in this chapterbut it can be found in connectionwith another network topology(section 10.6.1).



Figure 6.6: Example for even coverage of a two-dimensional input space by applying radial basisfunctions.

6.3.1.1 Fixed selection

In any case, the goal is to cover the in-put space as evenly as possible. Here,widths of 2

3 of the distance between thecenters can be selected so that the Gaus-sian bells overlap by approx. ”one third”2

(fig. 6.6). The closer the bells are set themore precise but the more time-consumingthe whole thing becomes.

This may seem to be very inelegant, butin the field of function approximation wecannot avoid even coverage. Here it is use-less when the function to be approximated

2 It is apparent that a Gaussian bell is mathemati-cally infinitely wide, therefore I ask the reader toapologize this sloppy formulation.

is precisely represented at some positionsbut at other positions the return value isonly 0.

However, the high input dimension re-quires a great many RBF neurons, which input

dimensionvery expensive

increases the computational effort expo-nentially with the dimension – and is re-sponsible for the fact that six- to ten-dimensional problems in RBF networksare already called ”high-dimensional” (anMLP, for example, does not cause anyproblems here).

6.3.1.2 Conditional, fixed selection

Let us assume that our training examplesare not evenly distributed across the in-put space. Then it seems to be obviousto arrange the centers and Sigmas of theRBF neurons by means of the pattern dis-tribution. So the training patterns can beanalyzed by statistical techniques such asa cluster analysis, and so it can be deter-mined whether there are statistical factorsaccording to which we should distributethe centers and sigmas (fig. 6.7 on the rightpage).

A more trivial alternative would be to set|H| centers on positions randomly selectedfrom the set of patterns. So this methodwould allow for every training pattern pto be directly in the center of a neuron(fig. 6.8 on the right page). This is notyet very elegant but a good solution whentime is of the essence. Generally, for thismethod the widths are fixedly selected.


dkriesel.com 6.3 Training of RBF-Networks

Figure 6.7: Example of an uneven coverage ofa two-dimensional input space, about which wehave previous knowledge, by applying radial basisfunctions.

If we have reason to believe that the set oftraining examples has cluster points, wecan use clustering methods to determinethem. There are different methods to de-termine clusters in an arbitrarily dimen-sional set of points. We will be introducedto some of them in excursus A. One neuralclustering method are the so-called ROLFs(section A.5), and self-organizing maps arealso useful in connection with determin-ing the position of RBF neurons (section10.6.1). Using ROLFs, one can also receiveindicators for useful radii of the RBF neu-rons. Learning vector quantisation (chap-ter 9) has also provided good results. Allthese methods do not have anything to dowith the RBF networks directly but are

Figure 6.8: Example of an uneven coverage ofa two-dimensional input space by applying radialbasis functions. The widths were fixedly selected,the centers of the neurons were randomly dis-tributed throughout the training patterns. Thisdistribution can certainly lead to slightly unrepre-sentative results, which can be seen at the singledata point down to the left.



only used to generate some previous knowl-edge. Therefore we will not discuss themin this chapter but independently in theindicated chapters.

Another approach is to use the approvedmethods: We could turn the positions ofthe centers and observe how our error func-tion Err is changing – a gradient descent,as already known from the MLPs. In asimilar manner we could look how the er-ror depends on the values σ. Analogousto the derivation of backpropagation wederive

∂Err(σhch)∂σh

and ∂Err(σhch)∂ch

.

Since the derivation of these terms corre-sponds to the derivation of backpropaga-tion we do not want to discuss it here.

But experience shows that no convincingresults are obtained by regarding how theerror behaves depending on the centersand sigmas. Even if mathematics claimsthat such methods are promising, the gra-dient descent, as we already know, leadsto problems with very craggy error sur-faces.

And that’s the crucial point: Naturally,RBF networks generate very craggy errorsurfaces because if we considerably changea c or a σ, we will significantly change theappearance of the error function.

6.4 Growing RBF Networksautomatically adjust theneuron density

In growing RBF networks, the number|H| of RBF networks is not constant. Acertain number |H| of neurons as well astheir centers ch and widths σh are previ-ously selected (e.g. by means of a cluster-ing method) and then extended or reduced.In the following text, only simple mecha-nisms are sketched. For more information,I refer to [Fri94].

6.4.1 Neurons are added to placeswith large error values

After generating this initial configurationthe vector of the weights G is analyticallycalculated. Then all specific errors Errpconcerning the set P of the training ex-amples are calculated and the maximumspecific error

maxP

(Errp)

is sought.

The extension of the network is simple:We replace this maximum error with a new

replaceerror withneuron

RBF neuron. Of course, we have to exer-cise care in doing this: IF the σ re small,the neurons will only influence one anotherwhen the distance between them is short.But if the σ are large, the already exisitingneurons are considerably influenced by thenew neuron because of the overlapping ofthe Gaussian bells.


dkriesel.com 6.5 Comparing RBF Networks and Multi-layer Perceptrons

So it is obvious, that we will adjust the al-ready existing RBF neurons when addingthe new neuron.

To put it crudely: this adjustment is madeby moving the centers c of the other neu-rons away from the new neuron and re-ducing their width σ a bit. Then thecurrent output vector y of the network iscompared to the teaching input t and theweight vector G s improved by means oftraining. Subsequently, a new neuron canbe inserted if necessary. This method isparticulary suited for function approxima-tions.

6.4.2 Limiting the number ofneurons

Here it is mandatory to see that the net-work will not grow ad infinitum, which canhappen very fast. Thus, it is very usefulto previously define a maximum numberfor neurons |H|max.

6.4.3 Less important neurons aredeleted

Which leads to the question whether itis possible to continue learning when thislimit |H|max is reached. The answer is:this would not stop learning. We only haveto look for the ”most unimportant” neu-ron and delete it. A neuron is, for exam-ple, unimportant for the network if thereis another neuron that has a similar func-tion: It is often that two Gaussian bellsexactly overlap and at such a position, for

deleteunimportant

neurons

instance, one single neuron with a higherGaussian bell would be appropriate.

But to develop automated procedures inorder to find less relevant neurons is veryproblem dependent and we want to leavethis to the programmer.

With RBF networks and multi-layer per-ceptrons we have already become ac-quainted with and extensivley discussedtwo network paradigms for similar prob-lems. Therefore we want to compare thesetwo paradigms and look at their advan-tages and disadvantages.

6.5 Comparing RBF Networksand Multi-layerPerceptrons

We will compare multi-layer perceptronsand RBF networks by means of differentaspects.

Input dimension: We must be carefulwith RBF networks in high-dimensional functional spacessince the network could very fastrequire huge memory storage andcomputational effort. Here, amulti-layer perceptron would causeless problems because its number ofneuons does not grow exponentiallywith the input dimension.

Center selection: However, selecting thecenters c for RBF networks is (de-spite the introduced approaches) still



a great problem. Please use any previ-ous knowledge you have when apply-ing them. Such problems do not occurwith the MLP.

Output dimension: The advantage ofRBF networks is that the training isnot much influenced when the outputdimension of the network is high.For an MLP a learning proceduresuch as backpropagation thereby willbe very protracted.

Extrapolation: Advantage and disadvan-tage RBF networks is the lack of ex-trapolation capability: An RBF net-work returns the result 0 far awayfrom the centers of the RBF layer. Onthe one hand it does not extrapolate,unlike the MLP it cannot be usedfor extrapolation (whereby we couldnever know if the extrapolated valuesof the MLP are reasonable, but expe-rience shows that MLPs are good na-tured for that matter). On the otherhand, unlike the MLP the network isImportant!capable to use this 0 to tell us ”I don’tknow”, which could be an advantage.

Lesion tolerance: For the output of anMLP it is no so important if a weightor a neuron is missing. It will onlyworsen a little in total. If a weightor a neuron is missing in an RBFnetwork then large parts of the out-put remain practically uninfluenced.But one part of the output is very af-fected because a Gaussian bell is di-rectly missing. Thus, we can choosebetween a strong local error for lesionand a weak but global error.

Spread: Here the MLP is ”advantaged”since RBF networks are used consid-erably less often – which is not alwaysunderstood by professionals (at leastas far as low-dimensional input spacesare concerned). The MLPs seem tohave a considerably higher traditionand they are working too good to takethe effort to read some pages of thispaper about RBF networks) :-).

Exercises

Exercise 13: An |I|-|H|-|O| RBF networkwith fixed widths and centers of the neu-rons should approximate a target functionu. For this, |P | training examples of theform (p, t) of the function u are given. LEt|P | > |H| be true. The weights shouldbe analytically determed by means of theMoore-Penrose pseudo inverse. Indicatethe running time behavior regarding |P |and |O| as precise as possible.

Note: There are methods for matrix mul-tiplications and matrix inversions that aremore efficient than the canonical methods.For better estimations I recommend tosearch for such methods (and their com-plexity). In addition to your complexitycalculation, please indicate the used meth-ods together with their complexity.


Chapter 7

Recurrent Perceptron-like NetworksSome thoughts about networks with internal states. Learning approaches

using such networks, overview of their dynamics.

Generally, recurrent networks are net-works being capable to influence them-selves by means of recurrents, e.g. byincluding the network output in the follow-ing computation steps. There are manytypes of recurrent networks of nearly arbi-trary form, and nearly all of them are re-ferred to as recurrent neural networks.As a result, for the few paradigms in-troduced here I use the name recurrentmulti-layer perceptrons.

more capablethan MLP

Apparently, such a recurrent network is ca-pable to compute more than the ordinaryMLP: If the recurrent weights are set to 0,the recurrent network will be reduced toan ordinary MLP. Additionally, the recur-rency generates different network-internalstates so that in the context of the networkstate different inputs can produce differentoutputs.

Recurrent networks in itself have a greatdynamic that is mathematically difficultto conceive and has to be discussed exten-sively. The aim of this chapter is only

to briefly discuss how recurrencies canbe structured and how network-internalstates can be generated. Thus, I willbriefly introduce two paradigms of recur-rent networks and afterwards roughly out-line their training.

With a recurrent network a temporallyconstant input x may lead to different re-sults: For one thing, the network could state

dynamicsconverge, i.e. it could transform itself intoa fixed state and at any time it will returna fixed output value y, for another thingit would never converge, or at least notuntil a long time later, so that we do notrecognize anymore the consequences of aconstant change of y.

If the network does not converge, it is, forexample, possible to check if periodicalsor attractors (fig. 7.1 on the next page)are returned. Here, we can expect thecomplete variety of dynamical systems.That is the reason why I particularly wantto refer to the literature concerning dy-namical systems.

117

Chapter 7 Recurrent Perceptron-like Networks (depends on chapter 5) dkriesel.com

Figure 7.1: The Roessler attractor

Further discussions could reveal what willhappen if the input of recurrent networksis changed.

In this chapter the related paradigms ofrecurrent networks according to Jordanand Elman should be introduced.

7.1 Jordan Networks

A Jordan network [Jor86] is a multi-layer perceptron with a set K of so-calledcontext neurons k1, k2, . . . , k|K|. Thereis one context neuron per output neuron(fig. 7.2 on the right page). In principle, acontext neuron just memorizes an outputuntil it can be processed in the next time output

neuronsare buffered

step. Therefore, there are weighted con-nections between each output neuron andone context neuron. The stored values arereturned to the actual network by meansof complete links between the context neu-rons and the input layer.

In the originial definition of a Jordan net-work the context neurons are also recur-rent to themselves due to a connectingweight λ. But the most applications omitthis recurrence since the Jordan networkis already very dynamic and difficult toanalyze, even without these additional re-currences.Definition 7.1 (Context neuron): A con-text neuron k receives the output value ofanother neuron i at a time t and then reen-ters it into the network at a time (t+ 1).Definition 7.2 (Jordan network): A Jor-dan network is a multi-layer perceptron


dkriesel.com 7.2 Elman Networks

GFED@ABCi1

~~

AAAAAAAAA



~~

AAAAAAAAAGFED@ABCk2

xx

GFED@ABCk1

vvGFED@ABCh1

AAAAAAAAA

**UUUUUUUUUUUUUUUUUUUUUUUUUU GFED@ABCh2

~~

AAAAAAAAAGFED@ABCh3

~~


GFED@ABCΩ1

@A BC

OO

GFED@ABCΩ2

OO

Figure 7.2: Illustration of a Jordan network. The network output is buffered in the context neuronsand with the next time step it is entered into the network together with the new input.

with one context neuron per output neu-ron. The set of context neurons is calledK. The context neurons are completelylinked toward the input layer of the net-work.

7.2 Elman Networks

The Elman networks (a variation of theJordan networks) [Elm90] have contextneurons, too. But, however, one layer ofcontext neurons per information process-ing neuron layer (fig. 7.3 on the next page).Thus, the outputs of each hidden neuron

nearly every-thing is

bufferedor output neuron are led into the associ-ated context layer (again exactly one con-text neuron per neuron) and from thereit is reentered into the complete neuron

layer during the next time step (i.e. on theway back a complete link again). So thecomplete information processing part1 ofthe MLP exists a second time as ”contextversion” – which once again considerablyincreases dynamics and state variety.

Compared with Jordan networks the El-man networks often have the advantage toact more purposeful since every layer canaccess its own context.Definition 7.3 (Elman network): An El-man network is an MLP with one con-text neuron per information processingneuron. The set of context neurons iscalled K. This means that there exists onecontext layer per information processing

1 Remember: The input layer does not process in-formation.



GFED@ABCi1

~~~~~~~~~~~~

@@@@@@@@@@



~~~~~~~~~~~~

@@@@@@@@@@

GFED@ABCh1

@@@@@@@@@

**UUUUUUUUUUUUUUUUUUUUUUUUUU 44GFED@ABCh2

~~~~~~~~~

@@@@@@@@@ 55GFED@ABCh3

~~~~~~~~~

ttiiiiiiiiiiiiiiiiiiiiiiiiii 55ONMLHIJKkh1

uu zzvv ONMLHIJKkh2

wwuutt ONMLHIJKkh3

vvuutt

GFED@ABCΩ1

55GFED@ABCΩ2 55

ONMLHIJKkΩ1

uu ww ONMLHIJKkΩ2

uu vv

Figure 7.3: Illustration of an Elman network. The entire information processing part of the networkexists, in a manner of speaking, twice. The output of each neuron (except for the output of theinput neurons) is buffered and reentered into the associated layer. For the reason of clarity I namedthe context neurons on the basis of their models in the actual network, but it is not mandatory todo so.


dkriesel.com 7.3 Training Recurrent Networks

neuron layer with exactly the same num-ber of context neurons. Every neuron hasa weighted connection to exactly one con-text neuron while the context layer is com-pletely linked towards its original layer.

Now it is interesting to take a look at thetraining of recurrent network since, for in-stance, the ordinary backpropagation oferror cannot work on recurrent networks.Once again, the style of the following partis more informal, which means that I willnot use any formal definitions.

7.3 Training RecurrentNetworks

In order to explain the training as descrip-tively as possible, we have to agree uponsome simplifications that do not affect thelearning principle itself.

So for the training let us assume thatinitially the context neurons are initiatedwith an input since otherwise they wouldhave an undefined input (this is no simpli-fication but reality).

Furthermore, we use a Jordan networkwithout a hidden neuron layer for ourtraining attempts so that the output neu-rons can directly provide input. his ap-proach is a strong simplification becausegenerally more complicated networks areused. But this does not change the learn-ing principle.

7.3.1 Unfolding in Time

Remember our actual learning procedurefor MLPs, the backpropagation of error,which backpropagates the delta values.So, in case of recurrent networks thedelta values would cyclically backpropa-gate through the network again and again,which makes the training more difficult.On the one hand we cannot know whichof the many generated delta values for aweight should be selected for training, i.e.which values are useful. On the other handwe cannot definitely know when learningshould be stopped. The advantage of re-current networks is a great state dynamicswithin the network operation; the disad-vantage of recurrent networks is that thisdynamics is also granted to the trainingand therefore makes it difficult.

One learning approach would be the at-tempt to unfold the temporal states of thenetwork (fig. 7.4 on page 123): Recursionsare deleted by setting an even networkover the context neurons, i.e. the contextneurons are, as a manner of speaking, theoutput neurons of the attached network.More generally spoken, we have to back-track the recurrencies and place ”‘earlier”’instances of neurons in the network – thuscreating a larger, but forward-orientednetwork without recurrencies. This en-ables training a recurrent network withany training strategy developed for non-recurrent ones. Here, the input is entered

attachthe samenetworkto eachcontextlayer

as teaching input into every ”copy” of theinput neurons. This can be done for a de-screte number of time steps. These train-ing paradigms are called unfolding in



time [MP69]. After the unfolding a train-ing by means of backpropagation of erroris possible.

But obviously, for one weight wi,j sev-eral changing values ∆wi,j are received,which can be treated differently: accumu-lation, averaging etc. A simple accumu-lation could possibly result in enormouschanges per weight if all changes have thesame sign. Hence, the average is not to bescoffed at. We could also introduce a be-griffdiscounting factor, which weakens theinfluence of ∆wi,j of the past.

Unfolding in time is particularly useful ifwe receive the impression that the closerpast is more important for the networkthan the one being further away. The rea-son for this is that backpropagation hasonly few influence in the layers fartheraway from the output (Remember: Thefarther we remove from the output layer,the smaller the influence of backpropaga-tion becomes).

Disadvantages: The training of such an un-folded network will take a long time sincea large number of layers could possibly beproduced. A problem that is no longernegligible is the limited computational ac-curacy of ordinary computers, which isexhausted very fast because of so manynested computations (the farther we re-move from the output layer, the smallerthe influence of backpropagation becomes,so that this limit is reached). Furthermore,with several levels of context neurons thisprocedure could produce very large net-works to be trained.

7.3.2 Teacher forcing

Other procedures are the equivalentteacher forcing and open loop learn-ing. They force open the recurrence dur-ing the learning process: During the learn-

teachinginputapplied atcontextneurons

ing process we simply act as if the recur-rence does not exist and apply the teach-ing input to the context neurons duringthe training. So, a backpropagation be-comes possible, too. Disadvantage: WithElman networks a teaching input for non-output-neurons is not given.

7.3.3 Recurrent backpropagation

Another popular procedure without lim-ited time horizon is the recurrent back-propagation using methods of differ-ential calculus to solve the problem[Pin87].

7.3.4 Training with evolution

Due to the already long lasting train-ing time, evolutionary algorithms havebeen proved of value especially with re-current networks. One reason for this isthat they are not only unrestricted withrespect to recurrences but they also haveother advantages when the mutation mech-anisms are suitably chosen: So, for ex-ample, neurons and weights can be ad-justed and the network topology can beoptimized (of course the result of learn-ing is not necessarily a Jordan or Elmannetwork). With ordinary MLPs, however,evolutionary strategies are less popular


dkriesel.com 7.3 Training Recurrent Networks

GFED@ABCi1

''OOOOOOOOOOOOOOOO


@@@@@@@@@


AAAAAAAAAGFED@ABCk1

wwnnnnnnnnnnnnnnnnn

~~GFED@ABCk2


wwnnnnnnnnnnnnnnnnn

GFED@ABCΩ1@A BC

OO

GFED@ABCΩ2

OO

...

...

......

...

/.-,()*+

((RRRRRRRRRRRRRRRRR

**VVVVVVVVVVVVVVVVVVVVVVVV /.-,()*+

!!CCCCCCCCC

((PPPPPPPPPPPPPPP /.-,()*+ ???????? /.-,()*+

wwoooooooooooooo

/.-,()*+

ttjjjjjjjjjjjjjjjjjjjjj

wwoooooooooooooo

/.-,()*+

((RRRRRRRRRRRRRRRRRR

**VVVVVVVVVVVVVVVVVVVVVVVVVVV /.-,()*+

!!DDDDDDDDDD

((QQQQQQQQQQQQQQQQQQ /.-,()*+ !!CCCCCCCCC /.-,()*+

vvnnnnnnnnnnnnnnnn

/.-,()*+

ttjjjjjjjjjjjjjjjjjjjjjjj

wwppppppppppppppp

GFED@ABCi1

''OOOOOOOOOOOOOOOO


@@@@@@@@@


AAAAAAAAAGFED@ABCk1

wwnnnnnnnnnnnnnnnnn

~~GFED@ABCk2


wwnnnnnnnnnnnnnnnnn

GFED@ABCΩ1

GFED@ABCΩ2

Figure 7.4: Illustration of the unfolding in time with a small exemplary recurrent MLP. Top: Therecurrent MLP. Bottom: The unfolded network. or reasons of clarity, I only added names to thelowest part of the unfolded network. Dotted arrows leading into the network mark the inputs. ottedarrows leading out of the network mark the outputs. Each ”network copy” represents a time stepof the network with the most current time step being at the bottom.



since they certainly need a lot more timethan a directed learning procedure such asbackpropagation.


Chapter 8

Hopfield NetworksIn a magnetic field, each particle applies a force to any other particle so that

all particles adjust their movements in the energetically most favorable way.This natural mechanism is copied to adjust noisy inputs in order to match

their real models.

Another supervised learning example ofthe wide range of neural networks wasdeveloped by John Hopfield: the so-called Hopfield networks [Hop82]. Hop-field and his physically motivated net-works have contributed a lot to the renais-sance of neural networks.

8.1 Hopfield networks areinspired by particles in amagnetic field

The idea for the Hopfield networks origi-nated from the behavior of particles in amagnetic field: Every particle ”communi-cates” (by means of magnetic forces) withevery other particle (comple link) witheach particle trying to reach an energeti-cally favorable state (i.e. a minimum en-ergy function). As for the neurons thisstate is known as activation. Thus, all

particles or neurons rotate and thereby en-courage each other to continue this rota-tion. As a manner of speaking, our neuralnetwork is a cloud of particles

Based on the fact that the particles auto-matically detect the minima of the energyfunction, Hopfield had the idea to use the”spin” of the particles to process informa-tion: Why not letting the particles searchminima on self-defined functions? Even ifwe only use two of those spins, i.e. a bi-nary activation, we will recognize that thedeveloped Hopfield network shows a con-siderable dynamics.

8.2 In a hopfield network, allneurons influence eachother symmetrically

Briefly speaking, a Hopfield network con-sists of a set K of completely linked neu- JK

125

Chapter 8 Hopfield Networks dkriesel.com

?>=<89:;↑ ii

ii

))SSSSSSSSSSSSSSSSSSSSSSSSOO

oo //^^

<<<<<<<<<?>=<89:;↓55

uukkkkkkkkkkkkkkkkkkkkkkkkOO

@@

^^

<<<<<<<<<

?>=<89:;↑ ii

))SSSSSSSSSSSSSSSSSSSSSSSSoo //

@@ ?>=<89:;↓ ?>=<89:;↑44jj 55

uukkkkkkkkkkkkkkkkkkkkkkkk//oo@@

?>=<89:;↓

66

@@

^^<<<<<<<<< ?>=<89:;↑//oo

^^<<<<<<<<<

Figure 8.1: Illustration of an exemplary Hop-field network. The arrows ↑ and ↓ mark thebinary ”spin”. Due to the completely linked neu-rons the layers cannot be separated, which meansthat a Hopfield network simply includes a set ofneurons.

rons with binary activation (since we onlyuse two spins), with the weights beingsymmetric between the individual neurons

completelylinkedset of

neurons

and without any neuron being directly con-nected to itself (fig. 8.1). Thus, the stateof |K| neurons with two possible states∈ −1, 1 can be described by a stringx ∈ −1, 1|K|.

The complete link provides a full squarematrix of weights under the neurons. Themeaning of the weights will be discussed inthe following. Furthermore, we will soonrecognize according to which rules the neu-rons are spinning, i.e. are changing theirstate.

Additionally, the complete link providesfor the fact that we do not know any input,output or hidden neurons. Thus, we have

to think about how we can enter some-thing into the |K| neurons.Definition 8.1 (Hopfield network): A Hop-field network consists of a set K of com-pletely linked neurons without direct re-currences. The activation function of theneurons is the binary threshold functionwith outputs ∈ 1,−1.Definition 8.2 (State of a Hopfield net-work): The state of the network con-sists of the activation states of all neu-rons. Thus, the state of the network canbe understood as a binary string z ∈−1, 1|K|.

8.2.1 Input and Output of aHopfield Network arerepresented by neuron states

We have learned that a network, i.e. aset of |K| particles, that is in a stateis automatically looking for a minimum.An input pattern of a Hopfield networkis exactly such a state: A binary stringx ∈ −1, 1|K| that initializes the neurons.Then the network is looking for the mini-mum to be entered (which we have previ-ously defined by the input of training ex-amples) on its energy surface.

But when do we know that the minimumhas been found? AThis is simple, too:

input andoutput =networkstates

When the network stopps. It can beproved that a Hopfield network with a sym-metric weight matrix and zeros in the di-agonal always converges [CG88] , i.e. at

alwaysconvergessome point it will stand still. Then the

output is a binary string y ∈ −1, 1|K|,


dkriesel.com 8.2 Structure and Functionality

namely the state string of the network thathas found a minimum.

Now let us take a closer look at the con-tents of the weight matrix and the rulesfor the state change of the neurons.Definition 8.3 (Input and output of a Hop-field network): The input of a Hopfield net-work is binary string x ∈ −1, 1|K| thatinitializes the state of the network. Afterthe convergence of the network, the outputis the binary string y ∈ −1, 1|K| gener-ated from the new network state.

8.2.2 Significance of Weights

We have already said that the neuronschange their states, i.e. their direction,from −1 to 1 or vice versa. These spins oc-cur dependent on the current states of theother neurons and the associated weights.Thus, the weights are capable to controlthe complete change of the network. Theweights can be positive, negative, or 0.Colloquially speaking, a weight wi,j be-tween two neurons i and j:

If wi,j is positive, it will try to force thetwo neurons to become equal – thelarger they are, the harder the net-work will try. If the neuron i is instate 1 and the neuron j is in state−1, a high positive weight will advisethe two neurons that it is energeti-cally more favorable to be equal.

If wi,j is negative, its behavior will beanaloguous only that i and j areurged to be different. A neuron i in

state −1 would try to urge a neuronj into state 1.

Zero weights see to it that the two in-volved neurons do not influence eachother.

The weights as a whole apparently takethe way from the current state of the net-work towards the next minimum of the en-ergy function. We now want to discusshow the neurons follow this way.

8.2.3 A Neuron changes its stateaccording to the influence ofthe other neurons

Change in the state of neurons xk ofthe individual neurons k according to thescheme

xk(t) = fact

∑j∈K

wj,k · xj(t− 1)

(8.1)

every time step, whereby the function factgenerally is the binary threshold function(fig. 8.2 on the next page) with the thresh-old 0. Colloquially speaking: A neuron kcalculates the sum of wj,k ·xj(t−1), whichindicates how strong and in which direc-tion the neuron k is urged by the otherneurons j. Thus, the new state of the net-work (time t) results from the state of thenetwork at the previous time t − 1. Thissum is the direction the neuron k gis urgedin. Depending on the sign of the sum theneuron takes state 1 or −1.

Another difference between the Hopfieldnetworks and the other already known net-work topologies is the asynchronous up-



−1

−0.5

0

0.5

1

−4 −2 0 2 4

f(x)

x

Heaviside Function

Figure 8.2: Illustration of the binary thresholdfunction.

date: A neuron k is randomly chosen ev-ery time which recalculates the activation.Thus, the new activation of the previouslychanged neurons directly exerts influence,i.e. one time step indicates the change ofa single neuron.Remark: Regardless of the above-mentioned random selection of theneuron, a Hopfield network is often mucheasier to implement: The neurons aresimply processed one after another andtheir activations are recalculated until nomore changes occur.

randomneuron

calculatesnew

activation

Definition 8.4 (Change in the state of aHopfield network): DThe change in thestate of the neurons occurs asynchronouslywith the neuron to be updated being ran-domly chosen and the new state being gen-erated by means of rule

xk(t) = fact

∑j∈J

wj,k · xj(t− 1)

.

Now that we know how the weights influ-ence the changes in the states of the neu-rons and force the entire network towardsa minimum, then there is the question ofhow to teach the weights to force the net-work towards a certain minimum.

8.3 The weight matrix isgenerated directly out ofthe training patterns

The aim is to generate minima on thementioned energy surface, so that at aninput the network can converge to them.As with many other network paradigms,we use a set P von Mustereingaben p ∈1,−1|K|, representing the minima of ourenergy surface.Remark: Unlike many other networkparadigms, we do not look for the min-ima of an unknown error function but de-fine minima on such a function. The pur-pose is that the network shall automati-cally take the closest minimum when theinput is presented. For now this seems un-usual, but we will understand the wholepurpose later.

Roughly speaking, the training of a Hop-field network is done by training each train-ing pattern exactly once using the ruledescribed in the following (Single ShotLearning), whereby pi and pj are thestates of the neurons i and j under p ∈P :

wi,j =∑p∈P

pi · pj (8.2)


dkriesel.com 8.4 Autoassociation and Traditional Application

This results in the weight matrix W . Col-loquially speaking: We initialize the net-work by means of a training pattern andthen process weights wi,j one after another.For each of these weights we verify: Arethe neurons i, j n the same state or do thestates vary? In the first case we add 1to the weight, in the second case we add−1.

This we repeat for each training patternp ∈ P . Finally, the values of the weightswi,j are high when i and j correspondedwith many training patterns. Colloquiallyspeaking, this high value tells the neurons:”Often, it is energetically favorable to holdthe same state”. The same applies to neg-ative weights.

Due to this training we can store a certainfixed number of patterns p in the weightmatrix. At an input x the network willconverge to the stored pattern that is clos-est to the input p.

Unfortunately, the number of the maxi-mum storable and reconstructible patternsp is limited to

|P |MAX ≈ 0.139 · |K|, (8.3)

which in turn only applies to orthogo-nal patterns. This was shown by precise(and time-consuming) mathematical anal-yses, which we do not want to specifynow. If more patterns are entered, alreadystored information will be destroyed.Definition 8.5 (Learning rule for Hopfieldnetworks): The individual elements of theweight matrix W wThe individual ele-

ments of the weight matrix W are definedby single processing of the learning rule

wi,j =∑p∈P

pi · pj ,

whereby the diagonal of the matrix is cov-ered with zeros. Here, no more than|P |MAX ≈ 0.139 · |K| training examplescan be trained and at the same time main-tain their function.

Now we know the functionality of Hopfieldnetworks but nothing about their practicaluse.

8.4 Autoassociation andTraditional Application

Hopfield networks like those mentionedabove are called autoassociators. Anautoassociator a exactly shows the above- Jamentioned behavior: Firstly, when aknown pattern p is entered exactly thisknown pattern is returned. Thus,

a(p) = p,

with a being the associative mapping. Sec-ondly, and that is the practical use, thisalso works with inputs being close to a pat-tern:

a(p+ ε) = p.

Afterwards, the autoassociator is, in anycase, in a stable state, namely in the statep.

If the set of patterns P consists, for exam-ple, of letters or other characters in the

networkrestoresdamagedinputs

form of pixels, the network will be able to



correctly recognize deformed or noisy let-ters with high probability (fig. 8.3).

The primary fields of application of Hop-field networks are pattern recognitionand pattern completion, such as the zipcode recognition on letters in the eight-ies. But soon the Hopfield networks havebeen overtaken by other systems in most oftheir fields of application, for example byOCR systems in the field of letter recog-nition. Today Hopfield networks are vir-tually no longer used, they have becomeestablished in practice.

8.5 Heteroassociation andAnalogies to Neural DataStorage

So far we have been introduced to Hopfieldnetworks that converge to an arbitrary in-put into the closest minimum of a staticenergy surface.

Another variant is a dynamic energy sur-face: Here, the appearance of the energysurface depends on the current state andwe receive a heteroassociator instead ofan autoassociator. For a heteroassocia-tor

a(p+ ε) = p

is no longer true, but rather

h(p+ ε) = q,

which means that a pattern is mappedonto another one. h is the heteroasso-

hI ciative mapping. Such heteroassociations

Figure 8.3: Illustration of the convergence of anexemplary Hopfield network. Each of the pic-tures has 10 × 12 = 120 binary pixels. In theHopfield network each pixel corresponds to oneneuron. The upper illustration shows the train-ing examples, the lower shows the convergenceof a heavily noisy 3 to the corresponding trainingexample.


dkriesel.com 8.5 Heteroassociation and Analogies to Neural Data Storage

are achieved by means of an asymmetricweight matrix V .

Heteroassociations connected in series ofthe form

h(p+ ε) = q

h(q + ε) = r

h(r + ε) = s

...h(z + ε) = p

can provoke a fast cycle of states

p→ q → r → s→ . . .→ z → p,

whereby a single pattern is never com-pletely accepted: Before a pattern is com-pletely accomplished the heteroassociationalready tries to generate the successorof this pattern. Additionally, the net-work would never stop since after havingreached the last state z it would proceedto the first state p again.

8.5.1 Generating theheteroassociative matrix

We generate the matrix V by means of el-VI ements v very similar to the autoassocia-vI tive matrix with (per transition) p being

the training example before the transitionand q being the training example to be

qI generated from p:

vi,j =∑

p,q∈P,p6=qpiqj (8.4)

The diagonal of the matrix is again cov-ered by zeros. The neuron states are, as

networdis instable

whilechanging

states

always, adapted during operation. Severaltransitions can be introduced into the ma-trix by a simple addition, whereby the saidlimitation exists here, too.Definition 8.6 (Learning rule for the het-eroassociative matrix): For two trainingexamples p being predecessor and q beingsuccessor of a heteroassociative transitionthe weights of the heteroassociative matrixV result from the learning rule

vi,j =∑

p,q∈P,p6=qpiqj ,

with several heteroassociations being intro-duced into the network by a simple addi-tion.

8.5.2 Stabilizing theheteroassociations

We have already mentioned the problemthat the patterns are not completely gen-erated but that the next pattern is alreadybeginning before the generation of the pre-vious pattern is finished.

This problem can be avoided by not onlyinfluencing the network by means of theheteroassociative matrix V but also bythe already known autoassociative matrixW .

Additionally, the neuron adaptation ruleis changed so that competing terms aregenerated: One term autoassociating anexisting pattern and one term trying toconvert the very same pattern into its suc-cessor. The associative rule provokes thatthe network stabilizes a pattern, remains



there for a while, goes on to the next pat-tern, and so on.

xi(t+ 1) = (8.5)

fact

∑j∈K

wi,jxj(t)︸︷︷︸autoassociation

+∑k∈K

vi,kxk(t−∆t)︸︷︷︸heteroassociation

Here, the value ∆t bewirkt hierbei causes,

∆tIstable change

in states

descriptively speaking, the influence of thematrix V to be delayed, since it onlyrefers to a network being ∆t versions be-hind. The result is a change in state, dur-ing which the individual states are stableawhile. If ∆t is set on, for example, twentysteps, then the asymmetric weight matrixwill realize any change in the network onlytwenty steps later so that it initially workswith the autoassociative matrix (since itstill perceives the predecessor pattern ofthe current one), and only after that it willwork against it.

8.5.3 Biological motivation ofheterassociation

From a biological point of view the transi-tion of stable states into other stable statesis highly motivated: At least in the begin-ning of the nineties it was assumed thatthe Hopfield modell will achieve an ap-proximation of the state dynamics in thebrain, which realizes much by means ofstate chains: When I would ask you, dearreader, to recite the alphabet, you gener-ally will manage this better than (pleasetry it immediately) to answer the follow-ing question:

Which letter in the alphabet follows theletter P?

Another example is the phenomenon thatone cannot remember a situation, but theplace at which one memorized it the lasttime is perfectly known. If one returnsto this place, the forgotten situation oftenrecurs.

8.6 Continuous HopfieldNetworks

So far, we only have discussed Hopfield net-works with binary activations. But Hop-field also described a version of his net-works with continuous activations [Hop84],which we want to regard at least briefly:continuous Hopfield networks. Here,the activation is no longer calculated bythe binary threshold function but by theFermi function with temperature parame-ters (fig. 8.4 on the right page).

Here, the network is stable for symmetricweight matrices with zeros on the diagonal,too.

Hopfield also stated, that continuous Hop-field Networks can be applied to find ac-ceptable solutions for the NP-hard trav-elling salesman problem [HT85]. Accord-ing to some verification trials [Zel94] thisstatement can’t be kept up any more. Buttoday there are faster algorithms for han-dling this problem and therefore the Hop-field network is no longer used here.


dkriesel.com 8.6 Continuous Hopfield Networks

0

0.2

0.4

0.6

0.8

1

−4 −2 0 2 4

f(x)

x


Figure 8.4: The already known Fermi functionwith different temperature parameter variations.

Exercises

Exercise 14: ndicate the storage require-ments for a Hopfield network with |K| =1000 neurons when the weights wi,j shallbe stored as whole numbers. Is it possibleto limit the value range of the weights inorder to save storage space?Exercise 15: Compute the weights wi,jfor a Hopfield network using the trainingset

P =(−1,−1,−1,−1,−1, 1);(−1, 1, 1,−1,−1,−1);(1,−1,−1, 1,−1, 1).


Chapter 9

Learning Vector QuantizationLearning vector quantization is a learning procedure with the aim to reproduce

the vector training sets divided in predefined classes as good as possible byusing a few representative vectors. If this has been managed, vectors which

were unkown until then could easily be assigned to one of these classes.

Slowly, part II of this paper draws to aclose – and therefore I want to write a lastchapter for this part that will be a smoothtransition into the next one: A chapterabout the learning vector quantization(abbreviated LVQ) [Koh89] described byTeuvo Kohonen, which can be charac-terized as being related to the self orga-nizing feature maps. These SOMs are de-scribed in the next chapter that already be-longs to part III of this paper, since SOMslearn unsupervised. Thus, after the explo-ration of LVQ I want to bid farewell tosupervised learning.

Previously, I want to announce that thereare different variations of LVQ, which willbe mentioned but not exactly represented.The goal of this chapter rather is to ana-lyze the underlying principle..

9.1 About Quantization

In order to explore the learning vec-tor quantization we should at first geta clearer picture of what quantization(which can also be referred to as dis-cretization is.

Everybody knows the discrete number se-quence

N = 1, 2, 3, . . .,

including the natural numbers. Discretemeans, that this sequence consists of sepa-

discrete= separatedrated elements that are not interconnected.

The elements of our example are exactlysuch numbers, because the natural num-bers do not include, for example, numbersbetween 1 and 2. On the other hand, thesequence of real numbers R, for instance,is continuous: It does not matter howclose two selected numbers are, there willalways be a number between them.

135

Chapter 9 Learning Vector Quantization dkriesel.com

Quantization means that a continuousspace is divided into discrete sections: Bydeleting, for example, all decimal placesof the real number 2.71828, it could beassigned to the natural number 2. Hereit is obvious that any other number hav-ing a 2 in front of the comma would alsobe assigned to the natural number 2, i.e.2 would be some kind of representativefor all real numbers within the interval[2; 3).

It must be noted that a sequence canbe irregularly quantized, too: Thus, forinstance, the timeline for a week couldbe quantized into working days and week-end.

A special case of quantization is digiti-zation: In case of digitization we alwaystalk about regular quantization of a con-tinuous space into a number system withrespect to a certain basis. If we enter, forexample, some numbers into the computer,these numbers will be digitized into the bi-nary system (basis 2).Definition 9.1 (Quantization): Separationof a continuous space into discrete sec-tions.Definition 9.2 (Digitization): Regularquantization.

9.2 LVQ divides the inputspace into distinctiveareas

Now it is almost possible to describe bymeans of its name what LVQ should en-

able us to do: A set of representativesshould be used to divide an input spaceinto classes that reflect the input spaceas good as possible (fig. 9.1 on the right input space

reduced tovector repre-sentatives

page). Thus, each element of the inputspace should be assigned to a vector as arepresentative, i.e. to a class, where theset of these representatives should repre-sent the entire input space as precise aspossible. Such a vector is called codebookvector. A codebook vector is the represen-tative of exactly those input space vectorslying closest to it, which divides the inputspace into the said discrete areas.

It is to be emphasized that we have toknow in advance how many classes wehave and which training example belongsto which class. Furthermore, it is impor-tant that the classes must not be disjoint,which means they may overlap.

Such separation of data into classes is in-teresting for many problems for which itis useful to explore only some characteris-tic representatives instead of the possiblyhuge set of origin – be it because it is lesstime-consuming or because it is sufficientlyprecise.

9.3 Using Code BookVectors: The nearest oneis the winner

The use of a prepared set of codebook vec-tors is very simple: For an input vectory the class affiliation is easily decided the

closestvectorwins


dkriesel.com 9.4 Adjusting Codebook Vectors

Figure 9.1: BExamples for quantization of a two-dimensional input space. DThe lines representthe class limit, the × mark the codebook vectors.

class affiliation is easily decided by consid-ering which codebook vector is the clos-est – so, the codebook vectors build avoronoi diagram out of the set. Sinceeach codebook vector can clearly be asso-ciated to a class, each input vector is asso-ciated to a class, too.

9.4 Adjusting CodebookVectors

As we have already indicated, the LVQ isa supervised learning procedure. Thus, wehave a teaching input that tells the learn-ing procedure whether the classification ofthe input pattern is right or wrong: Inother words, we have to know in advance

the number of classes to be represented orthe number of codebook vectors.

Roughly speaking, it is the aim of thelearning procedure that training examplesare used to cause a previously defined num-ber of randomly initialized codebook vec-tors to reflect the training data as preciseas possible.

9.4.1 The procedure of learning

Learning works according to a simplescheme. We have (since learning is super-vised) a set P of |P | training examples.Additionally, we already know that classesare predefined, too, i.e. we also have a setof classes C. A codebook vector is clearlyassigned to each class. Thus, we can say


Chapter 9 Learning Vector Quantization dkriesel.com

that the set of classes |C| contains manycodebook vectors C1, C2, . . . , C|C|.

This leads to the structure of the trainingexamples: They are of the form (p, c) andtherefore contain the training input vectorp and its class affiliation c. For the classaffiliation

c ∈ 1, 2, . . . , |C|

is true, which means that it clearly assignsthe training example to a class or a code-book vector.Remark: Intuitively, we could say aboutlearning: ”Why a learning procedure? Wecalculate the average of all class membersand there place their codebook vectors –and that’s it.” But we will see soon thatour learning procedure can do a lot more.

We only want to briefly discuss the stepsof the fundamental LVQ learning proce-dure:

Initialization: We place our set of code-book vectors on random positions inthe input space.

Training example: A training example pof our training set P is selected andpresented.

Distance measurement: We measure thedistance ||p − C|| between all code-book vectors C1, C2, . . . , C|C| and ourinput p.

Winner: The closest codebook vectorwins, i.e. the one with

minCi∈C

||p− Ci||.

Learning process: The learning processtakes place by the rule

∆Ci = η(t) · h(p, Ci) · (p− Ci)(9.1)

Ci(t+ 1) = Ci(t) + ∆Ci, (9.2)

which we now want to break down.

. We have already seen that the firstfactor η(t) is a time-dependent learn-ing rate allowing us to differentiatebetween large learning steps and finetuning.

. The last factor (p − Ci) obviously isthe direction toward which the code-book vector is moved.

. But the function h(p, Ci) is the coreof the rule: It makes a case differenti-ation.

Assignment is correct: The winnervector is the codebook vector ofthe class that includes p. In this Important!case, the function provides posi-tive values and the codebook vec-tor moves towards p.

Assignment is wrong: The winnervector does not represent theclass that includes p. Thereforeit moves away from p.

We can see that our definition of the func-tion h was not precise enough. Withgood reason: From here on, the LVQis divided into different nuances, depen-dent of how exact h and the learning rateshould be defined (called LVQ1, LVQ2,LVQ3, OLVQ, etc). The differences are,


dkriesel.com 9.5 Connection to Neural Networks

for instance, in the strength of the code-book vector movements. They are not allbased on the same principle described here,and as announced I don’t want to discussthem any further. Therefore I don’t giveany formal definition regarding the above-mentioned learning rule and LVQ.

9.5 Connection to NeuralNetworks

Until now, inspite of the learning process,the question was what LVQ has to do withneural networks. The codebook vectorscan be understood as neurons with a fixedposition within in the input space, similarto RBF networks. Additionally, in naturevectors

= neurons? it often occurs that in a group one neuronmay fire (a winner neuron, here: a code-book vector) and, in return, inhibits allother neurons.

I decided to place this brief chapter aboutlearning vector quantization here so thatthis approach can be continued in the fol-lowing chapter about self-organizing maps:We will classify further inputs by means ofneurons distributed throughout the inputspace, only that, this time, we do not knowwhich input belongs to which class.

Now let us take a look at the unsupervisedlearning networks!

Exercises

Exercise 16: Indicate a quantizationwhich equally distributes all vectors H ∈

H in the five-dimensional unit cube H intoone of 1024 classes.


Part III

Unsupervised learning NetworkParadigms

141

Chapter 10

Self-organizing Feature MapsA paradigm of unsupervised learning neural networks, which maps an input

space by its fixed topology and thus independently looks for simililarities.Function, learning procedure, variations and neural gas.

If you take a look at the concepts of bio-logical neural networks mentioned in theintroduction, one question will arise: Howdoes our brain store and recall the impres-sions it receives every day. Let me pointout that the brain does not have any train-

How aredata stored

in thebrain?

ing examples and therefore no ”desiredoutput”. And while already consideringthis subject we realize that there is no out-put in this sense at all, too. Our brainresponds to external input by changes instate. These are, so to speak, its output.

Based on this principle and exploringthe question of how biological neural net-works organize themselves, Teuvo Ko-honen developed in the Eighties his self-organizing feature maps [Koh82, Koh98],shortly referred to as self-organizingmaps or SOMs. A paradigm of neuralnetworks where the output is the state ofthe network, which learns completely un-supervised, i.e. without a teacher.

Unlike the other network paradigms wehave already got to know, for SOMs it isunnecessary to ask what the neurons calcu-late. We only ask which neuron is active atthe moment. Biologically, this is very mono output,

but activeneuron

tivated: If in biology the neurons are con-nected to certain muscles, it will be lessinteresting to know how strong a certainmuscle is contracted but which muscle isactivated. In other words: We are not in-terested in the exact output of the neuronbut in knowing which neuron provides out-put. Thus, SOMs are considerably morerelated to biology than, for example, thefeedforward networks, which are increas-ingly used for calculations.

10.1 Structure of a SelfOrganizing Map

Typically, SOMs have – like our brain –the task to map a high-dimensional in-put (N dimensions) onto areas in a low-

143

Chapter 10 Self-organizing Feature Maps dkriesel.com

dimensional grid of cells (G dimensions)to draw a map of the high-dimensional

high-dim.input↓

low-dim.map

space, so to speak. To generate this map,the SOM simply obtains arbitrary manypoints of the input space. During the in-put of the points the SOM will try to coveras good as possible the positions on whichthe points appear by its neurons. This par-ticularly means, that every neuron can beassigned to a certain position in the inputspace.

At first, these facts seem to be a bit con-fusing, and it is recommended to brieflyreflect about them. There are two spacesin which SOMs are working:

. The N -dimensional input space and

. the G-dimensional grid on which theneurons are lying and which indi-input space

and topology cates the neighborhood relationshipsbetween the neurons and thereforethe network topology.

In a one-dimensional grid, the neuronscould be, for instance, like pearls on astring. Every neuron would have exactlytwo neighbors (except for the two end neu-rons). A two-dimensional grid could be asquare array of neurons (fig. 10.1). An-other possible array in two-dimensionalspace would be some kind of honeycombshape. Irregular topologies are possible,too, but not very often. Topolgies withmore dimensions and considerably moreneighborhood relationships would also bepossible, but due to their lack of visualiza-tion capability they are not employed veryoften.Remark: Even if N = G is true, the twoImportant!

/.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+

/.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+

/.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+

/.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+

/.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+

Figure 10.1: Example topologies of a self-organizing map. Above we can see a one-dimensional topology, below a two-dimensionalone.

spaces are not equal and have to be dis-tinguished. In this special case they onlyhave the same dimension.

Initially, we will briefly and formally re-gard the functionality of a self-organizingmap and then make it clear by means ofsome examples.Definition 10.1 (SOM neuron): Similar tothe neurons in an RBF network a SOMneuron k does not occupy a fixed positionck (a center) in the input space. JcDefinition 10.2 (Self-organizing map): Aself-organizing map is a set K of SOM neu-rons. If an input vector is entered, exactly JKthat neuron k ∈ K is activated which isclosest to the input pattern in the input


dkriesel.com 10.3 Training

space. The dimension of the input spaceis referred to as N .

NIDefinition 10.3 (Topology): The neuronsare interconnected by neighborhood rela-tionships. These neighborhood relation-ships are called topology. The training ofa SOM is highly influenced by the topol-ogy. It is defined by the topology functionh(i, k, t), where i is the winner neuron1 ist,

iIk the neuron to be adapted (which will be

kI discussed later) and t the timestep. Thedimension of the topology is referred to asG.

GI

10.2 SOMs always activatethe Neuron with theleast distance to aninput pattern

Like many other neural networks, theSOM has to be trained before it can beused. But let us regard the very simplefunctionality of a complete self-organizingmap before training, since there are manyanalogies to the training. Functionalityconsists of the following steps:

Input of an arbitrary value p of the inputspace RN .

Calculation of the distance between ev-ery neuron k and p by means of anorm, i.e. calculation of ||p− ck||.

One neuron becomes active, namelysuch neuron i with the shortestcalculated distance to the input. All

1 We will learn soon what a winner neuron is.

other neurons remain inactive.Thisparadigm of activity is also called input

↓winner

winner-takes-all scheme. The outputwe expect due to the input of a SOMshows which neuron becomes active.

Remark: In many literature citations, thedescription of SOMs is more formal: Of-ten an input layer is described that iscompletely linked towards an SOM layer.Then the input layer (N neurons) forwardsall inputs to the SOM layer. The SOMlayer is laterally linked in itself so thata winner neuron can be established andinhibit the other neurons. I think thatthis explanation of a SOM is not very de-scriptive and therefore I tried to providea clearer description of the network struc-ture.

Now the question is which neuron is ac-tivated by which input – and the answeris given by the network itself during train-ing.

10.3 Training

[Training makes the SOM topology coverthe input space] The training of a SOMis nearly as straightforward as the func-tionality described above. Basically, it isstructured into five steps, which partiallycorrespond to those of functionality.

Initialization: The network starts withrandom neuron centers ck ∈ RN fromthe input space.

Creating an input pattern: A stimulus,i.e. a point p, is selected from the



input space RN . Now this stimulus istraining:input,

→ winner i,change in

positioni and

neighbors

entered into the network.

Distance measurement: Then the dis-tance ||p−ck|| is determined for everyneuron k in the network.

Winner takes all: The winner neuron iis determined, which has the smallestdistance to p, i.e. which fulfills thecondition

||p− ci|| ≤ ||p− ck|| ∀ k 6= i

. You can see that from several win-ner neurons one can be selected atwill.

Adapting the centers: The neuron cen-ters are moved within the input spaceaccording to the rule2

∆ck = η(t) · h(i, k, t) · (p− ck),

where the values ∆ck are simplyadded to the existing centers. Thelast factor shows that the change inposition of the neurons k is propor-tional to the distance to the inputpattern p and, as usual, to a time-dependent learning rate η(t). Theabove-mentioned network topology ex-erts its influence by means of the func-tion h(i, k, t), which will be discussedin the following.

2 Note: In many sources this rule is written ηh(p−ck), which wrongly leads the reader to believe thath is a constant. This problem can easily be solvedby not omitting the multiplication dots ·.

Definition 10.4 (SOM learning rule): ASOM is trained by presenting an input pat-tern and determining the associated win-ner neuron. The winner neuron and itsneighbor neurons, which are defined by thetopology function, then adapt their cen-ters according to the rule

∆ck = η(t) · h(i, k, t) · (p− ck),(10.1)

ck(t+ 1) = ck(t) + ∆ck(t). (10.2)

10.3.1 The topology functiondefines, how a learningneuron influences itsneighbours

The topology function h is not definedon the input space but on the grid and rep-resents the neighborhood relationships be-tween the neurons, i.e. the topology of thenetwork. It can be time-dependent (whichit often is) – which explains the parameter

defined onthe gridt. The parameter k is the index running

through all neurons, and the parameter iis the index of the winner neuron.

In principle, the function shall take a largevalue if k is the neighbor of the winner neu-ron or even the winner neuron itself, andsmall values if not. SMore precise defini-tion: The topology function must be uni-modal, i.e. it must have exactly one maxi-mum. This maximum must be next to thewinner neuron i, for which the distance toitself certainly is 0.

only 1 maximumfor the winnerAdditionally, the time-dependence enables

us, for example, to reduce the neighbor-hood in the course of time.



In order to be able to output large valuesfor the neighbors of i and small values fornon-neighbors, the function h needs somekind of distance notion on the grid becausefrom somewhere it has to know how far iand k are apart from each other on thegrid. There are different methods to cal-culate this distance.

On a two-dimensional grid we could apply,for instance, the Euclidean distance (lowerpart of fig. 10.2) or on a one-dimensionalgrid we could simply use the number of theconnections between the neurons i and k(upper part of the same figure).

Definition 10.5 (Topology function): Thetopology function h(i, k, t) describes theneighborhood relationships in the topol-ogy. It can be any unimodal functionthat reaches its maximum when i = kgilt. Time-dependence is optional, but of-ten used.

10.3.1.1 Introduction of commondistance and topologyfunctions

A common distance function would be, forexample, the already known Gaussianbell (see fig. 10.3 on page 149). It is uni-modal with a maximum close to 0. Addi-tionally, its width can be changed by ap-plying its parameter σ , which can be used

σI to realize the neighborhood being reducedin the course of time: We simply relate thetime-dependence to the σ and the result is

/.-,()*+ ?>=<89:;i oo 1 // ?>=<89:;k /.-,()*+ /.-,()*+

/.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+

/.-,()*+ /.-,()*+ /.-,()*+ ?>=<89:;kOO

/.-,()*+

/.-,()*+ ?>=<89:;i xx2.23qqqqqqq

88qqqqqq

oo ///.-,()*+oo ///.-,()*+ /.-,()*+

/.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+

Figure 10.2: Example distances of a one-dimensional SOM topology (above) and a two-dimensional SOM topology (below) between twoneurons i and k. In the lower case the Euclideandistance is determined (in two-dimensional spaceequivalent to the Pythagoream theorem). In theupper case we simply count the discrete pathlength between i and k. To simplify matters Irequired a fixed grid edge length of 1 in bothcases.



a monotonically decreasing σ(t). Then ourtopology function could look like this:

h(i, k, t) = e(− ||gi−ck||

2

2·σ(t)2

), (10.3)

where gi and gk represent the neuron po-sitions on the grid, not the neuron posi-tions in the input space, which would bereferred to as ci and ck.

Other functions that can be used in-stead of the Gaussian function are, forinstance, the cone function, the cylin-der function or the Mexican hat func-tion (fig. 10.3 on the right page). Here,the Mexican hat function offers a particu-lar biological motivation: Due to its neg-ative digits it rejects some neurons closeto the winner neuron, a behavior that hasalready been observed in nature. This cancause sharply separated map areas – andthat is exactly why the Mexican hat func-tion has been suggested by Teuvo Koho-nen himself. But this adjustment charac-teristic is not necessary for the functional-ity of the map, it could even be possiblethat the map would diverge, i.e. it couldvirtually explode.

10.3.2 Learning rates andneighbourhoods candecrease monotonically overtime

To avoid that the later training phasesforcefully pull the entire map towardsa new pattern, the SOMs often workwith temporally monotonically decreasinglearning rates and neighborhood sizes. At

first, let us talk about the learning rate:Typical sizes of the target value of a learn-ing rate are two sizes smaller than the ini-tial value, e.g

0.01 < η < 0.6

could be true. But this size must also de-pend on the network topology or the sizeof the neighborhood.

As we have already seen, a decreasingneighborhood size can be realized, for ex-ample, by means of a time-dependent,monotonically decreasing σ with theGaussin bell being used in the topologyfunction.

The advantage of a decreasing neighbor-hood size is that in the beginning a movingneuron ”pulls along” many neurons in itsvicinity, i.e. the randomly initialized net-work can unfold fast and properly in thebeginning. In the end of the learning pro-cess, only a few neurons are influenced atthe same time which stiffens the networkas a whole but enables a good ”fine tuning”of the individual neurons.

It must be noted that

h · η ≤ 1

must always be true, since otherwise theneurons would constantly miss the currenttraining example.

But enough of theory – let us take a lookat a SOM in action!



0

0.2

0.4

0.6

0.8

1

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

h(r)

r

Gaussian in 1D

0

0.2

0.4

0.6

0.8

1

−4 −2 0 2 4

f(x)

x

Cone Function

0

0.2

0.4

0.6

0.8

1

−4 −2 0 2 4

f(x)

x

Cylinder Funktion

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

3.5

−3 −2 −1 0 1 2 3

f(x)

x

Mexican Hat Function

Figure 10.3: Gaussian bell, cone function, cylinder function and the Mexican hat function sug-gested by Kohonen as examples for topology functions of a SOM..



?>=<89:;1 ?>=<89:;2

?>=<89:;7

?>=<89:;4

>>>>>>>>?>=<89:;6

?>=<89:;3 // // p ?>=<89:;5

?>=<89:;1

?>=<89:;2

?>=<89:;3

?>=<89:;4

?>=<89:;5

?>=<89:;6

?>=<89:;7

Figure 10.4: Illustration of the two-dimensional input space (left) and the one-dimensional topolgyspace (right) of a self-organizing map. Neuron 3 is the winner neuron since it is closest to p. Inthe topology, the neurons 2 and 4 are the neighbors of 3. The arrows mark the movement of thewinner neuron and its neighbors towards the training example p.

To illustrate the one-dimensional topology of the network, it is plotted into the input space by thedotted line. The arrows mark the movement of the winner neuron and its neighbors towards thepattern.


dkriesel.com 10.4 Examples

10.4 Examples for thefunctionality of SOMs

Let us begin with a simple, mentally com-prehensible example.

In this example, we use a two-dimensionalinput space, i.e. N = 2 is true. Let thegrid structure be one-dimensional (G = 1).Furthermore, our example SOM shouldconsist of 7 neurons and the learning rateshould be η = 0.5.

The neighborhood function is also keptsimple so that we will be able to mentallycomprehend the network:

h(i, k, t) =

1 k direct neighbor of i,1 k = i,

0 otherw.(10.4)

Now let us take a look at the above-mentioned network with random initializa-tion of the centers (fig. 10.4 on the leftpage) and enter a training example p. Ob-viously, in our example the input patternis closest to neuron 3, i.e. this is the win-ning neuron.

We remember the learning rule forSOMs

∆ck = η(t) · h(i, k, t) · (p− ck)

and process the three factors from theback:

Learning direction: Remember that theneuron centers ck are vectors in theinput space, as well as the pattern p.

Thus, the factor (p−ck) indicates thevector of the neuron k to the patternp. This is now multiplied by differentscalars:

Our topology function h indicates thatonly the winner neuron and its twoclosest neighbors (here: 2 and 4) areallowed to learn by returning 0 forall other neurons. A time-dependenceis not specified. Thus, our vector(p − ck) is multiplied by either 1 or0.

The learning rate indicates, as always,the strength of learning. As alreadymentioned, η = 0.5, i. e. all in all, theresult is that the winner neuron andits neighbors (here: 2, 3 and 4) ap-proximate the pattern p half the way(in the figure marked by arrows).

Although the center of neuron 7 – seenfrom the input space – is considerablycloser to the input pattern p than neuron2, neuron 2 is learning and neuron 7 isnot. I want to remind that the networktopology specifies which neuron is allowed

topologyspecifies,who will learn

to learn and not its position in the inputspace. DThis is exactly the mechanism bywhich a topology can significantly cover aninput space without having to be relatedto it by any sort.

After the adaptation of the neurons 2, 3and 4 the next pattern is applied, and soon. Another example of how such a one-dimensional SOM can develop in a two-dimensional input space with uniformlydistributed input patterns in the course of



time can be seen in figure 10.5 on the rightpage.

End states of one- and two-dimensionalSOMs with differently shaped input spacescan be seen in figure 10.6 on page 154.As we can see, not every input space canbe neatly covered by every network topol-ogy. There are so called exposed neurons– neurons which are located in an areawhere no input pattern has ever been oc-curred. A one-dimensional topology gen-erally produces less exposed neurons thana two-dimensional one: For instance, dur-ing training on circularly arranged inputpatterns it is nearly impossible with a two-dimensional squared topology to avoid theexposed neurons in the center of the cir-cle. These are pulled in every directionduring the training so that they finallyremain in the center. But this does notmake the one-dimensional topology an op-timal topology since it can only find lesscomplex neighborhood relationships thana multi-dimensional one.

10.4.1 Topological defects arefailures in SOM unfolding

During the unfolding of a SOM itcould happen that a topological defect(fig. 10.7), i.e. the SOM does not unfold

”knot”in map correctly. A topological defect can be de-

scribed at best by means of the word ”knot-ting”.

A remedy for topological defects couldbe to increase the initial values for the

Figure 10.7: A topological defect in a two-dimensional SOM.

neighborhood size, because the more com-plex the topology is (or the more neigh-bors each neuron has, respectively, since athree-dimensional or a honeycombed two-dimensional topology could also be gener-ated) the more difficult it is for a randomlyinitialized map to unfold.

10.5 It is possible to dose theresolution of certainareas in a SOM

We have seen that a SOM is trained byentering input patterns of the input space


dkriesel.com 10.5 Resolution Dose and Position-dependent Learning Rate

Figure 10.5: Behavior of a SOM with one-dimensional topology (G = 1) after the input of 0, 100,300, 500, 5000, 50000, 70000 and 80000 randomly distributed input patterns p ∈ R2. During thetraining η decreased from 1.0 to 0.1, the σ parameter of the Gauss function decreased from 10.0to 0.2.



Figure 10.6: End states of one-dimensional (left column) and two-dimensional (right column)SOMs on different different input spaces. 200 neurons were used for the one-dimensional topology,10× 10 neurons for the two-dimensionsal topology and 80.000 input patterns for all maps.


dkriesel.com 10.6 Application

RN one after another, again and again sothat the SOM will be aligned with thesepatterns and map them. It could happenthat we want a certain subset U of the in-put space to be mapped more precise thanthe other ones.

This problem can easily be solved bymeans of SOMs: During the training dis-proportionally many input patterns of thearea U are presented to the SOM. If thenumber of training patterns of U ⊂ RNpresented to the SOM exceeds the numberof those patterns of the remaining RN \U ,then more neurons will group there whilethe remaining neurons are sparsely dis-tributed on RN \ U (fig. 10.8 on the nextpage).more

patterns↓

higherresolution

As you can see in the illustration, the edgeof the SOM could be deformed. This canbe compensated by assigning to the edgeof the input space a slightly higher proba-bility of being hit by training patterns (anoften applied approach for ”reaching everynook and corner” with the SOMs).

Also, a higher learning rate is often usedfor edge and corner neurons, since they areonly pulled into the center by the topol-ogy. This also results in a significantly im-proved corner coverage.

10.6 Application of SOMs

Regarding the biologically inspired asso-ciative data storage, there are manyfields of application for self-organizingmaps and their variations.

For example, the different phonemes ofthe finnish language have successfully beenmapped onto a SOM with a two dimen-sional discrete grid topology and thereforeneighborhoods have been found (a SOMdoes nothing else than finding neighbor-hood relationships). So one tries oncemore to break down a high-dimensionalspace into a low-dimensional space (thetopology), looks if some structures havebeen developed – et voila: Clearly definedareas for the individual phenomenons areformed.

Teuvo Kohonen himself made the ef-fort to search many papers mentioning hisSOMs for key words. In this large inputspace the individual papers now occupy in-dividual positions, depending on the occur-rence of key words. Then Kohonen createda SOM with G = 2 and used it to map thehigh-dimensional ”paper space” developedby him.

Thus, it is possible to enter any paperinto the completely trained SOM and lookwhich neuron in the SOM is activated. Itwill be likely to discover that the neigh-bored papers in the topology are interest-ing, too. This type of brain-like context-based search also works with many otherinput spaces.

SOM findssimilarities

It is to be noted that the system itselfdefines what is neighbored, i.e. similar,within the topology – and that’s why itis so interesting.

This example shows that the position c ofthe neurons in the input space is not signif-icant. t is rather interesting to see which



Figure 10.8: Training of a SOM with G = 2 on a two-dimensional input space. On the left side,the chance to become a training pattern was equal for each coordinate of the input space. On theright side, for the central circle in the input space, this chance is more than ten times larger thanfor the remaining input space (visible in the larger pattern density in the background). In this circlethe neurons are obviously more crowded and the remaining area is covered less dense but in bothcases the neurons are still evenly distributed. The two SOMS were trained by means of 80.000training examples and decreasing η (1→ 0.2) as well as decreasing σ (5→ 0.5).


dkriesel.com 10.7 Variations

neuron is activated when an unknown in-put pattern is entered. Next, we can lookat which of the previous inputs this neu-ron was also activated – and will imme-diately discover a group of very similarinputs. The more the inputs within thetopology are diverging, the less things incommon they have, um so weniger Gemein-samkeiten haben sie. Virtually, the topol-ogy generates a map of the input character-istics – reduced to descriptively few dimen-sions in relation to the input dimension.

Therefore, the topology of a SOM oftenis two-dimensional so that it can be easilyvisualized, while the input space can bevery high-dimensional.

10.6.1 SOMs can be used todetermin centers forRBF-Neurons

SOMs are exactly directed towards the po-sitions of the outgoing inputs. As a resultthey are used, for example, to select thecenters of an RBF network. We have al-ready been introduced to the paradigm ofthe RBF network in chapter 6.

As we have already seen, it is possibleto control which areas of the input spaceshould be covered with higher resolution- or, in connection with RBF networks,which areas of our function should wordthe RBF network with more neurons, i.e.work more exactly. A further useful fea-ture of the combination of RBF networkswith SOMs is the topology between theRBF Neurons, which is given by the SOM:During the final training of a RBF Neuron

it can be used to influence neighbour RBFas well.

For this, many neural network simulatorsoffer an additional so-called SOM layerin connection with the simulation of RBFnetworks.

10.7 Variations of SOMs

There are different variations of SOMsfor different variations of representationtasks:

10.7.1 A Neural Gas is a SOMwithout a static topology

The neural gas is a variation of the self-organizing maps of Thomas Martinetz[MBS93], which has been developed fromthe difficulty of mapping complex inputinformation that partially only occur inthe subspaces of the input space or evenchange the subspaces (fig. 10.9 on the nextpage).

The idea of a neural gas is, roughly speak-ing, to realize a SOM without a grid struc-ture. Due to the fact that they are de-rived from the SOMs the learning stepsare very similar to the SOM learning steps,but they include an additional intermedi-ate step:

. Again, random initialization of ck ∈Rn

. Selection and presentation of a pat-tern of the input space p ∈ Rn



Figure 10.9: A figure filling different subspaces of the actual input space of different positionstherefore can hardly be filled by a SOM.

. Neuron distance measurement

. Identification of the winner neuron i

. Intermediate step: Generation of aforward sorted list L of neurons or-dered by the distance to the winnerneuron. Thus, the first neuron in thelist L is the neuron being closest tothe winner neuron.

. Changing the centers by means of theknown rule but with the slightly mod-ified topology function

hL(i, k, t).

The function hL(i, k, t), which is slightlymodified compared with the original func-tion h(i, k, t), now regards the first el-ements of the list as the neighborhood

of the winner neuron i. The direct re-sult is that – similar to the free-floating

dynamicneighborhoodmolecules in a gas – dthe neighborhood

relationships between the neurons canchange anytime, and the number of neigh-bors is almost arbitrary, too. The dis-tance within the neighborhood is now rep-resented by the distance within the inputspace.

The mass of neurons can become as stiff-ened as a SOM by means of a constantlydecreasing neighborhood size. It does nothave a fixed dimension but it can take thedimension that is locally needed at the mo-ment, which can be very advantageous.

A disadvantage could be that there isno fixed grid forcing the input space tobecome regularly covered, and thereforewholes can occur in the cover or neuronscan be isolated.



Inspite of all practical hints it is, as always,the user’s responsibility not to understandthis paper as a catalog for easy answersbut to explore all advantages and disad-vantages himself.Remark: Unlike a SOM, the neighbor-hood of a neural gas initially must refer toall neurons since otherwise some outliersof the random initialization may never ap-proximate the remaining group. To forgetthis is a popular error during the imple-mentation of a neural gas.

With a neural gas it is possible to learn akind of complex input such as in fig. 10.9

can classifycomplex

figureon the left page since we are not bound toa fixed-dimensional grid. But some com-putational effort could be necessary forthe permanent sorting of the list (here,it could be effective to store the list inan ordered data structure right from thestart).Definition 10.6 (Neural gas): A neuralgas differs from a SOM by a completelydynamic neighborhood function. With ev-ery learning cycle it is decided anew whichneurons are the neigborhood neurons ofthe winner neuron. Generally, the crite-rion for this decision is the distance be-tween the neurosn and the winner neuronin the input space.

10.7.2 A Multi-SOM consists ofseveral separate SOMs

In order to present another variant of theSOMs, I want to formulate an extended

problem: What do we do with input pat-terns from which we know that they sepa-rate themselves into different (maybe dis-joint) areas?

several SOMs

Here, the idea is to use not only oneSOM but several ones: A multi-self-organizing map, shortly referred to asM-SOM [GKE01b,GKE01a,GS06]. It isunnecessary that the SOMs dispose of thesame topology or size, an M-SOM is onlya combination of M SOMs.

This learning process is analog to that ofthe SOMs. However, only the neurons be-longing to the winner SOM of each train-ing step are adapted. Thus, it is easy torepresent two disjoint clusters of data bymeans of two SOMs even if one of the clus-ters is not represented in every dimensionof the input space RN . Actually, the in-dividual SOMs exactly reflect these clus-ters.Definition 10.7 (Multi-SOM): A multi-SOM is nothing more than the simultane-ous use of M SOMs.

10.7.3 A multi-neural gas consistsof several separate neuralgases

Analogous to the multi-SOM, we also havea set of M neural gases: a multi-neuralgas [GS06, SG06]. This construct be-

several gaseshaves analogous to neural gas and M-SOM:Again, only the neurons of the winner gasare adapted.

The reader certainly wonders what advan-tage is there to use a multi-neural gas since



an individual neural gas is already capa-ble to divide into clusters and to work oncomplex input patterns with changing di-mensions. Basically, this is correct, buta multi-neural gas has two serious advan-tages over a simple neural gas.

1. With several gases, we can directlytell which neuron belongs to whichgas. This is particularly importantfor clustering tasks, for which multi-neural gases are recently used. Simpleneural gases can also find and coverclusters, but now we cannot recognizewhich neuron belongs to which clus-ter.

less computa-tional effort

2. A lot of computational effort is savedwhen large original gases are dividedinto several smaller ones since (as al-ready mentioned) the sorting of thelist L could use a lot of computa-tional effort while the sorting of sev-eral smaller lists L1, L2, . . . , LM is lesstime-consuming – even if these lists intotal contain the same number of neu-rons.

As a result we will only receive local in-stead of global3 sortings, but in most casesthese local sortings are sufficient.

Example (The sorting effort): As we know,the sorting problem can be solved in

3 Unfortunately, this step has not solved the sortingproblem faster than the solutions already known:-)

n log2(n) pattern comparisons. To sortn = 216 input patterns

n log2(n) = 216 · log2(216)= 1048576≈ 1 · 106

comparisons on average would be neces-sary. If we now divide the list containingthe 216 = 65536 neurons into 64 lists tobe sorted with each containing 210 = 1024neurons, a sorting effort of

64 ·(210 · log2(210)

)= 64 · 10240

= 655360≈ 6.5 · 105

comparisons will be necessary. Thus, forthis example alone, there will be a ≈ 35%saving of comparisons due to the use of amulti-neural gas. DIt is obvious that thiseffect is considerably increased by the useof more neurons and/or divisions.Remark: Now we can choose between twoextreme cases of multi-neural gases: Oneextreme case is the ordinary neural gasM = 1, i.e. we only use one single neuralgas. Interesting enough, the other extremecase (very large M , a few or only one neu-ron per gas) behaves analogously to theK-means clustering (for more informationon clustering procedures see the excursusA).Definition 10.8 (Multi-neural gas): Amulti-neural gas is nothing more than thesimultaneous use of M neural gases.



10.7.4 Growing neural gases canadd neurons to theirselves

A growing neural gas is a variation ofthe above mentioned neural gas to whichmore and more neurons are added accord-ing to certain rules. Thus, this is an at-tempt to work against the isolation of neu-rons or the generation of larger wholes inthe cover.

Here, this subject should only be men-tioned but not discussed.Remark: To build a growing SOM is moredifficult in as much as new neurons haveto be integrated in the neighborhood.

Exercises

Exercise 17: A regular, two-dimensionalgrid shall cover a two-dimensional surfaceas ”good” as possible.

1. Which grid structure would suit bestfor this purpose?

2. Which criteria did you use for ”good”and ”best”?

The very imprecise formulation of this ex-ercise is intentional.


Chapter 11

Adaptive Resonance TheoryAn ART network in its original form shall classify binary input vectors, i.e. to

assign them to a 1-out-of-n output. Simultaneously, the so far unclassifiedpatterns shall be recognized and assigned to a new class.

As in the other brief chapters, we wantto try to figure out the basic idea ofthe adaptive resonance theory (abbre-viated: ART) without discussing its the-ory profoundly.

In several sections we have already men-tioned that it is difficult to use neuralnetworks for the learning of new informa-tion in addition to but without destroy-ing the already existing ones. This cir-cumstance is called stability / plasticitydilemma.

In 1987, Stephen Grossberg and GailCarpenter published the first version oftheir ART network [Gro76] in order to al-leviate this problem. This was followedby a whole family of ART improvements(which we want to briefly discuss, too).

It is the idea of unsupervised learning, theaim of which is the (initially binary) pat-tern recognition, or more precisely the cat-egorization of patterns into classes. But

additionally an ART network shall be ca-pable to find new classes.

11.1 Task and Structure ofan ART Network

ART network comprises exactly two lay-ers: The input layer I and the recog-nition layer O with the input layer be-ing completely linked towards the recog-nition layer. This complete link indicatesa top-down weight matrix W that con-tains the weight values of the connectionsbetween each neuron in the input layerand each neuron in the recognition layer(fig. 11.1 on the next page).

Simple binary patterns are entered intothe input layer and transferred to the pattern

recognitionrecognition layer while the recognitionlayer shall return a 1-out-of-|O| encoding,i.e. it should follow the winner-takes-all

163

Chapter 11 Adaptive Resonance Theory dkriesel.com

GFED@ABCi1

44444444444444

##FFFFFFFFFFFFFFFFFFFFF

''OOOOOOOOOOOOOOOOOOOOOOOOOOOOO

))SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS GFED@ABCi2

xxxxxxxxxxxxxxxxxxxxx

44444444444444

##FFFFFFFFFFFFFFFFFFFFF

''OOOOOOOOOOOOOOOOOOOOOOOOOOOOO GFED@ABCi3

wwooooooooooooooooooooooooooooo


44444444444444

##FFFFFFFFFFFFFFFFFFFFF GFED@ABCi4

uukkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk

wwooooooooooooooooooooooooooooo


44444444444444

GFED@ABCΩ1

EE

;;xxxxxxxxxxxxxxxxxxxxx

77ooooooooooooooooooooooooooooo

55kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk

GFED@ABCΩ2

OO EE


77ooooooooooooooooooooooooooooo

GFED@ABCΩ3

YY44444444444444

OO EE


GFED@ABCΩ4

ccFFFFFFFFFFFFFFFFFFFFF

YY44444444444444

OO EE

GFED@ABCΩ5

ggOOOOOOOOOOOOOOOOOOOOOOOOOOOOO


YY44444444444444

OO

GFED@ABCΩ6

iiSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS

ggOOOOOOOOOOOOOOOOOOOOOOOOOOOOO


YY44444444444444

Figure 11.1: Simplified illustration of the ART network structure. Top: the input layer, bottom:the recognition layer. In this illustration the lateral inhibition of the recognition layer and the controlneurons are omitted.

scheme. For instance, to realize this 1-out-of-|O| encoding the principle of lateralinhibition can be used – or in the imple-mentation the most activated neuron canbe searched. For practical reasons an IFquery would suit this task best.

11.1.1 Resonance takes place byactivities being tossed andturned

But there also exists a bottom-up weightmatrix V , which propagates the activities

VI within the recognition layer back into theinput layer. Now it is obvious that theseactivities are buffeted again and again, afact that leads us to resonance. Everyactivity within the input layer causes an

layersactivate

oneanother

activity within the recognition layer whilein turn in the recognition layer every ac-tivity causes an activity within the inputlayer.

In addition to the two mentioned layers, inan ART network also exist some few neu-rons that exercise control functions suchas signal enhancement. ut we do not wantto discuss this theory further since hereonly the basic principle of the ART net-work should become explicit. I have onlymentioned it to explain that inspite of therecurrences the ART network will achievea stable state after an input.


dkriesel.com 11.3 Extensions

11.2 The learning process ofan ART network isdivided to Top-Downand Bottom-Up learning

The trick of the adaptive resonance the-ory is not only the configuration of theART network but also the two-piece learn-ing procedure of the theory: For one thingwe train the top-down matrix W , and foranother thing we train the bottom-up ma-trix V (fig. 11.2 on the next page).

11.2.1 Pattern input and top-downlearning

When a pattern is entered into the net-work it causes - as already mentioned - anactivation at the output neurons and thewinner

neuronis

amplified

strongest neuron wins. Then the weightsof the matrix W going towards the outputneuron are changed such that the outputof the strongest neuron Ω is still enhanced,i.e. the class affiliation of the input vectorto the class of the output neuron Ω be-comes enhanced.

11.2.2 Resonance and bottom-uplearning

The training of the backward weights ofinput isteach. inp.

for backwardweights

the matrix V is a bit tricky: EOnly theweights of the respective winner neuronare trained towards the input layer andour current input pattern is used as teach-ing input. Thus, the network is trained toenhance input vectors.

11.2.3 Adding an output neuron

Of course, it could happen that the neu-rons are nearly equally activated or thatseveral neurons are activated, i.e. that thenetwork is indecisive. In this case, themechanisms of the control neurons acti-vate a signal that adds a new output neu-ron. Then the current pattern is assignedto this output neuron and the weight setsof the new neuron are trained as usual.

Thus, the advantage of this system is notonly to divide inputs into classes and tofind new classes, it can also tell us afterthe activation of an output neuron how atypical representative of a class looks like- which is a significant feature.

Often, however, the system can only mod-erately distinguish the patterns. The ques-tion is when a new neuron is permitted tobecome active and when it should learn.In an ART network there are different ad-ditional control neurons which answer thisquestion according to different mathemat-ical rules and which are responsible for in-tercepting special cases.

This is at the same time one of the largestobjections to an ART is the fact that anART network is capable to distinguish spe-cial cases, similar to an IF queri that hasbeen forced into the mechanism of a neuralnetwork.

11.3 Extensions

As already mentioned above, the ART net-works have often been extended.


Chapter 11 Adaptive Resonance Theory dkriesel.comKapitel 11 Adaptive Resonance Theory dkriesel.com

GFED@ABCi1

""

GFED@ABCi2

GFED@ABCi3

GFED@ABCi4

|| GFED@ABCΩ1

YY OO EE <<

GFED@ABCΩ2

bb YY OO EE

0 1

GFED@ABCi1

""FFFFFFFFFFFFFFFFFFFF GFED@ABCi2

44444444444444GFED@ABCi3

GFED@ABCi4

||

GFED@ABCΩ1

YY OO EE <<

GFED@ABCΩ2

bb YY OO EE

0 1

GFED@ABCi1

""

GFED@ABCi2

GFED@ABCi3

GFED@ABCi4

|| GFED@ABCΩ1

YY OO EE <<

GFED@ABCΩ2

bbFFFFFFFFFFFFFFFFFFFF

YY44444444444444

OO EE

0 1

Abbildung 11.2: Vereinfachte Darstellung deszweigeteilten Trainings eines ART-Netzes: Diejeweils trainierten Gewichte sind durchgezogendargestellt. Nehmen wir an, ein Muster wurde indas Netz eingegeben und die Zahlen markierenAusgaben. Oben: Wir wir sehen, ist Ω2 das Ge-winnerneuron. Mitte: Also werden die Gewichtezum Gewinnerneuron hin trainiert und (unten)die Gewichte vom Gewinnerneuron zur Eingangs-schicht trainiert.

einer IF-Abfrage, die man in den Mecha-nismus eines Neuronalen Netzes gepressthat.

11.3 Erweiterungen

Wie schon eingangs erwahnt, wurden dieART-Netze vielfach erweitert.

ART-2 [CG87] ist eine Erweiterungauf kontinuierliche Eingaben und bietetzusatzlich (in einer ART-2A genanntenErweiterung) Verbesserungen der Lernge-schwindigkeit, was zusatzliche Kontroll-neurone und Schichten zur Folge hat.

ART-3 [CG90] verbessert die Lernfahig-keit von ART-2, indem zusatzliche biolo-gische Vorgange wie z.B. die chemischenVorgange innerhalb der Synapsen adap-tiert werden1.

Zusatzlich zu den beschriebenen Erweite-rungen existieren noch viele mehr.

1 Durch die haufigen Erweiterungen der AdaptiveResonance Theory sprechen bose Zungen bereitsvon ”ART-n-Netzen“.

168 D. Kriesel – Ein kleiner Uberblick uber Neuronale Netze (EPSILON-DE)

Figure 11.2: Simplified illustration of the two-piece training of an ART network: The trainedweights are represented by solid lines. Let us as-sume that a pattern has been entered into thenetwork and that the numbers mark the outputs.Top: We can see that Ω2 is the winner neu-ron. Middle:So the weights are trained towardsthe winner neuron and (below) the weights ofthe winner neuron are trained towards the inputlayer.

ART-2 [CG87] is extended to continuousinputs and additionally offers (in an ex-tension called ART-2A) enhancements ofthe learning speed which results in addi-tional control neurons and layers.

ART-3 [CG90] 3 improves the learningability of ART-2 by adapting additionalbiological processes such as the chemicalprocesses within the synapses1.

Apart from the described once there existmany other extensions.

1 Because of the frequent extensions of the adap-tive resonance theory wagging tongues already callthem ”ART-n networks”.


Part IV

Excursi, Appendices and Registers

167

Appendix A

Excursus: Cluster Analysis and Regionaland Online Learnable Fields

In Grimm’s dictionary the extinct German word ”Kluster” is described by ”wasdicht und dick zusammensitzet (a thick and dense group of sth.)”. In static

cluster analysis, the formation of groups within point clouds is explored.Introduction of some procedures, comparison of their advantages and

disadvantages. Discussion of an adaptive clustering method based on neuralnetworks. A regional and online learnable field models from a point cloud,

possibly with a lot of points, a comparatively small set of neurons beingrepresentative for the point cloud.

As already mentioned, many problems canbe traced back to problems in clusteranalysis. Therefore, it is necessary to re-search procedures that examine whethergroups (so-called clusters) are developedwithin point clouds.

Since cluster analysis procedures need anotion of distance between two points, ametric must be defined on the spacewhere these points are situated.

We briefly want to specify what a metricis.Definition A.1 (Metric): A relationdist(x1, x2) defined for two objects x1, x2is referred to as metric if each of the fol-lowing criteria applies:

1. dist(x1, x2) = 0 if and only if x1 = x2,

2. dist(x1, x2) = dist(x2, x1), i.e. sym-metry,

3. dist(x1, x3) ≤ dist(x1, x2) +dist(x2, x3), d. h. the triangleinequality is true.

Colloquially speaking, a metric is a toolfor determining distances between pointsin any space. Here, the distances haveto be symmetrical, and the distance be-tween to points may only be 0 if the twopoints are equal. Additionally, the trian-gle inequality must apply.

Metrics provide, for example, the squareddistance and the Euclidean distance,which have already been introduced.Based on such metrics we can define a clus-

169

Appendix A Excursus: Cluster Analysis and Regional and Online Learnable Fieldsdkriesel.com

tering procedure that uses a metric as dis-tance dimension.

Now we want to introduce and briefly dis-cuss different clustering procedures.

A.1 k-Means Clusteringallocates data to apredefined number ofclusters

k-means clustering according to J.MacQueen [Mac67] is an algorithm thatis often used because of its low computa-tion and storage complexity and which isregarded as ”inexpensive and good”. Theoperation sequence of the k-means cluster-ing algorithm is the following:

1. Provide data to be examined.

2. Define k, which is the number of clus-ter centers. definieren

3. Select k random vectors for the clus-ter centers (also referred to as code-book vectors).

4. Assign each data point to the nextcodebook vector1

5. Compute cluster centers for all clus-ters.

6. Set codebook vectors to new clustercenters.

1 The name codebook vector was created becausethe often used name cluster vector was too un-clear.

7. Continue with 4 until the assignmentsare no longer changed.

number ofclustermust beknownpreviously

2 already shows one of the great questionsof the k-means algorithm: The number kof the cluster centers has to be determinedin advance. This cannot be done by the al-gorithm. The problem is that it is not nec-essarily known in advance how k can be de-termined best. EAnother problem is thatthe procedure can become quite instableif the codebook vectors are badly initial-ized. But since this is random, it is oftenuseful to restart the procedure. This hasthe advantage of not requiring much com-putational effort. If you are fully awareof those weaknesses, you will receive quitegood results.

However, complex structures such as ”clus-ters in clusters” cannot be recognized. If kis high, the outer ring of the constructionin the following illustration will be recog-nized as many single clusters. If k is low,the ring with the small inner clusters willbe recognized as one cluster.

For an illustration see the upper right partof fig. A.1 on page 172.

A.2 k-Nearest Neighbouringlooks for the k nearestneighbours of each datapoint

The k-nearest neighbouring procedure[CH67] connects each data point to thek clostest neighbors, which often resultsin a division of the groups. Then such a


dkriesel.com A.4 The Silhouette coefficient

group builds a cluster. The advantage isthat the number of clusters occurs all by it-self. The disadvantage is that a large stor-age and computational effort is required tofind the next neighbor (the distances be-tween all data points must be computedand stored).

clusteringnext

points There are some special cases in which theprocedure combines data points belongingto different clusters, if kis too high. (seethe two small clusters in the upper rightof the illustration). Clusters consisting ofonly one single data point are basicallyconncted to another cluster, which is notalways intentional.

Furthermore, it is not mandatory thatthe links between the points are symmet-rical.

But this procedure allows a recognition ofrings and therefore of ”clusters in clusters”,which is a clear advantage. Another ad-vantage is that the procedure adaptivelyresponds to the distances in and betweenthe clusters.

For an illustration see the lower left partof fig. A.1.

A.3 ε-Nearest Neighbouringlooks for neighbourswithin the radius εforeach data point

Another approach of neighboring: Here,the neighborhood detection does not use afixed number k of neighbors but a radius ε,

which is the reason for the name epsilon-nearest neighboring. Points are neig-bors if they are ε apart from each other atthe most. Here, the storage and computa-tional effort is obviously very high, whichis a disadvantage.

Clusteringradii aroundpointsBut note that there are some special cases:

Two separate clusters can easily be con-nected due to the unfavorable situation ofa single data point. This can also happenwith k-nearest neighbouring, but it wouldbe more difficult since in this case the num-ber of neighbors per point is limited.

An advantage is the symmetric nature ofthe neighborhood relationships. Anotheradvantage is that the combination of min-imal clusters due to a fixed number ofneighbors is avoided.

On the other hand, it is necessary to skill-fully initialize ε in order to be successful,i.e. smaller than half the smallest distancebetween two clusters. With variable clus-ter and point distances within clusters thiscan possibly be a problem.

For an illustration see the lower right partof fig. A.1.

A.4 The Silhouettecoefficient determineshow accurate a givenclustering is

As we can see above, there is no easy an-swer for clustering problems. Each proce-dure described has very specific disadvan-



Figure A.1: Top left: our set of points. We will use this set to explore the different clusteringmethods. Top right: k-means clustering. Using this procedure we chose k = 6. As we cansee, the procedure is not capable to recognize ”clusters in clusters” (bottom left of the illustration).Long ”Lines” of points are a problem, too: SThey would be recognized as many small clusters (if kis sufficiently large). Bottom left: k-nearest neighboring. If k is selected too high (higher thanthe number of points in the smallest cluster), this will result in cluster combinations shown in theupper right of the illustration. Bottom right: ε-nearest neighbouring. This procedure will causedifficulties ε is selected larger than the minimum distance between two clusters (see upper left ofthe illustration), which will then be combined.


dkriesel.com A.5 Regional and Online Learnable Fields

tages. In this respect it is useful to havea criterion to decide how good our clus-ter division is. This possibility is offeredby the silhouette coefficient accordingto [Kau90]. This coefficient measures howgood the clusters are delimited from eachother and indicates if points are maybesorted into the wrong clusters.

clusteringquality is

measureable Let P be a point cloud and p a point ∈ P .Let c ⊆ P be a cluster within the pointcloud and p be part of this cluster, i.e.p ∈ c. The set of clusters is called C. Sum-mary:

p ∈ c ⊆ P

applies.

To calculate the silhouette coefficient weinitially need the average distance betweenpoint p and all its cluster neighbors. Thisvariable is referred to as a(p) and definedas follows:

a(p) = 1|c| − 1

∑q∈c,q 6=p

dist(p, q) (A.1)

Furthermore, let b(p) be the average dis-tance between our point p and all pointsof the next cluster (g includes all clustersexcept for c):

b(p) = ming∈C,g 6=c

1|g|∑q∈g

dist(p, q) (A.2)

The point p is classified well if the distanceto the center of the own cluster is minimaland the distance to the centers of the otherclusters is maximal. In this case, the fol-lowing term provides a value close to :

s(p) = b(p)− a(p)maxa(p), b(p) (A.3)

Apparently, the whole term s(p) can onlybe within the interval [−1; 1]. A valueclose to -1 indicates a bad classification ofp.

The silhouette coefficient S(P ) resultsfrom the average of all values s(p):

S(P ) = 1|P |

∑p∈P

s(p) (A.4)

applies. DAs above the total quality of thecluster division is expressed by the interval[−1; 1].

As different clustering strategies with dif-ferent charakteristics are presented now(lots of further material are presented in[DHS01]), as well as a measure to indi-cate the quality of an existing arrange-ment of given data into clusters, I wantto introduce a clustering method basedon an unsupervised learning neural net-work [SGE05] which was published in 2005.Like all the other methods this one maynot be perfect but it eliminates large stan-dard weaknesses of the known clusteringmethods

A.5 Regional and OnlineLearnable Fields are aneural clustering strategy

The paradigm of neural networks, which Iwant to introduce now, are the regionaland online learnable fields, shortly re-ferred to as ROLFs.



A.5.1 ROLFs try to cover data withneurons

Roughly speaking, the regional and onlinelearnable fields are a set K of neurons

KI which try to cover a set of points as goodas possible by means of their distributionin the input space. For this, neurons areadded, moved or changed in their size dur-

networkcovers

point clouding training if necessary. The parametersof the individual neurons will be discussedlater.Definition A.2 (Regional and online learn-able field): A regional and online learn-able filed (abbreviated ROLF or ROLFnetwork) is a set K of neurons that aretrained to cover a certain set in the inputspace as good as possible.

A.5.1.1 ROLF-Neurons feature aposition and a radius in theinput space

Here, a ROLF neuron k ∈ K has twoparameters: Similar to the RBF networks,it has a center ck, i.e. a position in the

cI input space.

But it has yet another parameter: The ra-dius σ, which defines the radius of the per-

σI ceptive surface surrounding the neuron2.A neuron covers the part of the input spacethat is situated within this radius.

ck and σk are locally defined for each neu-neuronrepresents

surface 2 I write ”defines” and not ”is” because the actualradius is specified by σ · ρ.

Figure A.2: Structure of a ROLF neuron.

ron. This particularly means that the neu-rons are capable to cover surfaces of differ-ent sizes.

The radius of the perceptive surface isspecified by r = ρ · σ (fig. A.2) withthe multiplier ρ being globally defined andpreviously specified for all neurons. Intu-itively, the reader will wonder what thismultiplicator is used for. Its significancewill be discussed later. Furthermore, thefollowing has to be observed: It is not nec-essary for the perceptive surface of the dif-ferent neurons to be of the same size.Definition A.3 (ROLF neuron): The pa-rameters of a ROLF neuron k are a centerck and a radius σk.Definition A.4 (Perceptive surface): Theperceptive surface of a ROLF neuron k



consists of all points within the radius ρ ·σin the input space.

A.5.2 A ROLF learns unsupervisedby presenting trainingexamples online

Like many other paradigms of neural net-works our ROLF network learns by receiv-ing many training examples p of a trainingset P . The learning is unsupervised. Foreach training example p entered into thenetwork two cases can occur:

1. There is one accepting neuron k for por

2. there is no accepting neuron at all.

If in the first case several neurons are suit-able, then there will be exactly one ac-cepting neuron insofar as the closest neu-ron is the accepting one. For the acceptingneuron k ck and σk are adapted.Definition A.5 (Accepting neuron): Thecriterion for a ROLF neuron k to be anaccepting neuron of a point p is that thepoint p must be located within the percep-tive surface of k. If p is located in the per-ceptive surfaces of several neurons, thenthe closest neuron will be the acceptingone. If there are several closest neurons,one can be randomized.

A.5.2.1 Both positions end radii areadapted throughout learning

Adaptingexistingneurons Let us assume that we entered a training

example p into the network and that there

is an accepting neuron k. Then the radiusmoves towards ||p − ck|| (i.e. towards thedistance between p and ck) and the centerck towards p. Additionally, let us definethe two learning rates ησ and ηc for radii Jησ

Jηcand centers.

ck(t+ 1) = ck(t) + ηc(p− ck(t))σk(t+ 1) = σk(t) + ησ(||p− ck(t)|| − σk(t))

Note that here σk is a scalar while ck is avector in the input space.Definition A.6 (Adapting a ROLF neuron):

A neuron k accepted by a point p isadapted according to the following rules:

ck(t+ 1) = ck(t) + ηc(p− ck(t)) (A.5)

σk(t+ 1) = σk(t) + ησ(||p− ck(t)|| − σk(t))(A.6)

A.5.2.2 The radius multiplier cares forneurons not only to shrink

Now we can understand the function of themultiplier ρ: Due to this multiplier the per- Jρceptive surface of a neuron includes morethan only all points surrounding the neu-ron in the radius σ. This means that dueto the above-mentioned learning rule σcannot only decrease but also increase.

so theneuronscan grow

Definition A.7 (Radius multiplier): The ra-dius multiplier ρ > 1 is globally definedand expands the perceptive surface of aneuron k to a multiple of σk. So it is en-sured that the radius σk cannot only de-crease but also increase.

Generally, the radius multiplier is set tovalues in the lower one-digit range, suchas 2 or 3.



So far we only have discussed the case inthe ROLF training that there is an accept-ing neuron for the training example p.

A.5.2.3 As and when required, newneurons are generated

This suggests to discuss the approach forthe case that there is no accepting neu-ron.

In this case a new accepting neuron k isgenerated for our training example. Theresult is that ck and σk certainly have tobe initialized.

The initialization of ck can be understoodintuitively: DThe center of the new neu-ron is simply set on the training example,i.e.

ck = p.

We generate a new neuron because thereis no neuron close to p – for logical reasons,we place the neuron exactly on p.

But how to set a σ when a new neuronis generated? For this purpose there existdifferent options:

Init-σ: We always select a predefinedstatic σ.

Minimum σ: We take a look at the σ ofall neurons and select the minimum.

Maximum σ: We take a look at the σ ofall neurons and select the maximum.

Mean σ: We select the mean σ of all neu-rons.

Currently, the mean-σ variant is the fa-vorite one although the learning procedurealso works with the other ones. In theminimum-σ variant the neurons tnd tocover less surface, in the maximum-σ vari-ant they tend to cover more surface.Definition A.8 (Generating a ROLF neu-ron): If a new ROLF neuron k k is gen-erated by entering a training example p,

initializationof aneurons

then ck is intialized with p and σk ac-cording to one of the above-mentionedstrategies (Init-σ, Minimum-σ, Maximum-σ, Mean-σ).

The training is complete when after re-peated randomly permuted pattern presen-tation no new neuron has been generatedin an epoch and the positions of the neu-rons barely change.

A.5.3 Evaluating a ROLF

The result of the training algorithm is thatthe training set is gradually covered welland precisely by the ROLF neurons andthat a high concentration of points on aspot of the input space does not automati-cally generate more neurons. Thus, a pos-sibly very large point cloud is reduced tovery few representatives (based on the in-put set).

Then it is very easy to define the num-cluster =connectedneurons

ber of clusters: Two neurons are (accord-ing to the definition of the ROLF) con-nected when their perceptive surfaces over-lap (i.e. some kind of nearest neighbour-ing is executed with the variable percep-tive surfaces). A cluster is a group of



connected neurons or a group of points ofthe input space covered by these neurons(fig. A.3).

Of course, the complete ROLF networkcan be evaluated by means of other clus-tering methods, i.e. the neurons can besearched for clusters. Particularly withclustering methods whose storage effortgrows quadratic to |P | the storage effortcan be reduced dramatically since gener-ally there are considerably less ROLF neu-rons than original data points, but theneurons represent the data points quitegood.

A.5.4 Comparison with PopularClustering Methods

It is obvious, that storing the neuronsrather than storing the input points takesthe lion’s share of the storage effort of theROLFs. This is a great advantage for huge

lessstorageeffort!

point clouds with a lot of points.

Since it is unnecessary to store the entiredata points, our ROLF, as a neural cluster-ing method, has the capability to learn on-line, which is definitely a great advantage.Furthermore, it can (similar to ε nearestneighbouring or k nearest neighbouring)distinguish clusters from enclosed clusters– but due to the online presentation ofrecognize

”cluster inClusters”

the data without a quadratically growingstorage effort, which is by far the greatestdisadvantage of the two neighboring meth-ods.

Additionally, the issue of the size of the in-dividual clusters proportional to their dis-

Figure A.3: The clustering process. Above theinput set, in the middle the input space coveredby ROLF neurons, below the the input space onlycovered by the neurons (representatives.



tance from each other is addressed by us-ing variable perceptive surfaces - which isalso not always the case for the two men-tioned methods.

The ROLF compares favorably with thek-means clustering, as well: Firstly, it isunnecessary to previously know the num-ber of clusters and, secondly, thek-meansclustering recognizes clusters enclosed byother clusters as separate clusters.

A.5.5 Initializing Radii, LearningRates and Multiplier is nottrivial

Certainly, the disadvantages of the ROLFshall not be concealed: It is not alwayseasy to select the appropriate initial valuefor σ and ρ. The previous knowledgeabout the data set can colloquially be in-cluded in ρ and the σ initial value of theROLF: Fine-grained data clusters shoulduse a small ρ and a small σ initial value.But the smaller the ρ the smaller thechance that the neurons will grow if neces-sary. Here again, there is no easy answer,just like for the learning rates ηc and ησ.

For ρ the multipliers in the lower one-digitrange such as 2 or 3 are very popular. ηcand ησ successfully work with values about0.005 to 0.1, variations during run-timeare also imaginable for this type of net-work. Initial values for σ generally dependon the cluster and data distribution (i.e.they often have to be tested). But com-pared to wrong initializations they are –at least with the mean-σ strategy – they

are relatively robust after some trainingtime.

As a whole, the ROLF is on a par withthe other clustering methods and is par-ticularly very interesting for systems withlow storage capacity or huge data sets.

A.5.6 Application examples

A first application example may be, forexample, finding color clusters in RGBimages. Another field of application di-rectly described in the ROLF publicationis the recognition of words transferred intoa 720-dimensional feature space. Thus, wecan see that ROLFs are relatively robustagainst higher dimensions. Further appli-cations can be found in the field of analy-sis of attacks on network systems and theirclassification.

Exercises

Exercise 18: Determine at least fouradaptation steps for one single ROLF neu-ron k if the four patterns stated beloware presented one after another in the in-dicated order. Let the initial values forthe ROLF neuron be ck = (0.1, 0.1) andσk = 1. Furthermore, let gelte ηc = 0.5and ησ = 0. Let ρ = 3.

P = (0.1, 0.1);= (0.9, 0.1);= (0.1, 0.9);= (0.9, 0.9).


Appendix B

Excursus: Neural Networks Used forPrediction

Discussion of an application of neural networks: A look ahead into the futureof time series.

After discussing the different paradigms ofneural networks it is now useful to take alook at an application of neural networkswhich is brought up often and (as we willsee) is also used for fraud: The applica-tion of time series prediction. This ex-cursus is structured into the description oftime series and estimations about the re-quirements that are actually needed to pre-dict the values of a time series. Finally, Iwill say something about the range of soft-ware, which should predict share prices orother economic characteristics by means ofneural networks or other procedures.

This chapter should not be a detaileddescription but rather indicate some ap-proaches for time series prediction. In thisrespect I will again try to avoid formal def-initions.

B.1 About Time Series

A time series iis a series of values dis-cretized in time. For example, daily mea-sured temperature values or other meteo-rological data of a specific site could berepresented by a time series. Share pricevalues also represent a time series. Oftenthe measurement of time series is timelyequidistant, and in many time series thefuture development of their values is veryinteresting, e.g. the daily weather fore-cast. time

series ofvaluesTime series can also be values of an actu-

ally continuous function read in a certaindistance of time ∆t (fig. B.1 on the next J∆tpage).

If we want to predict a time series, we willlook for a neural network that maps theprevious series values to future develop-ments of the time series, i.e. if we knowlonger sections of the time series, we will

179

Appendix B Excursus: Neural Networks Used for Prediction dkriesel.com

Figure B.1: A function x f the time is read atdiscrete times (time discretized), this means thatthe result is a time series. The read values areentered into a neural network (in this examplean SLP) which shall learn to predict the futurevalues of the time series.

have enough training examples. Of course,these are not examples for the future to bepredicted but it is tried to generalize andto extrapolate the past by means of thesaid examples.

But before we begin to predict a timeseries we have to answer some questionsabout the time series regarded and ensurethat our time series fulfills some require-ments.

1. Do we have any evidence which sug-gests that future values depend in anyway on the past values of the time se-ries? Does the past of a time seriesinclude information about its future?

2. Do we have enough past values of thetime series that can be used as train-ing patterns?

3. In case of a prediction of a continuousfunction: How must a useful ∆t looklike?

Now this question shall be explored in de-tail.

How many information about the futureis included in the past values of a time se-ries? This is the most important questionto be answered for any time series thatshould be mapped into the future. If thefuture values of a time series, for instance,do not depend on the past values, then atime series prediction based on them willbe impossible.

In this chapter, we assume systems whosefuture values can be deduced from theirstates – the deterministic systems. This


dkriesel.com B.2 One-Step-Ahead Prediction

leads us to the question of what a systemstate is.

A system state completely describes a sys-tem for a certain point of time. The futureof a deterministic system would be clearlydefined by means of the complete descrip-tion of its current state.

The problem in the real world is that sucha state concept includes all things that in-fluence our system by any means.

In case of our weather forecast for a spe-cific site we could definitely determinethe temperature, the atmospheric pres-sure and the cloud density as the mete-orological state of the place at a time t.But the whole state would include signifi-cantly more information. Here, the world-wide phenomena that control the weatherwould be interesting as well as small localpheonomena such as the cooling system ofthe local power plant.

So we shall note that the system state is de-sirable for prediction but not always possi-ble to obtain. Often only fragments of thecurrent states can be acquired, e.g. for aweather forecast these fragments are thesaid weather data.

Wir konnen aber diese Schwache teil-weise ausgleichen, indem wir nicht nur diebeobachtbaren Teile eines einzigen (desletzten) Zustandes in die Vorhersage miteinfließen lassen, sondern mehrere vergan-gene Zeitpunkte betrachten. From thiswe want to derive our first prediction sys-tem:

B.2 One-Step-AheadPrediction

The first attempt to predict the next fu-ture value of a time series out of past val-ues is called one-step-ahead prediction(fig. B.2 on the next page).

predictthe nextvalue

Such a predictor system receives the lastn observed state parts of the system asinput and outputs the prediction for thenext state (or state part). The idea ofa state space with predictable states iscalled state space forecasting.

The aim of the predictor is to realize afunction

f(xt−n+1, . . . , xt−1, xt) = xt+1, (B.1)

which receives exactly n in order to predictthe future value. Predicted values shall beheaded by a tilde (e.g. x) to distinguish it Jxfrom the actual future values.

To find a linear combination

xi+1 = a0xi + a1xi−1 + . . .+ ajxi−j(B.2)

at fulfills our conditions by approximationwould be the most intuitive and simplestapproach.

Such a construction is called digital fil-ter. Here we use the fact that time series



xt−3

..

xt−2

..

xt−1

--

xt

++

xt+1

predictor

KK

Figure B.2: Representation of the one-step-ahead prediction. It is tried to calculate the futurevalue from a series of past values. The predicting element (in this case a neural network) is referredto as predictor.

usually have a lot of past values so that wecan set up a series of equations1:

xt = a0xt−1 + . . .+ ajxt−1−(n−1)

xt−1 = a0xt−2 + . . .+ ajxt−2−(n−1)... (B.3)

xt−n = a0xt−n + . . .+ ajxt−n−(n−1)

Thus, n equations could be found for nunknown coefficients and solve them (ifpossible). Or another, better approach:We could use m > n equations for n un-knowns in such a way that the sum ofthe mean squared prediction error, whichare already known, is minimized. This iscalled moving average procedure.

But this linear structure corresponds to asingle-layer perceptron with a linear acti-vation function which has been trained bymeans of data from the past (The experi-mental setup would comply with fig. B.1on page 180). n fact, the training by

1 Without going into detail I want to remark thatthe prediction becomes easier the more past valuesof the time series are available. I would like toask the reader to read up on the Nyquist-Shannonsampling theorem

means of the delta rule provides resultsvery close to the analytical solution.

Even if this approach often provides sat-isfying results, we have seen that manyproblems cannot be solved by using asingle-layer perceptron. Additional lay-ers with linear activation function are use-less, as well, since a multi-layer perceptronwith only linear activation functions canbe reduced to a single-layer perceptron.DSuch considerations lead to a non-linearapproach.

The multi-layer perceptron and non-linearactivation functions provide a universalnon-linear function approximator, i.e. wecan use an n-|H|-1-MLP for n n inputsout of the past. An RBF network couldalso be used. But remember that here thenumber n has to remain low since in RBFnetworks high input dimensions are verycomplex to realize. So if we want to in-clude many past values, a multi-layer per-ceptron will require considerably less com-putational effort.


dkriesel.com B.4 Additional Optimization Approaches for Prediction

B.3 Two-Step-AheadPrediction

What approaches can we use to to see far-ther into the future?

B.3.1 Recursive two-step-aheadprediction

predictfuturevalues In order to extent prediction to, for in-

stance, two time steps into the future, wecould perform two one-step-ahead predic-tions in a row (fig. B.3 on the next page),i.e. a recursive two-step-ahead predic-tion. Unfortunately, the value determinedby means of a one-step-ahead predictionis generally imprecise so that errors canbe built up, and the more predictions areperformed in a row the more imprecise be-comes the result.

B.3.2 Direct two-step-aheadprediction

We have already suspected that there ex-ists a better approach: Like the systemcan be trained to predict the next value,we certainly can train it to predict the

directprediction

is bettervalue after next. This means we directlytrain, for example, a neural network tolook two time steps ahead into the fu-ture, which is referred to as direct two-step-ahead prediction (fig. B.4 on thenext page). Obviously, the direct two-step-ahead prediction is technically identical tothe one-step-ahead prediction. The onlydifference is the training.

B.4 Additional OptimizationApproaches forPrediction

The possibility to predict values far awayin the future is not only important be-cause we try to look farther ahead into thefuture. There can also be periodic timeseries where other approaches are hardlypossible: If a lecture is begins at 9 a.m.every Thursday, it is not very useful toknow how many people sat in the lectureroom on Monday to predict the numberof lecture participants. The same applies,for example, to periodically occurring com-muter jams.

B.4.1 Changing temporalparameters

Thus, it can be useful to intentionally leavegaps in the future values as well as in thepast values of the time series, i.e. to in-troduce the parameter ∆t which indicateswhich past value is used for prediction.Technically speaking, we still use a one-

extenthinputperiod

step-ahead prediction only that we extendthe input space or train the system to pre-dict values lying farther away.

It is also possible to combine different ∆t:In case of the traffic jam prediction for aMonday the values of the last few dayscould be used as data input in addition tothe values of the previous Mondays. Thus,we use the last values of several periods,in this case the values of a weekly and a



predictor

xt−3

..

xt−2

00

..

xt−1

00

--

xt

++

00

xt+1

OO

xt+2

predictor

JJ

Figure B.3: Representation of the two-step-ahead prediction. Attempt to predict the second futurevalue out of a past value series by means of a second predictor and the involvement of an alreadypredicted value.

xt−3

..

xt−2

..

xt−1

--

xt

++

xt+1 xt+2

predictor

EE

Figure B.4: Representation of the direct two-step-ahead prediction. Here, the second time step ispredicted directly, the first one is omitted. Technically, it does not differ from a one-step-aheadprediction.


dkriesel.com B.5 Remarks on the Prediction of Share Prices

daily period. We could also include an an-nual period in the form of the beginning ofthe holidays (for sure, everyone of us hasalready spent a lot of time on the highwaybecause he forgot the beginning of the hol-idays).

B.4.2 Heterogeneous prediction

Another prediction approach would be topredict the future values of a single timeseries out of several time series, if it isassumed that the additional time seriesuse

informationoutside of

time series

is related to the future of the first one(heterogeneous one-step-ahead pre-diction, fig. B.5 on the next page).

If we want to predict two outputs of tworelated time series, it is certainly possibleto perform two parallel one-step-ahead pre-dictions (analytically this is done very of-ten because otherwise the equations wouldbecome very confusing); or in case ofthe neural networks an additional outputneuron is attached and the knowledge ofboth time series is used for both outputs(fig. B.6 on the next page).

You’ll find more and more general materialon time series in [WG94].

B.5 Remarks on thePrediction of SharePrices

Many people observe the changes of ashare price in the past and try to con-clude the future from those values in or-

der to benefit from this knowledge. Shareprices are discontinuous and therefore theyare principally difficult functions. Further-more, the functions can only be used fordiscrete values – often, for example, in adaily rhythm (including the maximum andminimum values per day, if we are lucky)with the daily variations certainly beingeliminated. But this makes the wholething even more difficult.

There are chartists, i.e. people who lookat many diagrams and decide by meansof a lot of background knowledge anddecade-long experience whether the equi-ties should be bought or not (and oftenthey are very successful.

Apart from the share prices it is very in-teresting to predict the exchange rates ofcurrencies: If we exchange 100 Euros intoDollars, the Dollars into Pounds and thePounds back into Euros it could be pos-sible that we will finally receive 110 Eu-ros. But once found out, we would do thismore often and thus we would change theexchange rates into a state in which suchan increasing circulation would no longerbe possible (otherwise we could producemoney by generating, so to speak, a finan-cial perpetual motion machine.

At the stock exchange, successful stockand currency brokers increase or de-crease their thumbs – and thus indicatingwhether in their opinion a share price oran exchange rate will increase. SMathe-matically speaking, they indicate the firstbit (sign) of the first derivation of the ex-change rate. So, excellent worldclass bro-kers obtain success rates of about 70%.



xt−3

..

xt−2

..

xt−1

--

xt

++

xt+1

predictor

KK

yt−3

00

yt−2

00

yt−1

11

yt

33

Figure B.5: Representation of the heterogeneous one-step-ahead prediction. Predition of a timeseries under consideration of a second one.

xt−3

..

xt−2

..

xt−1

--

xt

++

xt+1

predictor

KK

yt−3

00

yt−2

00

yt−1

11

yt

33

yt+1

Figure B.6: Heterogeneous one-step-ahead prediction of two time series at the same time.


dkriesel.com B.5 Remarks on the Prediction of Share Prices

In Great Britain, the heterogeneous one-step-ahead prediction was successfullyused to increase the accuracy of such pre-dictions to 76%: In addition to the timeseries of the values indicators indicatorssuch as the oil price in Rotterdam or theUS national debt were included.

This just as an example to show the dimen-sion of the accuracy of stock-exchange eval-uations, since we are still talking about thefirst bit of the first derivation! We still donot know how strong the expected increaseor decrease will be and also whether theeffort will pay off: Probably, one wrongprediction could nullify the profit of onehundred correct predictions.

How can neural networks be used to pre-dict share prices? Intuitively, we assumethat future share prices are a function ofthe previous share values.

But this assumption is wrong: Shareprices are no function of their past val-ues, but a function of their assumed fu-

share pricefunction of

assumedfuturevalue!

ture value. We do not buy shares be-cause their values have been increased dur-ing othe last days, but because we believethat they will futher increase tomorrow.If, as a consequence, many people buy ashare, they will boost the price. Thereforetheir assumption was right – a self ful-filling prophecy has been generated, aphenomenon long known in economics.

The same is true for the other way round:We sell shares because we believe that to-morrow the prices will decrease. This willbeat down the prices the next day and gen-erally even more the day after the next.

Time and again some software appearswhich uses key words such as neural net-works to purport that it is capable to pre-dict where share prices are going. Donot buy such software! In addition tothe above-mentioned scientific exclusionsthere is one simple reason for this: Ifthese tools work – why should the man-ufacturer sell them? Normally, useful eco-nomic knowledge is kept secret. If we knewa way to definitely gain wealth by meansof shares, we would earn our millions byusing this knowledge instead of selling itfor 30 euros, wouldn’t we?


Appendix C

Excursus: Reinforcement LearningWhat if there were no training examples but it would nevertheless be possible

to evaluate how good we have learned to solve a problem? et us regard alearning paradigm that is situated between supervised and unsupervised

learning.

I now want to introduce a more exotic ap-proach of learning – just to leave the usualpaths. We know learning procedures inwhich the network is exactly told what todo, i.e. we provide exemplary output val-ues. We also know learning procedureslike those of the self-organizing maps, intowhich only input values are entered.

Now we want to explore something in-between: The learning paradigm of rein-forcement learning – reinforcement learn-ing according toSutton and Barto[SB98].

Reinforcement learning in itself is no neu-ral network but only one of the three learn-ing paradigms already mentioned in chap-ter 4. In some sources it is counted amongthe supervised learning procedures since afeedback is given. Due to its very rudimen-no

examplesbut

feedback

tary feedback it is reasonable to separateit from the supervised learning procedures– apart from the fact that there are notraining examples at all.

While it is generally known that pro-cedures such as backpropagation cannotwork in the human brain itself, the rein-forcement learning is usually considered asbeing biologically more motivated.

The term reinforcement learning Theterm reinforcement learning comes fromcognitive science and psychology and it de-scribes the learning system of carrot andstick, which occurs everywhere in nature,i.e. learning by means of good or bad expe-rience, reward and punishment. But thereis no learning aid that exactly explainswhat we have to do: We only receive atotal result for a process (Did we win thegame of chess or not? And how sure wasthis victory?), but no results for the indi-vidual intermediate steps.

For example, if we ride our bike with worntires and at a speed of exactly 21,5 21, 5kmhthrough a bend over some sand with agrain size of 0.1mm, on the average, thennobody could tell us exactly which handle-

189

Appendix C Excursus: Reinforcement Learning dkriesel.com

bar angle we have to adjust or, even worse,how strong the great number of muscleparts in our arms or legs have to contractfor this. Depending on whether we reachthe end of the bend unharmed or not, wesoon have to face the good or bad learn-ing experience, i.e. a feedback or a reward.Thus, the reward is very simple - but onthe other hand it is considerably easier toachieve. If we now have tested different ve-locities and bend angles often enough andreceived some rewards, we will get a feelfor what works and what does not. Theaim of reinforcement is to maintain exactlythis feeling.

Another example for the quasi-impossibility to achieve a sort of costor utility function is a tennis playerwho tries to maximize his athletic gloryfor a long time by means of complexmovements and ballistic trajectories inthe three-dimensional space including thewind direction, the importance of thetournament, private factors and manymore.

To get straight to the point: Since wereceive only little feedback, reinforcementlearning often means trial and error – andtherefore it is very slow.

C.1 System Structure

Now we want to briefly discuss differentsizes and components of the system. Wewill define them more precisely in the fol-lowing sections. Broadly speaking, rein-forcement learning represents the mutual

interaction between an agent and an envi-ronmental system (fig. C.2).

The agent shall solve some problem. Hecould, for instance, be an autonomousrobot that shall avoid obstacles. Theagent performs some actions within theenvironment and in return receives a feed-back from the environment, which in thefollowing is called reward. This circle ofaction and reward is characteristic for rein-forcement learning. The agent influencesthe system, the system provides a rewardand then changes.

The reward is a real or discrete scalarwhich describes, as mentioned above, howwell we achieve our aim, but it does notgive any guidance how we can achieve it.The aim is always to make the sum ofrewards as high as possible on the longterm.

C.1.1 The gridworld

As a learning example for reinforcementlearning I would like to use the so-calledgridworld. WWe will see that its struc-ture is very simple and easy to figure outand therefore reinforcement is actually notnecessary. However, it is very suitable

simpleworld ofexamples

for representing the approach of reinforce-ment learning. Now let us exemplary de-fine the individual components of the re-inforcement system by means of the grid-world. Later, each of these componentswill be examined more exactly.

Environment: The gridworld (fig. C.1 onthe right page) is a simple, discrete


dkriesel.com C.1 System Structure

world in two dimensions which in thefollowing we want to use as environ-mental system.

Agent: Als Agent As an agent we use asimple robot being situated in ourgridworld.

State space: As we can see, our gridworldhas 5×7 with 6 fields being unaccessi-ble. Therefore, our agent can occupy29 positions in the grid world. Thesepositions are regarded as states forthe agent.

Action space: The actions are still miss-ing. We simply define that the robotcould move one field up or down, tothe right or to the left (as long asthere is no obstacle or the edge of ourgridworld).

Task: Our agent’s task is to leave the grid-world. The exit is located on the leftof the light-colored field.

Non-determinism: The two obstacles canbe connected by a ”door”. When thedoor is closed (lower part of the illus-tration), the corresponding field is in-accessible. The position of the doorcannot change during a cycle but onlybetween the cycles.

We now have created a small world thatwill accompany us through the followinglearning strategies and illustrate them.

C.1.2 Agent und environment

Our aim is that the agent learns what hap-pens by means of the reward. Thus, it

×

×

Figure C.1: A graphical representation of ourgridworld. Dark-colored cells are obstacles andtherefore inaccessible. The exit is located on theright side of the light-colored field. The symbol× marks the starting position of our agent. Inthe upper part of our figure the door is open, inthe lower part it is closed.

Agent

action

__environment

reward / new situation

??

Figure C.2: The agent performs some actionswithin the environment and in return receives areward.



is trained over, of and by means of a dy-namic system, the environment, in orderto reach an aim. But what does learningmean in this context?

The agent shall learn a mapping of sit-agentacts in

environmentuations to actions (called policy), i.e. itshall learn what to do in which situationto achieve a certain (given) aim. The aimis simply shown to the agent by giving arecompense for the achievement.

Such a recompense must not be mistakenfor the reward – on the agent’s way tothe solution it may sometimes be usefulto receive less recompense or punishmentwhen in return the longterm result is max-imum (similar to the situation when aninvestor just sits out the downturn of theshare price or to a pawn sacrifice in a chessgame). So, if the agent is heading intothe right direction towards the target, itreceives a positive reward, and if not it re-ceives no reward at all or even a negativereward (punishment). The recompense is,so to speak, the final sum of all rewards –which is also called return.

After having colloquially named all the ba-sic components, we want to discuss moreprecisely which components can be used tomake up our abstract reinforcement learn-ing system.

In the gridworld: In the gridworld, theagent is a simple robot that should find theexit of the gridworld. The environmentis the gridworld itself, which is a discretegridworld.Definition C.1 (Agent): DIn reinforce-ment learning the agent can be formally

described as a mapping of the situationspace S into the action space A(st). Themeaning of situations st l be defined laterand should only indicate that the actionspace depends on the current situation.

Agent: S → A(st) (C.1)

Definition C.2 (Environment): The envi-ronment represents a stochastic mappingof an action A in the current situation stto a reward rt and a new situation st+1.

Environment: S ×A→ P (S × rt) (C.2)

C.1.3 States, situations and actions

As already mentioned, an agent can be indifferent states: In case of the gridworld,for example, it can be in different positions(here we get a two-dimensional state vec-tor).

For an agent is ist not always possible torealize all information about its currentstate so that we have to introduce the termsituation. A situation is a state from theagent’s point of view, i.e. only a more orless precise approximation of a state.

Therefore, situations generally do not al-low to clearly ”predict” successor situa-tions – even with a completely determin-istic system this may not be applied. Ifwe knew all states and the transitions be-tween them exactly (thus, the completesystem), it would be possible to plan op-timally and also easy to find an optimalpolicy (methods are provided, for example,by dynamic programming).



Now we know that reinforcement learningis an interaction between the agent andthe system including Actions at and sit-uations st. The agent cannot determineby itself whether the current situation isgood or bad: This is exactly the reasonwhy it receives the said reward from theenvironment.

In the gridworld: States are positionswhere the agent can be situated. Sim-ply said, the situations equal the statesin the gridworld. Possible actions wouldbe to move towards north, south, east orwest.Remark: Situation and action can be vec-torial, the reward, however, is always ascalar (in an extreme case even only a bi-nary value) since the aim of reinforcementlearning is to get along with little feedback.A complex vectorial reward would equal areal teaching input.

By the way, the cost function should beminimized, which would not be possible,however, with a vectorial reward since wedo not have any intuitive order relationsin multi-dimensional space, i.e. we do notdirectly know what is better or worse.Definition C.3 (State): Within its envi-ronment the agent is in a state. Statescontain any information about the agentwithin the environmental system. Thus,it is theoretically possible to clearly pre-dict a successor state to a performed ac-tion within a deterministic system out ofthis godlike state knowledge.Definition C.4 (Situation): Situations st(hier at time t) of a situation space

stI S are the agent’s limited, approximateSI

knowledge about its state. This approx-imation (about which the agent cannoteven know how good it is) makes clear pre-dictions impossible.Definition C.5 (Action): Actions at n be Jatperformed by the agent (whereby it couldbe possible that depending on the situa-tion another action space A(S) exists).

JA(S)They cause state transitions and thereforea new situation from the agent’s point ofview.

C.1.4 Reward and return

As in real life it is our aim to receive arecompense as high as possible, i.e. tomaximize the sum of the expected [re-ward]rewards r, called return R, on thelong term. For finitely many time steps1

the rewards can simply be added:

Rt = rt+1 + rt+2 + . . . (C.3)

=∞∑x=1

rt+x (C.4)

Certainly, the return is only estimatedhere (if we knew all rewards and thereforethe return completely, it would no longerbe necessary to learn).Definition C.6 (Reward): A reward rt is Jrta scalar, real or discrete (even sometimesonly binary) reward or punishment whichthe environmental system returns to theagent as reaction to an action.Definition C.7 (Return): The return Rtis the accumulation of all received rewards JRt1 In practice, only finitely many time steps will eb

possible, even though the formulas are stated withan infinite sum in the first place



until time t.

C.1.4.1 Dealing with large time spaces

However, not every problem has an ex-plicit target and therefore a finite sum (e.g.our agent can be a robot having the taskto drive around again and again and toavoid obstacles). In order not to receive adiverging sum in case of an infinite seriesof reward estimations a weakening factor0 < γ < 1 is used, which weakens the in-

γI fluence of future rewards. This is not onlyuseful if there exists no target but also ifthe target is very far away:

Rt = rt+1 + γ1rt+2 + γ2rt+3 + . . . (C.5)

=∞∑x=1

γx−1rt+x (C.6)

The farther the reward is away, the less isthe part it has in the agent’s decisions.

Another possibility to handle the returnsum would be a limited time horizonτ so that only τ many following rewards

τIrt+1, . . . , rt+τ are regarded:

Rt = rt+1 + . . .+ γτ−1rt+τ (C.7)

=τ∑x=1

γx−1rt+x (C.8)

Thus, we divide the timeline intoepisodes. Usually, one of the two meth-ods is used to limit the sum, if not bothmethods together.

As in daily living we try to approximateour current situation to a desired state.Since it is not mandatory that only the

next expected reward but the expected to-tal sum decides what the agent will do, itis also possible to perform actions that, onshort notice, result in a negative reward(e.g. the pawn sacrifice in a chess game)but will pay off later.

C.1.5 The policy

After having regarded and formalizedsome system components of reinforcementlearning the actual aim is still to be dis-cussed:

During reinforcement learning the agentlearns a policy JΠ

Π : S → P (A),

Thus, it continuously adjusts a mappingof the situations to the probabilities P (A),with which any action A is performed inany situation S. A policy can be definedas a strategy to select actions that wouldmaximize the reward in the long term.

In the gridworld: n the gridworld the pol-icy is the strategy according to which theagent tries to exit the gridworld.Definition C.8 (Policy): The policy Π sa mapping of situations to probabilitiesto perform every action out of the actionspace. So it can be formalized as

Π : S → P (A). (C.9)

Basically, we distinguish between two pol-icy paradigms: An open loop policy Anopen-loop policy represents an open con-trol chain and creates out of an initial sit-uation s0 a sequence of actions a0, a1, . . .



with ai 6= ai(si); i > 0. DThus, in the be-ginning the agent develops a plan and con-secutively executes it to the end withoutconsidering the interim situations (there-fore ai 6= ai(si), actions after a0 do notdepend on the situations).

In the gridworld: In the gridworld, anopen-loop policy would provide a precisedirection towards the exit, such as the wayfrom the given starting position to (in ab-breviations of the directions) OOOON.

So an open-loop policy is a sequence ofactions without interim feedback. A se-quence of actions is generated out of astarting situation. If the system is knownwell and truly, such an open-loop policycan be used successfully and lead to use-ful results. But, for example, to know thechess game well and truly it would be nec-essary to try every possible move, whichwould be very time-consuming. Thus, forsuch problems we have to find an alterna-tive to the open-loop policy, which incorpo-rates the current situations into the actionplan:

A closed loop policy is a closed loop, afunction

Π : si → ai mit ai = ai(si),

in a manner of speaking. Here, the envi-ronment influences our action or the agentresponds to the input of the environment,respectively, as already illustrated in fig.C.2. A closed-loop policy, so to speak, isa reactive plan to map current situationsto actions to be performed.

When selecting the actions to be per-formed, again two basic strategies can beregarded.

In the gridworld: A closed-loop policywould be responsive to the current posi-tion and choose the direction according tothe action. n particular, when an obstacleappears dynamically, such a policy is thebetter choice.

C.1.5.1 Exploitation vs. exploration

As in real life, during reinforcement learn-ing often the question arrives whetherthe exisiting knowledge is only wilfullyexploited or new ways are also explored.Initially, we want to discuss the two ex-tremes:

researchor safety?A greedy policy always chooses the way

of the highest reward that can be deter-mined in advance, i.e. the way of the high-est known reward. This policy representsthe exploitation approach and is verypromising when the used system is alreadyknown.

In contrast to the exploitation approach itis the aim of the exploration approachto explore a system as detailed as possi-ble so that also ways leading to the targetcan be found which may be at first glancenot very promising but, however, are verysuccessful.

Let us assume that we are looking for theway to a restaurant, a safe policy would beto always take the way we already know,not matter how unoptimal and long it maybe, and not to try to explore better ways.



Another approach would be to exploreshorter ways every now and then, even atthe risk of taking a long time and beingunsuccessful, and therefore we finally willtake the original way and arrive too lateat the restaurant.

In reality, often a combination of bothmethods is applied: In the beginning ofthe learning process it is researched witha higher probability while at the end moreexisting knowledge is exploited. Here, astatic probability distribution is also pos-sible and often applied.

In the gridworld: For finding the way inthe gridworld, the restaurant example ap-plies equally.

C.2 Learning process

Let us again take a look at daily life. Ac-tions can lead us from one situation intodifferent subsituations, from each subsit-uation into further sub-subsituations. Ina sense, we get a situation tree wherelinks between the nodes must be consid-ered (often there are several ways to reacha situation – so the tree could more accu-rately be referred to as situation graph).he leaves of such tree are the end situa-tions of the system. The exploration ap-proach would search the tree as thoroughlyas possible and become acquainted with allleaves. The exploitation approach wouldunerringly go to the best known leave.

Analogous to the situation tree, we alsocan create an action tree. Here, the re-wards for the actions are within the nodes.

Now we have to adapt from daily life howwe learn exactly.

C.2.1 Rewarding strategies

Interesting and very important is the ques-tion for what a reward and what kind ofreward is awarded since the design of thereward significantly controls system behav-ior. As we have seen above, there gener-ally are (again as in daily life) various ac-tions that can be performed in any situa-tion. There are different strategies to eval-uate the selected situations and to learnwhich series of actions would lead to thetarget. First of all, this principle shouldbe explained in the following.

We now want to indicate some extremecases as design examples for the reward:

A rewarding similar to the rewarding in achess game is referred to as pure delayedreward: We only receive the reward atthe end of and not during the game. Thismethod is always advantageous when wefinally can say whether we were succesfulor not; but the interim steps do not allowan estimation of our situation. If we win,then

rt = 0 ∀t < τ (C.10)

as well as rτ = 1. If we lose, then rτ = −1.With this rewarding strategy a reward isonly returned by the leaves of the situationtree.

Pure negative reward: Here,

rt = −1 ∀t < τ. (C.11)


dkriesel.com C.2 Learning process

This system finds the most rapid way toreach the target because this way is auto-matically the most favorable one in respectof the reward. Man wird bestraft Theagent receives punishment for anything itdoes – even if it does nothing. As a resultit is the most inexpensive method for theagent to reach the target fast.

Another strategy is the avoidance strat-egy: Harmful situations are avoided.Here,

rt ∈ 0,−1, (C.12)

Most situations do not receive any reward,only a few of them receive a negative re-ward. The agent agent will avoid gettingtoo close to such negative situations

Warning: Rewarding strategies can haveunexpected consequences. A robot thatis told ”have it your own way but if youtouch an obstacle you will be punished”will simply stand still. Hence,since stand-ing still is also punished it will drive insmall circles. Reconsidering this, we willunderstand that this behavior optimallyfulfills the return of the robot but unfor-tunately was not intended to do so.

Furthermore, we can show that especiallysmall tasks can be solved better by meansof negative rewards while positive, moredifferentiated rewards are useful for large,complex tasks.

For our gridworld we want to apply thepure negative reward strategy: The robotshall find the exit as fast as possible.

-6 -5 -4 -3 -2-7 -1-6 -5 -4 -3 -2-7 -6 -5 -3-8 -7 -6 -4-9 -8 -7 -5-10 -9 -8 -7 -6

-6 -5 -4 -3 -2-7 -1-8 -9 -10 -2-9 -10 -11 -3-10 -11 -10 -4-11 -10 -9 -5-10 -9 -8 -7 -6

Figure C.3: Representation of each optimal re-turn per field in our gridworld by means of ofpure negative reward awarding, at the top withan open and at the bottom with a closed door.

C.2.2 The state-value function

Unlike our agent we have a godlike view stateevaluationof our gridworld so that we can swiftly de-

termine which robot starting position canprovide which optimal return.

In figure C.3 these optimal returns are ap-plied per field.

In the gridworld: The state-value functionfor our gridworld exactly represents sucha function per situation (= position) withthe difference being that here the functionis unknown and has to be learned.

Thus, we can see that it would be morepractical for the robot to be capable to



evaluate the current and future situations.So let us take a look at another systemcomponent of reinforcement learning: thestate-value function V (s), which withregard to a policy Π is often called VΠ(s).Because whether a situation is bad oftendepends on the general behavior Π of theagent.

A situation being bad under a policy thatis searching risks and checking out limitswould be, for instance, if an agent on a bi-cycle turns a corner and the front wheelbegins to slide out. And due to its dare-devil policy the agent would not brake inthis situation. With a risk-aware policythe same situations would look much bet-ter, thus it would be evaluated higher bya good state-value function

VΠ(s) state-value function simply returnsVΠ(s)I the value the current situation s has for

the agent under policy Π. Abstractlyspeaking, according to the above defini-tions, the value of the state-value functioncorresponds to the return Rt (the expectedvalue) of a situation st. EΠ denotes the setof the expected returns under Π and thecurrent situation st.

VΠ(s) = EΠRt|s = st

Definition C.9 (State-value function):The state-value function VΠ(s) as the taskof determining the value of situations un-der a policy, i.e. to answer the agent’squestion of whether a situation s is goodor bad or how good or bad it is. For thispurpose it returns the expectation of thereturn under the situation:

VΠ(s) = EΠRt|s = st (C.13)

The optimal state-value function is calledV ∗Π(s).

JV ∗Π(s)Unfortunaely, unlike us our robot does nothave a godlike view of its environment. Itdoes not have a table with optimal returnslike the one shown above to orient itself.he aim of reinforcement learning is thatthe robot little by little generates its state-value function on the basis of the returnsof many trials and approximates the op-timal state-value function V ∗ (if there isone).

In this context I want to introduce twoterms closely related to the cycle betweenstate-value function and policy:

C.2.2.1 Policy evaluation

Policy evaluation the approach to trya policy a few times, to provide many re-wards that way and to gradually accumu-late a state-value function by means ofthese rewards.

C.2.2.2 Policy improvement

Policy improvement means to improvea policy itself, i.e. to turn it into a new andbetter one. In order to improve the policywe have to aim at the return finally havinga larger value than before, i.e. until wehave found a shorter way to the restaurantand have walked it successfully

The principle of reinforcement learning isto realize an interaction. It is tried to eval-uate how good a policy is in individual sit-uations. The changed state-value function



V))

Πii

V ∗ Π∗

Figure C.4: The cycle of reinforcement learningwhich ideally leads to optimal Π∗ and V ∗.

provides information about the system bymeans of which we again improve our pol-icy. These two values lift each other, whichcan mathematically be proved, so that thefinal result is an optimal policy Π∗ and anoptimal state-value function V ∗ (fig. C.4).This cycle sounds simple but is very time-consuming.

At first, let us regard a simple, ran-dom policy by means of which our robotcould slowly fulfill and improve its state-value function without any previous knowl-edge.

C.2.3 Monte Carlo method

The easiest approach to accumulate astate-value function is mere trial and er-ror. Thus, we select a randomly behavingpolicy which does not consider the accumu-lated state-value function for its randomdecisions. It can be proved that at somepoint we will find the exit of our gridworldby chance.

Inspired by random-based games of chancethis approach is called Monte Carlomethod.

If we additionally assume a pure negativereward, it is obvious that we can receivean optimum value of −6 for our startingfield in the state-value function. Je nach-dem, Depending on the random way therandom policy takes values other (smaller)than −6 can occur for the starting field.Intuitively, we want to memorize only thebetter value for one state (i.e. one field).HBut here caution is advised: In thisway, the learning procedure would workonly with deterministic systems. Our door,which can be open or closed during a cy-cle, would produce oscillations for all fieldsand such oscillations would influence theirshortest way to the target.

With the Monte Carlo method we preferto use the learning rule2

V (st)neu = V (st)alt + α(Rt − V (st)alt),

in which the update of the state-value func-tion is obviously influenced by both theold state value and the received return (αis the learning rate). Thus, the agent gets Jαsome kind of memory, new findings alwayschange the situation value just a little bit.An exemplary learning step is shown infig. C.5 on the next page.

In this example, the computation of thestate value was applied for only one singlestate (our initial state). It should be ob-vious that it is possible (and often done)to train the values for the states visited in-between (in case of the gridworld our waysto the target) at the same time. The result

2 The learning rule is, among others, derived bymeans of the Bellman equation, but this deriva-tion is not discussed in this chapter.



of such a calculation related to our exam-ple is illustrated in fig. C.6 on the rightpage.

The Monte Carlo method seems to besuboptimal and usually it is significantlyslower thatn the following methods of re-inforcement learning. But this method isthe only one for which it can be mathe-matically proved that it works and there-fore it is very useful for theoretic consider-ations.Definition C.10 (Monte Carlo learning):Actions are randomly performed regard-less of the state-value function and in thelong term an expressive state-value func-tion is accumulated by means of the fol-lowing learning rule.

V (st)neu = V (st)alt + α(Rt − V (st)alt),

C.2.4 Temporal difference learning

Most of the learning is the result of ex-periences; e.g. walking or riding a bicy-cle without getting injured (or not), evenmental skills like mathematical problemsolving benefit a lot from experiences andsimple trial and error. Thus, we initializeour policy with any values – we try, learnand improve the policy due to experience(fig. C.7). In contrast to the Monte Carlomethod we want to do this in a more di-rected manner.

Just as wie learn from experiences to reacton different situations in different ways thetemporal difference Lernmethode (ab-breviated: TD learning), does the sameby training VΠ(s) (i.e. the agent learns to

-1-6 -5 -4 -3 -2

-1-14 -13 -12 -2

-11 -3-10 -4-9 -5-8 -7 -6

-10

Figure C.5: Application of the Monte Carlolearning rule with a learning rate of α = 0.5.Top: two exemplary ways the agent randomlyselects are applied (one with an open and onewith a closed door). Bottom: The result of thelearning rule for the value of the initial state con-sidering both ways. Due to the fact that in thecourse of time many different ways are gone un-der random policy a very expressive state-valuefunction.



-1-10 -9 -8 -3 -2

-11 -3-10 -4-9 -5-8 -7 -6

Figure C.6: Extension of the learning examplein fig. C.5 in which the returns for intermedi-ate states are also used to accumulate the state-value function. Here, the low value on the doorfield can be seen very good: If this state is possi-ble, it must be very positive. If the door is closed,this state is impossible.

Π

Evaluation

!!Q

policy improvement

aa

Figure C.7: We try different actions w ithin theenvironment and as a result we learn and improvethe policy.

estimate which situations are worth a lotand which are not). Again the current sit-uation is identified with st, the followingsituations with st+1 and so on. Thus, thelearning formula for the state-value func-tion VΠ(st) is

V (st)neu =V (st)+ α(rt+1 + γV (st+1)− V (st))︸︷︷︸

change of previous value

We can see that the change in the value ofthe current situation st, which is propor-tional to the learning rate α is influencedby

. the received reward rt+1,

. the previous return weighted with afactor γ of the following situationV (st+1),

. the previous value of the situationV (st).

Definition C.11 (Temporal difference learn-ing): Unlike the Monte Carlo method, TDlearning looks ahead by regarding the fol-lowing situation st+1. Thus, the learningrule is by

V (st)neu =V (st) (C.14)

+ α(rt+1 + γV (st+1)− V (st))︸︷︷︸change of previous value

.

C.2.5 The action-value function

Analogous to the state-value functionVΠ(s), the action-value function action

evaluationQΠ(s, a) is another system component ofJQΠ(s, a)reinforcement learning, which evaluates a



0× +1-1

Figure C.8: Exemplary values of an action-value function for the position ×. Moving rightone remains on the fastest way towards the tar-get, moving up is still a quite fast way, movingdown is not a good way at all (provided that thedoor is open for all cases).

certain action a under a certain situations and the policy Π.

In the gridworld: In the gridworld, theaction-value function tells us how good itis to move from a certain field into a cer-tain direction (fig. C.8).Definition C.12 (Action-value function):Like the state-value function, the action-value function QΠ(st, a) evaluates certainactions on the basis of certain situationsunder a policy. The optimal action-valuefunction is called Q∗Π(st, a).

Q∗Π(s, a)I

As shown in fig. C.9 the actions are per-formed until a target situation (here re-ferred to as sτ ) is achieved (if there exists atarget situation, otherwise the actions aresimply performed again and again).

C.2.6 Q learning

This implies QΠ(s, a) as learning fomulafor the action-value function, and – analo-gously to TD learning – its application iscalled Q learning:

Q(st, a)neu =Q(st, a)+ α(rt+1 + γmax

aQ(st+1, a)︸︷︷︸

greedy strategy

−Q(st, a))

︸︷︷︸change of previous value

.

Again we break down the change of thecurrent action value (proportional to thelearning rate α) under the current situa-tion. It is influenced by

. the received reward rt+1,

. the maximum action over the follow-ing actions weighted with γ (Here, agreedy strategy is applied since it canbe assumed that the best known ac-tion is selected. With TD learning,on the other hand, we do not mind toalways get into the best known nextsituation),

. the previous value of the action un-der our situation st known as Q(st, a)(remind that this is also weighted bymeans of α).

IUsually, the action-value function learnsconsiderably faster than the state-valuefunction. But we must not disregard thatreinforcement learning is generally quiteslow: The system has to find out itselfwhat is good. But the advantage of Q


dkriesel.com C.3 Example Applications

GFED@ABCs0a0 //

direction of actions

((GFED@ABCs1a1 //

r1kk GFED@ABC· · · aτ−2 //

r2kk ONMLHIJKsτ−1

aτ−1 //rτ−1kk GFED@ABCsτ

rτll

direction of reward

hh

Figure C.9: Actions are performed until the desired target situation is achieved. Attention shouldbe paid to numbering: Rewards are numbered beginning with 1, actions and situations beginningwith 0 (This method has been adopted as tried and true).

learning is: Π can be initialized arbitrar-ily, and by means of Q learning the resultis always Q∗.Definition C.13 (Q learning): Q learningtrains the action-value function by meansof the learning rule

Q(st, a)new =Q(st, a) (C.15)+ α(rt+1 + γmax

aQ(st+1, a) −Q(st, a)).

and thus finds Q∗ in any case.

C.3 Example Applications

C.3.1 TD gammon

TD gammon is a very successfulbackgammon game based on TD learn-ing invented by Gerald Tesauro. Thesituation here is the current configura-tion of the board. Anyone who has ever

played backgammon knows that the situ-ation space is huge (approx. 1020 situa-tions). As a result, the state-value func-tions cannot be computed explicitly (par-ticularly in the late eighties when the TDgammon was introduced). The selected re-warding strategy was the pure delayed re-ward, i.e. the system receives the rewardnot before the end of the game and at thesame time the reward is the return. Thenthe system was allowed to practice itself(initially against a backgammon program,then against an entity of itself). The resultwas that it achieved the highest ranking ina computer-backgammon league and strik-ingly disproved the theory that a computerprogramm is not capable to master a taskbetter than its programmer.

C.3.2 The car in the pit

Let us take a look at a car parking on aone-dimensional road at the bottom of adeep pit without being able to get overthe slope on both sides straight away bymeans of its engine power in order to leave



the pit. Trivially, the executable actionsare here the possibilities to drive forwardsand backwards. The intuitive solution wethink of immediately is to move backwards,to gain momentum at the opposite slopeand severally oscillite in this way to dashout of the pit.

The actions of a reinforcement learningsystem would be ”full throttle forward”,”full reverse” and ”doing nothing”.

Here, ”everything costs” would be a goodchoice for awarding the reward so that thesystem learns fast how to leave the pit andrealizes that our problem cannot be solvedby means of mere forward directed enginepower. DSo the system will slowly buildup the movement.

The policy can no longer be stored as atable since the state space is hard to dis-cretize. As policy a function has to begenerated.

C.3.3 The pole balancer

The pole balancer was developed byBarto, Sutton and Anderson.

Let be given a situation including a vehiclethat is capable to move either to the rightat full throttle or to the left at full throt-tle (bang bang control). Only these twoactions can be performed, standing still isimpossible. n the top of this car is hingedan upright pole that could tip over to bothsides. The pole is built in such a way thatit always tips over to one side so it neverstands still (let us assume that the pole isrounded at the lower end).

The angle of the pole relative to the verti-cal line is referred to as α. Furthermore,the vehicle always has a fixed position x anour one-dimensional world and a velocityof x. Our one-dimensional world is lim-ited, i.e. there are maximum values andminimum values x can adopt.

The aim of our system is to learn to steerthe car in such a way that it can balancethe pole, to prevent the pole from tippingover. This is achieved best by an avoid-ance strategy: As long as the pole is bal-anced the reward is 0. If the pole tips over,the reward is -1.

Interestingly, the system is soon capableto keep the pole balanced by tilting it suf-ficiently fast and with small movements.At this the system mostly is in the cen-ter of the space since this is farthest fromthe walls which it understands as negative(if it touches the wall, the pole will tipover).

C.3.3.1 Swinging up an invertedpendulum

The initial situation is more difficult forthe system: the fact that initially the polehangs down, that it has to be swung upover the vehicle and that it finally has tobe stabilized. In the literature this taskis called swing up an inverted pendu-lum.


dkriesel.com C.4 Reinforcement Learning in Connection with Neural Networks

C.4 Reinforcement Learningin Connection withNeural Networks

Finally, the reader would like to ask whya paper on ”neural networks” includes achapter about reinforcement learning.

The answer is very simple. We have al-ready been introduced to supervised andunsupervised learning procedures. Al-though we do not always have an om-niscient teacher who makes unsupervisedlearning possible that does not mean thatwe do not receive any feedback at all.There is often something between, somekind of criticism or school mark. Prob-lems like this can be solved by means ofreinforcement learning.

But not every problem is so easy to solvelike our gridworld: BIn our backgammonexample we have approx. 1020 situationsand the situation tree has a large branch-ing factor, let alone other games. HHere,the tables used in the gridworld can nolonger be realized as state- and action-value functions. Thus, we have to find ap-proximators for these functions.

And which learning approximators forthese reinforcement learning componentscome immediately into our mind? Exactly:neural networks.

Exercises

Exercise 19: A robot control system shallbe persuaded by means of reinforcement

learning to find a strategy in order to exita labyrinth as fast as possible.

. How could an appropriate state-valuefunction look like?

. How would you generate an appropri-ate reward?

Assume that the robot is capable to avoidobstacles and at any time knows its posi-tion (x, y) and orientation φ.Exercise 20: Describe the function of thetwo components ASE and ACE as theyhave been proposed by Barto, Suttonand Anderson to control the pole bal-ancer.

Bibliography: [BSA83].Exercise 21: Indicate several ”classical”problems of informatics which could besolved efficiently by means of reinforce-ment learning. Please give reasons foryour answers.


Bibliography

[And72] James A. Anderson. A simple neural network generating an interactivememory. Mathematical Biosciences, 14:197–220, 1972.

[BSA83] A. Barto, R. Sutton, and C. Anderson. Neuron-like adaptive elementsthat can solve difficult learning control problems. IEEE Transactions onSystems, Man, and Cybernetics, 13(5):834–846, September 1983.

[CG87] G. A. Carpenter and S. Grossberg. ART2: Self-organization of stable cate-gory recognition codes for analog input patterns. Applied Optics, 26:4919–4930, 1987.

[CG88] M.A. Cohen and S. Grossberg. Absolute stability of global pattern forma-tion and parallel memory storage by competitive neural networks. Com-puter Society Press Technology Series Neural Networks, pages 70–81, 1988.

[CG90] G. A. Carpenter and S. Grossberg. ART 3: Hierarchical search usingchemical transmitters in self-organising pattern recognition architectures.Neural Networks, 3(2):129–152, 1990.

[CH67] T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE Trans-actions on Information Theory, 13(1):21–27, 1967.

[CR00] N.A. Campbell and JB Reece. Biologie. Spektrum. Akademischer Verlag,2000.

[Cyb89] G. Cybenko. Approximation by superpositions of a sigmoidal function.Mathematics of Control, Signals, and Systems (MCSS), 2(4):303–314,1989.

[DHS01] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern classification. Wiley NewYork, 2001.

[Elm90] Jeffrey L. Elman. Finding structure in time. Cognitive Science, 14(2):179–211, April 1990.

[Fah88] S. E. Fahlman. An empirical sudy of learning speed in back-propagationnetworks. Technical Report CMU-CS-88-162, CMU, 1988.

207

Bibliography dkriesel.com

[FMI83] K. Fukushima, S. Miyake, and T. Ito. Neocognitron: A neural networkmodel for a mechanism of visual pattern recognition. IEEE Transactions onSystems, Man, and Cybernetics, 13(5):826–834, September/October 1983.

[Fri94] B. Fritzke. Fast learning with incremental RBF networks. Neural Process-ing Letters, 1(1):2–5, 1994.

[GKE01a] N. Goerke, F. Kintzler, and R. Eckmiller. Self organized classification ofchaotic domains from a nonlinearattractor. In Neural Networks, 2001. Pro-ceedings. IJCNN’01. International Joint Conference on, volume 3, 2001.

[GKE01b] N. Goerke, F. Kintzler, and R. Eckmiller. Self organized partitioning ofchaotic attractors for control. Lecture notes in computer science, pages851–856, 2001.

[Gro76] S. Grossberg. Adaptive pattern classification and universal recoding, I:Parallel development and coding of neural feature detectors. BiologicalCybernetics, 23:121–134, 1976.

[GS06] Nils Goerke and Alexandra Scherbart. Classification using multi-soms andmulti-neural gas. In IJCNN, pages 3895–3902, 2006.

[Heb49] Donald O. Hebb. The Organization of Behavior: A NeuropsychologicalTheory. Wiley, New York, 1949.

[Hop82] John J. Hopfield. Neural networks and physical systems with emergent col-lective computational abilities. Proc. of the National Academy of Science,USA, 79:2554–2558, 1982.

[Hop84] JJ Hopfield. Neurons with graded response have collective computationalproperties like those of two-state neurons. Proceedings of the NationalAcademy of Sciences, 81(10):3088–3092, 1984.

[HT85] JJ Hopfield and DW Tank. Neural computation of decisions in optimizationproblems. Biological cybernetics, 52(3):141–152, 1985.

[Jor86] M. I. Jordan. Attractor dynamics and parallelism in a connectionist se-quential machine. In Proceedings of the Eighth Conference of the CognitiveScience Society, pages 531–546. Erlbaum, 1986.

[Kau90] L. Kaufman. Finding groups in data: an introduction to cluster analysis.In Finding Groups in Data: An Introduction to Cluster Analysis. Wiley,New York, 1990.

[Koh72] T. Kohonen. Correlation matrix memories. IEEEtC, C-21:353–359, 1972.


dkriesel.com Bibliography

[Koh82] Teuvo Kohonen. Self-organized formation of topologically correct featuremaps. Biological Cybernetics, 43:59–69, 1982.

[Koh89] Teuvo Kohonen. Self-Organization and Associative Memory. Springer-Verlag, Berlin, third edition, 1989.

[Koh98] T. Kohonen. The self-organizing map. Neurocomputing, 21(1-3):1–6, 1998.

[KSJ00] E.R. Kandel, J.H. Schwartz, and T.M. Jessell. Principles of neural science.Appleton & Lange, 2000.

[lCDS90] Y. le Cun, J. S. Denker, and S. A. Solla. Optimal brain damage. InD. Touretzky, editor, Advances in Neural Information Processing Systems2, pages 598–605. Morgan Kaufmann, 1990.

[Mac67] J. MacQueen. Some methods for classification and analysis of multivariateobservations. In Proceedings of the Fifth Berkeley Symposium on Mathe-matics, Statistics and Probability, Vol. 1, pages 281–296, 1967.

[MBS93] Thomas M. Martinetz, Stanislav G. Berkovich, and Klaus J. Schulten.’Neural-gas’ network for vector quantization and its application to time-series prediction. IEEE Trans. on Neural Networks, 4(4):558–569, 1993.

[MP43] W.S. McCulloch and W. Pitts. A logical calculus of the ideas immanentin nervous activity. Bulletin of Mathematical Biology, 5(4):115–133, 1943.

[MP69] M. Minsky and S. Papert. Perceptrons. MIT Press, Cambridge, Mass,1969.

[MR86] J. L. McClelland and D. E. Rumelhart. Parallel Distributed Processing:Explorations in the Microstructure of Cognition, volume 2. MIT Press,Cambridge, 1986.

[Par87] David R. Parker. Optimal algorithms for adaptive networks: Second or-der back propagation, second order direct propagation, and second orderhebbian learning. In Maureen Caudill and Charles Butler, editors, IEEEFirst International Conference on Neural Networks (ICNN’87), volume II,pages II–593–II–600, San Diego, CA, June 1987. IEEE.

[PG89] T. Poggio and F. Girosi. A theory of networks for approximation andlearning. MIT Press, Cambridge Mass., 1989.

[Pin87] F. J. Pineda. Generalization of back-propagation to recurrent neural net-works. Physical Review Letters, 59:2229–2232, 1987.


Bibliography dkriesel.com

[PM47] W. Pitts and W.S. McCulloch. How we know universals the perception ofauditory and visual forms. Bulletin of Mathematical Biology, 9(3):127–147,1947.

[RD05] G. Roth and U. Dicke. Evolution of the brain and intelligence. Trends inCognitive Sciences, 9(5):250–257, 2005.

[RHW86a] D. Rumelhart, G. Hinton, and R. Williams. Learning representations byback-propagating errors. Nature, 323:533–536, October 1986.

[RHW86b] David E. Rumelhart, Geoffrey E. Hinton, and R. J. Williams. Learninginternal representations by error propagation. In D. E. Rumelhart, J. L.McClelland, and the PDP research group., editors, Parallel distributed pro-cessing: Explorations in the microstructure of cognition, Volume 1: Foun-dations. MIT Press, 1986.

[Ros58] F. Rosenblatt. The perceptron: a probabilistic model for information stor-age and organization in the brain. Psychological Review, 65:386–408, 1958.

[Ros62] F. Rosenblatt. Principles of Neurodynamics. Spartan, New York, 1962.

[SB98] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction.MIT Press, Cambridge, MA, 1998.

[SG06] A. Scherbart and N. Goerke. Unsupervised system for discovering patternsin time-series, 2006.

[SGE05] Rolf Schatten, Nils Goerke, and Rolf Eckmiller. Regional and online learn-able fields. In Sameer Singh, Maneesha Singh, Chidanand Apte, and PetraPerner, editors, ICAPR (2), volume 3687 of Lecture Notes in ComputerScience, pages 74–83. Springer, 2005.

[Ste61] K. Steinbuch. Die lernmatrix. Kybernetik (Biological Cybernetics), 1:36–45,1961.

[vdM73] C. von der Malsburg. Self-organizing of orientation sensitive cells in striatecortex. Kybernetik, 14:85–100, 1973.

[Was89] P. D. Wasserman. Neural Computing Theory and Practice. New York :Van Nostrand Reinhold, 1989.

[Wer74] P. J. Werbos. Beyond Regression: New Tools for Prediction and Analysisin the Behavioral Sciences. PhD thesis, Harvard University, 1974.

[Wer88] P. J. Werbos. Backpropagation: Past and future. In Proceedings ICNN-88,San Diego, pages 343–353, 1988.


dkriesel.com Bibliography

[WG94] A.S. Weigend and N.A. Gershenfeld. Time series prediction. Addison-Wesley, 1994.

[WH60] B. Widrow and M. E. Hoff. Adaptive switching circuits. In ProceedingsWESCON, pages 96–104, 1960.

[Wid89] R. Widner. Single-stage logic. AIEE Fall General Meeting, 1960. Wasser-man, P. Neural Computing, Theory and Practice, Van Nostrand Reinhold,1989.

[Zel94] Andreas Zell. Simulation Neuronaler Netze. Addison-Wesley, 1994. Ger-man.


List of Figures

1.1 Robot with 8 sensors and 2 motors . . . . . . . . . . . . . . . . . . . . . 61.3 Black box with eight inputs and two outputs . . . . . . . . . . . . . . . 71.2 Learning examples for the example robot . . . . . . . . . . . . . . . . . 81.4 Institutions of the field of neural networks . . . . . . . . . . . . . . . . . 9

2.1 Central Nervous System . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2 Brain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3 Biological Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.4 Action Potential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.5 Complex Eyes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1 Data processing of a neuron . . . . . . . . . . . . . . . . . . . . . . . . . 363.2 Various popular activation functions . . . . . . . . . . . . . . . . . . . . 393.3 Feedforward network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.4 Feedforward network with shortcuts . . . . . . . . . . . . . . . . . . . . 423.5 Directly recurrent network . . . . . . . . . . . . . . . . . . . . . . . . . . 423.6 Indirectly recurrent network . . . . . . . . . . . . . . . . . . . . . . . . . 433.7 Laterally recurrent network . . . . . . . . . . . . . . . . . . . . . . . . . 443.8 Completely linked network . . . . . . . . . . . . . . . . . . . . . . . . . . 453.10 Examples for different types of neurons . . . . . . . . . . . . . . . . . . 463.9 Example network with and without bias neuron . . . . . . . . . . . . . . 47

4.1 Training examples and network capacities . . . . . . . . . . . . . . . . . 564.2 Learning curve with different scalings . . . . . . . . . . . . . . . . . . . 594.3 Gradient descent, 2D visualization . . . . . . . . . . . . . . . . . . . . . 614.4 Possible errors during a gradient descent . . . . . . . . . . . . . . . . . . 634.5 The 2-spiral problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.6 Checkerboard problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.1 The perceptron in three different views . . . . . . . . . . . . . . . . . . . 725.2 Single-layer perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.3 Single-layer perceptron with several output neurons . . . . . . . . . . . . 745.4 AND and OR single-layer perceptron . . . . . . . . . . . . . . . . . . . . 75

213

List of Figures dkriesel.com

5.5 Error surface of a network with 2 connections . . . . . . . . . . . . . . . 785.6 Sketch of a XOR-SLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825.7 Two-dimensional linear separation . . . . . . . . . . . . . . . . . . . . . 825.8 Three-dimensional linear separation . . . . . . . . . . . . . . . . . . . . 835.9 The XOR network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.10 Multi-layer perceptrons and output sets . . . . . . . . . . . . . . . . . . 855.11 Position of an inner neuron for derivation of backpropagation . . . . . . 875.12 Illustration of the backpropagation derivation . . . . . . . . . . . . . . . 895.13 Fermi function and hyperbolic tangent . . . . . . . . . . . . . . . . . . . 955.14 Momentum term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965.15 Functionality of 8-2-8 encoding . . . . . . . . . . . . . . . . . . . . . . . 98

6.1 RBF network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1036.2 Distance function in the RBF network . . . . . . . . . . . . . . . . . . . 1046.3 Individual Gaussian bells in one- and two-dimensional space . . . . . . . 1056.4 Accumulating Gaussian bells in one-dimensional space . . . . . . . . . . 1056.5 Accumulating Gaussian bells in two-dimensional space . . . . . . . . . . 1066.6 Even coverage of an input space with radial basis functions . . . . . . . 1126.7 Uneven coverage of an input space with radial basis functions . . . . . . 1136.8 Random, uneven coverage of an input space with radial basis functions . 113

7.1 Roessler attractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1187.2 Jordan network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1197.3 Elman network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1207.4 Unfolding in Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

8.1 Hopfield network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1268.2 Binary threshold function . . . . . . . . . . . . . . . . . . . . . . . . . . 1288.3 Convergence of a Hopfield network . . . . . . . . . . . . . . . . . . . . . 1308.4 Fermi function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

9.1 Examples for quantization . . . . . . . . . . . . . . . . . . . . . . . . . . 137

10.1 Example topologies of a SOM . . . . . . . . . . . . . . . . . . . . . . . . 14410.2 Example distances of SOM topologies . . . . . . . . . . . . . . . . . . . 14710.3 SOM topology functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 14910.4 First example of a SOM . . . . . . . . . . . . . . . . . . . . . . . . . . . 15010.7 Topological defect of a SOM . . . . . . . . . . . . . . . . . . . . . . . . . 15210.5 Training a SOM with one-dimensional topology . . . . . . . . . . . . . . 15310.6 SOMs with one- and two-dimensional topologies and different input spaces15410.8 Resolution optimization of a SOM to certain areas . . . . . . . . . . . . 156


dkriesel.com List of Figures

10.9 Figure to be classified by neural gas . . . . . . . . . . . . . . . . . . . . 158

11.1 Structure of an ART network . . . . . . . . . . . . . . . . . . . . . . . . 16411.2 Learning process of an ART network . . . . . . . . . . . . . . . . . . . . 166

A.1 Comparing cluster analysis methods . . . . . . . . . . . . . . . . . . . . 172A.2 ROLF neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174A.3 Clustering by means of a ROLF . . . . . . . . . . . . . . . . . . . . . . . 177

B.1 Neural network reading time series . . . . . . . . . . . . . . . . . . . . . 180B.2 One-step-ahead prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 182B.3 Two-step-ahead prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 184B.4 Direct two-step-ahead prediction . . . . . . . . . . . . . . . . . . . . . . 184B.5 Heterogeneous one-step-ahead prediction . . . . . . . . . . . . . . . . . . 186B.6 Heterogeneous one-step-ahead prediction with two outputs . . . . . . . . 186

C.1 Gridworld . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191C.2 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191C.3 Gridworld with optimal returns . . . . . . . . . . . . . . . . . . . . . . . 197C.4 Reinforcement learning cycle . . . . . . . . . . . . . . . . . . . . . . . . 199C.5 The Monte Carlo method . . . . . . . . . . . . . . . . . . . . . . . . . . 200C.6 Extended Monte Carlo method . . . . . . . . . . . . . . . . . . . . . . . 201C.7 Improving the policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201C.8 Action-value function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202C.9 Reinforcement learning timeline . . . . . . . . . . . . . . . . . . . . . . . 203


List of Tables

1.1 Brain vs. Computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

4.1 3-bit parity function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.1 Definition of the XOR function . . . . . . . . . . . . . . . . . . . . . . . 815.2 Number of linearly separable functions . . . . . . . . . . . . . . . . . . . 835.3 classifiable sets per perceptron . . . . . . . . . . . . . . . . . . . . . . . 86

217

Index

*100-step rule . . . . . . . . . . . . . . . . . . . . . . . . 5

AAction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193action potential . . . . . . . . . . . . . . . . . . . . 23action space . . . . . . . . . . . . . . . . . . . . . . . 193action-value function . . . . . . . . . . . . . . 201activation . . . . . . . . . . . . . . . . . . . . . . . . . . 37activation function . . . . . . . . . . . . . . . . . 38

selection of . . . . . . . . . . . . . . . . . . . . . 93ADALINE . . see adaptive linear neuronadaptive linear element . . . see adaptive

linear neuronadaptive linear neuron . . . . . . . . . . . . . . 10adaptive resonance theory . . . . . 11, 163agent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .192algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . .52amacrine cell . . . . . . . . . . . . . . . . . . . . . . . 30approximation. . . . . . . . . . . . . . . . . . . . .109ART . . . . see adaptive resonance theoryART-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165ART-2A. . . . . . . . . . . . . . . . . . . . . . . . . . .166ART-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166artificial intelligence . . . . . . . . . . . . . . . . 10associative data storage . . . . . . . . . . . 155

ATP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22attractor . . . . . . . . . . . . . . . . . . . . . . . . . . 117autoassociator . . . . . . . . . . . . . . . . . . . . . 129axon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20, 25

Bbackpropagation . . . . . . . . . . . . . . . . . . . . 90

second order . . . . . . . . . . . . . . . . . . . 96backpropagation of error. . . . . . . . . . . .86

recurrent . . . . . . . . . . . . . . . . . . . . . . 122bar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136bias neuron. . . . . . . . . . . . . . . . . . . . . . . . .45binary threshold function . . . . . . . . . . 38bipolar cell . . . . . . . . . . . . . . . . . . . . . . . . . 29black box . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7brain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16brainstem . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Ccapability to learn . . . . . . . . . . . . . . . . . . . 4center

of a ROLF neurons. . . . . . . . . . . .174of a SOM Neuron . . . . . . . . . . . . . 144

219

Index dkriesel.com

of an RBF neuron . . . . . . . . . . . . . 102distance to the . . . . . . . . . . . . . . 107

central nervous system . . . . . . . . . . . . . 16cerebellum . . . . . . . . . . . . . see little braincerebral cortex . . . . . . . . . . . . . . . . . . . . . 16cerebrum . . . . . . . . . . . . . . . . . . . . . . . . . . . 16change in weight. . . . . . . . . . . . . . . . . . . .66cluster analysis . . . . . . . . . . . . . . . . . . . . 169clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . 169CNS . . . . . . . see central nervous systemcodebook vector . . . . . . . . . . . . . . 136, 170complete linkage. . . . . . . . . . . . . . . . . . . .40complex eye . . . . . . . . . . . . . . . . . . . . . . . . 28compound eye . . . . . . . . see complex eyeconcentration gradient . . . . . . . . . . . . . . 21concept of time . . . . . . . . . . . . . . . . . . . . . 35cone function . . . . . . . . . . . . . . . . . . . . . .148connection. . . . . . . . . . . . . . . . . . . . . . . . . .36context-based search . . . . . . . . . . . . . . 155continuous . . . . . . . . . . . . . . . . . . . . . . . . 135cortex . . . . . . . . . . . . . . see cerebral cortex

visual . . . . . . . . . . . . . . . . . . . . . . . . . . 17cortical field . . . . . . . . . . . . . . . . . . . . . . . . 16

association . . . . . . . . . . . . . . . . . . . . . 17primary . . . . . . . . . . . . . . . . . . . . . . . . 17

cylinder function . . . . . . . . . . . . . . . . . . 148

DDartmouth Summer Research Project9Delta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81Delta rule . . . . . . . . . . . . . . . . . . . . . . . . . . 81Dendrite . . . . . . . . . . . . . . . . . . . . . . . . . . . 20dendrite

tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20depolarization . . . . . . . . . . . . . . . . . . . . . . 23diencephalon . . . . . . . . . . . . see interbrain

difference vector . . . . . . . see error vectordigital filter . . . . . . . . . . . . . . . . . . . . . . . 181digitization . . . . . . . . . . . . . . . . . . . . . . . . 136discrete . . . . . . . . . . . . . . . . . . . . . . . . . . . 135discretization . . . . . . . . . see quantizationdistance

Euclidean . . . . . . . . . . . . . . . . . 57, 169squared. . . . . . . . . . . . . . . . . . . .78, 169

dynamical system . . . . . . . . . . . . . . . . . 117

Eearly stopping . . . . . . . . . . . . . . . . . . . . . . 60electronic brain . . . . . . . . . . . . . . . . . . . . . . 9Elman network . . . . . . . . . . . . . . . . . . . . 119environment . . . . . . . . . . . . . . . . . . . . . . .192episode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194epoch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54epsilon-nearest neighboring . . . . . . . . 171error

specific . . . . . . . . . . . . . . . . . . . . . . . . . 57total . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

error function . . . . . . . . . . . . . . . . . . . . . . 77specific . . . . . . . . . . . . . . . . . . . . . . . . . 77

error vector . . . . . . . . . . . . . . . . . . . . . . . . 55evolutionary algorithms . . . . . . . . . . . 122exploitation approach . . . . . . . . . . . . . 195exploration approach . . . . . . . . . . . . . . 195exteroceptor . . . . . . . . . . . . . . . . . . . . . . . . 26

Ffault tolerance . . . . . . . . . . . . . . . . . . . . . . . 4Feedforward . . . . . . . . . . . . . . . . . . . . . . . . 40Fermi function . . . . . . . . . . . . . . . . . . . . . 38


dkriesel.com Index

flat spot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96elimination . . . . . . . . . . . . . . . . . . . . . 96

function approximation . . . . . . . . . . . . . 93function approximator

universal . . . . . . . . . . . . . . . . . . . . . . . 84

Gganglion cell . . . . . . . . . . . . . . . . . . . . . . . . 29Gauss-Markov model . . . . . . . . . . . . . . 109Gaussian bell . . . . . . . . . . . . . . . . . . . . . .147generalization . . . . . . . . . . . . . . . . . . . . 4, 51glial cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62gradient descent . . . . . . . . . . . . . . . . . . . . 62

problems . . . . . . . . . . . . . . . . . . . . . . . 62grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144gridworld. . . . . . . . . . . . . . . . . . . . . . . . . .190

HHeaviside function see binary threshold

functionHebbian rule . . . . . . . . . . . . . . . . . . . . . . . 66

generalized form. . . . . . . . . . . . . . . .66heteroassociator . . . . . . . . . . . . . . . . . . . 130Hinton diagram . . . . . . . . . . . . . . . . . . . . 36history of development. . . . . . . . . . . . . . .8Hopfield networks . . . . . . . . . . . . . . . . . 125

continuous . . . . . . . . . . . . . . . . . . . . 132horizontal cell . . . . . . . . . . . . . . . . . . . . . . 30hyperbolic tangent . . . . . . . . . . . . . . . . . 38hyperpolarization . . . . . . . . . . . . . . . . . . . 25hypothalamus . . . . . . . . . . . . . . . . . . . . . . 17

Iindividual eye . . . . . . . . see ommatidiuminput dimension . . . . . . . . . . . . . . . . . . . . 49input patterns . . . . . . . . . . . . . . . . . . . . . . 52input vector . . . . . . . . . . . . . . . . . . . . . . . . 48interbrain . . . . . . . . . . . . . . . . . . . . . . . . . . 17internodesn. . . . . . . . . . . . . . . . . . . . . . . . .25interoceptor . . . . . . . . . . . . . . . . . . . . . . . . 26interpolation

precise . . . . . . . . . . . . . . . . . . . . . . . . 108ion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21iris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

JJordan network. . . . . . . . . . . . . . . . . . . .118

Kk-means clustering . . . . . . . . . . . . . . . . 170k-nearest neighbouring . . . . . . . . . . . . 170

Llayer

hidden . . . . . . . . . . . . . . . . . . . . . . . . . 40input . . . . . . . . . . . . . . . . . . . . . . . . . . .40output . . . . . . . . . . . . . . . . . . . . . . . . . 40

learnability . . . . . . . . . . . . . . . . . . . . . . . . . 93learning


Index dkriesel.com

batch . . . . . . . . . . see learning, offlineoffline . . . . . . . . . . . . . . . . . . . . . . . . . . 53online . . . . . . . . . . . . . . . . . . . . . . . . . . 53reinforcement . . . . . . . . . . . . . . . . . . 52supervised. . . . . . . . . . . . . . . . . . . . . .53unsupervised . . . . . . . . . . . . . . . . . . . 52

learning rate . . . . . . . . . . . . . . . . . . . . . . . 91variable . . . . . . . . . . . . . . . . . . . . . . . . 92

learning strategy . . . . . . . . . . . . . . . . . . . 40learning vector quantization . . . . . . . 135lens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29linear separability . . . . . . . . . . . . . . . . . . 83linearer associator . . . . . . . . . . . . . . . . . . 11little brain. . . . . . . . . . . . . . . . . . . . . . . . . .17locked-in syndrome . . . . . . . . . . . . . . . . . 18logistic function . . . . see Fermi function

temperature parameter . . . . . . . . . 39LVQ . . see learning vector quantizationLVQ1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138LVQ2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138LVQ3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

MM-SOM . see self-organizing map, multiMark I perceptron . . . . . . . . . . . . . . . . . . 10Mathematical Symbols

(t). . . . . . . . . . . . .see concept of timeA(S) . . . . . . . . . . . . . see action spaceEp . . . . . . . . . . . . . . . . see error vectorG . . . . . . . . . . . . . . . . . . . . see topologyN . . see self-organizing map, input

dimensionP . . . . . . . . . . . . . . . . . see Training setQ∗Π(s, a) . see action-value function,

optimalQΠ(s, a) . see action-value function

Rt . . . . . . . . . . . . . . . . . . . . . . see returnS . . . . . . . . . . . . . . see situation spaceT . . . . . . see temperature parameterV ∗Π(s) . . . . . see state-value function,

optimalVΠ(s) . . . . . see state-value functionW . . . . . . . . . . . . . . see weight matrix∆wi,j . . . . . . . . see change in weightΠ . . . . . . . . . . . . . . . . . . . . . . . see policyΘ. . . . . . . . . . . . . .see threshold valueα . . . . . . . . . . . . . . . . . . see momentumβ . . . . . . . . . . . . . . . . see weight decayδ . . . . . . . . . . . . . . . . . . . . . . . . see Deltaη . . . . . . . . . . . . . . . . .see learning rate∇ . . . . . . . . . . . . . . see nabla operatorρ . . . . . . . . . . . . . see radius multiplierErr . . . . . . . . . . . . . . . . see error, totalErr(W ) . . . . . . . . . see error functionErrp . . . . . . . . . . . . . see error, specificErrp(W )see error function, specificErrWD . . . . . . . . . . . see weight decayaj . . . . . . . . . . . . . . . . . . .see activationat . . . . . . . . . . . . . . . . . . . . . . see Actionc . . . . . . . . . . . . . . . . . . . . . . . .see center

of an RBF neuron, see neuron,self-organizing map, center

m . . . . . . . . . . . see output dimensionn . . . . . . . . . . . . . see input dimensionp . . . . . . . . . . . . . see training patternrh . . . see center of an RBF neuron,

distance to thert . . . . . . . . . . . . . . . . . . . . . . see rewardst . . . . . . . . . . . . . . . . . . . . see situationt . . . . . . . . . . . . . . . see teaching inputwi,j . . . . . . . . . . . . . . . . . . . . see weightx . . . . . . . . . . . . . . . . . see input vectory . . . . . . . . . . . . . . . . see output vectorfact . . . . . . . . see activation functionfout . . . . . . . . . . . see output function

membrane . . . . . . . . . . . . . . . . . . . . . . . . . . 21


dkriesel.com Index

-potential . . . . . . . . . . . . . . . . . . . . . . 21memorized . . . . . . . . . . . . . . . . . . . . . . . . . 56metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . .169Mexican hat function . . . . . . . . . . . . . . 148MLP. . . . . . .see perceptron, Multi-layermomentum . . . . . . . . . . . . . . . . . . . . . . . . . 94momentum term. . . . . . . . . . . . . . . . . . . .95Monte Carlo method . . . . . . . . . . . . . . 199Moore-Penrose pseudo inverse . . . . . 109moving average procedure . . . . . . . . . 182myelin sheath . . . . . . . . . . . . . . . . . . . . . . 25

Nnabla operator. . . . . . . . . . . . . . . . . . . . . .62Neocognitron . . . . . . . . . . . . . . . . . . . . . . . 12nervous system . . . . . . . . . . . . . . . . . . . . . 15network input . . . . . . . . . . . . . . . . . . . . . . 37neural gas . . . . . . . . . . . . . . . . . . . . . . . . . 157

growing . . . . . . . . . . . . . . . . . . . . . . . 161multi- . . . . . . . . . . . . . . . . . . . . . . . . . 159

neural network . . . . . . . . . . . . . . . . . . . . . 36recurrent . . . . . . . . . . . . . . . . . . . . . . 117

neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35accepting . . . . . . . . . . . . . . . . . . . . . 175binary. . . . . . . . . . . . . . . . . . . . . . . . . .73context. . . . . . . . . . . . . . . . . . . . . . . .118Fermi . . . . . . . . . . . . . . . . . . . . . . . . . . 73identity . . . . . . . . . . . . . . . . . . . . . . . . 73information processing . . . . . . . . . 73input . . . . . . . . . . . . . . . . . . . . . . . . . . .73RBF . . . . . . . . . . . . . . . . . . . . . . . . . . 102

output . . . . . . . . . . . . . . . . . . . . . . 102ROLF. . . . . . . . . . . . . . . . . . . . . . . . .174self-organizing map. . . . . . . . . . . .144Tanh . . . . . . . . . . . . . . . . . . . . . . . . . . . 73winner . . . . . . . . . . . . . . . . . . . . . . . . 146

neuron layers . . . . . . . . . . . . . . . . . see layerneurotransmitters . . . . . . . . . . . . . . . . . . 19nodes of Ranvier . . . . . . . . . . . . . . . . . . . 25

Ooligodendrocytes . . . . . . . . . . . . . . . . . . . 25OLVQ. . . . . . . . . . . . . . . . . . . . . . . . . . . . .138on-neuron . . . . . . . . . . . . . see bias neuronone-step-ahead prediction . . . . . . . . . 181

heterogeneous . . . . . . . . . . . . . . . . . 185open loop learning. . . . . . . . . . . . . . . . .122optimal brain damage . . . . . . . . . . . . . . 97order of activation . . . . . . . . . . . . . . . . . . 46

asynchronousfixed order . . . . . . . . . . . . . . . . . . . 48random order . . . . . . . . . . . . . . . . 47randomly permuted order . . . . 47topological order . . . . . . . . . . . . . 48

synchronous . . . . . . . . . . . . . . . . . . . . 46output dimension . . . . . . . . . . . . . . . . . . . 49output function. . . . . . . . . . . . . . . . . . . . .39output vector. . . . . . . . . . . . . . . . . . . . . . .49

Pparallelism . . . . . . . . . . . . . . . . . . . . . . . . . . 5Pattern . . . . . . . . . . . see training patternpattern recognition . . . . . . . . . . . . 94, 130perceptron . . . . . . . . . . . . . . . . . . . . . . . . . 73

Multi-layer . . . . . . . . . . . . . . . . . . . . . 84multi-layer

recurrent . . . . . . . . . . . . . . . . . . . . 117single-layer . . . . . . . . . . . . . . . . . . . . . 74

perceptron convergence theorem . . . . 75


Index dkriesel.com

perceptron learning algorithm . . . . . . 75period . . . . . . . . . . . . . . . . . . . . . . . . . . . . .117peripheral nervous system . . . . . . . . . . 15Persons

Anderson . . . . . . . . . . . . . . . . . . . . 204 f.Anderson, James A. . . . . . . . . . . . . 11Barto . . . . . . . . . . . . . . . . . . . 189, 204 f.Carpenter, Gail . . . . . . . . . . . .11, 163Elman . . . . . . . . . . . . . . . . . . . . . . . . 118Fukushima . . . . . . . . . . . . . . . . . . . . . 12Girosi . . . . . . . . . . . . . . . . . . . . . . . . . 101Grossberg, Stephen . . . . . . . . 11, 163Hebb, Donald O. . . . . . . . . . . . . 9, 66Hinton . . . . . . . . . . . . . . . . . . . . . . . . . 12Hoff, Marcian E. . . . . . . . . . . . . . . . 10Hopfield, John. . . . . . . . . . . . . 12, 125Ito . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12Jordan . . . . . . . . . . . . . . . . . . . . . . . . 118Kohonen, Teuvo . 11, 135, 143, 155Lashley, Karl . . . . . . . . . . . . . . . . . . . . 9MacQueen, J. . . . . . . . . . . . . . . . . . 170Martinetz, Thomas . . . . . . . . . . . . 157McCulloch, Warren . . . . . . . . . . . . 8 f.Minsky, Marvin . . . . . . . . . . . . . . 9, 11Miyake . . . . . . . . . . . . . . . . . . . . . . . . . 12Nilsson, Nils. . . . . . . . . . . . . . . . . . . .10Papert, Seymour . . . . . . . . . . . . . . . 11Parker, David . . . . . . . . . . . . . . . . . . 96Pitts, Walter . . . . . . . . . . . . . . . . . . . 8 f.Poggio . . . . . . . . . . . . . . . . . . . . . . . . 101Pythagoras . . . . . . . . . . . . . . . . . . . . . 58Rosenblatt, Frank . . . . . . . . . . 10, 71Rumelhart . . . . . . . . . . . . . . . . . . . . . 12Steinbuch, Karl . . . . . . . . . . . . . . . . 10Sutton . . . . . . . . . . . . . . . . . . 189, 204 f.Tesauro, Gerald . . . . . . . . . . . . . . . 203von der Malsburg, Christoph . . . 11Werbos, Paul . . . . . . . . . . . 11, 86, 96Widrow, Bernard . . . . . . . . . . . . . . . 10Wightman, Charles . . . . . . . . . . . . .10

Williams . . . . . . . . . . . . . . . . . . . . . . . 12Zuse, Konrad . . . . . . . . . . . . . . . . . . . . 9

pinhole eye . . . . . . . . . . . . . . . . . . . . . . . . . 28PNS . . . . see peripheral nervous systempole balancer . . . . . . . . . . . . . . . . . . . . . . 204policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

closed loop . . . . . . . . . . . . . . . . . . . . 195evaluation . . . . . . . . . . . . . . . . . . . . . 198greedy . . . . . . . . . . . . . . . . . . . . . . . . 195improvement . . . . . . . . . . . . . . . . . . 198open loop . . . . . . . . . . . . . . . . . . . . . 194

pons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18propagation function . . . . . . . . . . . . . . . 37pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97pupil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

QQ learning . . . . . . . . . . . . . . . . . . . . . . . . 202quantization. . . . . . . . . . . . . . . . . . . . . . .135quickpropagation . . . . . . . . . . . . . . . . . . . 96

RRBF network. . . . . . . . . . . . . . . . . . . . . .102

growing . . . . . . . . . . . . . . . . . . . . . . . 114recepter cell . . . . . . . . . . . . . . . . . . . . . . . . 26receptive field . . . . . . . . . . . . . . . . . . . . . . 30receptor cell

photo-. . . . . . . . . . . . . . . . . . . . . . . . . .29primary . . . . . . . . . . . . . . . . . . . . . . . . 26secondary . . . . . . . . . . . . . . . . . . . . . . 26

recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . 41direct . . . . . . . . . . . . . . . . . . . . . . . . . . 41indirect . . . . . . . . . . . . . . . . . . . . . . . . 43


dkriesel.com Index

lateral . . . . . . . . . . . . . . . . . . . . . . . . . .43recurrent . . . . . . . . . . . . . . . . . . . . . . . . . . 117refractory period . . . . . . . . . . . . . . . . . . . 25regional and online learnable fields 173reinforcement learning . . . . . . . . . . . . . 189repolarization . . . . . . . . . . . . . . . . . . . . . . 23representability . . . . . . . . . . . . . . . . . . . . . 93resonance . . . . . . . . . . . . . . . . . . . . . . . . . 164retina. . . . . . . . . . . . . . . . . . . . . . . . . . .29, 73return . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193reward

avoidance strategy . . . . . . . . . . . . 197pure delayed . . . . . . . . . . . . . . . . . . 196pure negative . . . . . . . . . . . . . . . . . 196

RMS . . . . . . . . . . . . see root mean squareROLFs . . . . . . . . . see regional and online

learnable fieldsroot mean square . . . . . . . . . . . . . . . . . . . 57

Ssaltatory conductor . . . . . . . . . . . . . . . . . 25Schwann cell . . . . . . . . . . . . . . . . . . . . . . . 25self fulfilling prophecy . . . . . . . . . . . . . 187self-organizing feature maps . . . . . . . . 11self-organizing map . . . . . . . . . . . . . . . . 143

multi- . . . . . . . . . . . . . . . . . . . . . . . . . 159sensory adaptation . . . . . . . . . . . . . . . . . 27sensory transduction. . . . . . . . . . . . . . . .26shortcut connections . . . . . . . . . . . . . . . .41silhouette coefficient . . . . . . . . . . . . . . . 173single lense eye . . . . . . . . . . . . . . . . . . . . . 29Single Shot Learning . . . . . . . . . . . . . . 128situation . . . . . . . . . . . . . . . . . . . . . . . . . . 192situation space . . . . . . . . . . . . . . . . . . . . 193situation tree . . . . . . . . . . . . . . . . . . . . . . 196SLP . . . . . . . see perceptron, single-layer

Snark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9sodium-potassium pump. . . . . . . . . . . . 22SOM . . . . . . . . . . see self-organizing mapsoma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20spin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125spinal cord . . . . . . . . . . . . . . . . . . . . . . . . . 16stability / plasticity dilemma . . . . . . 163state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192state space forecasting . . . . . . . . . . . . .181state-value function . . . . . . . . . . . . . . . 198stimulus . . . . . . . . . . . . . . . . . . . . . . . 23, 145stimulus-conducting apparatus. . . . . .26surface, perceptive. . . . . . . . . . . . . . . . .174swing up an inverted pendulum. . . .204symmetry breaking . . . . . . . . . . . . . . . . . 94synapse

chemical . . . . . . . . . . . . . . . . . . . . . . . 19electrical . . . . . . . . . . . . . . . . . . . . . . . 19

synapses. . . . . . . . . . . . . . . . . . . . . . . . . . . .19synaptic cleft . . . . . . . . . . . . . . . . . . . . . . . 19

Ttarget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36TD gammon . . . . . . . . . . . . . . . . . . . . . . 203TD learning. . . .see temporal difference

learningteacher forcing . . . . . . . . . . . . . . . . . . . . 122teaching input . . . . . . . . . . . . . . . . . . . . . . 55telencephalon . . . . . . . . . . . . see cerebrumtemporal difference learning . . . . . . . 200thalamus . . . . . . . . . . . . . . . . . . . . . . . . . . . 17threshold potential . . . . . . . . . . . . . . . . . 23threshold value . . . . . . . . . . . . . . . . . . . . . 38time horizon . . . . . . . . . . . . . . . . . . . . . . 194time series . . . . . . . . . . . . . . . . . . . . . . . . 179time series prediction . . . . . . . . . . . . . . 179


Index dkriesel.com

topological defect. . . . . . . . . . . . . . . . . .152topology . . . . . . . . . . . . . . . . . . . . . . . . . . 145topology function . . . . . . . . . . . . . . . . . 146training pattern . . . . . . . . . . . . . . . . . . . . 54

set of . . . . . . . . . . . . . . . . . . . . . . . . . . .55Training set . . . . . . . . . . . . . . . . . . . . . . . . 52transfer functionsee activation functionTruncus cerebri . . . . . . . . . see brainstemtwo-step-ahead prediction . . . . . . . . . 183

direct . . . . . . . . . . . . . . . . . . . . . . . . . 183

Uunfolding in time . . . . . . . . . . . . . . . . . . 121

Vvoronoi diagram . . . . . . . . . . . . . . . . . . . 137

Wweight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36weight matrix . . . . . . . . . . . . . . . . . . . . . . 36

bottom-up . . . . . . . . . . . . . . . . . . . . 164top-down. . . . . . . . . . . . . . . . . . . . . .163

weight vector . . . . . . . . . . . . . . . . . . . . . . . 36weighted sum. . . . . . . . . . . . . . . . . . . . . . . 37Widrow-Hoff rule. . . . . . . .see Delta rulewinner-takes-all scheme . . . . . . . . . . . . . 43


Documents

Neural Networks