Brief Introduction to Neural Networks

A Brief Introduction to

Neural Networks

David Kriesel dkriesel.com

Download location:http://www.dkriesel.com/en/science/neural_networks

NEW – for the programmers: Scalable and efficient NN framework, written in JAVA

http://www.dkriesel.com/en/tech/snipe

dkriesel.com

In remembrance ofDr. Peter Kemp, Notary (ret.), Bonn, Germany.

D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) iii

A small preface"Originally, this work has been prepared in the framework of a seminar of theUniversity of Bonn in Germany, but it has been and will be extended (after

being presented and published online under www.dkriesel.com on5/27/2005). First and foremost, to provide a comprehensive overview of the

subject of neural networks and, second, just to acquire more and moreknowledge about LATEX . And who knows – maybe one day this summary will

become a real preface!"

Abstract of this work, end of 2005

The above abstract has not yet become apreface but at least a little preface, eversince the extended text (then 40 pageslong) has turned out to be a downloadhit.

Ambition and intention of thismanuscript

The entire text is written and laid outmore effectively and with more illustra-tions than before. I did all the illustra-tions myself, most of them directly inLATEX by using XYpic. They reflect whatI would have liked to see when becomingacquainted with the subject: Text and il-lustrations should be memorable and easyto understand to offer as many people aspossible access to the field of neural net-works.

Nevertheless, the mathematically and for-mally skilled readers will be able to under-

stand the definitions without reading therunning text, while the opposite holds forreaders only interested in the subject mat-ter; everything is explained in both collo-quial and formal language. Please let meknow if you find out that I have violatedthis principle.

The sections of this text are mostlyindependent from each other

The document itself is divided into differ-ent parts, which are again divided intochapters. Although the chapters containcross-references, they are also individuallyaccessible to readers with little previousknowledge. There are larger and smallerchapters: While the larger chapters shouldprovide profound insight into a paradigmof neural networks (e.g. the classic neuralnetwork structure: the perceptron and itslearning procedures), the smaller chaptersgive a short overview – but this is also ex-

v

dkriesel.com

plained in the introduction of each chapter.In addition to all the definitions and expla-nations I have included some excursusesto provide interesting information not di-rectly related to the subject.

Unfortunately, I was not able to find freeGerman sources that are multi-facetedin respect of content (concerning theparadigms of neural networks) and, nev-ertheless, written in coherent style. Theaim of this work is (even if it could notbe fulfilled at first go) to close this gap bitby bit and to provide easy access to thesubject.

Want to learn not only byreading, but also by coding?Use SNIPE!

SNIPE1 is a well-documented JAVA li-brary that implements a framework forneural networks in a speedy, feature-richand usable way. It is available at nocost for non-commercial purposes. It wasoriginally designed for high performancesimulations with lots and lots of neuralnetworks (even large ones) being trainedsimultaneously. Recently, I decided togive it away as a professional reference im-plementation that covers network aspectshandled within this work, while at thesame time being faster and more efficientthan lots of other implementations due to

1 Scalable and Generalized Neural Information Pro-cessing Engine, downloadable at http://www.dkriesel.com/tech/snipe, online JavaDoc athttp://snipe.dkriesel.com

the original high-performance simulationdesign goal. Those of you who are up forlearning by doing and/or have to use afast and stable neural networks implemen-tation for some reasons, should definetelyhave a look at Snipe.

However, the aspects covered by Snipe arenot entirely congruent with those coveredby this manuscript. Some of the kindsof neural networks are not supported bySnipe, while when it comes to other kindsof neural networks, Snipe may have lotsand lots more capabilities than may everbe covered in the manuscript in the formof practical hints. Anyway, in my experi-ence almost all of the implementation re-quirements of my readers are covered well.On the Snipe download page, look for thesection "Getting started with Snipe" – youwill find an easy step-by-step guide con-cerning Snipe and its documentation, aswell as some examples.

SNIPE: This manuscript frequently incor-porates Snipe. Shaded Snipe-paragraphslike this one are scattered among largeparts of the manuscript, providing infor-mation on how to implement their con-text in Snipe. This also implies thatthose who do not want to use Snipe,just have to skip the shaded Snipe-paragraphs! The Snipe-paragraphs as-sume the reader has had a close look atthe "Getting started with Snipe" section.Often, class names are used. As Snipe con-sists of only a few different packages, I omit-ted the package names within the qualifiedclass names for the sake of readability.

vi D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN)

http://www.dkriesel.com/tech/snipe

http://www.dkriesel.com/tech/snipe

http://snipe.dkriesel.com

dkriesel.com

It’s easy to print thismanuscript

This text is completely illustrated incolor, but it can also be printed as is inmonochrome: The colors of figures, tablesand text are well-chosen so that in addi-tion to an appealing design the colors arestill easy to distinguish when printed inmonochrome.

There are many tools directlyintegrated into the text

Different aids are directly integrated in thedocument to make reading more flexible:However, anyone (like me) who prefersreading words on paper rather than onscreen can also enjoy some features.

In the table of contents, differenttypes of chapters are marked

Different types of chapters are directlymarked within the table of contents. Chap-ters, that are marked as "fundamental"are definitely ones to read because almostall subsequent chapters heavily depend onthem. Other chapters additionally dependon information given in other (preceding)chapters, which then is marked in the ta-ble of contents, too.

Speaking headlines throughout thetext, short ones in the table ofcontents

The whole manuscript is now pervaded bysuch headlines. Speaking headlines arenot just title-like ("Reinforcement Learn-ing"), but centralize the information givenin the associated section to a single sen-tence. In the named instance, an appro-priate headline would be "Reinforcementlearning methods provide feedback to thenetwork, whether it behaves good or bad".However, such long headlines would bloatthe table of contents in an unacceptableway. So I used short titles like the first onein the table of contents, and speaking ones,like the latter, throughout the text.

Marginal notes are a navigationalaid

The entire document contains marginalnotes in colloquial language (see the exam-

Hypertexton paper:-)

ple in the margin), allowing you to "scan"the document quickly to find a certain pas-sage in the text (including the titles).

New mathematical symbols are marked byspecific marginal notes for easy finding Jx(see the example for x in the margin).

There are several kinds of indexing

This document contains different types ofindexing: If you have found a word inthe index and opened the correspondingpage, you can easily find it by searching

D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) vii

dkriesel.com

for highlighted text – all indexed wordsare highlighted like this.

Mathematical symbols appearing in sev-eral chapters of this document (e.g. Ω foran output neuron; I tried to maintain aconsistent nomenclature for regularly re-curring elements) are separately indexedunder "Mathematical Symbols", so theycan easily be assigned to the correspond-ing term.

Names of persons written in small capsare indexed in the category "Persons" andordered by the last names.

Terms of use and license

Beginning with the epsilon edition, thetext is licensed under the Creative Com-mons Attribution-No Derivative Works3.0 Unported License2, except for somelittle portions of the work licensed undermore liberal licenses as mentioned (mainlysome figures from Wikimedia Commons).A quick license summary:

1. You are free to redistribute this docu-ment (even though it is a much betteridea to just distribute the URL of myhomepage, for it always contains themost recent version of the text).

2. You may not modify, transform, orbuild upon the document except forpersonal use.

2 http://creativecommons.org/licenses/by-nd/3.0/

3. You must maintain the author’s attri-bution of the document at all times.

4. You may not use the attribution toimply that the author endorses youor your document use.

For I’m no lawyer, the above bullet-pointsummary is just informational: if there isany conflict in interpretation between thesummary and the actual license, the actuallicense always takes precedence. Note thatthis license does not extend to the sourcefiles used to produce the document. Thoseare still mine.

How to cite this manuscript

There’s no official publisher, so you needto be careful with your citation. Pleasefind more information in English andGerman language on my homepage, re-spectively the subpage concerning themanuscript3.

Acknowledgement

Now I would like to express my grati-tude to all the people who contributed, inwhatever manner, to the success of thiswork, since a work like this needs manyhelpers. First of all, I want to thankthe proofreaders of this text, who helpedme and my readers very much. In al-phabetical order: Wolfgang Apolinarski,Kathrin Gräve, Paul Imhoff, Thomas

3 http://www.dkriesel.com/en/science/neural_networks

viii D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN)

http://creativecommons.org/licenses/by-nd/3.0/

http://creativecommons.org/licenses/by-nd/3.0/

http://www.dkriesel.com/en/science/neural_networks

http://www.dkriesel.com/en/science/neural_networks

dkriesel.com

Kühn, Christoph Kunze, Malte Lohmeyer,Joachim Nock, Daniel Plohmann, DanielRosenthal, Christian Schulz and TobiasWilken.

Additionally, I want to thank the readersDietmar Berger, Igor Buchmüller, MarieChrist, Julia Damaschek, Jochen Döll,Maximilian Ernestus, Hardy Falk, AnneFeldmeier, Sascha Fink, Andreas Fried-mann, Jan Gassen, Markus Gerhards, Se-bastian Hirsch, Andreas Hochrath, NicoHöft, Thomas Ihme, Boris Jentsch, TimHussein, Thilo Keller, Mario Krenn, MirkoKunze, Maikel Linke, Adam Maciak,Benjamin Meier, David Möller, AndreasMüller, Rainer Penninger, Lena Reichel,Alexander Schier, Matthias Siegmund,Mathias Tirtasana, Oliver Tischler, Max-imilian Voit, Igor Wall, Achim Weber,Frank Weinreis, Gideon Maillette de BuijWenniger, Philipp Woock and many oth-ers for their feedback, suggestions and re-marks.

Additionally, I’d like to thank SebastianMerzbach, who examined this work in avery conscientious way finding inconsisten-cies and errors. In particular, he clearedlots and lots of language clumsiness fromthe English version.

Especially, I would like to thank BeateKuhl for translating the entire text fromGerman to English, and for her questionswhich made me think of changing thephrasing of some paragraphs.

I would particularly like to thank Prof.Rolf Eckmiller and Dr. Nils Goerke aswell as the entire Division of Neuroinfor-matics, Department of Computer Science

of the University of Bonn – they all madesure that I always learned (and also hadto learn) something new about neural net-works and related subjects. Especially Dr.Goerke has always been willing to respondto any questions I was not able to answermyself during the writing process. Conver-sations with Prof. Eckmiller made me stepback from the whiteboard to get a betteroverall view on what I was doing and whatI should do next.

Globally, and not only in the context ofthis work, I want to thank my parents whonever get tired to buy me specialized andtherefore expensive books and who havealways supported me in my studies.

For many "remarks" and the very specialand cordial atmosphere ;-) I want to thankAndreas Huber and Tobias Treutler. Sinceour first semester it has rarely been boringwith you!

Now I would like to think back to myschool days and cordially thank someteachers who (in my opinion) had im-parted some scientific knowledge to me –although my class participation had notalways been wholehearted: Mr. WilfriedHartmann, Mr. Hubert Peters and Mr.Frank Nökel.

Furthermore I would like to thank thewhole team at the notary’s office of Dr.Kemp and Dr. Kolb in Bonn, where I havealways felt to be in good hands and whohave helped me to keep my printing costslow - in particular Christiane Flamme andDr. Kemp!

D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) ix

dkriesel.com

Thanks go also to the Wikimedia Com-mons, where I took some (few) images andaltered them to suit this text.

Last but not least I want to thank twopeople who made outstanding contribu-tions to this work who occupy, so to speak,a place of honor: My girlfriend VerenaThomas, who found many mathematicaland logical errors in my text and dis-cussed them with me, although she haslots of other things to do, and Chris-tiane Schultze, who carefully reviewed thetext for spelling mistakes and inconsisten-cies.

David Kriesel

x D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN)

Contents

A small preface v

I From biology to formalization – motivation, philosophy, history andrealization of neural models 1

1 Introduction, motivation and history 31.1 Why neural networks? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 The 100-step rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.1.2 Simple application examples . . . . . . . . . . . . . . . . . . . . . 6

1.2 History of neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . 81.2.1 The beginning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.2.2 Golden age . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2.3 Long silence and slow reconstruction . . . . . . . . . . . . . . . . 111.2.4 Renaissance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Biological neural networks 132.1 The vertebrate nervous system . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.1 Peripheral and central nervous system . . . . . . . . . . . . . . . 132.1.2 Cerebrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.1.3 Cerebellum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.1.4 Diencephalon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.1.5 Brainstem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 The neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.1 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.2 Electrochemical processes in the neuron . . . . . . . . . . . . . . 19

2.3 Receptor cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.3.1 Various types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.3.2 Information processing within the nervous system . . . . . . . . 252.3.3 Light sensing organs . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4 The amount of neurons in living organisms . . . . . . . . . . . . . . . . 28

xi

Contents dkriesel.com

2.5 Technical neurons as caricature of biology . . . . . . . . . . . . . . . . . 30Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3 Components of artificial neural networks (fundamental) 333.1 The concept of time in neural networks . . . . . . . . . . . . . . . . . . 333.2 Components of neural networks . . . . . . . . . . . . . . . . . . . . . . . 33

3.2.1 Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.2.2 Propagation function and network input . . . . . . . . . . . . . . 343.2.3 Activation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.2.4 Threshold value . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.2.5 Activation function . . . . . . . . . . . . . . . . . . . . . . . . . . 363.2.6 Common activation functions . . . . . . . . . . . . . . . . . . . . 373.2.7 Output function . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.2.8 Learning strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3 Network topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.3.1 Feedforward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.3.2 Recurrent networks . . . . . . . . . . . . . . . . . . . . . . . . . . 403.3.3 Completely linked networks . . . . . . . . . . . . . . . . . . . . . 42

3.4 The bias neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.5 Representing neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.6 Orders of activation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.6.1 Synchronous activation . . . . . . . . . . . . . . . . . . . . . . . 453.6.2 Asynchronous activation . . . . . . . . . . . . . . . . . . . . . . . 46

3.7 Input and output of data . . . . . . . . . . . . . . . . . . . . . . . . . . 48Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4 Fundamentals on learning and training samples (fundamental) 514.1 Paradigms of learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.1.1 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . 524.1.2 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . 534.1.3 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . 534.1.4 Offline or online learning? . . . . . . . . . . . . . . . . . . . . . . 544.1.5 Questions in advance . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.2 Training patterns and teaching input . . . . . . . . . . . . . . . . . . . . 544.3 Using training samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.3.1 Division of the training set . . . . . . . . . . . . . . . . . . . . . 574.3.2 Order of pattern representation . . . . . . . . . . . . . . . . . . . 57

4.4 Learning curve and error measurement . . . . . . . . . . . . . . . . . . . 584.4.1 When do we stop learning? . . . . . . . . . . . . . . . . . . . . . 59

xii D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN)

dkriesel.com Contents

4.5 Gradient optimization procedures . . . . . . . . . . . . . . . . . . . . . . 614.5.1 Problems of gradient procedures . . . . . . . . . . . . . . . . . . 62

4.6 Exemplary problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.6.1 Boolean functions . . . . . . . . . . . . . . . . . . . . . . . . . . 644.6.2 The parity function . . . . . . . . . . . . . . . . . . . . . . . . . 644.6.3 The 2-spiral problem . . . . . . . . . . . . . . . . . . . . . . . . . 644.6.4 The checkerboard problem . . . . . . . . . . . . . . . . . . . . . . 654.6.5 The identity function . . . . . . . . . . . . . . . . . . . . . . . . 654.6.6 Other exemplary problems . . . . . . . . . . . . . . . . . . . . . 66

4.7 Hebbian rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.7.1 Original rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.7.2 Generalized form . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

II Supervised learning network paradigms 69

5 The perceptron, backpropagation and its variants 715.1 The singlelayer perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.1.1 Perceptron learning algorithm and convergence theorem . . . . . 755.1.2 Delta rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.2 Linear separability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.3 The multilayer perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . 845.4 Backpropagation of error . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.4.1 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.4.2 Boiling backpropagation down to the delta rule . . . . . . . . . . 915.4.3 Selecting a learning rate . . . . . . . . . . . . . . . . . . . . . . . 92

5.5 Resilient backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . 935.5.1 Adaption of weights . . . . . . . . . . . . . . . . . . . . . . . . . 945.5.2 Dynamic learning rate adjustment . . . . . . . . . . . . . . . . . 945.5.3 Rprop in practice . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.6 Further variations and extensions to backpropagation . . . . . . . . . . 965.6.1 Momentum term . . . . . . . . . . . . . . . . . . . . . . . . . . . 965.6.2 Flat spot elimination . . . . . . . . . . . . . . . . . . . . . . . . . 975.6.3 Second order backpropagation . . . . . . . . . . . . . . . . . . . 985.6.4 Weight decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 985.6.5 Pruning and Optimal Brain Damage . . . . . . . . . . . . . . . . 98

5.7 Initial configuration of a multilayer perceptron . . . . . . . . . . . . . . 995.7.1 Number of layers . . . . . . . . . . . . . . . . . . . . . . . . . . . 995.7.2 The number of neurons . . . . . . . . . . . . . . . . . . . . . . . 100

D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) xiii


5.7.3 Selecting an activation function . . . . . . . . . . . . . . . . . . . 1005.7.4 Initializing weights . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.8 The 8-3-8 encoding problem and related problems . . . . . . . . . . . . 101Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6 Radial basis functions 1056.1 Components and structure . . . . . . . . . . . . . . . . . . . . . . . . . . 1056.2 Information processing of an RBF network . . . . . . . . . . . . . . . . 106

6.2.1 Information processing in RBF neurons . . . . . . . . . . . . . . 1086.2.2 Analytical thoughts prior to the training . . . . . . . . . . . . . . 111

6.3 Training of RBF networks . . . . . . . . . . . . . . . . . . . . . . . . . . 1146.3.1 Centers and widths of RBF neurons . . . . . . . . . . . . . . . . 115

6.4 Growing RBF networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 1186.4.1 Adding neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1186.4.2 Limiting the number of neurons . . . . . . . . . . . . . . . . . . . 1196.4.3 Deleting neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.5 Comparing RBF networks and multilayer perceptrons . . . . . . . . . . 119Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

7 Recurrent perceptron-like networks (depends on chapter 5) 1217.1 Jordan networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1227.2 Elman networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1237.3 Training recurrent networks . . . . . . . . . . . . . . . . . . . . . . . . . 124

7.3.1 Unfolding in time . . . . . . . . . . . . . . . . . . . . . . . . . . . 1257.3.2 Teacher forcing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1277.3.3 Recurrent backpropagation . . . . . . . . . . . . . . . . . . . . . 1277.3.4 Training with evolution . . . . . . . . . . . . . . . . . . . . . . . 127

8 Hopfield networks 1298.1 Inspired by magnetism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1298.2 Structure and functionality . . . . . . . . . . . . . . . . . . . . . . . . . 129

8.2.1 Input and output of a Hopfield network . . . . . . . . . . . . . . 1308.2.2 Significance of weights . . . . . . . . . . . . . . . . . . . . . . . . 1318.2.3 Change in the state of neurons . . . . . . . . . . . . . . . . . . . 131

8.3 Generating the weight matrix . . . . . . . . . . . . . . . . . . . . . . . . 1328.4 Autoassociation and traditional application . . . . . . . . . . . . . . . . 1338.5 Heteroassociation and analogies to neural data storage . . . . . . . . . . 134

8.5.1 Generating the heteroassociative matrix . . . . . . . . . . . . . . 1358.5.2 Stabilizing the heteroassociations . . . . . . . . . . . . . . . . . . 1358.5.3 Biological motivation of heterassociation . . . . . . . . . . . . . . 136

xiv D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN)


8.6 Continuous Hopfield networks . . . . . . . . . . . . . . . . . . . . . . . . 136Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

9 Learning vector quantization 1399.1 About quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1399.2 Purpose of LVQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1409.3 Using codebook vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 1409.4 Adjusting codebook vectors . . . . . . . . . . . . . . . . . . . . . . . . . 141

9.4.1 The procedure of learning . . . . . . . . . . . . . . . . . . . . . . 1419.5 Connection to neural networks . . . . . . . . . . . . . . . . . . . . . . . 143Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

III Unsupervised learning network paradigms 145

10 Self-organizing feature maps 14710.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14710.2 Functionality and output interpretation . . . . . . . . . . . . . . . . . . 14910.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

10.3.1 The topology function . . . . . . . . . . . . . . . . . . . . . . . . 15010.3.2 Monotonically decreasing learning rate and neighborhood . . . . 152

10.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15510.4.1 Topological defects . . . . . . . . . . . . . . . . . . . . . . . . . . 156

10.5 Adjustment of resolution and position-dependent learning rate . . . . . 15610.6 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

10.6.1 Interaction with RBF networks . . . . . . . . . . . . . . . . . . . 16110.7 Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

10.7.1 Neural gas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16110.7.2 Multi-SOMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16310.7.3 Multi-neural gas . . . . . . . . . . . . . . . . . . . . . . . . . . . 16310.7.4 Growing neural gas . . . . . . . . . . . . . . . . . . . . . . . . . . 164

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

11 Adaptive resonance theory 16511.1 Task and structure of an ART network . . . . . . . . . . . . . . . . . . . 165

11.1.1 Resonance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16611.2 Learning process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

11.2.1 Pattern input and top-down learning . . . . . . . . . . . . . . . . 16711.2.2 Resonance and bottom-up learning . . . . . . . . . . . . . . . . . 16711.2.3 Adding an output neuron . . . . . . . . . . . . . . . . . . . . . . 167

D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) xv


11.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

IV Excursi, appendices and registers 169

A Excursus: Cluster analysis and regional and online learnable fields 171A.1 k-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172A.2 k-nearest neighboring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172A.3 ε-nearest neighboring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173A.4 The silhouette coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . 173A.5 Regional and online learnable fields . . . . . . . . . . . . . . . . . . . . . 175

A.5.1 Structure of a ROLF . . . . . . . . . . . . . . . . . . . . . . . . . 176A.5.2 Training a ROLF . . . . . . . . . . . . . . . . . . . . . . . . . . . 177A.5.3 Evaluating a ROLF . . . . . . . . . . . . . . . . . . . . . . . . . 178A.5.4 Comparison with popular clustering methods . . . . . . . . . . . 179A.5.5 Initializing radii, learning rates and multiplier . . . . . . . . . . . 180A.5.6 Application examples . . . . . . . . . . . . . . . . . . . . . . . . 180

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

B Excursus: neural networks used for prediction 181B.1 About time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181B.2 One-step-ahead prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 183B.3 Two-step-ahead prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 185

B.3.1 Recursive two-step-ahead prediction . . . . . . . . . . . . . . . . 185B.3.2 Direct two-step-ahead prediction . . . . . . . . . . . . . . . . . . 185

B.4 Additional optimization approaches for prediction . . . . . . . . . . . . . 185B.4.1 Changing temporal parameters . . . . . . . . . . . . . . . . . . . 185B.4.2 Heterogeneous prediction . . . . . . . . . . . . . . . . . . . . . . 187

B.5 Remarks on the prediction of share prices . . . . . . . . . . . . . . . . . 187

C Excursus: reinforcement learning 191C.1 System structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

C.1.1 The gridworld . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192C.1.2 Agent und environment . . . . . . . . . . . . . . . . . . . . . . . 193C.1.3 States, situations and actions . . . . . . . . . . . . . . . . . . . . 194C.1.4 Reward and return . . . . . . . . . . . . . . . . . . . . . . . . . . 195C.1.5 The policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

C.2 Learning process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198C.2.1 Rewarding strategies . . . . . . . . . . . . . . . . . . . . . . . . . 198C.2.2 The state-value function . . . . . . . . . . . . . . . . . . . . . . . 199

xvi D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN)


C.2.3 Monte Carlo method . . . . . . . . . . . . . . . . . . . . . . . . . 201C.2.4 Temporal difference learning . . . . . . . . . . . . . . . . . . . . 202C.2.5 The action-value function . . . . . . . . . . . . . . . . . . . . . . 203C.2.6 Q learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

C.3 Example applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205C.3.1 TD gammon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205C.3.2 The car in the pit . . . . . . . . . . . . . . . . . . . . . . . . . . 205C.3.3 The pole balancer . . . . . . . . . . . . . . . . . . . . . . . . . . 206

C.4 Reinforcement learning in connection with neural networks . . . . . . . 207Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

Bibliography 209

List of Figures 215

Index 219

D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) xvii

Part I

From biology to formalization –motivation, philosophy, history and

realization of neural models

1

Chapter 1

Introduction, motivation and historyHow to teach a computer? You can either write a fixed program – or you can

enable the computer to learn on its own. Living beings do not have anyprogrammer writing a program for developing their skills, which then only hasto be executed. They learn by themselves – without the previous knowledge

from external impressions – and thus can solve problems better than anycomputer today. What qualities are needed to achieve such a behavior for

devices like computers? Can such cognition be adapted from biology? History,development, decline and resurgence of a wide approach to solve problems.

1.1 Why neural networks?

There are problem categories that cannotbe formulated as an algorithm. Problemsthat depend on many subtle factors, for ex-ample the purchase price of a real estatewhich our brain can (approximately) cal-culate. Without an algorithm a computercannot do the same. Therefore the ques-tion to be asked is: How do we learn toexplore such problems?

Exactly – we learn; a capability comput-ers obviously do not have. Humans have

Computerscannotlearn

a brain that can learn. Computers havesome processing units and memory. Theyallow the computer to perform the mostcomplex numerical calculations in a veryshort time, but they are not adaptive.

If we compare computer and brain1, wewill note that, theoretically, the computershould be more powerful than our brain:It comprises 109 transistors with a switch-ing time of 10−9 seconds. The brain con-tains 1011 neurons, but these only have aswitching time of about 10−3 seconds.

The largest part of the brain is work-ing continuously, while the largest part ofthe computer is only passive data storage.Thus, the brain is parallel and therefore

parallelismperforming close to its theoretical maxi-

1 Of course, this comparison is - for obvious rea-sons - controversially discussed by biologists andcomputer scientists, since response time and quan-tity do not tell anything about quality and perfor-mance of the processing units as well as neuronsand transistors cannot be compared directly. Nev-ertheless, the comparison serves its purpose andindicates the advantage of parallelism by meansof processing time.

3

Chapter 1 Introduction, motivation and history dkriesel.com

Brain ComputerNo. of processing units ≈ 1011 ≈ 109

Type of processing units Neurons TransistorsType of calculation massively parallel usually serialData storage associative address-basedSwitching time ≈ 10−3s ≈ 10−9sPossible switching operations ≈ 1013 1

s ≈ 1018 1s

Actual switching operations ≈ 1012 1s ≈ 1010 1

s

Table 1.1: The (flawed) comparison between brain and computer at a glance. Inspired by: [Zel94]

mum, from which the computer is ordersof magnitude away (Table 1.1). Addition-ally, a computer is static - the brain asa biological neural network can reorganizeitself during its "lifespan" and therefore isable to learn, to compensate errors and soforth.

Within this text I want to outline howwe can use the said characteristics of ourbrain for a computer system.

So the study of artificial neural networksis motivated by their similarity to success-fully working biological systems, which - incomparison to the overall system - consistof very simple but numerous nerve cells

simplebut manyprocessing

units

that work massively in parallel and (whichis probably one of the most significantaspects) have the capability to learn.There is no need to explicitly program aneural network. For instance, it can learnfrom training samples or by means of en-

n. networkcapableto learn

couragement - with a carrot and a stick,so to speak (reinforcement learning).

One result from this learning procedure isthe capability of neural networks to gen-

eralize and associate data: After suc-cessful training a neural network can findreasonable solutions for similar problemsof the same class that were not explicitlytrained. This in turn results in a high de-gree of fault tolerance against noisy in-put data.

Fault tolerance is closely related to biolog-ical neural networks, in which this charac-teristic is very distinct: As previously men-tioned, a human has about 1011 neuronsthat continuously reorganize themselvesor are reorganized by external influences(about 105 neurons can be destroyed whilein a drunken stupor, some types of foodor environmental influences can also de-stroy brain cells). Nevertheless, our cogni-tive abilities are not significantly affected.

n. networkfaulttolerant

Thus, the brain is tolerant against internalerrors – and also against external errors,for we can often read a really "dreadfulscrawl" although the individual letters arenearly impossible to read.

Our modern technology, however, is notautomatically fault-tolerant. I have neverheard that someone forgot to install the

4 D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN)

dkriesel.com 1.1 Why neural networks?

hard disk controller into a computer andtherefore the graphics card automaticallytook over its tasks, i.e. removed con-ductors and developed communication, sothat the system as a whole was affectedby the missing component, but not com-pletely destroyed.

A disadvantage of this distributed fault-tolerant storage is certainly the fact thatwe cannot realize at first sight what a neu-ral neutwork knows and performs or whereits faults lie. Usually, it is easier to per-form such analyses for conventional algo-rithms. Most often we can only trans-fer knowledge into our neural network bymeans of a learning procedure, which cancause several errors and is not always easyto manage.

Fault tolerance of data, on the other hand,is already more sophisticated in state-of-the-art technology: Let us compare arecord and a CD. If there is a scratch on arecord, the audio information on this spotwill be completely lost (you will hear apop) and then the music goes on. On a CDthe audio data are distributedly stored: Ascratch causes a blurry sound in its vicin-ity, but the data stream remains largelyunaffected. The listener won’t notice any-thing.

So let us summarize the main characteris-tics we try to adapt from biology:

. Self-organization and learning capa-bility,

. Generalization capability and

. Fault tolerance.

What types of neural networks particu-larly develop what kinds of abilities andcan be used for what problem classes willbe discussed in the course of this work.

In the introductory chapter I want toclarify the following: "The neural net-work" does not exist. There are differ- Important!ent paradigms for neural networks, howthey are trained and where they are used.My goal is to introduce some of theseparadigms and supplement some remarksfor practical application.

We have already mentioned that our brainworks massively in parallel, in contrast tothe functioning of a computer, i.e. everycomponent is active at any time. If wewant to state an argument for massive par-allel processing, then the 100-step rulecan be cited.

1.1.1 The 100-step rule

Experiments showed that a human canrecognize the picture of a familiar objector person in ≈ 0.1 seconds, which cor-responds to a neuron switching time of≈ 10−3 seconds in ≈ 100 discrete timesteps of parallel processing.

parallelprocessing

A computer following the von Neumannarchitecture, however, can do practicallynothing in 100 time steps of sequential pro-cessing, which are 100 assembler steps orcycle steps.

Now we want to look at a simple applica-tion example for a neural network.

D. Kriesel – A Brief Introduction to Neural Networks (ZETA2-EN) 5


Figure 1.1: A small robot with eight sensorsand two motors. The arrow indicates the driv-ing direction.

1.1.2 Simple application examples

Let us assume that we have a small robotas shown in fig. 1.1. This robot has eightdistance sensors from which it extracts in-put data: Three sensors are placed on thefront right, three on the front left, and twoon the back. Each sensor provides a realnumeric value at any time, that means weare always receiving an input I ∈ R8.

Despite its two motors (which will beneeded later) the robot in our simple ex-ample is not capable to do much: It shallonly drive on but stop when it might col-lide with an obstacle. Thus, our outputis binary: H = 0 for "Everything is okay,drive on" and H = 1 for "Stop" (The out-

put is called H for "halt signal"). There-fore we need a mapping

f : R8 → B1,

that applies the input signals to a robotactivity.

1.1.2.1 The classical way

There are two ways of realizing this map-ping. On the one hand, there is the clas-sical way: We sit down and think for awhile, and finally the result is a circuit ora small computer program which realizesthe mapping (this is easily possible, sincethe example is very simple). After thatwe refer to the technical reference of thesensors, study their characteristic curve inorder to learn the values for the differentobstacle distances, and embed these valuesinto the aforementioned set of rules. Suchprocedures are applied in the classic artifi-cial intelligence, and if you know the exactrules of a mapping algorithm, you are al-ways well advised to follow this scheme.

1.1.2.2 The way of learning

On the other hand, more interesting andmore successful for many mappings andproblems that are hard to comprehendstraightaway is the way of learning: Weshow different possible situations to therobot (fig. 1.2 on page 8), – and the robotshall learn on its own what to do in thecourse of its robot life.

In this example the robot shall simplylearn when to stop. We first treat the


dkriesel.com 1.1 Why neural networks?

Figure 1.3: Initially, we regard the robot controlas a black box whose inner life is unknown. Theblack box receives eight real sensor values andmaps these values to a binary output value.

neural network as a kind of black box(fig. 1.3). This means we do not know itsstructure but just regard its behavior inpractice.

The situations in form of simply mea-sured sensor values (e.g. placing the robotin front of an obstacle, see illustration),which we show to the robot and for whichwe specify whether to drive on or to stop,are called training samples. Thus, a train-ing sample consists of an exemplary inputand a corresponding desired output. Nowthe question is how to transfer this knowl-edge, the information, into the neural net-work.

The samples can be taught to a neuralnetwork by using a simple learning pro-cedure (a learning procedure is a simplealgorithm or a mathematical formula. Ifwe have done everything right and chosengood samples, the neural network will gen-eralize from these samples and find a uni-versal rule when it has to stop.

Our example can be optionally expanded.For the purpose of direction control itwould be possible to control the motorsof our robot separately2, with the sensorlayout being the same. In this case we arelooking for a mapping

f : R8 → R2,

which gradually controls the two motorsby means of the sensor inputs and thuscannot only, for example, stop the robotbut also lets it avoid obstacles. Here itis more difficult to analytically derive therules, and de facto a neural network wouldbe more appropriate.

Our goal is not to learn the samples byheart, but to realize the principle behindthem: Ideally, the robot should apply theneural network in any situation and beable to avoid obstacles. In particular, therobot should query the network continu-ously and repeatedly while driving in orderto continously avoid obstacles. The resultis a constant cycle: The robot queries thenetwork. As a consequence, it will drivein one direction, which changes the sen-sors values. Again the robot queries thenetwork and changes its position, the sen-sor values are changed once again, and soon. It is obvious that this system can alsobe adapted to dynamic, i.e changing, en-vironments (e.g. the moving obstacles inour example).

2 There is a robot called Khepera with more or lesssimilar characteristics. It is round-shaped, approx.7 cm in diameter, has two motors with wheelsand various sensors. For more information I rec-ommend to refer to the internet.



Figure 1.2: The robot is positioned in a landscape that provides sensor values for different situa-tions. We add the desired output values H and so receive our learning samples. The directions inwhich the sensors are oriented are exemplarily applied to two robots.

1.2 A brief history of neuralnetworks

The field of neural networks has, like anyother field of science, a long history ofdevelopment with many ups and downs,as we will see soon. To continue the styleof my work I will not represent this historyin text form but more compact in form of atimeline. Citations and bibliographical ref-erences are added mainly for those topicsthat will not be further discussed in thistext. Citations for keywords that will beexplained later are mentioned in the corre-sponding chapters.

The history of neural networks begins inthe early 1940’s and thus nearly simulta-

neously with the history of programmableelectronic computers. The youth of thisfield of research, as with the field of com-puter science itself, can be easily recog-nized due to the fact that many of thecited persons are still with us.

1.2.1 The beginning

As soon as 1943 Warren McCullochand Walter Pitts introduced mod-els of neurological networks, recre-ated threshold switches based on neu-rons and showed that even simplenetworks of this kind are able tocalculate nearly any logic or arith-metic function [MP43]. Further-


dkriesel.com 1.2 History of neural networks

Figure 1.4: Some institutions of the field of neural networks. From left to right: John von Neu-mann, Donald O. Hebb, Marvin Minsky, Bernard Widrow, Seymour Papert, Teuvo Kohonen, JohnHopfield, "in the order of appearance" as far as possible.

more, the first computer precur-sors ("electronic brains")were de-veloped, among others supported byKonrad Zuse, who was tired of cal-culating ballistic trajectories by hand.

1947: Walter Pitts and Warren Mc-Culloch indicated a practical fieldof application (which was not men-tioned in their work from 1943),namely the recognition of spacial pat-terns by neural networks [PM47].

1949: Donald O. Hebb formulated theclassical Hebbian rule [Heb49] whichrepresents in its more generalizedform the basis of nearly all neurallearning procedures. The rule im-plies that the connection between twoneurons is strengthened when bothneurons are active at the same time.This change in strength is propor-tional to the product of the two activ-ities. Hebb could postulate this rule,but due to the absence of neurologicalresearch he was not able to verify it.

1950: The neuropsychologist KarlLashley defended the thesis that

brain information storage is realizedas a distributed system. His thesiswas based on experiments on rats,where only the extent but not thelocation of the destroyed nerve tissueinfluences the rats’ performance tofind their way out of a labyrinth.

1.2.2 Golden age

1951: For his dissertation Marvin Min-sky developed the neurocomputerSnark, which has already been capa-ble to adjust its weights3 automati-cally. But it has never been practi-cally implemented, since it is capableto busily calculate, but nobody reallyknows what it calculates.

1956: Well-known scientists and ambi-tious students met at the Dart-mouth Summer Research Projectand discussed, to put it crudely, howto simulate a brain. Differences be-tween top-down and bottom-up re-search developed. While the early

3 We will learn soon what weights are.



supporters of artificial intelligencewanted to simulate capabilities bymeans of software, supporters of neu-ral networks wanted to achieve sys-tem behavior by imitating the small-est parts of the system – the neurons.

1957-1958: At the MIT, Frank Rosen-blatt, Charles Wightman andtheir coworkers developed the firstsuccessful neurocomputer, the MarkI perceptron, which was capable to

developmentaccelerates recognize simple numerics by means

of a 20 × 20 pixel image sensor andelectromechanically worked with 512motor driven potentiometers - eachpotentiometer representing one vari-able weight.

1959: Frank Rosenblatt described dif-ferent versions of the perceptron, for-mulated and verified his perceptronconvergence theorem. He describedneuron layers mimicking the retina,threshold switches, and a learningrule adjusting the connecting weights.

1960: Bernard Widrow and Mar-cian E. Hoff introduced the ADA-LINE (ADAptive LInear NEu-ron) [WH60], a fast and preciseadaptive learning system being thefirst widely commercially used neu-ral network: It could be found innearly every analog telephone for real-time adaptive echo filtering and wastrained by menas of the Widrow-Hoff

firstspread

userule or delta rule. At that time Hoff,later co-founder of Intel Corporation,was a PhD student of Widrow, whohimself is known as the inventor of

modern microprocessors. One advan-tage the delta rule had over the origi-nal perceptron learning algorithm wasits adaptivity: If the difference be-tween the actual output and the cor-rect solution was large, the connect-ing weights also changed in largersteps – the smaller the steps, thecloser the target was. Disadvantage:missapplication led to infinitesimalsmall steps close to the target. In thefollowing stagnation and out of fearof scientific unpopularity of the neu-ral networks ADALINE was renamedin adaptive linear element – whichwas undone again later on.

1961: Karl Steinbuch introduced tech-nical realizations of associative mem-ory, which can be seen as predecessorsof today’s neural associative mem-ories [Ste61]. Additionally, he de-scribed concepts for neural techniquesand analyzed their possibilities andlimits.

1965: In his book Learning Machines,Nils Nilsson gave an overview ofthe progress and works of this periodof neural network research. It wasassumed that the basic principles ofself-learning and therefore, generallyspeaking, "intelligent" systems had al-ready been discovered. Today this as-sumption seems to be an exorbitantoverestimation, but at that time itprovided for high popularity and suf-ficient research funds.

1969: Marvin Minsky and SeymourPapert published a precise mathe-


dkriesel.com 1.2 History of neural networks

matical analysis of the perceptron[MP69] to show that the perceptronmodel was not capable of representingmany important problems (keywords:XOR problem and linear separability),and so put an end to overestimation,popularity and research funds. The

researchfunds were

stoppedimplication that more powerful mod-els would show exactly the same prob-lems and the forecast that the entirefield would be a research dead end re-sulted in a nearly complete decline inresearch funds for the next 15 years– no matter how incorrect these fore-casts were from today’s point of view.

1.2.3 Long silence and slowreconstruction

The research funds were, as previously-mentioned, extremely short. Everywhereresearch went on, but there were neitherconferences nor other events and thereforeonly few publications. This isolation ofindividual researchers provided for manyindependently developed neural networkparadigms: They researched, but therewas no discourse among them.

In spite of the poor appreciation the fieldreceived, the basic theories for the stillcontinuing renaissance were laid at thattime:

1972: Teuvo Kohonen introduced amodel of the linear associator,a model of an associative memory[Koh72]. In the same year, such amodel was presented independentlyand from a neurophysiologist’s point

of view by James A. Anderson[And72].

1973: Christoph von der Malsburgused a neuron model that was non-linear and biologically more moti-vated [vdM73].

1974: For his dissertation in HarvardPaul Werbos developed a learningprocedure called backpropagation oferror [Wer74], but it was not untilone decade later that this procedurereached today’s importance.

backpropdeveloped

1976-1980 and thereafter: StephenGrossberg presented many papers(for instance [Gro76]) in whichnumerous neural models are analyzedmathematically. Furthermore, hededicated himself to the problem ofkeeping a neural network capableof learning without destroyingalready learned associations. Undercooperation of Gail Carpenterthis led to models of adaptiveresonance theory (ART).

1982: Teuvo Kohonen described theself-organizing feature maps(SOM) [Koh82, Koh98] – alsoknown as Kohonen maps. He waslooking for the mechanisms involvingself-organization in the brain (Heknew that the information about thecreation of a being is stored in thegenome, which has, however, notenough memory for a structure likethe brain. As a consequence, thebrain has to organize and createitself for the most part).



John Hopfield also invented theso-called Hopfield networks [Hop82]which are inspired by the laws of mag-netism in physics. They were notwidely used in technical applications,but the field of neural networks slowlyregained importance.

1983: Fukushima, Miyake and Ito in-troduced the neural model of theNeocognitron which could recognizehandwritten characters [FMI83] andwas an extension of the Cognitron net-work already developed in 1975.

1.2.4 Renaissance

Through the influence of John Hopfield,who had personally convinced many re-searchers of the importance of the field,and the wide publication of backpro-pagation by Rumelhart, Hinton andWilliams, the field of neural networksslowly showed signs of upswing.

1985: John Hopfield published an arti-cle describing a way of finding accept-able solutions for the Travelling Sales-man problem by using Hopfield nets.

Renaissance

1986: The backpropagation of error learn-ing procedure as a generalization ofthe delta rule was separately devel-oped and widely published by the Par-allel Distributed Processing Group[RHW86a]: Non-linearly-separableproblems could be solved by multi-layer perceptrons, and Marvin Min-sky’s negative evaluations were dis-proven at a single blow. At the same

time a certain kind of fatigue spreadin the field of artificial intelligence,caused by a series of failures and un-fulfilled hopes.

From this time on, the development ofthe field of research has almost beenexplosive. It can no longer be item-ized, but some of its results will beseen in the following.

Exercises

Exercise 1. Give one example for eachof the following topics:

. A book on neural networks or neuroin-formatics,

. A collaborative group of a universityworking with neural networks,

. A software tool realizing neural net-works ("simulator"),

. A company using neural networks,and

. A product or service being realized bymeans of neural networks.

Exercise 2. Show at least four applica-tions of technical neural networks: twofrom the field of pattern recognition andtwo from the field of function approxima-tion.

Exercise 3. Briefly characterize the fourdevelopment phases of neural networksand give expressive examples for eachphase.


Chapter 2

Biological neural networksHow do biological systems solve problems? How does a system of neurons

work? How can we understand its functionality? What are different quantitiesof neurons able to do? Where in the nervous system does information

processing occur? A short biological overview of the complexity of simpleelements of neural information processing followed by some thoughts about

their simplification in order to technically adapt them.

Before we begin to describe the technicalside of neural networks, it would be use-ful to briefly discuss the biology of neu-ral networks and the cognition of livingorganisms – the reader may skip the fol-lowing chapter without missing any tech-nical information. On the other hand Irecommend to read the said excursus ifyou want to learn something about theunderlying neurophysiology and see thatour small approaches, the technical neuralnetworks, are only caricatures of nature– and how powerful their natural counter-parts must be when our small approachesare already that effective. Now we wantto take a brief look at the nervous systemof vertebrates: We will start with a veryrough granularity and then proceed withthe brain and up to the neural level. Forfurther reading I want to recommend thebooks [CR00,KSJ00], which helped me alot during this chapter.

2.1 The vertebrate nervoussystem

The entire information processing system,i.e. the vertebrate nervous system, con-sists of the central nervous system and theperipheral nervous system, which is onlya first and simple subdivision. In real-ity, such a rigid subdivision does not makesense, but here it is helpful to outline theinformation processing in a body.

2.1.1 Peripheral and centralnervous system

The peripheral nervous system (PNS)comprises the nerves that are situated out-side of the brain or the spinal cord. Thesenerves form a branched and very dense net-work throughout the whole body. The pe-

13

Chapter 2 Biological neural networks dkriesel.com

ripheral nervous system includes, for ex-ample, the spinal nerves which pass outof the spinal cord (two within the level ofeach vertebra of the spine) and supply ex-tremities, neck and trunk, but also the cra-nial nerves directly leading to the brain.

The central nervous system (CNS),however, is the "main-frame" within thevertebrate. It is the place where infor-mation received by the sense organs arestored and managed. Furthermore, it con-trols the inner processes in the body and,last but not least, coordinates the mo-tor functions of the organism. The ver-tebrate central nervous system consists ofthe brain and the spinal cord (Fig. 2.1).However, we want to focus on the brain,which can - for the purpose of simplifica-tion - be divided into four areas (Fig. 2.2on the next page) to be discussed here.

2.1.2 The cerebrum is responsiblefor abstract thinkingprocesses.

The cerebrum (telencephalon) is one ofthe areas of the brain that changed mostduring evolution. Along an axis, runningfrom the lateral face to the back of thehead, this area is divided into two hemi-spheres, which are organized in a foldedstructure. These cerebral hemispheresare connected by one strong nerve cord("bar") and several small ones. A largenumber of neurons are located in the cere-bral cortex (cortex) which is approx. 2-4 cm thick and divided into different cor-tical fields, each having a specific task to Figure 2.1: Illustration of the central nervous

system with spinal cord and brain.


dkriesel.com 2.1 The vertebrate nervous system

Figure 2.2: Illustration of the brain. The col-ored areas of the brain are discussed in the text.The more we turn from abstract information pro-cessing to direct reflexive processing, the darkerthe areas of the brain are colored.

fulfill. Primary cortical fields are re-sponsible for processing qualitative infor-mation, such as the management of differ-ent perceptions (e.g. the visual cortexis responsible for the management of vi-sion). Association cortical fields, how-ever, perform more abstract associationand thinking processes; they also containour memory.

2.1.3 The cerebellum controls andcoordinates motor functions

The cerebellum is located below the cere-brum, therefore it is closer to the spinalcord. Accordingly, it serves less abstractfunctions with higher priority: Here, largeparts of motor coordination are performed,i.e., balance and movements are controlled

and errors are continually corrected. Forthis purpose, the cerebellum has directsensory information about muscle lengthsas well as acoustic and visual informa-tion. Furthermore, it also receives mes-sages about more abstract motor signalscoming from the cerebrum.

In the human brain the cerebellum is con-siderably smaller than the cerebrum, butthis is rather an exception. In many ver-tebrates this ratio is less pronounced. Ifwe take a look at vertebrate evolution, wewill notice that the cerebellum is not "toosmall" but the cerebum is "too large" (atleast, it is the most highly developed struc-ture in the vertebrate brain). The two re-maining brain areas should also be brieflydiscussed: the diencephalon and the brain-stem.

2.1.4 The diencephalon controlsfundamental physiologicalprocesses

The interbrain (diencephalon) includesparts of which only the thalamus will

thalamusfiltersincomingdata

be briefly discussed: This part of the di-encephalon mediates between sensory andmotor signals and the cerebrum. Particu-larly, the thalamus decides which part ofthe information is transferred to the cere-brum, so that especially less importantsensory perceptions can be suppressed atshort notice to avoid overloads. Anotherpart of the diencephalon is the hypotha-lamus, which controls a number of pro-cesses within the body. The diencephalon



is also heavily involved in the human cir-cadian rhythm ("internal clock") and thesensation of pain.

2.1.5 The brainstem connects thebrain with the spinal cord andcontrols reflexes.

In comparison with the diencephalon thebrainstem or the (truncus cerebri) re-spectively is phylogenetically much older.Roughly speaking, it is the "extendedspinal cord" and thus the connection be-tween brain and spinal cord. The brain-stem can also be divided into different ar-eas, some of which will be exemplarily in-troduced in this chapter. The functionswill be discussed from abstract functionstowards more fundamental ones. One im-portant component is the pons (=bridge),a kind of transit station for many nerve sig-nals from brain to body and vice versa.

If the pons is damaged (e.g. by a cere-bral infarct), then the result could be thelocked-in syndrome – a condition inwhich a patient is "walled-in" within hisown body. He is conscious and awarewith no loss of cognitive function, but can-not move or communicate by any means.Only his senses of sight, hearing, smell andtaste are generally working perfectly nor-mal. Locked-in patients may often be ableto communicate with others by blinking ormoving their eyes.

Furthermore, the brainstem is responsiblefor many fundamental reflexes, such as theblinking reflex or coughing.

All parts of the nervous system have onething in common: information processing.This is accomplished by huge accumula-tions of billions of very similar cells, whosestructure is very simple but which com-municate continuously. Large groups ofthese cells send coordinated signals andthus reach the enormous information pro-cessing capacity we are familiar with fromour brain. We will now leave the level ofbrain areas and continue with the cellularlevel of the body - the level of neurons.

2.2 Neurons are informationprocessing cells

Before specifying the functions and pro-cesses within a neuron, we will give arough description of neuron functions: Aneuron is nothing more than a switch withinformation input and output. The switchwill be activated if there are enough stim-uli of other neurons hitting the informa-tion input. Then, at the information out-put, a pulse is sent to, for example, otherneurons.

2.2.1 Components of a neuron

Now we want to take a look at the com-ponents of a neuron (Fig. 2.3 on the fac-ing page). In doing so, we will follow theway the electrical information takes withinthe neuron. The dendrites of a neuronreceive the information by special connec-tions, the synapses.


dkriesel.com 2.2 The neuron

Figure 2.3: Illustration of a biological neuron with the components discussed in this text.

2.2.1.1 Synapses weight the individualparts of information

Incoming signals from other neurons orcells are transferred to a neuron by specialconnections, the synapses. Such connec-tions can usually be found at the dendritesof a neuron, sometimes also directly at thesoma. We distinguish between electricaland chemical synapses.

The electrical synapse is the simplerelectricalsynapse:simple

variant. An electrical signal received bythe synapse, i.e. coming from the presy-naptic side, is directly transferred to thepostsynaptic nucleus of the cell. Thus,there is a direct, strong, unadjustableconnection between the signal transmitterand the signal receiver, which is, for exam-ple, relevant to shortening reactions thatmust be "hard coded" within a living or-ganism.

The chemical synapse is the more dis-tinctive variant. Here, the electrical cou-pling of source and target does not takeplace, the coupling is interrupted by thesynaptic cleft. This cleft electrically sep-arates the presynaptic side from the post-synaptic one. You might think that, never-theless, the information has to flow, so wewill discuss how this happens: It is not anelectrical, but a chemical process. On thepresynaptic side of the synaptic cleft theelectrical signal is converted into a chemi-cal signal, a process induced by chemicalcues released there (the so-called neuro-transmitters). These neurotransmitterscross the synaptic cleft and transfer theinformation into the nucleus of the cell(this is a very simple explanation, but lateron we will see how this exactly works),where it is reconverted into electrical in-formation. The neurotransmitters are de-graded very fast, so that it is possible to re-



lease very precise information pulses here,too.

In spite of the more complex function-cemicalsynapseis morecomplexbut also

morepowerful

ing, the chemical synapse has - comparedwith the electrical synapse - utmost advan-tages:

One-way connection: A chemicalsynapse is a one-way connection.Due to the fact that there is no directelectrical connection between thepre- and postsynaptic area, electricalpulses in the postsynaptic areacannot flash over to the presynapticarea.

Adjustability: There is a large number ofdifferent neurotransmitters that canalso be released in various quantitiesin a synaptic cleft. There are neuro-transmitters that stimulate the post-synaptic cell nucleus, and others thatslow down such stimulation. Somesynapses transfer a strongly stimulat-ing signal, some only weakly stimu-lating ones. The adjustability variesa lot, and one of the central pointsin the examination of the learningability of the brain is, that here thesynapses are variable, too. That is,over time they can form a stronger orweaker connection.

2.2.1.2 Dendrites collect all parts ofinformation

Dendrites branch like trees from the cellnucleus of the neuron (which is calledsoma) and receive electrical signals from

many different sources, which are thentransferred into the nucleus of the cell.The amount of branching dendrites is alsocalled dendrite tree.

2.2.1.3 In the soma the weightedinformation is accumulated

After the cell nucleus (soma) has re-ceived a plenty of activating (=stimulat-ing) and inhibiting (=diminishing) signalsby synapses or dendrites, the soma accu-mulates these signals. As soon as the ac-cumulated signal exceeds a certain value(called threshold value), the cell nucleusof the neuron activates an electrical pulsewhich then is transmitted to the neuronsconnected to the current one.

2.2.1.4 The axon transfers outgoingpulses

The pulse is transferred to other neuronsby means of the axon. The axon is along, slender extension of the soma. Inan extreme case, an axon can stretch upto one meter (e.g. within the spinal cord).The axon is electrically isolated in orderto achieve a better conduction of the elec-trical signal (we will return to this pointlater on) and it leads to dendrites, whichtransfer the information to, for example,other neurons. So now we are back at thebeginning of our description of the neuronelements. An axon can, however, transferinformation to other kinds of cells in orderto control them.



2.2.2 Electrochemical processes inthe neuron and itscomponents

After having pursued the path of an elec-trical signal from the dendrites via thesynapses to the nucleus of the cell andfrom there via the axon into other den-drites, we now want to take a small stepfrom biology towards technology. In doingso, a simplified introduction of the electro-chemical information processing should beprovided.

2.2.2.1 Neurons maintain electricalmembrane potential

One fundamental aspect is the fact thatcompared to their environment the neu-rons show a difference in electrical charge,a potential. In the membrane (=enve-lope) of the neuron the charge is differentfrom the charge on the outside. This dif-ference in charge is a central concept thatis important to understand the processeswithin the neuron. The difference is calledmembrane potential. The membranepotential, i.e., the difference in charge, iscreated by several kinds of charged atoms(ions), whose concentration varies withinand outside of the neuron. If we penetratethe membrane from the inside outwards,we will find certain kinds of ions more of-ten or less often than on the inside. Thisdescent or ascent of concentration is calleda concentration gradient.

Let us first take a look at the membranepotential in the resting state of the neu-

ron, i.e., we assume that no electrical sig-nals are received from the outside. In thiscase, the membrane potential is −70 mV.Since we have learned that this potentialdepends on the concentration gradients ofvarious ions, there is of course the centralquestion of how to maintain these concen-tration gradients: Normally, diffusion pre-dominates and therefore each ion is eagerto decrease concentration gradients andto spread out evenly. If this happens,the membrane potential will move towards0 mV, so finally there would be no mem-brane potential anymore. Thus, the neu-ron actively maintains its membrane po-tential to be able to process information.How does this work?

The secret is the membrane itself, which ispermeable to some ions, but not for others.To maintain the potential, various mecha-nisms are in progress at the same time:

Concentration gradient: As describedabove the ions try to be as uniformlydistributed as possible. If theconcentration of an ion is higher onthe inside of the neuron than onthe outside, it will try to diffuseto the outside and vice versa.The positively charged ion K+

(potassium) occurs very frequentlywithin the neuron but less frequentlyoutside of the neuron, and thereforeit slowly diffuses out through theneuron’s membrane. But anothergroup of negative ions, collectivelycalled A−, remains within the neuronsince the membrane is not permeableto them. Thus, the inside of theneuron becomes negatively charged.



Negative A ions remain, positive Kions disappear, and so the inside ofthe cell becomes more negative. Theresult is another gradient.

Electrical Gradient: The electrical gradi-ent acts contrary to the concentrationgradient. The intracellular charge isnow very strong, therefore it attractspositive ions: K+ wants to get backinto the cell.

If these two gradients were now left alone,they would eventually balance out, reacha steady state, and a membrane poten-tial of −85 mV would develop. But wewant to achieve a resting membrane po-tential of −70 mV, thus there seem to ex-ist some disturbances which prevent this.Furthermore, there is another importantion, Na+ (sodium), for which the mem-brane is not very permeable but which,however, slowly pours through the mem-brane into the cell. As a result, the sodiumis driven into the cell all the more: On theone hand, there is less sodium within theneuron than outside the neuron. On theother hand, sodium is positively chargedbut the interior of the cell has negativecharge, which is a second reason for thesodium wanting to get into the cell.

Due to the low diffusion of sodium into thecell the intracellular sodium concentrationincreases. But at the same time the insideof the cell becomes less negative, so thatK+ pours in more slowly (we can see thatthis is a complex mechanism where every-thing is influenced by everything). Thesodium shifts the intracellular equilibriumfrom negative to less negative, compared

with its environment. But even with thesetwo ions a standstill with all gradients be-ing balanced out could still be achieved.Now the last piece of the puzzle gets intothe game: a "pump" (or rather, the proteinATP) actively transports ions against thedirection they actually want to take!

Sodium is actively pumped out of the cell,although it tries to get into the cellalong the concentration gradient andthe electrical gradient.

Potassium, however, diffuses strongly outof the cell, but is actively pumpedback into it.

For this reason the pump is also calledsodium-potassium pump. The pumpmaintains the concentration gradient forthe sodium as well as for the potassium,so that some sort of steady state equilib-rium is created and finally the resting po-tential is −70 mV as observed. All in allthe membrane potential is maintained bythe fact that the membrane is imperme-able to some ions and other ions are ac-tively pumped against the concentrationand electrical gradients. Now that weknow that each neuron has a membranepotential we want to observe how a neu-ron receives and transmits signals.

2.2.2.2 The neuron is activated bychanges in the membranepotential

Above we have learned that sodium andpotassium can diffuse through the mem-brane - sodium slowly, potassium faster.



They move through channels within themembrane, the sodium and potassiumchannels. In addition to these per-manently open channels responsible fordiffusion and balanced by the sodium-potassium pump, there also exist channelsthat are not always open but which onlyresponse "if required". Since the openingof these channels changes the concentra-tion of ions within and outside of the mem-brane, it also changes the membrane po-tential.

These controllable channels are opened assoon as the accumulated received stimulusexceeds a certain threshold. For example,stimuli can be received from other neuronsor have other causes. There exist, for ex-ample, specialized forms of neurons, thesensory cells, for which a light incidencecould be such a stimulus. If the incom-ing amount of light exceeds the threshold,controllable channels are opened.

The said threshold (the threshold poten-tial) lies at about −55 mV. As soon as thereceived stimuli reach this value, the neu-ron is activated and an electrical signal,an action potential, is initiated. Thenthis signal is transmitted to the cells con-nected to the observed neuron, i.e. thecells "listen" to the neuron. Now we wantto take a closer look at the different stagesof the action potential (Fig. 2.4 on the nextpage):

Resting state: Only the permanentlyopen sodium and potassium channelsare permeable. The membranepotential is at −70 mV and activelykept there by the neuron.

Stimulus up to the threshold: A stimu-lus opens channels so that sodiumcan pour in. The intracellular chargebecomes more positive. As soon asthe membrane potential exceeds thethreshold of −55 mV, the action po-tential is initiated by the opening ofmany sodium channels.

Depolarization: Sodium is pouring in. Re-member: Sodium wants to pour intothe cell because there is a lower in-tracellular than extracellular concen-tration of sodium. Additionally, thecell is dominated by a negative en-vironment which attracts the posi-tive sodium ions. This massive in-flux of sodium drastically increasesthe membrane potential - up to ap-prox. +30 mV - which is the electricalpulse, i.e., the action potential.

Repolarization: Now the sodium channelsare closed and the potassium channelsare opened. The positively chargedions want to leave the positive inte-rior of the cell. Additionally, the intra-cellular concentration is much higherthan the extracellular one, which in-creases the efflux of ions even more.The interior of the cell is once againmore negatively charged than the ex-terior.

Hyperpolarization: Sodium as well aspotassium channels are closed again.At first the membrane potential isslightly more negative than the rest-ing potential. This is due to thefact that the potassium channels closemore slowly. As a result, (positively



Figure 2.4: Initiation of action potential over time.



charged) potassium effuses because ofits lower extracellular concentration.After a refractory period of 1 − 2ms the resting state is re-establishedso that the neuron can react to newlyapplied stimuli with an action poten-tial. In simple terms, the refractoryperiod is a mandatory break a neu-ron has to take in order to regenerate.The shorter this break is, the moreoften a neuron can fire per time.

Then the resulting pulse is transmitted bythe axon.

2.2.2.3 In the axon a pulse isconducted in a saltatory way

We have already learned that the axonis used to transmit the action potentialacross long distances (remember: You willfind an illustration of a neuron includingan axon in Fig. 2.3 on page 17). The axonis a long, slender extension of the soma.In vertebrates it is normally coated by amyelin sheath that consists of Schwanncells (in the PNS) or oligodendrocytes(in the CNS) 1, which insulate the axonvery well from electrical activity. At a dis-tance of 0.1−2mm there are gaps betweenthese cells, the so-called nodes of Ran-vier. The said gaps appear where one in-sulate cell ends and the next one begins.It is obvious that at such a node the axonis less insulated.1 Schwann cells as well as oligodendrocytes are vari-eties of the glial cells. There are about 50 timesmore glial cells than neurons: They surround theneurons (glia = glue), insulate them from eachother, provide energy, etc.

Now you may assume that these less in-sulated nodes are a disadvantage of theaxon - however, they are not. At thenodes, mass can be transferred betweenthe intracellular and extracellular area, atransfer that is impossible at those partsof the axon which are situated betweentwo nodes (internodes) and therefore in-sulated by the myelin sheath. This masstransfer permits the generation of signalssimilar to the generation of the action po-tential within the soma. The action po-tential is transferred as follows: It doesnot continuously travel along the axon butjumps from node to node. Thus, a seriesof depolarization travels along the nodes ofRanvier. One action potential initiates thenext one, and mostly even several nodesare active at the same time here. Thepulse "jumping" from node to node is re-sponsible for the name of this pulse con-ductor: saltatory conductor.

Obviously, the pulse will move faster if itsjumps are larger. Axons with large in-ternodes (2 mm) achieve a signal disper-sion of approx. 180 meters per second.However, the internodes cannot grow in-definitely, since the action potential to betransferred would fade too much until itreaches the next node. So the nodes havea task, too: to constantly amplify the sig-nal. The cells receiving the action poten-tial are attached to the end of the axon –often connected by dendrites and synapses.As already indicated above, the action po-tentials are not only generated by informa-tion received by the dendrites from otherneurons.



2.3 Receptor cells aremodified neurons

Action potentials can also be generated bysensory information an organism receivesfrom its environment through its sensorycells. Specialized receptor cells are ableto perceive specific stimulus energies suchas light, temperature and sound or the ex-istence of certain molecules (like, for exam-ple, the sense of smell). This is workingbecause of the fact that these sensory cellsare actually modified neurons. They donot receive electrical signals via dendritesbut the existence of the stimulus beingspecific for the receptor cell ensures thatthe ion channels open and an action po-tential is developed. This process of trans-forming stimulus energy into changes inthe membrane potential is called sensorytransduction. Usually, the stimulus en-ergy itself is too weak to directly causenerve signals. Therefore, the signals areamplified either during transduction or bymeans of the stimulus-conducting ap-paratus. The resulting action potentialcan be processed by other neurons and isthen transmitted into the thalamus, whichis, as we have already learned, a gatewayto the cerebral cortex and therefore can re-ject sensory impressions according to cur-rent relevance and thus prevent an abun-dance of information to be managed.

2.3.1 There are different receptorcells for various types ofperceptions

Primary receptors transmit their pulsesdirectly to the nervous system. A goodexample for this is the sense of pain.Here, the stimulus intensity is propor-tional to the amplitude of the action po-tential. Technically, this is an amplitudemodulation.

Secondary receptors, however, continu-ously transmit pulses. These pulses con-trol the amount of the related neurotrans-mitter, which is responsible for transfer-ring the stimulus. The stimulus in turncontrols the frequency of the action poten-tial of the receiving neuron. This processis a frequency modulation, an encoding ofthe stimulus, which allows to better per-ceive the increase and decrease of a stimu-lus.

There can be individual receptor cells orcells forming complex sensory organs (e.g.eyes or ears). They can receive stimuliwithin the body (by means of the intero-ceptors) as well as stimuli outside of thebody (by means of the exteroceptors).

After having outlined how information isreceived from the environment, it will beinteresting to look at how the informationis processed.


dkriesel.com 2.3 Receptor cells

2.3.2 Information is processed onevery level of the nervoussystem

There is no reason to believe that all re-ceived information is transmitted to thebrain and processed there, and that thebrain ensures that it is "output" in theform of motor pulses (the only thing anorganism can actually do within its envi-ronment is to move). The information pro-cessing is entirely decentralized. In orderto illustrate this principle, we want to takea look at some examples, which leads usagain from the abstract to the fundamen-tal in our hierarchy of information process-ing.

. It is certain that information is pro-cessed in the cerebrum, which is themost developed natural informationprocessing structure.

. The midbrain and the thalamus,which serves – as we have alreadylearned – as a gateway to the cere-bral cortex, are situated much lowerin the hierarchy. The filtering of in-formation with respect to the currentrelevance executed by the midbrainis a very important method of infor-mation processing, too. But even thethalamus does not receive any prepro-cessed stimuli from the outside. Nowlet us continue with the lowest level,the sensory cells.

. On the lowest level, i.e. at the recep-tor cells, the information is not onlyreceived and transferred but directlyprocessed. One of the main aspects of

this subject is to prevent the transmis-sion of "continuous stimuli" to the cen-tral nervous system because of sen-sory adaptation: Due to continu-ous stimulation many receptor cellsautomatically become insensitive tostimuli. Thus, receptor cells are nota direct mapping of specific stimu-lus energy onto action potentials butdepend on the past. Other sensorschange their sensitivity according tothe situation: There are taste recep-tors which respond more or less to thesame stimulus according to the nutri-tional condition of the organism.

. Even before a stimulus reaches thereceptor cells, information processingcan already be executed by a preced-ing signal carrying apparatus, for ex-ample in the form of amplification:The external and the internal earhave a specific shape to amplify thesound, which also allows – in asso-ciation with the sensory cells of thesense of hearing – the sensory stim-ulus only to increase logarithmicallywith the intensity of the heard sig-nal. On closer examination, this isnecessary, since the sound pressure ofthe signals for which the ear is con-structed can vary over a wide expo-nential range. Here, a logarithmicmeasurement is an advantage. Firstly,an overload is prevented and secondly,the fact that the intensity measure-ment of intensive signals will be lessprecise, doesn’t matter as well. If a jetfighter is starting next to you, small



changes in the noise level can be ig-nored.

Just to get a feeling for sensory organsand information processing in the organ-ism, we will briefly describe "usual" lightsensing organs, i.e. organs often found innature. For the third light sensing organdescribed below, the single lens eye, wewill discuss the information processing inthe eye.

2.3.3 An outline of common lightsensing organs

For many organisms it turned out to be ex-tremely useful to be able to perceive elec-tromagnetic radiation in certain regions ofthe spectrum. Consequently, sensory or-gans have been developed which can de-tect such electromagnetic radiation andthe wavelength range of the radiation per-ceivable by the human eye is called visiblerange or simply light. The different wave-lengths of this electromagnetic radiationare perceived by the human eye as differ-ent colors. The visible range of the elec-tromagnetic radiation is different for eachorganism. Some organisms cannot see thecolors (=wavelength ranges) we can see,others can even perceive additional wave-length ranges (e.g. in the UV range). Be-fore we begin with the human being – inorder to get a broader knowledge of thesense of sight– we briefly want to look attwo organs of sight which, from an evolu-tionary point of view, exist much longerthan the human.

2.3.3.1 Compound eyes and pinholeeyes only provide high temporalor spatial resolution

Let us first take a look at the so-calledcompound eye (Fig. 2.5 on the nextpage), which is, for example, common ininsects and crustaceans. The compound

Compound eye:high temp.,lowspatialresolution

eye consists of a great number of small,individual eyes. If we look at the com-pound eye from the outside, the individ-ual eyes are clearly visible and arrangedin a hexagonal pattern. Each individualeye has its own nerve fiber which is con-nected to the insect brain. Since the indi-vidual eyes can be distinguished, it is ob-vious that the number of pixels, i.e. thespatial resolution, of compound eyes mustbe very low and the image is blurred. Butcompound eyes have advantages, too, espe-cially for fast-flying insects. Certain com-pound eyes process more than 300 imagesper second (to the human eye, however,movies with 25 images per second appearas a fluent motion).

Pinhole eyes are, for example, found inoctopus species and work – as you canguess – similar to a pinhole camera. A

pinholecamera:high spat.,lowtemporalresolution

pinhole eye has a very small opening forlight entry, which projects a sharp imageonto the sensory cells behind. Thus, thespatial resolution is much higher than inthe compound eye. But due to the verysmall opening for light entry the resultingimage is less bright.


dkriesel.com 2.3 Receptor cells

Figure 2.5: Compound eye of a robber fly

2.3.3.2 Single lens eyes combine theadvantages of the other twoeye types, but they are morecomplex

The light sensing organ common in verte-brates is the single lense eye. The result-ing image is a sharp, high-resolution imageof the environment at high or variable lightintensity. On the other hand it is morecomplex. Similar to the pinhole eye thelight enters through an opening (pupil)and is projected onto a layer of sensorycells in the eye. (retina). But in contrast

Singlelense eye:

high temp.and spat.resolution

to the pinhole eye, the size of the pupil canbe adapted to the lighting conditions (bymeans of the iris muscle, which expandsor contracts the pupil). These differencesin pupil dilation require to actively focusthe image. Therefore, the single lens eyecontains an additional adjustable lens.

2.3.3.3 The retina does not onlyreceive information but is alsoresponsible for informationprocessing

The light signals falling on the eye arereceived by the retina and directly pre-processed by several layers of information-processing cells. We want to briefly dis-cuss the different steps of this informa-tion processing and in doing so, we followthe way of the information carried by thelight:

Photoreceptors receive the light signalund cause action potentials (thereare different receptors for differentcolor components and light intensi-ties). These receptors are the reallight-receiving part of the retina andthey are sensitive to such an extentthat only one single photon fallingon the retina can cause an action po-tential. Then several photoreceptorstransmit their signals to one single

bipolar cell. This means that here the in-formation has already been summa-rized. Finally, the now transformedlight signal travels from several bipo-lar cells 2 into

ganglion cells. Various bipolar cells cantransmit their information to one gan-glion cell. The higher the numberof photoreceptors that affect the gan-glion cell, the larger the field of per-ception, the receptive field, whichcovers the ganglions – and the less

2 There are different kinds of bipolar cells, as well,but to discuss all of them would go too far.



sharp is the image in the area of thisganglion cell. So the information isalready reduced directly in the retinaand the overall image is, for exam-ple, blurred in the peripheral fieldof vision. So far, we have learnedabout the information processing inthe retina only as a top-down struc-ture. Now we want to take a look atthe

horizontal and amacrine cells. Thesecells are not connected from thefront backwards but laterally. Theyallow the light signals to influencethemselves laterally directly duringthe information processing in theretina – a much more powerfulmethod of information processingthan compressing and blurring.When the horizontal cells are excitedby a photoreceptor, they are able toexcite other nearby photoreceptorsand at the same time inhibit moredistant bipolar cells and receptors.This ensures the clear perception ofoutlines and bright points. Amacrinecells can further intensify certainstimuli by distributing informationfrom bipolar cells to several ganglioncells or by inhibiting ganglions.

These first steps of transmitting visual in-formation to the brain show that informa-tion is processed from the first moment theinformation is received and, on the otherhand, is processed in parallel within mil-lions of information-processing cells. Thesystem’s power and resistance to errorsis based upon this massive division ofwork.

2.4 The amount of neurons inliving organisms atdifferent stages ofdevelopment

An overview of different organisms andtheir neural capacity (in large part from[RD05]):

302 neurons are required by the nervoussystem of a nematode worm, whichserves as a popular model organismin biology. Nematodes live in the soiland feed on bacteria.

104 neurons make an ant (To simplifymatters we neglect the fact that someant species also can have more or lessefficient nervous systems). Due to theuse of different attractants and odors,ants are able to engage in complexsocial behavior and form huge stateswith millions of individuals. If you re-gard such an ant state as an individ-ual, it has a cognitive capacity similarto a chimpanzee or even a human.

With 105 neurons the nervous system ofa fly can be constructed. A fly canevade an object in real-time in three-dimensional space, it can land uponthe ceiling upside down, has a consid-erable sensory system because of com-pound eyes, vibrissae, nerves at theend of its legs and much more. Thus,a fly has considerable differential andintegral calculus in high dimensionsimplemented "in hardware". We allknow that a fly is not easy to catch.Of course, the bodily functions are


dkriesel.com 2.4 The amount of neurons in living organisms

also controlled by neurons, but theseshould be ignored here.

With 0.8 · 106 neurons we have enoughcerebral matter to create a honeybee.Honeybees build colonies and haveamazing capabilities in the field ofaerial reconnaissance and navigation.

4 · 106 neurons result in a mouse, andhere the world of vertebrates alreadybegins.

1.5 · 107 neurons are sufficient for a rat,an animal which is denounced as be-ing extremely intelligent and are of-ten used to participate in a varietyof intelligence tests representative forthe animal world. Rats have an ex-traordinary sense of smell and orien-tation, and they also show social be-havior. The brain of a frog can bepositioned within the same dimension.The frog has a complex build withmany functions, it can swim and hasevolved complex behavior. A frogcan continuously target the said flyby means of his eyes while jumpingin three-dimensional space and andcatch it with its tongue with consid-erable probability.

5 · 107 neurons make a bat. The bat cannavigate in total darkness through aroom, exact up to several centime-ters, by only using their sense of hear-ing. It uses acoustic signals to localizeself-camouflaging insects (e.g. somemoths have a certain wing structurethat reflects less sound waves and theecho will be small) and also eats itsprey while flying.

1.6 · 108 neurons are required by thebrain of a dog, companion of man forages. Now take a look at another pop-ular companion of man:

3 · 108 neurons can be found in a cat,which is about twice as much as ina dog. We know that cats are veryelegant, patient carnivores that canshow a variety of behaviors. By theway, an octopus can be positionedwithin the same magnitude. Onlyvery few people know that, for exam-ple, in labyrinth orientation the octo-pus is vastly superior to the rat.

For 6 · 109 neurons you already get achimpanzee, one of the animals beingvery similar to the human.

1011 neurons make a human. Usually,the human has considerable cognitivecapabilities, is able to speak, to ab-stract, to remember and to use toolsas well as the knowledge of other hu-mans to develop advanced technolo-gies and manifold social structures.

With 2 · 1011 neurons there are nervoussystems having more neurons thanthe human nervous system. Here weshould mention elephants and certainwhale species.

Our state-of-the-art computers are notable to keep up with the aforementionedprocessing power of a fly. Recent researchresults suggest that the processes in ner-vous systems might be vastly more pow-erful than people thought until not longago: Michaeva et al. describe a separate,



synapse-integrated information way of in-formation processing [MBW+10]. Poster-ity will show if they are right.

2.5 Transition to technicalneurons: neural networksare a caricature of biology

How do we change from biological neuralnetworks to the technical ones? Throughradical simplification. I want to brieflysummarize the conclusions relevant for thetechnical part:

We have learned that the biological neu-rons are linked to each other in a weightedway and when stimulated they electricallytransmit their signal via the axon. Fromthe axon they are not directly transferredto the succeeding neurons, but they firsthave to cross the synaptic cleft where thesignal is changed again by variable chem-ical processes. In the receiving neuronthe various inputs that have been post-processed in the synaptic cleft are summa-rized or accumulated to one single pulse.Depending on how the neuron is stimu-lated by the cumulated input, the neuronitself emits a pulse or not – thus, the out-put is non-linear and not proportional tothe cumulated input. Our brief summarycorresponds exactly with the few elementsof biological neural networks we want totake over into the technical approxima-tion:

Vectorial input: The input of technicalneurons consists of many components,

therefore it is a vector. In nature aneuron receives pulses of 103 to 104

other neurons on average.

Scalar output: The output of a neuron isa scalar, which means that the neu-ron only consists of one component.Several scalar outputs in turn formthe vectorial input of another neuron.This particularly means that some-where in the neuron the various inputcomponents have to be summarized insuch a way that only one componentremains.

Synapses change input: In technical neu-ral networks the inputs are prepro-cessed, too. They are multiplied bya number (the weight) – they areweighted. The set of such weights rep-resents the information storage of aneural network – in both biologicaloriginal and technical adaptation.

Accumulating the inputs: In biology, theinputs are summarized to a pulse ac-cording to the chemical change, i.e.,they are accumulated – on the techni-cal side this is often realized by theweighted sum, which we will get toknow later on. This means that afteraccumulation we continue with onlyone value, a scalar, instead of a vec-tor.

Non-linear characteristic: The input ofour technical neurons is also not pro-portional to the output.

Adjustable weights: The weights weight-ing the inputs are variable, similar to


dkriesel.com 2.5 Technical neurons as caricature of biology

the chemical processes at the synap-tic cleft. This adds a great dynamicto the network because a large part ofthe "knowledge" of a neural network issaved in the weights and in the formand power of the chemical processesin a synaptic cleft.

So our current, only casually formulatedand very simple neuron model receives avectorial input

~x,

with components xi. These are multipliedby the appropriate weights wi and accumu-lated: ∑

i

wixi.

The aforementioned term is calledweighted sum. Then the nonlinearmapping f defines the scalar output y:

y = f

(∑i

wixi

).

After this transition we now want to spec-ify more precisely our neuron model andadd some odds and ends. Afterwards wewill take a look at how the weights can beadjusted.

Exercises

Exercise 4. It is estimated that a hu-man brain consists of approx. 1011 nervecells, each of which has about 103 to 104

synapses. For this exercise we assume 103

synapses per neuron. Let us further as-sume that a single synapse could save 4

bits of information. Naïvely calculated:How much storage capacity does the brainhave? Note: The information which neu-ron is connected to which other neuron isalso important.


Chapter 3

Components of artificial neural networksFormal definitions and colloquial explanations of the components that realizethe technical adaptations of biological neural networks. Initial descriptions of

how to combine these components into a neural network.

This chapter contains the formal defini-tions for most of the neural network com-ponents used later in the text. After thischapter you will be able to read the indi-vidual chapters of this work without hav-ing to know the preceding ones (althoughthis would be useful).

3.1 The concept of time inneural networks

In some definitions of this text we use theterm time or the number of cycles of theneural network, respectively. Time is di-vided into discrete time steps:

discretetime steps

Definition 3.1 (The concept of time).The current time (present time) is referredto as (t), the next time step as (t + 1),

(t)I the preceding one as (t − 1). All othertime steps are referred to analogously. If inthe following chapters several mathemati-cal variables (e.g. netj or oi) refer to a

certain point in time, the notation will be,for example, netj(t− 1) or oi(t).

From a biological point of view this is, ofcourse, not very plausible (in the humanbrain a neuron does not wait for anotherone), but it significantly simplifies the im-plementation.

3.2 Components of neuralnetworks

A technical neural network consists of sim-ple processing units, the neurons, anddirected, weighted connections betweenthose neurons. Here, the strength of aconnection (or the connecting weight) be-

33

Chapter 3 Components of artificial neural networks (fundamental) dkriesel.com

tween two neurons i and j is referred to aswi,j

1.

Definition 3.2 (Neural network). Aneural network is a sorted triple(N,V,w) with two sets N , V and a func-tion w, where N is the set of neurons andV a set (i, j)|i, j ∈ N whose elements arecalled connections between neuron i andneuron j. The function w : V → R defines

n. network= neurons+ weightedconnection

the weights, where w((i, j)), the weight ofthe connection between neuron i and neu-ron j, is shortened to wi,j . Depending on

wi,jI the point of view it is either undefined or0 for connections that do not exist in thenetwork.

SNIPE: In Snipe, an instance of the classNeuralNetworkDescriptor is created inthe first place. The descriptor objectroughly outlines a class of neural networks,e.g. it defines the number of neuron lay-ers in a neural network. In a second step,the descriptor object is used to instantiatean arbitrary number of NeuralNetwork ob-jects. To get started with Snipe program-ming, the documentations of exactly thesetwo classes are – in that order – the rightthing to read. The presented layout involv-ing descriptor and dependent neural net-works is very reasonable from the imple-mentation point of view, because it is en-ables to create and maintain general param-eters of even very large sets of similar (butnot neccessarily equal) networks.

So the weights can be implemented in asquare weight matrix W or, optionally,in a weight vector W with the row num-

WI1 Note: In some of the cited literature i and j couldbe interchanged in wi,j . Here, a consistent stan-dard does not exist. But in this text I try to usethe notation I found more frequently and in themore significant citations.

ber of the matrix indicating where the con-nection begins, and the column number ofthe matrix indicating, which neuron is thetarget. Indeed, in this case the numeric0 marks a non-existing connection. Thismatrix representation is also called Hin-ton diagram2.

The neurons and connections comprise thefollowing components and variables (I’mfollowing the path of the data within aneuron, which is according to fig. 3.1 onthe facing page in top-down direction):

3.2.1 Connections carry informationthat is processed by neurons

Data are transferred between neurons viaconnections with the connecting weight be-ing either excitatory or inhibitory. Thedefinition of connections has already beenincluded in the definition of the neural net-work.

SNIPE: Connection weightscan be set using the methodNeuralNetwork.setSynapse.

3.2.2 The propagation functionconverts vector inputs toscalar network inputs

Looking at a neuron j, we will usually finda lot of neurons with a connection to j, i.e.which transfer their output to j.

2 Note that, here again, in some of the cited liter-ature axes and rows could be interchanged. Thepublished literature is not consistent here, as well.


dkriesel.com 3.2 Components of neural networks

Propagierungsfunktion (oft gewichtete Summe, verarbeitet

Eingaben zur Netzeingabe)

Ausgabefunktion (Erzeugt aus Aktivierung die Ausgabe,

ist oft Identität)

Aktivierungsfunktion (Erzeugt aus Netzeingabe und alter

Aktivierung die neue Aktivierung)

Eingaben anderer Neuronen Netzeingabe

Aktivierung Ausgabe zu anderen Neuronen

Propagation function (often weighted sum, transforms

outputs of other neurons to net input)

Output function (often identity function, transforms

activation to output for other neurons)

Activation function (Transforms net input and sometimes

old activation to new activation)

Data Input of other Neurons

Network Input

Activation

Data Output to other Neurons

Figure 3.1: Data processing of a neuron. Theactivation function of a neuron implies thethreshold value.

For a neuron j the propagation func-tion receives the outputs oi1 , . . . , oin ofother neurons i1, i2, . . . , in (which are con-nected to j), and transforms them in con- manages

inputssideration of the connecting weights wi,jinto the network input netj that can be fur-ther processed by the activation function.Thus, the network input is the result ofthe propagation function.

Definition 3.3 (Propagation func-tion and network input). LetI = i1, i2, . . . , in be the set of neurons,such that ∀z ∈ 1, . . . , n : ∃wiz ,j . Thenthe network input of j, called netj , iscalculated by the propagation functionfprop as follows:

netj = fprop(oi1 , . . . , oin , wi1,j , . . . , win,j)(3.1)

Here the weighted sum is very popular:The multiplication of the output of eachneuron i by wi,j , and the summation ofthe results:

netj =∑i∈I

(oi · wi,j) (3.2)

SNIPE: The propagation function inSnipe was implemented using the weightedsum.

3.2.3 The activation is the"switching status" of aneuron

Based on the model of nature every neuronis, to a certain extent, at all times active,excited or whatever you will call it. The



reactions of the neurons to the input val-ues depend on this activation state. The

How activeis a

neuron?activation state indicates the extent of aneuron’s activation and is often shortly re-ferred to as activation. Its formal defini-tion is included in the following definitionof the activation function. But generally,it can be defined as follows:

Definition 3.4 (Activation state / activa-tion in general). Let j be a neuron. Theactivation state aj , in short activation, isexplicitly assigned to j, indicates the ex-tent of the neuron’s activity and resultsfrom the activation function.

SNIPE: It is possible to get and set activa-tion states of neurons by using the meth-ods getActivation or setActivation inthe class NeuralNetwork.

3.2.4 Neurons get activated if thenetwork input exceeds theirtreshold value

Near the threshold value, the activationfunction of a neuron reacts particularlysensitive. From the biological point ofview the threshold value represents thethreshold at which a neuron starts fir-ing. The threshold value is also mostly

highestpoint of

sensationincluded in the definition of the activationfunction, but generally the definition is thefollowing:

Definition 3.5 (Threshold value in gen-eral). Let j be a neuron. The thresholdvalue Θj is uniquely assigned to j and

ΘI marks the position of the maximum gradi-ent value of the activation function.

3.2.5 The activation functiondetermines the activation of aneuron dependent on networkinput and treshold value

At a certain time – as we have alreadylearned – the activation aj of a neuron jdepends on the previous3 activation stateof the neuron and the external input.

Definition 3.6 (Activation function andActivation). Let j be a neuron. The ac-

calculatesactivationtivation function is defined as

aj(t) = fact(netj(t), aj(t− 1),Θj). (3.3)

It transforms the network input netj , Jfactas well as the previous activation stateaj(t− 1) into a new activation state aj(t),with the threshold value Θ playing an im-portant role, as already mentioned.

Unlike the other variables within the neu-ral network (particularly unlike the onesdefined so far) the activation function isoften defined globally for all neurons orat least for a set of neurons and only thethreshold values are different for each neu-ron. We should also keep in mind thatthe threshold values can be changed, forexample by a learning procedure. So itcan in particular become necessary to re-late the threshold value to the time and towrite, for instance Θj as Θj(t) (but for rea-sons of clarity, I omitted this here). Theactivation function is also called transferfunction.

3 The previous activation is not always relevant forthe current – we will see examples for both vari-ants.


dkriesel.com 3.2 Components of neural networks

SNIPE: In Snipe, activation functions aregeneralized to neuron behaviors. Suchbehaviors can represent just normal acti-vation functions, or even incorporate in-ternal states and dynamics. Correspond-ing parts of Snipe can be found in thepackage neuronbehavior, which also con-tains some of the activation functions in-troduced in the next section. The inter-face NeuronBehavior allows for implemen-tation of custom behaviors. Objects thatinherit from this interface can be passed toa NeuralNetworkDescriptor instance. Itis possible to define individual behaviorsper neuron layer.

3.2.6 Common activation functions

The simplest activation function is the bi-nary threshold function (fig. 3.2 on thenext page), which can only take on two val-ues (also referred to as Heaviside func-tion). If the input is above a certainthreshold, the function changes from onevalue to another, but otherwise remainsconstant. This implies that the functionis not differentiable at the threshold andfor the rest the derivative is 0. Due tothis fact, backpropagation learning, for ex-ample, is impossible (as we will see later).Also very popular is the Fermi functionor logistic function (fig. 3.2)

11 + e−x , (3.4)

which maps to the range of values of (0, 1)and the hyperbolic tangent (fig. 3.2)which maps to (−1, 1). Both functions aredifferentiable. The Fermi function can beexpanded by a temperature parameterT into the form

TI

11 + e−xT

. (3.5)

The smaller this parameter, the more doesit compress the function on the x axis.Thus, one can arbitrarily approximate theHeaviside function. Incidentally, there ex-ist activation functions which are not ex-plicitly defined but depend on the input ac-cording to a random distribution (stochas-tic activation function).

A alternative to the hypberbolic tangentthat is really worth mentioning was sug-gested by Anguita et al. [APZ93], whohave been tired of the slowness of the work-stations back in 1993. Thinking abouthow to make neural network propagationsfaster, they quickly identified the approx-imation of the e-function used in the hy-perbolic tangent as one of the causes ofslowness. Consequently, they "engineered"an approximation to the hyperbolic tan-gent, just using two parabola pieces andtwo half-lines. At the price of deliveringa slightly smaller range of values than thehyperbolic tangent ([−0.96016; 0.96016] in-stead of [−1; 1]), dependent on what CPUone uses, it can be calculated 200 timesfaster because it just needs two multipli-cations and one addition. What’s more,it has some other advantages that will bementioned later.

SNIPE: The activation functions intro-duced here are implemented within theclasses Fermi and TangensHyperbolicus,both of which are located in the packageneuronbehavior. The fast hyperbolic tan-gent approximation is located within theclass TangensHyperbolicusAnguita.



−1

−0.5

0

0.5

1

−4 −2 0 2 4

f(x)

x

Heaviside Function

0

0.2

0.4

0.6

0.8

1

−4 −2 0 2 4

f(x)

x

Fermi Function with Temperature Parameter

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

−4 −2 0 2 4

tanh

(x)

x

Hyperbolic Tangent

Figure 3.2: Various popular activation func-tions, from top to bottom: Heaviside or binarythreshold function, Fermi function, hyperbolictangent. The Fermi function was expanded bya temperature parameter. The original Fermifunction is represented by dark colors, the tem-perature parameters of the modified Fermi func-tions are, ordered ascending by steepness, 1

2 ,15 ,1

10 und 125 .

3.2.7 An output function may beused to process the activationonce again

The output function of a neuron j cal-culates the values which are transferred tothe other neurons connected to j. Moreformally:

Definition 3.7 (Output function). Let jinformsotherneurons

be a neuron. The output function

fout(aj) = oj (3.6)

calculates the output value oj of the neu- Jfoutron j from its activation state aj .

Generally, the output function is definedglobally, too. Often this function is theidentity, i.e. the activation aj is directlyoutput4:

fout(aj) = aj , so oj = aj (3.7)

Unless explicitly specified differently, wewill use the identity as output functionwithin this text.

3.2.8 Learning strategies adjust anetwork to fit our needs

Since we will address this subject later indetail and at first want to get to know theprinciples of neural network structures, Iwill only provide a brief and general defi-nition here:4 Other definitions of output functions may be use-ful if the range of values of the activation functionis not sufficient.


dkriesel.com 3.3 Network topologies

Definition 3.8 (General learning rule).The learning strategy is an algorithmthat can be used to change and therebytrain the neural network, so that the net-work produces a desired output for a giveninput.

3.3 Network topologies

After we have become acquainted with thecomposition of the elements of a neuralnetwork, I want to give an overview ofthe usual topologies (= designs) of neuralnetworks, i.e. to construct networks con-sisting of these elements. Every topologydescribed in this text is illustrated by amap and its Hinton diagram so that thereader can immediately see the character-istics and apply them to other networks.

In the Hinton diagram the dotted weightsare represented by light grey fields, thesolid ones by dark grey fields. The inputand output arrows, which were added forreasons of clarity, cannot be found in theHinton diagram. In order to clarify thatthe connections are between the line neu-rons and the column neurons, I have in-serted the small arrow in the upper-leftcell.

SNIPE: Snipe is designed for realizationof arbitrary network topologies. In thisrespect, Snipe defines different kinds ofsynapses depending on their source andtheir target. Any kind of synapse can sep-arately be allowed or forbidden for a set ofnetworks using the setAllowed methods ina NeuralNetworkDescriptor instance.

3.3.1 Feedforward networks consistof layers and connectionstowards each following layer

Feedforward In this text feedforward net-works (fig. 3.3 on the following page) arethe networks we will first explore (even ifwe will use different topologies later). Theneurons are grouped in the following lay-ers: One input layer, n hidden pro-

network oflayerscessing layers (invisible from the out-

side, that’s why the neurons are also re-ferred to as hidden neurons) and one out-put layer. In a feedforward network eachneuron in one layer has only directed con-nections to the neurons of the next layer(towards the output layer). In fig. 3.3 onthe next page the connections permittedfor a feedforward network are representedby solid lines. We will often be confrontedwith feedforward networks in which everyneuron i is connected to all neurons of thenext layer (these layers are called com-pletely linked). To prevent naming con-flicts the output neurons are often referredto as Ω.

Definition 3.9 (Feedforward network).The neuron layers of a feedforward net-work (fig. 3.3 on the following page) areclearly separated: One input layer, oneoutput layer and one or more processinglayers which are invisible from the outside(also called hidden layers). Connectionsare only permitted to neurons of the fol-lowing layer.



GFED@ABCi1

~~

AAAAAAAAA

**UUUUUUUUUUUUUUUUUUUUUUUUUU GFED@ABCi2

ttiiiiiiiiiiiiiiiiiiiiiiiiii

~~

AAAAAAAAA

GFED@ABCh1

AAAAAAAAA

**UUUUUUUUUUUUUUUUUUUUUUUUUU GFED@ABCh2

~~

AAAAAAAAAGFED@ABCh3

~~


GFED@ABCΩ1

GFED@ABCΩ2

i1 i2 h1 h2 h3 Ω1 Ω2i1i2h1h2h3Ω1Ω2

Figure 3.3: A feedforward network with threelayers: two input neurons, three hidden neuronsand two output neurons. Characteristic for theHinton diagram of completely linked feedforwardnetworks is the formation of blocks above thediagonal.

3.3.1.1 Shortcut connections skip layersShortcutsskiplayersSome feedforward networks permit the so-

called shortcut connections (fig. 3.4 onthe next page): connections that skip oneor more levels. These connections mayonly be directed towards the output layer,too.

Definition 3.10 (Feedforward networkwith shortcut connections). Similar to thefeedforward network, but the connectionsmay not only be directed towards the nextlayer but also towards any other subse-quent layer.

3.3.2 Recurrent networks haveinfluence on themselves

Recurrence is defined as the process of aneuron influencing itself by any means orby any connection. Recurrent networks donot always have explicitly defined input oroutput neurons. Therefore in the figuresI omitted all markings that concern thismatter and only numbered the neurons.

3.3.2.1 Direct recurrences start andend at the same neuron

Some networks allow for neurons to beconnected to themselves, which is calleddirect recurrence (or sometimes self-recurrence (fig. 3.5 on the facing page).As a result, neurons inhibit and thereforestrengthen themselves in order to reachtheir activation limits.


dkriesel.com 3.3 Network topologies

GFED@ABCi1

++

~~ **

GFED@ABCi2

ss

tt ~~ GFED@ABCh1

**

GFED@ABCh2

~~

GFED@ABCh3

~~ttGFED@ABCΩ1

GFED@ABCΩ2

i1 i2 h1 h2 h3 Ω1 Ω2i1i2h1h2h3Ω1Ω2

Figure 3.4: A feedforward network with short-cut connections, which are represented by solidlines. On the right side of the feedforward blocksnew connections have been added to the Hintondiagram.

?>=<89:;1vv

))

?>=<89:;2vv

uu ?>=<89:;3vv

))

?>=<89:;4vv

?>=<89:;5vv

uu?>=<89:;6vv ?>=<89:;7

vv

1 2 3 4 5 6 71234567

Figure 3.5: A network similar to a feedforwardnetwork with directly recurrent neurons. The di-rect recurrences are represented by solid lines andexactly correspond to the diagonal in the Hintondiagram matrix.

Definition 3.11 (Direct recurrence).Now we expand the feedforward network neurons

influencethemselves

by connecting a neuron j to itself, with theweights of these connections being referredto as wj,j . In other words: the diagonalof the weight matrix W may be differentfrom 0.



3.3.2.2 Indirect recurrences caninfluence their starting neurononly by making detours

If connections are allowed towards the in-put layer, they will be called indirect re-currences. Then a neuron j can use in-direct forwards connections to influence it-self, for example, by influencing the neu-rons of the next layer and the neurons ofthis next layer influencing j (fig. 3.6).

Definition 3.12 (Indirect recurrence).Again our network is based on a feedfor-ward network, now with additional connec-tions between neurons and their precedinglayer being allowed. Therefore, below thediagonal of W is different from 0.

3.3.2.3 Lateral recurrences connectneurons within one layer

Connections between neurons within onelayer are called lateral recurrences(fig. 3.7 on the facing page). Here, eachneuron often inhibits the other neurons ofthe layer and strengthens itself. As a re-sult only the strongest neuron becomes ac-tive (winner-takes-all scheme).Definition 3.13 (Lateral recurrence). Alaterally recurrent network permits con-nections within one layer.

3.3.3 Completely linked networksallow any possible connection

Completely linked networks permit connec-tions between all neurons, except for direct

?>=<89:;1

))

?>=<89:;2

uu ?>=<89:;3

88 22

))

?>=<89:;4

XX 88

?>=<89:;5

XXgg

uu?>=<89:;6

XX 88 22

?>=<89:;7

gg XX 88

1 2 3 4 5 6 71234567

Figure 3.6: A network similar to a feedforwardnetwork with indirectly recurrent neurons. Theindirect recurrences are represented by solid lines.As we can see, connections to the preceding lay-ers can exist here, too. The fields that are sym-metric to the feedforward blocks in the Hintondiagram are now occupied.


dkriesel.com 3.4 The bias neuron

?>=<89:;1 ++kk

))

?>=<89:;2

uu ?>=<89:;3 ++kk **jj

))

?>=<89:;4 ++kk

?>=<89:;5

uu?>=<89:;6 ++kk ?>=<89:;7

1 2 3 4 5 6 71234567

Figure 3.7: A network similar to a feedforwardnetwork with laterally recurrent neurons. Thedirect recurrences are represented by solid lines.Here, recurrences only exist within the layer.In the Hinton diagram, filled squares are con-centrated around the diagonal in the height ofthe feedforward blocks, but the diagonal is leftuncovered.

recurrences. Furthermore, the connectionsmust be symmetric (fig. 3.8 on the nextpage). A popular example are the self-organizing maps, which will be introducedin chapter 10.

Definition 3.14 (Complete interconnec-tion). In this case, every neuron is alwaysallowed to be connected to every other neu-ron – but as a result every neuron canbecome an input neuron. Therefore, di-rect recurrences normally cannot be ap-plied here and clearly defined layers do notlonger exist. Thus, the matrix W may beunequal to 0 everywhere, except along itsdiagonal.

3.4 The bias neuron is atechnical trick to considerthreshold values asconnection weights

By now we know that in many networkparadigms neurons have a threshold valuethat indicates when a neuron becomes ac-tive. Thus, the threshold value is anactivation function parameter of a neu-ron. From the biological point of viewthis sounds most plausible, but it is com-plicated to access the activation functionat runtime in order to train the thresholdvalue.

But threshold values Θj1 , . . . ,Θjn for neu-rons j1, j2, . . . , jn can also be realized asconnecting weight of a continuously fir-ing neuron: For this purpose an addi-tional bias neuron whose output value



?>=<89:;1 ii

ii

))TTTTTTTTTTTTTTTTTTTTTTTOO

oo //^^

>>>>>>>>>?>=<89:;255

uujjjjjjjjjjjjjjjjjjjjjjj OO

@@

^^

>>>>>>>>>

?>=<89:;3 ii

))TTTTTTTTTTTTTTTTTTTTTTToo //

@@ ?>=<89:;4 ?>=<89:;544jj 55

uujjjjjjjjjjjjjjjjjjjjjjj//oo@@

?>=<89:;6

55

@@

^>>>>>>>>> ?>=<89:;7//oo

^>>>>>>>>>

1 2 3 4 5 6 71234567

Figure 3.8: A completely linked network withsymmetric connections and without direct recur-rences. In the Hinton diagram only the diagonalis left blank.

is always 1 is integrated in the networkand connected to the neurons j1, j2, . . . , jn.These new connections get the weights−Θj1 , . . . ,−Θjn , i.e. they get the negativethreshold values.

Definition 3.15. A bias neuron is aneuron whose output value is always 1 andwhich is represented by

GFED@ABCBIAS .

It is used to represent neuron biases as con-nection weights, which enables any weight-training algorithm to train the biases atthe same time.

Then the threshold value of the neuronsj1, j2, . . . , jn is set to 0. Now the thresh-old values are implemented as connectionweights (fig. 3.9 on page 46) and can di-rectly be trained together with the con-nection weights, which considerably facil-itates the learning process.

In other words: Instead of including thethreshold value in the activation function,it is now included in the propagation func-tion. Or even shorter: The threshold valueis subtracted from the network input, i.e.it is part of the network input. More for-mally:

bias neuronreplacesthresh. valuewith weights

Let j1, j2, . . . , jn be neurons with thresh-old values Θj1 , . . . ,Θjn . By inserting abias neuron whose output value is always1, generating connections between the saidbias neuron and the neurons j1, j2, . . . , jnand weighting these connectionswBIAS,j1 , . . . , wBIAS,jnwith −Θj1 , . . . ,−Θjn ,we can set Θj1 = . . . = Θjn = 0 and


dkriesel.com 3.6 Orders of activation

receive an equivalent neural networkwhose threshold values are realized byconnection weights.

Undoubtedly, the advantage of the biasneuron is the fact that it is much easierto implement it in the network. One dis-advantage is that the representation of thenetwork already becomes quite ugly withonly a few neurons, let alone with a greatnumber of them. By the way, a bias neu-ron is often referred to as on neuron.

From now on, the bias neuron is omit-ted for clarity in the following illustrations,but we know that it exists and that thethreshold values can simply be treated asweights because of it.

SNIPE: In Snipe, a bias neuron was imple-mented instead of neuron-individual biases.The neuron index of the bias neuron is 0.

3.5 Representing neurons

We have already seen that we can eitherwrite its name or its threshold value intoa neuron. Another useful representation,which we will use several times in thefollowing, is to illustrate neurons accord-ing to their type of data processing. Seefig. 3.10 for some examples without fur-ther explanation – the different types ofneurons are explained as soon as we needthem.

WVUTPQRS||c,x||Gauß

GFED@ABC ONMLHIJKΣ

WVUTPQRSΣL|H

WVUTPQRSΣTanh

WVUTPQRSΣFermi

ONMLHIJKΣfact

GFED@ABCBIAS

Figure 3.10: Different types of neurons that willappear in the following text.

3.6 Take care of the order inwhich neuron activationsare calculated

For a neural network it is very importantin which order the individual neurons re-ceive and process the input and output theresults. Here, we distinguish two modelclasses:

3.6.1 Synchronous activation

All neurons change their values syn-chronously, i.e. they simultaneously cal-culate network inputs, activation and out-put, and pass them on. Synchronous ac-tivation corresponds closest to its biolog-ical counterpart, but it is – if to be im-plemented in hardware – only useful oncertain parallel computers and especiallynot for feedforward networks. This orderof activation is the most generic and canbe used with networks of arbitrary topol-ogy.



GFED@ABCΘ1

BBBBBBBBB

~~|||||||||

GFED@ABCΘ2

GFED@ABCΘ3

GFED@ABCBIAS −Θ1 //

−Θ2AAAA

AAAA −Θ3TTTTTTTTTT

**TTTTTTTTTT

?>=<89:;0

?>=<89:;0

?>=<89:;0

Figure 3.9: Two equivalent neural networks, one without bias neuron on the left, one with biasneuron on the right. The neuron threshold values can be found in the neurons, the connectingweights at the connections. Furthermore, I omitted the weights of the already existing connections(represented by dotted lines on the right side).

Definition 3.16 (Synchronous activa-tion). All neurons of a network calculate

biologicallyplausible network inputs at the same time by means

of the propagation function, activation bymeans of the activation function and out-put by means of the output function. Af-ter that the activation cycle is complete.

SNIPE: When implementing in software,one could model this very general activa-tion order by every time step calculatingand caching every single network input,and after that calculating all activations.This is exactly how it is done in Snipe, be-cause Snipe has to be able to realize arbi-trary network topologies.

3.6.2 Asynchronous activation

Here, the neurons do not change their val-ues simultaneously but at different points

of time. For this, there exist different or-ders, some of which I want to introduce inthe following: easier to

implement

3.6.2.1 Random order

Definition 3.17 (Random order of acti-vation). With random order of acti-vation a neuron i is randomly chosen andits neti, ai and oi are updated. For n neu-rons a cycle is the n-fold execution of thisstep. Obviously, some neurons are repeat-edly updated during one cycle, and others,however, not at all.

Apparently, this order of activation is notalways useful.


dkriesel.com 3.6 Orders of activation

3.6.2.2 Random permutation

With random permutation each neuronis chosen exactly once, but in random or-der, during one cycle.

Definition 3.18 (Random permutation).Initially, a permutation of the neurons iscalculated randomly and therefore definesthe order of activation. Then the neuronsare successively processed in this order.

This order of activation is as well usedrarely because firstly, the order is gener-ally useless and, secondly, it is very time-consuming to compute a new permutationfor every cycle. A Hopfield network (chap-ter 8) is a topology nominally having arandom or a randomly permuted order ofactivation. But note that in practice, forthe previously mentioned reasons, a fixedorder of activation is preferred.

For all orders either the previous neuronactivations at time t or, if already existing,the neuron activations at time t + 1, forwhich we are calculating the activations,can be taken as a starting point.

3.6.2.3 Topological order

Definition 3.19 (Topological activation).With topological order of activation

often veryuseful the neurons are updated during one cycle

and according to a fixed order. The orderis defined by the network topology.

This procedure can only be considered fornon-cyclic, i.e. non-recurrent, networks,

since otherwise there is no order of activa-tion. Thus, in feedforward networks (forwhich the procedure is very reasonable)the input neurons would be updated first,then the inner neurons and finally the out-put neurons. This may save us a lot oftime: Given a synchronous activation or-der, a feedforward network with n layersof neurons would need n full propagationcycles in order to enable input data tohave influence on the output of the net-work. Given the topological activation or-der, we just need one single propagation.However, not every network topology al-lows for finding a special activation orderthat enables saving time.

SNIPE: Those who want to use Snipefor implementing feedforward networksmay save some calculation time by us-ing the feature fastprop (mentionedwithin the documentation of the classNeuralNetworkDescriptor. Once fastpropis enabled, it will cause the data propaga-tion to be carried out in a slightly differentway. In the standard mode, all net inputsare calculated first, followed by all activa-tions. In the fastprop mode, for every neu-ron, the activation is calculated right afterthe net input. The neuron values are calcu-lated in ascending neuron index order. Theneuron numbers are ascending from inputto output layer, which provides us with theperfect topological activation order for feed-forward networks.

3.6.2.4 Fixed orders of activationduring implementation

Obviously, fixed orders of activationcan be defined as well. Therefore, whenimplementing, for instance, feedforward



networks it is very popular to determinethe order of activation once according tothe topology and to use this order withoutfurther verification at runtime. But this isnot necessarily useful for networks that arecapable to change their topology.

3.7 Communication with theoutside world: input andoutput of data in andfrom neural networks

Finally, let us take a look at the fact that,of course, many types of neural networkspermit the input of data. Then these dataare processed and can produce output.Let us, for example, regard the feedfor-ward network shown in fig. 3.3 on page 40:It has two input neurons and two outputneurons, which means that it also has twonumerical inputs x1, x2 and outputs y1, y2.As a simplification we summarize the in-put and output components for n inputor output neurons within the vectors x =(x1, x2, . . . , xn) and y = (y1, y2, . . . , yn).

Definition 3.20 (Input vector). A net-xI work with n input neurons needs n inputs

x1, x2, . . . , xn. They are considered as in-put vector x = (x1, x2, . . . , xn). As aconsequence, the input dimension is re-ferred to as n. Data is put into a neural

nI network by using the components of the in-put vector as network inputs of the inputneurons.

Definition 3.21 (Output vector). A net-yI work with m output neurons provides m

outputs y1, y2, . . . , ym. They are regardedas output vector y = (y1, y2, . . . , ym).Thus, the output dimension is referredto as m. Data is output by a neural net- Jmwork by the output neurons adopting thecomponents of the output vector in theiroutput values.

SNIPE: In order to propagate data througha NeuralNetwork-instance, the propagatemethod is used. It receives the input vectoras array of doubles, and returns the outputvector in the same way.

Now we have defined and closely examinedthe basic components of neural networks –without having seen a network in action.But first we will continue with theoreticalexplanations and generally describe how aneural network could learn.

Exercises

Exercise 5. Would it be useful (fromyour point of view) to insert one bias neu-ron in each layer of a layer-based network,such as a feedforward network? Discussthis in relation to the representation andimplementation of the network. Will theresult of the network change?

Exercise 6. Show for the Fermi functionf(x) as well as for the hyperbolic tangenttanh(x), that their derivatives can be ex-pressed by the respective functions them-selves so that the two statements

1. f ′(x) = f(x) · (1− f(x)) and

2. tanh′(x) = 1− tanh2(x)


dkriesel.com 3.7 Input and output of data

are true.


Chapter 4

Fundamentals on learning and trainingsamples

Approaches and thoughts of how to teach machines. Should neural networksbe corrected? Should they only be encouraged? Or should they even learn

without any help? Thoughts about what we want to change during thelearning procedure and how we will change it, about the measurement of

errors and when we have learned enough.

As written above, the most interestingcharacteristic of neural networks is theircapability to familiarize with problemsby means of training and, after sufficienttraining, to be able to solve unknown prob-lems of the same class. This approach is re-ferred to as generalization. Before intro-ducing specific learning procedures, I wantto propose some basic principles about thelearning procedure in this chapter.

4.1 There are differentparadigms of learning

Learning is a comprehensive term. Alearning system changes itself in order toadapt to e.g. environmental changes. Aneural network could learn from manythings but, of course, there will always be

From whatdo we learn?

the question of how to implement it. Inprinciple, a neural network changes whenits components are changing, as we havelearned above. Theoretically, a neural net-work could learn by

1. developing new connections,

2. deleting existing connections,

3. changing connecting weights,

4. changing the threshold values of neu-rons,

5. varying one or more of the three neu-ron functions (remember: activationfunction, propagation function andoutput function),

6. developing new neurons, or

7. deleting existing neurons (and so, ofcourse, existing connections).

51

Chapter 4 Fundamentals on learning and training samples (fundamental) dkriesel.com

As mentioned above, we assume thechange in weight to be the most commonprocedure. Furthermore, deletion of con-nections can be realized by additionallytaking care that a connection is no longertrained when it is set to 0. Moreover, wecan develop further connections by settinga non-existing connection (with the value0 in the connection matrix) to a value dif-ferent from 0. As for the modification ofthreshold values I refer to the possibilityof implementing them as weights (section3.4). Thus, we perform any of the first fourof the learning paradigms by just trainingsynaptic weights.

The change of neuron functions is difficultto implement, not very intuitive and notexactly biologically motivated. Thereforeit is not very popular and I will omit thistopic here. The possibilities to develop ordelete neurons do not only provide welladjusted weights during the training of aneural network, but also optimize the net-work topology. Thus, they attract a grow-ing interest and are often realized by usingevolutionary procedures. But, since we ac-cept that a large part of learning possibil-ities can already be covered by changes inweight, they are also not the subject mat-ter of this text (however, it is planned toextend the text towards those aspects oftraining).

SNIPE: Methods of the classNeuralNetwork allow for changes inconnection weights, and addition andremoval of both connections and neurons.Methods in NeuralNetworkDescriptorenable the change of neuron behaviors,respectively activation functions perlayer.

Thus, we let our neural network learn bymodifying the connecting weights accord-ing to rules that can be formulated as al-

Learningby changesin weight

gorithms. Therefore a learning procedureis always an algorithm that can easily beimplemented by means of a programminglanguage. Later in the text I will assumethat the definition of the term desired out-put which is worth learning is known (andI will define formally what a training pat-tern is) and that we have a training setof learning samples. Let a training set bedefined as follows:

Definition 4.1 (Training set). A train- JPing set (named P ) is a set of trainingpatterns, which we use to train our neu-ral net.

I will now introduce the three essentialparadigms of learning by presenting thedifferences between their regarding train-ing sets.

4.1.1 Unsupervised learningprovides input patterns to thenetwork, but no learning aides

Unsupervised learning is the biologi-cally most plausible method, but is notsuitable for all problems. Only the in-put patterns are given; the network triesto identify similar patterns and to classifythem into similar categories.

Definition 4.2 (Unsupervised learning).The training set only consists of inputpatterns, the network tries by itself to de-tect similarities and to generate patternclasses.


dkriesel.com 4.1 Paradigms of learning

Here I want to refer again to the popu-lar example of Kohonen’s self-organisingmaps (chapter 10).

4.1.2 Reinforcement learningmethods provide feedback tothe network, whether itbehaves well or bad

In reinforcement learning the networkreceives a logical or a real value after

networkreceives

reward orpunishment

completion of a sequence, which defineswhether the result is right or wrong. Intu-itively it is clear that this procedure shouldbe more effective than unsupervised learn-ing since the network receives specific crit-era for problem-solving.

Definition 4.3 (Reinforcement learning).The training set consists of input patterns,after completion of a sequence a value is re-turned to the network indicating whetherthe result was right or wrong and, possibly,how right or wrong it was.

4.1.3 Supervised learning methodsprovide training patternstogether with appropriatedesired outputs

In supervised learning the training setconsists of input patterns as well as theircorrect results in the form of the precise ac-tivation of all output neurons. Thus, foreach training set that is fed into the net-work the output, for instance, can directly

networkreceivescorrect

results forsamples

be compared with the correct solution andand the network weights can be changed

according to their difference. The objec-tive is to change the weights to the effectthat the network cannot only associate in-put and output patterns independently af-ter the training, but can provide plausibleresults to unknown, similar input patterns,i.e. it generalises.

Definition 4.4 (Supervised learning).The training set consists of input patternswith correct results so that the network canreceive a precise error vector1 can be re-turned.

This learning procedure is not always bio-logically plausible, but it is extremely ef-fective and therefore very practicable.

At first we want to look at the the su-pervised learning procedures in general,which - in this text - are correspondingto the following steps:

Entering the input pattern (activation ofinput neurons),

Forward propagation of the input by thenetwork, generation of the output,

learningscheme

Comparing the output with the desiredoutput (teaching input), provides er-ror vector (difference vector),

Corrections of the network arecalculated based on the error vector,

Corrections are applied.

1 The term error vector will be defined in section4.2, where mathematical formalisation of learningis discussed.



4.1.4 Offline or online learning?

It must be noted that learning can beoffline (a set of training samples is pre-sented, then the weights are changed, thetotal error is calculated by means of a errorfunction operation or simply accumulated -see also section 4.4) or online (after everysample presented the weights are changed).Both procedures have advantages and dis-advantages, which will be discussed in thelearning procedures section if necessary.Offline training procedures are also calledbatch training procedures since a batchof results is corrected all at once. Such atraining section of a whole batch of train-ing samples including the related changein weight values is called epoch.

Definition 4.5 (Offline learning). Sev-eral training patterns are entered into thenetwork at once, the errors are accumu-lated and it learns for all patterns at thesame time.

Definition 4.6 (Online learning). Thenetwork learns directly from the errors ofeach training sample.

4.1.5 Questions you should answerbefore learning

The application of such schemes certainlyrequires preliminary thoughts about somequestions, which I want to introduce nowas a check list and, if possible, answerthem in the course of this text:

. Where does the learning input comefrom and in what form?

. How must the weights be modified toallow fast and reliable learning?

. How can the success of a learning pro-cess be measured in an objective way?

. Is it possible to determine the "best"learning procedure?

. Is it possible to predict if a learningprocedure terminates, i.e. whether itwill reach an optimal state after a fi-nite time or if it, for example, will os-cillate between different states?

. How can the learned patterns bestored in the network?

. Is it possible to avoid that newlylearned patterns destroy previouslylearned associations (the so-called sta-bility/plasticity dilemma)?

We will see that all these questions cannotbe generally answered but that they have JJJ

no easyanswers!

to be discussed for each learning procedureand each network topology individually.

4.2 Training patterns andteaching input

Before we get to know our first learningrule, we need to introduce the teachinginput. In (this) case of supervised learn-ing we assume a training set consistingof training patterns and the correspond-ing correct output values we want to see

desiredoutputat the output neurons after the training.

While the network has not finished train-ing, i.e. as long as it is generating wrongoutputs, these output values are referred


dkriesel.com 4.2 Training patterns and teaching input

to as teaching input, and that for each neu-ron individually. Thus, for a neuron j withthe incorrect output oj , tj is the teachinginput, which means it is the correct or de-sired output for a training pattern p.

Definition 4.7 (Training patterns). ApI training pattern is an input vector p

with the components p1, p2, . . . , pn whosedesired output is known. By entering thetraining pattern into the network we re-ceive an output that can be compared withthe teaching input, which is the desiredoutput. The set of training patterns iscalled P . It contains a finite number of or-dered pairs(p, t) of training patterns withcorresponding desired output.

Training patterns are often simply calledpatterns, that is why they are referredto as p. In the literature as well as inthis text they are called synonymously pat-terns, training samples etc.

Definition 4.8 (Teaching input). Let jtI be an output neuron. The teaching in-

put tj is the desired and correct value jdesiredoutput should output after the input of a certain

training pattern. Analogously to the vec-tor p the teaching inputs t1, t2, . . . , tn ofthe neurons can also be combined into avector t. t always refers to a specific train-ing pattern p and is, as already mentioned,contained in the set P of the training pat-terns.

SNIPE: Classes that are relevantfor training data are located inthe package training. The classTrainingSampleLesson allows for storageof training patterns and teaching inputs,

as well as simple preprocessing of thetraining data.

Definition 4.9 (Error vector). For sev- JEperal output neurons Ω1,Ω2, . . . ,Ωn the dif-ference between output vector and teach-ing input under a training input p

Ep =

t1 − y1...

tn − yn

is referred to as error vector, sometimesit is also called difference vector. De-pending on whether you are learning of-fline or online, the difference vector refersto a specific training pattern, or to the er-ror of a set of training patterns which isnormalized in a certain way.

Now I want to briefly summarize the vec-tors we have yet defined. There is the

input vector x, which can be entered intothe neural network. Depending onthe type of network being used theneural network will output an

output vector y. Basically, the

training sample p is nothing more thanan input vector. We only use it fortraining purposes because we knowthe corresponding

teaching input t which is nothing morethan the desired output vector to thetraining sample. The

error vector Ep is the difference betweenthe teaching input t and the acturaloutput y.



So, what x and y are for the general net-work operation are p and t for the networkImportant!training - and during training we try tobring y as close to t as possible. One ad-vice concerning notation: We referred tothe output values of a neuron i as oi. Thus,the output of an output neuron Ω is calledoΩ. But the output values of a network arereferred to as yΩ. Certainly, these networkoutputs are only neuron outputs, too, butthey are outputs of output neurons. Inthis respect

yΩ = oΩ

is true.

4.3 Using training samples

We have seen how we can learn in prin-ciple and which steps are required to doso. Now we should take a look at the se-lection of training data and the learningcurve. After successful learning it is par-ticularly interesting whether the networkhas onlymemorized – i.e. whether it canuse our training samples to quite exactlyproduce the right output but to providewrong answers for all other problems ofthe same class.

Suppose that we want the network to traina mapping R2 → B1 and therefor use thetraining samples from fig. 4.1: Then therecould be a chance that, finally, the net-work will exactly mark the colored areasaround the training samples with the out-put 1 (fig. 4.1, top), and otherwise willoutput 0 . Thus, it has sufficient storagecapacity to concentrate on the six training

Figure 4.1: Visualization of training results ofthe same training set on networks with a capacitybeing too high (top), correct (middle) or too low(bottom).


dkriesel.com 4.3 Using training samples

samples with the output 1. This impliesan oversized network with too much freestorage capacity.

On the other hand a network could haveinsufficient capacity (fig. 4.1, bottom) –this rough presentation of input data doesnot correspond to the good generalizationperformance we desire. Thus, we have tofind the balance (fig. 4.1, middle).

4.3.1 It is useful to divide the set oftraining samples

An often proposed solution for these prob-lems is to divide, the training set into

. one training set really used to train ,

. and one verification set to test ourprogress

– provided that there are enough train-ing samples. The usual division relationsare, for instance, 70% for training dataand 30% for verification data (randomlychosen). We can finish the training whenthe network provides good results on thetraining data as well as on the verificationdata.

SNIPE: The method splitLesson withinthe class TrainingSampleLesson allows forsplitting a TrainingSampleLesson with re-spect to a given ratio.

But note: If the verification data providepoor results, do not modify the networkstructure until these data provide good re-sults – otherwise you run the risk of tai-loring the network to the verification data.

This means, that these data are includedin the training, even if they are not usedexplicitly for the training. The solutionis a third set of validation data used onlyfor validation after a supposably success-ful training.

By training less patterns, we obviouslywithhold information from the networkand risk to worsen the learning perfor-mance. But this text is not about 100%exact reproduction of given samples butabout successful generalization and ap-proximation of a whole function – forwhich it can definitely be useful to trainless information into the network.

4.3.2 Order of patternrepresentation

You can find different strategies to choosethe order of pattern presentation: If pat-terns are presented in random sequence,there is no guarantee that the patternsare learned equally well (however, this isthe standard method). Always the samesequence of patterns, on the other hand,provokes that the patterns will be memo-rized when using recurrent networks (later,we will learn more about this type of net-works). A random permutation wouldsolve both problems, but it is – as alreadymentioned – very time-consuming to cal-culate such a permutation.

SNIPE: The method shuffleSamples lo-cated in the class TrainingSampleLessonpermutes a lesson.



4.4 Learning curve and errormeasurement

The learning curve indicates the progressof the error, which can be determined innorm

tocompare

various ways. The motivation to create alearning curve is that such a curve can in-dicate whether the network is progressingor not. For this, the error should be nor-malized, i.e. represent a distance measurebetween the correct and the current out-put of the network. For example, we cantake the same pattern-specific, squared er-ror with a prefactor, which we are also go-ing to use to derive the backpropagationof error (let Ω be output neurons and Othe set of output neurons):

Errp = 12∑Ω∈O

(tΩ − yΩ)2 (4.1)

Definition 4.10 (Specific error). Thespecific error Errp is based on a single

ErrpI training sample, which means it is gener-ated online.

Additionally, the root mean square (ab-breviated: RMS) and the Euclideandistance are often used.

The Euclidean distance (generalization ofthe theorem of Pythagoras) is useful forlower dimensions where we can still visual-ize its usefulness.

Definition 4.11 (Euclidean distance).The Euclidean distance between two vec-tors t and y is defined as

Errp =√∑

Ω∈O(tΩ − yΩ)2. (4.2)

Generally, the root mean square is com-monly used since it considers extreme out-liers to a greater extent.

Definition 4.12 (Root mean square).The root mean square of two vectors t andy is defined as

Errp =√∑

Ω∈O(tΩ − yΩ)2

|O|. (4.3)

As for offline learning, the total error inthe course of one training epoch is inter-esting and useful, too:

Err =∑p∈P

Errp (4.4)

Definition 4.13 (Total error). The totalerror Err is based on all training samples, JErrthat means it is generated offline.

Analogously we can generate a total RMSand a total Euclidean distance in thecourse of a whole epoch. Of course, it ispossible to use other types of error mea-surement. To get used to further errormeasurement methods, I suggest to have alook into the technical report of Prechelt[Pre94]. In this report, both error mea-surement methods and sample problemsare discussed (this is why there will be asimmilar suggestion during the discussionof exemplary problems).

SNIPE: There are several static meth-ods representing different methods of er-ror measurement implemented in the classErrorMeasurement.

Depending on our method of error mea-surement our learning curve certainly


dkriesel.com 4.4 Learning curve and error measurement

changes, too. A perfect learning curvelooks like a negative exponential func-tion, that means it is proportional to e−t(fig. 4.2 on the following page). Thus, therepresentation of the learning curve can beillustrated by means of a logarithmic scale(fig. 4.2, second diagram from the bot-tom) – with the said scaling combinationa descending line implies an exponentialdescent of the error.

With the network doing a good job, theproblems being not too difficult and thelogarithmic representation of Err you cansee - metaphorically speaking - a descend-ing line that often forms "spikes" at thebottom – here, we reach the limit of the64-bit resolution of our computer and ournetwork has actually learned the optimumof what it is capable of learning.

Typical learning curves can show a few flatareas as well, i.e. they can show somesteps, which is no sign of a malfunctioninglearning process. As we can also see in fig.4.2, a well-suited representation can makeany slightly decreasing learning curve lookgood – so just be cautious when readingthe literature.

4.4.1 When do we stop learning?

Now, the big question is: When do westop learning? Generally, the training isstopped when the user in front of the learn-ing computer "thinks" the error was smallenough. Indeed, there is no easy answerand thus I can once again only give yousomething to think about, which, however,

depends on a more objective view on thecomparison of several learning curves.

Confidence in the results, for example, isboosted, when the network always reaches

objectivitynearly the same final error-rate for differ-ent random initializations – so repeatedinitialization and training will provide amore objective result.

On the other hand, it can be possible thata curve descending fast in the beginningcan, after a longer time of learning, beovertaken by another curve: This can indi-cate that either the learning rate of theworse curve was too high or the worsecurve itself simply got stuck in a local min-imum, but was the first to find it.

Remember: Larger error values are worsethan the small ones.

But, in any case, note: Many people onlygenerate a learning curve in respect of thetraining data (and then they are surprisedthat only a few things will work) – but forreasons of objectivity and clarity it shouldnot be forgotten to plot the verificationdata on a second learning curve, whichgenerally provides values that are slightlyworse and with stronger oscillation. Butwith good generalization the curve can de-crease, too.

When the network eventually begins tomemorize the samples, the shape of thelearning curve can provide an indication:If the learning curve of the verificationsamples is suddenly and rapidly risingwhile the learning curve of the verification



0

5e−005

0.0001

0.00015

0.0002

0.00025

0 100 200 300 400 500 600 700 800 900 1000

Feh

ler

Epoche

0

2e−005

4e−005

6e−005

8e−005

0.0001

0.00012

0.00014

0.00016

0.00018

0.0002

1 10 100 1000

Feh

ler

Epoche

1e−035

1e−030

1e−025

1e−020

1e−015

1e−010

1e−005

1

0 100 200 300 400 500 600 700 800 900 1000

Feh

ler

Epoche

1e−035

1e−030

1e−025

1e−020

1e−015

1e−010

1e−005

1

1 10 100 1000

Feh

ler

Epoche

Figure 4.2: All four illustrations show the same (idealized, because very smooth) learning curve.Note the alternating logarithmic and linear scalings! Also note the small "inaccurate spikes" visiblein the sharp bend of the curve in the first and second diagram from bottom.


dkriesel.com 4.5 Gradient optimization procedures

data is continuously falling, this could indi-cate memorizing and a generalization get-ting poorer and poorer. At this point itcould be decided whether the network hasalready learned well enough at the nextpoint of the two curves, and maybe thefinal point of learning is to be appliedhere (this procedure is called early stop-ping).

Once again I want to remind you that theyare all acting as indicators and not to drawIf-Then conclusions.

4.5 Gradient optimizationprocedures

In order to establish the mathematical ba-sis for some of the following learning pro-cedures I want to explain briefly what ismeant by gradient descent: the backpro-pagation of error learning procedure, forexample, involves this mathematical basisand thus inherits the advantages and dis-advantages of the gradient descent.

Gradient descent procedures are generallyused where we want to maximize or mini-mize n-dimensional functions. Due to clar-ity the illustration (fig. 4.3 on the nextpage) shows only two dimensions, but prin-cipally there is no limit to the number ofdimensions.

The gradient is a vector g that is de-fined for any differentiable point of a func-tion, that points from this point exactlytowards the steepest ascent and indicatesthe gradient in this direction by means

of its norm |g|. Thus, the gradient is ageneralization of the derivative for multi-dimensional functions. Accordingly, thenegative gradient −g exactly points to-wards the steepest descent. The gradientoperator ∇ is referred to as nabla op- J∇

gradient ismulti-dim.derivative

erator, the overall notation of the thegradient g of the point (x, y) of a two-dimensional function f being g(x, y) =∇f(x, y).Definition 4.14 (Gradient). Let g bea gradient. Then g is a vector with ncomponents that is defined for any pointof a (differential) n-dimensional functionf(x1, x2, . . . , xn). The gradient operatornotation is defined asg(x1, x2, . . . , xn) = ∇f(x1, x2, . . . , xn).

g directs from any point of f towardsthe steepest ascent from this point, with|g| corresponding to the degree of this as-cent.

Gradient descent means to going downhillin small steps from any starting point ofour function towards the gradient g (whichmeans, vividly speaking, the direction towhich a ball would roll from the startingpoint), with the size of the steps being pro-portional to |g| (the steeper the descent,the longer the steps). Therefore, we moveslowly on a flat plateau, and on a steep as-cent we run downhill rapidly. If we cameinto a valley, we would - depending on thesize of our steps - jump over it or we wouldreturn into the valley across the oppositehillside in order to come closer and closerto the deepest point of the valley by walk-ing back and forth, similar to our ball mov-ing within a round bowl.



Figure 4.3: Visualization of the gradient descent on a two-dimensional error function. Wemove forward in the opposite direction of g, i.e. with the steepest descent towards the lowestpoint, with the step width being proportional to |g| (the steeper the descent, the faster thesteps). On the left the area is shown in 3D, on the right the steps over the contour lines areshown in 2D. Here it is obvious how a movement is made in the opposite direction of g towardsthe minimum of the function and continuously slows down proportionally to |g|. Source:http://webster.fhs-hagenberg.ac.at/staff/sdreisei/Teaching/WS2001-2002/PatternClassification/graddescent.pdf

Definition 4.15 (Gradient descent).Let f be an n-dimensional function and

We gotowards the

gradients = (s1, s2, . . . , sn) the given startingpoint. Gradient descent means goingfrom f(s) against the direction of g, i.e.towards −g with steps of the size of |g|towards smaller and smaller values of f .

Gradient descent procedures are not an er-rorless optimization procedure at all (aswe will see in the following sections) – how-ever, they work still well on many prob-lems, which makes them an optimizationparadigm that is frequently used. Anyway,let us have a look on their potential disad-vantages so we can keep them in mind abit.

4.5.1 Gradient proceduresincorporate several problems

As already implied in section 4.5, the gra-dient descent (and therefore the backpro-pagation) is promising but not foolproof.One problem, is that the result does notalways reveal if an error has occurred.

gradientdescentwith errors

4.5.1.1 Often, gradient descentsconverge against suboptimalminima

Every gradient descent procedure can, forexample, get stuck within a local mini-mum (part a of fig. 4.4 on the facing page).


dkriesel.com 4.5 Gradient optimization procedures

Figure 4.4: Possible errors during a gradient descent: a) Detecting bad minima, b) Quasi-standstillwith small gradient, c) Oscillation in canyons, d) Leaving good minima.

This problem is increasing proportionallyto the size of the error surface, and thereis no universal solution. In reality, onecannot know if the optimal minimum isreached and considers a training success-ful, if an acceptable minimum is found.

4.5.1.2 Flat plataeus on the errorsurface may cause trainingslowness

When passing a flat plateau, for instance,the gradient also becomes negligibly smallbecause there is hardly a descent (part bof fig. 4.4), which requires many furthersteps. A hypothetically possible gradientof 0 would completely stop the descent.

4.5.1.3 Even if good minima arereached, they may be leftafterwards

On the other hand the gradient is verylarge at a steep slope so that large stepscan be made and a good minimum can pos-sibly be missed (part d of fig. 4.4).

4.5.1.4 Steep canyons in the errorsurface may cause oscillations

A sudden alternation from one very strongnegative gradient to a very strong positiveone can even result in oscillation (part cof fig. 4.4). In nature, such an error doesnot occur very often so that we can thinkabout the possibilities b and d.



4.6 Exemplary problems allowfor testing self-codedlearning strategies

We looked at learning from the formalpoint of view – not much yet but a little.Now it is time to look at a few exemplaryproblem you can later use to test imple-mented networks and learning rules.

4.6.1 Boolean functions

A popular example is the one that didnot work in the nineteen-sixties: the XORfunction (B2 → B1). We need a hiddenneuron layer, which we have discussed indetail. Thus, we need at least two neu-rons in the inner layer. Let the activationfunction in all layers (except in the inputlayer, of course) be the hyperbolic tangent.Trivially, we now expect the outputs 1.0or −1.0, depending on whether the func-tion XOR outputs 1 or 0 - and exactlyhere is where the first beginner’s mistakeoccurs.

For outputs close to 1 or -1, i.e. close tothe limits of the hyperbolic tangent (orin case of the Fermi function 0 or 1), weneed very large network inputs. The onlychance to reach these network inputs arelarge weights, which have to be learned:The learning process is largely extended.Therefore it is wiser to enter the teachinginputs 0.9 or −0.9 as desired outputs orto be satisfied when the network outputsthose values instead of 1 and −1.

i1 i2 i3 Ω0 0 0 10 0 1 00 1 0 00 1 1 11 0 0 01 0 1 11 1 0 11 1 1 0

Table 4.1: Illustration of the parity functionwith three inputs.

Another favourite example for singlelayerperceptrons are the boolean functionsAND and OR.

4.6.2 The parity function

The parity function maps a set of bits to 1or 0, depending on whether an even num-ber of input bits is set to 1 or not. Ba-sically, this is the function Bn → B1. Itis characterized by easy learnability up toapprox. n = 3 (shown in table 4.1), butthe learning effort rapidly increases fromn = 4. The reader may create a score ta-ble for the 2-bit parity function. What isconspicuous?

4.6.3 The 2-spiral problem

As a training sample for a function letus take two spirals coiled into each other(fig. 4.5 on the facing page) with thefunction certainly representing a mapping


dkriesel.com 4.6 Exemplary problems

Figure 4.5: Illustration of the training samplesof the 2-spiral problem

R2 → B1. One of the spirals is assignedto the output value 1, the other spiral to0. Here, memorizing does not help. Thenetwork has to understand the mapping it-self. This example can be solved by meansof an MLP, too.

4.6.4 The checkerboard problem

We again create a two-dimensional func-tion of the form R2 → B1 and specifycheckered training samples (fig. 4.6) withone colored field representing 1 and all therest of them representing 0. The difficultyincreases proportionally to the size of thefunction: While a 3×3 field is easy to learn,the larger fields are more difficult (herewe eventually use methods that are more

Figure 4.6: Illustration of training samples forthe checkerboard problem

suitable for this kind of problems than theMLP).

The 2-spiral problem is very similar to thecheckerboard problem, only that, mathe-matically speaking, the first problem is us-ing polar coordinates instead of Cartesiancoordinates. I just want to introduce asan example one last trivial case: the iden-tity.

4.6.5 The identity function

By using linear activation functions theidentity mapping from R1 to R1 (of courseonly within the parameters of the used ac-tivation function) is no problem for thenetwork, but we put some obstacles in itsway by using our sigmoid functions so that



it would be difficult for the network tolearn the identity. Just try it for the funof it.

Now, it is time to hava a look at our firstmathematical learning rule.

4.6.6 There are lots of otherexemplary problems

For lots and lots of further exemplary prob-lems, I want to recommend the technicalreport written by prechelt [Pre94] whichalso has been named in the sections abouterror measurement procedures..

4.7 The Hebbian learning ruleis the basis for mostother learning rules

In 1949, Donald O. Hebb formulatedtheHebbian rule [Heb49] which is the ba-sis for most of the more complicated learn-ing rules we will discuss in this text. Wedistinguish between the original form andthe more general form, which is a kind ofprinciple for other learning rules.

4.7.1 Original rule

Definition 4.16 (Hebbian rule). "If neu-ron j receives an input from neuron i andif both neurons are strongly active at thesame time, then increase the weight wi,j(i.e. the strength of the connection be-tween i and j)." Mathematically speaking,the rule is:

earlyform ofthe rule

∆wi,j ∼ ηoiaj (4.5)

with ∆wi,j being the change in weightfrom i to j , which is proportional to the J∆wi,jfollowing factors:

. the output oi of the predecessor neu-ron i, as well as,

. the activation aj of the successor neu-ron j,

. a constant η, i.e. the learning rate,which will be discussed in section5.4.3.

The changes in weight ∆wi,j are simplyadded to the weight wi,j .

Why am I speaking twice about activation,but in the formula I am using oi and aj , i.e.the output of neuron of neuron i and the ac-tivation of neuron j? Remember that theidentity is often used as output functionand therefore ai and oi of a neuron are of-ten the same. Besides, Hebb postulatedhis rule long before the specification oftechnical neurons. Considering that thislearning rule was preferred in binary acti-vations, it is clear that with the possibleactivations (1, 0) the weights will either in-crease or remain constant. Sooner or later

weightsgo adinfinitum

they would go ad infinitum, since they canonly be corrected "upwards" when an erroroccurs. This can be compensated by usingthe activations (-1,1)2. Thus, the weightsare decreased when the activation of thepredecessor neuron dissents from the oneof the successor neuron, otherwise they areincreased.2 But that is no longer the "original version" of theHebbian rule.


dkriesel.com 4.7 Hebbian rule

4.7.2 Generalized form

Most of the learning rules discussed beforeare a specialization of the mathematicallymore general form [MR86] of the Hebbianrule.

Definition 4.17 (Hebbian rule, more gen-eral). The generalized form of theHebbian Rule only specifies the propor-tionality of the change in weight to theproduct of two undefined functions, butwith defined input values.

∆wi,j = η · h(oi, wi,j) · g(aj , tj) (4.6)

Thus, the product of the functions

. g(aj , tj) and

. h(oi, wi,j)

. as well as the constant learning rateη

results in the change in weight. As youcan see, h receives the output of the pre-decessor cell oi as well as the weight frompredecessor to successor wi,j while g ex-pects the actual and desired activation ofthe successor aj and tj (here t stands forthe aforementioned teaching input). As al-ready mentioned g and h are not specifiedin this general definition. Therefore, wewill now return to the path of specializa-tion we discussed before equation 4.6. Af-ter we have had a short picture of whata learning rule could look like and of ourthoughts about learning itself, we will beintroduced to our first network paradigmincluding the learning procedure.

Exercises

Exercise 7. Calculate the average valueµ and the standard deviation σ for the fol-lowing data points.

p1 = (2, 2, 2)p2 = (3, 3, 3)p3 = (4, 4, 4)p4 = (6, 0, 0)p5 = (0, 6, 0)p6 = (0, 0, 6)


Part II

Supervised learning networkparadigms

69

Chapter 5

The perceptron, backpropagation and itsvariants

A classic among the neural networks. If we talk about a neural network, thenin the majority of cases we speak about a percepton or a variation of it.

Perceptrons are multilayer networks without recurrence and with fixed inputand output layers. Description of a perceptron, its limits and extensions thatshould avoid the limitations. Derivation of learning procedures and discussion

of their problems.

As already mentioned in the history of neu-ral networks, the perceptron was describedby Frank Rosenblatt in 1958 [Ros58].Initially, Rosenblatt defined the alreadydiscussed weighted sum and a non-linearactivation function as components of theperceptron.

There is no established definition for a per-ceptron, but most of the time the termis used to describe a feedforward networkwith shortcut connections. This networkhas a layer of scanner neurons (retina)with statically weighted connections tothe following layer and is called inputlayer (fig. 5.1 on the next page); but theweights of all other layers are allowed to bechanged. All neurons subordinate to theretina are pattern detectors. Here we ini-tially use a binary perceptron with everyoutput neuron having exactly two possi-

ble output values (e.g. 0, 1 or −1, 1).Thus, a binary threshold function is usedas activation function, depending on thethreshold value Θ of the output neuron.

In a way, the binary activation functionrepresents an IF query which can alsobe negated by means of negative weights.The perceptron can thus be used to ac-complish true logical information process-ing.

Whether this method is reasonable is an-other matter – of course, this is not theeasiest way to achieve Boolean logic. I justwant to illustrate that perceptrons canbe used as simple logical components andthat, theoretically speaking, any Booleanfunction can be realized by means of per-ceptrons being connected in series or in-terconnected in a sophisticated way. But

71

Chapter 5 The perceptron, backpropagation and its variants dkriesel.com

Kapitel 5 Das Perceptron dkriesel.com

"" )) ++ ,, ## )) ++|| ## ))uu

""uuss||uussrrGFED@ABC

''OOOOOOOOOOOOOOOOO GFED@ABC

@@@@@@@@@GFED@ABC

GFED@ABC

~~~~~~~~~GFED@ABC

wwooooooooooooooooo

WVUTPQRSΣL|H

GFED@ABCi1

((PPPPPPPPPPPPPPPPPP GFED@ABCi2

!!CCCCCCCCCCGFED@ABCi3

GFED@ABCi4

GFED@ABCi5

vvnnnnnnnnnnnnnnnnnn

?>=<89:;Ω

Abbildung 5.1: Aufbau eines Perceptrons mit einer Schicht variabler Verbindungen in verschiede-nen Ansichten. Die durchgezogene Gewichtsschicht in den unteren beiden Abbildungen ist trainier-bar.Oben: Am Beispiel der Informationsabtastung im Auge.Mitte: Skizze desselben mit eingezeichneter fester Gewichtsschicht unter Verwendung der definier-ten funktionsbeschreibenden Designs fur Neurone.Unten: Ohne eingezeichnete feste Gewichtsschicht, mit Benennung der einzelnen Neuronen nachunserer Konvention. Wir werden die feste Gewichtschicht im weiteren Verlauf der Arbeit nicht mehrbetrachten.

70 D. Kriesel – Ein kleiner Uberblick uber Neuronale Netze (EPSILON-DE)

Figure 5.1: Architecture of a perceptron with one layer of variable connections in different views.The solid-drawn weight layer in the two illustrations on the bottom can be trained.Left side: Example of scanning information in the eye.Right side, upper part: Drawing of the same example with indicated fixed-weight layer using thedefined designs of the functional descriptions for neurons.Right side, lower part: Without indicated fixed-weight layer, with the name of each neuroncorresponding to our convention. The fixed-weight layer will no longer be taken into account in thecourse of this work.


dkriesel.com

we will see that this is not possible withoutconnecting them serially. Before providingthe definition of the perceptron, I want todefine some types of neurons used in thischapter.

Definition 5.1 (Input neuron). An in-put neuron is an identity neuron. Itexactly forwards the information received.Thus, it represents the identity function,input neuron

only forwardsdata

which should be indicated by the symbol. Therefore the input neuron is repre-sented by the symbol GFED@ABC .

Definition 5.2 (Information process-ing neuron). Information processingneurons somehow process the input infor-mation, i.e. do not represent the identityfunction. A binary neuron sums up allinputs by using the weighted sum as prop-agation function, which we want to illus-trate by the sign Σ. Then the activationfunction of the neuron is the binary thresh-old function, which can be illustrated byL|H. This leads us to the complete de-piction of information processing neurons,

namely WVUTPQRSΣL|H

. Other neurons that use

the weighted sum as propagation functionbut the activation functions hyperbolic tan-gent or Fermi function, or with a sepa-rately defined activation function fact, aresimilarly represented by

WVUTPQRSΣTanh

WVUTPQRSΣFermi

ONMLHIJKΣfact

.

These neurons are also referred to asFermi neurons or Tanh neuron.

Now that we know the components of aperceptron we should be able to defineit.

Definition 5.3 (Perceptron). The per-ceptron (fig. 5.1 on the facing page) is1 afeedforward network containing a retinathat is used only for data acquisition andwhich has fixed-weighted connections withthe first neuron layer (input layer). Thefixed-weight layer is followed by at leastone trainable weight layer. One neuronlayer is completely linked with the follow-ing layer. The first layer of the percep-tron consists of the input neurons definedabove.

A feedforward network often containsshortcuts which does not exactly corre-spond to the original description and there-fore is not included in the definition. Wecan see that the retina is not included inthe lower part of fig. 5.1. As a matterof fact the first neuron layer is often un-derstood (simplified and sufficient for thismethod) as input layer, because this layer retina is

unconsideredonly forwards the input values. The retinaitself and the static weights behind it areno longer mentioned or displayed, sincethey do not process information in anycase. So, the depiction of a perceptronstarts with the input neurons.

1 It may confuse some readers that I claim that thereis no definition of a perceptron but then define theperceptron in the following section. I thereforesuggest keeping my definition in the back of yourmind and just take it for granted in the course ofthis work.



SNIPE: The methodssetSettingsTopologyFeedForwardand the variation -WithShortcuts ina NeuralNetworkDescriptor-Instanceapply settings to a descriptor, whichare appropriate for feedforward networksor feedforward networks with shortcuts.The respective kinds of connections areallowed, all others are not, and fastprop isactivated.

5.1 The singlelayerperceptron provides onlyone trainable weight layer

Here, connections with trainable weightsgo from the input layer to an outputneuron Ω, which returns the information

1 trainablelayer whether the pattern entered at the input

neurons was recognized or not. Thus, asinglelayer perception (abbreviated SLP)has only one level of trainable weights(fig. 5.1 on page 72).

Definition 5.4 (Singlelayer perceptron).A singlelayer perceptron (SLP) is aperceptron having only one layer of vari-able weights and one layer of output neu-rons Ω. The technical view of an SLP isshown in fig. 5.2.

Certainly, the existence of several outputneurons Ω1,Ω2, . . . ,Ωn does not consider-ably change the concept of the perceptronImportant!(fig. 5.3): A perceptron with several out-put neurons can also be regarded as sev-eral different perceptrons with the sameinput.

GFED@ABCBIAS

wBIAS,Ω

GFED@ABCi1

wi1,Ω

GFED@ABCi2

wi2,Ω

?>=<89:;Ω

Figure 5.2: A singlelayer perceptron with two in-put neurons and one output neuron. The net-work returns the output by means of the ar-row leaving the network. The trainable layer ofweights is situated in the center (labeled). As areminder, the bias neuron is again included here.Although the weight wBIAS,Ω is a normal weightand also treated like this, I have represented itby a dotted line – which significantly increasesthe clarity of larger networks. In future, the biasneuron will no longer be included.

GFED@ABCi1

@@@@@@@@@

**UUUUUUUUUUUUUUUUUUUUUUUUUU

''PPPPPPPPPPPPPPPPP GFED@ABCi2

((PPPPPPPPPPPPPPPPPP

AAAAAAAAAGFED@ABCi3

~~

AAAAAAAAA

GFED@ABCi4

vvnnnnnnnnnnnnnnnnnn

~~GFED@ABCi5

~~~~~~~~~~~


wwnnnnnnnnnnnnnnnnn

GFED@ABCΩ1

GFED@ABCΩ2

GFED@ABCΩ3

Figure 5.3: Singlelayer perceptron with severaloutput neurons


dkriesel.com 5.1 The singlelayer perceptron

GFED@ABC

1AAAA

AAAA

GFED@ABC

1

~~

[email protected]

GFED@ABC

1AAAA

AAAA

GFED@ABC

1

~~

[email protected]

Figure 5.4: Two singlelayer perceptrons forBoolean functions. The upper singlelayer per-ceptron realizes an AND, the lower one realizesan OR. The activation function of the informa-tion processing neuron is the binary thresholdfunction. Where available, the threshold valuesare written into the neurons.

The Boolean functions AND and OR shownin fig. 5.4 are trivial examples that can eas-ily be composed.

Now we want to know how to train a single-layer perceptron. We will therefore at firsttake a look at the perceptron learning al-gorithm and then we will look at the deltarule.

5.1.1 Perceptron learning algorithmand convergence theorem

The original perceptron learning algo-rithm with binary neuron activation func-tion is described in alg. 1. It has beenproven that the algorithm converges infinite time – so in finite time the per-ceptron can learn anything it can repre-sent (perceptron convergence theorem,[Ros62]). But please do not get your hopesup too soon! What the perceptron is capa-ble to represent will be explored later.

During the exploration of linear separabil-ity of problems we will cover the fact thatat least the singlelayer perceptron unfor-tunately cannot represent a lot of prob-lems.

5.1.2 The delta rule as a gradientbased learning strategy forSLPs

In the following we deviate from our bi-nary threshold value as activation functionbecause at least for backpropagation of er-ror we need, as you will see, a differen-

fact now differ-entiabletiable or even a semi-linear activation func-

tion. For the now following delta rule (likebackpropagation derived in [MR86]) it isnot always necessary but useful. This fact,however, will also be pointed out in theappropriate part of this work. Comparedwith the aforementioned perceptron learn-ing algorithm, the delta rule has the ad-vantage to be suitable for non-binary acti-vation functions and, being far away from



1: while ∃p ∈ P and error too large do2: Input p into the network, calculate output y P set of training patterns3: for all output neurons Ω do4: if yΩ = tΩ then5: Output is okay, no correction of weights6: else7: if yΩ = 0 then8: for all input neurons i do9: wi,Ω := wi,Ω + oi ...increase weight towards Ω by oi

10: end for11: end if12: if yΩ = 1 then13: for all input neurons i do14: wi,Ω := wi,Ω − oi ...decrease weight towards Ω by oi15: end for16: end if17: end if18: end for19: end whileAlgorithm 1: Perceptron learning algorithm. The perceptron learning algorithmreduces the weights to output neurons that return 1 instead of 0, and in the inversecase increases weights.



the learning target, to automatically learnfaster.

Suppose that we have a singlelayer percep-tron with randomly set weights which wewant to teach a function by means of train-ing samples. The set of these training sam-ples is called P . It contains, as already de-fined, the pairs (p, t) of the training sam-ples p and the associated teaching input t.I also want to remind you that

. x is the input vector and

. y is the output vector of a neural net-work,

. output neurons are referred to asΩ1,Ω2, . . . ,Ω|O|,

. i is the input and

. o is the output of a neuron.

Additionally, we defined that

. the error vector Ep represents the dif-ference (t−y) under a certain trainingsample p.

. Furthermore, let O be the set of out-put neurons and

. I be the set of input neurons.

Another naming convention shall be that,for example, for an output o and a teach-ing input t an additional index p may beset in order to indicate that these valuesare pattern-specific. Sometimes this willconsiderably enhance clarity.

Now our learning target will certainly be,that for all training samples the output y

of the network is approximately the de-sired output t, i.e. formally it is truethat

∀p : y ≈ t or ∀p : Ep ≈ 0.

This means we first have to understand thetotal error Err as a function of the weights:The total error increases or decreases de-pending on how we change the weights.

Definition 5.5 (Error function). The er-ror function

JErr(W )Err : W → R

regards the set2 of weights W as a vectorand maps the values onto the normalized error as

functionoutput error (normalized because other-wise not all errors can be mapped ontoone single e ∈ R to perform a gradient de-scent). It is obvious that a specific errorfunction can analogously be generated

JErrp(W )for a single pattern p.

As already shown in section 4.5, gradientdescent procedures calculate the gradientof an arbitrary but finite-dimensional func-tion (here: of the error function Err(W ))and move down against the direction ofthe gradient until a minimum is reached.Err(W ) is defined on the set of all weightswhich we here regard as the vector W .So we try to decrease or to minimize theerror by simply tweaking the weights –thus one receives information about howto change the weights (the change in all

2 Following the tradition of the literature, I previ-ously defined W as a weight matrix. I am awareof this conflict but it should not bother us here.



−2−1

0 1

2w1

−2−1

0 1

2

w2

0

1

2

3

4

5

Figure 5.5: Exemplary error surface of a neuralnetwork with two trainable connections w1 undw2. Generally, neural networks have more thantwo connections, but this would have made theillustration too complex. And most of the timethe error surface is too craggy, which complicatesthe search for the minimum.

weights is referred to as ∆W ) by calcu-lating the gradient ∇Err(W ) of the errorfunction Err(W ):

∆W ∼ −∇Err(W ). (5.1)

Due to this relation there is a proportional-ity constant η for which equality holds (ηwill soon get another meaning and a realpractical use beyond the mere meaning ofa proportionality constant. I just ask thereader to be patient for a while.):

∆W = −η∇Err(W ). (5.2)

To simplify further analysis, we nowrewrite the gradient of the error-functionaccording to all weights as an usual par-tial derivative according to a single weightwi,Ω (the only variable weights exists be-tween the hidden and the output layer Ω).

Thus, we tweak every single weight and ob-serve how the error function changes, i.e.we derive the error function according toa weight wi,Ω and obtain the value ∆wi,Ωof how to change this weight.

∆wi,Ω = −η∂Err(W )∂wi,Ω

. (5.3)

Now the following question arises: Howis our error function defined exactly? Itis not good if many results are far awayfrom the desired ones; the error functionshould then provide large values – on theother hand, it is similarly bad if manyresults are close to the desired ones butthere exists an extremely far outlying re-sult. The squared distance between theoutput vector y and the teaching input tappears adequate to our needs. It providesthe error Errp that is specific for a train-ing sample p over the output of all outputneurons Ω:

Errp(W ) = 12∑Ω∈O

(tp,Ω − yp,Ω)2. (5.4)

Thus, we calculate the squared differenceof the components of the vectors t andy, given the pattern p, and sum up thesesquares. The summation of the specific er-rors Errp(W ) of all patterns p then yieldsthe definition of the error Err and there-



fore the definition of the error functionErr(W ):

Err(W ) =∑p∈P

Errp(W ) (5.5)

= 12

sum over all p︷︸︸︷∑p∈P

∑Ω∈O

(tp,Ω − yp,Ω)2

︸︷︷︸

sum over all Ω

.

(5.6)

The observant reader will certainly wonderwhere the factor 1

2 in equation 5.4 on thepreceding page suddenly came from andwhy there is no root in the equation, asthis formula looks very similar to the Eu-clidean distance. Both facts result fromsimple pragmatics: Our intention is tominimize the error. Because the root func-tion decreases with its argument, we cansimply omit it for reasons of calculationand implementation efforts, since we donot need it for minimization. Similarly, itdoes not matter if the term to be mini-mized is divided by 2: Therefore I am al-lowed to multiply by 1

2 . This is just doneso that it cancels with a 2 in the course ofour calculation.

Now we want to continue deriving thedelta rule for linear activation functions.We have already discussed that we tweakthe individual weights wi,Ω a bit and seehow the error Err(W ) is changing – whichcorresponds to the derivative of the er-ror function Err(W ) according to the verysame weight wi,Ω. This derivative cor-responds to the sum of the derivativesof all specific errors Errp according tothis weight (since the total error Err(W )

results from the sum of the specific er-rors):

∆wi,Ω = −η∂Err(W )∂wi,Ω

(5.7)

=∑p∈P−η∂Errp(W )

∂wi,Ω. (5.8)

Once again I want to think about the ques-tion of how a neural network processesdata. Basically, the data is only trans-ferred through a function, the result of thefunction is sent through another one, andso on. If we ignore the output function,the path of the neuron outputs oi1 and oi2 ,which the neurons i1 and i2 entered into aneuron Ω, initially is the propagation func-tion (here weighted sum), from which thenetwork input is going to be received. Thisis then sent through the activation func-tion of the neuron Ω so that we receivethe output of this neuron which is at thesame time a component of the output vec-tor y:

netΩ → fact

= fact(netΩ)= oΩ

= yΩ.

As we can see, this output results frommany nested functions:

oΩ = fact(netΩ) (5.9)

= fact(oi1 · wi1,Ω + oi2 · wi2,Ω). (5.10)

It is clear that we could break down theoutput into the single input neurons (thisis unnecessary here, since they do not



process information in an SLP). Thus,we want to calculate the derivatives ofequation 5.8 on the preceding page anddue to the nested functions we can applythe chain rule to factorize the derivative∂Errp(W )∂wi,Ω

in equation 5.8 on the previouspage.

∂Errp(W )∂wi,Ω

= ∂Errp(W )∂op,Ω

· ∂op,Ω∂wi,Ω

. (5.11)

Let us take a look at the first multiplica-tive factor of the above equation 5.11which represents the derivative of the spe-cific error Errp(W ) according to the out-put, i.e. the change of the error Errpwith an output op,Ω: The examinationof Errp (equation 5.4 on page 78) clearlyshows that this change is exactly the dif-ference between teaching input and out-put (tp,Ω− op,Ω) (remember: Since Ω is anoutput neuron, op,Ω = yp,Ω). The closerthe output is to the teaching input, thesmaller is the specific error. Thus we canreplace one by the other. This differenceis also called δp,Ω (which is the reason forthe name delta rule):

∂Errp(W )∂wi,Ω

= −(tp,Ω − op,Ω) · ∂op,Ω∂wi,Ω

(5.12)

= −δp,Ω ·∂op,Ω∂wi,Ω

(5.13)

The second multiplicative factor of equa-tion 5.11 and of the following one is thederivative of the output specific to the pat-tern p of the neuron Ω according to theweight wi,Ω. So how does op,Ω changewhen the weight from i to Ω is changed?

Due to the requirement at the beginning ofthe derivation, we only have a linear acti-vation function fact, therefore we can justas well look at the change of the networkinput when wi,Ω is changing:

∂Errp(W )∂wi,Ω

= −δp,Ω ·∂∑i∈I(op,iwi,Ω)∂wi,Ω

.

(5.14)

The resulting derivative ∂∑

i∈I(op,iwi,Ω)∂wi,Ω

can now be simplified: The function∑i∈I(op,iwi,Ω) to be derived consists of

many summands, and only the sum-mand op,iwi,Ω contains the variable wi,Ω,according to which we derive. Thus,∂∑

i∈I(op,iwi,Ω)∂wi,Ω

= op,i and therefore:

∂Errp(W )∂wi,Ω

= −δp,Ω · op,i (5.15)

= −op,i · δp,Ω. (5.16)

We insert this in equation 5.8 on the previ-ous page, which results in our modificationrule for a weight wi,Ω:

∆wi,Ω = η ·∑p∈P

op,i · δp,Ω. (5.17)

However: From the very beginning thederivation has been intended as an offlinerule by means of the question of how toadd the errors of all patterns and how tolearn them after all patterns have beenrepresented. Although this approach ismathematically correct, the implementa-tion is far more time-consuming and, aswe will see later in this chapter, partially


dkriesel.com 5.2 Linear separability

needs a lot of compuational effort duringtraining.

The "online-learning version" of the deltarule simply omits the summation andlearning is realized immediately after thepresentation of each pattern, this also sim-plifies the notation (which is no longer nec-essarily related to a pattern p):

∆wi,Ω = η · oi · δΩ. (5.18)

This version of the delta rule shall be usedfor the following definition:

Definition 5.6 (Delta rule). If we deter-mine, analogously to the aforementionedderivation, that the function h of the Heb-bian theory (equation 4.6 on page 67) onlyprovides the output oi of the predecessorneuron i and if the function g is the differ-ence between the desired activation tΩ andthe actual activation aΩ, we will receivethe delta rule, also known as Widrow-Hoff rule:

∆wi,Ω = η · oi · (tΩ − aΩ) = ηoiδΩ (5.19)

If we use the desired output (instead of theactivation) as teaching input, and there-fore the output function of the output neu-rons does not represent an identity, we ob-tain

∆wi,Ω = η · oi · (tΩ − oΩ) = ηoiδΩ (5.20)

and δΩ then corresponds to the differencebetween tΩ and oΩ.

In the case of the delta rule, the changeof all weights to an output neuron Ω isproportional

In. 1 In. 2 Output0 0 00 1 11 0 11 1 0

Table 5.1: Definition of the logical XOR. Theinput values are shown of the left, the outputvalues on the right.

. to the difference between the currentactivation or output aΩ or oΩ and thecorresponding teaching input tΩ. Wewant to refer to this factor as δΩ , Jδwhich is also referred to as "Delta".

Apparently the delta rule only applies forSLPs, since the formula is always relatedto the teaching input, and there is no

delta ruleonly for SLPteaching input for the inner processing lay-

ers of neurons.

5.2 A SLP is only capable ofrepresenting linearlyseparable data

Let f be the XOR function which expectstwo binary inputs and generates a binaryoutput (for the precise definition see ta-ble 5.1).

Let us try to represent the XOR func-tion by means of an SLP with two inputneurons i1, i2 and one output neuron Ω(fig. 5.6 on the following page).



GFED@ABCi1

wi1,ΩBBBB

BBBB

GFED@ABCi2

wi2,Ω||||

~~||||

?>=<89:;Ω

XOR?

Figure 5.6: Sketch of a singlelayer perceptronthat shall represent the XOR function - which isimpossible.

Here we use the weighted sum as propaga-tion function, a binary activation functionwith the threshold value Θ and the iden-tity as output function. Depending on i1and i2, Ω has to output the value 1 if thefollowing holds:

netΩ = oi1wi1,Ω + oi2wi2,Ω ≥ ΘΩ (5.21)

We assume a positive weight wi2,Ω, the in-equality 5.21 is then equivalent to

oi1 ≥1

wi1,Ω(ΘΩ − oi2wi2,Ω) (5.22)

With a constant threshold value ΘΩ, theright part of inequation 5.22 is a straightline through a coordinate system definedby the possible outputs oi1 und oi2 of theinput neurons i1 and i2 (fig. 5.7).

For a (as required for inequation 5.22) pos-itive wi2,Ω the output neuron Ω fires for

Figure 5.7: Linear separation of n = 2 inputs ofthe input neurons i1 and i2 by a 1-dimensionalstraight line. A and B show the corners belong-ing to the sets of the XOR function that are tobe separated.


dkriesel.com 5.2 Linear separability

n number ofbinaryfunctions

lin.separableones

share

1 4 4 100%2 16 14 87.5%3 256 104 40.6%4 65, 536 1, 772 2.7%5 4.3 · 109 94, 572 0.002%6 1.8 · 1019 5, 028, 134 ≈ 0%

Table 5.2: Number of functions concerning n bi-nary inputs, and number and proportion of thefunctions thereof which can be linearly separated.In accordance with [Zel94,Wid89,Was89].

input combinations lying above the gener-ated straight line. For a negative wi2,Ω itwould fire for all input combinations lyingbelow the straight line. Note that only thefour corners of the unit square are possi-ble inputs because the XOR function onlyknows binary inputs.

In order to solve the XOR problem, wehave to turn and move the straight line sothat input set A = (0, 0), (1, 1) is sepa-rated from input set B = (0, 1), (1, 0) –this is, obviously, impossible.

Generally, the input parameters of n manyinput neurons can be represented in an n-dimensional cube which is separated by an

SLP cannotdo everything SLP through an (n−1)-dimensional hyper-

plane (fig. 5.8). Only sets that can be sep-arated by such a hyperplane, i.e. whichare linearly separable, can be classifiedby an SLP.

Figure 5.8: Linear separation of n = 3 inputsfrom input neurons i1, i2 and i3 by 2-dimensionalplane.

Unfortunately, it seems that the percent-age of the linearly separable problemsrapidly decreases with increasing n (seetable 5.2), which limits the functionality

few tasksare linearlyseparable

of the SLP. Additionally, tests for linearseparability are difficult. Thus, for moredifficult tasks with more inputs we needsomething more powerful than SLP. TheXOR problem itself is one of these tasks,since a perceptron that is supposed to rep-resent the XOR function already needs ahidden layer (fig. 5.9 on the next page).



GFED@ABC

1AAAA

AAAA

111111111

11111111

GFED@ABC

1

~~

1

[email protected]

−[email protected]

XOR

Figure 5.9: Neural network realizing the XORfunction. Threshold values (as far as they areexisting) are located within the neurons.

5.3 A multilayer perceptroncontains more trainableweight layers

A perceptron with two or more trainableweight layers (called multilayer perceptronor MLP) is more powerful than an SLP. Aswe know, a singlelayer perceptron can di-vide the input space by means of a hyper-plane (in a two-dimensional input spaceby means of a straight line). A two-stage perceptron (two trainable weight lay-

more planesers, three neuron layers) can classify con-vex polygons by further processing thesestraight lines, e.g. in the form "recognizepatterns lying above straight line 1, be-low straight line 2 and below straight line3". Thus, we – metaphorically speaking- took an SLP with several output neu-rons and "attached" another SLP (upper

part of fig. 5.10 on the facing page). Amultilayer perceptron represents an uni-versal function approximator, whichis proven by the Theorem of Cybenko[Cyb89].

Another trainable weight layer proceedsanalogously, now with the convex poly-gons. Those can be added, subtracted orsomehow processed with other operations(lower part of fig. 5.10 on the next page).

Generally, it can be mathematicallyproven that even a multilayer perceptronwith one layer of hidden neurons can ar-bitrarily precisely approximate functionswith only finitely many discontinuities aswell as their first derivatives. Unfortu-nately, this proof is not constructive andtherefore it is left to us to find the correctnumber of neurons and weights.

In the following we want to use awidespread abbreviated form for differentmultilayer perceptrons: We denote a two-stage perceptron with 5 neurons in the in-put layer, 3 neurons in the hidden layerand 4 neurons in the output layer as a 5-3-4-MLP.

Definition 5.7 (Multilayer perceptron).Perceptrons with more than one layer ofvariably weighted connections are referredto as multilayer perceptrons (MLP).An n-layer or n-stage perceptron hasthereby exactly n variable weight layersand n + 1 neuron layers (the retina is dis-regarded here) with neuron layer 1 beingthe input layer.

Since three-stage perceptrons can classifysets of any form by combining and sepa- 3-stage

MLP issufficient


dkriesel.com 5.3 The multilayer perceptron

GFED@ABCi1

@@@@@@@@@

**UUUUUUUUUUUUUUUUUUUUUUUUU GFED@ABCi2

@@@@@@@@@

ttjjjjjjjjjjjjjjjjjjjjjjjjj

GFED@ABCh1

''PPPPPPPPPPPPPPPPP GFED@ABCh2

GFED@ABCh3

wwooooooooooooooooo

?>=<89:;Ω

GFED@ABCi1

~~~~~~~~~~~

@@@@@@@@@

'' )) **

GFED@ABCi2

tt uu ww ~~~~~~~~~~~

@@@@@@@@@

GFED@ABCh1

''PPPPPPPPPPPPPPPPP

--

GFED@ABCh2

@@@@@@@@@

,,

GFED@ABCh3

**

GFED@ABCh4

tt

GFED@ABCh5

~~~~~~~~~~~

rr

GFED@ABCh6

wwnnnnnnnnnnnnnnnnn

qqGFED@ABCh7

@@@@@@@@@GFED@ABCh8

~~~~~~~~~

?>=<89:;Ω

Figure 5.10: We know that an SLP represents a straight line. With 2 trainable weight layers,several straight lines can be combined to form convex polygons (above). By using 3 trainableweight layers several polygons can be formed into arbitrary sets (below).



n classifiable sets1 hyperplane2 convex polygon3 any set4 any set as well, i.e. no

advantage

Table 5.3: Representation of which perceptroncan classify which types of sets with n being thenumber of trainable weight layers.

rating arbitrarily many convex polygons,another step will not be advantageouswith respect to function representations.Be cautious when reading the literature:There are many different definitions ofwhat is counted as a layer. Some sourcescount the neuron layers, some count theweight layers. Some sources include theretina, some the trainable weight layers.Some exclude (for some reason) the out-put neuron layer. In this work, I chosethe definition that provides, in my opinion,the most information about the learningcapabilities – and I will use it cosistently.Remember: An n-stage perceptron has ex-actly n trainable weight layers. You canfind a summary of which perceptrons canclassify which types of sets in table 5.3.We now want to face the challenge of train-ing perceptrons with more than one weightlayer.

5.4 Backpropagation of errorgeneralizes the delta ruleto allow for MLP training

Next, I want to derive and explain thebackpropagation of error learning rule(abbreviated: backpropagation, backpropor BP), which can be used to train multi-stage perceptrons with semi-linear3 activa-tion functions. Binary threshold functionsand other non-differentiable functions areno longer supported, but that doesn’t mat-ter: We have seen that the Fermi func-tion or the hyperbolic tangent can arbi-trarily approximate the binary thresholdfunction by means of a temperature pa-rameter T . To a large extent I will fol-low the derivation according to [Zel94] and[MR86]. Once again I want to point outthat this procedure had previously beenpublished by Paul Werbos in [Wer74]but had consideraby less readers than in[MR86].

Backpropagation is a gradient descent pro-cedure (including all strengths and weak-nesses of the gradient descent) with theerror function Err(W ) receiving all nweights as arguments (fig. 5.5 on page 78)and assigning them to the output error, i.e.being n-dimensional. On Err(W ) a pointof small error or even a point of the small-est error is sought by means of the gradi-ent descent. Thus, in analogy to the deltarule, backpropagation trains the weightsof the neural network. And it is exactly

3 Semilinear functions are monotonous and differen-tiable – but generally they are not linear.


dkriesel.com 5.4 Backpropagation of error

the delta rule or its variable δi for a neu-ron i which is expanded from one trainableweight layer to several ones by backpropa-gation.

5.4.1 The derivation is similar tothe one of the delta rule, butwith a generalized delta

Let us define in advance that the networkinput of the individual neurons i resultsfrom the weighted sum. Furthermore, aswith the derivation of the delta rule, letop,i, netp,i etc. be defined as the alreadyfamiliar oi, neti, etc. under the input pat-tern p we used for the training. Let theoutput function be the identity again, thusoi = fact(netp,i) holds for any neuron i.Since this is a generalization of the deltarule, we use the same formula frameworkas with the delta rule (equation 5.20 on

general-ization

of δpage 81). As already indicated, we haveto generalize the variable δ for every neu-ron.

First of all: Where is the neuron for whichwe want to calculate δ? It is obvious toselect an arbitrary inner neuron h havinga set K of predecessor neurons k as wellas a set of L successor neurons l, whichare also inner neurons (see fig. 5.11). Itis therefore irrelevant whether the prede-cessor neurons are already the input neu-rons.

Now we perform the same derivation asfor the delta rule and split functions bymeans the chain rule. I will not discussthis derivation in great detail, but the prin-cipal is similar to that of the delta rule (the

/.-,()*+

&&LLLLLLLLLLLLLLL /.-,()*+

========== /.-,()*+

. . . ?>=<89:;kwk,h

pppppppp

wwppppppp

K

ONMLHIJKΣfact

xxrrrrrrrrrrrrrrr

wh,lNNNNNNN

''NNNNNNNN

h H

/.-,()*+ /.-,()*+ /.-,()*+ . . . ?>=<89:;l L

Figure 5.11: Illustration of the position of ourneuron h within the neural network. It is lying inlayerH, the preceding layer isK, the subsequentlayer is L.

differences are, as already mentioned, inthe generalized δ). We initially derive theerror function Err according to a weightwk,h.

∂Err(wk,h)∂wk,h

= ∂Err∂neth︸︷︷︸=−δh

·∂neth∂wk,h

(5.23)

The first factor of equation 5.23 is −δh,which we will deal with later in this text.The numerator of the second factor of theequation includes the network input, i.e.the weighted sum is included in the numer-ator so that we can immediately derive it.Again, all summands of the sum drop outapart from the summand containing wk,h.



This summand is referred to as wk,h ·ok. Ifwe calculate the derivative, the output ofneuron k becomes:

∂neth∂wk,h

= ∂∑k∈K wk,hok∂wk,h

(5.24)

= ok (5.25)

As promised, we will now discuss the −δhof equation 5.23 on the previous page,which is split up again according of thechain rule:

δh = − ∂Err∂neth

(5.26)

= −∂Err∂oh

· ∂oh∂neth

(5.27)

The derivation of the output according tothe network input (the second factor inequation 5.27) clearly equals the deriva-tion of the activation function accordingto the network input:

∂oh∂neth

= ∂fact(neth)∂neth

(5.28)

= fact′(neth) (5.29)

Consider this an important passage! Wenow analogously derive the first factor inequation 5.27. Therefore, we have to pointout that the derivation of the error func-tion according to the output of an innerneuron layer depends on the vector of allnetwork inputs of the next following layer.This is reflected in equation 5.30:

−∂Err∂oh

= −∂Err(netl1 , . . . ,netl|L|)

∂oh(5.30)

According to the definition of the multi-dimensional chain rule, we immediately ob-tain equation 5.31:

−∂Err∂oh

=∑l∈L

(− ∂Err∂netl

· ∂netl∂oh

)(5.31)

The sum in equation 5.31 contains two fac-tors. Now we want to discuss these factorsbeing added over the subsequent layer L.We simply calculate the second factor inthe following equation 5.33:

∂netl∂oh

= ∂∑h∈H wh,l · oh∂oh

(5.32)

= wh,l (5.33)

The same applies for the first factor accord-ing to the definition of our δ:

− ∂Err∂netl

= δl (5.34)

Now we insert:

⇒ −∂Err∂oh

=∑l∈L

δlwh,l (5.35)

You can find a graphic version of the δgeneralization including all splittings infig. 5.12 on the facing page.

The reader might already have noticedthat some intermediate results were shownin frames. Exactly those intermediate re-sults were highlighted in that way, whichare a factor in the change in weight ofwk,h. If the aforementioned equations are



δh

− ∂Err∂neth

∂oh∂neth −∂Err

∂oh

f ′act(neth) − ∂Err

∂netl∑l∈L

∂netl∂oh

δl∂∑

h∈H wh,l·oh∂oh

wh,l

Figure 5.12: Graphical representation of the equations (by equal signs) and chain rule splittings(by arrows) in the framework of the backpropagation derivation. The leaves of the tree reflect thefinal results from the generalization of δ, which are framed in the derivation.



combined with the highlighted intermedi-ate results, the outcome of this will be thewanted change in weight ∆wk,h to

∆wk,h = ηokδh with (5.36)

δh = f ′act(neth) ·∑l∈L

(δlwh,l)

– of course only in case of h being an innerneuron (otherweise there would not be asubsequent layer L).

The case of h being an output neuron hasalready been discussed during the deriva-tion of the delta rule. All in all, the re-sult is the generalization of the delta rule,called backpropagation of error :

∆wk,h = ηokδh with

δh =f ′act(neth) · (th − yh) (h outside)f ′act(neth) ·∑l∈L(δlwh,l) (h inside)

(5.37)

In contrast to the delta rule, δ is treateddifferently depending on whether h is anoutput or an inner (i.e. hidden) neuron:

1. If h is an output neuron, then

δp,h = f ′act(netp,h) · (tp,h − yp,h)(5.38)

Thus, under our training pattern pthe weight wk,h from k to h is changedproportionally according to

. the learning rate η,

. the output op,k of the predeces-sor neuron k,

. the gradient of the activationfunction at the position of thenetwork input of the successorneuron f ′act(netp,h) and

. the difference between teachinginput tp,h and output yp,h of thesuccessor neuron h.

Teach. Inputchanged forthe outerweight layer

In this case, backpropagation is work-ing on two neuron layers, the outputlayer with the successor neuron h andthe preceding layer with the predeces-sor neuron k.

2. If h is an inner, hidden neuron, then

δp,h = f ′act(netp,h) ·∑l∈L

(δp,l · wh,l)

(5.39)

holds. I want to explicitly mentionback-propagationfor innerlayers

that backpropagation is now workingon three layers. Here, neuron k isthe predecessor of the connection tobe changed with the weight wk,h, theneuron h is the successor of the con-nection to be changed and the neu-rons l are lying in the layer follow-ing the successor neuron. Thus, ac-cording to our training pattern p, theweight wk,h from k to h is proportion-ally changed according to

. the learning rate η,

. the output of the predecessorneuron op,k,

. the gradient of the activationfunction at the position of thenetwork input of the successorneuron f ′act(netp,h),

. as well as, and this is thedifference, according to theweighted sum of the changes inweight to all neurons following h,∑l∈L(δp,l · wh,l).



Definition 5.8 (Backpropagation). If wesummarize formulas 5.38 on the precedingpage and 5.39 on the facing page, we re-ceive the following final formula for back-propagation (the identifiers p are om-mited for reasons of clarity):



(5.40)

SNIPE: An online variant of backpro-pagation is implemented in the methodtrainBackpropagationOfError within theclass NeuralNetwork.

It is obvious that backpropagation ini-tially processes the last weight layer di-rectly by means of the teaching input andthen works backwards from layer to layerwhile considering each preceding change inweights. Thus, the teaching input leavestraces in all weight layers. Here I describethe first (delta rule) and the second partof backpropagation (generalized delta ruleon more layers) in one go, which may meetthe requirements of the matter but notof the research. The first part is obvious,which you will soon see in the frameworkof a mathematical gimmick. Decades ofdevelopment time and work lie between thefirst and the second, recursive part. Likemany groundbreaking inventions, it wasnot until its development that it was recog-nized how plausible this invention was.

5.4.2 Heading back: Boilingbackpropagation down todelta rule

As explained above, the delta rule is aspecial case of backpropagation for one-stage perceptrons and linear activationfunctions – I want to briefly explain this

backpropexpandsdelta rule

circumstance and develop the delta ruleout of backpropagation in order to aug-ment the understanding of both rules. Wehave seen that backpropagation is definedby



(5.41)

Since we only use it for one-stage percep-trons, the second part of backpropagation(light-colored) is omitted without substitu-tion. The result is:

∆wk,h = ηokδh withδh = f ′act(neth) · (th − oh) (5.42)

Furthermore, we only want to use linearactivation functions so that f ′act (light-colored) is constant. As is generallyknown, constants can be combined, andtherefore we directly merge the constantderivative f ′act and (being constant for atleast one lerning cycle) the learning rate η(also light-colored) in η. Thus, the resultis:

∆wk,h = ηokδh = ηok · (th − oh) (5.43)

This exactly corresponds to the delta ruledefinition.



5.4.3 The selection of the learningrate has heavy influence onthe learning process

In the meantime we have often seen thatthe change in weight is, in any case, pro-portional to the learning rate η. Thus, theselection of η is crucial for the behaviourof backpropagation and for learning proce-dures in general.

how fastwill be

learned? Definition 5.9 (Learning rate). Speedand accuracy of a learning procedure canalways be controlled by and are always pro-portional to a learning rate which is writ-ten as η.

ηI

If the value of the chosen η is too large,the jumps on the error surface are alsotoo large and, for example, narrow valleyscould simply be jumped over. Addition-ally, the movements across the error sur-face would be very uncontrolled. Thus, asmall η is the desired input, which, how-ever, can cost a huge, often unacceptableamount of time. Experience shows thatgood learning rate values are in the rangeof

0.01 ≤ η ≤ 0.9.

The selection of η significantly depends onthe problem, the network and the trainingdata, so that it is barely possible to givepractical advise. But for instance it is pop-ular to start with a relatively large η, e.g.0.9, and to slowly decrease it down to 0.1.For simpler problems η can often be keptconstant.

5.4.3.1 Variation of the learning rateover time

During training, another stylistic devicecan be a variable learning rate: In thebeginning, a large learning rate leads togood results, but later it results in inac-curate learning. A smaller learning rateis more time-consuming, but the result ismore precise. Thus, during the learningprocess the learning rate needs to be de-creased by one order of magnitude once orrepeatedly.

A common error (which also seems to be avery neat solution at first glance) is to con-tinually decrease the learning rate. Hereit quickly happens that the descent of thelearning rate is larger than the ascent ofa hill of the error function we are climb-ing. The result is that we simply get stuckat this ascent. Solution: Rather reducethe learning rate gradually as mentionedabove.

5.4.3.2 Different layers – Differentlearning rates

The farer we move away from the out-put layer during the learning process, theslower backpropagation is learning. Thus,it is a good idea to select a larger learningrate for the weight layers close to the in-put layer than for the weight layers closeto the output layer.


dkriesel.com 5.5 Resilient backpropagation

5.5 Resilient backpropagationis an extension tobackpropagation of error

We have just raised two backpropagation-specific properties that can occasionally bea problem (in addition to those which arealready caused by gradient descent itself):On the one hand, users of backpropaga-tion can choose a bad learning rate. Onthe other hand, the further the weights arefrom the output layer, the slower backpro-pagation learns. For this reason, Mar-tin Riedmiller et al. enhanced back-propagation and called their version re-silient backpropagation (short Rprop)[RB93, Rie94]. I want to compare back-propagation and Rprop, without explic-itly declaring one version superior to theother. Before actually dealing with formu-las, let us informally compare the two pri-mary ideas behind Rprop (and their con-sequences) to the already familiar backpro-pagation.

Learning rates: Backpropagation uses bydefault a learning rate η, which is se-lected by the user, and applies to theentire network. It remains static un-til it is manually changed. We havealready explored the disadvantages ofthis approach. Here, Rprop pursues acompletely different approach: thereis no global learning rate. First, eachweight wi,j has its own learning rate

One learning-rate perweight

ηi,jI

ηi,j , and second, these learning ratesare not chosen by the user, but are au-tomatically set by Rprop itself. Third,

automaticlearning rateadjustment

the weight changes are not static but

are adapted for each time step ofRprop. To account for the temporalchange, we have to correctly call itηi,j(t). This not only enables morefocused learning, also the problem ofan increasingly slowed down learningthroughout the layers is solved in anelegant way.

Weight change: When using backpropa-gation, weights are changed propor-tionally to the gradient of the errorfunction. At first glance, this is reallyintuitive. However, we incorporate ev-ery jagged feature of the error surfaceinto the weight changes. It is at leastquestionable, whether this is alwaysuseful. Here, Rprop takes other waysas well: the amount of weight change∆wi,j simply directly corresponds tothe automatically adjusted learningrate ηi,j . Thus the change in weight isnot proportional to the gradient, it isonly influenced by the sign of the gra-dient. Until now we still do not knowhow exactly the ηi,j are adapted atrun time, but let me anticipate thatthe resulting process looks consider-

Muchsmoother learningably less rugged than an error func-

tion.

In contrast to backprop the weight updatestep is replaced and an additional stepfor the adjustment of the learning rate isadded. Now how exactly are these ideasbeing implemented?



5.5.1 Weight changes are notproportional to the gradient

Let us first consider the change in weight.We have already noticed that the weight-specific learning rates directly serve as ab-solute values for the changes of the re-spective weights. There remains the ques-tion of where the sign comes from – thisis a point at which the gradient comesinto play. As with the derivation of back-propagation, we derive the error functionErr(W ) by the individual weights wi,j andobtain gradients ∂Err(W )

∂wi,j. Now, the big

difference: rather than multiplicativelyincorporating the absolute value of thegradient into the weight change, we con-sider only the sign of the gradient. Thegradient hence no longer determines the

gradientdetermines onlydirection of the

updates

strength, but only the direction of theweight change.

If the sign of the gradient ∂Err(W )∂wi,j

is pos-itive, we must decrease the weight wi,j .So the weight is reduced by ηi,j . If thesign of the gradient is negative, the weightneeds to be increased. So ηi,j is added toit. If the gradient is exactly 0, nothinghappens at all. Let us now create a for-mula from this colloquial description. Thecorresponding terms are affixed with a (t)to show that everything happens at thesame time step. This might decrease clar-ity at first glance, but is nevertheless im-portant because we will soon look at an-other formula that operates on differenttime steps. Instead, we shorten the gra-dient to: g = ∂Err(W )

∂wi,j.

Definition 5.10 (Weight change inRprop).

∆wi,j(t) =

−ηi,j(t), if g(t) > 0+ηi,j(t), if g(t) < 00 otherwise.

(5.44)

We now know how the weights are changed– now remains the question how the learn-ing rates are adjusted. Finally, once wehave understood the overall system, wewill deal with the remaining details like ini-tialization and some specific constants.

5.5.2 Many dynamically adjustedlearning rates instead of onestatic

To adjust the learning rate ηi,j , we againhave to consider the associated gradientsg of two time steps: the gradient that hasjust passed (t − 1) and the current one(t). Again, only the sign of the gradientmatters, and we now must ask ourselves:What can happen to the sign over two timesteps? It can stay the same, and it canflip.

If the sign changes from g(t − 1) to g(t),we have skipped a local minimum in thegradient. Hence, the last update was toolarge and ηi,j(t) has to be reduced as com-pared to the previous ηi,j(t− 1). One cansay, that the search needs to be more accu-rate. In mathematical terms, we obtain anew ηi,j(t) by multiplying the old ηi,j(t−1)with a constant η↓, which is between 1 and

Jη↓0. In this case we know that in the lasttime step (t − 1) something went wrong –


dkriesel.com 5.5 Resilient backpropagation

hence we additionally reset the weight up-date for the weight wi,j at time step (t) to0, so that it not applied at all (not shownin the following formula).

However, if the sign remains the same, onecan perform a (careful!) increase of ηi,j toget past shallow areas of the error function.Here we obtain our new ηi,j(t) by multiply-ing the old ηi,j(t − 1) with a constant η↑

η↑I which is greater than 1.

Definition 5.11 (Adaptation of learningrates in Rprop).

ηi,j(t) =

η↑ηi,j(t− 1), g(t− 1)g(t) > 0η↓ηi,j(t− 1), g(t− 1)g(t) < 0ηi,j(t− 1) otherwise.

(5.45)

Caution: This also implies that Rprop isRprop only

learnsoffline

exclusively designed for offline. If the gra-dients do not have a certain continuity, thelearning process slows down to the lowestrates (and remains there). When learningonline, one changes – loosely speaking –the error function with each new epoch,since it is based on only one training pat-tern. This may be often well applicablein backpropagation and it is very ofteneven faster than the offline version, whichis why it is used there frequently. It lacks,however, a clear mathematical motivation,and that is exactly what we need here.

5.5.3 We are still missing a fewdetails to use Rprop inpractice

A few minor issues remain unanswered,namely

1. How large are η↑ and η↓ (i.e. howmuch are learning rates reinforced orweakened)?

2. How to choose ηi,j(0) (i.e. how arethe weight-specific learning rates ini-tialized)?4

3. What are the upper and lower boundsηmin and ηmax for ηi,j set? Jηmin

JηmaxWe now answer these questions with aquick motivation. The initial value for thelearning rates should be somewhere in theorder of the initialization of the weights.ηi,j(0) = 0.1 has proven to be a goodchoice. The authors of the Rprop paperexplain in an obvious way that this value– as long as it is positive and without an ex-orbitantly high absolute value – does notneed to be dealt with very critically, asit will be quickly overridden by the auto-matic adaptation anyway.

Equally uncritical is ηmax, for which theyrecommend, without further mathemati-cal justification, a value of 50 which is usedthroughout most of the literature. Onecan set this parameter to lower values inorder to allow only very cautious updates.Small update steps should be allowed inany case, so we set ηmin = 10−6.

4 Protipp: since the ηi,j can be changed only bymultiplication, 0 would be a rather suboptimal ini-tialization :-)



Now we have left only the parameters η↑and η↓. Let us start with η↓: If this valueis used, we have skipped a minimum, fromwhich we do not know where exactly it lieson the skipped track. Analogous to theprocedure of binary search, where the tar-get object is often skipped as well, we as-sume it was in the middle of the skippedtrack. So we need to halve the learningrate, which is why the canonical choiceη↓ = 0.5 is being selected. If the valueof η↑ is used, learning rates shall be in-creased with caution. Here we cannot gen-eralize the principle of binary search andsimply use the value 2.0, otherwise thelearning rate update will end up consist-ing almost exclusively of changes in direc-tion. Independent of the particular prob-lems, a value of η↑ = 1.2 has proven tobe promising. Slight changes of this valuehave not significantly affected the rate ofconvergence. This fact allowed for settingthis value as a constant as well.

With advancing computational capabili-ties of computers one can observe a moreand more widespread distribution of net-works that consist of a big number of lay-ers, i.e. deep networks. For such net-

Rprop is verygood for

deep networksworks it is crucial to prefer Rprop over theoriginal backpropagation, because back-prop, as already indicated, learns veryslowly at weights wich are far from theoutput layer. For problems with a smallernumber of layers, I would recommend test-ing the more widespread backpropagation(with both offline and online learning) andthe less common Rprop equivalently.

SNIPE: In Snipe resilient backpropa-gation is supported via the methodtrainResilientBackpropagation of theclass NeuralNetwork. Furthermore, youcan also use an additional improvementto resilient propagation, which is, however,not dealt with in this work. There are get-ters and setters for the different parametersof Rprop.

5.6 Backpropagation hasoften been extended andaltered besides Rprop

Backpropagation has often been extended.Many of these extensions can simply be im-plemented as optional features of backpro-pagation in order to have a larger scope fortesting. In the following I want to brieflydescribe some of them.

5.6.1 Adding momentum tolearning

Let us assume to descent a steep slopeon skis - what prevents us from immedi-ately stopping at the edge of the slopeto the plateau? Exactly - our momen-tum. With backpropagation the momen-tum term [RHW86b] is responsible for thefact that a kind of moment of inertia(momentum) is added to every step size(fig. 5.13 on the next page), by alwaysadding a fraction of the previous changeto every new change in weight:

(∆pwi,j)now = ηop,iδp,j+α·(∆pwi,j)previous.


dkriesel.com 5.6 Further variations and extensions to backpropagation

Of course, this notation is only used fora better understanding. Generally, as al-ready defined by the concept of time, whenreferring to the current cycle as (t), thenthe previous cycle is identified by (t − 1),which is continued successively. And nowwe come to the formal definition of the mo-mentum term:

Definition 5.12 (Momentum term). Themoment of

inertia variation of backpropagation by means ofthe momentum term is defined as fol-lows:

∆wi,j(t) = ηoiδj + α ·∆wi,j(t− 1) (5.46)

We accelerate on plateaus (avoiding quasi-standstill on plateaus) and slow down oncraggy surfaces (preventing oscillations).Moreover, the effect of inertia can be var-ied via the prefactor α, common val-

αI ues are between 0.6 und 0.9. Addition-ally, the momentum enables the positiveeffect that our skier swings back andforth several times in a minimum, and fi-nally lands in the minimum. Despite itsnice one-dimensional appearance, the oth-erwise very rare error of leaving good min-ima unfortunately occurs more frequentlybecause of the momentum term – whichmeans that this is again no optimal solu-tion (but we are by now accustomed tothis condition).

5.6.2 Flat spot elimination preventsneurons from getting stuck

It must be pointed out that with the hy-perbolic tangent as well as with the Fermi

Figure 5.13: We want to execute the gradientdescent like a skier crossing a slope, who wouldhardly stop immediately at the edge to theplateau.

function the derivative outside of the closeproximity of Θ is nearly 0. This resultsin the fact that it becomes very difficultto move neurons away from the limits ofthe activation (flat spots), which could ex- neurons

get stucktremely extend the learning time. Thisproblem can be dealt with by modifyingthe derivative, for example by adding aconstant (e.g. 0.1), which is called flatspot elimination or – more colloquial –fudging.

It is an interesting observation, that suc-cess has also been achieved by using deriva-tives defined as constants [Fah88]. A niceexample making use of this effect is thefast hyperbolic tangent approximation byAnguita et al. introduced in section 3.2.6on page 37. In the outer regions of it’s (as



well approximated and accelerated) deriva-tive, it makes use of a small constant.

5.6.3 The second derivative can beused, too

According to David Parker [Par87],Second order backpropagation also us-ese the second gradient, i.e. the secondmulti-dimensional derivative of the errorfunction, to obtain more precise estimatesof the correct ∆wi,j . Even higher deriva-tives only rarely improve the estimations.Thus, less training cycles are needed butthose require much more computational ef-fort.

In general, we use further derivatives (i.e.Hessian matrices, since the functions aremultidimensional) for higher order meth-ods. As expected, the procedures reducethe number of learning epochs, but signifi-cantly increase the computational effort ofthe individual epochs. So in the end theseprocedures often need more learning timethan backpropagation.

The quickpropagation learning proce-dure [Fah88] uses the second derivative ofthe error propagation and locally under-stands the error function to be a parabola.We analytically determine the vertex (i.e.the lowest point) of the said parabola anddirectly jump to this point. Thus, thislearning procedure is a second-order proce-dure. Of course, this does not work witherror surfaces that cannot locally be ap-proximated by a parabola (certainly it isnot always possible to directly say whetherthis is the case).

5.6.4 Weight decay: Punishment oflarge weights

The weight decay according to PaulWerbos [Wer88] is a modification that ex-tends the error by a term punishing largeweights. So the error under weight de-cay

ErrWD

does not only increase proportionally to JErrWDthe actual error but also proportionally tothe square of the weights. As a result thenetwork is keeping the weights small dur-ing learning.

ErrWD = Err + β · 12∑w∈W

(w)2

︸︷︷︸punishment

(5.47)

This approach is inspired by nature wheresynaptic weights cannot become infinitelystrong as well. Additionally, due to these

keep weightssmallsmall weights, the error function often

shows weaker fluctuations, allowing easierand more controlled learning.

The prefactor 12 again resulted from sim-

ple pragmatics. The factor β controls the Jβstrength of punishment: Values from 0.001to 0.02 are often used here.

5.6.5 Cutting networks down:Pruning and Optimal BrainDamage

If we have executed the weight decay longenough and notice that for a neuron inthe input layer all successor weights are

prune thenetwork0 or close to 0, we can remove the neuron,


dkriesel.com 5.7 Initial configuration of a multilayer perceptron

hence losing this neuron and some weightsand thereby reduce the possibility that thenetwork will memorize. This procedure iscalled pruning.

Such a method to detect and delete un-necessary weights and neurons is referredto as optimal brain damage [lCDS90].I only want to describe it briefly: Themean error per output neuron is composedof two competing terms. While one term,as usual, considers the difference betweenoutput and teaching input, the other onetries to "press" a weight towards 0. If aweight is strongly needed to minimize theerror, the first term will win. If this is notthe case, the second term will win. Neu-rons which only have zero weights can bepruned again in the end.

There are many other variations of back-prop and whole books only about thissubject, but since my aim is to offer anoverview of neural networks, I just wantto mention the variations above as a moti-vation to read on.

For some of these extensions it is obvi-ous that they cannot only be applied tofeedforward networks with backpropaga-tion learning procedures.

We have gotten to know backpropagationand feedforward topology – now we haveto learn how to build a neural network. Itis of course impossible to fully communi-cate this experience in the framework ofthis work. To obtain at least some ofthis knowledge, I now advise you to dealwith some of the exemplary problems from4.6.

5.7 Getting started – Initialconfiguration of amultilayer perceptron

After having discussed the backpropaga-tion of error learning procedure and know-ing how to train an existing network, itwould be useful to consider how to imple-ment such a network.

5.7.1 Number of layers: Two orthree may often do the job,but more are also used

Let us begin with the trivial circumstancethat a network should have one layer of in-put neurons and one layer of output neu-rons, which results in at least two layers.

Additionally, we need – as we have alreadylearned during the examination of linearseparability – at least one hidden layer ofneurons, if our problem is not linearly sep-arable (which is, as we have seen, verylikely).

It is possible, as already mentioned, tomathematically prove that this MLP withone hidden neuron layer is already capableof approximating arbitrary functions withany accuracy 5 – but it is necessary notonly to discuss the representability of aproblem by means of a perceptron but alsothe learnability. Representability meansthat a perceptron can, in principle, realize

5 Note: We have not indicated the number of neu-rons in the hidden layer, we only mentioned thehypothetical possibility.



a mapping - but learnability means thatwe are also able to teach it.

In this respect, experience shows that twohidden neuron layers (or three trainableweight layers) can be very useful to solvea problem, since many problems can berepresented by a hidden layer but are verydifficult to learn.

One should keep in mind that any ad-ditional layer generates additional sub-minima of the error function in which wecan get stuck. All these things consid-ered, a promising way is to try it withone hidden layer at first and if that fails,retry with two layers. Only if that fails,one should consider more layers. However,given the increasing calculation power ofcurrent computers, deep networks witha lot of layers are also used with success.

5.7.2 The number of neurons hasto be tested

The number of neurons (apart from inputand output layer, where the number of in-put and output neurons is already definedby the problem statement) principally cor-responds to the number of free parametersof the problem to be represented.

Since we have already discussed the net-work capacity with respect to memorizingor a too imprecise problem representation,it is clear that our goal is to have as fewfree parameters as possible but as many asnecessary.

But we also know that there is no stan-dard solution for the question of how many

neurons should be used. Thus, the mostuseful approach is to initially train withonly a few neurons and to repeatedly trainnew networks with more neurons until theresult significantly improves and, particu-larly, the generalization performance is notaffected (bottom-up approach).

5.7.3 Selecting an activationfunction

Another very important parameter for theway of information processing of a neuralnetwork is the selection of an activa-tion function. The activation functionfor input neurons is fixed to the identityfunction, since they do not process infor-mation.

The first question to be asked is whetherwe actually want to use the same acti-vation function in the hidden layer andin the ouput layer – no one prevents usfrom choosing different functions. Gener-ally, the activation function is the same forall hidden neurons as well as for the outputneurons respectively.

For tasks of function approximation ithas been found reasonable to use the hy-perbolic tangent (left part of fig. 5.14 onpage 102) as activation function of the hid-den neurons, while a linear activation func-tion is used in the output. The latter isabsolutely necessary so that we do not gen-erate a limited output intervall. Contraryto the input layer which uses linear acti-vation functions as well, the output layerstill processes information, because it has


dkriesel.com 5.8 The 8-3-8 encoding problem and related problems

threshold values. However, linear activa-tion functions in the output can also causehuge learning steps and jumping over goodminima in the error surface. This can beavoided by setting the learning rate to verysmall values in the output layer.

An unlimited output interval is not essen-tial for pattern recognition tasks6. Ifthe hyperbolic tangent is used in any case,the output interval will be a bit larger. Un-like with the hyperbolic tangent, with theFermi function (right part of fig. 5.14 onthe following page) it is difficult to learnsomething far from the threshold value(where its result is close to 0). However,here a lot of freedom is given for selectingan activation function. But generally, thedisadvantage of sigmoid functions is thefact that they hardly learn something forvalues far from thei threshold value, unlessthe network is modified.

5.7.4 Weights should be initializedwith small, randomly chosenvalues

The initialization of weights is not as triv-ial as one might think. If they are simplyinitialized with 0, there will be no changein weights at all. If they are all initializedby the same value, they will all changeequally during training. The simple so-lution of this problem is called symme-try breaking, which is the initializationof weights with small random values. The

randominitial

weights 6 Generally, pattern recognition is understood as aspecial case of function approximation with a fewdiscrete output possibilities.

range of random values could be the in-terval [−0.5; 0.5] not including 0 or valuesvery close to 0. This random initializationhas a nice side effect: Chances are thatthe average of network inputs is close to 0,a value that hits (in most activation func-tions) the region of the greatest derivative,allowing for strong learning impulses rightfrom the start of learning.

SNIPE: In Snipe, weights are initial-ized randomly (if a synapse initial-ization is wanted). The maximumabsolute weight value of a synapseinitialized at random can be set ina NeuralNetworkDescriptor using themethod setSynapseInitialRange.

5.8 The 8-3-8 encodingproblem and relatedproblems

The 8-3-8 encoding problem is a clas-sic among the multilayer perceptron testtraining problems. In our MLP wehave an input layer with eight neuronsi1, i2, . . . , i8, an output layer with eightneurons Ω1,Ω2, . . . ,Ω8 and one hiddenlayer with three neurons. Thus, this net-work represents a function B8 → B8. Nowthe training task is that an input of a value1 into the neuron ij should lead to an out-put of a value 1 from the neuron Ωj (onlyone neuron should be activated, which re-sults in 8 training samples.

During the analysis of the trained networkwe will see that the network with the 3



−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

−4 −2 0 2 4

tanh

(x)

x

Hyperbolic Tangent

0

0.2

0.4

0.6

0.8

1

−4 −2 0 2 4

f(x)

x


Figure 5.14: As a reminder the illustration of the hyperbolic tangent (left) and the Fermi function(right). The Fermi function was expanded by a temperature parameter. The original Fermi functionis thereby represented by dark colors, the temperature parameter of the modified Fermi functionsare, ordered ascending by steepness, 1

2 ,15 ,

110 and 1

25 .

hidden neurons represents some kind of bi-nary encoding and that the above map-ping is possible (assumed training time:≈ 104 epochs). Thus, our network is a ma-chine in which the input is first encodedand afterwards decoded again.

Analogously, we can train a 1024-10-1024encoding problem. But is it possible toimprove the efficiency of this procedure?Could there be, for example, a 1024-9-1024- or an 8-2-8-encoding network?

Yes, even that is possible, since the net-work does not depend on binary encodings:Thus, an 8-2-8 network is sufficient for ourproblem. But the encoding of the networkis far more difficult to understand (fig. 5.15on the next page) and the training of thenetworks requires a lot more time.

SNIPE: The static methodgetEncoderSampleLesson in the classTrainingSampleLesson allows for creatingsimple training sample lessons of arbitrary

dimensionality for encoder problems likethe above.

An 8-1-8 network, however, does not work,since the possibility that the output of oneneuron is compensated by another one isessential, and if there is only one hiddenneuron, there is certainly no compensatoryneuron.

Exercises

Exercise 8. Fig. 5.4 on page 75 showsa small network for the boolean functionsAND and OR. Write tables with all computa-tional parameters of neural networks (e.g.network input, activation etc.). Performthe calculations for the four possible in-puts of the networks and write down thevalues of these variables for each input. Dothe same for the XOR network (fig. 5.9 onpage 84).


dkriesel.com 5.8 The 8-3-8 encoding problem and related problems

Figure 5.15: Illustration of the functionality of8-2-8 network encoding. The marked points rep-resent the vectors of the inner neuron activationassociated to the samples. As you can see, itis possible to find inner activation formations sothat each point can be separated from the restof the points by a straight line. The illustrationshows an exemplary separation of one point.

Exercise 9.

1. List all boolean functions B3 → B1,that are linearly separable and char-acterize them exactly.

2. List those that are not linearly sepa-rable and characterize them exactly,too.

Exercise 10. A simple 2-1 network shallbe trained with one single pattern bymeans of backpropagation of error andη = 0.1. Verify if the error

Err = Errp = 12(t− y)2

converges and if so, at what value. Howdoes the error curve look like? Let thepattern (p, t) be defined by p = (p1, p2) =(0.3, 0.7) and tΩ = 0.4. Randomly initalizethe weights in the interval [1;−1].

Exercise 11. A one-stage perceptronwith two input neurons, bias neuronand binary threshold function as activa-tion function divides the two-dimensionalspace into two regions by means of astraight line g. Analytically calculate aset of weight values for such a perceptronso that the following set P of the 6 pat-terns of the form (p1, p2, tΩ) with ε 1 iscorrectly classified.

P =(0, 0,−1);(2,−1, 1);(7 + ε, 3− ε, 1);(7− ε, 3 + ε,−1);(0,−2− ε, 1);(0− ε,−2,−1)



Exercise 12. Calculate in a comprehen-sible way one vector ∆W of all changes inweight by means of the backpropagation oferror procedure with η = 1. Let a 2-2-1MLP with bias neuron be given and let thepattern be defined by

p = (p1, p2, tΩ) = (2, 0, 0.1).

For all weights with the target Ω the ini-tial value of the weights should be 1. Forall other weights the initial value shouldbe 0.5. What is conspicuous about thechanges?


Chapter 6

Radial basis functionsRBF networks approximate functions by stretching and compressing Gaussianbells and then summing them spatially shifted. Description of their functions

and their learning process. Comparison with multilayer perceptrons.

According to Poggio and Girosi [PG89]radial basis function networks (RBF net-works) are a paradigm of neural networks,which was developed considerably laterthan that of perceptrons. Like percep-trons, the RBF networks are built in layers.But in this case, they have exactly threelayers, i.e. only one single layer of hiddenneurons.

Like perceptrons, the networks have afeedforward structure and their layers arecompletely linked. Here, the input layeragain does not participate in informationprocessing. The RBF networks are -like MLPs - universal function approxima-tors.

Despite all things in common: What is thedifference between RBF networks and per-ceptrons? The difference lies in the infor-mation processing itself and in the compu-tational rules within the neurons outsideof the input layer. So, in a moment wewill define a so far unknown type of neu-rons.

6.1 Components andstructure of an RBFnetwork

Initially, we want to discuss colloquiallyand then define some concepts concerningRBF networks.

Output neurons: In an RBF network theoutput neurons only contain the iden-tity as activation function and oneweighted sum as propagation func-tion. Thus, they do little more thanadding all input values and returningthe sum.

Hidden neurons are also called RBF neu-rons (as well as the layer in whichthey are located is referred to as RBFlayer). As propagation function, eachhidden neuron calculates a norm thatrepresents the distance between theinput to the network and the so-calledposition of the neuron (center). Thisis inserted into a radial activation

105

Chapter 6 Radial basis functions dkriesel.com

function which calculates and outputsthe activation of the neuron.

Definition 6.1 (RBF input neuron). Def-inition and representation is identical toinput

is linearagain

the definition 5.1 on page 73 of the inputneuron.

Definition 6.2 (Center of an RBF neu-ron). The center ch of an RBF neuron

cIh is the point in the input space wherethe RBF neuron is located . In general,

Positionin the input

spacethe closer the input vector is to the centervector of an RBF neuron, the higher is itsactivation.

Definition 6.3 (RBF neuron). The so-called RBF neurons h have a propaga-tion function fprop that determines the dis-tance between the center ch of a neuronImportant!and the input vector y. This distance rep-resents the network input. Then the net-work input is sent through a radial basisfunction fact which returns the activationor the output of the neuron. RBF neurons

are represented by the symbol WVUTPQRS||c,x||Gauß

.

Definition 6.4 (RBF output neuron).RBF output neurons Ω use the

weighted sum as propagation functionfprop, and the identity as activation func-

only sumsup tion fact. They are represented by the sym-

bol ONMLHIJKΣ .

Definition 6.5 (RBF network). AnRBF network has exactly three layers inthe following order: The input layer con-sisting of input neurons, the hidden layer(also called RBF layer) consisting of RBFneurons and the output layer consisting of

RBF output neurons. Each layer is com-3 layers,feedforwardpletely linked with the following one, short-

cuts do not exist (fig. 6.1 on the next page)– it is a feedforward topology. The connec-tions between input layer and RBF layerare unweighted, i.e. they only transmitthe input. The connections between RBFlayer and output layer are weighted. Theoriginal definition of an RBF network onlyreferred to an output neuron, but – in anal-ogy to the perceptrons – it is apparent thatsuch a definition can be generalized. Abias neuron is not used in RBF networks.The set of input neurons shall be repre-sented by I, the set of hidden neurons by JI,H,OH and the set of output neurons by O.

Therefore, the inner neurons are called ra-dial basis neurons because from their def-inition follows directly that all input vec-tors with the same distance from the cen-ter of a neuron also produce the same out-put value (fig. 6.2 on page 108).

6.2 Information processing ofan RBF network

Now the question is, what can be realizedby such a network and what is its purpose.Let us go over the RBF network from topto bottom: An RBF network receives theinput by means of the unweighted con-nections. Then the input vector is sentthrough a norm so that the result is ascalar. This scalar (which, by the way, canonly be positive due to the norm) is pro-cessed by a radial basis function, for exam-


dkriesel.com 6.2 Information processing of an RBF network

GFED@ABC

||yyyyyyyyyy

""EEEEEEEEEE

((RRRRRRRRRRRRRRRRRRRR

++VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV GFED@ABC

""EEEEEEEEEE

||yyyyyyyyyy

vvllllllllllllllllllll

sshhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh i1, i2, . . . , i|I|


!!CCCCCCCCCC

((QQQQQQQQQQQQQQQQQQQQ

**VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV WVUTPQRS||c,x||Gauß

!!CCCCCCCCCC

((QQQQQQQQQQQQQQQQQQQQWVUTPQRS||c,x||

Gauß

!!CCCCCCCCCCWVUTPQRS||c,x||

Gauß

vvmmmmmmmmmmmmmmmmmmmm


tthhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh

vvmmmmmmmmmmmmmmmmmmmm

h1, h2, . . . , h|H|

ONMLHIJKΣ

ONMLHIJKΣ

ONMLHIJKΣ

Ω1,Ω2, . . . ,Ω|O|

Figure 6.1: An exemplary RBF network with two input neurons, five hidden neurons and threeoutput neurons. The connections to the hidden neurons are not weighted, they only transmit theinput. Right of the illustration you can find the names of the neurons, which coincide with thenames of the MLP neurons: Input neurons are called i, hidden neurons are called h and outputneurons are called Ω. The associated sets are referred to as I, H and O.



Figure 6.2: Let ch be the center of an RBF neu-ron h. Then the activation function facth is ra-dially symmetric around ch.

ple by a Gaussian bell (fig. 6.3 on the nextpage) .input

→ distance→ Gaussian bell

→ sum→ output

The output values of the different neuronsof the RBF layer or of the different Gaus-sian bells are added within the third layer:basically, in relation to the whole inputspace, Gaussian bells are added here.

Suppose that we have a second, a thirdand a fourth RBF neuron and thereforefour differently located centers. Each ofthese neurons now measures another dis-tance from the input to its own centerand de facto provides different values, evenif the Gaussian bell is the same. Sincethese values are finally simply accumu-lated in the output layer, one can easilysee that any surface can be shaped by drag-

ging, compressing and removing Gaussianbells and subsequently accumulating them.Here, the parameters for the superpositionof the Gaussian bells are in the weightsof the connections between the RBF layerand the output layer.

Furthermore, the network architecture of-fers the possibility to freely define or trainheight and width of the Gaussian bells –due to which the network paradigm be-comes even more versatile. We will getto know methods and approches for thislater.

6.2.1 Information processing inRBF neurons

RBF neurons process information by usingnorms and radial basis functions

At first, let us take as an example a sim-ple 1-4-1 RBF network. It is apparentthat we will receive a one-dimensional out-put which can be represented as a func-tion (fig. 6.4 on the facing page). Ad-ditionally, the network includes the cen-ters c1, c2, . . . , c4 of the four inner neuronsh1, h2, . . . , h4, and therefore it has Gaus-sian bells which are finally added withinthe output neuron Ω. The network alsopossesses four values σ1, σ2, . . . , σ4 whichinfluence the width of the Gaussian bells.On the contrary, the height of the Gaus-sian bell is influenced by the subsequentweights, since the individual output val-ues of the bells are multiplied by thoseweights.



0

0.2

0.4

0.6

0.8

1

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

h(r)

r

Gaussian in 1D Gaussian in 2D

−2−1

0 1

x

−2−1

0 1

2

y

0 0.2 0.4 0.6 0.8

1

h(r)

Figure 6.3: Two individual one- or two-dimensional Gaussian bells. In both cases σ = 0.4 holdsand the centers of the Gaussian bells lie in the coordinate origin. The distance r to the center (0, 0)is simply calculated according to the Pythagorean theorem: r =

√x2 + y2.

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

−2 0 2 4 6 8

y

x

Figure 6.4: Four different Gaussian bells in one-dimensional space generated by means of RBFneurons are added by an output neuron of the RBF network. The Gaussian bells have differentheights, widths and positions. Their centers c1, c2, . . . , c4 are located at 0, 1, 3, 4, the widthsσ1, σ2, . . . , σ4 at 0.4, 1, 0.2, 0.8. You can see a two-dimensional example in fig. 6.5 on the followingpage.



Gaussian 1

−2−1

0 1x

−2−1

0 1

2

y

−1−0.5

0 0.5

1 1.5

2

h(r)Gaussian 2

−2−1

0 1x

−2−1

0 1

2

y

−1−0.5

0 0.5

1 1.5

2

h(r)

Gaussian 3

−2−1

0 1x

−2−1

0 1

2

y

−1−0.5

0 0.5

1 1.5

2

h(r)Gaussian 4

−2−1

0 1x

−2−1

0 1

2

y

−1−0.5

0 0.5

1 1.5

2

h(r)


((QQQQQQQQQQQQQQQQQQQQWVUTPQRS||c,x||

Gauß

AAAAAAAAAAWVUTPQRS||c,x||

Gauß

~~


vvmmmmmmmmmmmmmmmmmmm

ONMLHIJKΣ

Sum of the 4 Gaussians

−2−1.5

−1−0.5

0 0.5

1 1.5

2

x

−2−1.5

−1−0.5

0 0.5

1 1.5

2

y

−1−0.75

−0.5−0.25

0 0.25

0.5 0.75

1 1.25

1.5 1.75

2

Figure 6.5: Four different Gaussian bells in two-dimensional space generated by means of RBFneurons are added by an output neuron of the RBF network. Once again r =

√x2 + y2 applies for

the distance. The heights w, widths σ and centers c = (x, y) are: w1 = 1, σ1 = 0.4, c1 = (0.5, 0.5),w2 = −1, σ2 = 0.6, c2 = (1.15,−1.15), w3 = 1.5, σ3 = 0.2, c3 = (−0.5,−1), w4 = 0.8, σ4 =1.4, c4 = (−2, 0).



Since we use a norm to calculate the dis-tance between the input vector and thecenter of a neuron h, we have differentchoices: Often the Euclidian norm is cho-sen to calculate the distance:

rh = ||x− ch|| (6.1)

=√∑i∈I

(xi − ch,i)2 (6.2)

Remember: The input vector was referredto as x. Here, the index i runs throughthe input neurons and thereby through theinput vector components and the neuroncenter components. As we can see, theEuclidean distance generates the squareddifferences of all vector components, addsthem and extracts the root of the sum.In two-dimensional space this correspondsto the Pythagorean theorem. From thedefinition of a norm directly follows thatthe distance can only be positive. Strictlyspeaking, we hence only use the positivepart of the activation function. By theway, activation functions other than theGaussian bell are possible. Normally, func-tions that are monotonically decreasingover the interval [0;∞] are chosen.

Now that we know the distance rh be-rhI tween the input vector x and the center

ch of the RBF neuron h, this distance hasto be passed through the activation func-tion. Here we use, as already mentioned,a Gaussian bell:

fact(rh) = e

(−r2h

2σ2h

)(6.3)

It is obvious that both the center ch andthe width σh can be seen as part of the

activation function fact, and hence the ac-tivation functions should not be referredto as fact simultaneously. One solutionwould be to number the activation func-tions like fact1, fact2, . . . , fact|H| withH be-ing the set of hidden neurons. But as aresult the explanation would be very con-fusing. So I simply use the name fact forall activation functions and regard σ andc as variables that are defined for individ-ual neurons but no directly included in theactivation function.

The reader will certainly notice that in theliterature the Gaussian bell is often nor-malized by a multiplicative factor. Wecan, however, avoid this factor becausewe are multiplying anyway with the subse-quent weights and consecutive multiplica-tions, first by a normalization factor andthen by the connections’ weights, wouldonly yield different factors there. We donot need this factor (especially because forour purpose the integral of the Gaussianbell must not always be 1) and thereforesimply leave it out.

6.2.2 Some analytical thoughtsprior to the training

The output yΩ of an RBF output neuronΩ results from combining the functions ofan RBF neuron to

yΩ =∑h∈H

wh,Ω · fact (||x− ch||) . (6.4)

Suppose that similar to the multilayer per-ceptron we have a set P , that contains |P |



training samples (p, t). Then we obtain|P | functions of the form

yΩ =∑h∈H

wh,Ω · fact (||p− ch||) , (6.5)

i.e. one function for each training sam-ple.

Of course, with this effort we are aimingat letting the output y for all trainingpatterns p converge to the correspondingteaching input t.

6.2.2.1 Weights can simply becomputed as solution of asystem of equations

Thus, we have |P | equations. Now let usassume that the widths σ1, σ2, . . . , σk, thecenters c1, c2, . . . , ck and the training sam-ples p including the teaching input t aregiven. We are looking for the weights wh,Ωwith |H| weights for one output neuronΩ. Thus, our problem can be seen as asystem of equations since the only thingwe want to change at the moment are theweights.

This demands a distinction of cases con-cerning the number of training samples |P |and the number of RBF neurons |H|:

|P | = |H|: If the number of RBF neuronsequals the number of patterns, i.e.|P | = |H|, the equation can be re-duced to a matrix multiplication

simplycalculateweights

T = M ·G (6.6)

⇔ M−1 · T = M−1 ·M ·G (6.7)

⇔ M−1 · T = E ·G (6.8)

⇔ M−1 · T = G, (6.9)

where

. T is the vector of the teaching JTinputs for all training samples,

. M is the |P | × |H| matrix of JMthe outputs of all |H| RBF neu-rons to |P | samples (remember:|P | = |H|, the matrix is squaredand we can therefore attempt toinvert it),

. G is the vector of the desired JGweights and

. E is a unit matrix with the same JEsize as G.

Mathematically speaking, we can sim-ply calculate the weights: In the caseof |P | = |H| there is exactly one RBFneuron available per training sample.This means, that the network exactlymeets the |P | existing nodes after hav-ing calculated the weights, i.e. it per-forms a precise interpolation. Tocalculate such an equation we cer-tainly do not need an RBF network,and therefore we can proceed to thenext case.

Exact interpolation must not be mis-taken for the memorizing ability men-tioned with the MLPs: First, we arenot talking about the training of RBF



networks at the moment. Second,it could be advantageous for us andmight in fact be intended if the net-work exactly interpolates between thenodes.

|P | < |H|: The system of equations isunder-determined, there are moreRBF neurons than training samples,i.e. |P | < |H|. Certainly, this casenormally does not occur very often.In this case, there is a huge varietyof solutions which we do not need insuch detail. We can select one set ofweights out of many obviously possi-ble ones.

|P | > |H|: But most interesting for fur-ther discussion is the case if thereare significantly more training sam-ples than RBF neurons, that means|P | > |H|. Thus, we again wantto use the generalization capability ofthe neural network.

If we have more training samples thanRBF neurons, we cannot assume thatevery training sample is exactly hit.So, if we cannot exactly hit the pointsand therefore cannot just interpolateas in the aforementioned ideal casewith |P | = |H|, we must try to finda function that approximates ourtraining set P as closely as possible:As with the MLP we try to reducethe sum of the squared error to a min-imum.

How do we continue the calculationin the case of |P | > |H|? As above,to solve the system of equations, we

have to find the solution M of a ma-trix multiplication

T = M ·G. (6.10)

The problem is that this time we can-not invert the |P | × |H| matrix M be-cause it is not a square matrix (here,|P | 6= |H| is true). Here, we haveto use the Moore-Penrose pseudoinverse M+ which is defined by

JM+

M+ = (MT ·M)−1 ·MT (6.11)

Although the Moore-Penrose pseudoinverse is not the inverse of a matrix,it can be used similarly in this case1.We get equations that are very similarto those in the case of |P | = |H|:

T = M ·G (6.12)

⇔ M+ · T = M+ ·M ·G (6.13)

⇔ M+ · T = E ·G (6.14)

⇔ M+ · T = G (6.15)

Another reason for the use of theMoore-Penrose pseudo inverse is thefact that it minimizes the squarederror (which is our goal): The esti-mate of the vector G in equation 6.15corresponds to the Gauss-Markovmodel known from statistics, whichis used to minimize the squared error.In the aforementioned equations 6.11and the following ones please do notmistake the T in MT (of the trans-pose of the matrix M) for the T ofthe vector of all teaching inputs.

1 Particularly, M+ = M−1 is true if M is invertible.I do not want to go into detail of the reasons forthese circumstances and applications ofM+ - theycan easily be found in literature for linear algebra.



6.2.2.2 The generalization on severaloutputs is trivial and not quitecomputationally expensive

We have found a mathematically exactway to directly calculate the weights.What will happen if there are several out-put neurons, i.e. |O| > 1, with O being, asusual, the set of the output neurons Ω? Inthis case, as we have already indicated, itdoes not change much: The additional out-put neurons have their own set of weightswhile we do not change the σ and c of theRBF layer. Thus, in an RBF network it iseasy for given σ and c to realize a lot ofoutput neurons since we only have to cal-culate the individual vector of weights

GΩ = M+ · TΩ (6.16)

for every new output neuron Ω, whereasthe matrix M+, which generally requiresa lot of computational effort, always staysthe same: So it is quite inexpensive – atinexpensive

outputdimension

least concerning the computational com-plexity – to add more output neurons.

6.2.2.3 Computational effort andaccuracy

For realistic problems it normally appliesthat there are considerably more trainingsamples than RBF neurons, i.e. |P | |H|: You can, without any difficulty, use106 training samples, if you like. Theoreti-cally, we could find the terms for the math-ematically correct solution on the black-board (after a very long time), but suchcalculations often seem to be imprecise

and very time-consuming (matrix inver-sions require a lot of computational ef-fort).

Furthermore, our Moore-Penrose pseudo-inverse is, in spite of numeric stabil-ity, no guarantee that the output vector

M+ complexand imprecisecorresponds to the teaching vector, be-

cause such extensive computations can beprone to many inaccuracies, even thoughthe calculation is mathematically correct:Our computers can only provide us with(nonetheless very good) approximations ofthe pseudo-inverse matrices. This meansthat we also get only approximations ofthe correct weights (maybe with a lot ofaccumulated numerical errors) and there-fore only an approximation (maybe veryrough or even unrecognizable) of the de-sired output.

If we have enough computing power to an-alytically determine a weight vector, weshould use it nevertheless only as an initialvalue for our learning process, which leadsus to the real training methods – but oth-erwise it would be boring, wouldn’t it?

6.3 Combinations of equationsystem and gradientstrategies are useful fortraining

Analogous to the MLP we perform a gra-dient descent to find the suitable weightsby means of the already well known delta retraining

delta rulerule. Here, backpropagation is unneces-sary since we only have to train one single


dkriesel.com 6.3 Training of RBF networks

weight layer – which requires less comput-ing time.

We know that the delta rule is

∆wh,Ω = η · δΩ · oh, (6.17)

in which we now insert as follows:

∆wh,Ω = η · (tΩ − yΩ) · fact(||p− ch||)(6.18)

Here again I explicitly want to mentionthat it is very popular to divide the train-ing into two phases by analytically com-puting a set of weights and then refiningit by training with the delta rule.

There is still the question whether to learnoffline or online. Here, the answer is sim-ilar to the answer for the multilayer per-ceptron: Initially, one often trains onlinetraining

in phases (faster movement across the error surface).Then, after having approximated the so-lution, the errors are once again accumu-lated and, for a more precise approxima-tion, one trains offline in a third learn-ing phase. However, similar to the MLPs,you can be successful by using many meth-ods.

As already indicated, in an RBF networknot only the weights between the hiddenand the output layer can be optimized. Solet us now take a look at the possibility tovary σ and c.

6.3.1 It is not always trivial todetermine centers and widthsof RBF neurons

It is obvious that the approximation accu-racy of RBF networks can be increased byadapting the widths and positions of theGaussian bells in the input space to theproblem that needs to be approximated.There are several methods to deal with thecenters c and the widths σ of the Gaussian vary

σ and cbells:

Fixed selection: The centers and widthscan be selected in a fixed manner andregardless of the training samples –this is what we have assumed untilnow.

Conditional, fixed selection: Again cen-ters and widths are selected fixedly,but we have previous knowledgeabout the functions to be approxi-mated and comply with it.

Adaptive to the learning process: Thisis definitely the most elegant variant,but certainly the most challengingone, too. A realization of thisapproach will not be discussed inthis chapter but it can be found inconnection with another networktopology (section 10.6.1).

6.3.1.1 Fixed selection

In any case, the goal is to cover the in-put space as evenly as possible. Here,widths of 2

3 of the distance between the



Figure 6.6: Example for an even coverage of atwo-dimensional input space by applying radialbasis functions.

centers can be selected so that the Gaus-sian bells overlap by approx. "one third"2(fig. 6.6). The closer the bells are set themore precise but the more time-consumingthe whole thing becomes.

This may seem to be very inelegant, butin the field of function approximation wecannot avoid even coverage. Here it isuseless if the function to be approximatedis precisely represented at some positionsbut at other positions the return value isonly 0. However, the high input dimen-sion requires a great many RBF neurons,which increases the computational effortinput

dimensionvery expensive

exponentially with the dimension – and is

2 It is apparent that a Gaussian bell is mathemati-cally infinitely wide, therefore I ask the reader toapologize this sloppy formulation.

responsible for the fact that six- to ten-dimensional problems in RBF networksare already called "high-dimensional" (anMLP, for example, does not cause anyproblems here).

6.3.1.2 Conditional, fixed selection

Suppose that our training samples are notevenly distributed across the input space.It then seems obvious to arrange the cen-ters and sigmas of the RBF neurons bymeans of the pattern distribution. So thetraining patterns can be analyzed by statis-tical techniques such as a cluster analysis,and so it can be determined whether thereare statistical factors according to whichwe should distribute the centers and sig-mas (fig. 6.7 on the facing page).

A more trivial alternative would be toset |H| centers on positions randomly se-lected from the set of patterns. So thismethod would allow for every training pat-tern p to be directly in the center of a neu-ron (fig. 6.8 on the next page). This isnot yet very elegant but a good solutionwhen time is an issue. Generally, for thismethod the widths are fixedly selected.

If we have reason to believe that the setof training samples is clustered, we canuse clustering methods to determine them.There are different methods to determineclusters in an arbitrarily dimensional setof points. We will be introduced to someof them in excursus A. One neural cluster-ing method are the so-called ROLFs (sec-tion A.5), and self-organizing maps are


dkriesel.com 6.3 Training of RBF networks

Figure 6.7: Example of an uneven coverage ofa two-dimensional input space, of which wehave previous knowledge, by applying radial ba-sis functions.

also useful in connection with determin-ing the position of RBF neurons (section10.6.1). Using ROLFs, one can also receiveindicators for useful radii of the RBF neu-rons. Learning vector quantisation (chap-ter 9) has also provided good results. Allthese methods have nothing to do withthe RBF networks themselves but are onlyused to generate some previous knowledge.Therefore we will not discuss them in thischapter but independently in the indicatedchapters.

Another approach is to use the approvedmethods: We could slightly move the po-sitions of the centers and observe how ourerror function Err is changing – a gradientdescent, as already known from the MLPs.

Figure 6.8: Example of an uneven coverage ofa two-dimensional input space by applying radialbasis functions. The widths were fixedly selected,the centers of the neurons were randomly dis-tributed throughout the training patterns. Thisdistribution can certainly lead to slightly unrepre-sentative results, which can be seen at the singledata point down to the left.



In a similar manner we could look how theerror depends on the values σ. Analogousto the derivation of backpropagation wederive

∂Err(σhch)∂σh

and ∂Err(σhch)∂ch

.

Since the derivation of these terms corre-sponds to the derivation of backpropaga-tion we do not want to discuss it here.

But experience shows that no convincingresults are obtained by regarding how theerror behaves depending on the centersand sigmas. Even if mathematics claimthat such methods are promising, the gra-dient descent, as we already know, leadsto problems with very craggy error sur-faces.

And that is the crucial point: Naturally,RBF networks generate very craggy er-ror surfaces because, if we considerablychange a c or a σ, we will significantlychange the appearance of the error func-tion.

6.4 Growing RBF networksautomatically adjust theneuron density

In growing RBF networks, the number|H| of RBF neurons is not constant. Acertain number |H| of neurons as well astheir centers ch and widths σh are previ-ously selected (e.g. by means of a cluster-ing method) and then extended or reduced.

In the following text, only simple mecha-nisms are sketched. For more information,I refer to [Fri94].

6.4.1 Neurons are added to placeswith large error values

After generating this initial configurationthe vector of the weights G is analyticallycalculated. Then all specific errors Errpconcerning the set P of the training sam-ples are calculated and the maximum spe-cific error

maxP

(Errp)

is sought.

The extension of the network is simple:We replace this maximum error with a new

replaceerror withneuron

RBF neuron. Of course, we have to exer-cise care in doing this: IF the σ are small,the neurons will only influence each otherif the distance between them is short. Butif the σ are large, the already exisitingneurons are considerably influenced by thenew neuron because of the overlapping ofthe Gaussian bells.

So it is obvious that we will adjust the al-ready existing RBF neurons when addingthe new neuron.

To put it simply, this adjustment is madeby moving the centers c of the other neu-rons away from the new neuron and re-ducing their width σ a bit. Then thecurrent output vector y of the network iscompared to the teaching input t and theweight vector G is improved by means oftraining. Subsequently, a new neuron canbe inserted if necessary. This method is


dkriesel.com 6.5 Comparing RBF networks and multilayer perceptrons

particularly suited for function approxima-tions.

6.4.2 Limiting the number ofneurons

Here it is mandatory to see that the net-work will not grow ad infinitum, which canhappen very fast. Thus, it is very usefulto previously define a maximum numberfor neurons |H|max.

6.4.3 Less important neurons aredeleted

Which leads to the question whether itis possible to continue learning when thislimit |H|max is reached. The answer is:this would not stop learning. We only haveto look for the "most unimportant" neuronand delete it. A neuron is, for example,unimportant for the network if there is an-other neuron that has a similar function:It often occurs that two Gaussian bells ex-actly overlap and at such a position, for

deleteunimportant

neuronsinstance, one single neuron with a higherGaussian bell would be appropriate.

But to develop automated procedures inorder to find less relevant neurons is highlyproblem dependent and we want to leavethis to the programmer.

With RBF networks and multilayer per-ceptrons we have already become ac-quainted with and extensivley discussedtwo network paradigms for similar prob-lems. Therefore we want to compare these

two paradigms and look at their advan-tages and disadvantages.

6.5 Comparing RBF networksand multilayerperceptrons

We will compare multilayer perceptronsand RBF networks with respect to differ-ent aspects.

Input dimension: We must be carefulwith RBF networks in high-dimensional functional spaces sincethe network could very quicklyrequire huge memory storage andcomputational effort. Here, amultilayer perceptron would causeless problems because its number ofneuons does not grow exponentiallywith the input dimension.

Center selection: However, selecting thecenters c for RBF networks is (despitethe introduced approaches) still a ma-jor problem. Please use any previousknowledge you have when applyingthem. Such problems do not occurwith the MLP.

Output dimension: The advantage ofRBF networks is that the training isnot much influenced when the outputdimension of the network is high.For an MLP, a learning proceduresuch as backpropagation thereby willbe very time-consuming.

Extrapolation: Advantage as well as dis-advantage of RBF networks is the lack



of extrapolation capability: An RBFnetwork returns the result 0 far awayfrom the centers of the RBF layer. Onthe one hand it does not extrapolate,unlike the MLP it cannot be usedfor extrapolation (whereby we couldnever know if the extrapolated valuesof the MLP are reasonable, but expe-rience shows that MLPs are suitablefor that matter). On the other hand,unlike the MLP the network is capa-Important!ble to use this 0 to tell us "I don’tknow", which could be an advantage.

Lesion tolerance: For the output of anMLP, it is no so important if a weightor a neuron is missing. It will onlyworsen a little in total. If a weightor a neuron is missing in an RBF net-work then large parts of the outputremain practically uninfluenced. Butone part of the output is heavily af-fected because a Gaussian bell is di-rectly missing. Thus, we can choosebetween a strong local error for lesionand a weak but global error.

Spread: Here the MLP is "advantaged"since RBF networks are used consid-erably less often – which is not alwaysunderstood by professionals (at leastas far as low-dimensional input spacesare concerned). The MLPs seem tohave a considerably longer traditionand they are working too good to takethe effort to read some pages of thiswork about RBF networks) :-).

Exercises

Exercise 13. An |I|-|H|-|O| RBF net-work with fixed widths and centers of theneurons should approximate a target func-tion u. For this, |P | training samples ofthe form (p, t) of the function u are given.Let |P | > |H| be true. The weights shouldbe analytically determined by means ofthe Moore-Penrose pseudo inverse. Indi-cate the running time behavior regarding|P | and |O| as precisely as possible.

Note: There are methods for matrix mul-tiplications and matrix inversions that aremore efficient than the canonical methods.For better estimations, I recommend tolook for such methods (and their complex-ity). In addition to your complexity calcu-lations, please indicate the used methodstogether with their complexity.


Chapter 7

Recurrent perceptron-like networksSome thoughts about networks with internal states.

Generally, recurrent networks are net-works that are capable of influencing them-selves by means of recurrences, e.g. byincluding the network output in the follow-ing computation steps. There are manytypes of recurrent networks of nearly arbi-trary form, and nearly all of them are re-ferred to as recurrent neural networks.As a result, for the few paradigms in-troduced here I use the name recurrentmultilayer perceptrons.

Apparently, such a recurrent network is ca-pable to compute more than the ordinaryMLP: If the recurrent weights are set to 0,

more capablethan MLP the recurrent network will be reduced to

an ordinary MLP. Additionally, the recur-rence generates different network-internalstates so that different inputs can producedifferent outputs in the context of the net-work state.

Recurrent networks in themselves have agreat dynamic that is mathematically dif-ficult to conceive and has to be discussedextensively. The aim of this chapter isonly to briefly discuss how recurrences can

be structured and how network-internalstates can be generated. Thus, I willbriefly introduce two paradigms of recur-rent networks and afterwards roughly out-line their training.

With a recurrent network an input x thatis constant over time may lead to differ-ent results: On the one hand, the network state

dynamicscould converge, i.e. it could transform it-self into a fixed state and at some time re-turn a fixed output value y. On the otherhand, it could never converge, or at leastnot until a long time later, so that it canno longer be recognized, and as a conse-quence, y constantly changes.

If the network does not converge, it is, forexample, possible to check if periodicalsor attractors (fig. 7.1 on the followingpage) are returned. Here, we can expectthe complete variety of dynamical sys-tems. That is the reason why I particu-larly want to refer to the literature con-cerning dynamical systems.

121

Chapter 7 Recurrent perceptron-like networks (depends on chapter 5) dkriesel.com

Figure 7.1: The Roessler attractor

Further discussions could reveal what willhappen if the input of recurrent networksis changed.

In this chapter the related paradigms ofrecurrent networks according to Jordanand Elman will be introduced.

7.1 Jordan networks

A Jordan network [Jor86] is a multi-layer perceptron with a set K of so-calledcontext neurons k1, k2, . . . , k|K|. Thereis one context neuron per output neuron(fig. 7.2 on the next page). In principle, acontext neuron just memorizes an outputuntil it can be processed in the next time output

neuronsare buffered

step. Therefore, there are weighted con-nections between each output neuron andone context neuron. The stored values arereturned to the actual network by meansof complete links between the context neu-rons and the input layer.

In the originial definition of a Jordan net-work the context neurons are also recur-rent to themselves via a connecting weightλ. But most applications omit this recur-rence since the Jordan network is alreadyvery dynamic and difficult to analyze, evenwithout these additional recurrences.

Definition 7.1 (Context neuron). A con-text neuron k receives the output value ofanother neuron i at a time t and then reen-ters it into the network at a time (t+ 1).

Definition 7.2 (Jordan network). A Jor-dan network is a multilayer perceptron


dkriesel.com 7.2 Elman networks

GFED@ABCi1

~~

AAAAAAAAA



~~

AAAAAAAAAGFED@ABCk2

xx

GFED@ABCk1

vvGFED@ABCh1

AAAAAAAAA

**UUUUUUUUUUUUUUUUUUUUUUUUUU GFED@ABCh2

~~

AAAAAAAAAGFED@ABCh3

~~


GFED@ABCΩ1

@A BC

OO

GFED@ABCΩ2

OO

Figure 7.2: Illustration of a Jordan network. The network output is buffered in the context neuronsand with the next time step it is entered into the network together with the new input.

with one context neuron per output neu-ron. The set of context neurons is calledK. The context neurons are completelylinked toward the input layer of the net-work.

7.2 Elman networks

The Elman networks (a variation ofthe Jordan networks) [Elm90] have con-text neurons, too, but one layer of contextneurons per information processing neu-ron layer (fig. 7.3 on the following page).Thus, the outputs of each hidden neuron

nearly every-thing isbuffered

or output neuron are led into the associ-ated context layer (again exactly one con-text neuron per neuron) and from there itis reentered into the complete neuron layer

during the next time step (i.e. again a com-plete link on the way back). So the com-plete information processing part1 of theMLP exists a second time as a "contextversion" – which once again considerablyincreases dynamics and state variety.

Compared with Jordan networks the El-man networks often have the advantage toact more purposeful since every layer canaccess its own context.

Definition 7.3 (Elman network). An El-man network is an MLP with one con-text neuron per information processingneuron. The set of context neurons iscalledK. This means that there exists onecontext layer per information processing

1 Remember: The input layer does not process in-formation.



GFED@ABCi1

~~~~~~~~~~~~

@@@@@@@@@@



~~~~~~~~~~~~

@@@@@@@@@@

GFED@ABCh1

@@@@@@@@@

**UUUUUUUUUUUUUUUUUUUUUUUUUU 44GFED@ABCh2

~~~~~~~~~

@@@@@@@@@ 55GFED@ABCh3

~~~~~~~~~

ttiiiiiiiiiiiiiiiiiiiiiiiiii 55ONMLHIJKkh1

uu zzvv ONMLHIJKkh2

wwuutt ONMLHIJKkh3

vvuutt

GFED@ABCΩ1

55GFED@ABCΩ2 55

ONMLHIJKkΩ1

uu ww ONMLHIJKkΩ2

uu vv

Figure 7.3: Illustration of an Elman network. The entire information processing part of the networkexists, in a way, twice. The output of each neuron (except for the output of the input neurons)is buffered and reentered into the associated layer. For the reason of clarity I named the contextneurons on the basis of their models in the actual network, but it is not mandatory to do so.

neuron layer with exactly the same num-ber of context neurons. Every neuron hasa weighted connection to exactly one con-text neuron while the context layer is com-pletely linked towards its original layer.

Now it is interesting to take a look at thetraining of recurrent networks since, for in-stance, ordinary backpropagation of errorcannot work on recurrent networks. Onceagain, the style of the following part israther informal, which means that I willnot use any formal definitions.

7.3 Training recurrentnetworks

In order to explain the training as compre-hensible as possible, we have to agree onsome simplifications that do not affect thelearning principle itself.

So for the training let us assume that inthe beginning the context neurons are ini-tiated with an input, since otherwise theywould have an undefined input (this is nosimplification but reality).

Furthermore, we use a Jordan networkwithout a hidden neuron layer for ourtraining attempts so that the output neu-


dkriesel.com 7.3 Training recurrent networks

rons can directly provide input. This ap-proach is a strong simplification becausegenerally more complicated networks areused. But this does not change the learn-ing principle.

7.3.1 Unfolding in time

Remember our actual learning procedurefor MLPs, the backpropagation of error,which backpropagates the delta values.So, in case of recurrent networks thedelta values would backpropagate cycli-cally through the network again and again,which makes the training more difficult.On the one hand we cannot know whichof the many generated delta values for aweight should be selected for training, i.e.which values are useful. On the other handwe cannot definitely know when learningshould be stopped. The advantage of re-current networks are great state dynamicswithin the network; the disadvantage ofrecurrent networks is that these dynamicsare also granted to the training and there-fore make it difficult.

One learning approach would be the at-tempt to unfold the temporal states ofthe network (fig. 7.4 on the next page):Recursions are deleted by putting a sim-ilar network above the context neurons,i.e. the context neurons are, as a man-ner of speaking, the output neurons ofthe attached network. More generally spo-ken, we have to backtrack the recurrencesand place "‘earlier"’ instances of neuronsin the network – thus creating a larger,

but forward-oriented network without re-currences. This enables training a recur-rent network with any training strategydeveloped for non-recurrent ones. Here,

attachthe samenetworkto eachcontextlayer

the input is entered as teaching input intoevery "copy" of the input neurons. Thiscan be done for a discrete number of timesteps. These training paradigms are calledunfolding in time [MP69]. After the un-folding a training by means of backpropa-gation of error is possible.

But obviously, for one weight wi,j sev-eral changing values ∆wi,j are received,which can be treated differently: accumu-lation, averaging etc. A simple accumu-lation could possibly result in enormouschanges per weight if all changes have thesame sign. Hence, also the average is notto be underestimated. We could also intro-duce a discounting factor, which weakensthe influence of ∆wi,j of the past.

Unfolding in time is particularly useful ifwe receive the impression that the closerpast is more important for the networkthan the one being further away. Thereason for this is that backpropagationhas only little influence in the layers far-ther away from the output (remember:the farther we are from the output layer,the smaller the influence of backpropaga-tion).

Disadvantages: the training of such an un-folded network will take a long time sincea large number of layers could possibly beproduced. A problem that is no longernegligible is the limited computational ac-curacy of ordinary computers, which isexhausted very fast because of so many



GFED@ABCi1

''OOOOOOOOOOOOOOOO


@@@@@@@@@


AAAAAAAAAGFED@ABCk1

wwnnnnnnnnnnnnnnnnn

~~GFED@ABCk2


wwnnnnnnnnnnnnnnnnn

GFED@ABCΩ1@A BC

OO

GFED@ABCΩ2

OO

...

...

......

...

/.-,()*+

((RRRRRRRRRRRRRRRRR

**VVVVVVVVVVVVVVVVVVVVVVVV /.-,()*+

!!CCCCCCCCC

((PPPPPPPPPPPPPPP /.-,()*+ ???????? /.-,()*+

wwoooooooooooooo

/.-,()*+

ttjjjjjjjjjjjjjjjjjjjjj

wwoooooooooooooo

/.-,()*+

((RRRRRRRRRRRRRRRRRR

**VVVVVVVVVVVVVVVVVVVVVVVVVVV /.-,()*+

!!DDDDDDDDDD

((QQQQQQQQQQQQQQQQQQ /.-,()*+ !!CCCCCCCCC /.-,()*+

vvnnnnnnnnnnnnnnnn

/.-,()*+

ttjjjjjjjjjjjjjjjjjjjjjjj

wwppppppppppppppp

GFED@ABCi1

''OOOOOOOOOOOOOOOO


@@@@@@@@@


AAAAAAAAAGFED@ABCk1

wwnnnnnnnnnnnnnnnnn

~~GFED@ABCk2


wwnnnnnnnnnnnnnnnnn

GFED@ABCΩ1

GFED@ABCΩ2

Figure 7.4: Illustration of the unfolding in time with a small exemplary recurrent MLP. Top: Therecurrent MLP. Bottom: The unfolded network. For reasons of clarity, I only added names tothe lowest part of the unfolded network. Dotted arrows leading into the network mark the inputs.Dotted arrows leading out of the network mark the outputs. Each "network copy" represents a timestep of the network with the most recent time step being at the bottom.


dkriesel.com 7.3 Training recurrent networks

nested computations (the farther we arefrom the output layer, the smaller the in-fluence of backpropagation, so that thislimit is reached). Furthermore, with sev-eral levels of context neurons this proce-dure could produce very large networks tobe trained.

7.3.2 Teacher forcing

Other procedures are the equivalentteacher forcing and open loop learn-ing. They detach the recurrence duringthe learning process: We simply pretend

teachinginput

applied atcontextneurons

that the recurrence does not exist and ap-ply the teaching input to the context neu-rons during the training. So, backpropaga-tion becomes possible, too. Disadvantage:with Elman networks a teaching input fornon-output-neurons is not given.

7.3.3 Recurrent backpropagation

Another popular procedure without lim-ited time horizon is the recurrent back-propagation using methods of differ-ential calculus to solve the problem[Pin87].

7.3.4 Training with evolution

Due to the already long lasting train-ing time, evolutionary algorithms haveproved to be of value, especially with recur-rent networks. One reason for this is thatthey are not only unrestricted with respectto recurrences but they also have other ad-vantages when the mutation mechanisms

are chosen suitably: So, for example, neu-rons and weights can be adjusted andthe network topology can be optimized(of course the result of learning is notnecessarily a Jordan or Elman network).With ordinary MLPs, however, evolution-ary strategies are less popular since theycertainly need a lot more time than a di-rected learning procedure such as backpro-pagation.


Chapter 8

Hopfield networksIn a magnetic field, each particle applies a force to any other particle so thatall particles adjust their movements in the energetically most favorable way.This natural mechanism is copied to adjust noisy inputs in order to match

their real models.

Another supervised learning example ofthe wide range of neural networks wasdeveloped by John Hopfield: the so-called Hopfield networks [Hop82]. Hop-field and his physically motivated net-works have contributed a lot to the renais-sance of neural networks.

8.1 Hopfield networks areinspired by particles in amagnetic field

The idea for the Hopfield networks origi-nated from the behavior of particles in amagnetic field: Every particle "communi-cates" (by means of magnetic forces) withevery other particle (completely linked)with each particle trying to reach an ener-getically favorable state (i.e. a minimumof the energy function). As for the neuronsthis state is known as activation. Thus,all particles or neurons rotate and thereby

encourage each other to continue this rota-tion. As a manner of speaking, our neuralnetwork is a cloud of particles

Based on the fact that the particles auto-matically detect the minima of the energyfunction, Hopfield had the idea to use the"spin" of the particles to process informa-tion: Why not letting the particles searchminima on arbitrary functions? Even if weonly use two of those spins, i.e. a binaryactivation, we will recognize that the devel-oped Hopfield network shows considerabledynamics.

8.2 In a hopfield network, allneurons influence eachother symmetrically

Briefly speaking, a Hopfield network con-sists of a set K of completely linked neu- JKrons with binary activation (since we only

129

Chapter 8 Hopfield networks dkriesel.com

?>=<89:;↑ ii

ii

))SSSSSSSSSSSSSSSSSSSSSSSSOO

oo //^^

<<<<<<<<<?>=<89:;↓55

uukkkkkkkkkkkkkkkkkkkkkkkkOO

@@

^^

<<<<<<<<<

?>=<89:;↑ ii

))SSSSSSSSSSSSSSSSSSSSSSSSoo //

@@ ?>=<89:;↓ ?>=<89:;↑44jj 55

uukkkkkkkkkkkkkkkkkkkkkkkk//oo@@

?>=<89:;↓

66

@@

^^<<<<<<<<< ?>=<89:;↑//oo

^^<<<<<<<<<

Figure 8.1: Illustration of an exemplary Hop-field network. The arrows ↑ and ↓ mark thebinary "spin". Due to the completely linked neu-rons the layers cannot be separated, which meansthat a Hopfield network simply includes a set ofneurons.

use two spins), with the weights beingsymmetric between the individual neurons

completelylinkedset of

neurons

and without any neuron being directly con-nected to itself (fig. 8.1). Thus, the stateof |K| neurons with two possible states∈ −1, 1 can be described by a stringx ∈ −1, 1|K|.

The complete link provides a full squarematrix of weights between the neurons.The meaning of the weights will be dis-cussed in the following. Furthermore, wewill soon recognize according to whichrules the neurons are spinning, i.e. arechanging their state.

Additionally, the complete link leads tothe fact that we do not know any input,output or hidden neurons. Thus, we haveto think about how we can input some-thing into the |K| neurons.

Definition 8.1 (Hopfield network). AHopfield network consists of a set K ofcompletely linked neurons without directrecurrences. The activation function ofthe neurons is the binary threshold func-tion with outputs ∈ 1,−1.

Definition 8.2 (State of a Hopfield net-work). The state of the network con-sists of the activation states of all neu-rons. Thus, the state of the network canbe understood as a binary string z ∈−1, 1|K|.

8.2.1 Input and output of aHopfield network arerepresented by neuron states

We have learned that a network, i.e. aset of |K| particles, that is in a stateis automatically looking for a minimum.An input pattern of a Hopfield networkis exactly such a state: A binary stringx ∈ −1, 1|K| that initializes the neurons.Then the network is looking for the min-imum to be taken (which we have previ-ously defined by the input of training sam-ples) on its energy surface.

But when do we know that the minimumhas been found? This is simple, too: when

input andoutput =networkstates

the network stops. It can be proven that aHopfield network with a symmetric weightmatrix that has zeros on its diagonal al-ways converges [CG88], i.e. at some point

alwaysconvergesit will stand still. Then the output is a

binary string y ∈ −1, 1|K|, namely thestate string of the network that has founda minimum.


dkriesel.com 8.2 Structure and functionality

Now let us take a closer look at the con-tents of the weight matrix and the rulesfor the state change of the neurons.Definition 8.3 (Input and output ofa Hopfield network). The input of aHopfield network is binary string x ∈−1, 1|K| that initializes the state of thenetwork. After the convergence of thenetwork, the output is the binary stringy ∈ −1, 1|K| generated from the new net-work state.

8.2.2 Significance of weights

We have already said that the neuronschange their states, i.e. their direction,from −1 to 1 or vice versa. These spins oc-cur dependent on the current states of theother neurons and the associated weights.Thus, the weights are capable to controlthe complete change of the network. Theweights can be positive, negative, or 0.Colloquially speaking, for a weight wi,j be-tween two neurons i and j the followingholds:

If wi,j is positive, it will try to force thetwo neurons to become equal – thelarger they are, the harder the net-work will try. If the neuron i is instate 1 and the neuron j is in state−1, a high positive weight will advisethe two neurons that it is energeti-cally more favorable to be equal.

If wi,j is negative, its behavior will beanaloguous only that i and j areurged to be different. A neuron i instate −1 would try to urge a neuronj into state 1.

Zero weights lead to the two involvedneurons not influencing each other.

The weights as a whole apparently takethe way from the current state of the net-work towards the next minimum of the en-ergy function. We now want to discusshow the neurons follow this way.

8.2.3 A neuron changes its stateaccording to the influence ofthe other neurons

Once a network has been trained andinitialized with some starting state, thechange of state xk of the individual neu-rons k occurs according to the scheme

xk(t) = fact

∑j∈K

wj,k · xj(t− 1)

(8.1)

in each time step, where the function factgenerally is the binary threshold function(fig. 8.2 on the next page) with threshold0. Colloquially speaking: a neuron k cal-culates the sum of wj,k · xj(t − 1), whichindicates how strong and into which direc-tion the neuron k is forced by the otherneurons j. Thus, the new state of the net-work (time t) results from the state of thenetwork at the previous time t − 1. Thissum is the direction into which the neuronk is pushed. Depending on the sign of thesum the neuron takes state 1 or −1.

Another difference between Hopfield net-works and other already known networktopologies is the asynchronous update: Aneuron k is randomly chosen every time,which then recalculates the activation.



−1

−0.5

0

0.5

1

−4 −2 0 2 4

f(x)

x

Heaviside Function

Figure 8.2: Illustration of the binary thresholdfunction.

Thus, the new activation of the previouslychanged neurons immediately influencesthe network, i.e. one time step indicatesthe change of a single neuron.

Regardless of the aforementioned randomselection of the neuron, a Hopfield net-work is often much easier to implement:The neurons are simply processed one af-ter the other and their activations are re-calculated until no more changes occur.

randomneuron

calculatesnew

activation

Definition 8.4 (Change in the state ofa Hopfield network). The change of stateof the neurons occurs asynchronously withthe neuron to be updated being randomlychosen and the new state being generatedby means of this rule:

xk(t) = fact

∑j∈J

wj,k · xj(t− 1)

.Now that we know how the weights influ-ence the changes in the states of the neu-rons and force the entire network towards

a minimum, then there is the question ofhow to teach the weights to force the net-work towards a certain minimum.

8.3 The weight matrix isgenerated directly out ofthe training patterns

The aim is to generate minima on thementioned energy surface, so that at aninput the network can converge to them.As with many other network paradigms,we use a set P of training patterns p ∈1,−1|K|, representing the minima of ourenergy surface.

Unlike many other network paradigms, wedo not look for the minima of an unknownerror function but define minima on such afunction. The purpose is that the networkshall automatically take the closest min-imum when the input is presented. Fornow this seems unusual, but we will un-derstand the whole purpose later.

Roughly speaking, the training of a Hop-field network is done by training each train-ing pattern exactly once using the ruledescribed in the following (Single ShotLearning), where pi and pj are the statesof the neurons i and j under p ∈ P :

wi,j =∑p∈P

pi · pj (8.2)

This results in the weight matrix W . Col-loquially speaking: We initialize the net-work by means of a training pattern andthen process weights wi,j one after another.


dkriesel.com 8.4 Autoassociation and traditional application

For each of these weights we verify: Arethe neurons i, j n the same state or do thestates vary? In the first case we add 1to the weight, in the second case we add−1.

This we repeat for each training patternp ∈ P . Finally, the values of the weightswi,j are high when i and j correspondedwith many training patterns. Colloquiallyspeaking, this high value tells the neurons:"Often, it is energetically favorable to holdthe same state". The same applies to neg-ative weights.

Due to this training we can store a certainfixed number of patterns p in the weightmatrix. At an input x the network willconverge to the stored pattern that is clos-est to the input p.

Unfortunately, the number of the maxi-mum storable and reconstructible patternsp is limited to

|P |MAX ≈ 0.139 · |K|, (8.3)

which in turn only applies to orthogo-nal patterns. This was shown by precise(and time-consuming) mathematical anal-yses, which we do not want to specifynow. If more patterns are entered, alreadystored information will be destroyed.

Definition 8.5 (Learning rule for Hop-field networks). The individual elementsof the weight matrix W are defined by asingle processing of the learning rule

wi,j =∑p∈P

pi · pj ,

where the diagonal of the matrix is coveredwith zeros. Here, no more than |P |MAX ≈

0.139 · |K| training samples can be trainedand at the same time maintain their func-tion.

Now we know the functionality of Hopfieldnetworks but nothing about their practicaluse.

8.4 Autoassociation andtraditional application

Hopfield networks, like those mentionedabove, are called autoassociators. Anautoassociator a exactly shows the afore- Jamentioned behavior: Firstly, when aknown pattern p is entered, exactly thisknown pattern is returned. Thus,

a(p) = p,

with a being the associative mapping. Sec-ondly, and that is the practical use, thisalso works with inputs that are close to apattern:

a(p+ ε) = p.

Afterwards, the autoassociator is, in anycase, in a stable state, namely in the statep.

If the set of patterns P consists of, for ex-networkrestoresdamagedinputs

ample, letters or other characters in theform of pixels, the network will be able tocorrectly recognize deformed or noisy let-ters with high probability (fig. 8.3 on thefollowing page).

The primary fields of application of Hop-field networks are pattern recognitionand pattern completion, such as the zip



Figure 8.3: Illustration of the convergence of anexemplary Hopfield network. Each of the pic-tures has 10 × 12 = 120 binary pixels. In theHopfield network each pixel corresponds to oneneuron. The upper illustration shows the train-ing samples, the lower shows the convergence ofa heavily noisy 3 to the corresponding trainingsample.

code recognition on letters in the eighties.But soon the Hopfield networks were re-placed by other systems in most of theirfields of application, for example by OCRsystems in the field of letter recognition.Today Hopfield networks are virtually nolonger used, they have not become estab-lished in practice.

8.5 Heteroassociation andanalogies to neural datastorage

So far we have been introduced to Hopfieldnetworks that converge from an arbitraryinput into the closest minimum of a staticenergy surface.

Another variant is a dynamic energy sur-face: Here, the appearance of the energysurface depends on the current state andwe receive a heteroassociator instead ofan autoassociator. For a heteroassocia-tor

a(p+ ε) = p

is no longer true, but rather

h(p+ ε) = q,

which means that a pattern is mappedonto another one. h is the heteroasso- Jhciative mapping. Such heteroassociationsare achieved by means of an asymmetricweight matrix V .


dkriesel.com 8.5 Heteroassociation and analogies to neural data storage

Heteroassociations connected in series ofthe form

h(p+ ε) = q

h(q + ε) = r

h(r + ε) = s

...h(z + ε) = p

can provoke a fast cycle of states

p→ q → r → s→ . . .→ z → p,

whereby a single pattern is never com-pletely accepted: Before a pattern is en-tirely completed, the heteroassociation al-ready tries to generate the successor of thispattern. Additionally, the network wouldnever stop, since after having reached thelast state z, it would proceed to the firststate p again.

8.5.1 Generating theheteroassociative matrix

We generate the matrix V by means of el-VI ements v very similar to the autoassocia-vI tive matrix with p being (per transition)

the training sample before the transitionand q being the training sample to be gen-

qI erated from p:

vi,j =∑

p,q∈P,p6=qpiqj (8.4)

The diagonal of the matrix is again filledwith zeros. The neuron states are, as al-

networdis instable

whilechanging

states

ways, adapted during operation. Severaltransitions can be introduced into the ma-trix by a simple addition, whereby the saidlimitation exists here, too.

Definition 8.6 (Learning rule for the het-eroassociative matrix). For two trainingsamples p being predecessor and q beingsuccessor of a heteroassociative transitionthe weights of the heteroassociative matrixV result from the learning rule

vi,j =∑

p,q∈P,p6=qpiqj ,

with several heteroassociations being intro-duced into the network by a simple addi-tion.

8.5.2 Stabilizing theheteroassociations

We have already mentioned the problemthat the patterns are not completely gen-erated but that the next pattern is alreadybeginning before the generation of the pre-vious pattern is finished.

This problem can be avoided by not onlyinfluencing the network by means of theheteroassociative matrix V but also bythe already known autoassociative matrixW .

Additionally, the neuron adaptation ruleis changed so that competing terms aregenerated: One term autoassociating anexisting pattern and one term trying toconvert the very same pattern into its suc-cessor. The associative rule provokes thatthe network stabilizes a pattern, remains



there for a while, goes on to the next pat-tern, and so on.

xi(t+ 1) = (8.5)

fact

∑j∈K

wi,jxj(t)︸︷︷︸autoassociation

+∑k∈K

vi,kxk(t−∆t)︸︷︷︸heteroassociation

Here, the value ∆t causes, descriptively

∆tIstable change

in states

speaking, the influence of the matrix Vto be delayed, since it only refers to anetwork being ∆t versions behind. Theresult is a change in state, during whichthe individual states are stable for a shortwhile. If ∆t is set to, for example, twentysteps, then the asymmetric weight matrixwill realize any change in the network onlytwenty steps later so that it initially workswith the autoassociative matrix (since itstill perceives the predecessor pattern ofthe current one), and only after that it willwork against it.

8.5.3 Biological motivation ofheterassociation

From a biological point of view the transi-tion of stable states into other stable statesis highly motivated: At least in the begin-ning of the nineties it was assumed thatthe Hopfield modell will achieve an ap-proximation of the state dynamics in thebrain, which realizes much by means ofstate chains: When I would ask you, dearreader, to recite the alphabet, you gener-ally will manage this better than (pleasetry it immediately) to answer the follow-ing question:

Which letter in the alphabet follows theletter P?

Another example is the phenomenon thatone cannot remember a situation, but theplace at which one memorized it the lasttime is perfectly known. If one returnsto this place, the forgotten situation oftencomes back to mind.

8.6 Continuous Hopfieldnetworks

So far, we only have discussed Hopfield net-works with binary activations. But Hop-field also described a version of his net-works with continuous activations [Hop84],which we want to cover at least briefly:continuous Hopfield networks. Here,the activation is no longer calculated bythe binary threshold function but by theFermi function with temperature parame-ters (fig. 8.4 on the next page).

Here, the network is stable for symmetricweight matrices with zeros on the diagonal,too.

Hopfield also stated, that continuous Hop-field networks can be applied to find ac-ceptable solutions for the NP-hard trav-elling salesman problem [HT85]. Accord-ing to some verification trials [Zel94] thisstatement can’t be kept up any more. Buttoday there are faster algorithms for han-dling this problem and therefore the Hop-field network is no longer used here.


dkriesel.com 8.6 Continuous Hopfield networks

0

0.2

0.4

0.6

0.8

1

−4 −2 0 2 4

f(x)

x


Figure 8.4: The already known Fermi functionwith different temperature parameter variations.

Exercises

Exercise 14. Indicate the storage re-quirements for a Hopfield network with|K| = 1000 neurons when the weights wi,jshall be stored as integers. Is it possibleto limit the value range of the weights inorder to save storage space?

Exercise 15. Compute the weights wi,jfor a Hopfield network using the trainingset

P =(−1,−1,−1,−1,−1, 1);(−1, 1, 1,−1,−1,−1);(1,−1,−1, 1,−1, 1).


Chapter 9

Learning vector quantizationLearning Vector Quantization is a learning procedure with the aim to represent

the vector training sets divided into predefined classes as well as possible byusing a few representative vectors. If this has been managed, vectors which

were unkown until then could easily be assigned to one of these classes.

Slowly, part II of this text is nearing itsend – and therefore I want to write a lastchapter for this part that will be a smoothtransition into the next one: A chapterabout the learning vector quantization(abbreviated LVQ) [Koh89] described byTeuvo Kohonen, which can be charac-terized as being related to the self orga-nizing feature maps. These SOMs are de-scribed in the next chapter that alreadybelongs to part III of this text, since SOMslearn unsupervised. Thus, after the explo-ration of LVQ I want to bid farewell tosupervised learning.

Previously, I want to announce that thereare different variations of LVQ, which willbe mentioned but not exactly represented.The goal of this chapter is rather to ana-lyze the underlying principle.

9.1 About quantization

In order to explore the learning vec-tor quantization we should at first geta clearer picture of what quantization(which can also be referred to as dis-cretization) is.

Everybody knows the sequence of discretenumbers

N = 1, 2, 3, . . .,

which contains the natural numbers. Dis-crete means, that this sequence consists of

discrete= separatedseparated elements that are not intercon-

nected. The elements of our example areexactly such numbers, because the naturalnumbers do not include, for example, num-bers between 1 and 2. On the other hand,the sequence of real numbers R, for in-stance, is continuous: It does not matterhow close two selected numbers are, therewill always be a number between them.

139

Chapter 9 Learning vector quantization dkriesel.com

Quantization means that a continuousspace is divided into discrete sections: Bydeleting, for example, all decimal placesof the real number 2.71828, it could beassigned to the natural number 2. Hereit is obvious that any other number hav-ing a 2 in front of the comma would alsobe assigned to the natural number 2, i.e.2 would be some kind of representativefor all real numbers within the interval[2; 3).

It must be noted that a sequence can be ir-regularly quantized, too: For instance, thetimeline for a week could be quantized intoworking days and weekend.

A special case of quantization is digiti-zation: In case of digitization we alwaystalk about regular quantization of a con-tinuous space into a number system withrespect to a certain basis. If we enter, forexample, some numbers into the computer,these numbers will be digitized into the bi-nary system (basis 2).

Definition 9.1 (Quantization). Separa-tion of a continuous space into discrete sec-tions.

Definition 9.2 (Digitization). Regularquantization.

9.2 LVQ divides the inputspace into separate areas

Now it is almost possible to describe bymeans of its name what LVQ should en-able us to do: A set of representativesshould be used to divide an input space

into classes that reflect the input spaceas well as possible (fig. 9.1 on the facing input space

reduced tovector repre-sentatives

page). Thus, each element of the inputspace should be assigned to a vector as arepresentative, i.e. to a class, where theset of these representatives should repre-sent the entire input space as precisely aspossible. Such a vector is called codebookvector. A codebook vector is the represen-tative of exactly those input space vectorslying closest to it, which divides the inputspace into the said discrete areas.

It is to be emphasized that we have toknow in advance how many classes wehave and which training sample belongsto which class. Furthermore, it is impor-tant that the classes must not be disjoint,which means they may overlap.

Such separation of data into classes is in-teresting for many problems for which itis useful to explore only some characteris-tic representatives instead of the possiblyhuge set of all vectors – be it because it isless time-consuming or because it is suffi-ciently precise.

9.3 Using codebook vectors:the nearest one is thewinner

The use of a prepared set of codebook vec-tors is very simple: For an input vector ythe class association is easily decided by

closestvectorwins

considering which codebook vector is theclosest – so, the codebook vectors build avoronoi diagram out of the set. Since


dkriesel.com 9.4 Adjusting codebook vectors

Figure 9.1: BExamples for quantization of a two-dimensional input space. DThe lines representthe class limit, the × mark the codebook vectors.

each codebook vector can clearly be asso-ciated to a class, each input vector is asso-ciated to a class, too.

9.4 Adjusting codebookvectors

As we have already indicated, the LVQ isa supervised learning procedure. Thus, wehave a teaching input that tells the learn-ing procedure whether the classification ofthe input pattern is right or wrong: Inother words, we have to know in advancethe number of classes to be represented orthe number of codebook vectors.

Roughly speaking, it is the aim of thelearning procedure that training samples

are used to cause a previously defined num-ber of randomly initialized codebook vec-tors to reflect the training data as preciselyas possible.

9.4.1 The procedure of learning

Learning works according to a simplescheme. We have (since learning is su-pervised) a set P of |P | training samples.Additionally, we already know that classesare predefined, too, i.e. we also have a setof classes C. A codebook vector is clearlyassigned to each class. Thus, we can saythat the set of classes |C| contains manycodebook vectors C1, C2, . . . , C|C|.

This leads to the structure of the trainingsamples: They are of the form (p, c) and


Chapter 9 Learning vector quantization dkriesel.com

therefore contain the training input vectorp and its class affiliation c. For the classaffiliation

c ∈ 1, 2, . . . , |C|

holds, which means that it clearly assignsthe training sample to a class or a code-book vector.

Intuitively, we could say about learning:"Why a learning procedure? We calculatethe average of all class members and placetheir codebook vectors there – and that’sit." But we will see soon that our learningprocedure can do a lot more.

I only want to briefly discuss the stepsof the fundamental LVQ learning proce-dure:

Initialization: We place our set of code-book vectors on random positions inthe input space.

Training sample: A training sample p ofour training set P is selected and pre-sented.

Distance measurement: We measure thedistance ||p − C|| between all code-book vectors C1, C2, . . . , C|C| and ourinput p.

Winner: The closest codebook vectorwins, i.e. the one with

minCi∈C

||p− Ci||.

Learning process: The learning processtakes place according to the rule

∆Ci = η(t) · h(p, Ci) · (p− Ci)(9.1)

Ci(t+ 1) = Ci(t) + ∆Ci, (9.2)

which we now want to break down.

. We have already seen that the firstfactor η(t) is a time-dependent learn-ing rate allowing us to differentiatebetween large learning steps and finetuning.

. The last factor (p − Ci) is obviouslythe direction toward which the code-book vector is moved.

. But the function h(p, Ci) is the core ofthe rule: It implements a distinctionof cases.

Assignment is correct: The winnervector is the codebook vector ofthe class that includes p. In this Important!case, the function provides posi-tive values and the codebook vec-tor moves towards p.

Assignment is wrong: The winnervector does not represent theclass that includes p. Thereforeit moves away from p.

We can see that our definition of the func-tion h was not precise enough. With goodreason: From here on, the LVQ is dividedinto different nuances, dependent of howexactly h and the learning rate shouldbe defined (called LVQ1, LVQ2, LVQ3,


dkriesel.com 9.5 Connection to neural networks

OLVQ, etc). The differences are, for in-stance, in the strength of the codebook vec-tor movements. They are not all based onthe same principle described here, and asannounced I don’t want to discuss themany further. Therefore I don’t give anyformal definition regarding the aforemen-tioned learning rule and LVQ.

9.5 Connection to neuralnetworks

Until now, in spite of the learning process,the question was what LVQ has to do withneural networks. The codebook vectorscan be understood as neurons with a fixedposition within the input space, similar toRBF networks. Additionally, in nature itvectors

= neurons? often occurs that in a group one neuronmay fire (a winner neuron, here: a code-book vector) and, in return, inhibits allother neurons.

I decided to place this brief chapter aboutlearning vector quantization here so thatthis approach can be continued in the fol-lowing chapter about self-organizing maps:We will classify further inputs by means ofneurons distributed throughout the inputspace, only that this time, we do not knowwhich input belongs to which class.

Now let us take a look at the unsupervisedlearning networks!

Exercises

Exercise 16. Indicate a quantizationwhich equally distributes all vectors H ∈H in the five-dimensional unit cube H intoone of 1024 classes.


Part III

Unsupervised learning networkparadigms

145

Chapter 10

Self-organizing feature mapsA paradigm of unsupervised learning neural networks, which maps an input

space by its fixed topology and thus independently looks for simililarities.Function, learning procedure, variations and neural gas.

If you take a look at the concepts of biologi-cal neural networks mentioned in the intro-duction, one question will arise: How doesour brain store and recall the impressionsit receives every day. Let me point outthat the brain does not have any training

How aredata stored

in thebrain?

samples and therefore no "desired output".And while already considering this subjectwe realize that there is no output in thissense at all, too. Our brain responds toexternal input by changes in state. Theseare, so to speak, its output.

Based on this principle and exploringthe question of how biological neural net-works organize themselves, Teuvo Ko-honen developed in the Eighties his self-organizing feature maps [Koh82, Koh98],shortly referred to as self-organizingmaps or SOMs. A paradigm of neuralnetworks where the output is the state ofthe network, which learns completely un-supervised, i.e. without a teacher.

Unlike the other network paradigms wehave already got to know, for SOMs it isunnecessary to ask what the neurons calcu-late. We only ask which neuron is active atthe moment. Biologically, this is very mo- no output,

but activeneuron

tivated: If in biology the neurons are con-nected to certain muscles, it will be lessinteresting to know how strong a certainmuscle is contracted but which muscle isactivated. In other words: We are not in-terested in the exact output of the neuronbut in knowing which neuron provides out-put. Thus, SOMs are considerably morerelated to biology than, for example, thefeedforward networks, which are increas-ingly used for calculations.

10.1 Structure of aself-organizing map

Typically, SOMs have – like our brain –the task to map a high-dimensional in-put (N dimensions) onto areas in a low-

147

Chapter 10 Self-organizing feature maps dkriesel.com

dimensional grid of cells (G dimensions)to draw a map of the high-dimensional

high-dim.input↓

low-dim.map

space, so to speak. To generate this map,the SOM simply obtains arbitrary manypoints of the input space. During the in-put of the points the SOM will try to coveras good as possible the positions on whichthe points appear by its neurons. This par-ticularly means, that every neuron can beassigned to a certain position in the inputspace.

At first, these facts seem to be a bit con-fusing, and it is recommended to brieflyreflect about them. There are two spacesin which SOMs are working:

. The N -dimensional input space and

. the G-dimensional grid on which theneurons are lying and which indi-input space

and topology cates the neighborhood relationshipsbetween the neurons and thereforethe network topology.

In a one-dimensional grid, the neuronscould be, for instance, like pearls on astring. Every neuron would have exactlytwo neighbors (except for the two end neu-rons). A two-dimensional grid could be asquare array of neurons (fig. 10.1). An-other possible array in two-dimensionalspace would be some kind of honeycombshape. Irregular topologies are possible,too, but not very often. Topolgies withmore dimensions and considerably moreneighborhood relationships would also bepossible, but due to their lack of visualiza-tion capability they are not employed veryoften.Important!

/.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+

/.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+

/.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+

/.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+

/.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+

Figure 10.1: Example topologies of a self-organizing map. Above we can see a one-dimensional topology, below a two-dimensionalone.

Even if N = G is true, the two spaces arenot equal and have to be distinguished. Inthis special case they only have the samedimension.

Initially, we will briefly and formally re-gard the functionality of a self-organizingmap and then make it clear by means ofsome examples.

Definition 10.1 (SOM neuron). Similarto the neurons in an RBF network a SOMneuron k does not occupy a fixed positionck (a center) in the input space. JcDefinition 10.2 (Self-organizing map).A self-organizing map is a set K of SOMneurons. If an input vector is entered, ex- JKactly that neuron k ∈ K is activated which


dkriesel.com 10.3 Training

is closest to the input pattern in the inputspace. The dimension of the input spaceis referred to as N .

NIDefinition 10.3 (Topology). The neu-rons are interconnected by neighborhoodrelationships. These neighborhood rela-tionships are called topology. The train-ing of a SOM is highly influenced by thetopology. It is defined by the topologyfunction h(i, k, t), where i is the winner

iI neuron1 ist, k the neuron to be adaptedkI (which will be discussed later) and t the

timestep. The dimension of the topologyis referred to as G.

GI

10.2 SOMs always activatethe neuron with theleast distance to aninput pattern

Like many other neural networks, theSOM has to be trained before it can beused. But let us regard the very simplefunctionality of a complete self-organizingmap before training, since there are manyanalogies to the training. Functionalityconsists of the following steps:

Input of an arbitrary value p of the inputspace RN .

Calculation of the distance between ev-ery neuron k and p by means of anorm, i.e. calculation of ||p− ck||.

One neuron becomes active, namelysuch neuron i with the shortest

1 We will learn soon what a winner neuron is.

calculated distance to the input. Allother neurons remain inactive.Thisparadigm of activity is also called input

↓winner

winner-takes-all scheme. The outputwe expect due to the input of a SOMshows which neuron becomes active.

In many literature citations, the descrip-tion of SOMs is more formal: Often aninput layer is described that is completelylinked towards an SOM layer. Then the in-put layer (N neurons) forwards all inputsto the SOM layer. The SOM layer is later-ally linked in itself so that a winner neuroncan be established and inhibit the otherneurons. I think that this explanation ofa SOM is not very descriptive and there-fore I tried to provide a clearer descriptionof the network structure.

Now the question is which neuron is ac-tivated by which input – and the answeris given by the network itself during train-ing.

10.3 Training

[Training makes the SOM topology coverthe input space] The training of a SOMis nearly as straightforward as the func-tionality described above. Basically, it isstructured into five steps, which partiallycorrespond to those of functionality.

Initialization: The network starts withrandom neuron centers ck ∈ RN fromthe input space.

Creating an input pattern: A stimulus,i.e. a point p, is selected from the



input space RN . Now this stimulus istraining:input,

→ winner i,change inpositioni and

neighbors

entered into the network.

Distance measurement: Then the dis-tance ||p−ck|| is determined for everyneuron k in the network.

Winner takes all: The winner neuron iis determined, which has the smallestdistance to p, i.e. which fulfills thecondition

||p− ci|| ≤ ||p− ck|| ∀ k 6= i

. You can see that from several win-ner neurons one can be selected atwill.

Adapting the centers: The neuron cen-ters are moved within the input spaceaccording to the rule2

∆ck = η(t) · h(i, k, t) · (p− ck),

where the values ∆ck are simplyadded to the existing centers. Thelast factor shows that the change inposition of the neurons k is propor-tional to the distance to the inputpattern p and, as usual, to a time-dependent learning rate η(t). Theabove-mentioned network topology ex-erts its influence by means of the func-tion h(i, k, t), which will be discussedin the following.

2 Note: In many sources this rule is written ηh(p−ck), which wrongly leads the reader to believe thath is a constant. This problem can easily be solvedby not omitting the multiplication dots ·.

Definition 10.4 (SOM learning rule). ASOM is trained by presenting an input pat-tern and determining the associated win-ner neuron. The winner neuron and itsneighbor neurons, which are defined by thetopology function, then adapt their cen-ters according to the rule

∆ck = η(t) · h(i, k, t) · (p− ck),(10.1)

ck(t+ 1) = ck(t) + ∆ck(t). (10.2)

10.3.1 The topology functiondefines, how a learningneuron influences itsneighbors

The topology function h is not definedon the input space but on the grid and rep-resents the neighborhood relationships be-tween the neurons, i.e. the topology of thenetwork. It can be time-dependent (whichit often is) – which explains the parameter

defined onthe gridt. The parameter k is the index running

through all neurons, and the parameter iis the index of the winner neuron.

In principle, the function shall take a largevalue if k is the neighbor of the winner neu-ron or even the winner neuron itself, andsmall values if not. SMore precise defini-tion: The topology function must be uni-modal, i.e. it must have exactly one maxi-mum. This maximum must be next to thewinner neuron i, for which the distance toitself certainly is 0.

only 1 maximumfor the winnerAdditionally, the time-dependence enables

us, for example, to reduce the neighbor-hood in the course of time.



In order to be able to output large valuesfor the neighbors of i and small values fornon-neighbors, the function h needs somekind of distance notion on the grid becausefrom somewhere it has to know how far iand k are apart from each other on thegrid. There are different methods to cal-culate this distance.

On a two-dimensional grid we could apply,for instance, the Euclidean distance (lowerpart of fig. 10.2) or on a one-dimensionalgrid we could simply use the number of theconnections between the neurons i and k(upper part of the same figure).

Definition 10.5 (Topology function).The topology function h(i, k, t) describesthe neighborhood relationships in thetopology. It can be any unimodal func-tion that reaches its maximum when i = kgilt. Time-dependence is optional, but of-ten used.

10.3.1.1 Introduction of commondistance and topologyfunctions

A common distance function would be, forexample, the already known Gaussianbell (see fig. 10.3 on page 153). It is uni-modal with a maximum close to 0. Addi-tionally, its width can be changed by ap-plying its parameter σ , which can be used

σI to realize the neighborhood being reducedin the course of time: We simply relate thetime-dependence to the σ and the result is

/.-,()*+ ?>=<89:;i oo 1 // ?>=<89:;k /.-,()*+ /.-,()*+

/.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+

/.-,()*+ /.-,()*+ /.-,()*+ ?>=<89:;kOO

/.-,()*+

/.-,()*+ ?>=<89:;i xx2.23qqqqqqq

88qqqqqq

oo ///.-,()*+oo ///.-,()*+ /.-,()*+

/.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+

Figure 10.2: Example distances of a one-dimensional SOM topology (above) and a two-dimensional SOM topology (below) between twoneurons i and k. In the lower case the Euclideandistance is determined (in two-dimensional spaceequivalent to the Pythagoream theorem). In theupper case we simply count the discrete pathlength between i and k. To simplify matters Irequired a fixed grid edge length of 1 in bothcases.



a monotonically decreasing σ(t). Then ourtopology function could look like this:

h(i, k, t) = e(− ||gi−ck||

2

2·σ(t)2

), (10.3)

where gi and gk represent the neuron po-sitions on the grid, not the neuron posi-tions in the input space, which would bereferred to as ci and ck.

Other functions that can be used in-stead of the Gaussian function are, forinstance, the cone function, the cylin-der function or theMexican hat func-tion (fig. 10.3 on the facing page). Here,the Mexican hat function offers a particu-lar biological motivation: Due to its neg-ative digits it rejects some neurons closeto the winner neuron, a behavior that hasalready been observed in nature. This cancause sharply separated map areas – andthat is exactly why the Mexican hat func-tion has been suggested by Teuvo Koho-nen himself. But this adjustment charac-teristic is not necessary for the functional-ity of the map, it could even be possiblethat the map would diverge, i.e. it couldvirtually explode.

10.3.2 Learning rates andneighborhoods can decreasemonotonically over time

To avoid that the later training phasesforcefully pull the entire map towardsa new pattern, the SOMs often workwith temporally monotonically decreasinglearning rates and neighborhood sizes. Atfirst, let us talk about the learning rate:

Typical sizes of the target value of a learn-ing rate are two sizes smaller than the ini-tial value, e.g

0.01 < η < 0.6

could be true. But this size must also de-pend on the network topology or the sizeof the neighborhood.

As we have already seen, a decreasingneighborhood size can be realized, for ex-ample, by means of a time-dependent,monotonically decreasing σ with theGaussin bell being used in the topologyfunction.

The advantage of a decreasing neighbor-hood size is that in the beginning a movingneuron "pulls along" many neurons in itsvicinity, i.e. the randomly initialized net-work can unfold fast and properly in thebeginning. In the end of the learning pro-cess, only a few neurons are influenced atthe same time which stiffens the networkas a whole but enables a good "fine tuning"of the individual neurons.

It must be noted that

h · η ≤ 1

must always be true, since otherwise theneurons would constantly miss the currenttraining sample.

But enough of theory – let us take a lookat a SOM in action!



0

0.2

0.4

0.6

0.8

1

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

h(r)

r

Gaussian in 1D

0

0.2

0.4

0.6

0.8

1

−4 −2 0 2 4

f(x)

x

Cone Function

0

0.2

0.4

0.6

0.8

1

−4 −2 0 2 4

f(x)

x

Cylinder Funktion

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

3.5

−3 −2 −1 0 1 2 3

f(x)

x

Mexican Hat Function

Figure 10.3: Gaussian bell, cone function, cylinder function and the Mexican hat function sug-gested by Kohonen as examples for topology functions of a SOM..



?>=<89:;1 ?>=<89:;2

?>=<89:;7

?>=<89:;4

>>>>>>>>?>=<89:;6

?>=<89:;3 // // p ?>=<89:;5

?>=<89:;1

?>=<89:;2

?>=<89:;3

?>=<89:;4

?>=<89:;5

?>=<89:;6

?>=<89:;7

Figure 10.4: Illustration of the two-dimensional input space (left) and the one-dimensional topolgyspace (right) of a self-organizing map. Neuron 3 is the winner neuron since it is closest to p. Inthe topology, the neurons 2 and 4 are the neighbors of 3. The arrows mark the movement of thewinner neuron and its neighbors towards the training sample p.

To illustrate the one-dimensional topology of the network, it is plotted into the input space by thedotted line. The arrows mark the movement of the winner neuron and its neighbors towards thepattern.


dkriesel.com 10.4 Examples

10.4 Examples for thefunctionality of SOMs

Let us begin with a simple, mentally com-prehensible example.

In this example, we use a two-dimensionalinput space, i.e. N = 2 is true. Let thegrid structure be one-dimensional (G = 1).Furthermore, our example SOM shouldconsist of 7 neurons and the learning rateshould be η = 0.5.

The neighborhood function is also keptsimple so that we will be able to mentallycomprehend the network:

h(i, k, t) =

1 k direct neighbor of i,1 k = i,

0 otherw.(10.4)

Now let us take a look at the above-mentioned network with random initializa-tion of the centers (fig. 10.4 on the preced-ing page) and enter a training sample p.Obviously, in our example the input pat-tern is closest to neuron 3, i.e. this is thewinning neuron.

We remember the learning rule forSOMs

∆ck = η(t) · h(i, k, t) · (p− ck)

and process the three factors from theback:

Learning direction: Remember that theneuron centers ck are vectors in theinput space, as well as the pattern p.

Thus, the factor (p−ck) indicates thevector of the neuron k to the patternp. This is now multiplied by differentscalars:

Our topology function h indicates thatonly the winner neuron and its twoclosest neighbors (here: 2 and 4) areallowed to learn by returning 0 forall other neurons. A time-dependenceis not specified. Thus, our vector(p − ck) is multiplied by either 1 or0.

The learning rate indicates, as always,the strength of learning. As alreadymentioned, η = 0.5, i. e. all in all, theresult is that the winner neuron andits neighbors (here: 2, 3 and 4) ap-proximate the pattern p half the way(in the figure marked by arrows).

Although the center of neuron 7 – seenfrom the input space – is considerablycloser to the input pattern p than neuron2, neuron 2 is learning and neuron 7 isnot. I want to remind that the networktopology specifies which neuron is allowed

topologyspecifies,who will learn

to learn and not its position in the inputspace. This is exactly the mechanism bywhich a topology can significantly cover aninput space without having to be relatedto it by any sort.

After the adaptation of the neurons 2, 3and 4 the next pattern is applied, and soon. Another example of how such a one-dimensional SOM can develop in a two-dimensional input space with uniformlydistributed input patterns in the course of



time can be seen in figure 10.5 on the fac-ing page.

End states of one- and two-dimensionalSOMs with differently shaped input spacescan be seen in figure 10.6 on page 158.As we can see, not every input space canbe neatly covered by every network topol-ogy. There are so called exposed neurons– neurons which are located in an areawhere no input pattern has ever been oc-curred. A one-dimensional topology gen-erally produces less exposed neurons thana two-dimensional one: For instance, dur-ing training on circularly arranged inputpatterns it is nearly impossible with a two-dimensional squared topology to avoid theexposed neurons in the center of the cir-cle. These are pulled in every directionduring the training so that they finallyremain in the center. But this does notmake the one-dimensional topology an op-timal topology since it can only find lesscomplex neighborhood relationships thana multi-dimensional one.

10.4.1 Topological defects arefailures in SOM unfolding

During the unfolding of a SOM itcould happen that a topological defect(fig. 10.7) occurs, i.e. the SOM does not

"knot"in map unfold correctly. A topological defect can

be described at best by means of the word"knotting".

A remedy for topological defects couldbe to increase the initial values for the

Figure 10.7: A topological defect in a two-dimensional SOM.

neighborhood size, because the more com-plex the topology is (or the more neigh-bors each neuron has, respectively, since athree-dimensional or a honeycombed two-dimensional topology could also be gener-ated) the more difficult it is for a randomlyinitialized map to unfold.

10.5 It is possible to adjustthe resolution of certainareas in a SOM

We have seen that a SOM is trained byentering input patterns of the input space


dkriesel.com 10.5 Adjustment of resolution and position-dependent learning rate

Figure 10.5: Behavior of a SOM with one-dimensional topology (G = 1) after the input of 0, 100,300, 500, 5000, 50000, 70000 and 80000 randomly distributed input patterns p ∈ R2. During thetraining η decreased from 1.0 to 0.1, the σ parameter of the Gauss function decreased from 10.0to 0.2.



Figure 10.6: End states of one-dimensional (left column) and two-dimensional (right column)SOMs on different input spaces. 200 neurons were used for the one-dimensional topology, 10× 10neurons for the two-dimensionsal topology and 80.000 input patterns for all maps.


dkriesel.com 10.6 Application

RN one after another, again and again sothat the SOM will be aligned with thesepatterns and map them. It could happenthat we want a certain subset U of the in-put space to be mapped more precise thanthe other ones.

This problem can easily be solved bymeans of SOMs: During the training dis-proportionally many input patterns of thearea U are presented to the SOM. If thenumber of training patterns of U ⊂ RNpresented to the SOM exceeds the numberof those patterns of the remaining RN \U ,then more neurons will group there whilethe remaining neurons are sparsely dis-tributed on RN \ U (fig. 10.8 on the nextpage).more

patterns↓

higherresolution

As you can see in the illustration, the edgeof the SOM could be deformed. This canbe compensated by assigning to the edgeof the input space a slightly higher proba-bility of being hit by training patterns (anoften applied approach for reaching everycorner with the SOMs).

Also, a higher learning rate is often usedfor edge and corner neurons, since they areonly pulled into the center by the topol-ogy. This also results in a significantly im-proved corner coverage.

10.6 Application of SOMs

Regarding the biologically inspired asso-ciative data storage, there are manyfields of application for self-organizingmaps and their variations.

For example, the different phonemes ofthe finnish language have successfully beenmapped onto a SOM with a two dimen-sional discrete grid topology and thereforeneighborhoods have been found (a SOMdoes nothing else than finding neighbor-hood relationships). So one tries oncemore to break down a high-dimensionalspace into a low-dimensional space (thetopology), looks if some structures havebeen developed – et voilà: clearly definedareas for the individual phenomenons areformed.

Teuvo Kohonen himself made the ef-fort to search many papers mentioning hisSOMs in their keywords. In this large in-put space the individual papers now indi-vidual positions, depending on the occur-rence of keywords. Then Kohonen createda SOM with G = 2 and used it to map thehigh-dimensional "paper space" developedby him.

Thus, it is possible to enter any paperinto the completely trained SOM and lookwhich neuron in the SOM is activated. Itwill be likely to discover that the neigh-bored papers in the topology are interest-ing, too. This type of brain-like context-based search also works with many otherinput spaces.

SOM findssimilarities

It is to be noted that the system itselfdefines what is neighbored, i.e. similar,within the topology – and that’s why itis so interesting.

This example shows that the position c ofthe neurons in the input space is not signif-icant. It is rather interesting to see which



Figure 10.8: Training of a SOM with G = 2 on a two-dimensional input space. On the left side,the chance to become a training pattern was equal for each coordinate of the input space. On theright side, for the central circle in the input space, this chance is more than ten times larger thanfor the remaining input space (visible in the larger pattern density in the background). In this circlethe neurons are obviously more crowded and the remaining area is covered less dense but in bothcases the neurons are still evenly distributed. The two SOMS were trained by means of 80.000training samples and decreasing η (1→ 0.2) as well as decreasing σ (5→ 0.5).


dkriesel.com 10.7 Variations

neuron is activated when an unknown in-put pattern is entered. Next, we can lookat which of the previous inputs this neu-ron was also activated – and will imme-diately discover a group of very similarinputs. The more the inputs within thetopology are diverging, the less things theyhave in common. Virtually, the topologygenerates a map of the input characteris-tics – reduced to descriptively few dimen-sions in relation to the input dimension.

Therefore, the topology of a SOM oftenis two-dimensional so that it can be easilyvisualized, while the input space can bevery high-dimensional.

10.6.1 SOMs can be used todetermine centers for RBFneurons

SOMs arrange themselves exactly towardsthe positions of the outgoing inputs. As aresult they are used, for example, to selectthe centers of an RBF network. We havealready been introduced to the paradigmof the RBF network in chapter 6.

As we have already seen, it is possibleto control which areas of the input spaceshould be covered with higher resolution- or, in connection with RBF networks,on which areas of our function should theRBF network work with more neurons, i.e.work more exactly. As a further useful fea-ture of the combination of RBF networkswith SOMs one can use the topology ob-tained through the SOM: During the finaltraining of a RBF neuron it can be used

to influence neighboring RBF neurons indifferent ways.

For this, many neural network simulatorsoffer an additional so-called SOM layerin connection with the simulation of RBFnetworks.

10.7 Variations of SOMs

There are different variations of SOMsfor different variations of representationtasks:

10.7.1 A neural gas is a SOMwithout a static topology

The neural gas is a variation of the self-organizing maps of Thomas Martinetz[MBS93], which has been developed fromthe difficulty of mapping complex inputinformation that partially only occur inthe subspaces of the input space or evenchange the subspaces (fig. 10.9 on the fol-lowing page).

The idea of a neural gas is, roughly speak-ing, to realize a SOM without a grid struc-ture. Due to the fact that they are de-rived from the SOMs the learning stepsare very similar to the SOM learning steps,but they include an additional intermedi-ate step:

. again, random initialization of ck ∈Rn

. selection and presentation of a pat-tern of the input space p ∈ Rn



Figure 10.9: A figure filling different subspaces of the actual input space of different positionstherefore can hardly be filled by a SOM.

. neuron distance measurement

. identification of the winner neuron i

. Intermediate step: generation of a listL of neurons sorted in ascending orderby their distance to the winner neu-ron. Thus, the first neuron in the listL is the neuron that is closest to thewinner neuron.

. changing the centers by means of theknown rule but with the slightly mod-ified topology function

hL(i, k, t).

The function hL(i, k, t), which is slightlymodified compared with the original func-tion h(i, k, t), now regards the first el-ements of the list as the neighborhood

of the winner neuron i. The direct re-sult is that – similar to the free-floating

dynamicneighborhoodmolecules in a gas – the neighborhood rela-

tionships between the neurons can changeanytime, and the number of neighbors isalmost arbitrary, too. The distance withinthe neighborhood is now represented bythe distance within the input space.

The bulk of neurons can become as stiff-ened as a SOM by means of a constantlydecreasing neighborhood size. It does nothave a fixed dimension but it can take thedimension that is locally needed at the mo-ment, which can be very advantageous.

A disadvantage could be that there isno fixed grid forcing the input space tobecome regularly covered, and thereforewholes can occur in the cover or neuronscan be isolated.


dkriesel.com 10.7 Variations

In spite of all practical hints, it is as al-ways the user’s responsibility not to un-derstand this text as a catalog for easy an-swers but to explore all advantages anddisadvantages himself.

Unlike a SOM, the neighborhood of a neu-ral gas must initially refer to all neuronssince otherwise some outliers of the ran-dom initialization may never reach the re-maining group. To forget this is a popularerror during the implementation of a neu-ral gas.

With a neural gas it is possible to learn akind of complex input such as in fig. 10.9

can classifycomplex

figureon the preceding page since we are notbound to a fixed-dimensional grid. Butsome computational effort could be neces-sary for the permanent sorting of the list(here, it could be effective to store the listin an ordered data structure right from thestart).

Definition 10.6 (Neural gas). A neuralgas differs from a SOM by a completely dy-namic neighborhood function. With everylearning cycle it is decided anew which neu-rons are the neigborhood neurons of thewinner neuron. Generally, the criterionfor this decision is the distance betweenthe neurosn and the winner neuron in theinput space.

10.7.2 A Multi-SOM consists ofseveral separate SOMs

In order to present another variant of theSOMs, I want to formulate an extended

problem: What do we do with input pat-terns from which we know that they areconfined in different (maybe disjoint) ar-eas?

several SOMs

Here, the idea is to use not only oneSOM but several ones: A multi-self-organizing map, shortly referred to asM-SOM [GKE01b,GKE01a,GS06]. It isunnecessary that the SOMs have the sametopology or size, an M-SOM is just a com-bination of M SOMs.

This learning process is analog to that ofthe SOMs. However, only the neurons be-longing to the winner SOM of each train-ing step are adapted. Thus, it is easy torepresent two disjoint clusters of data bymeans of two SOMs, even if one of theclusters is not represented in every dimen-sion of the input space RN . Actually, theindividual SOMs exactly reflect these clus-ters.

Definition 10.7 (Multi-SOM). A multi-SOM is nothing more than the simultane-ous use of M SOMs.

10.7.3 A multi-neural gas consistsof several separate neuralgases

Analogous to the multi-SOM, we also havea set of M neural gases: a multi-neuralgas [GS06, SG06]. This construct be-

several gaseshaves analogous to neural gas and M-SOM:Again, only the neurons of the winner gasare adapted.

The reader certainly wonders what advan-tage is there to use a multi-neural gas since



an individual neural gas is already capa-ble to divide into clusters and to work oncomplex input patterns with changing di-mensions. Basically, this is correct, buta multi-neural gas has two serious advan-tages over a simple neural gas.

1. With several gases, we can directlytell which neuron belongs to whichgas. This is particularly importantfor clustering tasks, for which multi-neural gases have been used recently.Simple neural gases can also find andcover clusters, but now we cannot rec-ognize which neuron belongs to whichcluster.

less computa-tional effort 2. A lot of computational effort is saved

when large original gases are dividedinto several smaller ones since (as al-ready mentioned) the sorting of thelist L could use a lot of computa-tional effort while the sorting of sev-eral smaller lists L1, L2, . . . , LM is lesstime-consuming – even if these lists intotal contain the same number of neu-rons.

As a result we will only obtain local in-stead of global sortings, but in most casesthese local sortings are sufficient.

Now we can choose between two extremecases of multi-neural gases: One extremecase is the ordinary neural gas M = 1, i.e.we only use one single neural gas. Interest-ing enough, the other extreme case (verylargeM , a few or only one neuron per gas)behaves analogously to the K-means clus-tering (for more information on clusteringprocedures see excursus A).

Definition 10.8 (Multi-neural gas). Amulti-neural gas is nothing more than thesimultaneous use of M neural gases.

10.7.4 Growing neural gases canadd neurons to themselves

A growing neural gas is a variation ofthe aforementioned neural gas to whichmore and more neurons are added accord-ing to certain rules. Thus, this is an at-tempt to work against the isolation of neu-rons or the generation of larger wholes inthe cover.

Here, this subject should only be men-tioned but not discussed.

To build a growing SOM is more difficultbecause new neurons have to be integratedin the neighborhood.

Exercises

Exercise 17. A regular, two-dimensionalgrid shall cover a two-dimensional surfaceas "well" as possible.

1. Which grid structure would suit bestfor this purpose?

2. Which criteria did you use for "well"and "best"?

The very imprecise formulation of this ex-ercise is intentional.


Chapter 11

Adaptive resonance theoryAn ART network in its original form shall classify binary input vectors, i.e. to

assign them to a 1-out-of-n output. Simultaneously, the so far unclassifiedpatterns shall be recognized and assigned to a new class.

As in the other smaller chapters, we wantto try to figure out the basic idea ofthe adaptive resonance theory (abbre-viated: ART) without discussing its the-ory profoundly.

In several sections we have already men-tioned that it is difficult to use neuralnetworks for the learning of new informa-tion in addition to but without destroyingthe already existing information. This cir-cumstance is called stability / plasticitydilemma.

In 1987, Stephen Grossberg and GailCarpenter published the first version oftheir ART network [Gro76] in order to al-leviate this problem. This was followedby a whole family of ART improvements(which we want to discuss briefly, too).

It is the idea of unsupervised learning,whose aim is the (initially binary) patternrecognition, or more precisely the catego-rization of patterns into classes. But addi-

tionally an ART network shall be capableto find new classes.

11.1 Task and structure of anART network

An ART network comprises exactly twolayers: the input layer I and the recog-nition layer O with the input layer be-ing completely linked towards the recog-nition layer. This complete link inducesa top-down weight matrix W that con-tains the weight values of the connectionsbetween each neuron in the input layerand each neuron in the recognition layer(fig. 11.1 on the following page).

Simple binary patterns are entered intothe input layer and transferred to the pattern

recognitionrecognition layer while the recognitionlayer shall return a 1-out-of-|O| encoding,i.e. it should follow the winner-takes-all

165

Chapter 11 Adaptive resonance theory dkriesel.com

GFED@ABCi1

44444444444444

##FFFFFFFFFFFFFFFFFFFFF

''OOOOOOOOOOOOOOOOOOOOOOOOOOOOO

))SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS GFED@ABCi2

xxxxxxxxxxxxxxxxxxxxx

44444444444444

##FFFFFFFFFFFFFFFFFFFFF

''OOOOOOOOOOOOOOOOOOOOOOOOOOOOO GFED@ABCi3

wwooooooooooooooooooooooooooooo


44444444444444

##FFFFFFFFFFFFFFFFFFFFF GFED@ABCi4

uukkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk

wwooooooooooooooooooooooooooooo


44444444444444

GFED@ABCΩ1

EE

;;xxxxxxxxxxxxxxxxxxxxx

77ooooooooooooooooooooooooooooo

55kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk

GFED@ABCΩ2

OO EE


77ooooooooooooooooooooooooooooo

GFED@ABCΩ3

YY44444444444444

OO EE


GFED@ABCΩ4

ccFFFFFFFFFFFFFFFFFFFFF

YY44444444444444

OO EE

GFED@ABCΩ5

ggOOOOOOOOOOOOOOOOOOOOOOOOOOOOO


YY44444444444444

OO

GFED@ABCΩ6

iiSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS

ggOOOOOOOOOOOOOOOOOOOOOOOOOOOOO


YY44444444444444

Figure 11.1: Simplified illustration of the ART network structure. Top: the input layer, bottom:the recognition layer. In this illustration the lateral inhibition of the recognition layer and the controlneurons are omitted.

scheme. For instance, to realize this 1-out-of-|O| encoding the principle of lateralinhibition can be used – or in the imple-mentation the most activated neuron canbe searched. For practical reasons an IFquery would suit this task best.

11.1.1 Resonance takes place byactivities being tossed andturned

But there also exists a bottom-up weightmatrix V , which propagates the activi-

VI ties within the recognition layer back intothe input layer. Now it is obvious thatthese activities are bounced forth and backagain and again, a fact that leads us toresonance. Every activity within the in-

put layer causes an activity within thelayersactivateoneanother

recognition layer while in turn in the recog-nition layer every activity causes an activ-ity within the input layer.

In addition to the two mentioned layers,in an ART network also exist a few neu-rons that exercise control functions such assignal enhancement. But we do not wantto discuss this theory further since hereonly the basic principle of the ART net-work should become explicit. I have onlymentioned it to explain that in spite of therecurrences, the ART network will achievea stable state after an input.


dkriesel.com 11.3 Extensions

11.2 The learning process ofan ART network isdivided to top-down andbottom-up learning

The trick of adaptive resonance theory isnot only the configuration of the ART net-work but also the two-piece learning pro-cedure of the theory: On the one handwe train the top-down matrix W , on theother hand we train the bottom-up matrixV (fig. 11.2 on the next page).

11.2.1 Pattern input and top-downlearning

When a pattern is entered into the net-work it causes - as already mentioned - anactivation at the output neurons and thewinner

neuronis

amplified

strongest neuron wins. Then the weightsof the matrix W going towards the outputneuron are changed such that the outputof the strongest neuron Ω is still enhanced,i.e. the class affiliation of the input vectorto the class of the output neuron Ω be-comes enhanced.

11.2.2 Resonance and bottom-uplearning

The training of the backward weights ofinput isteach. inp.

for backwardweights

the matrix V is a bit tricky: Only theweights of the respective winner neuronare trained towards the input layer andour current input pattern is used as teach-ing input. Thus, the network is trained toenhance input vectors.

11.2.3 Adding an output neuron

Of course, it could happen that the neu-rons are nearly equally activated or thatseveral neurons are activated, i.e. that thenetwork is indecisive. In this case, themechanisms of the control neurons acti-vate a signal that adds a new output neu-ron. Then the current pattern is assignedto this output neuron and the weight setsof the new neuron are trained as usual.

Thus, the advantage of this system is notonly to divide inputs into classes and tofind new classes, it can also tell us afterthe activation of an output neuron what atypical representative of a class looks like- which is a significant feature.

Often, however, the system can only mod-erately distinguish the patterns. The ques-tion is when a new neuron is permitted tobecome active and when it should learn.In an ART network there are different ad-ditional control neurons which answer thisquestion according to different mathemat-ical rules and which are responsible for in-tercepting special cases.

At the same time, one of the largest ob-jections to an ART is the fact that anART network uses a special distinction ofcases, similar to an IF query, that has beenforced into the mechanism of a neural net-work.

11.3 Extensions

As already mentioned above, the ART net-works have often been extended.


Chapter 11 Adaptive resonance theory dkriesel.comKapitel 11 Adaptive Resonance Theory dkriesel.com

GFED@ABCi1

""

GFED@ABCi2

GFED@ABCi3

GFED@ABCi4

|| GFED@ABCΩ1

YY OO EE <<

GFED@ABCΩ2

bb YY OO EE

0 1

GFED@ABCi1

""FFFFFFFFFFFFFFFFFFFF GFED@ABCi2

44444444444444GFED@ABCi3

GFED@ABCi4

||

GFED@ABCΩ1

YY OO EE <<

GFED@ABCΩ2

bb YY OO EE

0 1

GFED@ABCi1

""

GFED@ABCi2

GFED@ABCi3

GFED@ABCi4

|| GFED@ABCΩ1

YY OO EE <<

GFED@ABCΩ2

bbFFFFFFFFFFFFFFFFFFFF

YY44444444444444

OO EE

0 1

Abbildung 11.2: Vereinfachte Darstellung deszweigeteilten Trainings eines ART-Netzes: Diejeweils trainierten Gewichte sind durchgezogendargestellt. Nehmen wir an, ein Muster wurde indas Netz eingegeben und die Zahlen markierenAusgaben. Oben: Wir wir sehen, ist Ω2 das Ge-winnerneuron. Mitte: Also werden die Gewichtezum Gewinnerneuron hin trainiert und (unten)die Gewichte vom Gewinnerneuron zur Eingangs-schicht trainiert.

einer IF-Abfrage, die man in den Mecha-nismus eines Neuronalen Netzes gepressthat.

11.3 Erweiterungen

Wie schon eingangs erwahnt, wurden dieART-Netze vielfach erweitert.

ART-2 [CG87] ist eine Erweiterungauf kontinuierliche Eingaben und bietetzusatzlich (in einer ART-2A genanntenErweiterung) Verbesserungen der Lernge-schwindigkeit, was zusatzliche Kontroll-neurone und Schichten zur Folge hat.

ART-3 [CG90] verbessert die Lernfahig-keit von ART-2, indem zusatzliche biolo-gische Vorgange wie z.B. die chemischenVorgange innerhalb der Synapsen adap-tiert werden1.

Zusatzlich zu den beschriebenen Erweite-rungen existieren noch viele mehr.

1 Durch die haufigen Erweiterungen der AdaptiveResonance Theory sprechen bose Zungen bereitsvon ”ART-n-Netzen“.

168 D. Kriesel – Ein kleiner Uberblick uber Neuronale Netze (EPSILON-DE)

Figure 11.2: Simplified illustration of the two-piece training of an ART network: The trainedweights are represented by solid lines. Let us as-sume that a pattern has been entered into thenetwork and that the numbers mark the outputs.Top: We can see that Ω2 is the winner neu-ron. Middle: So the weights are trained towardsthe winner neuron and (below) the weights ofthe winner neuron are trained towards the inputlayer.

ART-2 [CG87] is extended to continuousinputs and additionally offers (in an ex-tension called ART-2A) enhancements ofthe learning speed which results in addi-tional control neurons and layers.

ART-3 [CG90] 3 improves the learningability of ART-2 by adapting additionalbiological processes such as the chemicalprocesses within the synapses1.

Apart from the described ones there existmany other extensions.

1 Because of the frequent extensions of the adap-tive resonance theory wagging tongues already callthem "ART-n networks".


Part IV

Excursi, appendices and registers

169

Appendix A

Excursus: Cluster analysis and regional andonline learnable fields

In Grimm’s dictionary the extinct German word "Kluster" is described by "wasdicht und dick zusammensitzet (a thick and dense group of sth.)". In static

cluster analysis, the formation of groups within point clouds is explored.Introduction of some procedures, comparison of their advantages and

disadvantages. Discussion of an adaptive clustering method based on neuralnetworks. A regional and online learnable field models from a point cloud,possibly with a lot of points, a comparatively small set of neurons being

representative for the point cloud.

As already mentioned, many problems canbe traced back to problems in clusteranalysis. Therefore, it is necessary to re-search procedures that examine whethergroups (so-called clusters) exist withinpoint clouds.

Since cluster analysis procedures need anotion of distance between two points, ametric must be defined on the spacewhere these points are situated.

We briefly want to specify what a metricis.

Definition A.1 (Metric). A relationdist(x1, x2) defined for two objects x1, x2is referred to as metric if each of the fol-lowing criteria applies:

1. dist(x1, x2) = 0 if and only if x1 = x2,

2. dist(x1, x2) = dist(x2, x1), i.e. sym-metry,

3. dist(x1, x3) ≤ dist(x1, x2) +dist(x2, x3), i.e. the triangleinequality holds.

Colloquially speaking, a metric is a toolfor determining distances between pointsin any space. Here, the distances haveto be symmetrical, and the distance be-tween to points may only be 0 if the twopoints are equal. Additionally, the trian-gle inequality must apply.

Metrics are provided by, for example, thesquared distance and the Euclideandistance, which have already been intro-duced. Based on such metrics we can de-

171

Appendix A Excursus: Cluster analysis and regional and online learnable fieldsdkriesel.com

fine a clustering procedure that uses a met-ric as distance measure.

Now we want to introduce and briefly dis-cuss different clustering procedures.

A.1 k-means clusteringallocates data to apredefined number ofclusters

k-means clustering according to J.MacQueen [Mac67] is an algorithm thatis often used because of its low computa-tion and storage complexity and which isregarded as "inexpensive and good". Theoperation sequence of the k-means cluster-ing algorithm is the following:

1. Provide data to be examined.

2. Define k, which is the number of clus-ter centers.

3. Select k random vectors for the clus-ter centers (also referred to as code-book vectors).

4. Assign each data point to the nextcodebook vector1

5. Compute cluster centers for all clus-ters.

6. Set codebook vectors to new clustercenters.

1 The name codebook vector was created becausethe often used name cluster vector was too un-clear.

7. Continue with 4 until the assignmentsare no longer changed.

number ofclustermust beknownpreviously

Step 2 already shows one of the great ques-tions of the k-means algorithm: The num-ber k of the cluster centers has to be de-termined in advance. This cannot be doneby the algorithm. The problem is that itis not necessarily known in advance how kcan be determined best. Another problemis that the procedure can become quite in-stable if the codebook vectors are badlyinitialized. But since this is random, itis often useful to restart the procedure.This has the advantage of not requiringmuch computational effort. If you are fullyaware of those weaknesses, you will receivequite good results.

However, complex structures such as "clus-ters in clusters" cannot be recognized. If kis high, the outer ring of the constructionin the following illustration will be recog-nized as many single clusters. If k is low,the ring with the small inner clusters willbe recognized as one cluster.

For an illustration see the upper right partof fig. A.1 on page 174.

A.2 k-nearest neighboringlooks for the k nearestneighbors of each datapoint

The k-nearest neighboring procedure[CH67] connects each data point to the kclosest neighbors, which often results in adivision of the groups. Then such a group


dkriesel.com A.4 The silhouette coefficient

builds a cluster. The advantage is thatthe number of clusters occurs all by it-self. The disadvantage is that a large stor-age and computational effort is required tofind the next neighbor (the distances be-tween all data points must be computedand stored).

clusteringnext

points There are some special cases in which theprocedure combines data points belongingto different clusters, if k is too high. (seethe two small clusters in the upper rightof the illustration). Clusters consisting ofonly one single data point are basicallyconncted to another cluster, which is notalways intentional.

Furthermore, it is not mandatory that thelinks between the points are symmetric.

But this procedure allows a recognition ofrings and therefore of "clusters in clusters",which is a clear advantage. Another ad-vantage is that the procedure adaptivelyresponds to the distances in and betweenthe clusters.

For an illustration see the lower left partof fig. A.1.

A.3 ε-nearest neighboringlooks for neighbors withinthe radius ε for eachdata point

Another approach of neighboring: here,the neighborhood detection does not use afixed number k of neighbors but a radius ε,

which is the reason for the name epsilon-nearest neighboring. Points are neig-bors if they are at most ε apart from eachother. Here, the storage and computa-tional effort is obviously very high, whichis a disadvantage.

clusteringradii aroundpointsBut note that there are some special cases:

Two separate clusters can easily be con-nected due to the unfavorable situation ofa single data point. This can also happenwith k-nearest neighboring, but it wouldbe more difficult since in this case the num-ber of neighbors per point is limited.

An advantage is the symmetric nature ofthe neighborhood relationships. Anotheradvantage is that the combination of min-imal clusters due to a fixed number ofneighbors is avoided.

On the other hand, it is necessary to skill-fully initialize ε in order to be successful,i.e. smaller than half the smallest distancebetween two clusters. With variable clus-ter and point distances within clusters thiscan possibly be a problem.

For an illustration see the lower right partof fig. A.1.

A.4 The silhouette coefficientdetermines how accuratea given clustering is

As we can see above, there is no easy an-swer for clustering problems. Each proce-dure described has very specific disadvan-tages. In this respect it is useful to have



Figure A.1: Top left: our set of points. We will use this set to explore the different clusteringmethods. Top right: k-means clustering. Using this procedure we chose k = 6. As we cansee, the procedure is not capable to recognize "clusters in clusters" (bottom left of the illustration).Long "lines" of points are a problem, too: They would be recognized as many small clusters (if kis sufficiently large). Bottom left: k-nearest neighboring. If k is selected too high (higher thanthe number of points in the smallest cluster), this will result in cluster combinations shown in theupper right of the illustration. Bottom right: ε-nearest neighboring. This procedure will causedifficulties if ε is selected larger than the minimum distance between two clusters (see upper left ofthe illustration), which will then be combined.


dkriesel.com A.5 Regional and online learnable fields

a criterion to decide how good our clus-ter division is. This possibility is offeredby the silhouette coefficient accordingto [Kau90]. This coefficient measures howwell the clusters are delimited from eachother and indicates if points may be as-signed to the wrong clusters.

clusteringquality is

measureable Let P be a point cloud and p a point inP . Let c ⊆ P be a cluster within thepoint cloud and p be part of this cluster,i.e. p ∈ c. The set of clusters is called C.Summary:

p ∈ c ⊆ P

applies.

To calculate the silhouette coefficient, weinitially need the average distance betweenpoint p and all its cluster neighbors. Thisvariable is referred to as a(p) and definedas follows:

a(p) = 1|c| − 1

∑q∈c,q 6=p

dist(p, q) (A.1)

Furthermore, let b(p) be the average dis-tance between our point p and all pointsof the next cluster (g represents all clustersexcept for c):

b(p) = ming∈C,g 6=c

1|g|∑q∈g

dist(p, q) (A.2)

The point p is classified well if the distanceto the center of the own cluster is minimaland the distance to the centers of the otherclusters is maximal. In this case, the fol-lowing term provides a value close to 1:

s(p) = b(p)− a(p)maxa(p), b(p) (A.3)

Apparently, the whole term s(p) can onlybe within the interval [−1; 1]. A valueclose to -1 indicates a bad classification ofp.

The silhouette coefficient S(P ) resultsfrom the average of all values s(p):

S(P ) = 1|P |

∑p∈P

s(p). (A.4)

As above the total quality of the clus-ter division is expressed by the interval[−1; 1].

As different clustering strategies with dif-ferent characteristics have been presentednow (lots of further material is presentedin [DHS01]), as well as a measure to in-dicate the quality of an existing arrange-ment of given data into clusters, I wantto introduce a clustering method basedon an unsupervised learning neural net-work [SGE05] which was published in 2005.Like all the other methods this one maynot be perfect but it eliminates large stan-dard weaknesses of the known clusteringmethods

A.5 Regional and onlinelearnable fields are aneural clustering strategy

The paradigm of neural networks, which Iwant to introduce now, are the regionaland online learnable fields, shortly re-ferred to as ROLFs.



A.5.1 ROLFs try to cover data withneurons

Roughly speaking, the regional and onlinelearnable fields are a set K of neurons

KI which try to cover a set of points as wellas possible by means of their distributionin the input space. For this, neurons areadded, moved or changed in their size dur-

networkcovers

point clouding training if necessary. The parametersof the individual neurons will be discussedlater.

Definition A.2 (Regional and onlinelearnable field). A regional and on-line learnable field (abbreviated ROLF orROLF network) is a set K of neurons thatare trained to cover a certain set in theinput space as well as possible.

A.5.1.1 ROLF neurons feature aposition and a radius in theinput space

Here, a ROLF neuron k ∈ K has twoparameters: Similar to the RBF networks,it has a center ck, i.e. a position in the

cI input space.

But it has yet another parameter: The ra-dius σ, which defines the radius of the per-

σI ceptive surface surrounding the neuron2.A neuron covers the part of the input spacethat is situated within this radius.

ck and σk are locally defined for each neu-neuronrepresents

surface 2 I write "defines" and not "is" because the actualradius is specified by σ · ρ.

Figure A.2: Structure of a ROLF neuron.

ron. This particularly means that the neu-rons are capable to cover surfaces of differ-ent sizes.

The radius of the perceptive surface isspecified by r = ρ · σ (fig. A.2) withthe multiplier ρ being globally defined andpreviously specified for all neurons. Intu-itively, the reader will wonder what thismultiplicator is used for. Its significancewill be discussed later. Furthermore, thefollowing has to be observed: It is not nec-essary for the perceptive surface of the dif-ferent neurons to be of the same size.

Definition A.3 (ROLF neuron). The pa-rameters of a ROLF neuron k are a centerck and a radius σk.

Definition A.4 (Perceptive surface).The perceptive surface of a ROLF neuron



k consists of all points within the radiusρ · σ in the input space.

A.5.2 A ROLF learns unsupervisedby presenting trainingsamples online

Like many other paradigms of neural net-works our ROLF network learns by receiv-ing many training samples p of a trainingset P . The learning is unsupervised. Foreach training sample p entered into the net-work two cases can occur:

1. There is one accepting neuron k for por

2. there is no accepting neuron at all.

If in the first case several neurons are suit-able, then there will be exactly one ac-cepting neuron insofar as the closest neu-ron is the accepting one. For the acceptingneuron k ck and σk are adapted.

Definition A.5 (Accepting neuron). Thecriterion for a ROLF neuron k to be anaccepting neuron of a point p is that thepoint p must be located within the percep-tive surface of k. If p is located in the per-ceptive surfaces of several neurons, thenthe closest neuron will be the acceptingone. If there are several closest neurons,one can be chosen randomly.

A.5.2.1 Both positions and radii areadapted throughout learning

Adaptingexistingneurons

Let us assume that we entered a trainingsample p into the network and that there

is an accepting neuron k. Then the radiusmoves towards ||p − ck|| (i.e. towards thedistance between p and ck) and the centerck towards p. Additionally, let us definethe two learning rates ησ and ηc for radii Jησ, ηcand centers.

ck(t+ 1) = ck(t) + ηc(p− ck(t))σk(t+ 1) = σk(t) + ησ(||p− ck(t)|| − σk(t))

Note that here σk is a scalar while ck is avector in the input space.

Definition A.6 (Adapting a ROLF neu-ron). A neuron k accepted by a point p isadapted according to the following rules:

ck(t+ 1) = ck(t) + ηc(p− ck(t)) (A.5)

σk(t+ 1) = σk(t) + ησ(||p− ck(t)|| − σk(t))(A.6)

A.5.2.2 The radius multiplier allowsneurons to be able not only toshrink

Now we can understand the function of themultiplier ρ: Due to this multiplier the per- Jρceptive surface of a neuron includes morethan only all points surrounding the neu-ron in the radius σ. This means that dueto the aforementioned learning rule σ can-not only decrease but also increase.

so theneuronscan growDefinition A.7 (Radius multiplier). The

radius multiplier ρ > 1 is globally definedand expands the perceptive surface of aneuron k to a multiple of σk. So it is en-sured that the radius σk cannot only de-crease but also increase.



Generally, the radius multiplier is set tovalues in the lower one-digit range, suchas 2 or 3.

So far we only have discussed the case inthe ROLF training that there is an accept-ing neuron for the training sample p.

A.5.2.3 As required, new neurons aregenerated

This suggests to discuss the approach forthe case that there is no accepting neu-ron.

In this case a new accepting neuron k isgenerated for our training sample. The re-sult is of course that ck and σk have to beinitialized.

The initialization of ck can be understoodintuitively: The center of the new neuronis simply set on the training sample, i.e.

ck = p.

We generate a new neuron because thereis no neuron close to p – for logical reasons,we place the neuron exactly on p.

But how to set a σ when a new neuronis generated? For this purpose there existdifferent options:

Init-σ: We always select a predefinedstatic σ.

Minimum σ: We take a look at the σ ofeach neuron and select the minimum.

Maximum σ: We take a look at the σ ofeach neuron and select the maximum.

Mean σ: We select the mean σ of all neu-rons.

Currently, the mean-σ variant is the fa-vorite one although the learning procedurealso works with the other ones. In theminimum-σ variant the neurons tend tocover less of the surface, in the maximum-σ variant they tend to cover more of thesurface.

Definition A.8 (Generating a ROLF neu-ron). If a new ROLF neuron k is gener-ated by entering a training sample p, then

initializationof aneurons

ck is intialized with p and σk according toone of the aforementioned strategies (init-σ, minimum-σ, maximum-σ, mean-σ).

The training is complete when after re-peated randomly permuted pattern presen-tation no new neuron has been generatedin an epoch and the positions of the neu-rons barely change.

A.5.3 Evaluating a ROLF

The result of the training algorithm is thatthe training set is gradually covered welland precisely by the ROLF neurons andthat a high concentration of points on aspot of the input space does not automati-cally generate more neurons. Thus, a pos-sibly very large point cloud is reduced tovery few representatives (based on the in-put set).

Then it is very easy to define the num-cluster =connectedneurons

ber of clusters: Two neurons are (accord-ing to the definition of the ROLF) con-nected when their perceptive surfaces over-



lap (i.e. some kind of nearest neighbor-ing is executed with the variable percep-tive surfaces). A cluster is a group ofconnected neurons or a group of points ofthe input space covered by these neurons(fig. A.3).

Of course, the complete ROLF networkcan be evaluated by means of other clus-tering methods, i.e. the neurons can besearched for clusters. Particularly withclustering methods whose storage effortgrows quadratic to |P | the storage effortcan be reduced dramatically since gener-ally there are considerably less ROLF neu-rons than original data points, but theneurons represent the data points quitewell.

A.5.4 Comparison with popularclustering methods

It is obvious, that storing the neuronsrather than storing the input points takesthe biggest part of the storage effort of theROLFs. This is a great advantage for huge

lessstorageeffort!

point clouds with a lot of points.

Since it is unnecessary to store the en-tire point cloud, our ROLF, as a neuralclustering method, has the capability tolearn online, which is definitely a great ad-vantage. Furthermore, it can (similar toε nearest neighboring or k nearest neigh-boring) distinguish clusters from enclosedclusters – but due to the online presenta-recognize

"cluster inclusters"

tion of the data without a quadraticallygrowing storage effort, which is by far thegreatest disadvantage of the two neighbor-ing methods.

Figure A.3: The clustering process. Top: theinput set, middle: the input space covered byROLF neurons, bottom: the input space onlycovered by the neurons (representatives).



Additionally, the issue of the size of the in-dividual clusters proportional to their dis-tance from each other is addressed by us-ing variable perceptive surfaces - which isalso not always the case for the two men-tioned methods.

The ROLF compares favorably with k-means clustering, as well: Firstly, it is un-necessary to previously know the numberof clusters and, secondly, k-means cluster-ing recognizes clusters enclosed by otherclusters as separate clusters.

A.5.5 Initializing radii, learningrates and multiplier is nottrivial

Certainly, the disadvantages of the ROLFshall not be concealed: It is not alwayseasy to select the appropriate initial valuefor σ and ρ. The previous knowledgeabout the data set can so to say be in-cluded in ρ and the initial value of σ of theROLF: Fine-grained data clusters shoulduse a small ρ and a small σ initial value.But the smaller the ρ the smaller, thechance that the neurons will grow if neces-sary. Here again, there is no easy answer,just like for the learning rates ηc and ησ.

For ρ the multipliers in the lower single-digit range such as 2 or 3 are very popu-lar. ηc and ησ successfully work with val-ues about 0.005 to 0.1, variations duringrun-time are also imaginable for this typeof network. Initial values for σ generallydepend on the cluster and data distribu-tion (i.e. they often have to be tested).But compared to wrong initializations –

at least with the mean-σ strategy – theyare relatively robust after some trainingtime.

As a whole, the ROLF is on a par withthe other clustering methods and is par-ticularly very interesting for systems withlow storage capacity or huge data sets.

A.5.6 Application examples

A first application example could be find-ing color clusters in RGB images. Anotherfield of application directly described inthe ROLF publication is the recognition ofwords transferred into a 720-dimensionalfeature space. Thus, we can see thatROLFs are relatively robust against higherdimensions. Further applications can befound in the field of analysis of attacks onnetwork systems and their classification.

Exercises

Exercise 18. Determine at least fouradaptation steps for one single ROLF neu-ron k if the four patterns stated beloware presented one after another in the in-dicated order. Let the initial values forthe ROLF neuron be ck = (0.1, 0.1) andσk = 1. Furthermore, let ηc = 0.5 andησ = 0. Let ρ = 3.

P = (0.1, 0.1);= (0.9, 0.1);= (0.1, 0.9);= (0.9, 0.9).


Appendix B

Excursus: neural networks used forprediction

Discussion of an application of neural networks: a look ahead into the futureof time series.

After discussing the different paradigms ofneural networks it is now useful to take alook at an application of neural networkswhich is brought up often and (as we willsee) is also used for fraud: The applica-tion of time series prediction. This ex-cursus is structured into the description oftime series and estimations about the re-quirements that are actually needed to pre-dict the values of a time series. Finally, Iwill say something about the range of soft-ware which should predict share prices orother economic characteristics by means ofneural networks or other procedures.

This chapter should not be a detaileddescription but rather indicate some ap-proaches for time series prediction. In thisrespect I will again try to avoid formal def-initions.

B.1 About time series

A time series is a series of values dis-cretized in time. For example, daily mea-sured temperature values or other meteo-rological data of a specific site could berepresented by a time series. Share pricevalues also represent a time series. Oftenthe measurement of time series is timelyequidistant, and in many time series thefuture development of their values is veryinteresting, e.g. the daily weather fore-cast. time

series ofvaluesTime series can also be values of an actu-

ally continuous function read in a certaindistance of time ∆t (fig. B.1 on the next J∆tpage).

If we want to predict a time series, we willlook for a neural network that maps theprevious series values to future develop-ments of the time series, i.e. if we knowlonger sections of the time series, we will

181

Appendix B Excursus: neural networks used for prediction dkriesel.com

Figure B.1: A function x that depends on thetime is sampled at discrete time steps (time dis-cretized), this means that the result is a timeseries. The sampled values are entered into aneural network (in this example an SLP) whichshall learn to predict the future values of the timeseries.

have enough training samples. Of course,these are not examples for the future to bepredicted but it is tried to generalize andto extrapolate the past by means of thesaid samples.

But before we begin to predict a timeseries we have to answer some questionsabout this time series we are dealing withand ensure that it fulfills some require-ments.

1. Do we have any evidence which sug-gests that future values depend in anyway on the past values of the time se-ries? Does the past of a time seriesinclude information about its future?

2. Do we have enough past values of thetime series that can be used as train-ing patterns?

3. In case of a prediction of a continuousfunction: What must a useful ∆t looklike?

Now these questions shall be explored indetail.

How much information about the futureis included in the past values of a time se-ries? This is the most important questionto be answered for any time series thatshould be mapped into the future. If thefuture values of a time series, for instance,do not depend on the past values, then atime series prediction based on them willbe impossible.

In this chapter, we assume systems whosefuture values can be deduced from theirstates – the deterministic systems. This


dkriesel.com B.2 One-step-ahead prediction

leads us to the question of what a systemstate is.

A system state completely describes a sys-tem for a certain point of time. The futureof a deterministic system would be clearlydefined by means of the complete descrip-tion of its current state.

The problem in the real world is that sucha state concept includes all things that in-fluence our system by any means.

In case of our weather forecast for a spe-cific site we could definitely determinethe temperature, the atmospheric pres-sure and the cloud density as the mete-orological state of the place at a time t.But the whole state would include signifi-cantly more information. Here, the world-wide phenomena that control the weatherwould be interesting as well as small localpheonomena such as the cooling system ofthe local power plant.

So we shall note that the system state is de-sirable for prediction but not always possi-ble to obtain. Often only fragments of thecurrent states can be acquired, e.g. for aweather forecast these fragments are thesaid weather data.

However, we can partially overcome theseweaknesses by using not only one singlestate (the last one) for the prediction, butby using several past states. From thiswe want to derive our first prediction sys-tem:

B.2 One-step-aheadprediction

The first attempt to predict the next fu-ture value of a time series out of past val-ues is called one-step-ahead prediction(fig. B.2 on the following page).

predictthe nextvalueSuch a predictor system receives the last

n observed state parts of the system asinput and outputs the prediction for thenext state (or state part). The idea ofa state space with predictable states iscalled state space forecasting.

The aim of the predictor is to realize afunction

f(xt−n+1, . . . , xt−1, xt) = xt+1, (B.1)

which receives exactly n past values in or-der to predict the future value. Predictedvalues shall be headed by a tilde (e.g. x) Jxto distinguish them from the actual futurevalues.

The most intuitive and simplest approachwould be to find a linear combination

xi+1 = a0xi + a1xi−1 + . . .+ ajxi−j(B.2)

that approximately fulfills our condi-tions.

Such a construction is called digital fil-ter. Here we use the fact that time series



xt−3

..

xt−2

..

xt−1

--

xt

++

xt+1

predictor

KK

Figure B.2: Representation of the one-step-ahead prediction. It is tried to calculate the futurevalue from a series of past values. The predicting element (in this case a neural network) is referredto as predictor.

usually have a lot of past values so that wecan set up a series of equations1:

xt = a0xt−1 + . . .+ ajxt−1−(n−1)

xt−1 = a0xt−2 + . . .+ ajxt−2−(n−1)... (B.3)

xt−n = a0xt−n + . . .+ ajxt−n−(n−1)

Thus, n equations could be found for n un-known coefficients and solve them (if pos-sible). Or another, better approach: wecould usem > n equations for n unknownsin such a way that the sum of the meansquared errors of the already known pre-diction is minimized. This is called mov-ing average procedure.

But this linear structure corresponds to asinglelayer perceptron with a linear activa-tion function which has been trained bymeans of data from the past (The experi-mental setup would comply with fig. B.1on page 182). In fact, the training by

1 Without going into detail, I want to remark thatthe prediction becomes easier the more past valuesof the time series are available. I would like toask the reader to read up on the Nyquist-Shannonsampling theorem

means of the delta rule provides resultsvery close to the analytical solution.

Even if this approach often provides satis-fying results, we have seen that many prob-lems cannot be solved by using a single-layer perceptron. Additional layers withlinear activation function are useless, aswell, since a multilayer perceptron withonly linear activation functions can be re-duced to a singlelayer perceptron. Suchconsiderations lead to a non-linear ap-proach.

The multilayer perceptron and non-linearactivation functions provide a universalnon-linear function approximator, i.e. wecan use an n-|H|-1-MLP for n n inputs outof the past. An RBF network could also beused. But remember that here the numbern has to remain low since in RBF networkshigh input dimensions are very complex torealize. So if we want to include many pastvalues, a multilayer perceptron will requireconsiderably less computational effort.


dkriesel.com B.4 Additional optimization approaches for prediction

B.3 Two-step-aheadprediction

What approaches can we use to to see far-ther into the future?

B.3.1 Recursive two-step-aheadprediction

predictfuturevalues In order to extend the prediction to, for in-

stance, two time steps into the future, wecould perform two one-step-ahead predic-tions in a row (fig. B.3 on the followingpage), i.e. a recursive two-step-aheadprediction. Unfortunately, the value de-termined by means of a one-step-aheadprediction is generally imprecise so thaterrors can be built up, and the more pre-dictions are performed in a row the moreimprecise becomes the result.

B.3.2 Direct two-step-aheadprediction

We have already guessed that there existsa better approach: Just like the systemcan be trained to predict the next value,we can certainly train it to predict the

directpredictionis better

next but one value. This means we di-rectly train, for example, a neural networkto look two time steps ahead into the fu-ture, which is referred to as direct two-step-ahead prediction (fig. B.4 on thenext page). Obviously, the direct two-step-ahead prediction is technically identical tothe one-step-ahead prediction. The onlydifference is the training.

B.4 Additional optimizationapproaches for prediction

The possibility to predict values far awayin the future is not only important becausewe try to look farther ahead into the fu-ture. There can also be periodic time se-ries where other approaches are hardly pos-sible: If a lecture begins at 9 a.m. everyThursday, it is not very useful to know howmany people sat in the lecture room onMonday to predict the number of lectureparticipants. The same applies, for ex-ample, to periodically occurring commuterjams.

B.4.1 Changing temporalparameters

Thus, it can be useful to intentionally leavegaps in the future values as well as in thepast values of the time series, i.e. to in-troduce the parameter ∆t which indicateswhich past value is used for prediction.Technically speaking, we still use a one- extent

inputperiod

step-ahead prediction only that we extendthe input space or train the system to pre-dict values lying farther away.

It is also possible to combine different ∆t:In case of the traffic jam prediction for aMonday the values of the last few dayscould be used as data input in addition tothe values of the previous Mondays. Thus,we use the last values of several periods,in this case the values of a weekly and adaily period. We could also include an an-nual period in the form of the beginning ofthe holidays (for sure, everyone of us has



predictor

xt−3

..

xt−2

00

..

xt−1

00

--

xt

++

00

xt+1

OO

xt+2

predictor

JJ

Figure B.3: Representation of the two-step-ahead prediction. Attempt to predict the second futurevalue out of a past value series by means of a second predictor and the involvement of an alreadypredicted value.

xt−3

..

xt−2

..

xt−1

--

xt

++

xt+1 xt+2

predictor

EE

Figure B.4: Representation of the direct two-step-ahead prediction. Here, the second time step ispredicted directly, the first one is omitted. Technically, it does not differ from a one-step-aheadprediction.


dkriesel.com B.5 Remarks on the prediction of share prices

already spent a lot of time on the highwaybecause he forgot the beginning of the hol-idays).

B.4.2 Heterogeneous prediction

Another prediction approach would be topredict the future values of a single timeseries out of several time series, if it isassumed that the additional time seriesuse

informationoutside oftime series

is related to the future of the first one(heterogeneous one-step-ahead pre-diction, fig. B.5 on the following page).

If we want to predict two outputs of tworelated time series, it is certainly possibleto perform two parallel one-step-ahead pre-dictions (analytically this is done very of-ten because otherwise the equations wouldbecome very confusing); or in case ofthe neural networks an additional outputneuron is attached and the knowledge ofboth time series is used for both outputs(fig. B.6 on the next page).

You’ll find more and more general materialon time series in [WG94].

B.5 Remarks on theprediction of share prices

Many people observe the changes of ashare price in the past and try to con-clude the future from those values in or-der to benefit from this knowledge. Shareprices are discontinuous and therefore theyare principally difficult functions. Further-more, the functions can only be used for

discrete values – often, for example, in adaily rhythm (including the maximum andminimum values per day, if we are lucky)with the daily variations certainly beingeliminated. But this makes the wholething even more difficult.

There are chartists, i.e. people who lookat many diagrams and decide by meansof a lot of background knowledge anddecade-long experience whether the equi-ties should be bought or not (and oftenthey are very successful).

Apart from the share prices it is very in-teresting to predict the exchange rates ofcurrencies: If we exchange 100 Euros intoDollars, the Dollars into Pounds and thePounds back into Euros it could be pos-sible that we will finally receive 110 Eu-ros. But once found out, we would do thismore often and thus we would change theexchange rates into a state in which suchan increasing circulation would no longerbe possible (otherwise we could producemoney by generating, so to speak, a finan-cial perpetual motion machine.

At the stock exchange, successful stockand currency brokers raise or lower theirthumbs – and thereby indicate whether intheir opinion a share price or an exchangerate will increase or decrease. Mathemat-ically speaking, they indicate the first bit(sign) of the first derivative of the ex-change rate. In that way excellent world-class brokers obtain success rates of about70%.

In Great Britain, the heterogeneous one-step-ahead prediction was successfully



xt−3

..

xt−2

..

xt−1

--

xt

++

xt+1

predictor

KK

yt−3

00

yt−2

00

yt−1

11

yt

33

Figure B.5: Representation of the heterogeneous one-step-ahead prediction. Prediction of a timeseries under consideration of a second one.

xt−3

..

xt−2

..

xt−1

--

xt

++

xt+1

predictor

KK

yt−3

00

yt−2

00

yt−1

11

yt

33

yt+1

Figure B.6: Heterogeneous one-step-ahead prediction of two time series at the same time.


dkriesel.com B.5 Remarks on the prediction of share prices

used to increase the accuracy of such pre-dictions to 76%: In addition to the timeseries of the values indicators such as theoil price in Rotterdam or the US nationaldebt were included.

This is just an example to show the mag-nitude of the accuracy of stock-exchangeevaluations, since we are still talking onlyabout the first bit of the first derivation!We still do not know how strong the ex-pected increase or decrease will be andalso whether the effort will pay off: Prob-ably, one wrong prediction could nullifythe profit of one hundred correct predic-tions.

How can neural networks be used to pre-dict share prices? Intuitively, we assumethat future share prices are a function ofthe previous share values.

But this assumption is wrong: Shareprices are no function of their past val-ues, but a function of their assumed fu-

share pricefunction of

assumedfuturevalue!

ture value. We do not buy shares be-cause their values have been increasedduring the last days, but because we be-lieve that they will futher increase tomor-row. If, as a consequence, many peoplebuy a share, they will boost the price.Therefore their assumption was right – aself-fulfilling prophecy has been gener-ated, a phenomenon long known in eco-nomics.

The same applies the other way around:We sell shares because we believe that to-morrow the prices will decrease. This willbeat down the prices the next day and gen-erally even more the day after the next.

Again and again some software appearswhich uses scientific key words such as”neural networks” to purport that it is ca-pable to predict where share prices are go-ing. Do not buy such software! In addi-tion to the aforementioned scientific exclu-sions there is one simple reason for this:If these tools work – why should the man-ufacturer sell them? Normally, useful eco-nomic knowledge is kept secret. If we knewa way to definitely gain wealth by meansof shares, we would earn our millions byusing this knowledge instead of selling itfor 30 euros, wouldn’t we?


Appendix C

Excursus: reinforcement learningWhat if there were no training samples but it would nevertheless be possibleto evaluate how well we have learned to solve a problem? Let us examine a

learning paradigm that is situated between supervised and unsupervisedlearning.

I now want to introduce a more exotic ap-proach of learning – just to leave the usualpaths. We know learning procedures inwhich the network is exactly told what todo, i.e. we provide exemplary output val-ues. We also know learning procedureslike those of the self-organizing maps, intowhich only input values are entered.

Now we want to explore something in-between: The learning paradigm of rein-forcement learning – reinforcement learn-ing according to Sutton and Barto[SB98].

Reinforcement learning in itself is no neu-ral network but only one of the three learn-ing paradigms already mentioned in chap-ter 4. In some sources it is counted amongthe supervised learning procedures since afeedback is given. Due to its very rudimen-no

samplesbut

feedback

tary feedback it is reasonable to separateit from the supervised learning procedures– apart from the fact that there are notraining samples at all.

While it is generally known that pro-cedures such as backpropagation cannotwork in the human brain itself, reinforce-ment learning is usually considered as be-ing biologically more motivated.

The term reinforcement learningcomes from cognitive science andpsychology and it describes the learningsystem of carrot and stick, which occurseverywhere in nature, i.e. learning bymeans of good or bad experience, rewardand punishment. But there is no learningaid that exactly explains what we haveto do: We only receive a total resultfor a process (Did we win the game ofchess or not? And how sure was thisvictory?), but no results for the individualintermediate steps.

For example, if we ride our bike with worntires and at a speed of exactly 21, 5kmhthrough a turn over some sand with agrain size of 0.1mm, on the average, thennobody could tell us exactly which han-

191

Appendix C Excursus: reinforcement learning dkriesel.com

dlebar angle we have to adjust or, evenworse, how strong the great number ofmuscle parts in our arms or legs have tocontract for this. Depending on whetherwe reach the end of the curve unharmed ornot, we soon have to face the learning expe-rience, a feedback or a reward, be it goodor bad. Thus, the reward is very simple- but on the other hand it is considerablyeasier to obtain. If we now have tested dif-ferent velocities and turning angles oftenenough and received some rewards, we willget a feel for what works and what doesnot. The aim of reinforcement learning isto maintain exactly this feeling.

Another example for the quasi-impossibility to achieve a sort of cost orutility function is a tennis player whotries to maximize his athletic successon the long term by means of complexmovements and ballistic trajectories inthe three-dimensional space including thewind direction, the importance of thetournament, private factors and manymore.

To get straight to the point: Since wereceive only little feedback, reinforcementlearning often means trial and error – andtherefore it is very slow.

C.1 System structure

Now we want to briefly discuss differentsizes and components of the system. Wewill define them more precisely in the fol-lowing sections. Broadly speaking, rein-forcement learning represents the mutual

interaction between an agent and an envi-ronmental system (fig. C.2).

The agent shall solve some problem. Hecould, for instance, be an autonomousrobot that shall avoid obstacles. Theagent performs some actions within theenvironment and in return receives a feed-back from the environment, which in thefollowing is called reward. This cycle of ac-tion and reward is characteristic for rein-forcement learning. The agent influencesthe system, the system provides a rewardand then changes.

The reward is a real or discrete scalarwhich describes, as mentioned above, howwell we achieve our aim, but it does notgive any guidance how we can achieve it.The aim is always to make the sum ofrewards as high as possible on the longterm.

C.1.1 The gridworld

As a learning example for reinforcementlearning I would like to use the so-calledgridworld. We will see that its struc-ture is very simple and easy to figure outand therefore reinforcement is actually notnecessary. However, it is very suitable

simpleexamplaryworld

for representing the approach of reinforce-ment learning. Now let us exemplary de-fine the individual components of the re-inforcement system by means of the grid-world. Later, each of these componentswill be examined more exactly.

Environment: The gridworld (fig. C.1 onthe facing page) is a simple, discrete


dkriesel.com C.1 System structure

world in two dimensions which in thefollowing we want to use as environ-mental system.

Agent: As an Agent we use a simple robotbeing situated in our gridworld.

State space: As we can see, our gridworldhas 5× 7 fields with 6 fields being un-accessible. Therefore, our agent canoccupy 29 positions in the grid world.These positions are regarded as statesfor the agent.

Action space: The actions are still miss-ing. We simply define that the robotcould move one field up or down, tothe right or to the left (as long asthere is no obstacle or the edge of ourgridworld).

Task: Our agent’s task is to leave the grid-world. The exit is located on the rightof the light-colored field.

Non-determinism: The two obstacles canbe connected by a "door". When thedoor is closed (lower part of the illus-tration), the corresponding field is in-accessible. The position of the doorcannot change during a cycle but onlybetween the cycles.

We now have created a small world thatwill accompany us through the followinglearning strategies and illustrate them.

C.1.2 Agent und environment

Our aim is that the agent learns what hap-pens by means of the reward. Thus, it

×

×

Figure C.1: A graphical representation of ourgridworld. Dark-colored cells are obstacles andtherefore inaccessible. The exit is located on theright side of the light-colored field. The symbol× marks the starting position of our agent. Inthe upper part of our figure the door is open, inthe lower part it is closed.

Agent

action

__environment

reward / new situation

??

Figure C.2: The agent performs some actionswithin the environment and in return receives areward.



is trained over, of and by means of a dy-namic system, the environment, in orderto reach an aim. But what does learningmean in this context?

The agent shall learn a mapping of sit-agentacts in

environmentuations to actions (called policy), i.e. itshall learn what to do in which situationto achieve a certain (given) aim. The aimis simply shown to the agent by giving anaward for the achievement.

Such an award must not be mistaken forthe reward – on the agent’s way to thesolution it may sometimes be useful toreceive a smaller award or a punishmentwhen in return the longterm result is max-imum (similar to the situation when aninvestor just sits out the downturn of theshare price or to a pawn sacrifice in a chessgame). So, if the agent is heading intothe right direction towards the target, itreceives a positive reward, and if not it re-ceives no reward at all or even a negativereward (punishment). The award is, so tospeak, the final sum of all rewards – whichis also called return.

After having colloquially named all the ba-sic components, we want to discuss moreprecisely which components can be used tomake up our abstract reinforcement learn-ing system.

In the gridworld: In the gridworld, theagent is a simple robot that should find theexit of the gridworld. The environmentis the gridworld itself, which is a discretegridworld.

Definition C.1 (Agent). In reinforce-ment learning the agent can be formally

described as a mapping of the situationspace S into the action space A(st). Themeaning of situations st will be definedlater and should only indicate that the ac-tion space depends on the current situa-tion.

Agent: S → A(st) (C.1)

Definition C.2 (Environment). The en-vironment represents a stochastic map-ping of an action A in the current situa-tion st to a reward rt and a new situationst+1.

Environment: S ×A→ P (S × rt) (C.2)

C.1.3 States, situations and actions

As already mentioned, an agent can be indifferent states: In case of the gridworld,for example, it can be in different positions(here we get a two-dimensional state vec-tor).

For an agent is ist not always possible torealize all information about its currentstate so that we have to introduce the termsituation. A situation is a state from theagent’s point of view, i.e. only a more orless precise approximation of a state.

Therefore, situations generally do not al-low to clearly "predict" successor situa-tions – even with a completely determin-istic system this may not be applicable.If we knew all states and the transitionsbetween them exactly (thus, the completesystem), it would be possible to plan op-timally and also easy to find an optimal



policy (methods are provided, for example,by dynamic programming).

Now we know that reinforcement learningis an interaction between the agent andthe system including actions at and sit-uations st. The agent cannot determineby itself whether the current situation isgood or bad: This is exactly the reasonwhy it receives the said reward from theenvironment.

In the gridworld: States are positionswhere the agent can be situated. Sim-ply said, the situations equal the statesin the gridworld. Possible actions wouldbe to move towards north, south, east orwest.

Situation and action can be vectorial, thereward is always a scalar (in an extremecase even only a binary value) since theaim of reinforcement learning is to getalong with little feedback. A complex vec-torial reward would equal a real teachinginput.

By the way, the cost function should beminimized, which would not be possible,however, with a vectorial reward since wedo not have any intuitive order relationsin multi-dimensional space, i.e. we do notdirectly know what is better or worse.

Definition C.3 (State). Within its en-vironment the agent is in a state. Statescontain any information about the agentwithin the environmental system. Thus,it is theoretically possible to clearly pre-dict a successor state to a performed ac-tion within a deterministic system out ofthis godlike state knowledge.

Definition C.4 (Situation). Situationsst (here at time t) of a situation space JstS are the agent’s limited, approximate

JSknowledge about its state. This approx-imation (about which the agent cannoteven know how good it is) makes clear pre-dictions impossible.

Definition C.5 (Action). Actions at can Jatbe performed by the agent (whereupon itcould be possible that depending on thesituation another action space A(S) ex-

JA(S)ists). They cause state transitions andtherefore a new situation from the agent’spoint of view.

C.1.4 Reward and return

As in real life it is our aim to receivean award that is as high as possible, i.e.to maximize the sum of the expected re-wards r, called return R, on the longterm. For finitely many time steps1 therewards can simply be added:

Rt = rt+1 + rt+2 + . . . (C.3)

=∞∑x=1

rt+x (C.4)

Certainly, the return is only estimatedhere (if we knew all rewards and thereforethe return completely, it would no longerbe necessary to learn).

Definition C.6 (Reward). A reward rt is Jrta scalar, real or discrete (even sometimesonly binary) reward or punishment which

1 In practice, only finitely many time steps will bepossible, even though the formulas are stated withan infinite sum in the first place



the environmental system returns to theagent as reaction to an action.

Definition C.7 (Return). The return Rtis the accumulation of all received rewards

RtI until time t.

C.1.4.1 Dealing with long periods oftime

However, not every problem has an ex-plicit target and therefore a finite sum (e.g.our agent can be a robot having the taskto drive around again and again and toavoid obstacles). In order not to receive adiverging sum in case of an infinite seriesof reward estimations a weakening factor0 < γ < 1 is used, which weakens the in-

γI fluence of future rewards. This is not onlyuseful if there exists no target but also ifthe target is very far away:

Rt = rt+1 + γ1rt+2 + γ2rt+3 + . . . (C.5)

=∞∑x=1

γx−1rt+x (C.6)

The farther the reward is away, the smalleris the influence it has in the agent’s deci-sions.

Another possibility to handle the returnsum would be a limited time horizonτ so that only τ many following rewards

τIrt+1, . . . , rt+τ are regarded:

Rt = rt+1 + . . .+ γτ−1rt+τ (C.7)

=τ∑x=1

γx−1rt+x (C.8)

Thus, we divide the timeline intoepisodes. Usually, one of the two meth-ods is used to limit the sum, if not bothmethods together.

As in daily living we try to approximateour current situation to a desired state.Since it is not mandatory that only thenext expected reward but the expected to-tal sum decides what the agent will do, itis also possible to perform actions that, onshort notice, result in a negative reward(e.g. the pawn sacrifice in a chess game)but will pay off later.

C.1.5 The policy

After having considered and formalizedsome system components of reinforcementlearning the actual aim is still to be dis-cussed:

During reinforcement learning the agentlearns a policy JΠ

Π : S → P (A),

Thus, it continuously adjusts a mappingof the situations to the probabilities P (A),with which any action A is performed inany situation S. A policy can be definedas a strategy to select actions that wouldmaximize the reward in the long term.

In the gridworld: In the gridworld the pol-icy is the strategy according to which theagent tries to exit the gridworld.

Definition C.8 (Policy). The policy Πs a mapping of situations to probabilities



to perform every action out of the actionspace. So it can be formalized as

Π : S → P (A). (C.9)

Basically, we distinguish between two pol-icy paradigms: An open loop policy rep-resents an open control chain and createsout of an initial situation s0 a sequence ofactions a0, a1, . . . with ai 6= ai(si); i > 0.Thus, in the beginning the agent developsa plan and consecutively executes it to theend without considering the intermediatesituations (therefore ai 6= ai(si), actions af-ter a0 do not depend on the situations).

In the gridworld: In the gridworld, anopen-loop policy would provide a precisedirection towards the exit, such as the wayfrom the given starting position to (in ab-breviations of the directions) EEEEN.

So an open-loop policy is a sequence ofactions without interim feedback. A se-quence of actions is generated out of astarting situation. If the system is knownwell and truly, such an open-loop policycan be used successfully and lead to use-ful results. But, for example, to know thechess game well and truly it would be nec-essary to try every possible move, whichwould be very time-consuming. Thus, forsuch problems we have to find an alterna-tive to the open-loop policy, which incorpo-rates the current situations into the actionplan:

A closed loop policy is a closed loop, afunction

Π : si → ai with ai = ai(si),

in a manner of speaking. Here, the envi-ronment influences our action or the agentresponds to the input of the environment,respectively, as already illustrated in fig.C.2. A closed-loop policy, so to speak, isa reactive plan to map current situationsto actions to be performed.

In the gridworld: A closed-loop policywould be responsive to the current posi-tion and choose the direction according tothe action. In particular, when an obsta-cle appears dynamically, such a policy isthe better choice.

When selecting the actions to be per-formed, again two basic strategies can beexamined.

C.1.5.1 Exploitation vs. exploration

As in real life, during reinforcement learn-ing often the question arises whether theexisiting knowledge is only willfully ex-ploited or new ways are also explored.Initially, we want to discuss the two ex-tremes:

researchor safety?A greedy policy always chooses the way

of the highest reward that can be deter-mined in advance, i.e. the way of the high-est known reward. This policy representsthe exploitation approach and is verypromising when the used system is alreadyknown.

In contrast to the exploitation approach itis the aim of the exploration approachto explore a system as detailed as possibleso that also such paths leading to the tar-get can be found which may be not very



promising at first glance but are in factvery successful.

Let us assume that we are looking for theway to a restaurant, a safe policy wouldbe to always take the way we alreadyknow, not matter how unoptimal and longit may be, and not to try to explore bet-ter ways. Another approach would be toexplore shorter ways every now and then,even at the risk of taking a long time andbeing unsuccessful, and therefore finallyhaving to take the original way and arrivetoo late at the restaurant.

In reality, often a combination of bothmethods is applied: In the beginning ofthe learning process it is researched witha higher probability while at the end moreexisting knowledge is exploited. Here, astatic probability distribution is also pos-sible and often applied.

In the gridworld: For finding the way inthe gridworld, the restaurant example ap-plies equally.

C.2 Learning process

Let us again take a look at daily life. Ac-tions can lead us from one situation intodifferent subsituations, from each subsit-uation into further sub-subsituations. Ina sense, we get a situation tree wherelinks between the nodes must be consid-ered (often there are several ways to reacha situation – so the tree could more accu-rately be referred to as a situation graph).

he leaves of such a tree are the end situ-ations of the system. The exploration ap-proach would search the tree as thoroughlyas possible and become acquainted with allleaves. The exploitation approach wouldunerringly go to the best known leave.

Analogous to the situation tree, we alsocan create an action tree. Here, the re-wards for the actions are within the nodes.Now we have to adapt from daily life howwe learn exactly.

C.2.1 Rewarding strategies

Interesting and very important is the ques-tion for what a reward and what kind ofreward is awarded since the design of thereward significantly controls system behav-ior. As we have seen above, there gener-ally are (again as in daily life) various ac-tions that can be performed in any situa-tion. There are different strategies to eval-uate the selected situations and to learnwhich series of actions would lead to thetarget. First of all, this principle shouldbe explained in the following.

We now want to indicate some extremecases as design examples for the reward:

A rewarding similar to the rewarding in achess game is referred to as pure delayedreward: We only receive the reward atthe end of and not during the game. Thismethod is always advantageous when wefinally can say whether we were succesfulor not, but the interim steps do not allow


dkriesel.com C.2 Learning process

an estimation of our situation. If we win,then

rt = 0 ∀t < τ (C.10)

as well as rτ = 1. If we lose, then rτ = −1.With this rewarding strategy a reward isonly returned by the leaves of the situationtree.

Pure negative reward: Here,

rt = −1 ∀t < τ. (C.11)

This system finds the most rapid way toreach the target because this way is auto-matically the most favorable one in respectof the reward. The agent receives punish-ment for anything it does – even if it doesnothing. As a result it is the most inex-pensive method for the agent to reach thetarget fast.

Another strategy is the avoidance strat-egy: Harmful situations are avoided.Here,

rt ∈ 0,−1, (C.12)

Most situations do not receive any reward,only a few of them receive a negative re-ward. The agent agent will avoid gettingtoo close to such negative situations

Warning: Rewarding strategies can haveunexpected consequences. A robot that istold "have it your own way but if you touchan obstacle you will be punished" will sim-ply stand still. If standing still is also pun-ished, it will drive in small circles. Recon-sidering this, we will understand that thisbehavior optimally fulfills the return of the

robot but unfortunately was not intendedto do so.

Furthermore, we can show that especiallysmall tasks can be solved better by meansof negative rewards while positive, moredifferentiated rewards are useful for large,complex tasks.

For our gridworld we want to apply thepure negative reward strategy: The robotshall find the exit as fast as possible.

C.2.2 The state-value function

Unlike our agent we have a godlike view stateevaluationof our gridworld so that we can swiftly de-

termine which robot starting position canprovide which optimal return.

In figure C.3 on the next page these opti-mal returns are applied per field.

In the gridworld: The state-value functionfor our gridworld exactly represents sucha function per situation (= position) withthe difference being that here the functionis unknown and has to be learned.

Thus, we can see that it would be morepractical for the robot to be capable toevaluate the current and future situations.So let us take a look at another systemcomponent of reinforcement learning: thestate-value function V (s), which withregard to a policy Π is often called VΠ(s).Because whether a situation is bad oftendepends on the general behavior Π of theagent.

A situation being bad under a policy thatis searching risks and checking out limits



-6 -5 -4 -3 -2-7 -1-6 -5 -4 -3 -2-7 -6 -5 -3-8 -7 -6 -4-9 -8 -7 -5-10 -9 -8 -7 -6

-6 -5 -4 -3 -2-7 -1-8 -9 -10 -2-9 -10 -11 -3-10 -11 -10 -4-11 -10 -9 -5-10 -9 -8 -7 -6

Figure C.3: Representation of each optimal re-turn per field in our gridworld by means of purenegative reward awarding, at the top with anopen and at the bottom with a closed door.

would be, for instance, if an agent on a bi-cycle turns a corner and the front wheelbegins to slide out. And due to its dare-devil policy the agent would not brake inthis situation. With a risk-aware policythe same situations would look much bet-ter, thus it would be evaluated higher bya good state-value function

VΠ(s) simply returns the value the currentVΠ(s)I situation s has for the agent under policy

Π. Abstractly speaking, according to theabove definitions, the value of the state-value function corresponds to the returnRt (the expected value) of a situation st.

EΠ denotes the set of the expected returnsunder Π and the current situation st.

VΠ(s) = EΠRt|s = st

Definition C.9 (State-value function).The state-value function VΠ(s) has thetask of determining the value of situationsunder a policy, i.e. to answer the agent’squestion of whether a situation s is goodor bad or how good or bad it is. For thispurpose it returns the expectation of thereturn under the situation:

VΠ(s) = EΠRt|s = st (C.13)

The optimal state-value function is calledV ∗Π(s).

JV ∗Π(s)

Unfortunaely, unlike us our robot does nothave a godlike view of its environment. Itdoes not have a table with optimal returnslike the one shown above to orient itself.The aim of reinforcement learning is thatthe robot generates its state-value func-tion bit by bit on the basis of the returns ofmany trials and approximates the optimalstate-value function V ∗ (if there is one).

In this context I want to introduce twoterms closely related to the cycle betweenstate-value function and policy:

C.2.2.1 Policy evaluation

Policy evaluation is the approach to trya policy a few times, to provide many re-wards that way and to gradually accumu-late a state-value function by means ofthese rewards.



V))

Πii

V ∗ Π∗

Figure C.4: The cycle of reinforcement learningwhich ideally leads to optimal Π∗ and V ∗.

C.2.2.2 Policy improvement

Policy improvement means to improvea policy itself, i.e. to turn it into a new andbetter one. In order to improve the policywe have to aim at the return finally havinga larger value than before, i.e. until wehave found a shorter way to the restaurantand have walked it successfully

The principle of reinforcement learning isto realize an interaction. It is tried to eval-uate how good a policy is in individualsituations. The changed state-value func-tion provides information about the sys-tem with which we again improve our pol-icy. These two values lift each other, whichcan mathematically be proved, so that thefinal result is an optimal policy Π∗ and anoptimal state-value function V ∗ (fig. C.4).This cycle sounds simple but is very time-consuming.

At first, let us regard a simple, random pol-icy by which our robot could slowly fulfilland improve its state-value function with-out any previous knowledge.

C.2.3 Monte Carlo method

The easiest approach to accumulate astate-value function is mere trial and er-ror. Thus, we select a randomly behavingpolicy which does not consider the accumu-lated state-value function for its randomdecisions. It can be proved that at somepoint we will find the exit of our gridworldby chance.

Inspired by random-based games of chancethis approach is called Monte Carlomethod.

If we additionally assume a pure negativereward, it is obvious that we can receivean optimum value of −6 for our startingfield in the state-value function. Depend-ing on the random way the random policytakes values other (smaller) than −6 canoccur for the starting field. Intuitively, wewant to memorize only the better value forone state (i.e. one field). But here cautionis advised: In this way, the learning proce-dure would work only with deterministicsystems. Our door, which can be open orclosed during a cycle, would produce oscil-lations for all fields and such oscillationswould influence their shortest way to thetarget.

With the Monte Carlo method we preferto use the learning rule2

V (st)new = V (st)alt + α(Rt − V (st)alt),

in which the update of the state-value func-tion is obviously influenced by both the

2 The learning rule is, among others, derived bymeans of the Bellman equation, but this deriva-tion is not discussed in this chapter.



old state value and the received return (αis the learning rate). Thus, the agent gets

αI some kind of memory, new findings alwayschange the situation value just a little bit.An exemplary learning step is shown infig. C.5.

In this example, the computation of thestate value was applied for only one singlestate (our initial state). It should be ob-vious that it is possible (and often done)to train the values for the states visited in-between (in case of the gridworld our waysto the target) at the same time. The resultof such a calculation related to our exam-ple is illustrated in fig. C.6 on the facingpage.

The Monte Carlo method seems to besuboptimal and usually it is significantlyslower than the following methods of re-inforcement learning. But this method isthe only one for which it can be mathemat-ically proved that it works and thereforeit is very useful for theoretical considera-tions.

Definition C.10 (Monte Carlo learning).Actions are randomly performed regard-less of the state-value function and in thelong term an expressive state-value func-tion is accumulated by means of the fol-lowing learning rule.

V (st)new = V (st)alt + α(Rt − V (st)alt),

C.2.4 Temporal difference learning

Most of the learning is the result of ex-periences; e.g. walking or riding a bicycle

-1-6 -5 -4 -3 -2

-1-14 -13 -12 -2

-11 -3-10 -4-9 -5-8 -7 -6

-10

Figure C.5: Application of the Monte Carlolearning rule with a learning rate of α = 0.5.Top: two exemplary ways the agent randomlyselects are applied (one with an open and onewith a closed door). Bottom: The result of thelearning rule for the value of the initial state con-sidering both ways. Due to the fact that in thecourse of time many different ways are walkedgiven a random policy, a very expressive state-value function is obtained.



-1-10 -9 -8 -3 -2

-11 -3-10 -4-9 -5-8 -7 -6

Figure C.6: Extension of the learning examplein fig. C.5 in which the returns for intermedi-ate states are also used to accumulate the state-value function. Here, the low value on the doorfield can be seen very well: If this state is possi-ble, it must be very positive. If the door is closed,this state is impossible.

Π

Evaluation

!!Q

policy improvement

aa

Figure C.7: We try different actions within theenvironment and as a result we learn and improvethe policy.

without getting injured (or not), even men-tal skills like mathematical problem solv-ing benefit a lot from experience and sim-ple trial and error. Thus, we initialize ourpolicy with arbitrary values – we try, learnand improve the policy due to experience(fig. C.7). In contrast to the Monte Carlomethod we want to do this in a more di-rected manner.

Just as we learn from experience to re-act on different situations in different ways

the temporal difference learning (abbre-viated: TD learning), does the same bytraining VΠ(s) (i.e. the agent learns to esti-mate which situations are worth a lot andwhich are not). Again the current situa-tion is identified with st, the following sit-uations with st+1 and so on. Thus, thelearning formula for the state-value func-tion VΠ(st) is

V (st)new =V (st)+ α(rt+1 + γV (st+1)− V (st))︸︷︷︸

change of previous value

We can see that the change in value of thecurrent situation st, which is proportionalto the learning rate α, is influenced by

. the received reward rt+1,

. the previous return weighted with afactor γ of the following situationV (st+1),

. the previous value of the situationV (st).

Definition C.11 (Temporal differencelearning). Unlike the Monte Carlomethod, TD learning looks ahead by re-garding the following situation st+1. Thus,the learning rule is given by

V (st)new =V (st) (C.14)

+ α(rt+1 + γV (st+1)− V (st))︸︷︷︸change of previous value

.

C.2.5 The action-value function

Analogous to the state-value functionVΠ(s), the action-value function action

evaluation



0× +1-1

Figure C.8: Exemplary values of an action-value function for the position ×. Moving right,one remains on the fastest way towards the tar-get, moving up is still a quite fast way, movingdown is not a good way at all (provided that thedoor is open for all cases).

QΠ(s, a) is another system component ofQΠ(s, a)I reinforcement learning, which evaluates a

certain action a under a certain situations and the policy Π.

In the gridworld: In the gridworld, theaction-value function tells us how good itis to move from a certain field into a cer-tain direction (fig. C.8).

Definition C.12 (Action-value function).Like the state-value function, the action-value function QΠ(st, a) evaluates certainactions on the basis of certain situationsunder a policy. The optimal action-valuefunction is called Q∗Π(st, a).

Q∗Π(s, a)I

As shown in fig. C.9, the actions are per-formed until a target situation (here re-ferred to as sτ ) is achieved (if there exists atarget situation, otherwise the actions aresimply performed again and again).

C.2.6 Q learning

This implies QΠ(s, a) as learning fomulafor the action-value function, and – analo-gously to TD learning – its application iscalled Q learning:

Q(st, a)new =Q(st, a)+ α(rt+1 + γmax

aQ(st+1, a)︸︷︷︸

greedy strategy

−Q(st, a))

︸︷︷︸change of previous value

.

Again we break down the change of thecurrent action value (proportional to thelearning rate α) under the current situa-tion. It is influenced by

. the received reward rt+1,

. the maximum action over the follow-ing actions weighted with γ (Here, agreedy strategy is applied since it canbe assumed that the best known ac-tion is selected. With TD learning,on the other hand, we do not mind toalways get into the best known nextsituation.),

. the previous value of the action underour situation st known as Q(st, a) (re-member that this is also weighted bymeans of α).

Usually, the action-value function learnsconsiderably faster than the state-valuefunction. But we must not disregard thatreinforcement learning is generally quiteslow: The system has to find out itselfwhat is good. But the advantage of Q


dkriesel.com C.3 Example applications

GFED@ABCs0a0 //

direction of actions

((GFED@ABCs1a1 //

r1kk GFED@ABC· · · aτ−2 //

r2kk ONMLHIJKsτ−1

aτ−1 //rτ−1kk GFED@ABCsτ

rτll

direction of reward

hh

Figure C.9: Actions are performed until the desired target situation is achieved. Attention shouldbe paid to numbering: Rewards are numbered beginning with 1, actions and situations beginningwith 0 (This has simply been adopted as a convention).

learning is: Π can be initialized arbitrar-ily, and by means of Q learning the resultis always Q∗.

Definition C.13 (Q learning). Q learn-ing trains the action-value function bymeans of the learning rule

Q(st, a)new =Q(st, a) (C.15)+ α(rt+1 + γmax

aQ(st+1, a) −Q(st, a)).

and thus finds Q∗ in any case.

C.3 Example applications

C.3.1 TD gammon

TD gammon is a very successfulbackgammon game based on TD learn-ing invented by Gerald Tesauro. Thesituation here is the current configura-tion of the board. Anyone who has ever

played backgammon knows that the situ-ation space is huge (approx. 1020 situa-tions). As a result, the state-value func-tions cannot be computed explicitly (par-ticularly in the late eighties when TD gam-mon was introduced). The selected re-warding strategy was the pure delayed re-ward, i.e. the system receives the rewardnot before the end of the game and at thesame time the reward is the return. Thenthe system was allowed to practice itself(initially against a backgammon program,then against an entity of itself). The resultwas that it achieved the highest ranking ina computer-backgammon league and strik-ingly disproved the theory that a computerprogramm is not capable to master a taskbetter than its programmer.

C.3.2 The car in the pit

Let us take a look at a car parking on aone-dimensional road at the bottom of adeep pit without being able to get overthe slope on both sides straight away bymeans of its engine power in order to leave



the pit. Trivially, the executable actionshere are the possibilities to drive forwardsand backwards. The intuitive solution wethink of immediately is to move backwards,to gain momentum at the opposite slopeand oscillate in this way several times todash out of the pit.

The actions of a reinforcement learningsystem would be "full throttle forward","full reverse" and "doing nothing".

Here, "everything costs" would be a goodchoice for awarding the reward so that thesystem learns fast how to leave the pit andrealizes that our problem cannot be solvedby means of mere forward directed enginepower. So the system will slowly build upthe movement.

The policy can no longer be stored as atable since the state space is hard to dis-cretize. As policy a function has to begenerated.

C.3.3 The pole balancer

The pole balancer was developed byBarto, Sutton and Anderson.

Let be given a situation including a vehiclethat is capable to move either to the rightat full throttle or to the left at full throt-tle (bang bang control). Only these twoactions can be performed, standing stillis impossible. On the top of this car ishinged an upright pole that could tip overto both sides. The pole is built in such away that it always tips over to one side soit never stands still (let us assume that thepole is rounded at the lower end).

The angle of the pole relative to the verti-cal line is referred to as α. Furthermore,the vehicle always has a fixed position x anour one-dimensional world and a velocityof x. Our one-dimensional world is lim-ited, i.e. there are maximum values andminimum values x can adopt.

The aim of our system is to learn to steerthe car in such a way that it can balancethe pole, to prevent the pole from tippingover. This is achieved best by an avoid-ance strategy: As long as the pole is bal-anced the reward is 0. If the pole tips over,the reward is -1.

Interestingly, the system is soon capableto keep the pole balanced by tilting it suf-ficiently fast and with small movements.At this the system mostly is in the cen-ter of the space since this is farthest fromthe walls which it understands as negative(if it touches the wall, the pole will tipover).

C.3.3.1 Swinging up an invertedpendulum

More difficult for the system is the fol-lowing initial situation: the pole initiallyhangs down, has to be swung up over thevehicle and finally has to be stabilized. Inthe literature this task is called swing upan inverted pendulum.


dkriesel.com C.4 Reinforcement learning in connection with neural networks

C.4 Reinforcement learning inconnection with neuralnetworks

Finally, the reader would like to ask why atext on "neural networks" includes a chap-ter about reinforcement learning.

The answer is very simple. We have al-ready been introduced to supervised andunsupervised learning procedures. Al-though we do not always have an om-niscient teacher who makes unsupervisedlearning possible, this does not mean thatwe do not receive any feedback at all.There is often something in between, somekind of criticism or school mark. Problemslike this can be solved by means of rein-forcement learning.

But not every problem is that easily solvedlike our gridworld: In our backgammon ex-ample we have approx. 1020 situations andthe situation tree has a large branching fac-tor, let alone other games. Here, the tablesused in the gridworld can no longer be re-alized as state- and action-value functions.Thus, we have to find approximators forthese functions.

And which learning approximators forthese reinforcement learning componentscome immediately into our mind? Exactly:neural networks.

Exercises

Exercise 19. A robot control systemshall be persuaded by means of reinforce-

ment learning to find a strategy in orderto exit a maze as fast as possible.

. What could an appropriate state-value function look like?

. How would you generate an appropri-ate reward?

Assume that the robot is capable to avoidobstacles and at any time knows its posi-tion (x, y) and orientation φ.

Exercise 20. Describe the function ofthe two components ASE and ACE asthey have been proposed by Barto, Sut-ton and Anderson to control the polebalancer.

Bibliography: [BSA83].

Exercise 21. Indicate several "classical"problems of informatics which could besolved efficiently by means of reinforce-ment learning. Please give reasons foryour answers.


Bibliography

[And72] James A. Anderson. A simple neural network generating an interactivememory. Mathematical Biosciences, 14:197–220, 1972.

[APZ93] D. Anguita, G. Parodi, and R. Zunino. Speed improvement of the back-propagation on current-generation workstations. In WCNN’93, Portland:World Congress on Neural Networks, July 11-15, 1993, Oregon ConventionCenter, Portland, Oregon, volume 1. Lawrence Erlbaum, 1993.

[BSA83] A. Barto, R. Sutton, and C. Anderson. Neuron-like adaptive elementsthat can solve difficult learning control problems. IEEE Transactions onSystems, Man, and Cybernetics, 13(5):834–846, September 1983.

[CG87] G. A. Carpenter and S. Grossberg. ART2: Self-organization of stable cate-gory recognition codes for analog input patterns. Applied Optics, 26:4919–4930, 1987.

[CG88] M.A. Cohen and S. Grossberg. Absolute stability of global pattern forma-tion and parallel memory storage by competitive neural networks. Com-puter Society Press Technology Series Neural Networks, pages 70–81, 1988.

[CG90] G. A. Carpenter and S. Grossberg. ART 3: Hierarchical search usingchemical transmitters in self-organising pattern recognition architectures.Neural Networks, 3(2):129–152, 1990.

[CH67] T. Cover and P. Hart. Nearest neighbor pattern classification. IEEETransactions on Information Theory, 13(1):21–27, 1967.

[CR00] N.A. Campbell and JB Reece. Biologie. Spektrum. Akademischer Verlag,2000.

[Cyb89] G. Cybenko. Approximation by superpositions of a sigmoidal function.Mathematics of Control, Signals, and Systems (MCSS), 2(4):303–314,1989.

[DHS01] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern classification. Wiley NewYork, 2001.

209

Bibliography dkriesel.com

[Elm90] Jeffrey L. Elman. Finding structure in time. Cognitive Science, 14(2):179–211, April 1990.

[Fah88] S. E. Fahlman. An empirical sudy of learning speed in back-propagationnetworks. Technical Report CMU-CS-88-162, CMU, 1988.

[FMI83] K. Fukushima, S. Miyake, and T. Ito. Neocognitron: A neural networkmodel for a mechanism of visual pattern recognition. IEEE Transactionson Systems, Man, and Cybernetics, 13(5):826–834, September/October1983.

[Fri94] B. Fritzke. Fast learning with incremental RBF networks. Neural Process-ing Letters, 1(1):2–5, 1994.

[GKE01a] N. Goerke, F. Kintzler, and R. Eckmiller. Self organized classification ofchaotic domains from a nonlinearattractor. In Neural Networks, 2001. Pro-ceedings. IJCNN’01. International Joint Conference on, volume 3, 2001.

[GKE01b] N. Goerke, F. Kintzler, and R. Eckmiller. Self organized partitioning ofchaotic attractors for control. Lecture notes in computer science, pages851–856, 2001.

[Gro76] S. Grossberg. Adaptive pattern classification and universal recoding, I:Parallel development and coding of neural feature detectors. BiologicalCybernetics, 23:121–134, 1976.

[GS06] Nils Goerke and Alexandra Scherbart. Classification using multi-soms andmulti-neural gas. In IJCNN, pages 3895–3902, 2006.

[Heb49] Donald O. Hebb. The Organization of Behavior: A NeuropsychologicalTheory. Wiley, New York, 1949.

[Hop82] John J. Hopfield. Neural networks and physical systems with emergent col-lective computational abilities. Proc. of the National Academy of Science,USA, 79:2554–2558, 1982.

[Hop84] JJ Hopfield. Neurons with graded response have collective computationalproperties like those of two-state neurons. Proceedings of the NationalAcademy of Sciences, 81(10):3088–3092, 1984.

[HT85] JJ Hopfield and DW Tank. Neural computation of decisions in optimiza-tion problems. Biological cybernetics, 52(3):141–152, 1985.

[Jor86] M. I. Jordan. Attractor dynamics and parallelism in a connectionist se-quential machine. In Proceedings of the Eighth Conference of the CognitiveScience Society, pages 531–546. Erlbaum, 1986.


dkriesel.com Bibliography

[Kau90] L. Kaufman. Finding groups in data: an introduction to cluster analysis.In Finding Groups in Data: An Introduction to Cluster Analysis. Wiley,New York, 1990.

[Koh72] T. Kohonen. Correlation matrix memories. IEEEtC, C-21:353–359, 1972.

[Koh82] Teuvo Kohonen. Self-organized formation of topologically correct featuremaps. Biological Cybernetics, 43:59–69, 1982.

[Koh89] Teuvo Kohonen. Self-Organization and Associative Memory. Springer-Verlag, Berlin, third edition, 1989.

[Koh98] T. Kohonen. The self-organizing map. Neurocomputing, 21(1-3):1–6, 1998.

[KSJ00] E.R. Kandel, J.H. Schwartz, and T.M. Jessell. Principles of neural science.Appleton & Lange, 2000.

[lCDS90] Y. le Cun, J. S. Denker, and S. A. Solla. Optimal brain damage. InD. Touretzky, editor, Advances in Neural Information Processing Systems2, pages 598–605. Morgan Kaufmann, 1990.

[Mac67] J. MacQueen. Some methods for classification and analysis of multivariateobservations. In Proceedings of the Fifth Berkeley Symposium on Mathe-matics, Statistics and Probability, Vol. 1, pages 281–296, 1967.

[MBS93] Thomas M. Martinetz, Stanislav G. Berkovich, and Klaus J. Schulten.’Neural-gas’ network for vector quantization and its application to time-series prediction. IEEE Trans. on Neural Networks, 4(4):558–569, 1993.

[MBW+10] K.D. Micheva, B. Busse, N.C. Weiler, N. O’Rourke, and S.J. Smith. Single-synapse analysis of a diverse synapse population: proteomic imaging meth-ods and markers. Neuron, 68(4):639–653, 2010.

[MP43] W.S. McCulloch and W. Pitts. A logical calculus of the ideas immanentin nervous activity. Bulletin of Mathematical Biology, 5(4):115–133, 1943.

[MP69] M. Minsky and S. Papert. Perceptrons. MIT Press, Cambridge, Mass,1969.

[MR86] J. L. McClelland and D. E. Rumelhart. Parallel Distributed Processing:Explorations in the Microstructure of Cognition, volume 2. MIT Press,Cambridge, 1986.


Bibliography dkriesel.com

[Par87] David R. Parker. Optimal algorithms for adaptive networks: Second or-der back propagation, second order direct propagation, and second orderhebbian learning. In Maureen Caudill and Charles Butler, editors, IEEEFirst International Conference on Neural Networks (ICNN’87), volume II,pages II–593–II–600, San Diego, CA, June 1987. IEEE.

[PG89] T. Poggio and F. Girosi. A theory of networks for approximation andlearning. MIT Press, Cambridge Mass., 1989.

[Pin87] F. J. Pineda. Generalization of back-propagation to recurrent neural net-works. Physical Review Letters, 59:2229–2232, 1987.

[PM47] W. Pitts and W.S. McCulloch. How we know universals the perception ofauditory and visual forms. Bulletin of Mathematical Biology, 9(3):127–147,1947.

[Pre94] L. Prechelt. Proben1: A set of neural network benchmark problems andbenchmarking rules. Technical Report, 21:94, 1994.

[RB93] M. Riedmiller and H. Braun. A direct adaptive method for faster back-propagation learning: The rprop algorithm. In Neural Networks, 1993.,IEEE International Conference on, pages 586–591. IEEE, 1993.

[RD05] G. Roth and U. Dicke. Evolution of the brain and intelligence. Trends inCognitive Sciences, 9(5):250–257, 2005.

[RHW86a] D. Rumelhart, G. Hinton, and R. Williams. Learning representations byback-propagating errors. Nature, 323:533–536, October 1986.

[RHW86b] David E. Rumelhart, Geoffrey E. Hinton, and R. J. Williams. Learninginternal representations by error propagation. In D. E. Rumelhart, J. L.McClelland, and the PDP research group., editors, Parallel distributed pro-cessing: Explorations in the microstructure of cognition, Volume 1: Foun-dations. MIT Press, 1986.

[Rie94] M. Riedmiller. Rprop - description and implementation details. Technicalreport, University of Karlsruhe, 1994.

[Ros58] F. Rosenblatt. The perceptron: a probabilistic model for informationstorage and organization in the brain. Psychological Review, 65:386–408,1958.

[Ros62] F. Rosenblatt. Principles of Neurodynamics. Spartan, New York, 1962.

[SB98] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction.MIT Press, Cambridge, MA, 1998.


dkriesel.com Bibliography

[SG06] A. Scherbart and N. Goerke. Unsupervised system for discovering patternsin time-series, 2006.

[SGE05] Rolf Schatten, Nils Goerke, and Rolf Eckmiller. Regional and online learn-able fields. In Sameer Singh, Maneesha Singh, Chidanand Apté, and PetraPerner, editors, ICAPR (2), volume 3687 of Lecture Notes in ComputerScience, pages 74–83. Springer, 2005.

[Ste61] K. Steinbuch. Die lernmatrix. Kybernetik (Biological Cybernetics), 1:36–45,1961.

[vdM73] C. von der Malsburg. Self-organizing of orientation sensitive cells in striatecortex. Kybernetik, 14:85–100, 1973.

[Was89] P. D. Wasserman. Neural Computing Theory and Practice. New York :Van Nostrand Reinhold, 1989.

[Wer74] P. J. Werbos. Beyond Regression: New Tools for Prediction and Analysisin the Behavioral Sciences. PhD thesis, Harvard University, 1974.

[Wer88] P. J. Werbos. Backpropagation: Past and future. In Proceedings ICNN-88,San Diego, pages 343–353, 1988.

[WG94] A.S. Weigend and N.A. Gershenfeld. Time series prediction. Addison-Wesley, 1994.

[WH60] B. Widrow and M. E. Hoff. Adaptive switching circuits. In ProceedingsWESCON, pages 96–104, 1960.

[Wid89] R. Widner. Single-stage logic. AIEE Fall General Meeting, 1960. Wasser-man, P. Neural Computing, Theory and Practice, Van Nostrand Reinhold,1989.

[Zel94] Andreas Zell. Simulation Neuronaler Netze. Addison-Wesley, 1994. Ger-man.


List of Figures

1.1 Robot with 8 sensors and 2 motors . . . . . . . . . . . . . . . . . . . . . 61.3 Black box with eight inputs and two outputs . . . . . . . . . . . . . . . 71.2 Learning samples for the example robot . . . . . . . . . . . . . . . . . . 81.4 Institutions of the field of neural networks . . . . . . . . . . . . . . . . . 9

2.1 Central nervous system . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2 Brain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3 Biological neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4 Action potential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.5 Compound eye . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1 Data processing of a neuron . . . . . . . . . . . . . . . . . . . . . . . . . 353.2 Various popular activation functions . . . . . . . . . . . . . . . . . . . . 383.3 Feedforward network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.4 Feedforward network with shortcuts . . . . . . . . . . . . . . . . . . . . 413.5 Directly recurrent network . . . . . . . . . . . . . . . . . . . . . . . . . . 413.6 Indirectly recurrent network . . . . . . . . . . . . . . . . . . . . . . . . . 423.7 Laterally recurrent network . . . . . . . . . . . . . . . . . . . . . . . . . 433.8 Completely linked network . . . . . . . . . . . . . . . . . . . . . . . . . . 443.10 Examples for different types of neurons . . . . . . . . . . . . . . . . . . 453.9 Example network with and without bias neuron . . . . . . . . . . . . . . 46

4.1 Training samples and network capacities . . . . . . . . . . . . . . . . . . 564.2 Learning curve with different scalings . . . . . . . . . . . . . . . . . . . 604.3 Gradient descent, 2D visualization . . . . . . . . . . . . . . . . . . . . . 624.4 Possible errors during a gradient descent . . . . . . . . . . . . . . . . . . 634.5 The 2-spiral problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.6 Checkerboard problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.1 The perceptron in three different views . . . . . . . . . . . . . . . . . . . 725.2 Singlelayer perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.3 Singlelayer perceptron with several output neurons . . . . . . . . . . . . 745.4 AND and OR singlelayer perceptron . . . . . . . . . . . . . . . . . . . . 75

215

List of Figures dkriesel.com

5.5 Error surface of a network with 2 connections . . . . . . . . . . . . . . . 785.6 Sketch of a XOR-SLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825.7 Two-dimensional linear separation . . . . . . . . . . . . . . . . . . . . . 825.8 Three-dimensional linear separation . . . . . . . . . . . . . . . . . . . . 835.9 The XOR network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.10 Multilayer perceptrons and output sets . . . . . . . . . . . . . . . . . . . 855.11 Position of an inner neuron for derivation of backpropagation . . . . . . 875.12 Illustration of the backpropagation derivation . . . . . . . . . . . . . . . 895.13 Momentum term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 975.14 Fermi function and hyperbolic tangent . . . . . . . . . . . . . . . . . . . 1025.15 Functionality of 8-2-8 encoding . . . . . . . . . . . . . . . . . . . . . . . 103

6.1 RBF network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1076.2 Distance function in the RBF network . . . . . . . . . . . . . . . . . . . 1086.3 Individual Gaussian bells in one- and two-dimensional space . . . . . . . 1096.4 Accumulating Gaussian bells in one-dimensional space . . . . . . . . . . 1096.5 Accumulating Gaussian bells in two-dimensional space . . . . . . . . . . 1106.6 Even coverage of an input space with radial basis functions . . . . . . . 1166.7 Uneven coverage of an input space with radial basis functions . . . . . . 1176.8 Random, uneven coverage of an input space with radial basis functions . 117

7.1 Roessler attractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1227.2 Jordan network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1237.3 Elman network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1247.4 Unfolding in time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

8.1 Hopfield network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1308.2 Binary threshold function . . . . . . . . . . . . . . . . . . . . . . . . . . 1328.3 Convergence of a Hopfield network . . . . . . . . . . . . . . . . . . . . . 1348.4 Fermi function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

9.1 Examples for quantization . . . . . . . . . . . . . . . . . . . . . . . . . . 141

10.1 Example topologies of a SOM . . . . . . . . . . . . . . . . . . . . . . . . 14810.2 Example distances of SOM topologies . . . . . . . . . . . . . . . . . . . 15110.3 SOM topology functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 15310.4 First example of a SOM . . . . . . . . . . . . . . . . . . . . . . . . . . . 15410.7 Topological defect of a SOM . . . . . . . . . . . . . . . . . . . . . . . . . 15610.5 Training a SOM with one-dimensional topology . . . . . . . . . . . . . . 15710.6 SOMs with one- and two-dimensional topologies and different input spaces15810.8 Resolution optimization of a SOM to certain areas . . . . . . . . . . . . 160


dkriesel.com List of Figures

10.9 Shape to be classified by neural gas . . . . . . . . . . . . . . . . . . . . . 162

11.1 Structure of an ART network . . . . . . . . . . . . . . . . . . . . . . . . 16611.2 Learning process of an ART network . . . . . . . . . . . . . . . . . . . . 168

A.1 Comparing cluster analysis methods . . . . . . . . . . . . . . . . . . . . 174A.2 ROLF neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176A.3 Clustering by means of a ROLF . . . . . . . . . . . . . . . . . . . . . . . 179

B.1 Neural network reading time series . . . . . . . . . . . . . . . . . . . . . 182B.2 One-step-ahead prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 184B.3 Two-step-ahead prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 186B.4 Direct two-step-ahead prediction . . . . . . . . . . . . . . . . . . . . . . 186B.5 Heterogeneous one-step-ahead prediction . . . . . . . . . . . . . . . . . . 188B.6 Heterogeneous one-step-ahead prediction with two outputs . . . . . . . . 188

C.1 Gridworld . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193C.2 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193C.3 Gridworld with optimal returns . . . . . . . . . . . . . . . . . . . . . . . 200C.4 Reinforcement learning cycle . . . . . . . . . . . . . . . . . . . . . . . . 201C.5 The Monte Carlo method . . . . . . . . . . . . . . . . . . . . . . . . . . 202C.6 Extended Monte Carlo method . . . . . . . . . . . . . . . . . . . . . . . 203C.7 Improving the policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203C.8 Action-value function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204C.9 Reinforcement learning timeline . . . . . . . . . . . . . . . . . . . . . . . 205


Index

*100-step rule . . . . . . . . . . . . . . . . . . . . . . . . 5

AAction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195action potential . . . . . . . . . . . . . . . . . . . . 21action space . . . . . . . . . . . . . . . . . . . . . . . 195action-value function . . . . . . . . . . . . . . 203activation . . . . . . . . . . . . . . . . . . . . . . . . . . 36activation function . . . . . . . . . . . . . . . . . 36

selection of . . . . . . . . . . . . . . . . . . . . . 98ADALINE . . see adaptive linear neuronadaptive linear element . . . see adaptive

linear neuronadaptive linear neuron . . . . . . . . . . . . . . 10adaptive resonance theory . . . . . 11, 165agent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .194algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . .50amacrine cell . . . . . . . . . . . . . . . . . . . . . . . 28approximation. . . . . . . . . . . . . . . . . . . . .110ART . . . . see adaptive resonance theoryART-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167ART-2A. . . . . . . . . . . . . . . . . . . . . . . . . . .168ART-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168artificial intelligence . . . . . . . . . . . . . . . . 10associative data storage . . . . . . . . . . . 157

ATP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20attractor . . . . . . . . . . . . . . . . . . . . . . . . . . 119autoassociator . . . . . . . . . . . . . . . . . . . . . 131axon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18, 23

Bbackpropagation . . . . . . . . . . . . . . . . . . . . 88

second order . . . . . . . . . . . . . . . . . . . 95backpropagation of error. . . . . . . . . . . .84

recurrent . . . . . . . . . . . . . . . . . . . . . . 125bar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138bias neuron. . . . . . . . . . . . . . . . . . . . . . . . .44binary threshold function . . . . . . . . . . 37bipolar cell . . . . . . . . . . . . . . . . . . . . . . . . . 27black box . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7brain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14brainstem . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Ccapability to learn . . . . . . . . . . . . . . . . . . . 4center

of a ROLF neuron . . . . . . . . . . . . 176of a SOM neuron. . . . . . . . . . . . . .146

219

Index dkriesel.com

of an RBF neuron . . . . . . . . . . . . . 104distance to the . . . . . . . . . . . . . . 107

central nervous system . . . . . . . . . . . . . 14cerebellum . . . . . . . . . . . . . . . . . . . . . . . . . 15cerebral cortex . . . . . . . . . . . . . . . . . . . . . 14cerebrum . . . . . . . . . . . . . . . . . . . . . . . . . . . 14change in weight. . . . . . . . . . . . . . . . . . . .64cluster analysis . . . . . . . . . . . . . . . . . . . . 171clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . 171CNS . . . . . . . see central nervous systemcodebook vector . . . . . . . . . . . . . . 138, 172complete linkage. . . . . . . . . . . . . . . . . . . .39compound eye . . . . . . . . . . . . . . . . . . . . . . 26concentration gradient . . . . . . . . . . . . . . 19cone function . . . . . . . . . . . . . . . . . . . . . .150connection. . . . . . . . . . . . . . . . . . . . . . . . . .34context-based search . . . . . . . . . . . . . . 157continuous . . . . . . . . . . . . . . . . . . . . . . . . 137cortex . . . . . . . . . . . . . . see cerebral cortex

visual . . . . . . . . . . . . . . . . . . . . . . . . . . 15cortical field . . . . . . . . . . . . . . . . . . . . . . . . 14

association . . . . . . . . . . . . . . . . . . . . . 15primary . . . . . . . . . . . . . . . . . . . . . . . . 15

cylinder function . . . . . . . . . . . . . . . . . . 150

DDartmouth Summer Research Project9deep networks . . . . . . . . . . . . . . . . . . 93, 97Delta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79delta rule . . . . . . . . . . . . . . . . . . . . . . . . . . .79dendrite . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18depolarization . . . . . . . . . . . . . . . . . . . . . . 21diencephalon . . . . . . . . . . . . see interbraindifference vector . . . . . . . see error vectordigital filter . . . . . . . . . . . . . . . . . . . . . . . 183

digitization . . . . . . . . . . . . . . . . . . . . . . . . 138discrete . . . . . . . . . . . . . . . . . . . . . . . . . . . 137discretization . . . . . . . . . see quantizationdistance

Euclidean . . . . . . . . . . . . . . . . . 56, 171squared. . . . . . . . . . . . . . . . . . . .76, 171

dynamical system . . . . . . . . . . . . . . . . . 119

Eearly stopping . . . . . . . . . . . . . . . . . . . . . . 59electronic brain . . . . . . . . . . . . . . . . . . . . . . 9Elman network . . . . . . . . . . . . . . . . . . . . 121environment . . . . . . . . . . . . . . . . . . . . . . .193episode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196epoch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52epsilon-nearest neighboring . . . . . . . . 173error

specific . . . . . . . . . . . . . . . . . . . . . . . . . 56total . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

error function . . . . . . . . . . . . . . . . . . . . . . 75specific . . . . . . . . . . . . . . . . . . . . . . . . . 75

error vector . . . . . . . . . . . . . . . . . . . . . . . . 53evolutionary algorithms . . . . . . . . . . . 125exploitation approach . . . . . . . . . . . . . 197exploration approach . . . . . . . . . . . . . . 197exteroceptor . . . . . . . . . . . . . . . . . . . . . . . . 24

Ffastprop . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47fault tolerance . . . . . . . . . . . . . . . . . . . . . . . 4feedforward. . . . . . . . . . . . . . . . . . . . . . . . .39Fermi function . . . . . . . . . . . . . . . . . . . . . 37flat spot elimination . . . . . . . . . . . . . . . . 95


dkriesel.com Index

fudging . . . . . . . see flat spot eliminationfunction approximation . . . . . . . . . . . . . 98function approximator

universal . . . . . . . . . . . . . . . . . . . . . . . 82

Gganglion cell . . . . . . . . . . . . . . . . . . . . . . . . 27Gauss-Markov model . . . . . . . . . . . . . . 111Gaussian bell . . . . . . . . . . . . . . . . . . . . . .149generalization . . . . . . . . . . . . . . . . . . . . 4, 49glial cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59gradient descent . . . . . . . . . . . . . . . . . . . . 59

problems . . . . . . . . . . . . . . . . . . . . . . . 60grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146gridworld. . . . . . . . . . . . . . . . . . . . . . . . . .192

HHeaviside function see binary threshold

functionHebbian rule . . . . . . . . . . . . . . . . . . . . . . . 64

generalized form. . . . . . . . . . . . . . . .65heteroassociator . . . . . . . . . . . . . . . . . . . 132Hinton diagram . . . . . . . . . . . . . . . . . . . . 34history of development. . . . . . . . . . . . . . .8Hopfield networks . . . . . . . . . . . . . . . . . 127

continuous . . . . . . . . . . . . . . . . . . . . 134horizontal cell . . . . . . . . . . . . . . . . . . . . . . 28hyperbolic tangent . . . . . . . . . . . . . . . . . 37hyperpolarization . . . . . . . . . . . . . . . . . . . 21hypothalamus . . . . . . . . . . . . . . . . . . . . . . 15

Iindividual eye . . . . . . . . see ommatidiuminput dimension . . . . . . . . . . . . . . . . . . . . 48input patterns . . . . . . . . . . . . . . . . . . . . . . 50input vector . . . . . . . . . . . . . . . . . . . . . . . . 48interbrain . . . . . . . . . . . . . . . . . . . . . . . . . . 15internodes . . . . . . . . . . . . . . . . . . . . . . . . . . 23interoceptor . . . . . . . . . . . . . . . . . . . . . . . . 24interpolation

precise . . . . . . . . . . . . . . . . . . . . . . . . 110ion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19iris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

JJordan network. . . . . . . . . . . . . . . . . . . .120

Kk-means clustering . . . . . . . . . . . . . . . . 172k-nearest neighboring. . . . . . . . . . . . . .172

Llayer

hidden . . . . . . . . . . . . . . . . . . . . . . . . . 39input . . . . . . . . . . . . . . . . . . . . . . . . . . .39output . . . . . . . . . . . . . . . . . . . . . . . . . 39

learnability . . . . . . . . . . . . . . . . . . . . . . . . . 97learning


Index dkriesel.com

batch . . . . . . . . . . see learning, offlineoffline . . . . . . . . . . . . . . . . . . . . . . . . . . 52online . . . . . . . . . . . . . . . . . . . . . . . . . . 52reinforcement . . . . . . . . . . . . . . . . . . 51supervised. . . . . . . . . . . . . . . . . . . . . .51unsupervised . . . . . . . . . . . . . . . . . . . 50

learning rate . . . . . . . . . . . . . . . . . . . . . . . 89variable . . . . . . . . . . . . . . . . . . . . . . . . 90

learning strategy . . . . . . . . . . . . . . . . . . . 39learning vector quantization . . . . . . . 137lens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27linear separability . . . . . . . . . . . . . . . . . . 81linearer associator . . . . . . . . . . . . . . . . . . 11locked-in syndrome . . . . . . . . . . . . . . . . . 16logistic function . . . . see Fermi function

temperature parameter . . . . . . . . . 37LVQ . . see learning vector quantizationLVQ1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140LVQ2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140LVQ3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

MM-SOM. see self-organizing map, multiMark I perceptron . . . . . . . . . . . . . . . . . . 10Mathematical Symbols

(t) . . . . . . . . . . . . . . . see time conceptA(S) . . . . . . . . . . . . . see action spaceEp . . . . . . . . . . . . . . . . see error vectorG . . . . . . . . . . . . . . . . . . . . see topologyN . . see self-organizing map, input

dimensionP . . . . . . . . . . . . . . . . . see training setQ∗Π(s, a) . see action-value function,

optimalQΠ(s, a) . see action-value functionRt . . . . . . . . . . . . . . . . . . . . . . see return

S . . . . . . . . . . . . . . see situation spaceT . . . . . . see temperature parameterV ∗Π(s) . . . . . see state-value function,

optimalVΠ(s) . . . . . see state-value functionW . . . . . . . . . . . . . . see weight matrix∆wi,j . . . . . . . . see change in weightΠ . . . . . . . . . . . . . . . . . . . . . . . see policyΘ . . . . . . . . . . . . . .see threshold valueα . . . . . . . . . . . . . . . . . . see momentumβ . . . . . . . . . . . . . . . . see weight decayδ . . . . . . . . . . . . . . . . . . . . . . . . see Deltaη . . . . . . . . . . . . . . . . .see learning rateη↑ . . . . . . . . . . . . . . . . . . . . . . see Rpropη↓ . . . . . . . . . . . . . . . . . . . . . . see Rpropηmax . . . . . . . . . . . . . . . . . . . . see Rpropηmin . . . . . . . . . . . . . . . . . . . . see Rpropηi,j . . . . . . . . . . . . . . . . . . . . . see Rprop∇ . . . . . . . . . . . . . . see nabla operatorρ . . . . . . . . . . . . . see radius multiplierErr . . . . . . . . . . . . . . . . see error, totalErr(W ) . . . . . . . . . see error functionErrp . . . . . . . . . . . . . see error, specificErrp(W )see error function, specificErrWD . . . . . . . . . . . see weight decayat . . . . . . . . . . . . . . . . . . . . . . see actionc . . . . . . . . . . . . . . . . . . . . . . . .see center

of an RBF neuron, see neuron,self-organizing map, center

m . . . . . . . . . . . see output dimensionn . . . . . . . . . . . . . see input dimensionp . . . . . . . . . . . . . see training patternrh . . . see center of an RBF neuron,

distance to thert . . . . . . . . . . . . . . . . . . . . . . see rewardst . . . . . . . . . . . . . . . . . . . . see situationt . . . . . . . . . . . . . . . see teaching inputwi,j . . . . . . . . . . . . . . . . . . . . see weightx . . . . . . . . . . . . . . . . . see input vectory . . . . . . . . . . . . . . . . see output vector


dkriesel.com Index

fact . . . . . . . . see activation functionfout . . . . . . . . . . . see output function

membrane . . . . . . . . . . . . . . . . . . . . . . . . . . 19-potential . . . . . . . . . . . . . . . . . . . . . . 19

memorized . . . . . . . . . . . . . . . . . . . . . . . . . 54metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . .171Mexican hat function . . . . . . . . . . . . . . 150MLP. . . . . . . .see perceptron, multilayermomentum . . . . . . . . . . . . . . . . . . . . . . . . . 94momentum term. . . . . . . . . . . . . . . . . . . .94Monte Carlo method . . . . . . . . . . . . . . 201Moore-Penrose pseudo inverse . . . . . 110moving average procedure . . . . . . . . . 184myelin sheath . . . . . . . . . . . . . . . . . . . . . . 23

Nnabla operator. . . . . . . . . . . . . . . . . . . . . .59Neocognitron . . . . . . . . . . . . . . . . . . . . . . . 12nervous system . . . . . . . . . . . . . . . . . . . . . 13network input . . . . . . . . . . . . . . . . . . . . . . 35neural gas . . . . . . . . . . . . . . . . . . . . . . . . . 159

growing . . . . . . . . . . . . . . . . . . . . . . . 162multi- . . . . . . . . . . . . . . . . . . . . . . . . . 161

neural network . . . . . . . . . . . . . . . . . . . . . 34recurrent . . . . . . . . . . . . . . . . . . . . . . 119

neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33accepting . . . . . . . . . . . . . . . . . . . . . 177binary. . . . . . . . . . . . . . . . . . . . . . . . . .71context. . . . . . . . . . . . . . . . . . . . . . . .120Fermi . . . . . . . . . . . . . . . . . . . . . . . . . . 71identity . . . . . . . . . . . . . . . . . . . . . . . . 71information processing . . . . . . . . . 71input . . . . . . . . . . . . . . . . . . . . . . . . . . .71RBF . . . . . . . . . . . . . . . . . . . . . . . . . . 104output . . . . . . . . . . . . . . . . . . . . . . 104

ROLF. . . . . . . . . . . . . . . . . . . . . . . . .176

self-organizing map. . . . . . . . . . . .146tanh . . . . . . . . . . . . . . . . . . . . . . . . . . . 71winner . . . . . . . . . . . . . . . . . . . . . . . . 148

neuron layers . . . . . . . . . . . . . . . . . see layerneurotransmitters . . . . . . . . . . . . . . . . . . 17nodes of Ranvier . . . . . . . . . . . . . . . . . . . 23

Ooligodendrocytes . . . . . . . . . . . . . . . . . . . 23OLVQ. . . . . . . . . . . . . . . . . . . . . . . . . . . . .141on-neuron . . . . . . . . . . . . . see bias neuronone-step-ahead prediction . . . . . . . . . 183

heterogeneous . . . . . . . . . . . . . . . . . 187open loop learning. . . . . . . . . . . . . . . . .125optimal brain damage . . . . . . . . . . . . . . 96order of activation . . . . . . . . . . . . . . . . . . 45

asynchronousfixed order . . . . . . . . . . . . . . . . . . . 47random order . . . . . . . . . . . . . . . . 46randomly permuted order . . . . 46topological order . . . . . . . . . . . . . 47

synchronous . . . . . . . . . . . . . . . . . . . . 46output dimension . . . . . . . . . . . . . . . . . . . 48output function. . . . . . . . . . . . . . . . . . . . .38output vector. . . . . . . . . . . . . . . . . . . . . . .48

Pparallelism . . . . . . . . . . . . . . . . . . . . . . . . . . 5pattern . . . . . . . . . . . see training patternpattern recognition . . . . . . . . . . . . 98, 131perceptron . . . . . . . . . . . . . . . . . . . . . . . . . 71

multilayer . . . . . . . . . . . . . . . . . . . . . . 82recurrent . . . . . . . . . . . . . . . . . . . . 119


Index dkriesel.com

singlelayer . . . . . . . . . . . . . . . . . . . . . .72perceptron convergence theorem . . . . 73perceptron learning algorithm . . . . . . 73period . . . . . . . . . . . . . . . . . . . . . . . . . . . . .119peripheral nervous system . . . . . . . . . . 13Persons

Anderson . . . . . . . . . . . . . . . . . . . . 206 f.Anderson, James A. . . . . . . . . . . . . 11Anguita . . . . . . . . . . . . . . . . . . . . . . . . 37Barto . . . . . . . . . . . . . . . . . . . 191, 206 f.Carpenter, Gail . . . . . . . . . . . .11, 165Elman . . . . . . . . . . . . . . . . . . . . . . . . 120Fukushima . . . . . . . . . . . . . . . . . . . . . 12Girosi . . . . . . . . . . . . . . . . . . . . . . . . . 103Grossberg, Stephen . . . . . . . . 11, 165Hebb, Donald O. . . . . . . . . . . . . 9, 64Hinton . . . . . . . . . . . . . . . . . . . . . . . . . 12Hoff, Marcian E. . . . . . . . . . . . . . . . 10Hopfield, John . . . . . . . . . . . 11 f., 127Ito . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12Jordan . . . . . . . . . . . . . . . . . . . . . . . . 120Kohonen, Teuvo . 11, 137, 145, 157Lashley, Karl . . . . . . . . . . . . . . . . . . . . 9MacQueen, J. . . . . . . . . . . . . . . . . . 172Martinetz, Thomas . . . . . . . . . . . . 159McCulloch, Warren . . . . . . . . . . . . 8 f.Minsky, Marvin . . . . . . . . . . . . . . . . 9 f.Miyake . . . . . . . . . . . . . . . . . . . . . . . . . 12Nilsson, Nils. . . . . . . . . . . . . . . . . . . .10Papert, Seymour . . . . . . . . . . . . . . . 10Parker, David . . . . . . . . . . . . . . . . . . 95Pitts, Walter . . . . . . . . . . . . . . . . . . . 8 f.Poggio . . . . . . . . . . . . . . . . . . . . . . . . 103Pythagoras . . . . . . . . . . . . . . . . . . . . . 56Riedmiller, Martin . . . . . . . . . . . . . 90Rosenblatt, Frank . . . . . . . . . . 10, 69Rumelhart . . . . . . . . . . . . . . . . . . . . . 12Steinbuch, Karl . . . . . . . . . . . . . . . . 10Sutton . . . . . . . . . . . . . . . . . . 191, 206 f.Tesauro, Gerald . . . . . . . . . . . . . . . 205

von der Malsburg, Christoph . . . 11Werbos, Paul . . . . . . . . . . . 11, 84, 96Widrow, Bernard . . . . . . . . . . . . . . . 10Wightman, Charles . . . . . . . . . . . . .10Williams . . . . . . . . . . . . . . . . . . . . . . . 12Zuse, Konrad . . . . . . . . . . . . . . . . . . . . 9

pinhole eye . . . . . . . . . . . . . . . . . . . . . . . . . 26PNS . . . . see peripheral nervous systempole balancer . . . . . . . . . . . . . . . . . . . . . . 206policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

closed loop . . . . . . . . . . . . . . . . . . . . 197evaluation . . . . . . . . . . . . . . . . . . . . . 200greedy . . . . . . . . . . . . . . . . . . . . . . . . 197improvement . . . . . . . . . . . . . . . . . . 200open loop . . . . . . . . . . . . . . . . . . . . . 197

pons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16propagation function . . . . . . . . . . . . . . . 35pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96pupil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

QQ learning . . . . . . . . . . . . . . . . . . . . . . . . 204quantization. . . . . . . . . . . . . . . . . . . . . . .137quickpropagation . . . . . . . . . . . . . . . . . . . 95

RRBF network. . . . . . . . . . . . . . . . . . . . . .104

growing . . . . . . . . . . . . . . . . . . . . . . . 115receptive field . . . . . . . . . . . . . . . . . . . . . . 27receptor cell . . . . . . . . . . . . . . . . . . . . . . . . 24

photo-. . . . . . . . . . . . . . . . . . . . . . . . . .27primary . . . . . . . . . . . . . . . . . . . . . . . . 24secondary . . . . . . . . . . . . . . . . . . . . . . 24


dkriesel.com Index

recurrence . . . . . . . . . . . . . . . . . . . . . 40, 119direct . . . . . . . . . . . . . . . . . . . . . . . . . . 40indirect . . . . . . . . . . . . . . . . . . . . . . . . 41lateral . . . . . . . . . . . . . . . . . . . . . . . . . .42

refractory period . . . . . . . . . . . . . . . . . . . 23regional and online learnable fields 175reinforcement learning . . . . . . . . . . . . . 191repolarization . . . . . . . . . . . . . . . . . . . . . . 21representability . . . . . . . . . . . . . . . . . . . . . 97resilient backpropagation . . . . . . . . . . . 90resonance . . . . . . . . . . . . . . . . . . . . . . . . . 166retina. . . . . . . . . . . . . . . . . . . . . . . . . . .27, 71return . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

avoidance strategy . . . . . . . . . . . . 199pure delayed . . . . . . . . . . . . . . . . . . 198pure negative . . . . . . . . . . . . . . . . . 198

RMS . . . . . . . . . . . . see root mean squareROLFs . . . . . . . . . see regional and online

learnable fieldsroot mean square . . . . . . . . . . . . . . . . . . . 56Rprop . . . see resilient backpropagation

Ssaltatory conductor . . . . . . . . . . . . . . . . . 23Schwann cell . . . . . . . . . . . . . . . . . . . . . . . 23self-fulfilling prophecy . . . . . . . . . . . . . 189self-organizing feature maps . . . . . . . . 11self-organizing map . . . . . . . . . . . . . . . . 145

multi- . . . . . . . . . . . . . . . . . . . . . . . . . 161sensory adaptation . . . . . . . . . . . . . . . . . 25sensory transduction. . . . . . . . . . . . . . . .24shortcut connections . . . . . . . . . . . . . . . .39silhouette coefficient . . . . . . . . . . . . . . . 175single lense eye . . . . . . . . . . . . . . . . . . . . . 27Single Shot Learning . . . . . . . . . . . . . . 130

situation . . . . . . . . . . . . . . . . . . . . . . . . . . 194situation space . . . . . . . . . . . . . . . . . . . . 195situation tree . . . . . . . . . . . . . . . . . . . . . . 198SLP . . . . . . . . see perceptron, singlelayerSnark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9SNIPE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .visodium-potassium pump. . . . . . . . . . . . 20SOM . . . . . . . . . . see self-organizing mapsoma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18spin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127spinal cord . . . . . . . . . . . . . . . . . . . . . . . . . 14stability / plasticity dilemma . . . . . . 165state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194state space forecasting . . . . . . . . . . . . .183state-value function . . . . . . . . . . . . . . . 200stimulus . . . . . . . . . . . . . . . . . . . . . . . 21, 147stimulus-conducting apparatus. . . . . .24surface, perceptive. . . . . . . . . . . . . . . . .176swing up an inverted pendulum. . . .206symmetry breaking . . . . . . . . . . . . . . . . . 98synapse

chemical . . . . . . . . . . . . . . . . . . . . . . . 17electrical . . . . . . . . . . . . . . . . . . . . . . . 17

synapses. . . . . . . . . . . . . . . . . . . . . . . . . . . .17synaptic cleft . . . . . . . . . . . . . . . . . . . . . . . 17

Ttarget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34TD gammon . . . . . . . . . . . . . . . . . . . . . . 205TD learning. . . .see temporal difference

learningteacher forcing . . . . . . . . . . . . . . . . . . . . 125teaching input . . . . . . . . . . . . . . . . . . . . . . 53telencephalon . . . . . . . . . . . . see cerebrumtemporal difference learning . . . . . . . 202thalamus . . . . . . . . . . . . . . . . . . . . . . . . . . . 15


Index dkriesel.com

threshold potential . . . . . . . . . . . . . . . . . 21threshold value . . . . . . . . . . . . . . . . . . . . . 36time concept . . . . . . . . . . . . . . . . . . . . . . . 33time horizon . . . . . . . . . . . . . . . . . . . . . . 196time series . . . . . . . . . . . . . . . . . . . . . . . . 181time series prediction . . . . . . . . . . . . . . 181topological defect. . . . . . . . . . . . . . . . . .154topology . . . . . . . . . . . . . . . . . . . . . . . . . . 147topology function . . . . . . . . . . . . . . . . . 148training pattern . . . . . . . . . . . . . . . . . . . . 53

set of . . . . . . . . . . . . . . . . . . . . . . . . . . .53training set . . . . . . . . . . . . . . . . . . . . . . . . . 50transfer functionsee activation functiontruncus cerebri . . . . . . . . . . see brainstemtwo-step-ahead prediction . . . . . . . . . 185

direct . . . . . . . . . . . . . . . . . . . . . . . . . 185

Uunfolding in time . . . . . . . . . . . . . . . . . . 123

Vvoronoi diagram . . . . . . . . . . . . . . . . . . . 138

Wweight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34weight matrix . . . . . . . . . . . . . . . . . . . . . . 34

bottom-up . . . . . . . . . . . . . . . . . . . . 166top-down. . . . . . . . . . . . . . . . . . . . . .165

weight vector . . . . . . . . . . . . . . . . . . . . . . . 34

weighted sum. . . . . . . . . . . . . . . . . . . . . . . 35Widrow-Hoff rule . . . . . . . . see delta rulewinner-takes-all scheme . . . . . . . . . . . . . 42


Documents

Brief Introduction to Neural Networks