[G. P. Liu BEng, MEng, PhD (Auth.)] Nonlinear Ide(BookZZ.org)

Advances in Industrial Control

Springer-Verlag London Ltd.

Other titles published in this Series:

Adaptive Internal Model Control Aniruddha Datta

Price-Based Commitment Decisions in the Electricity Market Eric Allen and Marija Hie Compressor Surge and Rotating Stall: Modeling and Control Jan Tommy Gravdahl and Olav Egeland

Radiotherapy Treatment Planning: New System Approaches Olivier Haas

Feedback Control Theory for Dynamic Traffic Assignment Pushkin Kaehroo and Kaan Ozbay

Autotuning ofPID Controllers Cheng-Ching Yu

Robust Aeroservoelastic Stability Analysis Rick Lind and Marty Brenner

Performance Assessment of Control Loops: Theory and Applications Biao Huang and Sirish L. Shah

Data Mining and Knowledge Discovery for Process Monitoring and Control Xue Z. Wang

Advances in PID Control Tan Kok Kiong, Wang Quing-Guo and Hang Chang Chieh with Tore J. Hagglund Advanced Control with Recurrent High-order Neural Networks: Theory and Industrial Applications George A. Rovithakis and Manolis A. Christodoulou

Structure and Synthesis ofPID Controllers Aniruddha Datta, Ming-Tzu Ho and Shankar P. Bhattaeharyya

Data-driven Techniques for Fault Detection and Diagnosis in Chemical Processes Evan L. Russell, Leo H. Chiang and Richard D. Braatz

Bounded Dynamic Stochastic Systems: Modelling and Control Hong Wang

Non-linear Model-based Process Control Rashid M. Ansari and Moses O. Tade

Identification and Control of Sheet and Film Processes Andrew P. Featherstone, Jeremy G. VanAntwerp and Richard D. Braatz

Precision Motion Control: Design and Implementation Tan Kok Kiong, Lee Tong Heng, Dou Huifang and Huang Sunan

G.P. Liu

Nonlinear Identification and Control A Neural Network Approach

With 88 Figures

. Springer

G.P. Liu, BEng, MEng, PhD School of Mechanical Materials, Manufacturing Engineering and Management, University of Nottingham, University Park, Nottingham, NG7 2RD, UK

ISBN 978-1-4471-1076-7 ISBN 978-1-4471-0345-5 (eBook) DOI 10.1007/978-1-4471-0345-5

British Library Cataloguing in Publication Data Liu, G. P. (Guo Ping), 1962-

Nonlinear identification and control. - (Advances in industrial control) 1.Nonlinear control theory 2.Neural networks (Computer science) 1. Title 629.8'36 ISBN 9781447110767

Library of Congress Cataloging-in-Publication Data Liu, G.P. (Guo Ping), 1962-

Nonlinear identification and control/ G.P. Liu p. cm. -- (Advances in industrial control)

Includes bibliographical references and index. ISBN 978-1-4471-lO76-7 (alk. paper) 1. Automatic control. 2. Neural networks (Computer science) 3. Nonlinear theories. 4.

System identification. 1. Title. II. Series. TJ213 .L522 2001 629.8--dc21 200lO42662

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permis sion in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms oflicences issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to the publishers.

http://www.springer.co.uk

Springer-Verlag London 2001 Originally published by Springer-Verlag London Berlin Heidelberg 2001 Softcover reprint of the hardcover 1 st edition 2001

The use of registered names, trademarks etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant laws and regulations and therefore free for general use.

The publisher makes no representation, express or implied, with regard to the accuracy of the information contained in this book and cannot accept any legal responsibility or liability for any errors or omissions that may be made.

Typesetting: Electronic text files prepared by author

69/3830-5432lO Printed on acid-free paper SPIN lO770966

Advances in Industrial Control

Series Editors

Professor Michael J. Grimble, Professor ofIndustrial Systems and Director Professor Michael A. Johnson, Professor of Control Systems and Deputy Director Industrial Control Centre Department of Electronic and Electrical Engineering University of Strathclyde Graham Hills Building 50 George Street Glasgow Gil QE United Kingdom

Series Advisory Board

Professor Dr-Ing J. Ackermann DLR Institut fur Robotik und Systemdynamik Postfach 1116 D82230 WeBling Germany

Professor LD. Landau Laboratoire d'Automatique de Grenoble ENSIEG, BP 46 38402 Saint Martin d'Heres France

Dr D.C. McFarlane Department of Engineering University of Cambridge Cambridge CB2 1 QJ United Kingdom

Professor B. Wittenmark Department of Automatic Control Lund Institute of Technology PO Box 118 S-221 00 Lund Sweden

Professor D.W. Clarke Department of Engineering Science University of Oxford Parks Road Oxford OXI 3PJ United Kingdom

Professor Dr-Ing M. Thoma Institut fUr Regelungstechnik UniversiHit Hannover Appelstr. 11 30167 Hannover Germany

Professor H. Kimura Department of Mathematical Engineering and Information Physics Faculty of Engineering The University of Tokyo 7-3-1 Hongo Bunkyo Ku Tokyo 113 Japan

Professor A.J. Laub College of Engineering - Dean's Office University of California One Shields Avenue Davis California 95616-5294 United States of America

Professor J.B. Moore Department of Systems Engineering The Australian National University Research School of Physical Sciences GPO Box4 Canberra ACT 2601 Australia

Dr M.K. Masten Texas Instruments 2309 Northcrest Plano TX 75075 United States of America

Professor Ton Backx AspenTech Europe B.V. De Waal32 NL-5684 PH Best The Netherlands

Dedication

To Weihong and Louise

SERIES EDITORS' FOREWORD

The series Advances in Industrial Control aims to report and encourage technology transfer in control engineering. The rapid development of control technology has an impact on all areas of the control discipline. New theory, new controllers, actuators, sensors, new industrial processes, computer methods, new applications, new philosophies ... , new challenges. Much of this development work resides in industrial reports, feasibility study papers and the reports of advanced collaborative projects. The series otTers an opportunity for researchers to present an extended exposition of such new work in all aspects of industrial control for wider and rapid dissemination.

The time for nonlinear control to enter routine application seems to be approaching. Nonlinear control has had a long gestation period but much ofthe past has been concerned with methods that involve formal nonlinear functional model representations. It seems more likely that the breakthough will come through the use of other more flexible and amenable nonlinear system modelling tools. This Advances in Industrial Control monograph by Guoping Liu gives an excellent introduction to the type of new nonlinear system modelling methods currently being developed and used. Neural networks appear prominent in these new modelling directions. The monograph presents a systematic development of this exciting subject. It opens with a useful tutorial introductory chapter on the various tools to be used. In subsequent chapters Doctor Liu leads the reader through identification, and then onto nonlinear control using nonlinear system neural network representations. Each chapter culminates with some examples and the final chapter is a worked-out case-study for combustion processes.

We feel the structured presentation of modern nonlinear identitication methods and their use in control schemes will be of interest to postgraduate students, industrial engineers and academics alike. We welcome this addition to the Advances in Industrial Control monograph series.

M.1. Grimble and M.A. Johnson Industrial Control Centre Glasgow, Scotland, U.K.

PREFACE

It is well known that linear models have been widely used in system identi-fication for two major reasons. First, the effects that different and combined input signals have on the output are easily determined. Second, linear systems are homogeneous. However, control systems encountered in practice possess the property of linearity only over a certain range of operation; all physical systems are nonlinear to some degree. In many cases, linear models are not suitable to represent these systems and nonlinear models have to be considered. Since there are nonlinear effects in practical systems, e.g., harmonic genera-tion, intermodulation, desensitisation, gainj expansion and chaos, neither of the above principles for linear models is valid for nonlinear systems. There-fore, nonlinear system identification is much more difficult than linear system identification.

Any attempt to restrict attention strictly to linear control can only lead to severe complications in system design. To operate linearly over a wide range of variation of signal amplitude and frequency would require components of an extremely high quality; such a system would probably be impractical from the viewpoints of cost, space, and weight. In addition, the restriction of linearity severely limits the system characteristics that can be realised.

Recently, neural networks have become an attractive tool that can be used to construct a model of complex nonlinear processes. This is because neu-ral networks have an inherent ability to learn and approximate a nonlinear function arbitrarily well. This therefore provides a possible way of modelling complex nonlinear processes effectively. A large number of identification and control structures have been proposed on the basis of neural networks in recent years.

The purpose of this monograph is to give the broad aspects of nonlinear identification and control using neural networks. Basically, the monograph consists of three parts. The first part gives an introduction to fundamental princi pIes of neural networks. Then several methods for nonlinear identification using neural networks are presented. In the third part, various techniques for nonlinear control using neural networks are studied. A number of simulated and industrial examples are used throughout the monograph to demonstrate the operation of the techniques of nonlinear identification and control using neural networks. It should be emphasised here that methods for nonlinear control systems have not progressed as rapidly as have techniques for linear

XII Preface

control systems. Comparatively speaking, at the present time they are still in the development stage. We believe that the fundamental theory, various design methods and techniques, and many application examples of nonlinear identification and control using neural networks that are presented in this monograph will enable one to analyse and synthesise nonlinear control systems quantitatively. The monograph, which is mostly based on the author's recent research work, is organised as follows.

Chapter 1 gives an overview of what neural networks are, followed by a description of the model of a neuron (the basic element of a neural network) and commonly used architectures of neural networks. Various types of neural networks are presented, e.g., radial basis function networks, polynomial basis function networks, fuzzy neural networks and wavelet networks. The function approximation properties of neural networks are discussed. A few widely used learning algorithms are introduced, such as the sequential learning algorithm, the error back-propagation learning algorithm and the least-mean-squares al-gorithm. Many applications of neural networks to classification, filtering, mod-elling, prediction, control and hardware implementation are mentioned.

Chapter 2 presents a sequential identification scheme for nonlinear dynam-ical systems. A novel neural network architecture, referred to as a variable neu-ral network, is studied and shown to be useful in approximating the unknown nonlinearities of dynamical systems. In the variable neural network, the num-ber of basis functions can be either increased or decreased with time according to specified design strategies so that the network will not overfit or underfit the data set. The identification model varies gradually to span the appropri-ate state-space and is of sufficient complexity to provide an approximation to the dynamical system. The sequential identification scheme, different from the conventional methods of optimising a cost function, attempts to ensure stabil-ity of the overall system while the neural network learns the system dynamics. The stability and convergence of the overall identification scheme are guaran-teed by the developed parameter adjustment laws. An example illustrates the modelling of an unknown nonlinear dynamical system using variable network identification techniques.

Chapter 3 considers a recursive identification scheme using neural networks for nonlinear control systems. This comprises a structure selection procedure and a recursive weight learning algorithm. The orthogonal least squares algo-rithm is introduced for off-line structure selection and the growing network technique is used for on-line structure selection. An on-line recursive weight learning algorithm is developed to adjust the weights so that the identified model can adapt to variations of the characteristics and operating points in nonlinear systems. The convergence of both the weights and estimation errors is established. The recursive identification scheme using neural networks is demonstrated by three examples. The first is identification of unknown sys-tems represented by a nonlinear input output dynamical model. The second is identification of unknown systems represented by a nonlinear state-space dynamical model. The third is the identification of the Santa Fe time series.

Preface XIll

Chapter 4 is devoted to model selection and identification of nonlinear systems via neural networks and genetic algorithms based on multiobjective performance criteria. It considers three performance indices (or cost functions) as the objectives, which are the Euclidean distance and maximum difference measurements between the real nonlinear system and the nonlinear model, and the complexity measurement of the nonlinear model, instead of a single performance index. An algorithm based on the method of inequalities, least squares and genetic algorithms is developed for optimising over the multiobjec-tive criteria. Volterra polynomial basis function networks and Gaussian radial basis function networks are applied to the identification of a practical sys-tem a large-scale pilot liquid level nonlinear system and a simulated unknown nonlinear system with mixed noise.

In Chapter 5, identification schemes using wavelet networks are discussed for nonlinear dynamical systems. Based on fixed wavelet networks, parameter adaptation laws are developed. This guarantees the stability of the overall identification scheme and the convergence of both the parameters and the state errors. Using the decomposition and reconstruction techniques of multi-resolution decompositions, variable wavelet networks are introduced to achieve desired estimation accuracy and a suitable sized network, and to adapt to variations of the characteristics and operating points in nonlinear systems. B-spline wavelets are used to form the wavelet networks. A simulated example demonstrates the operation of the wavelet network identification to obtain a model with different estimation accuracy.

Chapter 6 is concerned with the adaptive control of nonlinear dynamical systems using neural networks. Based on Gaussian radial basis function neural networks, an adaptive control scheme is presented. The location of the centres and the determination of the widths of the Gaussian radial basis functions in neural networks are analysed to make a compromise between orthogonality and smoothness. The developed weight adaptive laws ensure the overall con-trol scheme is stable, even in the presence of modelling error. The tracking errors converge to the required accuracy through the adaptive control algo-rithm derived by combining the variable neural network and Lyapunov synthe-sis techniques. An example details the adaptive control design of an unknown nonlinear time-variant dynamical system using variable network identification techniques.

Chapter 7 studies neural network based predictive control for nonlinear con-trol systems. An affine nonlinear predictor structure is presented. It is shown that the use of nonlinear programming techniques can be avoided by using a set of affine nonlinear predictors to predict the output of the nonlinear process. The nonlinear predictive controller based on this design is both simple and easy to implement in practice. Some simulation results of nonlinear predictive neural control using growing neural networks are given.

Chapter 8 considers neural network based variable structure control for the design of discrete nonlinear systems. Sliding mode control is used to provide good stability and robustness performance for nonlinear systems. A nonlinear

XIV Preface

neural predictor is introduced to predict the outputs of the nonlinear process and to make the variable structure control algorithm simple. When the predic-tor model is inaccurate, variable structure control with sliding modes is used to improve the stability of the system. A simulated example illustrates the variable structure neural control of a nonlinear dynamical system.

Chapter 9 describes a neural control strategy for the active stabilisation of combustion processes. The characteristics of these processes include not only several interacting physical phenomena, but also a wide variety of dynamical behaviour. In terms of their impact on the system performance, pressure osci-llations are undesirable since they result in excessive vibration, causing high levels of acoustic noise and, in extreme cases, mechanical failure. The active acoustic control algorithm is comprised of three parts: an output model, an output predictor and a feedback controller. The output model established us-ing neural networks is used to predict the output in order to overcome the time delay of the system, which is often very large, compared with the sampling pe-riod. An output-feedback controller is introduced which employs the output of the predictor to suppress instability in the combustion process. The approach developed is first demonstrated by a simulated unstable combustor with six modes. Results are also presented showing its application to an experimental combustion test rig with a commercial combustor.

Much of the work described in this book is based on a series of publica-tions by the author. The following publishers are gratefully acknowledged for permission to publish aspects of the author's work which appeared in their journals: The Institute of Electrical Engineers, Taylor and Francis Ltd., Else-vier Science Ltd., and the Institution of Electrical and Electronic Engineers. The author wishes to thank his wife Weihong and daughter Louise for their constant encouragement, understanding and patience during the preparation of the manuscript.

Guoping Liu School of Mechanical, Materials, Manufacturing

Engineering and Management University of Nottingham

Nottingham NG7 2RD United Kingdom

May 2001

TABLE OF CONTENTS

Symbols and Abbreviations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. XIX

1. Neural Networks. .. .. .. . . . . .... . . . . . .. . . .. . . .. .. .. . . .. .. . . 1 1.1 Introduction............................................ 1 1.2 Model of a Neuron. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Architectures of Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.1 Single Layer Networks ............................. 4 1.3.2 Multilayer Networks .............................. 4 1.3.3 Recurrent Networks ............................... 5 1.3.4 Lattice Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Various Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4.1 Radial Basis Function Networks. . . . . . . . . . . . . . . . . . . . . 7 1.4.2 Gaussian RBF Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4.3 Polynomial Basis Function Networks. . . . . . . . . . . . . . . . . 9 1.4.4 Fuzzy Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4.5 Wavelet Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . .. 10 1.4.6 General Form of Neural Networks ................... 13

1.5 Learning and Approximation. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 14 1.5.1 Background to Function Approximation. . . . . . . . . . . . .. 14 1.5.2 Universal Approximation. . . . . . . . . . . . . . . . . . . . . . . . . .. 15 1.5.3 Capacity of Neural Networks. . . . . . . . . . . . . . . . . . . . . . .. 16 1.5.4 Generalisation of Neural Networks. . . . . . . . . . . . . . . . . .. 17 1.5.5 Error Back Propagation Algorithm .................. 17 1.5.6 Recursive Learning Algorithms. . . . . . . . . . . . . . . . . . . . .. 19 1.5.7 Least Mean Square Algorithm ...................... 20

1.6 Applications of Keural ~etworks .......................... 20 1.6.1 Classification..................................... 20 1.6.2 Filtering......................................... 21 1.6.3 Modelling and Prediction. . . . . . . . . . . . . . . . . . . . . . . . . .. 21 1.6.4 Control.......................................... 22 1.6.5 Hardware Implementation. . . . . . . . . . . . . . . . . . . . . . . . .. 23

1. 7 Mathematical Preliminaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 23 1.8 Summary............................................... 25

XVI Table of Contents

2. Sequential Nonlinear Identification. . . . . . . . . . . . . . . . . . . . . . .. 27 2.1 Introduction............................................ 27 2.2 Variable Neural Networks ................................ 29

2.2.1 Variable Grids. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 30 2.2.2 Variable Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 31 2.2.3 Selection of Basis Functions ........................ 33

2.3 Dynamical System Modelling by Neural Networks. . . . . . . . . .. 36 2.4 Stable Nonlinear Identification. . . . . . . . . . . . . . . . . . . . . . . . . . .. 38 2.5 Sequential Nonlinear Identification. . . . . . . . . . . . . . . . . . . . . . . .. 41 2.6 Sequential Identification of Multivariable Systems ........... 45 2.7 An Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 49 2.8 Summary............................................... 51

3. Recursive Nonlinear Identification. . . . . . . . . . . . . . . . . . . . . . . .. 53 3.1 Introduction............................................ 53 3.2 Nonlinear Modelling by VPBF Networks. . . . . . . . . . . . . . . . . .. 54 3.3 Structure Selection of Neural Networks. . . . . . . . . . . . . . . . . . . .. 56

3.3.1 Off-line Structure Selection. . . . . . . . . . . . . . . . . . . . . . . .. 56 3.3.2 On-line Structure Selection. . . . . . . . . . . . . . . . . . . . . . . .. 59

3.4 Recursive Learning of Neural Networks. . . . . . . . . . . . . . . . . . . .. 60 3.5 Examples............................................... 66 3.6 Summary............................................... 76

4. Multiobjective Nonlinear Identification. . . . . . . . . . . . . . . . . . .. 77 4.1 Introduction............................................ 77 4.2 Multiobjective Modelling with Neural Networks. . . . . . . . . . . .. 78 4.3 Model Selection by Genetic Algorithms. . . . . . . . . . . . . . . . . . . .. 81

4.3.1 Genetic Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 81 4.3.2 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 84

4.4 Multiobjective Identification Algorithm .................... 86 4.5 Examples............................................... 90 4.6 Summary ............................................... 100

5. Wavelet Based Nonlinear Identification ................... 101 5.1 Introduction ............................................ 101 5.2 Wavelet Networks ....................................... 102

5.2.1 One-dimensional Wavelets .......................... 102 5.2.2 Multi-dimensional Wavelets ......................... 104 5.2.3 Wavelet Networks ................................. 104

5.3 Identification Using Fixed Wavelet Networks ................ 105 5.4 Identification Using Variable Wavelet Networks .............. 108

5.4.1 Variable Wavelet Networks ......................... 109 5.4.2 Parameter Estimation .............................. 111

5.5 Identification Using B-spline Wavelets ...................... 113 5.5.1 One-dimensional B-spline Wavelets .................. 113 5.5.2 n-dimensional B-spline Wavelets ..................... 115

Table of Contents XVII

5.6 An Example ............................................ 116 5.7 Summary ............................................... 124

6. Nonlinear Adaptive Neural Control . ....................... 125 6.1 Introduction ............................................ 125 6.2 Adaptive Control ........................................ 126 6.3 Adaptive Neural Control ................................. 129 6.4 Adaptation Algorithm with Variable Networks .............. 135 6.5 Examples ............................................... 137 6.6 Summary ............................................... 141

7. Nonlinear Predictive Neural Control. ...................... 143 7.1 Introduction ............................................ 143 7.2 Predictive Control ....................................... 144 7.3 Nonlinear Neural Predictors .............................. 148 7.4 Predictive Neural Control ................................ 150 7.5 On-line Learning of Neural Predictors ...................... 152 7.6 Sequential Predictive Neural Control ....................... 155 7.7 An Example ............................................ 157 7.8 Summary ............................................... 160

8. Variable Structure Neural Control . ........................ 163 8.1 Introduction ............................................ 163 8.2 Variable Structure Control ................................ 164 8.3 Variable Structure Neural Control ......................... 168 8.4 Generalised Variable Structure Neural Control .............. 172 8.5 Recursive Learning for Variable Structure Control ........... 174 8.6 An Example ............................................ 176 8.7 Summary............................................... 178

9. Neural Control Application to Combustion Processes . ..... 179 9.1 Introduction ............................................ 179 9.2 Model of Combustion Dynamics ........................... 180 9.3 Neural Network Based Mode Observer ..................... 182 9.4 Output Predictor and Controller .......................... 183 9.5 Active Control of a Simulated Combustor .................. 184 9.6 Active Control of an Experimental Combustor .............. 190 9.7 Summary ............................................... 192

References . .................................................... 193

Index .......................................................... 209

SYMBOLS AND ABBREVIATIONS

The symbols and abbreviations listed here are used unless otherwise stated.

C diag{.} dim(.) exp(.) GA GAs GRBF g II f lin < .,. > A(. ) Arnax (.) Arnin (.) MLVlO MLVlS Mol MLP max{-} min{} 11 NARMA NARMAX NN NNs N }/+ w a

ax 4J(. ) r

RBF R R+ sign(.)

field of complex numbers diagonal matrix dimension of a vector exponential function genetic algorithm genetic algorithms Gaussian radial basis function complex conjugate of 9 n-norm of the function f inner product eigenvalue of a matrix maximum eigenvalue of a matrix minimum eigenvalue of a matrix multi-input multi-output multi-input multi-state method of inequalities multilayer percept ron maximum mllllmum modulus nonlinear auto-regressive moving average NARMA model with exogenous inputs neural network neural networks integer numbers non-negative integer numbers angular frequency partial derivative with respect to x basis function reference input radial basis function field of real numbers (- 00, 00 ) field of non-negative real numbers [0,(0) sign function

xx

SISO SISS sup{-} t u

VPBF x

y

Symbols and Abbreviations

single-input single-output single-input single-state supremum time system control input Volterra polynomial basis function system state vector system output

CHAPTER!

NEURAL NETWORKS

1.1 Introduction

The field of neural networks has its roots in neurobiology. The structure and functionality of neural networks has been motivated by the architecture of the human brain. Following the complex neural architecture, a neural network consists of layers of simple processing units coupled by weighted interconnec-tions. With the development of computer technology, significant progress in neural network research has been made. A number of neural networks have been proposed in recent years.

The multilayer percept ron (MLP)(Rumelhart et al., 1986) is a network that is built upon the McGulloch and Pitts' model of neurons (McCulloch and Pitts, 1943) and the perceptron (Rosenblatt, 1958). The perceptron maps the input, generally binary, onto a binary valued output. The MLP uses this mapping to real valued outputs for binary or real valued inputs. The decision regions that could be formed by this network extend beyond the linear sepa-rable regions that are formed by the perceptron. The nonlinearity inherent in the network enables it to perform better than the traditional linear methods (Lapedes and Farber, 1987). It has been observed that this input output net-work mapping can be viewed as a hypersurface constructed in the input space (Lapedes and Farber, 1988). A surface interpolation method, called the radial basis functions, has been cast into a network whose architecture is similar to that of MLP (Broomhead and Lowe, 1988). Other surface interpolation meth-ods, for example, the multivariate adaptive regression splines (Friedman, 1991) and B-splines (Lane et al., 1989), have also found their way into new forms of networks. Another view presented in Lippmann (1987), and Lapedes and Farber (1988) is that the network provides an approximation to an underlying function. This has resulted in applying polynomial approximation methods to neural networks, such as the Sigma-Pi units (Rumelhart et al., 1986), the Volterra polynomial network (Rayner and Lynch, 1989) and the orthogonal network (Qian et al., 1990). The application of wavelet transforms to neural networks (Pati and Krishnaprasad, 1990) has also derived its inspiration from function approximation.

While these networks may have little relationship to biological neural net-works, it has become common in the neural network area to refer to them as neural networks. These networks share one important characteristic that they

G. P. Liu, Nonlinear Identification and Control Springer-Verlag London 2001

2 1. Neural N etwor ks

are able to approximate any continuous mapping to a sufficient accuracy if they have resources to do so (Friedman, 1991; Stinchcombe and White, 1989).

As its name implies, a neural network is a network of simple processing elements called neurons connected to each other via links. The architecture of the network and the functionality of the neurons determine the response of the network to an input pattern. The network does no more than provide an input output mapping. Thus, a simple mathematical model can represent these networks. This chapter will investigate the neural network architectures and their functional representation by considering the multilayer network, which laid the foundation for the development of many other classes of feedforward networks.

1.2 Model of a Neuron

A neuron is an information-processing unit that is fundamental to the oper-ation of a neural network. The model of a neuron is illustrated in Figure 1.1 (Haykin, 1994). There are three basic elements in the neuron model: connecting links, an adder and an activation function.

Fig. 1.1. Model of a neuron

Each connecting link is characterised by a weight or strength of its own. Speci-fically, a signal Uj at the j-th input connected to the k-th neuron is multiplied by the weight Wkj. For the subscripts of the weight Wkj, the first subscript refers to the neuron and the second subscript refers to the input to which the weight refers. The reverse of this notation is also used in the literature.

The adder sums the input signals weighted by the respective connecting link of the neuron. The operations described here constitute a linear combiner.

The activation function limits the amplitude of the output of a neuron, which is also referred to in the literature as a squashing function in that it squashes the permissible amplitude range of the output signal to some finite

1.2 Model of a Neuron 3

value. Typically, the normalised amplitude range of the output of a neuron is written as the closed unit interval [0,1] or alternatively [-1,1].

In mathematical terms, a neuron may be described by the following pair of equations:

n

Vk = L WkjUj j=l

Yk = 'P(Vk)

(1.1 )

(1.2) where Uj is the input signal, Wkj the weight of the neuron, Vk the linear com-biner link, 'P(.) the activation function and Yk the output signal of the neuron.

The activation function defines the output of a neuron in terms of the activity level at its input. There are many types of activation functions. Here three basic types of activation functions are introduced: threshold function, piecewise-linear function and sigmoid function.

When the threshold function is used as an activation function, it is de-scribed by

'P(V) = { ~ if v:2:0 if v < 0 (1.3) A neuron employing such a threshold function is referred to in the literature as the McCulloch Pitts model, in recognition of the pioneering work done by McCulloch and Pitts (1943). In this model, the output of a neuron takes the value of 1 if the total internal activity level of that neuron is nonnegative and o otherwise.

The activation function using a piecewise-linear function is given by

if V> 1. - 2

l'f 1 1 2" > v > -2" if V::; - ~

(1.4)

where the amplification factor inside the linear region of operation is assumed to be unity. This activation function may be viewed as an approximation to a nonlinear amplifier. There are two special forms of the piecewise-linear func-tion: (a) it is a linear combiner if the linear region of operation is maintained without running into saturation, and (b) it reduces to a threshold function if the amplification factor of the linear region is made infinitely large.

The sigmoid function is a widely used form of activation function in neural networks. It is defined as a strictly increasing function that exhibits smoothness and asymptotic properties. An example of the sigmoid is the logistic function, described by

1 'P ( v) = -l-+-e---a-v (1.5)

where a is the slope parameter of the sigmoid function. By varying the pa-rameter a, sigmoid functions of different slopes can be obtained. In the limit,


as the slope parameter approaches infinity, the sigmoid function becomes sim-ply a threshold function. Note also that the sigmoid function is differentiable, whereas the threshold function is not.

1.3 Architectures of Neural Networks

In recent years a number of neural network architectures have been proposed. Here, four different classes of network architectures (or structures) are in-troduced: single layer networks, multilayer networks, recurrent networks and lattice networks.

1.3.1 Single Layer Networks

A network of neurons organised in the form of layers is viewed as a layered neural network. The simplest form of a layered network is one that has an input layer of source nodes that projects onto an output layer of neurons (computation nodes), but not vice versa. In other words, this network is strictly of a feedforward type. It is illustrated in Figure 1.2 for the case of five nodes in the input layer and four nodes in the output layer. Such a network is called a single-layer network, with the designation "single layer" referring to the output layer of computation nodes (neurons) but not to the input layer of source nodes because no computation is performed there.

Input layer Output layer

Fig. 1.2. Architecture of a single layer network

1.3.2 Multilayer Networks

The multilayer network has a input layer, one or several hidden layers and an output layer. Each layer consists of neurons with each neuron in a layer

1.3 Architectures of Neural Networks 5

connected to neurons in the layer below. This network has a feedforward ar-chitecture which is shown in Figure 1.3. The number of input neurons defines the dimensionality of the input space being mapped by the network and the number of output neurons the dimensionality of the output space into which the input is mapped.

In a feedforward neural network, the overall mapping is achieved via in-termediate mappings from one layer to another. These intermediate mappings depend on two factors. The first is the connection mapping that transforms the output of the lower-layer neurons to an input to the neuron of interest and the second is the activation function of the neuron itself.

Input layer Hidden layer Output layer

Fig. 1.3. Architecture of a multilayer network

1.3.3 Recurrent Networks

A recurrent neural network has at least one feedback loop that distinguishes itself from a feedforward neural network. The recurrent network may consist of a single-layer or multilayer of neurons and each neuron may feed its output signal back to the inputs of all the other neurons. A class of recurrent networks with hidden neurons is illustrated in the architectural graph of Figure 1.4. In the structure, the feedback connections originate from the hidden neurons as well as the output neurons. The presence of feedback loops in the recurrent networks has a profound impact on the learning capability of the network, and on its performance. Moreover, the feedback loops use particular branches com-posed of unit-delay elements, which result in a nonlinear dynamical behaviour by virtue of the nonlinear nature of the neurons.


Outputs

Fig. 1.4. Architecture of a recurrent network

1.3.4 Lattice Networks

A lattice network may consist of a one-dimensional, two-dimensional, or higher-dimensional array of neurons. The dimension of the lattice refers to the number of the space in which the graph lies. A set of source nodes in this network supply the input signals to the array. The architectural graph of Figure 1.5 depicts a two-dimensional lattice of two-by-two neurons fed from a layer of three source nodes. Note that in this case each source node is connected to every neuron in the lattice. A lattice network is really a feedforward network with the output neurons arranged in rows and columns.

Inputs

Fig. 1.5. Architecture of a lattice network

1.4 Various Neural Networks 7

1.4 Various Neural Networks

Many different types of neural networks have been developed in recent years. This section introduces several main neural networks that are widely used in control systems.

1.4.1 Radial Basis Function Networks

Radial basis functions (RBF) have been introduced as a technique for multi-variable interpolation (Powell, 1987). Broomhead and Lowe demonstrated that these functions can be cast into an architecture similar to that of the multilayer network, and hence named the RBF network (Broomhead and Lowe, 1988).

In the RBF network, which is a single hidden layer network, its input to the hidden layer connection transforms the input into a distance from a point in the input space, unlike in the MLP, where it is transformed into a distance from a hyperplane in the input space. However, it has been seen from multilayer networks that the hidden neurons can be viewed as constructing basis functions which are then combined to form the overall mapping. For the RBF network, the basis function constructed at the k-th hidden neuron is given by

(1.6) where 11.112 is a distance measure, u the input vector, dk the unit centre in the input space and g(.) a nonlinear function. The basis functions are radially symmetric with the centre on dk in the input space, hence they are named radial basis functions. Some examples of nonlinear functions used as a radial basis function g(.) are the following:

(a) the local RBFs

g(r) = exp (::) (Gaussian) g(r) = (r2 + (72)-~ (inverse multiquadric)

(b) the global RBFs g(r) = r g(r) = r3

(linear) ( cubic)

g(r) = vr2 + c2 (multi - quadratic) g(r) = r 2In(r) (thin plate splines) g(r) = In(r2 + (72) (shifted logarithms)

r2 g (r) = (1 - exp - ~2 ) In (r) (pseudo potential functions)

(1.7) (1.8)

(1.9) (1.10) (1.11) (1.12) (1.13) (1.14)

where r = II u - dk 112' (7 is a real number commonly called receptive width or simply the width of the locally-tuned function which describes the sharpness of the hyperbolic cone used in the radial basis function.


As observed earlier, any functional description that is a linear combination of a set of basis functions can be cast into a feedforward architecture. The traditional methods used in surface interpolation and function approximation, all have a functional form similar to that of the RBF network.

1.4.2 Gaussian RBF Networks

The radial basis function network with Gaussian hidden neurons is named the Gaussian radial basis function (GRBF) network, also referred to as a network of localised receptive fields by Moody and Darken, who were inspired by the biological neurons in the visual cortex (Moody and Darken, 1989). The GRBF network is related to a variety of different methods (Niranjan and Fallside, 1990), particularly, Parzen window density estimation which is the same as kernel density estimation with a Gaussian kernel, potential functions method for pattern classification, and maximum likelihood Gaussian classifiers, which all can be described by a GRBF network formalism.

Following (1.6) and (1.7), the GRBF network can be described in a more general form. Instead of using the simple Euclidean distance between an input and a unit centre as in the usual formalism, a weighted distance scheme is used as follows:

(1.15)

where Ck is a weighting matrix of the k-th basis function whose centre is dk . The effect of the weighting matrix is to transform the equidistant lines from being hyperspherical to hyperellipsoidal. Thus, a Gaussian RBF is given by

(1.16)

where d and C represent the centres and the weighting matrices. Using the same Ck for all the basis functions is equivalent to linearly transforming the input by the matrix C;;1/2 and then using the Euclidean distance (u-dk)T (u-dk ). In general, a different Ck is used.

The Gaussian RBF network mapping is given by n

J(u;p) = L Wkipk(U; d, C) k=l

(1.17)

where p = {w, d, C}. Clearly, the Gaussian RBF network is determined by the set of parameters {w k , dk , C k}. To learn a mapping using this network, one can estimate all of these parameters or alternatively, provide a scheme to choose the widths Ck and the centres dk of the Gaussian and adapt only the weights Wk. Adapting only the weights is much easier and more popular, since the problem of estimation is then linear.


1.4.3 Polynomial Basis Function Networks

Multivariate polynomial expansions have been suggested as a candidate for discriminant functions in pattern classification (Duda and Hart, 1973; Koho-nen, 1984) and are widely used in function approximation, particularly when the input is one dimensional (Powell, 1981). Recently, the polynomial expan-sion of a function with multiple variables has been cast into the framework of neural networks. Its functional representation is described by

f(u) j(u;p)

n n n

Wo + L WiUi + L L Wili2 U il Ui2 + ... + i=l il=li2=il

n n n

+ L L L Wili2 ... ikUilUi2 .Uik il=l i2=il ik=ik-l

N

(1.18)

L WjCPj(u) (1.19) j=l

where p = {Wj} is the set of the concatenated weights and {cpj} the set of basis functions formed from the polynomial input terms, N is the number of the polynomial basis functions, k is the order ofthe polynomial expansion, O(Uk+l) denotes the approximation error caused by the high order (:2: k+ 1) of the input vector. The basis functions are essentially polynomials of zero, first and higher orders ofthe input vector U E nn. This method can be considered as expanding the input to a higher dimensional space. An important difference between polynomial networks and other networks like REF is that the polynomial basis functions themselves are not parameterised and hence adaptation of the basis functions during learning is not needed.

1.4.4 Fuzzy Neural Networks

Fuzzy neural networks have their origin from fuzzy sets and fuzzy inference systems, which were developed by Zadeh (1973). A survey of fuzzy sets in ap-proximate reasoning is given in Dubois and Prade (1991). The fuzzy reasoning is usually an "if-then" rule (or fuzzy conditional statement), for example,

If pressure is HIGH, then volume is SMALL

where pressure and volume are linguistic variables, and HIGH and SMALL linguistic values. The linguistic values are characterised by appropriate mem-bership functions. The "if" part of the rules is referred as the antecedent and the "then" part is known as the consequent.

Another type of fuzzy if-then rule has fuzzy sets involved only in the an-tecedent part. For example, the dependency of the air resistance (force) on the speed of a moving object may be described as

10 1. Neural Networks

If velocity is HIGH, then force = k * (velocityP where HIGH is the only linguistic value here, and the consequent part is given by a non-fuzzy equation of the input variable, velocity.

Suppose there is a rule base that consists of two fuzzy if-then rules, which are

Rule 1: If UI is Al and U2 is B I , then el(u) = alUI + blU2 + CI Rule 2: If UI is A2 and U2 is B 2" then e2 (u) = a2uI + b2U2 + C2

To construct a fuzzy reasoning mechanism, the firing strength of the i-th rule may be defined as the T-norm (usually multiplication or minimum operator) of the membership values on the antecedent part

(1.20) or

(1.21 ) where /-LAi (Ui) and /-LEi (Ui) are usually chosen to be bell-shaped functions with maximum equal to 1 (Jang and Sun, 1993) and minimum equal to 0, such as

(1.22)

{CAJ, {bAJ and {CTd are the parameter sets. A fuzzy reasoning mechanism may be stated thus: the overall output is

chosen to be a weighted sum of each rule's output (Takagi and Hayashi, 1991). Thus, a fuzzy neural network can be given by

~ ei(u) f(u) = ~ m CPi(Ui) (1.23) i=I2:CPj(Uj)

j=l

where m is the number of fuzzy if-then rules. The approximation capability of fuzzy neural networks or fuzzy inference

systems has been established by numerous researchers (see, for example, Wang, 1993; Brown and Harris, 1994). The functional equivalence of fuzzy neural networks to RBF networks has also been studied (Jang and Sun, 1993). Both fuzzy and RBF neural networks transform an input space into an output space by clustering the input space, applying gains to each cluster, and interpolating the regions between the clusters.

1.4.5 Wavelet Neural Networks

Wavelet neural networks were introduced in the 1990s (Zhang and Benveniste, 1992; Liu et al., 1998), based on wavelet transform theory initiated by Mor-let et aZ. (1982) though the theory goes as far back as 1952 (Calderon and


Zygmund, 1952). Wavelet transform theory was developed to analyse signals with varied frequency resolutions as a unifying idea of looking at nonstation-ary signals at various time locations. For reviews and tutorials on wavelets, see, for example, Rioul and Vetterli (1991), Strang (1989), Strichartz (1993) and numerous complementary texts such as Chui (1992), Ruskai (1991) and Newland (1993).

The wavelet transform provides a better alternative to classical Short-time Fourier or Gabor transform (Gabor, 1946) and Windowed Fourier transform (Daubechies, 1990) for time frequency analysis. For a continuous input sig-nal, the time and scale parameters of the wavelet transform can be contin-uous, which leads to a continuous wavelet transform, or be discrete, which results in a wavelet series expansion. This is analogous to classical continu-ous Fourier transform and discrete Fourier transform (Daubechies, 1990). The term wavelets, which are wavelet transform and wavelet series, will be used interchangeably, though strictly, the wavelet transform relates to continuous signals while the wavelet series handle discrete transforms.

There exist some significant differences between wavelet series expansions and classical Fourier series, which are:

(a) Wavelets are local in both the frequency domain (via dilations) and in the time domain (via translations). On the other hand, Fourier basis functions are localised only in the frequency domain but not in the time domain. Small frequency changes in the Fourier transform will cause changes everywhere in the time domain.

(b) Many classes of functions can be described in a more compact way by wavelets than by the Fourier series. Also, the wavelet basis functions are more effective than classical Fourier basis ones in achieving a comparable function approximation. For example, a discontinuity within a function could be rep-resented efficiently by a few wavelets whereas it may require many more basis functions from the Fourier expansion.

The wavelets (Daubechies, 1988) refer to a family of functions that take the following form in the continuous case:

(1.24)

where s is a scaling or dilation factor and t a translation factor of the original function 7/J ( u).

The continuous wavelet transform of a function g(u) E L2 (R) (square integrable space) is defined by

1CXl u - t [W g(u)](s, t) = Is1 1 / 2 -CXl g(u)7/J(-s-)du (1.25) This transfer can decompose g(u) into its components at different scales in frequency and space (location) by varying the scaling/dilation factor sand the translation factor t, respectively.

The function g(u) can be reconstructed by performing the inverse opera-tion, that is


(1.26)

if the wavelet 1jJ( u) satisfies the admissibility conditions (Daubechies, 1988) given by

(1.27)

Similar to the discrete Fourier transform (a discrete version of the continuous Fourier transform), there also exists a discrete wavelet transform to calculate the wavelet transform for discrete signals. For this case, the basic wavelet function given in (1.24) needs to be discretised at various sand t values. For example, a typical scaling and translation basis would be

(1.28) (1.29)

where j E N+ and k E N+. The discrete basic wavelet function is given by

(1.30)

In practice, the orthonormal wavelet functions are widely used. For example, the following Haar wavelet is one of such wavelets.

if O:S u < ~ if ~:Su


m

g(u) = go + L Wi'Ij;(Si(U - ti)) (1.34) i=l

where Si = diag(sil' ... , Sid), d is the dimension of the input, and go is intro-duced to deal with nonzero mean functions on finite domains. The original formulation of the wavelet network was based on the tensor product of one-dimensional wavelets but recently the radial wavelet function was applied.

To obtain the orientation selective nature of dilations and to improve flexi-bility, a rotation transform can be incorporated by

m

g(U) = go + L w(lj)((u - ti)/Si) (1.35) i=l

Wavelet theory and networks have been widely employed in applications in diverse areas, such as geophysics (Kumar and Foufoula-Georgiou, 1993) and system identification (Sjoberg et al., 1995; Liu et al., 1999, 2000).

1.4.6 General Form of Neural Networks

There are many other types of neural networks. Forms of neural networks based on orthogonal polynomial expansions can be used, such as Hermite polynomials, Legendre polynomials and Bernstein polynomials. Apart from the polynomial expansion, orthogonal basis functions such as the Fourier se-ries may also be employed. The surface interpolation method of splines has been adopted in the development of spline networks (Friedman, 1991). Kernel functions, which are commonly used in kernel density estimation procedures, may also be introduced as forms of neural networks.

The mathematical formalism of the networks allows recent developments in neural networks to deviate from the biological plausibility that served as an impetus in the first place. This is not a cause for concern because the ultimate aim of such developments is to build machines rather than to understand and model biologically intelligent systems. What should be avoided is to refer to them simply as neural networks. However, to avoid confusion in the terminol-ogy we will continue to refer to these as neural networks with the emphasis placed on the fact that they are no more than a special class of nonlinear model.

The functional description of neural networks has a common form of ex-pression. Essentially, neural networks are parametric and can be described as a linear combination of basis functions. So, the neural network is generally denoted by

m

f(u; w) = L wk'Pdu) (1.36) k=l

where w is the parameter vector containing the coefficients Wk and the set of parameters that define the basis function 'Pk(U), m is the number of basis


functions used in the overall mapping of the network. For each parameter vector w E P, the network mapping f E Fw , where P is the parameter set and Fw the set of functions that can be described by the chosen neural network.

1.5 Learning and Approximation

Neural networks learn from the examples presented to them, which are in the form of input output pairs. To simplify the presentation, a single variable function is taken into account. Let the input to the network be denoted by u and the output by y. The neural network maps an input pattern to an output pattern, described by

f:u--+y (1.37) An assumption made about these examples is that they are consistent with an underlying mapping, say j*. Then the relationship between the input and the output can be stated as

y = j*(u) +v (1.38) where v is the measurement noise, which is an unknown random signal. Here, let us assume that the measurements are noise free. Then, the data set of N examples is described as

(1.39) which contains the information that is available about the unknown mapping j*.

Let the set Fw = {j(u;w) : for all w E P} describe all functions that can be mapped by the neural network. The task of learning is to approximate j*(u) by choosing a suitable f(u; w). This requires a measure of approximation accuracy to be defined, whose simple example is the approximation error.

1.5.1 Background to Function Approximation

The basic approximation problem treated in this book can be stated as follows: For a given f(u), find the function amongst the set Fw = {j(u;w) :

for all w E P} that has the least distance to f (u). This is equivalent to finding the f(u; w) that has the least approximation error, i.e.,

min II f(u) - f(u; w) 112 (1.40) It is not sufficient that the function f (u; w) to be found most closely ap-

proximates f (u) alone. To guarantee the approximation to be sufficiently good, the least approximation error must be below a threshold. If the set F w , which contains all the functions that can be mapped by the network, is sufficiently large, then there is a reasonable chance of satisfying the above requirement.

1.5 Learning and Approximation 15

In practice, the underlying function f (u) to be approximated is unknown. The information about the function f (u) is contained in the discrete data set D (see equation (1.39)). Then, the accuracy of approximation measure must be based on this discrete set and is given by

n

e(D; fw) = L IYk - f(Uk; W)12 (1.41) k=l

which is known as the squared error measure. The approximation problem is to find f(u; w), with its shorthand fw, that has the least e(D, fw), which is referred to as the least squares approximation.

The network mapping f(u;w) is defined by a specific set of parameters w E P. Finding the closest function f (u; w) to f (u) is equivalent to finding an optimal parameter w, denoted by w*, the corresponding map being f (u; w*) with shorthand description f:V. Thus, the learning problem of seeking the best approximation to the underlying function becomes that of estimating the op-timal set of parameter values. From the distance measure given in (1.41) that is a function of w, the estimation problem can be stated as

w* = arg mine(D; fw) (1.42) w

Generally, w appears nonlinearly in f (u; w). It is clear that the above problem is a nonlinear optimisation problem, which can be solved by any of the standard procedures or algorithms such as those in Luenberger (1984).

The function e(D; fw) can be viewed as an error surface defined over the space of the parameter w, called the parameter space. This surface will either have one or several minima which depend on how the parameter w appears through f(u; w). If f(u; w) is linear in w, e(D; fw) is convex and has only one minimum that is the global minimum of the error surface. On the other hand, if f (u; w) is nonlinear in w the error surface may have several local minima due to the non convexity of e(D; fw). One must bear in mind the effects caused by the presence of local minima in choosing an optimisation procedure or algorithm.

1.5.2 Universal Approximation

For the function approximation, it assumes that a choice of f (u; w) and F w is made. Now, let us see why neural networks have been a popular choice for representing f ( U; w). The selection of F w determines the goodness of function approximation that can be achieved. For example, if Fw contains a single member which is a constant, then the best approximation in the Hibert space 1{ is the mean of the output value of f (u; w). It is clear that this is bound to be a bad function approximation if the range of f (u) is large. However, if F w spans the entire space 1{, then f ( u) can be exactly represented by some f(u; w). The approximation ability of the representation f(u; w) is therefore crucial to the goodness of function approximation.


Neural networks with at least a single hidden layer have been shown to have the capacity to approximate any arbitrary function in C(Rm) (continuous function space) if there are a sufficiently number of basis functions (or hidden nodes) (Cybenko, 1989). This property of neural networks is referred to as the universal approximation property.

This approximation ability of neural networks can also be understood from the geometric view in the function space. If the neural network consists of N hidden neurons, then the function to be mapped is represented by a linear combination of the N basis functions k (u). For the case where these N basis functions are linearly independent, the set of functions the network can map, span a subspace of N-dimensions in the infinite dimensional Hilbert space H. By increasing the number of linearly independent basis functions to infinity the subspace spanned by the neural network mapping is extended to the entire Hilbert space H. For the Gaussian RBF network, the linear independence of the basis functions with different centres holds (Poggio and Girosi, 1990a,b), which can also be extended for other types of neural networks to show that these basis functions are linearly independent.

1.5.3 Capacity of Neural Networks

The universal approximation property of neural networks does not provide any information about the capacity of a network with a finite number of basis functions (or hidden units). But, it indicates that the capacity of the network depends on the number of basis functions. This is also evident from the fact that a larger sized Fw can give a good approximation to a wider class of functions because increasing the number of hidden units increases the size of this set.

The notion of capacity was introduced for pattern classifiers by Cover (1965). The classifier typically provides an output value of either 0 or 1 and then constructs only a class of piecewise constant functions. It was subse-quently developed by introducing the concept of Vapnik Chervonenkis dimen-sion or VC dimension as a measure of the capacity of a classifier network (Vapnik and Chervonenkis, 1971). The VC dimension of a classifier network is defined as the maximum number of dichotomies that a network can induce on the input space. A dichotomy is a partition of the input space into two sub-regions in this space. This is closely related to the number of hidden units in the network, in analogy with the number of coefficients or degrees of freedom (Baum and Haussler, 1989).

The notion of VC dimension is extended to networks that map arbitrary real-valued functions and obtains a capacity measure for such networks. This capacity is also found to be directly related to the number of hidden units in the network. Generally, for commonly used neural networks the number of parameters provide a measure of their capacity.


1.5.4 Generalisation of Neural Networks

If having learned to map the examples in the data set D, a neural network predicts the input output observations consistent with the underlying function f(u), which are not in D. Then, this neural network is said to generalise well. The generalisation ability of a network depends critically on its functional form f(u; w) and the data set D.

In order that a network has the capacity to generalise, its functional form f(u; w) must be able to provide a sufficiently good approximation to the un-known underlying function f(u). This implies that the capacity ofthe network and hence the number of parameters should be large. The universal approxi-mation property of neural networks seems to suggest that the functional rep-resentation is not important as long as a sufficiently large network is chosen.

After the functional form is chosen, the network parameters must be esti-mated from the data set D. If the number of examples contained in this data set is less than the number of parameters, infinitely many solutions for the pa-rameters that will fit the data exist. The network will generalise poorly if the learning algorithm cannot give consistent estimates and cannot find an esti-mate not necessarily closest to the unknown f(u). The generalisation problem of neural networks can also be understood from a statistical point of view. If there are an infinite number of functions that can fit the data set D exactly, the probability that the estimate found will be closer to f(u) will be very low. With an increasing number of examples this probability is increased and in turn the generalisation of the network is improved. Thus, the network size that gives good generalisation depends on the number of examples that are used to estimate the parameters. It has been shown that an upper bound on the number of parameters of the network can be derived on the basis of the size of the data set D (Baum and Haussler, 1989).

It has been observed that choosing large size networks is bound to exhibit poor generalisation (Chauvin, 1989), which is referred to as overfitting. Impos-ing smoothness constraints is a powerful way of reducing the dimensionality of the functional representation problem. Good generalisation can be achieved by choosing large networks with added penalty terms to provide smoother basis functions (Hinton, 1987; Hanson and Pratt, 1989).

1.5.5 Error Back Propagation Algorithm

A learning algorithm for neural networks is often defined by the optimisation criterion and the optimisation procedure together. A widely used optimisa-tion algorithm for neural network learning is based on the least squares error criterion. The least squares method gives an estimate that is the maximum likelihood estimate under the assumption that the observation noise statis-tics is Gaussian (White, 1989). Alternative criteria based on cross entropy for classifiers have also been proposed (Solla et at., 1988). These criteria view the classifier as constructing a conditional probability for which the cross-entropy distance measure is more suitable (Basseville, 1989). Since the interest here


is in estimating functions rather than constructing probability estimates, the least squares criterion is more appropriate.

Neural network learning can be viewed as a block estimation problem where all the information or data are assumed available together. Here, a brief inves-tigation on error back propagation is given, which is the most commonly used neural network learning algorithm.

The error back propagation learning algorithm, which is devised for the MLP, was the first learning algorithm developed for multilayer feedforward networks. Essentially, it is a stochastic gradient descent procedure minimising the squared error criterion. Given the network mapping f(u;w) and the data set D, the squared error function is given by

n

Je(D; fw) = 2)Yk - f(Uk; W))2 (1.43) k=l

The error back propagation algorithm updates the parameters according to the following:

at the j-th iteration, where a is a constant and

Substituting for Je(D; fw) and differentiating gives

where

n

-2 L ek V' wf(Uk; W(j-l)) k=l

Yk-f(Uk;W) of (Uk; w)

ow IW=W(j-l)

(1.44)

(1.45)

(1.46)

(1.47)

(1.48)

The parameter is adapted in the direction of decreasing J e (D; f w), where the direction is averaged over all the samples. The iteration is repeated until the squared error falls below a required threshold.

This algorithm could be efficiently implemented in feedforward networks by back propagating the errors (Chan and Fallside, 1987). Further, it could be implemented within the highly parallel architecture of neural networks.

The error back propagation learning algorithm has a characteristic feature of slow rate of convergence. Such behaviour is caused by the shape of the error


surface in the parameter space in which sharp valleys and long plateaux exist. A scheme for adapting the step size or learning rate is proposed, based on the angle of the previous gradient direction and the current gradient direction in the parameter space (Chan and Fallside, 1987).

When the learning problem is viewed as one minimising a cost function, the slow rate of the gradient descent procedure in the error back propagation method becomes fairly obvious. The nonlinear optimisation procedure consid-ers only the gradient of the current iteration. Methods that are faster but need more computation have been developed, for example, the method of line search along the gradient direction, the conjugate gradient descent method which utilises information about previous descent directions, and the quasi-Newton descent direction method which utilises the Hessian of the cost function along with the gradient (Luenberger, 1984). These methods have also been applied to neural network learning.

1.5.6 Recursive Learning Algorithms

The problem of recursive learning in neural networks can be viewed as a re-cursive parameter estimation problem for which a variety of algorithms exist (Ljung and Soderstrom, 1983; Young, 1984). The general sequential learning algorithm can be operated as follows. Let the data set be defined as

(1.49)

which is received sequentially, so that at time n the observation {(Uk, Yk); f( Uk) = yd is received. The neural network or nonlinear model mapping is given by f(u; w). Let the set of parameter values be w(n-l) before the n-th observation is received, which is known as the a priori estimate. On learning v(n), let the parameter values be modified to w(n), known as the a posteriori estimate. The operation of the recursive learning algorithm is to provide a functional relationship between the posterior estimate w(n), the prior estimate w(n-l), and the n-th observation. In general, it can be described mathematically by

(1.50)

where h(.,.) is a nonlinear function. The recursive learning algorithms has an important advantage over block

algorithms that the model or network can continually improving the approx-imation as it learns. If the network mapping is exactly the same as f(u), the underlying model mapping, both Yn and f(u n ; w(n-l)) would be equal. But, it is indicative of the approximation error between the network and the under-lying model if a difference between the two exists. This difference is also the prediction error, which is the error in the predicted value when compared to the actual value. The prediction error, denoted by en, is defined by

(1.51)


In sequential learning, the prediction error can be calculated for each ob-servation as it arrives and hence is a dynamic performance index that can be used in evaluating different models and algorithms.

1.5.7 Least Mean Square Algorithm

A commonly used algorithm for neural networks is the least mean square (LMS) algorithm (Widrow and Hoff, 1960). It is a special case ofthe stochastic approximation algorithm (Robbins and Munro, 1951). For the n-th observation v(n), the parameter vector is adapted by

(1.52)

where en is the prediction error and T} the learning rate or the adaptation step size. The above LMS learning is a recursive version of the stochastic gradient descent procedure, with the gradient being estimated on the basis of the cur-rent sample rather than the ensemble of examples as in the block estimation procedure. It is shown that such a procedure minimises the least squares cost function defined in equation (1.43) (block estimation cost function) and fur-ther that the LMS algorithm converge slowly to the underlying set of optimal parameters.

1.6 Applications of Neural Networks

Neural networks have been widely applied to many areas. Applications of neural networks for classification, filtering, modelling, prediction, control and hardware implementation are introduced here.

1.6.1 Classification

With the growth of information technology and the availability of cheap com-puter systems, the rapid expansion of medical knowledge makes the develop-ment of Computer-Aided Diagnostic (CAD) systems increasingly attractive. Such systems assist clinicians to improve clinical decision-making. The con-tribution of neural networks to such systems is no exception. For example, RBF networks have been applied to classify various categories of low back disorders (Bounds et at., 1990), which takes in many elements of information and classifies the different cases of low back disorders. Besides using RBF net-works, classification studies have been made with MLP networks, fuzzy logic, k-nearest neighbours, closest class mean and also have been compared with clinicians' diagnoses.

Classification and feature extraction of speech signals is the single most applied and reported application of neural networks (see, for example, Re-nals, 1989; Bengio, 1992). Primarily, neural networks are used to classify spo-ken vowels based on speech spectrograms. It is worth noting that consistent

1.6 Applications of Neural Networks 21

and superior performance was obtained with neural networks compared to other known methods or networks for speech classification and feature ex-traction. Neural networks have also found their way into process industries (Leonard and Kramer, 1991; Lohninger, 1993). Possible applications include process state identification, sensor validation, malfunction diagnosis and fault detection. Neural networks have also been employed in the classification of marine phytoplankton from multivariate flow cytometry data (Wilkins et at., 1994).

1.6.2 Filtering

Neural networks have drawn considerable attention from the signal processing community (Casdagli, 1989; Chen et al., 1990; LeCun et al., 1990). Remarkable claims have been made concerning the superior performance of neural networks over traditional methods in signal processing. One of the major areas of signal processing application is filtering.

The filtering property of neural networks employing Gaussian radial basis functions has been discussed and reported by researchers (Tattersall et at., 1991) and applied in filtering chaotic data (Holzfuss and Kadtke, 1993). Gaus-sian RBFs are a particularly good choice for this purpose because of the local property of this network, which enables wild oscillations to be damped out. Actually, the RBF method for multivariate approximation schemes is devel-oped by imposing a smoothness constraint on the approximation function. This smoothness constraint can be synthesised in the frequency domain by the use of the generalised Fourier transform. Analysis and application of the generalised inverse Fourier transform lead to a smooth approximating scheme. Moreover, it has been shown that the neural network approach is a very promising method for smoothing scattered data (Barnhill, 1983).

Neural networks as filters have been used in digital communications such as channel equalisation and overcoming cochannel interference. Significant ro-bustness and good filtering properties of neural networks for systems with high signal to noise ratios have been reported (Holzfuss and Kadtke, 1993).

1.6.3 Modelling and Prediction

Since neural networks are used for nonlinear prediction of chaotic time series (Casdagli, 1989), there has been a growing interest in using neural networks for various prediction tasks (Leung and Haykin, 1991). Many prediction tasks include various nonlinear time series, such as annual sunspots, Canadian lynx data, ice ages, measles; chaotic data include Ikeda map, Lorenz equations, Mackey-Glass delay differential equation, Henon map, logistic map, Duffing oscillators, radar backscatter, fluid turbulence flow, electrochemical systems (electrodissolution of copper in phosphoric acid) and many others. Neural networks have become popular for prediction of a variety of different time series, for example, chaotic time series (Platt, 1991), speech waveforms (Fall-side, 1989) and economic data (Weigend et al., 1991). The interest in most of


these approaches demonstrate the effectiveness with which good predictions can be made with neural networks. In some cases, investigation into the accu-racy level has been achieved by different sizes of networks, and in others the performances of different types of networks has been compared.

Unlike speech waveforms or astronomical data, chaotic time series are usu-ally deterministic and the underlying models generating them are known. So, the neural network learning performance can be evaluated directly by apply-ing it to the prediction of chaotic time series. A review on the use of neural networks for the prediction of chaotic systems can be found in Casdagli et al. (1994).

Extensive studies of neural network prediction and modelling capabilities have been reported (Carlin et al., 1994). Based on real-world data, these stud-ies are used to identify the dynamic actuator characteristics of a hydraulic industrial robot, to model carbon consumption in a metallurgic industrial process and to estimate the water content in fish food products based on NIR-spectroscopy. The use of neural networks for system identification was popularised by a series of research papers (Chen et al., 1990; Narendra and Parthasarathy, 1990; Chen and Billings, 1992).

1.6.4 Control

Neural networks have also received widespread attention and have been ap-plied to the control of dynamical systems. They are employed to adaptively compensate for plant nonlinearities (Sanner and Slotine, 1992; Feng, 1994; Liu, 2001). Under mild assumptions about the degree of smoothness exhibited by the nonlinear functions, it has been shown that the nonlinear optimal neural control is globally stable with tracking errors converging to a neighbourhood of zero. A variant of neural networks (with Gaussian RBFs) is used to opti-mise and control a repetitively pulsed, small-angle negative ion source which is designed to produce a high-current, low-emittance beam of negative hydrogen ions for injection into various accelerators used in nuclear physics (Mead et al., 1992). Neural networks have shown amongst other things, the versatibility of nonlinear adaptive basis functions, simple and rapid training algorithms, and a variety of optional capabilities that could be incorporated such as Kalman noise filtering. Neural networks have been used to design more powerful feed-back feedforward controllers for robotic applications (Parisini and Zoppoli, 1993). Apart from showing the desirable properties the neural network could achieve, it is highlighted how much computational load is involved, particu-larly how the computation increases rapidly with respect to the dimension of the problem.

There are many other interesting applications of neural networks used in the control of dynamical and industrial systems. Space however, precludes mention of these but details can be found as follows: biomedical control (Nie and Linkens, 1993), chemical and industrial processes (Roscheisen et al., 1992; Liu and Daley, 1999a,c,200l), servomechanism (Lee and Tan, 1993).

1.7 Mathematical Preliminaries 23

1.6.5 Hardware Implementation

The hardware implementation of neural networks is parallel to the development of neural networks. Many reports have appeared describing this development. An optical disk implementation of a neural network was reported to apply the network to a handwritten classification task (Neifield et al., 1991). The optical disk based system was designed to recognise handwritten numerals 0-9, and computed the Euclidean distance between an unknown input and 650 stored patterns (REF centres or reference vectors) at a demonstrated rate of 26 000 pattern comparisons/so This application chose the REF structure of neural networks because they lie between supervised output error driven learning algorithms like backpropagation, and memory intensive sample based systems such as k-nearest neighbour classifiers.

In space-based applications, three different neurocomputers were designed and their performances compared (Watkins and Chau, 1992), based on cus-tom analogue VLSI circuits and digital systems made commercially available on digital signal processors (Motorola 56000). These neurocomputer systems were controlled by a PC host running in C and 56000 assembly language. In computer vision, an experimental analogue VLSI chip to reconstruct a smooth surface from sparse depth data was reported (Harris, 1987, 1988). Us-ing state-of-the-art CMOS technology, it showed how a neural VLSI chip can provide thin-plate spline smoothing of images. This work (Harris, 1994) was extended and implemented via Delbruck's bump-resistors concept (Delbruck, 1991). Hardware development of neural networks is described in Anderson et al. (1993) while a good review of various neurocomputer developments is provided by Glesner and Pochmuller (1994).

1.7 Mathematical Preliminaries

This section outlines some fundamental mathematical concepts that are nec-essary for the remaining chapters.

The framework of metric spaces provides a general way of measuring the goodness of approximation, since there is a distance function defined for all functions that belong to this space. A special case of metric spaces, normed linear spaces, gives a more convenient method for approximation.

In most approximation problems, f(x) is in the space C(Rn) which is the set of continuous functions defined in R n. The Lp-norm in C (R n) is defined as

(1.53)

In a normed linear space, the distance between the functions f (x) and 1* (x) is given the shorthand description

D(j, j*) := II f - 1* 112 (1.54)


and is the norm of the difference between the two functions, which is a suitable distance function. Since the difference f - 1* is the error function, this measure is the approximation error.

The commonly used norms are the 1-,2-, (Xl-norms. The L 1-norm has the property that the magnitude of error in the case of discrete data makes no difference to the final approximation (Powell, 1981). The Loa-norm, also known as the Chebyshev norm, is much used in approximation theory. The norm can also be expressed as

II f Iloa:= sup f(x) (1.55) xERn

which gives the maximum value of f(x). The (Xl-norm of the difference would then give the maximum difference between the two functions for any point x, which is also the maximum error of approximation.

The L 2-norm or the Euclidean norm occurs naturally in theoretical studies of Hilbert spaces (Powell, 1981). The practical reasons for considering the L 2 -norm are even stronger. From a statistical point of view, if the errors in the data have a normal distribution, the most appropriate choice of data fitting is the L 2-norm. Further, highly efficient algorithms can be developed to find the best approximation. The L 2 - norm is given by

( )1/2

II f 112:= 1 If(xWdx xERn

(1.56)

The L 2-norm defines the L2-space of functions, the square integrable real func-tions. Since an inner product can be defined in this space, it is also the Hilbert space of square integrable real functions, denoted by H (Linz, 1979). All con-tinuous functions in C (R n), and therefore f and 1* , are a subset of this Hilbert space.

Typically, for a function to be admitted into H, its L 2-norm must be finite. There exist continuous functions with infinite L 2-norm in C(Rn). However, for the input space D E Rn the norms of these functions can be made finite. Since the input space is always finite, all continuous functions can be admitted into H (see also, Linz, 1979).

Nand R denote the set of integers and real numbers, respectively. L2 (R) denotes the vector space of measurable, square-integrable one-dimensional functions f(x). For f,g E L 2 (R), the inner product and norm for the space L 2 (R) are written as

< f,g >:= i: f(x)g(x)dx II f 112 :=< f, f >1/2

(1.57)

(1.58)

where g(.) is the conjugate of the function g(.). L 2 (Rn ) is the vector space of measurable, square-integrable n-dimensional functions f(X1' X2, ... , x n ). For

1.8 Summary 25

j,g E L2(Rn), the inner product of j(Xl,X2, ... ,Xn) with g(Xl,X2, ... ,Xn) is written as

The above mathematical notation introduced in this section will be used throughout this book.

1.8 Summary

This chapter has presented an overview of neural networks. It started with a description of the model of a neuron (the basic element of a neural net-work) and commonly used architectures of neural networks. Then various neu-ral networks were discussed, such as radial basis function networks, Gaussian REF networks, polynomial basis function networks, Fuzzy neural networks and wavelet networks. Function approximation by neural networks was then considered. It takes the view that function approximation is essentially a lin-ear combination of a set of basis functions defined at the hidden layer of a single hidden layer network. Learning by neural networks and its relation to function approximation are discussed with measures of approximation good-ness. Three learning algorithms are introduced: the error back propagation algorithm, the recursive learning algorithm and the least mean square algo-rithm. Applications of neural networks to classification, filtering, modelling, prediction, control and hardware implementation were briefly detailed. Some fundamental mathematical concepts that are necessary in this book have also been provided.

CHAPTER 2

SEQUENTIAL NONLINEAR IDENTIFICATION

2.1 Introduction

The identification of nonlinear systems using neural networks has become a widely studied research area in recent years. System identification mainly con-sists of two steps: the first is to choose an appropriate identification model and the second is to adjust the parameters of the model according to some adap-tive laws so that the response of the model to an input signal can approximate the response of the real system to the same input. Since neural networks have good approximation capabilities and inherent adaptivity features, they pro-vide a powerful tool for identification of systems with unknown nonlinearities (Antsaklis, 1990; Miller et al. 1990).

The application of neural network architectures to nonlinear system iden-tification has been demonstrated by several studies in discrete time (see, for example, Chen et al., 1990; Narendra and Parthasarathy, 1990; Billings and Chen, 1992; Qin et al., 1992; Willis et al., 1992; Kuschewski et al., 1993; Liu and Kadirkamanathan, 1995) and in continuous time (Polycarpou and Ioan-nou, 1991; Sanner and Slotine, 1992; Sadegh 1993). For the most part, much of the studies in discrete-time systems are based on first replacing unknown functions in the difference equation by static neural networks and then de-riving update laws using optimisation methods (e.g., gradient descent/ascent methods) for a cost function (quadratic in general), which has led to var-ious back-propagation-type algorithms (Williams and Zipser, 1989; Werbos, 1990; Narendra and Parthasarathy, 1991). Though such schemes perform well in many cases, in general, some problems arise, such as the stability of the overall identification scheme and convergence of the output error. Alternative approaches based on the model reference adaptive control scheme (N arendra and Annaswamy, 1989; Slotine and Li, 1991) have been developed (Polycar-pou and Ioannou, 1991; Sanner and Slotine, 1992; Sadegh, 1993), where the stability of the overall scheme is taken into consideration.

Most of the neural network based identification schemes view the problem as deriving model parameter adaptive laws, having chosen a structure for the neural network. However, choosing structure details such as the number of basis functions (hidden units in a single hidden layer) in the model must be done a priori. This can often lead to an over-determined or under-determined network structure which in turn leads to an identification model that is not optimal. In discrete-time formulation, some approaches have been developed

G. P. Liu, Nonlinear Identification and Control Springer-Verlag London 2001

28 2. Sequential Nonlinear Identification

in determining the number of hidden units (or basis functions) using decision theory (Baum and Haussler, 1989) and model comparison methods such as minimum description length (Smyth, 1991) and Bayesian methods (MacKay, 1992). The problem with these methods is that they require all observations to be available together and hence are not suitable for on-line or sequential identification tas

Documents

[G. P. Liu BEng, MEng, PhD (Auth.)] Nonlinear Ide(BookZZ.org)