Efficient Training of Neural Nets for Nonlinear.pdf

Embed Size (px)

Citation preview

  • 7/29/2019 Efficient Training of Neural Nets for Nonlinear.pdf

    1/13

    IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 48, NO. 7, JULY 2000 1915

    Efficient Training of Neural Nets for NonlinearAdaptive Filtering Using a Recursive

    LevenbergMarquardt AlgorithmLester S. H. Ngia, Student Member, IEEE, and Jonas Sjberg, Member, IEEE

    AbstractThe LevenbergMarquardt algorithm is often su-perior to other training algorithms in off-line applications. Thismotivates the proposal of using a recursive version of the algo-rithm for on-line training of neural nets for nonlinear adaptive fil-tering. The performance of the suggested algorithm is comparedwith other alternative recursive algorithms, such as the recursiveversion of the off-line steepest-descent and GaussNewton algo-rithms. The advantages and disadvantages of the different algo-rithms are pointed out. The algorithms are tested on some exam-ples, and it is shownthat generally, the recursive LevenbergMar-quardt algorithm has better convergence properties than the other

    algorithms.

    Index TermsGaussNewton, LevenbergMarquardt, neuralnetworks, nonlinear adaptive filtering, off-line training, on-linetraining, steepest descent.

    I. INTRODUCTION

    NEURAL nets can be trained to learn from their environ-ment. They are known to be universal approximators, e.g.,[1][4], such that they can be used to approximate any smoothnonlinear functions. Neural net training can be performedoff-line or batch, where all data are used in each training itera-tion of the parameter update, e.g., [5][8]. The other possibility

    is to train the net with an on-line or a recursive algorithm,where a new data sample is available in each training iteration,e.g., [9][12]. Such on-line algorithms will be discussed inthis article, and an on-line LevenbergMarquardt algorithm forneural net filters will be specially derived.

    Consider a regressor vector , which is based on measuredinputoutput signals up to time and a parameter vector

    . A linear regression model is then expressed as anda pseudo-linear regression model as . Such modelsor filter structures are described in [13] and [14]. The specifictype of filter structure depends on the contents of the regressor,and the FIR and IIR filters are included in these two classes ofmodels.

    There exist several approaches to describe recursive trainingalgorithms. In the Bayesian approach, the parameter vector isdescribed as a stochastic time-varying variable with a certain

    Manuscript received November 30, 1998; revised November 11, 1999. Thiswork was supported in part by the Swedish Research Council for EngineeringSciences (TFR). The associate editor coordinating the review of this paper andapproving it for publication was Prof. Lang Tong.

    The authors are with the Signal Processing Group, Department of Signalsand Systems, Chalmers Universityof Technology, Gothenburg,Sweden (e-mail:[email protected]).

    Publisher Item Identifier S 1053-587X(00)04936-9.

    prior probability distribution. Then, the recursive training canbe described as a tracking problem, where the algorithm triesto follow the time-varying parameter vector. Another approachis to present the algorithms as modifications of off-line algo-rithms so that it is possible to use and generalize insights fromthe off-line training. An overview of the types of approaches andrecursive algorithms for the linear and pseudo-linear regressionmodels can be found in [14][17].

    The chosen approach to derive recursive algorithms for non-

    linear regression models is from the modifications of off-linetraining algorithms. The selected technique in this category ofapproach is similar to the one used to obtain the recursive algo-rithms for a pseudo-linear regression model in [14]. The non-linear regression models can be described as , where

    is a nonlinear function. In this case, can be a feed-forward or radial basis function neural net.

    The main purpose of this paper is to introduce an on-lineLevenbergMarquardt algorithm to train neural nets for non-linear adaptive filtering. This is motivated by the fact that Lev-enbergMarquardt algorithm is often the best method in off-linetraining. Indeed, as shown in the examples, the recursive Lev-enbergMarquardt algorithm has similar advantages over other

    recursive algorithms, exactly as their corresponding off-line al-gorithms.

    The applications of neural nets as nonlinear adaptive filtersare not new. For example, neural nets are trained by a variantof a recursive steepest descent algorithm in [11] and a recursivelinearized least squares algorithm in [10]. The extended Kalmanfilter algorithm is also used to train neural nets, e.g., [18][20],which is derived from the earlier mentioned Bayesian approach.This algorithm is close to a recursive GaussNewton algorithm[14]. To the authors knowledge, the only work that is close tothis paper is found in [21], but their algorithm cannot be usedto estimate time-varying systems. The derivation of their algo-rithm ignores the second-order information (or off-diagonal el-

    ements) of the Hessian matrix, which is a modification to sim-plify the computational complexity. This simplification causesslow convergence rate. On the contrary, the proposed recursiveLevenbergMarquardt algorithm can estimate time-varying sys-tems, and it uses thesecond-order information in a way that doesnot add large computational complexity.

    For the sake of clarity, the derivation is constrained to thesingle-input single-output (SISO) and the nonlinear regression,i.e., , case.Itisstraightforward to generalize theresultto the multi-input multi-output (MIMO) and the pseudo-non-linear regression, i.e., , case.

    1053587X/00$10.00 2000 IEEE

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 7/29/2019 Efficient Training of Neural Nets for Nonlinear.pdf

    2/13

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 7/29/2019 Efficient Training of Neural Nets for Nonlinear.pdf

    3/13

    NGIA AND SJBERG: EFFICIENT TRAINING OF NEURAL NETS FOR NONLINEAR ADAPTIVE FILTERING 1917

    Fig. 1. Nonlinear ARX model that uses a feedforward neural net, with M 0 1 hidden layers of sigmoidal neuron units, as the nonlinear dynamics. Descriptionsof the model is given in Appendix A.

    step-length to assure that decreases in everyiteration.

    A. Search Directions

    The basis for the local search is the gradient

    (10)

    where

    (11)

    It is well known that methods that use only the local gradient,i.e., the steepest-descent search direction, perform inefficientminimization especially in ill-conditioned problems that areclose to the minimum. Then, it is optimal to use the Newtonsearch direction

    (12)

    where is the second derivative or Hessian matrix

    (13)

    The true Newton direction will thus require that the secondderivative

    be computed. In addition, far from the minimum, neednot be positive semi-definite. Therefore, alternative search di-rections are usually used in practice. The following are somecommon off-line algorithms that are described by (9).

    Steepest Descent Algorithm: Take

    (14)

    If is constant, then the steepest descent algorithmbecomes the traditional back-propagation error algorithm[7].

    GaussNewton Algorithm: Neglect thesecond term in (13)to obtain

    (15)Theneglected term can be expectedto be close to zero near

    the minimum. In addition, by omitting this term, isguaranteed to be positive semi-definite. However, canbe almost singular, and this leads to a long GaussNewtonstep in the unimportant parameter space. This is the casein an overparameterized model or when the data sets arenot informative enough.

    LevenbergMarquardt Algorithm: Set , and use

    (16)

    where is defined by (15). Now, is used

    instead of to assure a decrease in each iteration.Choosing gives a full GaussNewton step, which

    is the largest possible update with the algorithm. Takingwill cause the diagonal elements of todominate and thus lead to the steepest descent directionwith a short step length.

    It is well known that the LevenbergMarquardt algorithm isoften the best choice in many neural net training problems, e.g.,[6], [8]. The reason is that the neural net minimization problemsare often ill-conditioned as viewed in [25] and [26]. The Lev-enbergMarquardt algorithm disregards nuisance directions inthe parameter spaces that influence the criterion marginally. Atthe same time, it still performs an update close to the efficientGaussNewton directions within the subset of the important pa-rameters. This is further explained in [27].

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 7/29/2019 Efficient Training of Neural Nets for Nonlinear.pdf

    4/13

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 7/29/2019 Efficient Training of Neural Nets for Nonlinear.pdf

    5/13

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 7/29/2019 Efficient Training of Neural Nets for Nonlinear.pdf

    6/13

    http://-/?-http://-/?-http://-/?-
  • 7/29/2019 Efficient Training of Neural Nets for Nonlinear.pdf

    7/13

    NGIA AND SJBERG: EFFICIENT TRAINING OF NEURAL NETS FOR NONLINEAR ADAPTIVE FILTERING 1921

    Fig. 3. Adaptive filter used in a system identification mode.

    TABLE IICOMPARISON OF THE NUMBER OF FLOPS PER RECURSION TO THE

    IMPLEMENT EACH ALGORITHM

    The overall proposed recursive LevenbergMarquardt algo-rithm as described in (34)(40) is summarized in Appendix C.

    V. EXAMPLES

    In the examples, the adaptive filter is used as a system identi-fier as shown in Fig. 3. The adjustable filter has an input signal

    and produces an output signal . The adaptive algo-rithm adjusts the parameters in the filter to minimize the error

    , where is the response from theunknown system that is corrupted by a disturbance .

    The proposed recursive LevenbergMarquardt algorithm istested on three examples. To provide some comparisons, twoother recursive algorithms are also applied to the same filterstructure in each example. All filter structures belong to thegeneral SISO NARX model as shown in Fig. 1. The algorithmsfollow.

    Recursive Steepest-Descent Algorithm (RecSD):Thealgo-rithm suggested in [11] is used, which is a variant of (24)and (25). It has a rectangular sliding data window of length

    and at each time , iterations of parameter updateare performed. The use of and helps toimprove the tracking and convergence rate at the expense

    of increased computational complexity. If ,then the suggested algorithm is the same as the one in (24)and (25).

    Recursive GaussNewton Algorithm (RecGN): It uses thealgorithm in (29), which is the same as a recursive lin-earized least squares algorithm in [10].

    Recursive LevenbergMarquardt Algorithm (RecLM): Itapplies the algorithms described in (34)(40), which issummarized in Appendix C.

    Table II compares the number of flops required to imple-ment the RecSD, RecGN, and RecLM algorithms. Similarlywith Table I, the number of flops does not include the computa-tion of .

    Fig. 4. Time-invariant NFIR system with 1 layer neural nets for Example 1.

    Fig. 5. Surface and contour graph for the fit of the off-line error criterion (7)with ( N ; t ) = 1 .

    A. Example 1Influence of Design Variable in a

    Simulated Time-Invariant System

    This example illustrates the performance of the algorithms ina time-invariant system. A neural net as shown in Fig. 4 is thesystem to be identified. It has a constant

    . The input signal is a sequence of samplesof white noise with mean 0 and variance 1. Fig. 5 shows thesurface and contour graph forthe fit of theoff-line error criterion(7) with as a function of the bias and weight

    .An NFIR filter structure, which is the same as the NFIR

    model in the system shown in Fig. 4, is used. It hasand in the regressor of (3a), i.e., .The values of the design variables in RecSD, RecGN, and

    RecLM are shown in Table III, except for in RecLM.Sets of different values for will be given in the example.Two different initial values of are chosen: Case 1 uses

    , and Case 2 uses . In eachcase, two kinds of experiments are performed on RecLM. Thefirst experiment uses two different values of constant , and thesecond one uses the adaptive , as described in (36)(38).

    In Case 1, RecLM begins with a constant of and. Fig.6 showsthe locus of onthe same errorcontour

    of Fig. 5. The following are the main points.

    Among the algorithms, RecSD will converge the slowestdue to its zig-zag path, and RecGN will converge thefastest due to its shortest path to the target.

    http://-/?-http://-/?-http://-/?-http://-/?-
  • 7/29/2019 Efficient Training of Neural Nets for Nonlinear.pdf

    8/13

    1922 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 48, NO. 7, JULY 2000

    TABLE IIIVALUES OF DESIGN VARIABLES IN EACH ALGORITHM FOR EXAMPLE 1

    Fig. 6. Locus of parameter estimate ^ ( t ) on the error contour for Case 1( ^ ( 0 ) = [ 1 7 ; 4 ] ). Note that a small part of the locus for RecGN is outsidethe figure.

    In the beginning, RecGN has, however, taken few longsteps in the nuisance parameter space before quickly re-turning to the correct path toward the target.

    A careful choiceof value for in RecLM with nonadaptivecan improve the convergence.

    Fig. 7 depicts the convergence of over time.

    Clearly, RecSD converges the slowest and RecGN con-verges the fastest among the algorithms.

    RecLM with a constant initially convergesfaster but finally converges slower than RecSD. Thisre-emphasizes that should not be fixed to a constantvalue.

    If is allowed to be adapted over time, a good choice forthe value of may not be crucial. This is shown by theRecLM that uses the adaptive with .It converges faster than the RecLM with a constant

    , and it has a similar fast convergence rate as thatof RecGN.

    In Case 2, RecLM begins with two different values of con-stant (i.e., and ). Fig. 8 shows the followingresults.

    In the beginning, RecSD has taken the same path as otheralgorithms but later descends quickly toward the target.Thus, RecSD will converge the fastest among the algo-rithms.

    Fig. 7. Convergence of parameterestimate ^ ( t ) forCase1( ^ ( 0 ) = [ 1 7 ; 4 ] ).

    Fig. 8. Locus of parameter estimate ^ ( t ) on the error contour for Case2( ^ ( 0 ) = [ 7 ; 0 2 ] ). Note that a small part of the locus for RecGN and RecLM( = 1 1 1 0 ) is outside the figure.

    RecGN has taken few long steps in the unimportant di-rection of the parameters space and converged to a localminima.

    Fig. 9 illustrates that a RecLM which uses the adaptivewith has similar fast convergence rate asRecSD.Note that the convergence of RecGN is not plotted in Fig. 9because it converged to a bad local minima.

    The results shown in Figs. 69 indicate that the recursiveLevenbergMarquardt algorithm is indeed a method in betweenthe recursive steepest-descent and GaussNewton algorithms.These results are similar to the ones obtained by their corre-sponding off-line training algorithms. In addition, the adaptive

    is preferable for obtaining fast convergence in the recur-sive LevenbergMarquardt algorithm. In the next two examples,the recursive LevenbergMarquardt algorithm will use only theadaptive .

  • 7/29/2019 Efficient Training of Neural Nets for Nonlinear.pdf

    9/13

    NGIA AND SJBERG: EFFICIENT TRAINING OF NEURAL NETS FOR NONLINEAR ADAPTIVE FILTERING 1923

    Fig.9. Convergence of parameterestimate ^ ( t ) for Case2 ( ^ ( 0 ) = [ 7 ; 0 2 ] ).Note that a small part of the convergence for RecLM ( ( t ) ) and RecLM ( =1 1 1 0 ) is outside the figure.

    Fig. 10. Time-varying NARX system with two layers neural nets for Example2.

    B. Example 2Influence of Design Variable in a

    Simulated Time-Varying System

    This example illustrates the performance of the algorithmsin a time-varying system. A NARX system is shown in Fig. 10.The inputsignalis a sequenceof samples ofwhite noisewith mean 0 and variance 1. The parameter at timeare chosen as uniformly distributed random values in ,

    which are tabulated in Table IV. At time samples,all parameters are changed by random values with uniform dis-tribution in . At samples, the parametersare changed to the original values at .

    The algorithms use the same filter structure as the NARXmodel that is used to generate the output data, as depicted inFig. 10. It has and in the regressorof (3c), i.e., .Table V shows the values of the design variables in the recursivealgorithms. Note the weighting sequence , as plotted inFig. 2, for the two cases in RecLM. The sensitivity ofwith respect to is discussed in Section IV-A3. In this config-uration, RecLM requires similar number of flops as in RecSD

    TABLE IVVALUES OF TRUE PARAMETER IN A TIME-VARYING SYSTEM FOR EXAMPLE 2

    TABLE VVALUES OF THE DESIGN VARIABLES IN EACH ALGORITHM FOR EXAMPLE 2

    and 25% more flops than RecGN. The initial values of arethe same in all algorithms, which are selected randomly with auniform distribution in .

    Fig. 11 exhibits the convergence of the normalized meansquare error (NMSE) of the algorithms, which is defined as

    NMSE(dB) (41)

    The NMSE(dB) is approximated by smoothing andusing a first-order filter with a decay factor of 0.95.

    Fig. 11 illustrates the following points. RecSD converges slower than RecGN and RecLM (Casea) at every time.

    RecGN converges slower than RecLM (Case a) betweento samples, although they have the same

    discounting effects, i.e., in RecLM (Case a) has thesame constant 0.995 as in RecGN.

    RecGN and RecLM have an almost similar convergencerate after samples because they track equallywell at and samples.

    RecLM (Case b) converges slower and tracks poorer thanRecLM (Case a). This is quite natural because the es-timates of the variables in the gradient are rather poor

  • 7/29/2019 Efficient Training of Neural Nets for Nonlinear.pdf

    10/13

    1924 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 48, NO. 7, JULY 2000

    Fig. 11. Convergence of normalized mean square error (NMSE) in decibels.The two different cases of RecLM corresponds to ( 0 ) = 0 : 9 9 5 and = 1(Case a) and 0.999 (Case b).

    Fig. 12. Convergence of ^w ( t ) with its true value w , as shown inTable IV.

    in the transient phase. RecLM (Case a) discounts moreof these poor estimates and thus converges faster thanRecLM (Case b). RecLM (Case a) also discounts moreold data, and thus, it is more alert to parameter changesthan RecLM (Case b). The differences in the convergenceperformance are quite significant between both cases, al-though they differ only by 0.001 in . This is due to thesensitivity of to . Note that if RecGN uses thesame as in RecLM (Case b), it will have similar slowconvergence and poor tracking as in RecLM (Case b); thisis not shown in Fig. 11.

    A similar result is shown in Fig. 12, which illustrates the con-vergence of one of the parameter estimates .

    Among the algorithms, RecLM (Case a) converges thefastest to the true parameter during thetransient phase between to samples.

    Fig. 13. Estimated impulse response of the network echo channel.

    RecGN and RecLM (Case a) can track the changes ofat and samples better than RecSDand RecLM (Case b).

    The results depicted in Figs. 11 and 12 imply that therecursive LevenbergMarquardt algorithm with the adaptive

    and a correct choice of has better convergence andtracking properties than the recursive steepest-descent andGaussNewton algorithms.

    C. Example 3Echo Cancellation Performance in a Real

    Network Echo Channel

    The data used in the example is an echo of a telephone net-work channel, which is recorded at a mobile switching centerin Sweden. Network echo is a phenomenon in which a delayedand distorted version of an original signal is reflected back tothe source due to the impedance mismatch at the hybrid of alocal telephone exchange. In practice, the network echo is onlynearly linear. The input signal is a white noise sequence, and thesampling rate is 8 kHz.

    The algorithms RecSD, RecGN, and RecLM are applied toa NFIR filter structure, which uses the regressor of (3a).To demonstrate the degree of nonlinearity in the network echochannel, a linear FIR filter is also compared. It has a linearregression structure as in (4) and uses the same of (3a).

    Its filter parameters are estimated using the NLMS algorithm3

    with , which is similar to (24) and (25), except thatin this linear case. An earlier analysis on the

    recorded data shows that the network echo channel has an esti-mated impulse response, as shown in Fig. 13. Thus, the chosenvalues are 203 for and 25 for in the regressor , whichare sufficient to represent the energy contained in the true im-pulse response of the echo channel.

    The values of design variables in RecSD, RecGN, andRecLM are shown in Table VI. They use one hidden layer withfive neurons as a trade-off between the achieved ERLE and

    3Analysis shows that the performances of the NLMS and RLS algorithms onthe linear FIR filter are almost similar.

  • 7/29/2019 Efficient Training of Neural Nets for Nonlinear.pdf

    11/13

    NGIA AND SJBERG: EFFICIENT TRAINING OF NEURAL NETS FOR NONLINEAR ADAPTIVE FILTERING 1925

    TABLE VIVALUES OF DESIGN VARIABLES IN EACH ALGORITHM FOR EXAMPLE 3

    Fig. 14. Comparison of ERLE convergence. Note that the linear FIR filterstructure is estimated by the NLMS algorithm with

    = 0 : 5

    .

    computational complexity. The resulting number of parametersis . In this configuration, the computational complexity

    of RecLM is similar to RecSD, and RecLM needs 15% moreflops than RecGN. The linear FIR filter has the lowest numberof parameters (with ) to match the predetermined valueof , after considering that the input signal is delayed by .All algorithms have the same initial parameter estimate ,which are chosen to be uniformly distributed random values in

    . Note that the linear FIR filter uses only the first 25 outof 136 initial estimates in .

    The four different adaptive filters operate as echo cancelers tomimic the transfer function of the feedback coupling path. Thefigure of merit for the effectiveness of an echo canceler is theecho return loss enhancement (ERLE), which is defined as

    ERLE(dB) NMSE(dB) (42)Fig. 14 shows the following main results.

    Among the algorithms, the linear FIR filter has the lowestERLE because the echo path is nonlinear, and it convergesthe fastest because it uses the least number of parameters.

    Compared with the linear filter, RecSD has the least ERLEgain of 1.5 dB, and RecLM has the highest ERLE gainof 8 dB.

    RecLM converges 0.3 s faster and has an ERLE of2.5 dB higher than RecGN.

    To an extent, the results show the superior efficiency of therecursive LevenbergMarquardt algorithm.

    VI. CONCLUSION

    The LevenbergMarquardt numerical search is known to bemore efficient than the steepest-descent and GaussNewtonmethods in the off-line training of neural nets. This paperproposes a recursive LevenbergMarquardt algorithm foron-line training of neural nets for nonlinear adaptive filtering.

    The recursive LevenbergMarquardt algorithm is comparedwith the recursive steepest-descent and GaussNewton algo-rithms in three system identification examples. All algorithmsuse the same filter structures in each example. The results indi-cate that the recursive LevenbergMarquardt algorithm has sim-ilar advantages as its corresponding off-line algorithm. It is anefficient algorithm for adapting nonlinear systems in both sim-ulated and real practical data.

    Although the recursive LevenbergMarquardt algorithm con-sumes a higher number of flops than the recursive steepest-de-scent and GaussNewton algorithms, it is compensated abun-dantly by its increased efficiency. The high efficiency is alsounmatched by the recursive steepest-descent algorithm when its

    computational complexity is allowed to be equal to the one ofthe recursive LevenbergMarquardt algorithm by increasing thelength of data window and the iteration of parameter updateat each time , .

    APPENDIX ADESCRIPTION OF THE SISO NARX MODEL

    Fig.1 illustrates the nonlinear ARX model with hiddenlayers of sigmoidal neuron units as the nonlinear dynamics. Thenet input to unit at layer is

    (43)

    wherenumber of units at layer ;weight term;

    bias term.The parameter vector consists of all the weights and bias terms

    (44)

    with dimension . The inputoutputrelationship of the unit at layer is

    (45)

    where is a nonlinear function of the unit. At layer 0,is equivalent to , and at layer ,

    is equivalent to .

    APPENDIX BDERIVATION OF GRADIENT VECTOR

    Consider the neural nets filter structure as shown in Fig. 1.The gradient vector in (11) is reproduced as

    (46)

  • 7/29/2019 Efficient Training of Neural Nets for Nonlinear.pdf

    12/13

  • 7/29/2019 Efficient Training of Neural Nets for Nonlinear.pdf

    13/13

    NGIA AND SJBERG: EFFICIENT TRAINING OF NEURAL NETS FOR NONLINEAR ADAPTIVE FILTERING 1927

    [13] L. Ljung, System Identification: Theory for the User. EnglewoodCliffs, NJ: Prentice-Hall, 1999.

    [14] L. Ljung and T. Sderstrm, Theory and Practice of Recursive Identifi-cation. Cambridge, MA: MIT Press, 1983.

    [15] G. C. Goodwin and K. S. Sin, Adaptive Filtering Prediction and Con-trol. Englewood Cliffs, NJ: Prentice-Hall, 1984.

    [16] S. Haykin, Adaptive Filter Theory. Upper Saddle River, NJ: Prentice-Hall, 1996.

    [17] B. Widrow and S. Stearns, Adaptive Signal Processing. Englewood

    Cliffs, NJ: Prentice-Hall, 1984.[18] Y. Iiguni and S. Sakai, A real time learning algorithm for a multilay-ered neural network based on the extended Kalman filter, IEEE Trans.Signal Processing, vol. 40, pp. 959966, Apr. 1992.

    [19] H. Kinjo, S. Tamaki, and T. Yamamoto, Linearized adaptive filter fora nonlinear system using neural network based on the extended Kalmanfilter, Trans. IEE Jpn., vol. 117, no. 3, pp. 279286, Mar. 1997.

    [20] M. B. Matthews and G. S. Moschytz, Neural-network nonlinear adap-tive filtering using the extended Kalman filter algorithm, in Proc. Int.

    Neural Networks Conf., 1990, pp. 115118.[21] S. Kollias and D. Anastassiou, An adaptive least squares algorithm for

    the efficient training of artificial neural networks, IEEE Trans. CircuitsSyst., vol. 36, pp. 10921101, Aug. 1989.

    [22] J. Sjberg, Q. Zhang, L. Ljung, A. Benveniste, B. Deylon, P. Y. Glo-rennec, H. Hjalmarsson, and A. Juditsky, Nonlinear black-box mod-eling in system identification: A unified overview, Automatica, vol. 31,no. 12, pp. 16911724, Dec. 1995.

    [23] J. E. Dennis and R. B. Schnabel, Numerical Methods for UnconstrainedOptimization and Nonlinear Equations. Englewood Cliffs, NJ: Pren-tice-Hall, 1983.

    [24] R. Fletcher, Practical Methods of Optimization. Wiltshire, U.K.:Wiley, 1997.

    [25] S. Saarinen, R. Bramley, and G. Cybenko, Ill-conditioning in neuralnetwork training problems, SIAM J. Sci. Comput., vol. 14, no. 3, pp.693714, May 1993.

    [26] Q. J. Zhang, Y. J. Zhang, and W. Ye, Local-sparse connection multi-layer networks, in Proc. IEEE Int. Conf. Neural Networks, 1995, pp.12541257.

    [27] J. Sjberg and M. Viberg, Separable nonlinear least-squares mini-mizationPossible improvements for neural net fitting, in Proc. IEEEWorkshop Neural Networks Signal Process., 1997, pp. 345354.

    [28] T. Sderstrm, An on-line algorithm for approximate maximum like-lihood identification of linear dynamic systems, Lund Inst. Technol.,Lund, Sweden, Tech. Rep. 7308, 1973.

    [29] S. Gunnarsson, On covariance modification and regularization in re-cursive least squares identification, in Proc. IFAC Symp. Syst. Ident.,1995, pp. 935940.

    [30] J. J. Mor, The LevenbergMarquardt algorithm: Implementationand theory, in Numerical Analysis, ser. Springer Lecture Notes inMathematics, G. A. Watson, Ed. Berlin, Germany, 1977, vol. 630, pp.105116.

    [31] M. D. Hebden, An algorithm for minimization using exact secondderivatives, Atomic Energy Res. Establish., Harwell, U.K., Tech. Rep.TP515, 1973.

    Lester S. H. Ngia (S97) was born in Malaysiain 1965. He received the B.E.E. degree from theUniversity of Technology of Malaysia, KualaLumpur, in 1988, the M.B.A. degree from theEdinburgh Business School, Heriot-Watt University,Edinburgh, U.K., in 1995, and the M.Sc.E.E. degreefrom Chalmers University of Technology (CTH),Gothenburg, Sweden in 1997. Since February 1997,he has been working toward the Ph.D. degree inelectrical engineering at CTH.

    He was with Intel Corporation in Malaysia as aProduct Engineer from 1988 to 1992 and a Senior Product, Quality and Relia-bility Engineer from 1992 to 1995, working in the area of testing and reliabilityfor speech and data communication VLSI products. Since February 1997, hehas been a Teaching and Research Assistant with the Department of Signals andSystems, CTH. His research interests include mathematical modeling of linear

    and nonlinear dynamic systems, withapplications on network and acoustic echocancellation problems.

    Jonas Sjberg (M97) was born in Sweden. He re-ceived the M.Sc. degree from the Uppsala Univer-sity, Uppsala, Sweden, in 1989 and the Ph.D. degreein electrical engineering from Linkping University,Linkping, Sweden, in May 1995.

    He was a Postdoctoral Fellow at the Swiss Fed-eral Institute of Technology, Zrich,and at theViennaUniversity of Technology, Vienna, Austria. He was aVisiting Researcher at the TechnionIsrael Instituteof Technology, Haifa, in 1998. Since 1996, he hasbeen an Assistant Professor with the Department of

    Signalsand Systems, Chalmers Universityof Technology, Gothenburg,Sweden.

    His current research interests include algorithms and applications of systemidentification, especially on nonlinear systems. He is currently an Associate Ed-itor ofControl Engineering Practice.