14
This article was originally published in a journal published by Elsevier, and the attached copy is provided by Elsevier for the author’s benefit and for the benefit of the author’s institution, for non-commercial research and educational use including without limitation use in instruction at your institution, sending it to specific colleagues that you know, and providing a copy to your institution’s administrator. All other uses, reproduction and distribution, including without limitation commercial reprints, selling or licensing copies or access, or posting on open internet sites, your personal or institution’s website or repository, are prohibited. For exceptions, permission may be sought for such use through Elsevier’s permissions site at: http://www.elsevier.com/locate/permissionusematerial

Author's personal copy - Missouri University of Science …web.mst.edu/~autosys/papers/009.pdfThis article was originally published in a journal published by Elsevier, and the attached

  • Upload
    buiminh

  • View
    213

  • Download
    1

Embed Size (px)

Citation preview

This article was originally published in a journal published byElsevier, and the attached copy is provided by Elsevier for the

author’s benefit and for the benefit of the author’s institution, fornon-commercial research and educational use including without

limitation use in instruction at your institution, sending it to specificcolleagues that you know, and providing a copy to your institution’s

administrator.

All other uses, reproduction and distribution, including withoutlimitation commercial reprints, selling or licensing copies or access,

or posting on open internet sites, your personal or institution’swebsite or repository, are prohibited. For exceptions, permission

may be sought for such use through Elsevier’s permissions site at:

http://www.elsevier.com/locate/permissionusematerial

Autho

r's

pers

onal

co

py

Neural Networks 19 (2006) 1648–1660www.elsevier.com/locate/neunet

A single network adaptive critic (SNAC) architecture for optimal controlsynthesis for a class of nonlinear systems

Radhakant Padhia,1, Nishant Unnikrishnanb, Xiaohua Wangb, S.N. Balakrishnanb,∗

a Department of Aerospace Engineering, Indian Institute of Science, Bangalore, Indiab Department of Mechanical and Aerospace Engineering, University of Missouri – Rolla, MO 65409, USA

Received 27 January 2004; received in revised form 18 August 2006; accepted 18 August 2006

Abstract

Even though dynamic programming offers an optimal control solution in a state feedback form, the method is overwhelmed by computationaland storage requirements. Approximate dynamic programming implemented with an Adaptive Critic (AC) neural network structure has evolved asa powerful alternative technique that obviates the need for excessive computations and storage requirements in solving optimal control problems. Inthis paper, an improvement to the AC architecture, called the “Single Network Adaptive Critic (SNAC)” is presented. This approach is applicableto a wide class of nonlinear systems where the optimal control (stationary) equation can be explicitly expressed in terms of the state and costatevariables. The selection of this terminology is guided by the fact that it eliminates the use of one neural network (namely the action network) that ispart of a typical dual network AC setup. As a consequence, the SNAC architecture offers three potential advantages: a simpler architecture, lessercomputational load and elimination of the approximation error associated with the eliminated network. In order to demonstrate these benefitsand the control synthesis technique using SNAC, two problems have been solved with the AC and SNAC approaches and their computationalperformances are compared. One of these problems is a real-life Micro-Electro-Mechanical-system (MEMS) problem, which demonstrates thatthe SNAC technique is applicable to complex engineering systems.c© 2006 Elsevier Ltd. All rights reserved.

Keywords: Optimal control; Nonlinear control; Approximate dynamic programming; Adaptive critic; Single network adaptive critic; SNAC architecture

1. Introduction

Many difficult real-life control design problems can beformulated in the framework of optimal control theory.It is well-known that dynamic programming formulationoffers the most comprehensive solution approach to nonlinearoptimal control in a state feedback form (Bryson & Ho,1975). A feedback control is desirable because of itsbeneficial properties like robustness with respect to noiseand modeling uncertainties. However, solving the associatedHamilton–Jacobi–Bellman (HJB) equation demands a verylarge (rather infeasible) number of computations and storage

∗ Corresponding address: Department of Mechanical and AerospaceEngineering, University of Missouri-Rolla, 1870 Miner Circle, Rolla, MO65409, USA. Tel.: +1 573 341 4675; fax: +1 573 341 4607.

E-mail addresses: [email protected] (R. Padhi), [email protected](N. Unnikrishnan), [email protected] (X. Wang), [email protected](S.N. Balakrishnan).

1 Tel.: +91 80 2293 2756; fax: +91 80 2360 0134.

space dedicated to this purpose. An innovative idea wasproposed by Werbos (1992) to get around this numericalcomplexity by using an ‘Approximate Dynamic Programming(ADP)’ formulation. The solution to the ADP formulationis obtained through a dual neural network approach calledAdaptive Critic (AC). In one version of the AC approach,called Dual Heuristic Programming (DHP), one network (calledthe action network) represents the mapping between the stateand control variables while a second network (called thecritic network) represents the mapping between the state andcostate variables. The optimal solution is reached after the twonetworks iteratively train each other successfully. This trainingprocess is carried out for a very large number of states withina ‘domain of interest’, within which the closed-loop state issupposed to lie during the duration of operation of the plant.Note that the domain of interest for which the networks needto be trained can be larger than the actual domain of operation,and hence it is usually not a difficult task to guess for such adomain while doing the off-line training.

0893-6080/$ - see front matter c© 2006 Elsevier Ltd. All rights reserved.doi:10.1016/j.neunet.2006.08.010

Autho

r's

pers

onal

co

py

R. Padhi et al. / Neural Networks 19 (2006) 1648–1660 1649

Nomenclature

X State vectorU Control vectorJ Cost functionΨ Utility functionλ Costate vectorSD Ricatti matrixK D Gain matrix1t Discrete time intervalµT Mean timeσT Standard deviationyi Sample mean of i th data groupn Data sizet0 Test statisticA Area of electrostatic plateε Permittivityg0 Initial air gapm Mass of electrostatic plateb Damping constantR ResistanceQ Chargeg Gap between plate and baseg Rate of change of gapQW , RW Weighting matrices for state and control

respectively

This DHP process, aided by the nonlinear functionapproximation capabilities of neural networks, overcomes thecomputational complexity that had been the bottleneck of thedynamic programming approach. Another important advantageof this method is that this solution can be implemented on-line.This is because for computing the control on-line, one needsonly to use (not train) the neural networks. The time-consumingneural network training process can be carried out off-line.Proofs for both stability of the AC algorithm as well as the factthat the process will converge to the optimal control is found inLiu and Balakrishnan (2000) for linear systems. A related butseparate development towards stability and global convergenceproofs is found in Murray, Cox, Lendaris, and Saeks (2002) forinput-affine nonlinear systems with respect to their discussionon ‘adaptive dynamic programming’, the philosophy of whichis closely related to approximate dynamic programming.

Among many successful uses of the AC method fornonlinear control design, we cite Balakrishnan and Biega(1996) in which the authors have solved an aircraft controlproblem using this technique and Han and Balakrishnan(2002) where the adaptive critic technique has been usedfor agile missile control. Another application of the adaptivecritic technique can be seen in Venayagamoorthy, Harley, andWunsch (2002) where the authors have used it for neurocontrolof a turbo generator. The authors in Ferrari and Stengel (2002)have implemented an adaptive critic global controller for abusiness jet. Recently, Padhi and Balakrishnan have extendedthe applicability of this technique to distributed parameter

systems (Padhi, 2001; Padhi & Balakrishnan, 2003a, 2003b).In fact, there are various types of AC designs available inliterature. An interested reader can refer to Prokhorov andWunsch (1997) for more details about various types of ACdesigns. More details about the use of neural networks forcontrol applications can be found in Hunt (1992), Miller, Suttonand Werbos (1990).

In this paper a significant improvement to the adaptivecritic architecture is proposed. It is named Single NetworkAdaptive Critic (SNAC) because it uses only the critic networkinstead of the action–critic dual network set up used in atypical adaptive critic architecture. SNAC is applicable toa large class of problems for which the optimal control(stationary) equation is explicitly solvable for control interms of state and costate variables. As an added benefit,the iterative training loops between the action and criticnetworks are no longer required. This leads to significantcomputational savings besides eliminating the approximationerror due to action networks. The computational savingsare quantitatively demonstrated in this paper by solving twochallenging nonlinear problems. Note that while applying bothAC and SNAC techniques discussed in this paper, we assumethat the plant equations and parameters are known.

In the control literature, there is an alternate approachfor solving optimal control problems using a neural networktrained using the back propagation through time (BPTT)approach (Prokhorov, 2003). An interested reader can find thedetails of BPTT in Werbos (1990). In Prokhorov (2003), theauthor has attempted to compare the computational complexityof the BPTT based approach with the approach presented inPadhi and Balakrishnan (2003a). Prokhorov was able to solvethe problem successfully in a single network framework, asopposed to the dual-network framework presented in Padhi andBalakrishnan (2003a). However, even though the motivationwas to carry out a comparison study for computationalcomplexity, no ‘quantitative’ comparison has been made. Inthis paper, it is clearly shown through comparison studies whySNAC is better for a certain class of problems. Unlike theapproach in Prokhorov (2003), the SNAC approach presentedin this paper is more control designer friendly since theneural networks embed more control theoretic knowledge. Forexample, for control affine problems with a quadratic costfunction it serves as the state dependent Ricatti operator, whichhas rich information and interpretation in a control theoreticsense.

Rest of the paper is organized as follows. In Section 2, theapproximate dynamic programming technique is outlined. InSection 3, the AC (DHP) technique is outlined. In Section 4,the newly-developed SNAC technique is presented in detail.Numerical results for two different example problems (withincreasing complexity) are presented in Section 5. Conclusionsare drawn in Section 6.

2. Approximate dynamic programming

In this section, the principles of approximate (discrete)dynamic programming, on which both AC and SNAC

Autho

r's

pers

onal

co

py

1650 R. Padhi et al. / Neural Networks 19 (2006) 1648–1660

approaches rely are described. An interested reader can findmore details about the derivations in Balakrishnan and Biega(1996), Werbos (1992).

In a discrete-time formulation, we want to find an admissiblecontrol Uk , which causes the system described by the stateequation

Xk+1 = Fk (Xk, Uk) (1)

to follow an admissible trajectory from an initial point X1 to afinal desired point X N while minimizing a desired cost functionJ given by

J =

N−1∑k=1

Ψk (Xk, Uk) (2)

where the subscript k denotes the time step. Xk and Ukrepresent the n × 1 state vector and m × 1 control vector,respectively, at time step k. The functions Fk and Ψk areassumed to be differentiable with respect to both Xk and Uk .Moreover, Ψk is assumed to be convex (e.g. a quadratic functionin Xk and Uk). One can notice that when N → ∞, this leads tothe infinite time problem. The aim is to find Uk as a function ofXk , so that the control can be implemented as a feedback.

Note that a prime requirement to apply AC or SNACis to formulate the problem in a discrete-time framework.If the discrete system dynamics and cost function are notavailable, the continuous expressions can be discretized beforederiving the costate and optimal control equations. In thisprocess, the control designer has the freedom of using anyappropriate discretization scheme. For example, one can useEuler approximation for the state equation and Trapezoidalapproximation for the cost function (Gupta, 1995).

Next, the steps in obtaining optimal control are described.First, the cost function in Eq. (2) is rewritten for convenience tostart from time step k as

Jk =

N−1∑k=k

Ψk(X k, Uk). (3)

Then Jk can be split into

Jk = Ψk + Jk+1 (4)

where Ψk and Jk+1 =∑N−1

k=k+1Ψk represent the utility function

at time step k and the cost-to-go from time step k + 1 to N ,respectively. The n × 1 costate vector at time step k is definedas

λk =∂ Jk

∂ Xk. (5)

The necessary condition for optimality is given by

∂ Jk

∂Uk= 0. (6)

However,

∂ Jk

∂Uk=

(∂Ψk

∂Uk

)+

(∂ Jk+1

∂Uk

)=

(∂Ψk

∂Uk

)

+

(∂ Xk+1

∂Uk

)T (∂ Jk+1

∂ Xk+1

)=

(∂Ψk

∂Uk

)+

(∂ Xk+1

∂Uk

)T

λk+1. (7)

Thus combining Eqs. (6) and (7), the optimal controlequation can be written as(

∂Ψk

∂Uk

)+

(∂ Xk+1

∂Uk

)T

λk+1 = 0. (8)

The costate equation is derived in the following way

λk =∂ Jk

∂ Xk=

(∂Ψk

∂ Xk

)+

(∂ Jk+1

∂ Xk

)=

(∂Ψk

∂ Xk

)+

(∂ Xk+1

∂ Xk

)T (∂ Jk+1

∂ Xk+1

). (9)

Note that by using Eq. (8), on the optimal path, the costateequation (9) can be simplified to

λk =

(∂Ψk

∂ Xk

)+

(∂ Xk+1

∂ Xk

)T

λk+1. (10)

Eqs. (1), (8) and (10) have to be solved simultaneously, alongwith appropriate boundary conditions, for the synthesis ofoptimal control. Note that the equations derived satisfy onlynecessary conditions. The sufficiency condition is very difficultto verify in the case of nonlinear discrete-time systems. In thiswork, we assume sufficiency conditions to hold good. We alsoassume that a unique optimal controller exists which will drivethe system through the optimal trajectory.

Some of the broad classes of problems include fixed initialand final states, fixed initial state and free final state etc. For theinfinite time regulator class of problems, however, the boundaryconditions usually take the form: X0 is fixed and λN → 0as N → ∞. If the state equation and cost functions are suchthat one can obtain an explicit solution for the control variablein terms of state and costate variables from Eq. (8), then onlySNAC is applicable. Note that many control affine nonlinearsystems (of the form Xk+1 = f (Xk) + g(Xk)Uk) with aquadratic cost function (of the form J =

12

∑∞

k=1(XTk Q Xk +

U Tk RUk)) fall under such a class. In this case the explicit

expression for the control will be Uk = −R−1 [g (Xk)]T λk+1.Such problems have wide applicability in many real-lifeproblems, including aircraft and robot control problems. Notethat whenever the problem formulation allows the use of SNAC,it should always be preferred over AC because of its potentialadvantages. Since the control at step k is a function of costateat step k + 1, closed form solutions do not exist except forlinear systems and a limited number of nonlinear systems.Even though the dynamic programming approach offers sucha framework, it is well-known that the approach runs intothe “curse-of-dimensionality” issue, requiring huge (infeasible)amounts of computational and storage requirements. Avoidingthis computational complexity is the main contribution of theAC and SNAC techniques. The SNAC technique leads tocontroller expressions explicitly in terms of the costate.

Autho

r's

pers

onal

co

py

R. Padhi et al. / Neural Networks 19 (2006) 1648–1660 1651

3. Adaptive critics for optimal control synthesis

In this section, the application of adaptive critics (AC) foroptimal control synthesis is reviewed. In an AC framework,two neural networks (called the ‘action’ and ‘critic’ networks)are iteratively trained. After successful training, these networkscapture the relationship between state and control and stateand costate variables respectively. We review the steps in thissection in fair detail.

3.1. State generation for neural network training

State generation is an important part of training proceduresfor both the AC and the newly-developed SNAC. For thispurpose, define Si = {Xk : Xk ∈ Domain of operation}

where the action and critic networks have to be trained. Thisis chosen so that the elements of this set cover a large numberof points of the state space in which the state trajectories areexpected to lie. Obviously it is not a trivial task before designingthe control. However, for the regulator class of problems, astabilizing controller drives the states towards the origin. Fromthis observation, a ‘telescopic method’ is arrived at as follows.

For i = 1, 2, . . . define the set Si as Si = {Xk : ‖Xk‖∞ ≤ ci }

where, ci is a positive constant. At the beginning, a small valueof c1 is fixed and both the networks are trained with the statesgenerated in S1. After convergence, c2 is chosen such that(c2 > c1). Then the networks are trained again for states withinS2 and so on. Values of c1 = 0.05 and ci = c1 + 0.05(i − 1)

for i = 2, 3, . . . are used in this study. The network training iscontinued until i = I , where SI covers the domain of interest.

3.2. Neural network training

The training procedure for the action network, whichcaptures the relationship between Xk and Uk , is as follows(Fig. 1):

1. Generate set Si (see Section 3.1). For each element Xk of Si ,follow the steps below:a. Input Xk to the action network to obtain Ukb. Get Xk+1 from state equation (1) using Xk and Ukc. Input Xk+1 to the critic network to get λk+1d. Using Xk and λk+1, calculate U t

k (target Uk) from theoptimal control equation (8)

2. Train the action network for all Xk in Si , the output beingcorresponding U t

k .

The steps for training the critic network, which captures therelationship between Xk and λk , are as follows (Fig. 1):

1. Generate set Si (see Section 3.1). For each element Xk of Si ,follow the steps below:a. Input Xk to the action network to obtain Ukb. Get Xk+1 from the state equation (1) using Xk and Ukc. Input Xk+1 to the critic network to get λk+1d. Using Xk and λk+1, calculate λt

k from the costate equation(10)

2. Train the critic network for all Xk in Si , the output beingcorresponding λt

k .

3.3. Convergence conditions

In order to check the individual convergence of the criticand action networks, a set of new states, Sc

i and targetoutputs are generated as described in Section 3.2. Let thesetarget outputs be λt

k for the critic network and U tk for the

action network. Let the outputs from the trained networks(using the same inputs from the set Sc

i ) be λak for the critic

network and U ak for the action network. Tolerance values

tolc and tola are used as convergence criteria for the criticand action networks respectively. The following quantities aredefined as relative errors: eck ,

(‖λt

k − λak‖/‖λt

k‖)

and eak ,(‖U t

k − U ak ‖/‖U t

k‖). Also define ec , {eck }, k = 1, . . . , |S|

and ea , {eak }, k = 1, . . . , |S|. When ‖ec‖ < tolc, theconvergence criterion for the critic network training is met andwhen ‖ea‖ < tola , the convergence criterion for the actionnetwork is met.

After successful training runs of the action and criticnetworks (i.e. after the convergence criteria are met), cycle errorcriteria are checked. For the training cycle n > 1, the error isdefined as errcn = ‖ecn − ecn−1‖/‖ecn ‖ and erran = ‖ean −

ean−1‖/‖ean ‖ for the critic and the action networks respectively.Also by defining tolc f = βc tolc, and tola f = βa tola where0 < βc, βa ≤ 1, (for n > 1) if both |errcn − errcn−1 | < tolc f

and |erran − erran−1 | < tola f , the cycle convergence criterionhas been met. Further discussion on this adaptive critic methodcan be found in Balakrishnan and Biega (1996), Padhi (2001),Werbos (1992). Note that this iterative training cycle will not beneeded in the newly-developed SNAC technique (Section 4).

3.4. Initialization of networks: Pre-training

Initialization plays an important role in any optimizationprocess. Before starting the training process outlined inSection 3.2, the networks should be appropriately initialized(we call this process ‘pre-training’). The following approachworks well for the quadratic regulator design problems. First,the system dynamics in Eq. (1) is linearized (Gopal, 1993) andthe following standard representation of a linear system in thediscrete time formulation is obtained

Xk+1 = AD Xk + BDUk . (11a)

Using the standard discrete linear quadratic regulator (DLQR)optimal control theory (Bryson & Ho, 1975; Lewis, 1992), wecan solve for the Ricatti matrix SD and Gain matrix K D . Withthe availability of SD and K D , we know from DLQR theory thatthe following relationships are satisfied

λk = SD Xk (11b)

Uk = −K D Xk . (11c)

The critic and action networks are initially trained with thestatic relationships given in Eqs. (11b) and (11c) respectively,before starting the actual AC training process outlined inSection 3.2. Intuitively, the idea is to start with the relationshipsthat are ‘close’ to the optimal relationships (at least in a smallneighborhood of the origin).

Autho

r's

pers

onal

co

py

1652 R. Padhi et al. / Neural Networks 19 (2006) 1648–1660

Fig. 1. Adaptive critic network training.

Note that there is no unique way of carrying out the pre-training process. Even though we have used the approachdiscussed above in the demonstrative problems in Section 5,the pre-training could have also been done with any otherstabilizing control solution. For example, in Balakrishnan andBiega (1996) this pre-training process is carried out with just anarbitrarily chosen (non-optimal) stabilizing controller, beforestarting the AC process.

4. Single network adaptive critic (SNAC) synthesis

In this section, the newly developed single network adaptivecritic (SNAC) technique is discussed in detail. As mentionedin Section 1, the SNAC technique retains all the powerfulfeatures of the AC methodology while eliminating the actionnetwork completely. Note that in the SNAC design, the criticnetwork captures the functional relationship between state Xkand costate λk+1, whereas in the AC design the critic networkcaptures the relationship between state Xk and costate λk .However, the SNAC method is applicable only for problemswhere the optimal control equation (8) is explicitly expressiblefor control variable Uk in terms of the state variable Xk andcostate variable λk+1 (control affine systems with quadraticcost functions fall into this class), where such a restriction isnot there for the AC technique. As mentioned earlier, Eqs.(1), (8) and (10) have to be solved simultaneously, along withappropriate boundary conditions, for the synthesis of optimalcontrol. If the state equation and cost functions are such thatone can obtain explicit solution for the control variable in termsof state and costate variables from Eq. (8), then only SNACis applicable. Note that many control affine nonlinear systems(of the form Xk+1 = f (Xk) + g (Xk) Uk) with a quadraticcost function (of the form J =

12

∑∞

k=1(XTk Q Xk + U T

k RUk))fall under such a class. In this case the explicit expression for

Fig. 2. Single network adaptive critic scheme.

the control will be Uk = −R−1 [g (Xk)]T λk+1. Due to thefact that this explicit relation exists between the costates andthe controller, the SNAC structure involves mapping betweenXk (state at instant k) and λk+1 (costate at instant k + 1) andcomputes the control value. On examining the optimal controlexpression in terms of the costate (Uk = −R−1 [g (Xk)]T λk+1)and comparing it with the optimal control law for a linearsystem of the form xk+1 = Axk + Buk , it can be seen fromthe structure in Fig. 2 that the critic network maps the relation

λk+1 =(I + PBR−1 BT

)−1P A Xk for linear systems where

P is the solution of the algebraic Riccati equation. This workextends this for nonlinear systems using the nonlinear functionapproximation properties of neural networks.

4.1. Neural network training

In the SNAC approach, the steps for training the criticnetwork, which captures the relationship between Xk and λk+1,are as follows (Fig. 2):

1. Generate Si (see Section 3.1). For each element Xk of Si ,follow the steps below:

Autho

r's

pers

onal

co

py

R. Padhi et al. / Neural Networks 19 (2006) 1648–1660 1653

a. Input Xk to the critic network to obtain λk+1 = λak+1

b. Calculate Uk , form the optimal control equation since Xkand λk+1 are known.

c. Get Xk+1 from the state equation (1) using Xk and Uk

d. Input Xk+1 to the critic network to get λk+2e. Using Xk+1 and λk+2, calculate λt

k+1 from costateequation (10)

2. Train the critic network for all Xk in Si ; the output beingcorresponding λt

k+1.3. Check for convergence of the critic network (Section 4.2).

If convergence is achieved, revert to step 1 with i = i + 1.Otherwise, repeat steps 1–2.

4. Continue steps 1–3 this process until i = I .

4.2. Convergence condition

Convergence check in the SNAC scheme is carried outas in the AC case. First a set Sc

i of states is generated asexplained in Section 3.1. Let these target output be λt

k+1 andthe outputs from the trained networks (using the same inputsfrom the set Sc

i ) be λak+1. A tolerance value tol is used to test

the convergence of the critic network. By defining the relativeerror eck ,

(‖λt

k+1 − λak+1‖/‖λ

tk+1‖

)and ec ,

{eck

}, k =

1, . . . , |S|., the training process is stopped when ‖ec‖ < tol.

4.3. Initialization of networks: Pre-training

For regulator problems, as in Section 3.4, the idea is to pre-train the neural network(s) with the solution for the linearizedproblem (using DLQR theory). However, note that SD givesthe relationship between Xk and λk (see Eq. (11b)), whereasthe critic network in SNAC has to be trained to capture thefunctional relationship between Xk and λk+1. This can be doneby observing that

λk+1 = SD Xk+1

= SD (AD Xk + BDUk)

= SD (AD Xk − BD K D Xk)

= SD Xk (11d)

where SD , SD (AD − BD K D). The relationship in Eq. (11d)is used to pre-train the networks.

Once the iterative process of training the critic network isaccomplished, the SNAC approach converges to the solutionof the Riccati equation. This has been shown in the Appendix.As in the case of numerical solutions to the algebraic Riccatiequation for linear systems, an initial stabilizing controller isrequired to ensure convergence of SNAC to the solution of thealgebraic Riccati equation.

5. Numerical results

In this section, numerical results from two representativeproblems are reported. These are (i) a Van der Pol’s oscillatorand (ii) an electrostatic actuator. The goals of this study are (i)to investigate the performance of the newly-developed SNACcontroller in stabilizing nonlinear systems and (ii) to compare

quantitatively the computations in using the SNAC and theAC. A personal computer having a Pentium III processor with930 MHz speed and 320 MB of RAM was used to conductthe numerical experiments. The software used for training wasMATLAB V. 5.2, Release 12. The Neural Network ToolboxV.3.0 in MATLAB was used with the Levenberg–Marquardtback-propagation scheme (Hagan, Demuth, & Beale, 1996) fortraining the networks.

5.1. Example 1: Van der Pol’s oscillator

5.1.1. Problem description and optimality conditionsMotivation for selecting a Van der Pol oscillator was

that it is a nonlinear benchmark problem (Yesildirek, 1994).The homogeneous system for this problem has an unstableequilibrium point at the origin (x1 = x2 = 0) and the systemhas a stable limit cycle as well. These properties make it a non-trivial regulator problem in the sense that without applying anappropriate control, the states starting from any non-zero initialcondition will never go to zero (they will rather develop towardsthe limit cycle).

The system dynamics of a Van der Pol’s oscillator is givenby

x1 = x2

x2 = α(1 − x21)x2 − x1 + (1 + x2

1 + x22)u.

(12)

Our goal in this problem was to drive X , [x1, x1]T

→ 0 ast → ∞. A quadratic cost function is formulated as

J =12

∫∞

0

(XT QW X + RW u2

)dt (13)

where QW = diag(q1, q2) and RW = 1. Discretization ofEqs. (12) and (13) using Euler and trapezoidal methods [Gupta]leads to[

x1k+1

x2k+1

]=

[x1k

x2k

]+ 1t

×

[x2k

α(1−x21k

)x2k − x1k + (1 + x21k

+ x22k

)uk

](14)

J =

N→∞∑k=1

12

(XT

k QW Xk + R W u2k

)1t. (15)

Note that the discretized cost function is slightly differentfrom would have been obtained by using the trapezoidal methodas such. We ignore the slight discrepancy at k = 1, (N − 1) forthe sake of simplicity (the case k = N − 1 does not matterhere since N → ∞). By substituting Ψk = (XT

k QW Xk +

RW u2k)1t/2 in Eqs. (8) and (10), yields the following optimal

control and costate equations.

uk = −r−1(1 + x21k

+ x22k

)λ2k+1 (16)[λ1k

λ2k

]= 1t QW Xk

+

[1 1t[2x1k uk − 1 − 2α x1k x2k ]

1t (1 + 1t)[2x2k uk + α(1 − x21k

)]

] [λ1k+1

λ2k+1

].

(17)

Autho

r's

pers

onal

co

py

1654 R. Padhi et al. / Neural Networks 19 (2006) 1648–1660

5.1.2. Selection of design parametersFor this problem, we chose 1t = 0.01, QW = diag(1, 2)

and RW = 1, tola = tolc = 0.05 and βc = βa = 0.2. Thedomain of interest (for neural network training) was chosento be SI = {X : |xi | ≤ 1, i = 1, 2}. The ‘telescopic method’described in Section 3.1 was used for state generation. Eachtime 1000 points were randomly selected for training thenetworks.

In the AC synthesis, the critic network was selected suchthat it consists of two sub-netwroks, each having a 2-6-1structure (i.e. two neurons in the input layer, six neurons inthe hidden layer and one neuron in the output layer). Similarlyin the SNAC synthesis, the critic network was also assumedto have two sub-networks, each having a 2-6-1 structure. Forthe activation functions, the hyperbolic tangent function wasselected for the input and hidden layers and linear function waschosen for the output layer (in both critic and action networks).

The reason for choosing two 2-6-1 sub-networks as thestructure for the critic network (instead of a 2-6-2 network) isas follows. In optimal control theory, even though the costate(critic) variables are essential, these variables have no physicalmeaning. Prior to arriving at the optimal solution, the orderof magnitudes the costates can take is unknown. Becauseof this, one element of the costate vector can be of verylow order whereas the other one can be very high order. Insuch a case, even though it is theoretically sound to assumea single 2-6-2 network, there can be numerical problems intraining. It may also lead to non-optimal values for somecomponents of the costate vector. This may happen in spite ofthe convergence check criterion being met, because the errorin the high magnitude component may suppress the error inthe low magnitude component in the back-propagation trainingprocess. Such problems will not arise if we assume two 2-6-1networks instead, since in this case the weights of the individualnetworks can see only the error for a particular component ofthe costate vector. For the action network in the AC synthesis, a2-6-1 structure was chosen. Note that the control vector for thisproblem has only one component, and hence, there is no needfor a sub-network structure.

It is well-known in the neural network literature that atwo layer neural network with sufficient number of neuronsin the hidden layer can approximate any continuous functionwith arbitrarily small error bound (Barto, Sutton, & Anderson,1983). However, to the best of our knowledge, the numberof neurons required for any particular application is still anopen problem. Hence, selecting the structure of a neuralnetwork is more of an art than science. In our applicationproblems, we selected six neurons for the hidden layer andthis was satisfactory in the sense that we were able to meetthe convergence tolerance values that we chose, which led tosatisfactory simulation results.

5.1.3. Analysis of resultsAfter synthesizing the neural networks as discussed in

Section 4, we carried out simulation studies to validate themethods proposed. Arbitrary initial conditions for the statesfrom the same domain of initial conditions as used for

Fig. 3. Position trajectories (Van der Pol’s oscillator).

Fig. 4. Velocity trajectories (Van der Pol’s oscillator).

training were chosen. Using the synthesized neural networksfor computing control, the system was simulated for 10 s (largeenough for our purpose). Simulations showed that both thestates were successfully driven to zero (the control goes tozero as well) with time. Figs. 3 and 4 are plots of the systemstate (i.e. position and velocity) histories. It can be seen thatboth the AC and SNAC techniques perform well in regulatingthe states (i.e. driving them to zero). Fig. 5 is the plot of thecorresponding control history, which as expected, goes to zeroas well. Note that the trajectories coming out of the AC andthe SNAC techniques are pretty much close to each other. Thisshows that the newly developed SNAC approach is as good asthe AC approach in synthesizing the optimal controller.

As pointed out earlier, one of the main advantages of theSNAC approach over the AC is that it leads to substantialcomputational savings. To demonstrate this quantitatively, bothtechniques were statistically analyzed based on ten independentruns. Fig. 6 shows the time taken by the AC and SNAC methodsto complete the process. This plot clearly indicates that it takessignificantly less time to train the network using the SNAC

Autho

r's

pers

onal

co

py

R. Padhi et al. / Neural Networks 19 (2006) 1648–1660 1655

Fig. 5. Corresponding control trajectories (Van der Pol’s oscillator).

Fig. 6. Training time comparison [AC/SNAC] (Van der Pol’s oscillator).

technique as compared to the AC technique. It was observedthat µTSNAC = 0.52 µTAC , where µTSNAC = 339.8445 s andµTAC = 642.0185 s are the mean times taken to train (includingchecking for convergence) using the SNAC and AC schemesrespectively. It was also observed that σTSNAC = 1.676839 s andσTAC = 1.887459 s, where σTSNAC and σTAC are the standarddeviations for the SNAC and AC schemes respectively. Thesmall standard deviation values indicate that there is not muchvariation in total time taken for each test run, and hence boththe techniques are fairly consistent.

Next, we viewed the data (run time) as if they werea random sample from a normal distribution. In that casethe appropriate test statistic for comparing whether the twomeans are significantly different would be a two-sample t-Test(Montgomery, 2001). The statistic used to test this is

t0 =y1 − y2

Sp

√1n +

1n

(18)

where yi is the sample mean of the i th group, n is the samplesize, S2

p is an estimate of the common variance. Details on howto compute S2

p are given in Montgomery (2001). To determinewhether to reject the hypothesis that the two means of trainingtime data are the same, we compare t0 to the t distribution with

Fig. 7. Average training times for each step (Van der Pol’s oscillator).

Fig. 8. Cost comparison for different initial conditions (Van der Pol’soscillator).

2n − 2 degrees of freedom. If |t0| > tα/2,2n−2 (where α isthe significance level of the test) we would conclude that thetwo means are different. It was seen that t0 = 378.4785 >

t0.025,18 = 2.101. Hence, we conclude that the two means(training time using SNAC and training time using AC) aresignificantly different (which we have already observed fromthe fact that µTSNAC = 0.52 µTAC ).

Fig. 7 compares the average times taken by AC and SNACfor each step in the telescopic training process discussed earlier.The average was taken over ten independent runs. From thefigure, it is evident that the means of the two sets of data aresignificantly different. The cost comparison based on Eq. (15)for six different sets of initial conditions using the AC andSNAC methodologies for a simulation of t f = 10 has beengiven in Fig. 8. It can be seen that the costs for the AC andSNAC schemes are very close to each other in each case.

Next, a paired comparison design (Montgomery, 2001) testwas conducted to determine whether the two sets of costs arestatistically different. Blocking is a design technique used toimprove precision with which comparisons among the factorsof interest are made. A block is a set of relatively homogeneousexperimental conditions. In our case, the initial condition ofsimulation is the block. The test statistic to test whether the twosets of data are statistically different is

t0 =d

Sd√

n(19)

Autho

r's

pers

onal

co

py

1656 R. Padhi et al. / Neural Networks 19 (2006) 1648–1660

Fig. 9. Electrostatic actuator.

where d is the sample mean of the differences, Sd is the samplestandard deviation of the differences and n is the number ofsamples. An interested reader can refer to Montgomery (2001)for details. The computed value of the paired t-test statistic ist0 = −0.2857. Since |t0| = 0.2857 < t0.025,5 = 2.571, weconclude that the two sets of data are not significantly different.This means that the costs generated by both SNAC and ACsimulations are statistically similar, which confirms our claimonce more that the SNAC technique finds out the same optimalcontroller as the AC technique.

5.2. Example 2: A micro-electro-mechanical-system (MEMS)actuator

5.2.1. Problem statement and optimality conditions

The next problem considered in this study is a MEMSdevice, namely electrostatic actuator (Senturia, 2001). Inaddition to demonstrating the computational advantage, thisproblem also proves that the SNAC technique is applicablefor complex engineering systems of practical significance. Theschematic diagram for this problem is as shown in Fig. 9.

There are two domains that are interlinked in the dynamicsof the system. One is the electrical domain and the other is themechanical domain. The governing equations are given by

Q −1R

(Vin −

Qg

εA

)= 0

mg + bg + k(g − g0) +Q2

2εA= 0

(20)

where Q denotes the charge, g the gap between the plate andthe base, and g represents the rate of change of the gap whenthe plate moves. Vin is the input voltage that is used to movethe plate to the desired position. The mass m represents themechanical inertia of the moving plate, a dashpot b capturesthe mechanical damping forces that arise from the viscosityof the air that gets squeezed when the plate moves, a springk represents the stiffness encountered when the plate actuatormoves, a source resistor R for the voltage source that drives thetransducer. The various parameters used in Eq. (20), along withtheir associated values are given in Table 1.

Table 1Parameters used in modeling the electrostatic actuator (Senturia, 2001)

Parameter Symbol Value Units

Area A 100 µm2

Permittivity ε 1 C2/N µm2

Initial gap g0 1 µmMass m 1 mgDamping constant b 0.5 mg/sSpring constant k 1 mg/s2

Resistance R 0.001 �

Defining the state variable Z = [z1 z2 z3]T

= [Q g g]T, Eq.

(20) can be written as

z1 =1R

(Vin −

z1z2

εA

)z2 = z3

z3 = −1m

(z2

1

2εA+ bz3 + k(z2 − g0)

).

(21)

The function of the control input in this problem is to bringthe plate to a desired position, i.e. the gap g has to be maintainedat some desired value. We selected the desired value of the gapas 0.5 µm. An optimal controller is designed to drive the plateto the desired value. At the equilibrium point, z2 = 0.5, Z = 0.Using this information in Eq. (21) leads to

000

=

1R

(Vin −

z1

2εA

)z3

1m

(−

z21

2εA− k (0.5 − g0) − bz3

) . (22)

Solving Eq. (22) for z1, z3 and Vin the values of the statesat the equilibrium (operating) point are obtained as Z0 =[10 0.5 0

]Tand the associated steady state controller valueis given by Vin0 = 0.05. Next the deviated state is definedas X = [x1 x2 x3]

T , Z − Z0 and deviated control u ,(Vin − Vin0

). In terms of these variables, the error dynamics of

the system is

x1 =1R

(u −

x1

2εA−

x2√

εA−

x1x2

εA

)x2 = x3

x3 = −1m

(x2

1

2εA+

x1√

εA+ kx2 + bx3 +

12

+k

2−

g0

k

).

(23)

Now an optimal regulator problem can be formulated todrive X → 0 with a cost function, J as

J =12

∫∞

0(XT Qw X + R wu2) dt (24)

where Qw ≥ 0 and Rw > 0 are weighting matrices for stateand control respectively. Next, using the Euler and trapezoidaltechniques (Gupta, 1995) the state equation and cost function

Autho

r's

pers

onal

co

py

R. Padhi et al. / Neural Networks 19 (2006) 1648–1660 1657

x1k+1

x2k+1

x3k+1

= Xk + 1t

uk

R−

x1k

2εAR−

x2k

R√

εA−

x1k x2k

RεAx3k

−x2

1k

2εAm−

x1k

m√

εA−

kx2k

m−

bx3k

m−

12m

−k

2m+

g0

km

Box I.

were discretized as follows: (see Box I)

J =

N→∞∑k=1

12(XT

k QW Xk + RW u2k)1t. (25)

Next, by using Ψk = (XTk QW Xk + RW u2

k) 1t/2 in Eqs.(8) and (10), the optimal control and costate equation can beobtained as follows:

uk = −R−1w

λ1k+1

R(26)

λk = 1t Qw Xk +

[∂ Fk

∂ Xk

]T

λk+1 (27)

where Fk stands for all the terms on the right hand side of Box I.

5.2.2. Selection of design parametersFor this problem, values of 1t = 0.01, Qw = I3 and

Rw = 1, tola = tolc = 0.05 and βc = βa = 0.01 were chosenand the domain of the state SI = {X : |xi | ≤ 1, i = 1, 2, 3}.The ‘telescopic method’ described in Section 3.1 was used forstate generation. Each time 1000 points were randomly selectedfor training the networks. In SNAC synthesis, the tolerancevalue tol = 0.05 was selected for the convergence check.

Following the discussion in Section 5.1.2, in the ACsynthesis the critic network was selected to have three sub-networks, each having a 3-6-1 structure. A 3-6-1 network wasselected as the action network. Three sub-networks each havinga 3-6-1 structure were used in the SNAC process. In eachnetwork, the hyperbolic tangent function was chosen for theinput and hidden layers and a linear function was chosen forthe output layer.

5.2.3. Analysis of resultsSimulations were carried out using the same initial

conditions for both AC and SNAC schemes. For demonstrationpurposes, the initial condition chosen is [Q g g]

Tt=0 =

[9.85 1.5 − 1]T. Figs. 10–12 show the trajectories of Q,

g and g respectively for a time of twenty five seconds usingboth AC and SNAC techniques. These figures indicate that boththe AC and SNAC schemes performed well to drive the statesto their respective values. As before, the state trajectories fromboth AC and SNAC techniques are very similar to each other.Fig. 13 shows the control trajectory obtained from using thetwo schemes, which again shows that the SNAC technique iscapable of obtaining the same optimal control solution as theAC technique. Note that the control trajectories in both schemesalso drive toward a steady state value Vin0 = 0.05. It can be seenfrom Fig. 11 that the position of the actuator has been forced to

Fig. 10. SNAC/AC trajectories for charge (Electrostatic actuator).

Fig. 11. SNAC/AC trajectories for position (Electrostatic actuator).

the desired value of 0.5 µm. The velocity of the plate is drivento the steady state value of zero and the charge is driven to thesteady state desired value.

We carried out analysis similar to the case in Section 5.1for comparing the computational performances of the twotechniques. Fig. 14 is an illustration of the times taken by eachscheme under comparison (AC/SNAC) in our study for theMEMS problem, over ten independent runs.

It was observed that µTSNAC = 0.59 µTAC , where µTSNAC =

531.4063 s and µTAC = 890.3377 s are the mean timestaken to train (including checking for convergence) in theSNAC and AC schemes respectively. It was also observed that

Autho

r's

pers

onal

co

py

1658 R. Padhi et al. / Neural Networks 19 (2006) 1648–1660

Fig. 12. SNAC/AC trajectories for velocity (Electrostatic actuator).

Fig. 13. Associated control trajectories (Electrostatic actuator).

Fig. 14. Training time comparison [AC/SNAC] (Electrostatic actuator).

σTSNAC = 2.224001 s and σTAC = 2.098956 s, where σTSNAC andσTAC are the standard deviations for the SNAC and AC schemesrespectively. The small standard deviation values indicate thatthere is not much variation in total time taken for each testrun, and hence, both the techniques are fairly consistent. Atwo-sample t-test was performed on the data over the ten

Fig. 15. Average training times for each step (Electrostatic actuator).

Fig. 16. Cost comparison for different initial conditions (Electrostaticactuator).

independent runs (similar to the exercise in Section 5.1.3).The test statistic obtained was t0 = 371.1623. Since t0 =

371.1623 > t0.025,18 = 2.101, we conclude that the twotraining time means are statistically different.

Fig. 15 compares the average times taken by AC and SNACfor each step in the telescopic training process discussed earlier.The average was taken over ten independent runs. From thefigure, it is evident that the means of the two sets of data aresignificantly different. The cost comparison based on Eq. (25)for six different sets of initial conditions using the AC andSNAC techniques for a simulation of t f = 25 s (large enoughfor the t f → ∞ approximation) has been given in Fig. 16. It canbe seen that the costs for AC and SNAC schemes are very closeto each other in each case. On conducting a paired comparisontest (similar to the test mentioned in Section 5.1.3) on the twosets of cost data over six tests for the cost function, the teststatistic was calculated to be t0 = 0.5371. Since t0 = 0.5371 <

t0.025,5 = 2.571, we conclude that the two sets of cost data arenot statistically different and hence the SNAC technique leadsto the same optimal solution as the AC technique.

6. Conclusions

A new single network adaptive critic (SNAC) approachis presented for optimal control synthesis. This approachis applicable to a wide class of nonlinear systems. Thistechnique essentially retains all the powerful properties

Autho

r's

pers

onal

co

py

R. Padhi et al. / Neural Networks 19 (2006) 1648–1660 1659

of a typical adaptive critic (AC) technique. However, inSNAC, the action networks are no longer needed. Asan important additional advantage, the associated iterativetraining loops are also eliminated. This leads to a greatsimplification of the architecture and results in substantialcomputational savings. Besides, it also eliminates the neuralnetwork approximation error due to the eliminated actionnetworks. Huge computational savings with SNAC have beendemonstrated by using two interesting examples. In addition,the MEMS problem also demonstrates that SNAC is applicablefor complex engineering systems of practical significance. Notethat while applying both AC and SNAC techniques discussed inthis paper, we assume that the plant equations and parametersare known.

Acknowledgements

This research was supported by NSF-USA grants 0201076and 0324428. The authors also express their gratitude to theanonymous reviewers, whose constructive criticisms led tosubstantial improvements in this paper.

Appendix

Consider a linear system described by xk+1 = Axk + Buk .The optimal control equation can be derived to be uk =

−R−1 [B]T λk+1. On substituting the optimal control equationin the state variable equation we get a closed form system asfollows. The costate equation is also given below.

A.1. Dominant equations

xk+1 = Axk − B R−1 BTλk+1

λk = Qxk + ATλk+1(A.1)

wherex ∈ Rn×1 is the state variable, λ ∈ Rn×1 is the costate variable,Q ∈ Rn×n

≥ 0 is the penalty on states, R ∈ Rm×m > 0 is thepenalty on control,A ∈ Rn×n, det (A) 6= 0, B ∈ Rn×m, (A, B) controllable.

A.2. Iterative process

The nonlinear relationship between the costate at step k + 1and the state at step k from the SNAC based critic neuralnetwork can be expressed by the following relation

λk+1 = g (xk) . (A.2)

On substituting Eq. (A.2) in the costate equation in Eq. (A.1),we get,

gn+1 (xk) = Qxk+1 + ATgn (xk+1)

xk+1 = Axk − B R−1 BTgn (xk) , n = 0, 1, 2, 3, . . .(A.3)

In the above equation, n is the training iteration number.Eq. (A.3) can be re-written as

gn+1 (xk) = Q(Axk − B R−1 BTgn (xk))

+ ATgn(Axk − B R−1 BTgn (xk)). (A.4)

Claim 1. Consider a linear system where the mapping betweenλk+1 and xk is linear. Let the initial approximation be the linearrelation λk+1 = g0xk . The mappings obtained at each trainingiteration g1, g2, . . . , gn will all be linear functions of xk .

Proof. By mathematical induction

(1) If g0 (xk) = g0xk , then g1 (xk) = g1xk

g1 (xk) = Q(Axk − B R−1 BTg0xk)

+ ATg0(Axk − B R−1 BTg0xk)

= Q(A − B R−1 BTg0)xk

+ ATg0(A − B R−1 BTg0)xk

= (Q + ATg0)(A − B R−1 BTg0)xk

, g1xk

(2) If gk (xk) = gk xk , then gk+1 (xk) = gk+1xk

gk+1 (xk) = Q(Axk − B R−1 BTgk xk)

+ ATgk(Axk − B R−1 BTgk xk)

= Q(A − B R−1 BTgk)xk

+ ATgk(A − B R−1 BTgk)xk

= (Q + ATgk)(A − B R−1 BTgk)xk

, gk xk

(3) g1, g2, . . . , gn will all be linear functions of xk .

Assuming a linear SNAC approximation, according to Claim 1,the recursive relation can be written as

gn+1xk = Q(A − B R−1 BTgn)xk

+ ATgn(A − B R−1 BTgn)xk . (A.5)

This relation exists for all xk . The mapping gn+1 can beexpressed as

gn+1= (Q + ATgn)(A − B R−1 BTgn). (A.6)

Claim 2. If the iterative process of training the critic networkin the SNAC method converges, the method converges to thesolution of the algebraic Riccati equation.

Proof. Assume the converged value of the linear mappingbetween xk and λk+1 to be g. We can write

g = (Q + ATg)(A − B R−1 BTg). (A.7)

The Discrete Time Riccati equation is as follows

P = AT P A − AT P B(R + BT P B)−1 BT P A + Q. (A.8)

It can be seen that the critic network in the SNAC structure endsup being the relation

g =

(I + PBR−1 BT

)−1P A. (A.9)

The matrix inversion lemma is as follows

(A + BC D)−1= A−1

− A−1 B(D A−1 B + C−1)−1 D A−1.

(A.10)

Autho

r's

pers

onal

co

py

1660 R. Padhi et al. / Neural Networks 19 (2006) 1648–1660

Eq. (A.9) comes from the relations

λk+1 = gxk (A.11)

λk = Pxk . (A.12)

Substituting Eq. (A.9) in Eq. (A.7), the whole equation turnsinto an equation in P . The following is the proof to show thatthe resulting equation is the Riccati equation itself.

(I + PBR−1 BT)−1 P = Q + AT(I + PBR−1 BT)−1

×P A − Q B R−1 BT(I + PBR−1 BT)−1 P

− AT(I + PBR−1 BT)−1 P AB R−1 BT

×(I + PBR−1 BT)−1 P. (A.13)

Multiply both the left and right hand sides of Eq. (A.13) with(P−1

+ B R−1 BT). The left hand side of Eq. (A.13) can be

reduced to

L S : (I + PBR−1 BT)−1 P(P−1+ B R−1 BT)

= (P−1+ B R−1 BT)−1(P−1

+ B R−1 BT)

= I. (A.14)

The right hand side of Eq. (A.13) becomes

RS := Q(P−1+ B R−1 BT) + AT(P−1

+ B R−1 BT)−1

×A(P−1+ B R−1 BT)

− Q B R−1 BT− AT(P−1

+ B R−1 BT)−1 AB R−1 BT

= Q(P−1+ B R−1 BT) + (AT(P−1

+ B R−1 BT)−1

×AP−1+ AT(P−1

+ B R−1 BT)−1 AB R−1 BT)

− Q B R−1 BT− AT(P−1

+ B R−1 BT)−1 AB R−1 BT

= Q P−1+ AT(P−1

+ B R−1 BT)−1 AP−1. (A.15)

Equating the two sides we get

I = Q P−1+ AT(P−1

+ B R−1 BT)−1 AP−1. (A.16)

That is,

P = Q + AT(P−1+ B R−1 BT)−1 A. (A.17)

From Eq. (A.10) we get

(P−1+ B R−1 BT)−1

= P + P B(BT P B + R)−1 BT P.

(A.18)

Substitute Eq. (A.18) in Eq. (A.17) to obtain

P = Q + AT(P + P B(BT P B + R)−1 BT)A. (A.19)

The above equation turns out to be the discrete Riccaticequation as given below

P = Q + AT P A + AT P B(BT P B + R)−1 BT A. (A.20)

This proves that for linear systems (where the mapping betweenthe costate at stage k + 1 and the state at stage k is linear)

the SNAC structure on convergence converges to the discreteRiccati equation.

References

Balakrishnan, S. N., & Biega, V. (1996). Adaptive-critic based neural networksfor aircraft optimal control. Journal of Guidance, Control and Dynamics,19(4), 893–898.

Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neoronlike adaptiveelements that can solve difficult control problems. IEEE Transactions onSystems, Man and Cybernetics, SMC-13, 834–846.

Bryson, A. E., & Ho, Y. C. (1975). Applied optimal control. Taylor and Francis.Ferrari, S., & Stengel, R. F. (2002). An adaptive critic global controller,

In Proceedings of the American control conference (pp. 2665–2670).Anchorage, USA.

Gopal, M. (1993). Modern control system theory (2nd ed.). Wiley.Gupta, S. K. (1995). Numerical methods for engineers. Wiley Eastern Ltd. and

New Age International Ltd.Hagan, M. T., Demuth, H. B., & Beale, M. (1996). Neural network design.

PWS Publishing Company.Han, D., & Balakrishnan, S. N. (2002). Adaptive critics based neural networks

for agile missile control. Journal of Guidance, Control and Dynamics, 25,404–407.

Hunt, K. J. (1992). Neural networks for control systems—A survey.Automatica, 28(6), 1083–1112.

Lewis, F. (1992). Applied optimal control and estimation. Prentice-Hall.Liu, X., & Balakrishnan, S. N. (2000). Convergence analysis of adaptive critic

based optimal control, In Proceedings of the American control conference(pp. 1929–1933). Chicago, USA.

Miller, W. T., Sutton, R., & Werbos, P. J. (Eds.) (1990). Neural networks forcontrol. MIT Press.

Montgomery, D. C. (2001). Design and analysis of experiments (5th ed.). JohnWiley and Sons, Inc.

Murray, J. J., Cox, C. J., Lendaris, C. G., & Saeks, R. E. (2002).Adaptive dynamic programming. IEEE Transactions on Systems, Man, andCybernetics- Part C: Applications and Reviews, 32, 140–153.

Padhi, R. (2001). Optimal control of distributed parameter systems usingadaptive critic neural networks. Ph.D. Dissertation. University of MissouriRolla.

Padhi, R., & Balakrishnan, S. N. (2003a). Optimal process control using neuralnetworks. Asian Journal of Control, 5(2), 217–229.

Padhi, R., & Balakrishnan, S. N. (2003b). Proper orthogonal decompositionbased neurocontrol synthesis of a chemical reactor process usingapproximate dynamic programming. Neural Networks, 16, 719–728.

Prokhorov, D. V., & Wunsch, D. C. II (1997). Adaptive critic designs. IEEETransactions on Neural Networks, 8, 997–1007.

Prokhorov, D. V. (2003). Optimal controllers for discretized distributedparameter systems, In Proceedings of the American control conference (pp.549–554). Denver.

Senturia, S. D. (2001). Microsystem design. Kluwer Academic Publishers.Venayagamoorthy, G. K., Harley, R. G., & Wunsch, D. C. (2002). Comparison

of heuristic dynamic programming and dual heuristic programmingadaptive critics for neurocontrol of a turbo generator. IEEE Transactionson Neural Networks, 13, 764–773.

Werbos, P. J. (1992). Approximate dynamic programming for real-time controland neural modeling. In D. A. White, & D. A Sofge (Eds.), Handbook ofintelligent control. Multiscience Press.

Werbos, P. J. (1990). Backpropagation through time: What it does and how todo it. Proceedings of the IEEE, 78(10), 1550–1560.

Yesildirek, A. (1994). Nonlinear systems control using neural networks. Ph.D.Thesis. Arlington: University of Texas.