Stochastic optimal control via Bellman's principle

Available online at www.sciencedirect.com

Automatica 39 (2003) 2109–2114www.elsevier.com/locate/automatica

Brief Paper

Stochastic optimal control via Bellman’s principle�

Luis G. Crespoa ;∗, Jian-Qiao Sunb

aNational Institute of Aerospace, 144 Research Drive, Hampton VA 23666, USAbDepartment of Mechanical Engineering, University of Delaware, Newark, DE 19711, USA

Received 28 January 2003; received in revised form 27 May 2003; accepted 20 June 2003

Abstract

This paper presents a strategy for 3nding optimal controls of non-linear systems subject to random excitations. The method is capable togenerate global control solutions when state and control constraints are present. The solution is global in the sense that controls for all initialconditions in a region of the state space are obtained. The approach is based on Bellman’s principle of optimality, the cumulant neglectclosure method and the short-time Gaussian approximation. Problems with state-dependent di6usion terms, non-closeable hierarchies ofmoment equations for the states and singular state boundary condition are considered in the examples. The uncontrolled and controlledsystem responses are evaluated by creating a Markov chain with a control dependent transition probability matrix via the generalized cellmapping method. In all numerical examples, excellent controlled performances were obtained.? 2003 Elsevier Ltd. All rights reserved.

Keywords: Stochastic control; Optimality; Non-linearity; Random processes; Numerical algorithms; Random vibrations

1. Introduction

The optimal control of stochastic systems is a di>-cult problem, particularly when the system is stronglynon-linear and there are state and control constraints. Veryfew closed form solutions to such problem are available inthe literature (Bratus, Dimentberg, & Iourtchenko, 2000;Dimentberg, Iourtchenko, & Bratus, 2000). Given its com-plexity, we must resort to numerical methods (Kushner &Dupuis, 2001). While some numerical methods of solutionto the Hamilton Jacobi Bellman (HJB) equation are known,they usually require knowledge of the boundary/asymptoticbehavior of the solution in advance (Bratus et al., 2000).Numerical strategies to 3nd deterministic optimalcontrol solutions based on the Bellman’s principle ofoptimality (BPO) are available (Crespo & Sun, 2000).In this paper these tools are extended to the stochas-tic control problem. The method, that involves both

� This paper was not presented at any IFAC meeting. This paper wasrecommended for publication in revised form by Associate Editor IoannisPaschalidis under the direction of Editor Tamer Basar.

∗ Corresponding author. Tel.: +1-757-766-1689; fax: +1-757-766-1812.E-mail addresses: [email protected] (L.G. Crespo),

[email protected] (J.-Q. Sun).

analytical and numerical steps, o6ers several advantages:(i) it can be applied to strongly non-linear systems, (ii)it takes into account state and control constraints and(iii) it leads to global solutions, from where topologicalfeatures can be extracted, e.g. switching curves. Formerdevelopments can be found in Crespo and Sun, (2002),where comparisons with analytical solutions were made.Simulations for the examples presented are available athttp://research.nianet.org/∼lgcrespo/simulations.html.

2. Stochastic optimal control

2.1. Problem formulation

Consider a system governed by the stochastic di6erentialequation in the Stratonovich sense dx(t)=m(x(t); u(t)) dt+�(x(t); u(t)) dB(t), where x(t)∈Rn is the state vector,u(t)∈Rm is the control, B(t) is a vector of independentunit Wiener processes and the functions m(·) and �(·) arein general non-linear functions of their arguments. Ito’scalculus (Risken, 1984), leads to the equation

dx(t) =(m(x; u) +

12

@�(x; u)@x

�(x; u)T)

dt

+ �(x; u) dB(t): (1)

0005-1098/$ - see front matter ? 2003 Elsevier Ltd. All rights reserved.doi:10.1016/S0005-1098(03)00238-3

mailto:[email protected]

mailto:[email protected]

http://research.nianet.org/~lgcrespo/simulations.html

2110 L.G. Crespo, J.-Q. Sun / Automatica 39 (2003) 2109–2114

The corresponding Fokker–Planck–Kolmogorov (FPK)equation is given by

@�@t

=− @@x

[�(m(x; u) +

12

@�(x; u)@x

�(x; u)T)]

+12

@2

@x2[��(x; u)�(x; u)T

]; (2)

where �(x; t0|x0; t0) is the conditional probability densityfunction (PDF) of the response. Let the cost functional be

J (u; x0; t0; T ) = E[�(x(T ); T ) +

∫ T

t0L(x(t); u(t)) dt

];

(3)

where E[ · ] is the expected value operator, [t0; T ] is thetime interval of interest, �(x(T ); T ) is the terminal cost andL(x(t); u(t)) is the Lagrangian function. The optimal controlproblem is to 3nd the control u(t)∈U ⊂ Rm for t ∈ [t0; T ]in Eq. (1) that drives the system from a the initial conditionx(t0) = x0 to the target set de3ned by �(x(T ); T ) = 0 suchthat the cost functional J (·) is minimized. The 3xed 3nalstate condition leads to control solutions of the feedbacktype, i.e. u(x).

2.2. Bellman’s principle of optimality

Let V (x0; t0; T ) = J (u∗; x0; t0; T ) be the so-called valuefunction or optimal cost function (Yong & Zhou, 1999). TheBPO can be stated as

V (x0; t0; T ) = infu∈U

E

[∫ t

t0L(x(t); u(t)) dt

+∫ T

tL(x∗(t); u∗(t)) dt + �(x∗(T ); T )

];

(4)

where t06 t6T . Consider the problem of 3nding the opti-mal control for a system starting from xi in the time interval[i�; T ], where � is a discrete time step. De3ne the incremen-tal and the accumulative costs as

J� = E

[∫ (i+1)�

i�L(x(t); u(t)) dt

]; (5)

JT = E[�(x∗(T ); T ) +

∫ T

(i+1)�L(x∗(t); u∗(t)) dt

]; (6)

where {x∗(t); u∗(t)} is the optimal solution pair over thetime interval [(i+1)�; T ]. In this context, the BPO is givenby V (xi ; i�; T ) = inf u∈U{J� + JT}. The incremental cost J�is the cost for the system to march one time step forwardstarting from a deterministic initial condition xi neighboringV (x((i + 1)�); (i + 1)�; T ). The system moves to an inter-mediate set of the state variables. The accumulative cost JTis the optimal cost of reaching the target set �(x(T ); T )=0

starting from this intermediate set and is calculated throughthe accumulation of incremental costs over time intervalsbetween (i + 1)� and T , i.e. V (x((i + 1)�); (i + 1)�; T ) forthe processed state space.

3. Solution approach

To evaluate the expected values in Eq. (6) for a givencontrol, �(x; �|x0; 0) is needed. For a given feedback controllaw u=f(x), the response x(t) is a stationaryMarkov process(Lin & Cai, 1995). For a small �, �(x; �|x0; 0) is known tobe approximately Gaussian within an error of order O(�2)(Risken, 1984). We can derive dynamic equations for themoments of the states from Eq. (1). Such equations areclosed using the neglect closure method (Lin & Cai, 1995;Sun & Hsu, 1989) and used by a backward search algorithmto generate the global control solution.

3.1. Backward search algorithm

The backward solution process starts from the last seg-ment of the time interval [T − �; T ]. Since the terminal con-dition for the 3xed 3nal state problem is speci3ed, a familyof local optimal solutions for all initial conditions x(T − �)is found. The optimal control in the interval [i�; T ] is de-termined by minimizing the sum of the incremental and theaccumulative cost leading to V (xi ; i�; T ) subject to the con-dition x((i + 1)�) = x∗i+1, where x

∗i+1 is a random variable

whose support is mostly on the processed state region. Con-dition x((i + 1)�) = x∗i+1 must be applied in a probabilisticsense. Notice however, that �(x; �|x0; 0) covers the entirestate space no matter how small � is. To quantify the con-tinuity condition, let � be the extended target set such thatx∗i+1 ∈�. For a given control, de3ne P� as:

P� =∫x∈�

�(x; �|xi ; 0) dx (7)

then P� is the probability of reaching the extended targetset � in time � starting from xi. The controlled responsex(t) starting from a set of initial conditions xi will becomea candidate for the optimal solution when P� is maximal.The numerical procedure is presented next. Discretize a

3nite state region D ⊂ Rn into a countable number ofparts/cells. Let U be a set consisting of a countable num-ber of admissible controls ui for i = 1; 2; : : : ; I . The con-trol is assumed to be constant over the time intervals. Let� ⊂ Rn denote the discretized target set�(x(T ); T )=0 andJT =E[�(x(T ); T )] be the terminal cost. In this framework,the algorithm is as follows:

(1) Find all the cells that surround the target set �. Denotethe corresponding cell centers zj.

(2) Construct the conditional probability density function�(x; �|zj; 0) for each control ui and for all cell centerszj. Call every combination (zj; ui) a candidate pair.

L.G. Crespo, J.-Q. Sun / Automatica 39 (2003) 2109–2114 2111

(3) Calculate the incremental cost J�(zj; ui), the accumula-tive cost JT (z∗k ; u

∗– ) and P� for all candidate pairs. z∗k is

an image cell of zj in � and u∗– is the optimal controlof z∗k found in previous iterations.

(4) Search for the candidate pairs that minimize J�(zj; ui)+JT (z∗k ; u

∗– ) and satisfy P�¡�max {P�}, where

0¡�¡ 1 is a factor set in advance. Denote suchpairs as (z∗j ; u

∗i ).

(5) Save the minimized accumulative cost functionJT (z∗j ; u

∗i ) = J�(z∗j ; u

∗i ) + JT (z∗k ; u

∗– ) and the optimal

pairs (z∗k ; u∗– ).

(6) Expand the target set � by including the cells z∗j .(7) Repeat the search from Steps (1) to (6) until the initial

condition x0 is reached.

As a result, the optimal control solution for all the cellscovered by � is found. The choice of image cells, i.e. x∗i+1,could certainly by biased. This however, is avoided by using(i) non-uniform integration times such that the growth of �is gradual, i.e. mapping most of the probability to neighbor-ing cells, and (ii) by restricting the potential optimal pairsto be candidate pairs with high P�. These considerationsled to the same global control solution regardless of the cellsize (Crespo & Sun, 1999).The resulting dynamics of the conditional PDF is sim-

ulated using the generalized cell mapping method (GCM)(Crespo & Sun, 2002). Notice that if a simulation is donefor a long time all probability will eventually leave D. Thisis a consequence of using a 3nite computational domain tomodel a di6usion process, therefore all numerical simula-tions face this problem. While some probability is leakingout from D, probability mass is also being brought backfrom its complement. In this study the computational do-main is set such that the leakage of probability during thetransient controlled response is very small.

4. Non-analytically closeable terms

Closure methods readily handle polynomials. However,for other types of non-linearities they might not only re-quire tedious and lengthy integrations, but also might leadto expressions that do not admit a closed form. This pre-vents the integration of moment equations for the states evennumerically. In order to overcome this di>culty, the cellu-lar structure of the state space can be used to approximatesuch non-linearities with multiple Taylor expansions. Oncethe approximations are available, the in3nite hierarchy ofmoments can be closed.

4.1. Example

The optimal steering of a vehicle on a vortex 3eld isstudied next. The vehicle moves on the (x1; x2) plane withconstant velocity relative to the vortex. The control u isthe heading angle with respect to the positive x1-axis. The

velocity 3eld is given by v(x1; x2) = ar=(br2 + c), wherer=|x| is the radius from the center of the vortex. The vehicledynamics is given by the Stratonovich equations

x1 = cos (u)− vx2=r + !vw1;

x2 = sin (u) + vx1=r + !vw2;(8)

where ! is a constant, w1 and w2 are correlated Gaus-sian white noise processes with zero mean such thatE[w1(t)w1(t′)]=2D1$(t− t′); E[w2(t)w2(t′)]=2D2$(t− t′)and E[w1(t)w2(t′)] = 2D12$(t − t′). The control objectiveis to drive the vehicle to the target set � such that the costfunctional

J = E[∫ T

t0%v(x1; x2) dt

]; (9)

is minimized. This cost models the risk associated withthe selected path. Details can be found in Crespo (2003).For uncorrelated white noise processes we 3nd

m10 = cos (u)− E[x2v

r

]+ !2D1E

[v

@v@x1

];

m01 = sin (u) + E[x1v

r

]+ !2D2E

[v

@v@x2

];

m20 = 2m10 cos (u)− 2E[x1x2v

r

]+ 2!2D1E

[x1vr

@v@x1

]

+2!2D1E[v2]; (10)

m02 = 2m01 sin (u) + 2E[x1x2v

r

]+ 2!2D2E

[x2vr

@v@x2

]

+2!2D2E[v2];

m11 = cos (u)m01 − E[x22vr

]+ !2D1E

[x2vr

@v@x1

]

+sin (u)m10 + E[x21vr

]+ !2D2E

[x1vr

@v@x2

]:

Several expected values in these equations cannot be ana-lytically expressed by lower order moments. Let f(x1; x2)denote a non-analytically closeable function in Eq. (10).Second-order Taylor expansions about the cell centers wereused (Crespo, 2003).The region D = [ − 2; 2] × [ − 2; 2] is discretized with

1089 cells. Take U = {−';−14'=15; : : : ; 14'=15}, % = 1,a = 15, b = 10, c = 2, != 1, D1 = 0:05 and D2 = 0:05.Let � be the set of cells corresponding to the target set�={x1=2; x2}, i.e.� has the rightmost column of cells inD.Fig. 1 shows the mean vector 3eld of the vortex. A trajectoryof the vehicle moving freely is superimposed. The center ofthe vortex attracts probability due to the state dependence ofthe di6usion and not to the existence of an attracting pointat the origin. This behavior does not have a correspondingcounterpart in a deterministic analysis. The time evolutionof relevant indices is shown in Fig. 2. The vector 3eld ofthe mean of the controlled response is shown in Fig. 3.


Fig. 1. Expected vector 3eld of the uncontrolled trajectories.

0 5 10 15 20 250.6

0.8

1

1.2

1.4

E[L

]

0 5 10 15 20

0

0.5

0.5

1

1

m01

& m

10

0 5 10 15 200

0.5

1

Time

m02

& m

20

Fig. 2. Uncontrolled trajectories: moments of x1 (- -) and moments ofx2 (–).

Fig. 3. Expected vector 3eld of the controlled trajectories. Target cellsare marked with crosses.

Fig. 4. PDF of the controlled response after 0:3 time units.

A controlled trajectory is shown in Fig. 1. In the pro-cess, the control keeps the system in D with probabilityone. A discontinuity in the vector 3eld exists in spite ofhaving an unbounded control set. Such a discontinuityimplies a dichotomy in the long term behavior of thecontrolled response whose dynamics depends strongly onthe initial state. While for initial conditions above thediscontinuity, the control drives the vehicle against thevelocity 3eld, for the initial conditions under, the con-trol moves the vehicle in the direction of the current.Fig. 4 shows the controlled response of a system startingfrom x = (0:8;−0:32) after 0:3 time units. The systembifurcates when it reaches the discontinuity.

5. Singular boundary conditions

A transformation variables is used to remove non-smoothstate constraints. The optimal control problem, i.e. the sys-tem dynamics, the cost functional, the admissible state andcontrol spaces, and the target set; is transformed to a newdomain where it is solved by the method. Then, the globalfeedback control solution is transformed back to the phys-ical domain, where it is evaluated. For optimal bang-bangsolutions the inverse transformation of the solution satis3esthe control constraints.

5.1. Example

A vibro-impact system with a one-sided rigid barrier sub-ject to a Gaussian white noise excitation is considered. Theimpact is assumed to be perfectly elastic. The control ofthe vibro-impact system is then subject to a state constraintgiven by the one-sided rigid barrier. The equation of motionis given by

Ty + )(y2 − 1)y + 2*�y + �2y = u(t) + w(t) (11)

subject to the impact condition y(t−impact) = −y(t+impact)aty(timpact) = −h, where timpact is the time instant at which

L.G. Crespo, J.-Q. Sun / Automatica 39 (2003) 2109–2114 2113

impact occurs, y(t) is the displacement, w(t) is a Gaussianwhite noise process satisfying E[w(t)]=0, E[w(t)w(t+,)]=2D$(,) and u(t) is a bounded control satisfying |u|6 u.The control objective is to drive the system from any initialcondition to rest, i.e. �(y; y) = (0; 0), while the rate of en-ergy dissipation is maximized. The computational domain isDy=[−h; a]×[−v; v]. As suggested in Dimentberg, (1988),we introduce the following transformation y= |x|−h, lead-ing to

Tx + )(x2 − 2h|x|+ h2 − 1)x + 2*�x + �2x

=sgn(x)(u(t) + w(t) + h�2): (12)

The transformed domain is given byDx=[−2h; 2a]×[−v; v],the new target set is�(x; x) =(±h; 0), and the cost functionalis given by

Jx = E[∫ ∞

t0[(|x| − h)2 + x2] dt

]: (13)

For x1 = x and x2 = x, the Ito equation is given by

dx1 = x2 dt;

dx2 = (−)x21x2 + 2)h|x1|x2 − ()- + 2*�)x2 − �2x1

+ sgn(x1).) dt + sgn(x1)√2D dB; (14)

where B(t) is a unit Wiener process, . ≡ u + h�2 and- ≡ h2 − 1. Some manipulations lead to

m10 = m01;

m20 = 2m11;

m01 =−)m21 + 2)hE[|x1|x2]− ()- + 2*�)m01

−�2m10 + E[sgn(x1)].;

m11 =m02 − )m31 + 2)hE[|x1|x1x2]− ()- + 2*�)m11

−�2m20 + E[x1sgn(x1)].; (15)

m02 =−2)m22 + 4)hE[|x1|x22]− 2()- + 2*�)m02

− 2�2m11 + 2E[x2sgn(x1)].+ 2DE[sgn(x1)2]: (16)

The above equations can be closed and integrated(Crespo, 2003). Now, we solve for optimal controls in thedomain Dx using the cost function Jx and the target set�(x; x). The control solution u∗(x1; x2), is then transformedback to the physical domain, i.e. u∗(y; y). The optimal con-trol in both domains is bang-bang, being fully determinedby switching curves. From u∗(x1; x2), an approximation ofsuch curves is built and transformed to the physical domain.As a example, take u = 1, h = −1, a = 1, v = 1, ) = 0,

* = 0, � = 1 and D = 0:1. The transformed state space isDx=[−2; 2]×[−2; 2], where 849 cells are used. The trans-formed target set is formed by the cells that contain the points

�(x; x) = (±1; 0). Since the optimal control is bang-bang, wehave U= {−1; 1}. The uncontrolled response is marginallystable since ) = 0 and * = 0. At impact, the system is re-Uected back in such a way that y remains the same andthe sign of y is reversed. The crossing of any other bound-ary is an irreversible process in the sense that VDy acts asa sink cell. The uncontrolled response for the initial uni-form distribution Dy =[−0:86;−0:78]× [−0:64;−0:57] is3rst studied. After 3 time units, only 22% of the probabil-ity is within Dy indicating the strong e6ect of di6usion onthe marginally stable system. Besides, the probability left inDy is not concentrated about the target. The global optimalcontrol solution in Dx is shown in Fig. 5, where the switch-ing curves are also shown. Approximations of the switchingcurves were obtained by curve 3tting. The control solution ismapped back to Dy leading to Fig. 6. Notice the qualitativedi6erences in the controlled response for states in the thirdquadrant of Dy. While for the cells marked with crosses,

Fig. 5. Global optimal control solution in Dx . Cells with circles are theregions where the optimal control is u∗= u, otherwise u∗=−u. Switchingcurves are superimposed.

Fig. 6. Global optimal control solution in Dy . Previous conventions apply.


Fig. 7. Stationary PDF of the controlled response.

1 2 3 4 5 6 7 8 90

0.5

1

J D &

PD

1 2 3 4 5 6 7 8 9

−1

−0.5

0

0.5

m01

& m

10

1 2 3 4 5 6 7 8 9

0

0.2

0.4

0.6

0.8

Time

m02

& m

20

Fig. 8. Time evolutions of PD (−−), JD (−) the moments m10 (− ·−),m01 (−−), m20 (− · −), and m02 (−−) of the controlled response.

the control speeds up the system favoring impact in the oth-ers intends to avoid it. For the same initial condition, thestationary controlled PDF is shown in Fig. 7. Stationarity isreached in 4 time units with 98% of the probability withinDy. As before, the switching curves split a uni-modal PDFinto a bi-modal one. The time evolution of relevant indicesis shown in Fig. 8. As before, the controlled response (i)converges to the target set with high probability, (ii) mini-mizes the cost and (iii) maximizes the probability of stayingin Dy.

6. Conclusions

This paper proposes a strategy to 3nd optimal controlsof non-linear stochastic systems using Bellman’s principleof optimality, the cumulant neglect closure method andthe short-time Gaussian approximation. Control problemsof several challenging non-linear systems with 3xed 3nalstate conditions subject to state and control constraints were

studied to demonstrate the e6ectiveness of the ap-proach. Processes with a state dependent di6usion part,non-analytically closeable equations for the moments of thestates and singular boundary conditions are considered. Inall cases, the uncontrolled and controlled system responseswere evaluated in the space of the conditional PDFs usingthe generalized cell mapping method. In all cases, excel-lent controlled performances were obtained. Relaxation ofthe computational complexity, extensions to high dimen-sional systems and a rigorous study of the convergence andstability of the algorithm should be pursued in the future.

References

Bratus, A., Dimentberg, M., & Iourtchenko, D. (2000). Optimal boundedresponse control for a second-order system under a white-noiseexcitation. Journal of Vibration and Control, 6, 741–755.

Crespo, L. G. (2003). Stochastic optimal controls via dynamicprograming. NASA/CR 2003-2124191.

Crespo, L. G., & Sun, J. Q. (2000). Solution of 3xed 3nal state optimalcontrol problems via simple cell mapping. Non-linear Dynamics, 23,391–403.

Crespo, L. G., & Sun, J. Q. (2002). Stochastic optimal control ofnon-linear systems via short-time Gaussian approximation and cellmapping. Non-linear Dynamics, 28, 323–342.

Dimentberg, M. (1988). Statistical dynamics of non-linear andtime-varying sytems. Tauton, U.K.: Research Studies Press.

Dimentberg, M., Iourtchenko, D., & Bratus, A. (2000). Optimal boundedcontrol of steady-state random vibrations. Probabilistic EngineeringMechanics, 15, 381–386.

Kushner, H. J., & Dupuis, P. (2001). Numerical methods for stochasticcontrol problems in continuous time. New York: Springer.

Lin, Y. K., & Cai, G. Q. (1995). Probabilistic structural dynamics:Advanced theory and applications. New York: McGraw-Hill.

Risken, H. (1984). The Fokker–Planck equation—methods of solutionand applications. New York: Springer.

Sun, J. Q., & Hsu, C. S. (1989). Cumulant–neglect closure methodfor asymmetric non-linear systems driven by Gaussian white noise.Journal of Sound and Vibration, 135, 338–345.

Yong, J., & Zhou, X. Y. (1999). Stochastic controls, hamiltonian systemsand HJB equations. New York: Springer.

Luis G. Crespo obtained his Ph.D. fromUniversity of Delaware in 2002. He is cur-rently a Sta6 Scientist of the National Insti-tute of Aerospace NIA (former ICASE) anda member of the Uncertainty Based Methodsgroup of NASA Langley Research Center.His research interests are the dynamics, con-trol and optimization of deterministic andstochastic systems.

Jian-Oiao Sun obtained his Ph.D. inMechanical Engineering from Universityof California at Berkeley in 1988. He iscurrently a professor of Mechanical Engin-eering at the University of Delaware.His research interests include non-lineardynamics, structural acoustic control, bio-mechanics and nano-scale mechanicalmeasurement.

Documents

Stochastic optimal control via Bellman's principle