Stochastic approximation algorithms for the local optimization of functions with nonunique stationary points

~~ ~ ~~ - . - . . . . .

-* . . ~ i . ~ .~ ., , . . : . . . . . ~1 .. ..< .. . - . ~~ ~, . I ,:.

~ .. . . . . .

646 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. ac-17, NO. 5, OCTOBER 1972

Joint Automatic Control Conf., Preprints, pp. 473-475. E. E. Fisher, “The identiiicat.ion of linear systems” in 1965

31. Aoki, Optimizatia of Stochastic Systems. New York, Academic, 1967.

systems,” J . Franklin Inst., vol. 268, July 1968. D. S. Spain, “Identification and modelling of discrete, stochastic linear systems,” Ph.D. dissertat,ion, Inf. Syst.. Lab., Stanford Univ., Stanford, Calif., Aug. 1971.

- ,“On observability of stochastic discrete time dynamic

John J. Anton was born in South Bend, Ind., on August. 18, 1943. He received the Sc.B. degree in mathematics in 1965 from the Uni- versity of Notre Dame, S0ut.h Bend, Ind., and the Ph.D. degree in applied mat,hematics from Broxn University, Providence, R.I., in 1970.

He spent the academic year 1965 as a Fulbright Fellow in Germany, and h a worked for NASA Electronia Center, Ar- gonne National Laboratories, and IS61 be-

fore joining System Control, Inc., Palo Alto, Calif., in 1970. At Sys- tems Control, he has worked mainly on radar systems synthesis. Other work has been devoted to advanced tracking techniques and

Edison Tse (W70), for a photograph and biography, see page 51 of identification techniques and modeling for pollution estimation and the February 1972 issue of this TR-WSACTIONS. control.

Stochastic Approximation Algorithms for the Local Optimization of Functions with Nonunique

Stationary Points

Abstracf-The ahi of this paper is the provision of a framework for a practical stochastic unconstrained optimization theory. The results are based on certain concepts of stochastic approximation, although not restricted to those procedures, and aim at incorporating the great flexibility of currently available deterministic optimization ideas into the stochastic problem, whenever optimization must be done by Monte Carlo or sampling methods. Hills with nonunique stationary points are treated. A framework has been provided, with which convergence of stochastic versions of conjugate gradient, partan, etc., can be discussed and proved.

T I. INTRODUCTION

HIS paper describes some versions of Kiefer-Wolfo- wit,z-like (KW) stochastic approximation procedures

that are rather versatile. They allow for many more options in the techniques of the iterations t,han previous methods, and, hopefully, make a useful step in the direction of a t.ruly practical theory of algorithms for stochast,ic unconstrained optimization, and suggest, approaches to the development of algorithms for constrained stochast.ic opt,imization.

The classical KW procedure has been well studied

1972, and April 13, 1972. Paper recommended by D. Sworder, Manuscript received September 17, 1971; revised February 16,

Chairman of the IEEE S-CS Stochastic Control Committee. This

under Contract AFOSR 71-2078 and in part by the Office of Naval research was supported in part by the Office of Scientific Research

Research under Contract NONR N00014-67-A-0191-0018. The author is wit.h the Center for Dynamical Systems, Divisions

of Applied Mathematics and Engineering, Brown University, Provi- dence, R. I. 02912.

mathematically [1]-[5], et.c., and a number of papers on applications have been written (e.g., [6]-[13]) over the pa.st, two decades; yet. it is generally very inefficient,, and infrequent,lg . used as basis for pract.ica1 algorithms. Alore precise definit.ions will be given in the next section; in t.his section, the philosophy of the paper xi11 be discussed.

Loosely speaking, let f(x) be a smoot.11 real-valued funct.ion of a vector parameter x. We wish to find the value of x, namely 8, that minimizes f(x). But for each 2, we cannot. calculate f(x) ; we can only observe f(x) + E(x), where {(x) is corrupting noise of mean zero. Let X , denot,e an initial est,imate of 0 and let a,,c, be real positive sequences. Let DY(X,,c,) denote a noise-cor- rupted est,imat,e of the gradient of f(x) at x = X,. DY- (Xn,cn) is obtained by a finite difference estimate using noise-corrupt.ed values of f(x) (nhere the values of x for t,he calculation of DY(X,,c,) are set, equal t.0 X, per- t.urbed in each of the coordinate direct,ions by the finite difference interval c,J. The usual I<W it,eration has t,he form (*)

Xn+l = X , - anDY(Xn,cn). (*I Sometimes a, is replaced by a ma.trix L4nJ which may incorp0rat.e an estimate of the inverse of the Hessian at. 8, if it, exists.

Thus in (*), a. gradient is estimated, a single step (proportional t,o a,2 t.inles the gradient) estimate or t,imes a mat.rix t,imes t,he gradient. estimate) is t.aken, and t.he process is repeated. Furthermore, it is a. st.ochastic version

KUSHNER STOCHaSTIC APPROXIMATION ALGORITHMS 647

of the classical deterministic Newton-Raphson method. The classical t.heory assumes that t,here is a unique sta- t,ionary point, which is a minimum. There may not be a unique st,at,ionary point in pract.ica1 examples. It would still be desirable to know that the X, converge Tvit.h probabi1it.y 1 anyway. Indeed, in determinisbic opt.imization (see, e.g., [14J and [153), it, is usually proved t,hat, the algorithms yield a sequence converging to the set. mhere the gradient, fz(x) is zero. Such proofs do not use t,he contraction type of argument.s of the ISewt.on-Raphson met,hod. In applications to control, it is not, unusual for the f(x) to have many stationary p0int.s. Even aside from this important, point,, t,he st.ructure of (*) is inherently inefficient. (and not only due to the inefficiency involved in choosing inadequate values for { a,,c,} ).

I n the pract,ical deterministic t,heory, it.eration or search in the gradient. direct,ion is almost never done. Let, { X n ) denot,e a sequence of estimates of t,he minimum (or of a stationary point), for a deterministie problem. Let us see how Xi+1 is frequent,ly chosen, if Xo,. . . , X t are available. Usually, t,he gradient f x ( X i ) is calculat.ed, and a direction di is selected. The direction di is usually some funct.ion of past Xi and f L ( X j ) for j < i, and of fz(Xj), as in the conjugate gradient or partan methods. Then a one-dimensional search is made along the line through X i and in direction cl, unt,il some crit,erion is satisfied. The search then stops, some point, on t.he line is selected t.0 be Xifl, and then a new direction is select.ed, etc. The convergence proofs are often based on t,he assumption that, each one-directional search cycle reduces t.he value of the function f(x) by some function of the gradient at the stating point of the cycle, or else locates the minimum along t,hat, direction to within ei (where E< 3 0, as the cycle number i + ). Det,erministic methods are usually designed not only for asymptotic efficiency (say, qua,dratic convergence) , but also for t,heir decent. init,ia.l behavior, say, good ridge following abilit,y. It is n-orthwhile t.0 develop stochastic methods with the same general structure; na.mely, c0mput.e Xj+l from X, by selecting a direction d, by some rule, and searching on t,he line t,hrough X i in direction cl, until some criterion is satisfied; then select etc. It is desirable, analogously, that, in the stochastic case, a reduction “per cycle” of the aoeraye value of t,he performa.nce function imply t,he desired convergence. This will be accomplished in a natural way. Indeed, there is great flexibilit,y in the mebhods of t.he one-dimensional searches, and in t,he methods of select,ing t,he d i .

The aim of t.his paper is the development of a useful fra.mework for comput,ational methods for st,ochastic optimizat,ion, which incorporates t.he advanta.ges and flexibility of some of t.he more recent deve1opment.s in deterministic optimization. It is also desired to ext.end the usefulness of the classical I<W method by st.udying the case (Theorem 1) where f(x) may have more than one stationary point,, and to investigat.e it,s role as a one-

1) in the same sense t.hat, the “model” algorithms of Polak [14, ch. I] have a br0a.d applicability, t.o many deterministic hill climbing methods. They put emphasis on strucbure and general properties, rat,her t.han on t,he details of particular algorithms. Bot,h of these theorems are essentially concerned with KW (or Newton-Raphson)- like methods.

Our Theorem 2 is more in the spirit of t,he more recent deterministic ideas (t,hose concerning the use of a sequence of one-dimensional search cycles). Let Xi den0t.e t,he i th est.imate of the sta,t,ionaxy point.. A direct*ion d, is select,ed. (The main rest.riction on di is essentially tha,t the probability t.hat di is orthogonal to the true gradient, fi(X,) is zero.) Then a one-dimensional stochast.ic approximat,ion procedure proceeds along d i until some random time at which some crit,erion is satisfied. The criterion too can be rat.her arbitrary. Then a new direction is chosen, etc. The Xi denote only t.he first (or last) point,s of each cycle. The format allows great versatility. It could include versions of partan, conjugate gradient,, or other deterministic met.hods. Indeed, there is no int,rinsic reason why each one-dimensional search has t,o be a stochastic approximation; the method is valid for one-dimensional search methods that are not st>ocha.stic approximations. Some such methods a.re under investigation. Theorem 4 gives an algorithmic model of Theorem 3, which is valid as long as t,he conditional average improvement for each one-dimensional sea,rch satisfies certain general properties. Indeed, a purpose of Theorem 3 is the isolation of those properties that would be useful for each one-dimensional search to possess in t.he stochastic case.

There is no overlap between the results of this paper and those usual in stochast.ic approximation, say, of Dvoretsky [2]. Here, there may be many stationary point.s, and the contraction ideas of Dvoretsky (br0a.d a.nd important, as t,heg are) are not, applicable.

11. COKVERGESCE THEOREMS Definitions

For each z E R’ (Euclidean r-dimensiona.1 space), let, H(y(x) denote the distribution function of a real- valued random variable with finite mean f(x) = sy dH(y lx ) , and uniformly bounded variance; thus for some real 8, sup, j [ y - f(z) I* d ~ ( y l z ) = 82 < 03.

Let f&) and f&) denot,e the gradient and Hessian of f(z), if they exist,, x\-ith component.s fi(s) a.nd fij(z), respectively. The int,erest here is in minimizing f(x), when possible. Next, the usual M W iterative procedure for locating the minimum of f(z) (under conditions on f(x) which are close to the assumpt.ion t.hat f(z) has a unique stationary point, which is a minimum) \Till be described. Let {X*) denote the sequence of estimates that. are const.ruct,ed by the KW method. Dejin.e’ DO = {x: fJz) = 0 ) .

~

dimensional search subroutine. Theorem 3, which is the 1~~ may contain only one point. In general, it consists of a founda,tion of ~1~~~~~~ 1, giTres a tileorenl collect,ion of bounded or unbounded disjoint closed sets, some of

which may cont.ain only one point. All of these cases occur in ap- t,hat. has broader use (than a,pplicat.ion merely to Theorem plications.

648 IEEE TBBNSBCTIOXS ON AUTOMATIC CONTROL, OCTOBER 1972

In Theorem 1, it is shorn that, even if there is not a que stat.ionary point, X , -t DO with probability 1. This is a type of result that is commonly obtained in determinist,ic opt,imization. To describe the exact procedure, suppose that Xo, - - . , X , are available and let

sequences of positive real numbers. Let e, denote the unit vector in the ith coordinate direction in R'. To construct

let YnZi and Y,Zi-l, i = 1,. . ,r, be random va.riables drawn from H(ylx) , with parameters x = X, + e$," and x = X , - eicni, i = 1,. . . ,r, respectively, and let En denote the vector n t h ith component2 tnf = (y,Zi - y n 2%-1 ) - c f (Xn + etcni)

{ c,,z i ' - - l , - - . , r , n = O , l , - - . ) , {a,,n = O,l,.-.) be

- f(X, - eicni)).

Let DY(X,,c,), Df(Xn,cn) , and Df(Xn ,cn ) denote t,he vect,ors with it,h components [Y,2i - Yn2i-1]/2~ni, I f (X , + eicni) - f ( X , - eicni)]/2cni, and f n i / 2 c n i , respectively. For a sequence of matrices H,, let the KW process define X,E+l by

Xn+1 = X , - a,H,DY(Xmcn). (1)

Let3 a, denote the smallest u algebra that measures Xo,YLs, s = 1,. - ,2r, k 5 n - 1 (i.e., all t,he data that are known when X , is const.ruct,ed). Recall t,hat Do = { x : f x ( x ) = 0) , t,he set of points where the first-order necessary condit,ion for minimality holds.

Theorem 1: Assume Conditions Ia-Id. Condition. l a :

Ca,2/(cni)2 < 03, c," 3 o as n + m , CanCni n n

< 03, i = I,-. . , r ,Ca, = n

Condition Ib: For some real positive e i , let the positive definite matrices H , satisfy (in the sense of the partial 0rderin.g of positive definite matrices4)

d 2 Hn 2 d ,

H , are 63, measurable. Condition IC: For a real5 u2,

E@,.$, = O,E&\[,\2 = un2 I u2.

Condition Id: f x ( x ) and f x z (x ) exist and are cmtinuous on E'. There is a real K O so that fo?- any vector6 y,ly'fzx(x)y\ 5 KO\ yI 2, and f (x) i s bounded from below.

Then, if (X,) i s bounded ,with probability 1 , there i s a Jinite Tandom variable fo so tka,t fx(X,) + 0 , f ( X , ) + f", X , + Do, with probability 1 (i.e., I fz(Xn)\ > E only Jinitely often with probability 1 for each E > 0).

If we assume either Condition (Ie-1), (Ie-2), or (Ie-3), then

Condition le-1: { X,) is bou72ded with. probability 1.

J ( X ) 3 as \x\ 3 03.

Condition Ie-6:

lim inf \fi(x)\ > 0. I.4-f-

Condition Ie-3:

l f z (X>l > 0 for aU large x , and for some positive definite matrix P , with. probability 1 ,

Z'PH&(X) 2 0

for all su$ciently large x (for \x\ 2 some real ro), and n. Discussion,: Condition Ie is used only t,o insure that, the

iterates X, are uniformly bounded (i.e., that lim sup IX,l < 03 with probabilit,y 1). Any condition which guarantees that { X n ) is bounded can replace Condit,ion Ie. Condit,ion Ie-3 is a type of Lyapunov funct.ion condition. It states that for large x , f(z) increases (in suitable directions) as x increases, and hence t.hat there is no relative minimum at x = (alt.hough f x ( x ) + 0 as x -+ 03 is possible). The H , play no real role in the proofs (provided Condition Ib holds), but t.hey do allow some versat.ilit,y in the algorithm, for H , may be an (bounded) est,imate of the inverse of the Hessian at the minimum (if it exists), or it may be (see also Theorem 2) a trans- formation that aids movement of t.he path along ridges (as is done, say, iq partan). Basically, all that is required is that the average direction of each iteration be st,rictly within (say, within a/2 - e for some real E > 0, E < a/2) 7r/2 of the gradient direction (this is guaranteed by Condition Ib). Also, many ot.her met,hods for estimating the gradient can be used besides central difference methods.

The method of Theorem 2 is closer to standard methods for deterministic unconstrained optimization. At the ith cycle, we start with an estimate X i and a direction d i for the next onedimensional search. Iteration proceeds on the line through X , in the directions i d, until some criterion is sat.isfied. The last iterate along that line is called Xi+l, and it is the initial value of the next cycle. A new direction dt+l is determined and we continue. The ith one-dimensional search is a KW procedure, n-hich stops at the nit.h itera,t,ion, for some finite with probability 1 random variable ni. The sequence of iterates in the ith cycle is denoted by Z o i , . . ,Zyi. Thus X i = Zoil Xi+l =

Suppose Yo,. . . ,Xil Zoi,. ,Zk*, k < nil are available. Then Zk+li is chosen as follows. Dram- the random variables' Y2p+li,Y2ki from H(ylz) nith parameters (for some f i d e difference interval cki > O)Zki + dicki, Z k i - d i k ? e

zn; = ZOi+l.

dX,), the ith component of the gradient a t Xn. c,, is the vector cni is the finite difference interval to be used in the est.imate of

(&I, ' ' . ,&.). 3 Conditioning on a,, is equ,ivalent to conditioning on Xo,Y&*,

s = 1,. . .,2r, k I n - 1. A function that is (d, measurable is a suitable function of XO, Yk8, s = 1, . . . ,2r, k I n - 1, 1.e., of all the data used in the construction of XI, . . . ,X,,.

4 e and are the upper and lower bounds (uniformly in n) to the largest. and smallest eigenvalues, respectively, of €in .

5 All norms are Euclidean.. E@,, and E[Xia,] both denote the conditional expectation of X, gven 65,.

6 The prime denotes transpose.

denotes the iteration number in the cycle. In Theorem 1,. the sub- 7 The superscript i denots t.he cycle number, the subscript k

script n denoted t.he iteration number and the superscript i the directional component.

KUSHNER: STOCHASTIC APPROXIMATION .UAiORIlWblS 649

respectively. Define Zk+li b~7

Define

(k* = ( Y 2 k + l i - Y 2 k i ) - Cf(z,’ f d&ki) - f (zk* - d&k*)) .

Let Bki measure all the data up t.o and including the kth step of the ith cycle (i.e., up to and including the data used for the const,ruct,ion of Z n i ) and let ai = Boi measure XO, . . . ,xi . hTote that the cki and f k i are &$ned diflerently here than in Theorem 1.

Theorem 2: Assume Condition Id an,d Conditions Ila-IId.

Condition IIa: aki,cki are 6 3 k i measurable ranrEont vcriables satisfying O < cki +- 0 as i + k --f m, and for real Cki,

where c i A i ~ i x , k , , zL le have c k i 2 cki 2 Cki, A,‘ 2 an* 2 Anf ,

x - 1

m > B ~ ’ 2 2- ani 2 B~~ > 0. k = O

Then, i j { X , } i s bounded with probability 1 , there i s a random variable f” so that f(X,) ---t j o , fz(X,) --t 0, and X , ---t Do with probability 1 { X , ) i s bounded with probability 1 under Condition Ie-1 or Ie-2,

Discussion: Condition I Ib is not very restrictive, and the directions can be selected in many ways. d i can be the (noisy) est,imated gradient direction at X,. In deterministic optimization the gradient direction is not usually the best direction in which to search. With slight, modificat,ions that do not affect their convergence properties, methods such as partan or c0njugat.e gradient choose a direction t,hat, is related to t.he current gradient, and that, (deterministically) satisfies Condition IIB for yz = 1, and some y1 > 0. We must give ourselves the same flexibilit,y in t,he stochast,ic case. Also, we allow the ax*, c k i to vary within some 1imit.s.

In Condition IId, we can set B1‘ = 0. The notation and method of the proof is simpler when Bl’ > 0, but it should be int,uit.ively clear that. B1’ > 0 is not necessary. Thus, the main burden of Condit,ion IId is that. we not spend t.oo much time (defined by x k a k i ---t , as i + m )

in a,ny one direction. Within Condition IId, the decision whether to stop or not can depend in an arbitrary way

on the past data. Some random stopping rules are under investigat,ion. \diyz(Xi)l is the absolute value of the derivative at the initial point of the it.h iteration. Condi- tion IIb says that, with some probability y2, this must be no smaller than some fract,ion, yl, of the norm of t.he gradient. This is, seemingly, a minimal condition.

The foregoing algorithms fall into a general framework t.hat, includes many nonstochastic approximation algorithms, in the same manner that, convergence of many deterministic optimization schemes can be proved by merely working with properties of the subrout.ines (the properties of the method for selecting the direction of iteration and t,he method of search for a local nlinimum in tha.t direction). See, e.g., Polak [14] or Zangwell [ls]. In this sense, Theorems 3 and 4 generalize Theorems 2 and 1, respectively. Indeed, hopefully useful, but nonstochastic approximation methods that. satisfy Theorem 3 are being investigated.

Theorem 3 fits the situation of Theorem 2, and Theorem 4 fits the situation of Theorem 1. Theorems 3 and 4 were motivated by these respective methods, but, cover many more cases. In Theorem 3, {X,) denot,es a sequence of R‘ valued random variables.

Theorem 3: Leg 63, measure XO,. . . ,Xn. Suppose that f (x) i s bounded from below. Let there be a real-valued function a,(x) 2 0, a real number a > 0, a continuous real-valued function c(x) on R‘ - Do, where a,(x) 2 a > 0 for n 2 C(X) so that

Eanf(Xn+1) - f(xd I - an(Xn>S(Xn> + Bn, (3)

where the random variables f in satisfy

and 6(x) is a continuous real-valued nonnegative function, and positive f o r x g DO. Then., if { X , ) is bounded with probability 1 ,

f (X,) converges to a finite random variable with probability 1 (5a)

f z (X , ) + 0 with probability 1 (5b)

X , + DO with probability 1. (54

Discussion: In relating Theorem 3 to t.he stochastic approximat.ion procedure of Theorem 2, the fin term in (3) is due to the “average” noise effects and to the bias effects due t,o the use of “noninfinit.esima1” finite difference intervals. The -an(X,)6(X,) t,erm is an estimate of the improvement, in the average value of f(x) in the nt,h cycle, which is due to the fact t.hat, for large n, the it,era- t.ion is in a direction that. on average (and for a suf3kient number of iterates of the nt,h one-dimensional search cycle) is st,rictly v,ithin ~ / 2 of the gradient. In general, for hill descending methods in the presence of observat,ion noise, t,here would always be the conflicting effects of noise and of ‘(pull” in the gradient. or other direction “near” the gradient. Theorem 3 implies t,hat, (loosely speaking) modulo the summable noise a,nd bias effects P,, a return to an x g Do i&nitely oft.en implies that

650 IEEE TR.UiS.4CTIONS ON AUTOMATIC CONTROL, OCTOBER 1972

(ult.imately) the per return average reduction in f(x) distribution does not depend on the { V,,U,}, and t.he un is a t least aS(x). In that case, for that sequence, f (X,) - are open-loop controls-or parameters of closed-loop - 0 3 , a cont,radiction t.o the convergence of f (X , ) . controls. Let V k , o denote the zeroth component of V,,

a,(.) i s replaced by a real sequence a, 2 0, where a , may tend to zero and

Theorem 4: Assume the conditions (3), (4) except that the where we write for some function P Vk,O = q u a , . . ,u,.-1,wo,. . . ,M&l). (10)

Ea, = Q), sup lan] < 03. Suppose that it is desired to minimize EV,,o = f (m, - . , If t,he minimization is to be done by Monte Carlo, with

{w.) generated by a computer, and if F has continuous

beginning of the section can be utilized. Let p ( k ) denote

n n (6) ukPl) over the sequence of vectors uo,. . . ,UM.

Let &,(x) denote a sequence of functions jor wh.ich, \&,(x)I 5 where 61(x) is bounded On bounded that derivatives \\-ith respect t,o V and u, t,hen t,he idea at the {x,) has the f o r m

Xn+l - X, = anhn(Xn) + Y n (7) the (q + 1) column vector [1,0,. . ,O]' and defines

where (with probability I) p ( i ) = F,'(Vc,ui,wi)p(i + l), i 5 k - 1.

N Then lim sup I C Y n l = 0 :If N M Y u , ( u ~ , . . . , u ~ - I , ~ o , * * , w ~ - I ) = F,'(Vt,ut,wi)p+l- (11)

181n(x>l I h(4. For each i, let {wri) $i denote a sequence distributed as

Then (5) holds i f { \X,\} is bounded with probubilify 1.

searches need not be stochastic approximat.ion; arbitrary data on f (x) can be used in t,he search, provided only that

{ZLJ.), and let the $% be independent, in i. +i is the noise sequence on the ith 3lont.e Carlo run. Let, {unt,n = 0, - . . , ith Monte Carlo run. Then Vk,oi = Y ( X i , $ J P(wi, . . . , uli-l*,woi,. . . is t.he corresponding noisy estimat.e of

and approaches Of and are be calculated exactly. Furthermore, variance reduct,ion (e.g., antithet.ic variable) methods can be used, e.g., me may use

Thus' t'he methods used for the k - 1} E X i den0t.e the cont,rol sequence used on t,he

( 3 ) 7 (4) (Or their in Theorem 4, That t'he the cost on the it,h run, and it,s deriva,t,ive Yz(Xi ,# i ) can

natural is illustrated by the simplicity n7it.h which they represent t.he stochast.ic approximation problems of The- orems 2 and 1, as seen in the proofs.

1 111. MONTE CARLO OPTIMIZATION FOR -[Yz(Xi ,$i) + Y z ( X , - 941 2

CONTROL PROBLEMS Suppose t,hat the random variable Y drawn from H(ylx) for Y,(X,,$.,) in (9), etc.

takes the specific form Y = Y(x ,$) , for some function IV. CONCLUSIONS Y ( - , .), where $ is a random variable whose distribution several forms of iterat,ive procedures for stochastic does not depend on x. Define E Y ( x , $ ) = f ( x ) , and suppose opt.imizat,ion have been described. The stochast,ic ap-

in x (for each $). Let, sup, E\Y,(x,#) - fz(x)12=Mo < 03. st.ochastic opt,imization procedures should probably have. that f( ') and se continuously different'iab1e proximation procedures suggest cert,ain properties that

Suppose that can Observe the gradient y * ( x I $ ) . An at,tempt has been made to devise general forms that in (l) Or (2> it is not necessary to use finite differences to incorporate these properties (e.g., Theorems 3 and 4) est'imate the derivat,ives. Write 57l for yZ(x7Z,$71) - f Z ( y 7 t ) , but that do not have the general draX5,baeks of stochastic and suppose t.hat { $,I is a sequence of mut,ually inde pendent random variables, each d h the distribution of of L(MOdel,, has been carried Over to the approximation. The det.erministic optimization concept

$. Replace (1) with stochastic case and multiminimum hills have been t,reated. Xn+l = x , - a,H,Y,(X,,$n) = x,

APPENDIX - a,H,Lf*(x,> + En1 (8) The proofs depend heavily on the so-called martingale

and (2) \\<th (for { #kt} mut,udy independent, in i and k ) theorem. A real-valued random process { Y,] is called a

ZB+lf = Z k i - UliidiY,'(Z~~,$ki)di. (9) supermartingale if E [Y,+11 Y,, . - ,YO] 5 Y, (with prob- abi1it.y 1), and a martingale if = replaces 5 . Equivalently,

Replace Chndition I a with C,un2 < m , E n a n = 03 , and let {(B,} denote a sequence of nondecreasing u algebras similarly for Condition IIa. so that Yo, . . . ,Y, is measurable over @,. Then { Y,} is a

Next consider a control problem for the following supermartingale i f 9 Ea,Yn+l 5 17, with probability 1. (q + 1)-dimensional dynanlical system (with state sequence vo, . . . ,Vk) : * Fz(z,u,w) is the matrix whose i,jth component is the derivative of

V,+l = F(Vn,un,wn), n = 0,. . .,k - 1 the i th component of F(z,u,w) with respect t.0 the j t h component of 2.

9The notations E[fl@] and E @ j are used interchangeably for where { w n ) is a sequence of random variables whose conditional expectation.

KUSHNER: STOCHASTIC APPROXIMATION ALGORITHMS 651

For a real-valued function Y define Y - min [O,Y 1. Note that the symbols K i , i = 1,2, - . , are used for arbitrary real positive numbers, whose values may change from t,heorem to theorem.

Lemma 1 (Martingale Cmzcmgence Theorem, Lodve [16]): Suppose that inf, EY,- > - a, for a supermartingale { Y,) . Then Y , conmrges to s m e r a n . d m variable Y with probability 1.

Corollary I : Suppose that { f n ) i s a sequence of real-calued random variables with f, 2 M for a real number M , {an) i s a sequence of nondecreasing u algebras, and fo, - * ,fn is measurable over a,. Suppose that th.e sequence of random variables {an) satisJes Ecclqtl < Q,. I f Ea,f,+r - f n 5 q,, then f, converges to a random variable f“ uith probability 1.

Proof: Define Y , by W

Yn = fn + EanClqt\* n

Then { Y,] is a supermartingale, Lemma 1 holds, a.nd since ~ ~ l q i l < Q, with probability 1, f, converges also.

Q.E.D.

Proof of Theorem 3: By Corollary 1 (withf, = f(X,)), the sequence f ( X , ) converges with probability 1.

Suppose ( X , ) is bounded with probability 1 and limn sup 6(X,) > 0 1vit.h probability greater than zero. Then there is some compact set D’ that is disjoint from Do and for which infsD16(z) b > 0, sup,~~tc(z) s no < Q, , and X, is in D‘ infinitely often with probability at least some po > 0 . Then, for n > no ( IA is the indicator function of the set A and only the X i that, are in D’ are counted in the last inequality below),

But, the left-hand side is bounded from below in n., since J(z) is bounded from below. Thus [by the Borel-Cantelli lemma and (4)] 6 ( X i ) l p i ~ ~ t ) -+ 0 with probabi1it.y 1, which cont.radkts limn sup 6(X,) > 0 wit,h probabilit,y > 0. Then the with-probabilit,y-1 boundedness of { X , ) and t,he properties of 6( - ) imply t,hat fi(X,) - - 3 ~ 0 and X , -+

DO with probabilit,y 1. Q.E.D.

P.roof of Theorem 4: Suppose t,hat, the {an(.)) in (3) is replaced by t.he bounded real number sequence { a,), where { a,) satisfies (6) and (7). As in Theorem 3, f ( X n ) converges with probabi1it.y 1. Let Dl and D denote any compact sets wit.h t,he properties that D c Dl and

(i) inf Iz - yI 5 &, > 0 (i.e., D is st.rictly

interior to Dl) (ii) infZE~,16(x)] = & > 0 (i.e., Dl does not. intersect

DO) (i) sup,E&l(z)l = & < 00 (true by compactness

of Dl).

It will only be shown t,hat X , is in D only finitely often x-ith probability 1, for each compact D for which t.here is a compact Dl 3 D so that D and Dl sat,isfy (i)-(iii). By the with-probability-1 boundedness of ( X , } , and the fact that 6(.) is positive and continuous in R‘ - Do, (5bl and ( h ) will follow. (Indeed, if { X , ) is not required to be bounded, and X , +- DO, t,hen it will follow t,hat IX,I 3

~0 .) The main difficulty in the proof is due to the pos- sibility t.hat a, - 0. Thus, it is necessary to estimat.e t,he “lengt,h” of st,ay of each visit to Dl or D. It will bs shown that each length of st,ay is long enough to allow an estimate of the t,ype used in the last proof.

Define the sequences milmi by mo’ = 0 and

m, = inf { n:n 2 m2-l’,Xn E D }

mi’ = inf {n :n 2 m,,X, g D l ) .

Thus mi is essentially the time of the ith entry of X , into D, after being out of Dl i - 1 t,imes. mi is finite (i.e., defined) on an *set Q,”. Observe that t.he theorem follows if it is shown that mi < Q, only finitely often nit.h probability 1.

Define

n l m i

The second line follom-s by only counting (and bounding) the random strings between the mi and m,’ - 1, for which mt < Q, and LC < &/2 (i.e., where the effects of the term yc alone are insufficient to move X , by more than half the distance between Dl and D in [mi,mt‘ - 11.

Suppose Xmi E D (i.e., m, < Q, ). Then, as j increases, Xmitj must leave Dl eventually, or else the ith sum in (-41) would be infinite. Thus mi‘ < if mi < Q, (with probabi1it.y 1). Let, L, < &/2. Then, by (7), if Xmit is to be out.side of Dl for a finite mi‘, it is required at least t.hat \x;:’-’ aj+,(Xj)l 2 a0/2. But then by (iii), Em: m.’- 1

cyj 2 &/(2d2) > 0. Then (Al) yields

But. Li -+ 0 with probability 1. Thus it is necessary t,hat mi < a only finitely often with probability 1 if t.he right- hand side of (A2) is to be finite. Q.E.D.

Proof of Theorem 1: The proof rela,tes the process of Theorem 1 to that of Theorem 4. By Taylor’s expansion wit,h remainder, we can write Df(X,,c,) = fi(X,) + B,,

652 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, OCTOBER 1972

and then zj. lo Recall that f i j is t.he part.ia1 derivative of f with respect to zi

(A@ where we define a, = elan/2, 6(x) = lfz(x)12, and on in the obvious way. For n < 6, define a, = 0, 0, = E&(Xn+l)

Returning to (A3), let us check the conditions on the - f ( X n ) .

terms of (7). Observe that

C JanHnBnI I ~2 C anen' < 03 n r,n

and that t,he components of the vector E , defined by aiHiDE(Xi,ci) En, are martingales. Also

R-hich is uniformly bounded in n. Thus En converges with probability 1 as n 3 0 3 , and hence I ~ ~ = o a i H J 3 B i + aiHiD( (Xi,ci)\ converges with probability 1. Also limA,f+m mphr ) E , - EM[ = 0 with probability 1. Hence the required (from Theorem 4) condition on Y, [defined above (A3)] holds. Define 61(x) = fz(x)I.

Suppose that { X , I is bounded with probability 1. Then all the conditions of Theorem 4 hold, and (5b) and (5c) follow.

The boundedness with probability 1 of { X , ) can be proved using eit,her Condition Ie-1, Ie-2, or Ie-3. In the int.erests of saving space, t,he reader is referred to [17, pp. 1.21-1.221. Q.E.D.

Lemma 2 provides a very useful bound for a truncated one-dimensional stochastic approximation. It will be used in the proof of Theorem 2. Let an denote the smallest u algebra that measures X O , . . ,X,.

First note that,, if x is scalar (hence H , l), then (A5) can be written as (here and in the future c, is scalar)

E G f ( X n + S - f(xn) 5 -an(1 - c n K ~ - $'&KO) * IfZ(X¶>( + P n (A71

where

pn = [ancnKO + ~ u , ~ c ~ ~ K ~ ~ + ~ K O U , ~ U ~ / C , ~ ] .

Lemma 2: Let x be a scalar and let {X,) denote the iterates in a scalar valued stochastic approximation procedure with f & ~ ) bounded u d continuous (supz\fzz(x)l = KO< a). Thus

Xn+l = X , - a,DY(X,,c,), X . = x(real), an > 0, cn > 0

= X n - arDf(Xn,Cn) - anDE(Xmcn)-

Let E, satisfy EkSn = 0, EG\gnl2 I c2 < a. Let the procedure terminate at a time N , ( N can be a nonanticipative random variable) and let C f - ' a i >_ B1' > 0, for some real

WSHNER: STOCHASTIC APPEOXINA'TION ALQORITHMS 653

number E;. Let 1 - c 3 0 - 4 a 3 0 2 1/2 for n = 0, + , N - 1. Define G byl l

N-1 G = E, (aici + ai2/ci2).

0

There i s a real Q > 0, a function c( - ) on ( 0 , ~ ) where c(u) is positiue for u > 0 and i s nondecreasing as u + a, so that if c(]f,(x)\) 2 G, then

N - 1

E,~(x,) - I -Olfz<x>lz + E, C fin* 0

Proof: The method of the proof is to bound the f , (Xn) term in (A7) by a constant times f,(x), for sufficiently many n. Write (see proof of Theorem 1)

Xn+l = Xn - an.fz(Xn) - bln - b2n

where h n E an(Df(Xn,Cn) - f,(X,)), b2R E QnEn/Cn, and lblnl 5 Koancn. Define

n-1 n - I

B1n E C blir B2n E E b2,, Bn = B1n + Bzn 0 0

Bi = sup \Bin(, B & + &.

For each u E (0, ), c(u) mill be chosen so that G 5 c(u) implies t,hat

N) n ) 0

By the Chebyshev inequality (A9) and the submartingale estimat,e (AlO), the value of c(u) given by (All) implies (AS) for G 5 c(u). Thus, let (All) define c(u).

(A91 N- 1

Define d(x) = IfZ(x)I/2Ko. For y E I , {y: z - d ( x ) 5 y 5 z + d(z) ] , we have that. sgn f&) = sgn !,(x) and 21fz(x)c>l 2 If,(y)( 2 1/2\fz(x)\. This follows from t.he expres-

x)) (y - x). Define m: sion (for Some e E [O, l l> f Z ( d - f d x > = fzz(x + e(y -

m = min [AT; inf {n : X , @ I,] 1. Then by (A7) and the hypot,heses of the lemma,

1 N- 1

Ef(X,) - f(x) 5 -5 E, C a R l f i ( X , ) ( 2 + u (A121 0

l1 The notation E,Y is the expectation, condit,ioned on X O = x.

Now estimate E,C$-lai. If m = N , t.hen C f - ' a , 2 BI'. If m < N , then (since Ifx(Xi)l 5 d,f,(x)\, i 5 m - 1)

!&@! = d(x) 5 \X, - x1 = - atfi(Xi) + B, m - 1

2Ko l o

Substituting this into (A13) gives the lemma. Q.E.D.

Proof of Theorem 2: The proof that f ( X n ) converges is the same a s in Theorem 1 and is omitted. Again using the Ta37lor expa.nsion with remainder, we can mi t e the iteration (2) as

where Bxi is 1/2 cki times t,he sum of t,he second derivative a2f(Zk* + tdick')/dt2 evaluat,ed at some point t E [0,1], and at some t E [- 1,0]. As previously,

Applying Lemma 2 to the ith one-dimensional search gives the following. Fix d i and let the term Gfop oector x) fs'(x)dr replace the scalar fz(x) in Lemma 2. Define (analogously to G in the lemma)

Let ?k denote the smallest integer for which 1 - cniKo - 4aniKo 2 1/2 for i >_ lit. There is a real Q > 0 (independent of Gr and i) so that, for i 2 &,

with probability 1 relative to the set where c(\f,'(X,)&I) 2 Gi, where c ( . ) i s the function introcluced in the lemma., and where

A,' = KIExi

Now it. can be mitt.en that, for n 2 h,

654 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, OCTOBER 1972

n-I [ l l ] Y. C. HII; and R. C. K. Lee, “Identification of linear dynamic Ef(Xn) - E f ( X d 5 -&E C \fi’(Xi)di121(c(lL’(xi)ail) Z G ~ ) systems, J . Inform. Contr., vol. 8, pp. 93-110, 1965.

9% [12] D. J. Sakrison, “A continuous Kiefer-Wolfomitz procedure for random processes,” Ann. Math. Statist., vol. 35, pp. 590-

$. EKCGr. (A171 [13] D. F; Elhot and D. D. Sworder, “Applications of a simplified

Since f(x) is bounded from below and EZG, < 03 and Tram. Automat. Contr. (Short Papers), vol. AC-15, pp. 101- 104, Feb. 1970.

Gr - 0 njth probability 1, (A17) implies that (by the [I41 E. pol& C O ~ P U t a t i O ~ l Methock in @timd*n. New

Borel-Cantelli lemma) I fz’(Xi)drl -t 0 with probability 1. [IS] W. I. Zangwill, Nonhem Programming: A Unified Approach. York: Academic, 1971.

But this and Englewood Cliffs, N,.J.: PrentieHall, 1969. [16] M. M v e , Probubzhty Theory, 3rd ed. Princeton, N.J.: Van Nostrand, 1963.

[17] H. J. Kushner, “Stochastic approximation like algorithms for

for some yi > 0 imply thatfz(Xt) -+ 0 with probability 1. problems, Cent. Dyn. Syst., Brown Univ., Providence, R.I., the optirffization of constrained and mult,imode stochastic

The proof of ’the boundedness of { X n ) under Condition Rep. 72-1.

Ie-1 or Ie-2 is essentially the same as for Theorem 1 and is also omitted. Q.E.D.

m 599, 1964,

7% multidimensional stochastic approximation algorithm,” IEEE

P@i{lfZ’(Xi)d*l 2 nlfzYxi)l) 2 7 2 > 0

REFERENCES J. Kiefer and J. Wolfoyitz, “St0chS;stic =@nation of the maximum of a regresson function, Ann. Mdh. Statzst., vol. 23, pp. 462-466, 1952. A. Dvoretsky, “On stochastic approximation,” in Proc. Srd Berkeley symp. Xafh. Statist. Probability, vol. 1, 1956, pp.

J. R. Blum, “Multidimensional stochastic approximation 3 s 5 5 .

J. H. Venter, “On Dvoretsky stochastic approximation theo- procedures,” Ann. Afatk. Statist., vol. 25, pp. 737-744, 1954.

rems,” Ann. Ma.th. Statist., vol. 37, pp. 1534-1544, 1966.

Ann. Xafh. StuiiAt., vol. 38, pp. 1031-1036, 1967. H. J. Kushner, A simple Iterative procedure for t,he identi- ficat.ion of unknown parameters of a linear time varying system,” Trans. ASME, J. Basic Eng., ser. D, vol. 85, pp. 227- 235, 1963. -,“Hill climbing methods for the optimization of multi- narameter noise-disturbed systems.” Tram. ASME, J. Basic

- ,“On convergence of the Kiefer-Wolfowitz procedure,”

Eng., ser. D, vol. 85, 1963. “

detection systems,” in Conv. Rec., 1963IEEE Int. Conv. -,“Adaptive techniques for t,he optimization of binary

K. B. Gray, “Applications of stochastic approximation to the optimization of random circuits,” Symp. Appl. Math., vol. 16,

Ya. Z. Tsypkin, (‘Adaptation, training and self-organization in automatic systems,” Azctomat. Rmote Contr. (USSR), vol. 27, pp. 16-51, 1966.

pp. 172-192, 1964.

Harold J. Kushner (S’54-A’56-M’59) received the B.S. degree in electrical engineering from the City College of t.he City University of New York, New York, K.Y., in 1955, and the M S . and Ph.D. degrees, also in elect,rical engineering, from the University of Wisconsin, Madison, in 1956 and 1958, respectively.

He worked at. M.I.T. Lincoln Laboratories, Lexington, Mass., until 1963, R.I.A.S. until 1964, and since then has been a t Brown

University, Providence, R.I., where he is a member of the Center for Dynamical Systems and is a Professor of Applied Mathematics and Engineering. He has worked on almost, all aspeck of stochastic control theory: optimization, stability, comput.ationa1 methods, qualitative properties, filtering, etc. H i current interests include these as well as operations research problems that arise in hospitals and medical systems. He has been a consultant for NASA, Lincoln Laborat.ories, RAND Corporation, and ot.hers.

Dr. Kushner was t.he first Chairman of the IEEE Control Systems Society Stochastic Systems Committee:He is a member of t.he American hIat.hematical Society, the Institute of Mathemat,ical Statistics, and the Operations Research Societ,y of -4merica.

Documents

Stochastic approximation algorithms for the local optimization of functions with nonunique stationary points