Stochastic Approximation Algorithms of the Multiplier Type for the Sequential Monte Carlo Optimization of Stochastic Systems

SIAM J. CONTROL OPTIMIZATIONVOI. 14, NO. 5, August 1976

STOCHASTIC APPROXIMATION ALGORITHMS OF THEMULTIPLIER TYPE FOR THE SEQUENTIAL MONTE CARLO

OPTIMIZATION OF STOCHASTIC SYSTEMS*

HAROLD J. KUSHNER AND MILTON L. KELMANSON

Abstract. Many stochastic control (or parametrized) systems have (expected value) objectivefunctions of largely unknown form, but where noise corrupted observations can be taken at anyselected value of a finite-dimensional parameter x. The parameter x must satisfy equality and inequalityconstraints. The usual numerical techniques of nonlinear programming on control theory are not

usually helpful here. The paper discusses a number of algorithms (with convergence proofs) for selectinga sequence of parameter values {X.}, where X, depends on X_ and observations taken at X_ i,

and the limit points are both feasible and satisfy the Kuhn-Tucker necessary condition (w.p. (withprobability 1)). The algorithms are stochastic "small step" versions of the deterministic combinedpenalty function-multiplier methods.

1. Introduction. For some integers s, t, let./(. ), (Di("), 1, ..., s, qi(" ),i= 1, ..., denote continuous, twice differentiable real-valued functions onEuclidean r-space Rr, with uniformly bounded mixed second derivatives. (. ), q(.denote the vectors with the components bi(.), qi(’), respectively. Define thesets C {x’qi(x <= 0, i= 1,.--, t}, and B {x’b(x)= 0, i= 1,..., s}. Foreach x e Rr, let H(. Ix) and/(. Ix) denote distribution functions of real-valued andR-valued, respectively, random variables with uniformly (in x) bounded variance(covariance, resp)., and j" y dH(ylx) f(x), . v d-I(vlx) f(x), where fx(" is thegradient of f(.). The paper is concerned with several algorithms for finding(sequentially) a local minimum off(x) in C, B or C fq B. The functions dpi(" ), qi("are known and their values or values of their derivatives can be calculated at anyx. We do not assume that f(. is known but, given a parameter X, we can drawone or more random variables from the distribution (with parameter value X)H(. IX) or/-)(-IX), depending on the case. If Ji, __< n, f/, _<_ n, are the first nparameter values at which draws are made, and the values, respectively, andX/ is the (n + 1)st parameter value at which a draw (denoted by Y/ ) is to bemade, then we suppose that E[n+X[.i, iN n + 1, f/, iN nl f()+) (orf(X,+ ), according to the case.

The algorithms are roughly of the stochastic approximation type. An initialestimate, Xo, of a local minimum, is made, one or more observations are takenat Xo, a new estimate X is calculated, etc. As is generally true in nonlinearprogramming, it is quite difficult (except under certain convexity conditions) todevise practical computational algorithms which are guaranteed to (eventually)find a true local or global minimum. In this paper, the algorithms generate (as isusual in deterministic nonlinear programming) a sequence {X} whose limitpoints are feasible (meaning that they satisfy the constraints; they are in C, B orC fl B, according to the case) and which satisfy one of the usual local necessary

" Division of Applied Mathematics and Engineering, Brown University, Providence, RhodeIsland 02912. The work of this author was supported by the Air Force Office of Scientific Researchunder AF-AFOSR 71-2078C and in part by the National Science Foundation under Eng-73-03846-AO1,Office of Naval Research NONR N1467-AD-191001805.

: Division of Applied Mathematics, Brown University, Providence, Rhode Island 02912. Thework of this author was supported in part by CAPES, Ministry of Education, Brasil.

827

Dow

nloa

ded

11/2

5/14

to 1

41.2

12.1

09.1

70. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

828 HAROLD J. KUSHNER AND MILTON L. KELMANSON

conditions for minimality; in particular, either the Kuhn-Tucker condition, orthe necessary condition of the calculus, according to the case.

References Kushner [1] and Kushner and Gavin [2] dealt with a family of(inequality constrained only) algorithms, based on the deterministic methods offeasible directions, in which each iterate satisfied the constraints. The search wasdivided into cycles, X,, denoting the initial point of the nth search cycle, and alllimit points (w.p. 1) of {X,} satisfied a necessary condition for minimality. Thegeneral conditions on each search cycle implicitly required that the "search effort"per cycle increase as the cycle number increased. Part of the reason for the require-ment for the increasing effort/cycle is the difficulty of analyzing the algorithmwhen the iterates are on or near the boundary of the feasible region. Numericalexperiments (such as those reported in [2] suggest that more efficient use is madeof the observation if the effort per cycle does not increase. In Kushner andSanvicente [3], a penalty function-like method was developed (for inequalityconstrained case). There the iterates were not constrained to be feasible (thealgorithm guaranteed that the limits would be), but the method shares with thedeterministic penalty function method the numerical disadvantages that thepenalty functions ultimately increase extremely rapidly for x outside of thefeasible set.

Here, several stochastic approximation-like versions of the so-called deter-ministic methods of multipliers [4]-[7] will be developed. The methods in [4-[7]do not require feasibility, and avoid some of the numerical problems associatedwith penalty function techniques. Intuitively, it seems very reasonable to expectthat the numerical advantages which the techniques have in the deterministiccase (say, with the methods in [4]-[7]) would also hold for the stochastic algorithmsdiscussed below.

For the sake of simplicity of notation in the proofs, we do the pure equalityconstraint case in 2, and the pure inequality constraint case in 3 and 4. Itshould be fairly clear that the combined problem can be handled by a combina-tion of the ideas in the proofs of 2, 3 and 4.

There are numerous applications of these systematic Monte Carlo optimiza-tion techniques. Typicallyf(x) represents the average response of a physical systemwith parameter x. Only noise corrupted dataf(x) + is available at each chosenvalue of x ( observation noise). If the system is complex,f(. will not be known,and we may have to resort to an "experimental" method for optimization. Ex-perience with such methods in the stochastic case suggests that "small step"methods are probably preferable. The paper is concerned with convergencetheorems for several such methods.

2. Equality constraints. The algorithms in this section are stochastic approx-imations of those discussed by Miele et al. [4] (in the sense that the Kiefer-Wolfowitz method is a stochastic version of Newton’s method), where no actualconvergence proofs are given. In order to minimize the number of terms in ourexpansions, we assume that the observations are taken with (. Ix); i.e., givenXk, observe Y, where is distributed as /(. IX,). Let , denote the smallesta-algebra determined by Xo, ,X.. Then E.Y. f(X.), covar. . _< a2Ifor some real a2, where . _= Y. -fx(X.). There are only minor changes in theD

ownl

oade

d 11

/25/

14 to

141

.212

.109

.170

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

STOCHASTIC APPROXIMATION ALGORITHMS 829

assumptions if fx(X,) must be estimated via finite differences, and the changeswill be discussed later. Let k denote a (henceforth fixed) positive real number, anddefine the functions P(. ), (. ), L(.,. and W(.,. by P(x)= 14(x)l 2, (x)--Jacobian of 4(" at x, , (X,), L(x, 2) f(x) + 2’b(x) for a vector 2 withreal components (where denotes transpose, and the norm is Euclidean), andW(x, 2) L(x, 2) + (k/2)P(x).

The following assumptions will be used. We require s =< r.

(A1) Let {a,} be a sequence of positive real numbers with ,a, .2(A2) ,a, <

(A3) ’(x)(x) 0 implies that (x) is of full rank (hence also that b(x) 0).For each h Rr, define rt(x)h to be the projection of h on the orthogonal

complement to the subspace of R determined by the rows of (x). If (x) is offull rank, then

(2.1) rc(x)h [I ’(x)((x)’(x))- l(x)]h.

In any case, (2.1) holds if the inverse is interpreted as the pseudoinverse, which wewill do.

For each > 0, define the set G {x’lt(x)f(x)l 2 __< }, and write Go G.G f"l B is the set offeasible stationary points. It is closed and is the union of a collec-tion of disjoint closed and connected sets Si," , on each of which f(x) is constant,say, taking valuef on Si. Assumption (A4) is not a serious practical restriction.

(A4) There are only finitely many sets $1,"..In the equality constrained case, we must show that the sequences {X,}

generated by the algorithms converge to G fq B.ALGORITHM 1. Given the iterate X, (with components denoted by X,,1, ..., r), X,+I is given by the parameterized (by 2,) form

X,+x =X,-a,[Y,+@’,2,+Px(X,)1(2.2)

X, a,[fx(X,) + ’.2, + P,(X.) +

where P,(x)= 2’(x)b(x).If x is a constrained minimum, then we know from calculus that there is

a vector 2. (21, "", 2) and scalar 2o (not both zero) so that 2ofx(x)+i2ib,x(X) 2ofx(x) + ’(x)2 0. By (A3), the {b,x(x)} are linearly independent,and we can take 20 :/: 0. This suggests that we choose 2, so that the norm of theestimated gradient of the Lagrangian is minimized. Namely, we let 2, minimizeILx(X,, 2,) + ,l 2. Equivalently, we let (following the idea in [4]) 2, satisfy theorthogonality relationship(2.3) .(fx(X.) + , + ’.2,) 0.

If. is of full rank, then .’. is invertible and 2. is unique and is given by,. -[o,o’,3-o,L(x.)- [o.o’.]-o.. _= ,. + .(2.4)

E.2. + (2. E.2.).It may be computationally preferable to use the pseudoinverse of ’(x) for ((x)’(x))-X(x).D

ownl

oade

d 11

/25/

14 to

141

.212

.109

.170

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


If x X,, we write rt,h in (2.1). If (I), or (x) are not of full rank, then (2.3) stillhas a solution, although not necessarily unique and we may suppose that it is(2.4) with the pseudoinverse of [,’,] replacing the inverse. We will use the forms(2.1), (2.4) with the inverse indicating either the inverse or pseudoinverse. Definethe function (. by (x) -[(x)’(x)]-(X)fx(X), where the pseudoinverse isused. By (A3), ((x)’(x)) is invertible at all feasible x.

From (2.1), (2.2), (2,4),

(2.5) X+ X a rcf(X) + r + -P(X)In all proofs K denotes a positive real number. Its value may change from

usage to usage.TORFM 2.1. Assume the conditions in the introduction and also (A1) to (A4).

Then there is a null set N so that if oa N and sup. IX.(og)l < oo and x is any limitpoint of {X.(co)}, then qb(x) 0 and there is a vector (perhaps depending on the limitx) 9 (#1, 9S)for which

L(x) + ’(x)q, o.(An equivalent statement is that if co N and sup, IX,(co)l < oo, then all limitpoints are in G fl B.)

Remark. We note that convergence takes place for all values of k > 0, unlikein the deterministic case [4]-[7] where k must be greater than some minimumvalue.

Remark. It follows from the arguments of Part 1 of the proof of Theorem 2.1that if P(x) ---} ov as Ix[ ov and if either a,lrc,fx(X,)[ 2 0 w.p.1 as n --} v, orI(X)fx(X)l <-_ gl’(x)ck(x)l for large x and some real K, then P(X,) --} 0 w.p.1, andsup IS(o)l < w.p.1 is implied.

Proof Until further notice, we suppose that there is some real M for whichIXl _-< M w.p.1, all n, and that the generic variable x satisfies Ixl _-< M.

Part (i). By a straightforward Taylor series expansion and the use of (2.3)(which certainly holds if P’x(X,) replaces , there) we get

e(x+- P(X <= -ae’(x fx(X + ’ + + -e(x

+ aK fx(X.) + *’2 + + -Px(X)

<= -a-lP(X)l + aK[If(X) + ’1

+ I + xl +lx(Xtl]which yields

P(X,+ 1)- P(X,) <= -a,kl’,ck(X,)l 2 + a,K[lrC,fx(X,)l 2

(2.6)+ I..1 + I’.(x.)l].

Define the set / {x’l’(x)ck(x)12<= e}. By (A3), /, B as e 0. For eachDow

nloa

ded

11/2

5/14

to 1

41.2

12.1

09.1

70. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p


e > 0, there is a 6 > 0 so that 6 -, 0 as e --* 0 and/ c B x" [P(x)[ di},(by A3). By (A2)

lim a,2lrt,,l 2 0Noo n=N

w.p.1. For large n, the other second order (in a,) terms in (2.6) are at most half ofthe absolute value of the first order term--for X, in R -/. The last two sen-tences (the convergence and the dominance) and the divergence of , a, implythat X, e/ c B infinitely often w.p.1, for any e > 0. Fix e. The same sentencesand (2.6) imply that {X.} can go from B to R B3 only finitely often (w.p.1)without entering B2 B, first, and they also imply that the sequence {X,} cango from a point in Bo2, to R B3 at most finitely often w.p.1. Since e > 0 isarbitrary, ’.tk(X,) --. 0 and hence by (A3), tk(X,) 0 (w.p.1).

Part (ii). Now, we turn to evaluate the limits off(X,). Similarly to (2.6), wehave

(2.7)

f(X.+ ,)- f(X.) <_ -a.f(X. X.) + d.2. + X,,) + ,,+ a=.K[Irc.A(X.)l = + I’.(X.)I = + Ic..12]

<- a.f’,(X.)rc.f,(X.) a.f’(X.)rc..

ka.-if’,(X.)P(X.)

+ a.K[lr.fx(X.)l z + 1;,4,(X.)l = + 1z%.12].

We have f’x(X,,)rc.f,(X,)= I.L(x.)I 2 (by the definition of re(x) and projection).Note that by (A2) and the bound M, a,f’x(X,)rc,, is a square summable con-vergent martingale. Also Px(X,) 0 w.p. 1 by Part (i). Using these two facts togetherwith the divergence of , a, and (2.7), and an argument like that in Part (i) (toshow X, e/ or B infinitely often w.p.1), we can show that X, G infinitelyoften w.p.1 for each e > 0.

Since the2 Si are disjoint and closed, and since f(x) f on Si, for each small6 > 0, there is an e > 0 so that we can write G fq B Ui S, where {S} areclosed, connected and disjoint and S = Si, and the maximum variation off(x)on each S is less than 6. We can (and will) also suppose that iff/-f, then If/-fl ->_ 36. Let f > f. So for {X,} to go from S to S, the sequence {f(X,)} mustincrease by at least 6 while outside of S U S. Now, the inequality (2.7) and the

2 2convergence of a,,f,,(X,,)rc,,, and a,lt.,l and the asymptotic dominance ofthe other second order terms (in (2.7)) by the first order term for X, outside of G,and P(X,,) -,-, 0 w.p. 1, together imply that {X,} can make only finitely many excur-sions from S to S(w.p.1). Indeed, by the same reasoning {X,} can make onlyfinitely many excursions from S into any set A, where inf,a f(x)>= fj + 25.Since 6 is arbitrarily small, {f(X,)} converges w.p.1.

See paragraph above (A4) for the definition.Dow

nloa

ded

11/2

5/14

to 1

41.2

12.1

09.1

70. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p


Let Ad(Y) denote a ball in R with center y and diameter d, and suppose thatfor a real el > 0,

inf I(x)fx(x)l >= ex.xeA2a(y)

Define ml min {n" Xn Ad(Y)}, m min {n" X, q A2d(Y), n > mx} and,3 ingeneral, m, min {n "X, Ad(y), m > m;_ }, m’i min {n’X, A2d(Y), n > mi}.Summing (2.7) and using the convergence and dominance cited in the last para-graph yields, w.p.1,

m- m-lim [f(X,+)-f(X,)] - lieK a.i n=mi i mi

But by summing X,+ -X, from (2.5) over [mi, m’i- 1] and using the sum-mability of a,,, and convergence of P(X,) to 0, and the fact that the distancetraveled over those iterates is at least d, we get

1 a.Kd,i n=mi

which contradicts the convergence of (X,)}, unless m < only finitely oftenw.p.1. Since y and d are arbitrary, we conclude that (X,} must eventually stay inG B w.p.1 for any e > 0, and hence that X G B w.p.1, if sup IX.I Mw.p.1.

The proof without the bound M, but with sup, IXl < w.p.1 proceedssimilarly. We repeat the above proof, but stop the iteration {X} at the firstinstant that IXl > M. Then we conclude that X, G B with a probability

P{sup IXl M}, Since M is arbitrary, the theorem holds as stated. Q.E.D.Remark. If we replaced (A3) by "O’(x)(x)= 0 implies (x) 0", then the

theorem would read" there is a number 0, vector , (o, )# 0 such thatfx(X + ’(x) 0 at almost all limit points.

Finite differences. Let e denote the unit vector in the ith coordinate direction,and {c} a sequence of positive real numbers which converges to zero. Let X, begiven, and let Y(X cei) denote a random draw from H(. IX ce). Define(the finite difference version of (2.2))

(2.8) X+= X an[(Xn + cnei) (Xn cnei) k p(x,)]i=l,...,r,

Let 2, be determined by (2.3) but with the finite difference estimate replacingfx(Xn) + , there.

THEOREM 2.2. Assume the conditions of Theorem (2.1), but with an/C, replacinga, in (A2). Let E,(X, Cnei)= f(X, c,ei) w.p.1. Then the conclusion ofTheorem 2.1 holds.

The proof is almost exactly the same as that of Theorem 2.1, except thatf(X, c,e) must be expanded, and the "noise" in the iteration (2.8) is propor-tional to an/Cn, rather than to an. Note that we do not require , anC, < as is

Undefined m or m are set equal to .Dow

nloa

ded

11/2

5/14

to 1

41.2

12.1

09.1

70. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p


common in stochastic approximation (the a,c, terms in the expansion are ulti-mately dominated by the a, term outside of any B,).

Observe that, in both Theorem 2.1 and 2.2, if X.,(co) --, X(co), then the limit(X(co)) can be used as the multiplier ff at X(co).

ALC,ORITHU 2. Again, for the sake of simplicity, we use (. Ix). In thisalgorithm (2.2) is still used, but 2, is selected to assure that the "first order" changein 4(X,/ 1) b(X,) is proportional to -b(X,). In particular, for some positivenumber kl we select 2, to enforce the relationship (following the idea in 4])

O.(X, .1 X,) a,k dp(X,)

or, equivalently, the relationships (2.9) or (2.10).

(2.9)

(2.10)

,. L(x.) + . + ,’.. +-P(X.)O.[L(X,) + , + 0’,2.] [klI kO,O’,]O(X.).

There is not necessarily a solution to (2.10) unless we replace (A3) by (A3’).(A3’) O(x) is of full rank for each x e R".Assuming (A3’) and solving (2.10) for 2, yields

(2.11)

THEOREM 2.3. Under the assumptions ofTheorem 2.1 except with (A3’) replacing(A3), the conclusion of Theorem 2.1 holds for Algorithm 2.

Proof The proof is very close to that ofTheorem 2.1 and will only be sketched.We will first suppose here, as there, that IX,I _-< M w.p.1, for all n, and then letM as in that proof. The first inequality of Part (i) of the proof of Theorem 2.1still holds. The replacement of 2, in that inequality by its value in (2.11), noting(2.10), yields

P(X.+ 1)- P(X.) <= a,,klldp(X.)[

(2.12)

and we can conclude, as in the proof of Theorem 2.1, that X. B w.p.1.Similarly, we can get

(2.13)

f(X,+ 1) -f(X,) _< a.fj(X,) (X,) + , + 0’.2, + -P,,(X.)2K[lrc,L(X,)12 z

<= a,f’(X,)[rc,L(X,) + rc., ++ aZ, K[Irc.L(X,)l 2 + Irt,,l z +D

ownl

oade

d 11

/25/

14 to

141

.212

.109

.170

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


An argument similar to that in Part (ii) of the proof of Theorem 2.1 yields thatX. G f’) B as n o w.p.1.

There is an obvious finite difference analogue to Algorithm 2, but we omitthe details.

3. Inequality constraints. Two different types of algorithms for the inequalityconstrained problem will be discussed, one here and one in 4. Generally, "smallstep" methods require some sort of nonsingularity or linear independence assump-tion on the set of gradients of the constraint functions to prevent the iterates fromgetting "hung up" at some nonfeasible point. The problem exists in the deter-ministic case as well. See, for example, Polak 8, pp. 142-143]. Assumption (A3")below and (A3"’) in 4 are two different types of such an assumption. In this sec-tion, we constrain the Lagrange multipliers corresponding to the inequalityconstraints to be nonnegative. In 4, the signs are not constrained, but thealgorithm is more involved and additional conditions are needed to assure thatthose multipliers are nonnegative at the limit points. In this section, the require-ment that those multipliers be nonnegative forces us to use a stronger conditionon the noise (A6). These problems and conditions seem to be rather natural forthe stochastic algorithms.

For notational simplicity, we draw the observations from (. Ix) rather thanfrom H(. Ix), but there is an obvious finite difference analogue. Also, we treat thepure inequality case. Define i(X)= max [O, qi(x)] and e(x)= Zi(X). ThenP,(x) 2’(x)cTx), where (x) is the Jacobian of q(x). We let the components ofq(x) or c(x) range over the q(x) or i(x) for which q(x) >= O. Define , (x,).

A6ORXn 3. We use the same iteration as in Algorithm 1, namely,

(3.1) X,+ =X,-a,[f,(X.)+ , + I)’,2, + P(X,)I,where 2, is a 2 that minimizes in

(3.2) min [L(X,) / , + O,2,1 / min IL(X,, 2,)+ .12.all .i > 0 all 2 > 0

By the Kuhn-Tucker theorem, there is a vector c, with nonnegative components1, t, so that 2. satisfies (note that the gradient of the constraint

-2 _< 0 with respect to 2 is the unit vector with a 1 in the ith position)

(3.3) ,(f(X,) + , + ’,2,) c, 0,

0 if 2i, > 0.where c.For each hR define rc+(x)h =_ h + ’(x)2, where 2 minimizes in

minan,>=0 [h + ’(x)2l. Note that rt+(x)h is defined analogously to rc(x)h in 2,but that it is the "error" in the projection of h onto K(x), the cone generated bythe nonnegative linear combinations of the row vectors of (x). Let r.~+ h bedefined as re,+ h, but where we use only the rows of the matrix ., which is obtainedfrom , by deleting all rows of , for which 2. 0. (The indices of the deletedrows are random variables and . dependent.)

Define the set F {x:lrc+(x)f,(x)[ 2 <= e} and F0 F. Then F f’) C is theset of feasible points satisfying the Kuhn-Tucker condition f,(x)+ aCtiv2q,(X) 0 for some 2 with all 2 >= 0, and we must show that X. F 91 CD

ownl

oade

d 11

/25/

14 to

141

.212

.109

.170

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


w.p.1. F f] C is closed and is the union of closed, connected and disjoint setsU1, on each of which f(. is constant taking, say, the valuef on Ui.

We need the following assumptions.(A3") ’(x)(x) 0 implies that (x) 0, and at any x e OC, (the boundary

of C) the gradients of the active constraints are linearly independent.(A5) There are only finitely many sets U1, ....(A6) There is a real k > 0 (independent of n, 09) so that

~+ +fx(X.)o.. (L(x.) + .) kfx(X.). L(x.).

Assumption (A6) would seem to be difficult to explicitly verify in general, yetit holds in most of the specific special cases which we have checked graphically(by selecting simple noise distributions and constructing the projections), andwe expect that it holds in a large enough number of cases for the algorithm to beuseful. Some such condition appeared in all the variants of the algorithm, when2i, 0 was required.

THEOREM 3.1. Assume the conditions in the introduction and also (A1), (A2),(A3"), (A5) and (A6). Then there is a null set N so that N and sup, IX()l <and x is a limit point of {X,()}, then q(x) 0 and there is a vector , ’ O, with

i 0 qi(x) < O, for which

(3.4) L(x) + ’(x)ff 0.

Remark. A condition similar to that in the remark after the statement ofTheorem 2.1 implies that sup, Ix()l < w.p.1.

Proof As in the proof of Theorem 2.1, we can and will suppose that IX()lsome M < , and that the generic variable x satisfies Ixl M. The proof is veryclose to that of Theorem 2.1 and will only be outlined.

Part (i). Note that

(3.5) I’12 IL(X,)+ ,1> 0 and q(X,) > 0 imply thatand that (3.2) and (3.3) and c,

’(x,)o,. L(x.) + , + ’,, + 5e(x,)

Substituting these estimates in the first inequality of the proof of Theorem 2.1yields

e(x+ )- P(X) al’(X,)lz

(3.6)+ a]K[IL(X,)I 2 + I12 + IPx(X)12].

By (A3"), for each e > 0, there is a fit > 0 so that ]’(x)O(x)l 2 tS implies that2x N(C), an e neighborhood of C. Thus using the fact that the a, terms are

summable and arguing as in Part (i) of the proof of Theorem 2.1, we can concludethat the X, must ultimately be in N(C), for each e > 0.D

ownl

oade

d 11

/25/

14 to

141

.212

.109

.170

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


(3.7)

Part (ii). A truncated Taylor series expansion yields

f(x/ f(xo <-_ af2(xO Z(xo + + ;& + -I(xo

L(xo + + e’& + -e(xo+ a

Using 0’,2, 0’,2, we find that

(3.8)fx(X.), (f(X,) + .)fx(X,)[f(X,) + , + 0’,2,3 +

=_ E.f,,(X,)rt, (f(X,) + ,) + p,,

where {p,} is a sequence of orthogonal random variables and , a,p, is a squaresummable convergent martingale. Substituting (3.8) and (3.5) into (3.7) and using(A6) yields

f(X,+ 1) f(X,) a,klf’(X )re+. f(X,) a,p, a.f’x(X,)P,,(X.)+ aZ.K[IL(X.)l 2 + I.12 + IP(XOIZ3.

2The a,p, and a, terms are summable, and P(X,) 0 as n --+ co by Part (i).Since the Ui are disjoint and closed, and sincef(x) f on Ui, for each suffi-

ciently small fi > 0, there is an > 0 so that we can write CE f’l FE 1,3 U,where U are closed, connected, disjoint, and U U, and the maximum varia-tion off(x) on each U is less than 6. Now, we complete the proof exactly as wecompleted the proofin Part (ii) ofTheorem 2.1 that is, substitute C, F, CE, F, Ui, Ufor B, G, BE, G, S, S there. Q.E.D.

4. Inequality constraints" Algorithm 4. We now consider anotheralgorithm for minimizing f(x), x C f’l B. The inequalities are handled by con-verting them to equalities, via the common technique of adding a slack variable.Let z (zl, z’), denote a vector of nonnegative real variables and define(/)i(z) qi(x)+ zi, i= 1,..., t, with b,+ 1(" bt+(" denoting the originals equality constraints. For > t, we may write either bi(x) or b(x, z). The func-tion (. ), the Jacobian of b(x, z) with respect to x, (and , _= (X,)) is definedexactly as in 2. Note that b(. is a (t + s) x r matrix. For notational simplicity,we draw the random variables from/(. Ix), rather than from H(. Ix), although,here too, there is an obvious finite difference analogue. Define P(x,z)=

2 X’b, z), and let (.) {w.} be sequences of positive real numbers tending tozero.

AtOORIWrM 4. We iterate on both variables x and z. Assume *go, Zo aregiven. The iterates {X,} and, in certain cases, the iterates {Z,}, are computedexactly as {X,} would be in Algorithm 1. {2,} will be defined below. Define (theobservation vector is Y, _= 0(X,) + {,, as in 2)

(4.1) X,+ X, a, f(X,) + , + ’,2, + -P(X,, Z,)

Dow

nloa

ded

11/2

5/14

to 1

41.2

12.1

09.1

70. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p


If Z. > v., define (2/. is the ith component of 2.)

(4.2a) Z.+ max [0, Z. a.(2. + kdp(X., Z.))]

and define a,". by Z+1 Z. a’[ a.(2. + kdp(X,,, Z.))]. If Z. =< v., use (4.2b, c, d)

Z. + max [0, Z, a.(2. + kdp(X., Z.))] if 4(X., Z.) =< 0(4.2b)

and z. + k(X., Z.) <= O,

(4.2c) Z.+ Z. if b(X., Z.)__< 0 and 2. + kck(X,,, Z,,) > O,

(4.2d) Z.+ Z,, +,a,,w,, if 4,(X.,Z,,) > O.Define the (t + s) x matrices I(z, v) (resp., J) as follows. All entries are zero

except that the (i, i)th entry (i =< t) takes the value if z > v (resp., always takes thevalue 1). Denote 1,, =_ l(z,,, v.) and (x, z, v) [(x), l(z, v)]. (x, z, v) (a (t + s) x(t + r) matrix) is the Jacobian of b(x, z) where we use c3ck(x, z)/cz 1 only ifz > v (i.e., we exclude the c3ck(x, z)/c3z term for the "v active constraints"" q issaid to be a v active constraint at x if, "loosely speaking", z =< v.

Definef"(x)l=f"(X)L0 ["1 "’ where the zers are t-dimensinalvec’0tors. The multiplier is chosen by forcing it to satisfy an orthogonality relationshiplike (2.3). In particular, we let 2. be a 2 minimizing the norm of the estimatedgradient of a particular Lagrangian, namely, a minimizer in

(4.3) If(X.) + . + ’.212 If(X.)+ . + ,212 + (2’)2.i:z > vn

The,minimization of (4.3) yields (2.3), (2.4) with ,, fx(X,), , replacing ,,fx(X,);. ,there. The choice of (4.3) as the quantity to minimize is motivated by the fact thatif x is a constrained minimum off(x), then the Kuhn-Tucker condition is

t+s

f(x) + 2’q,,,,(x) + 2’b,,,(x)= 0, 2i_>_ 0 for =< t.iactive i=t +

and so in (4.3), we seek to penalize the use of the "v, inactive" constraints.For any vector/ define (x, z, v)/ as n(x)h was defined in 2, where (x, z, v),/

replace (x), h there, and write ,h ft(X,, Z,, v,)h. If 2, is to minimize (4.3),then it maast satisfy

(4.4) 0 .[f(X.) + . + .],and we choose

(4.5) 2. [.,]-’.f(X.) " +For each real e>0, define the sets C+,F+ and let C- =C+, F =F+:

when qi is used, then =< t)

C+ {(x,z)’zi>=O, all i,lb(x,z)l 2=<

F+ { (x, z)" minzt+s

i=t+l

Dow

nloa

ded

11/2

5/14

to 1

41.2

12.1

09.1

70. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p


Note that in C +, q(x) -z, so x specifies z, and we can unambiguously speak ofx in C +. The points satisfying the Kuhn-Tucker necessary condition are thepoints in C + CI F + for which there are nonnegative minimizing 2i, =< t. Forx C + f"l F +, define (x) (with components Yi(x)) to be the multiplier attainingthe minimum in the definition of F + (if it is not unique, use the one determinedby the appropriate pseudoinverse" i.e., the one of minimum norm). Now C + f’l F +

is closed and can be written as the union of a collection of disjoint closed andconnected sets T1, on each of which f(. is constant taking, say, the value fon T. We need the following assumptions"

(A3’") The rows of (I)(x) are linearly independent for each x(A7) For a real number >,0 for which sup, E],](AS) .a.w.(A9) For each T, if, for some __< t, we have 2(x) < 0 (resp., > 0) at a point

x e T, then 2’(x) < 0 (resp., > 0) at all x e Tj.(A10) There are a finite number of sets T.(All) Suppose that for some i,j, i() < 0 on some Tj, and let {,,,} denote

an arbitrary sequence in the (f-neighborhood No(T)--{y "IY- ul < 6, someu e Tj}, which tends to (, ) where -q(). Suppose that there are positivereal numbers , (51, 5, so that 0,, the ith element of the minimizing 0 in

min (IL(ff.)+ (I)’(.)0[ 2 + ((pJ)z) v. =< ,is less than -61 for any such {,, ,} sequence.

Assumption (A9) is apparently not restrictive in applications. It basicallyeliminates the possibility that a T contains both saddle-like points together withother points which are neither saddles nor local maxima nor local minima.For an example of the type of situation excluded by (A9), consider x (x 1, x2),f(x) Nix2, ql(x) x2. On the boundary determined by ql 0 (denote it as T1),f(x) 0 and f’x(X) (0, xl), q’,x(X) (0, 1), and (x) -x 1. If we add smoothconstraints which bound Ix1[ and x2 from below, the algorithm can be shown toconverge without (A9), for this type of problem. We strongly suspect that (A9)can be dispensed with, but cannot prove it, as yet.

Condition (All) may seem a little strange. If 2’ < 0, x e T, we will requirethat the 2 < 0 for large n and X, near T. But the "discontinuous" term z,. v. (2J)2creates discontinuities in the ii,, as the Z, vary above and below v,. If there is onlyone active constraint in Tj, the condition is no restriction, nor is it a restriction ifthe q,x(’) for all "nearly active" constraints are constant or nearly orthogonal(as for example, if all q were of the form q(x) txt + fljt). Assumption (All) isimplied by (A9) if the signs of the 2(x) (when >0 or <0) are the signs of-q’i,x(X)fx(X). The condition can be weakened by using bi(x, z)= qi(x) + bz forsmall b (i =< t) in lieu of qi(x) + z. We then need to multiply the (,j)2 term by b,and the "ones" in I, and J by b, and multiply the 2, in (4.2a, b) by b. The smallerb is, the less restrictive is the corresponding condition (All). The condition isneeded only to show that >= 0, __< t, in Theorem 4.1.

THFORFM 4.1. Assume(A1)-(A2), (A3"’), (A7)-(A11). Then there is a null set Nso that ifco q N, and sup, [X,(o)l < cc and x is any limit ofa convergent subsequenceD

ownl

oade

d 11

/25/

14 to

141

.212

.109

.170

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


of (X,(co)), then dp(x) O, > t, qi(x) <- O, <_ t, and there is a vector (perhapsdepending on x) (f/a,..., t,+s), t >_ O, 1,..., t, for which

(4.6) f(x) + @iq,,x(X) + Oidpi,(x) O.i,q (x) 0 +

(Equation (4.6) is the Kuhn-Tucker necessary condition for constained opti-mality.)

Remark. A condition analogous to that mentioned after the statement ofTheorem 2.1 yields that sup, Ix,I < oo w.p.1.

Proof As in the proof ofTheorem 2.1, we can and will suppose that sup,lX,I =<M < oo w.p.1 for some real M. Also, for notational simplicity, we ignore theconstraints b,+ 1(" ), "’", 4),+s(" )- The general proofis almost the same. as theone given below.

Part (i). Letting Px,(x, z) denote the gradient of P(x, z) with respect to (x, z),we get Px,z(X, z) k[(x), J]’dp(x, z). Let , denote the diagonal matrix with entries

I. Using (4.1)-0,.i If (4.2a) is used to calculate Z, + and is untruncated, then 0,(4.2), a Taylor series expansion and majorization of some of the second orderterms yields

P(X,+ I, Z,+ 1) P(X,, Z,)

z.)[L(x.) + . +L

+ 1ZnI’.]

(Xn, Zn)

(4.7)-a,k dpi(X,, Z,)[2i, + k(X,,Z,)] + a,w, dpi(X.,

(4.2b) (4.2d)

+ aZnK[]fx(Xn)] 2 + ]n] 2 4- ](Xn, Zn)l 2 + W2n],where (4.2b) or (4.2d) implies that the summation is over those for which (4.2b)or (4.2d) are used at iteration n. If (4.2a) is used and truncated, then (here we have

(4.8) a,[)J, + k(q,(X,) + Zi,)] _>_ Z, or v,(1 a,k) <= a,,i, + a,J. + a.q,(X,).

By sup. IS.I M, q(Xn) is bounded. The fact that the minimum (4.3) is <_ fx(S,) 4-,12 and the definitions of ,’,, /, imply that there is a real K for which 12.1 _<-KIf,(X,)I, I.1 _-< KII. These facts, together with (4.8) and (A7) and the Borel-Cantelli lemma, imply that (4.2a) will be truncated only finitely often w.p.1.Neglecting this "finitely often occurring" event will not affect the rest of the proof,and we will do so. Then setting 0, identity and using JI’ Inl’ and (4.4),(4.5), (4.7) we get

(4.9)

P(Xn+ 1, Zn+ 1) P(Xn, Zn) <= -anKI’(Xn, Zn, Vn)dP(Xn, Zn)I 2

+ anw qi(X., Z.)(4.. 2d)

+ anK[IL(Xn)l z + Inl 2

2+ 14,(x., z.)l 2 -4- IWnl-I.Dow

nloa

ded

11/2

5/14

to 1

41.2

12.1

09.1

70. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p


Using the fact that w. 0, an analogy of the argument of Part (i) (but (A3’")instead of(A3)) of the proof ofTheorem 2.1, yields that (X,, Z,) C+ infinitely oftenw.p.1 for each e > 0, and that p(X,, Z,) 0 w.p.1 as n --. . Since Z, >_ 0, thisimplies that cPi(x, z) O, qi(x) <= 0 for any limit point (x, z) of {X,(og), Z,(og)} (foro not in some null set). Note that if X,j(o) ---, x, then Z.j(og)i -qi(x).

Part (ii). Now we evaluate f(X,+ 1) f(X,)

f(X,+ 1) f(X,) <_ -a.f,(X, X.) + , + 0,2. + X,(4.10)

+a2.K[lfx(X,)l 2 + I,,] 2 + Ib(Xn, Zn)12].

It is helpful to rewrite (4.10) as

f(X,+ 1) f(X.) <= -a,f,’(X,) (X.) + , + O.z, + -Px,,.(X,)(4.11)

+ a2,K[lf(X,)l 2 + I,l 2 + Ib(X,, Z.)I2].

Equivalently, using (4.5) (see above (4.4) for the definition of , ,) we get

f(x.+ )- f(x,) <_ -a,f’(x,)[,f(x,) + .,]k

2K[I fx(X,)l 2 2(4.12) -a-fx(X)P(X,Z) + a + [[

+ I(X.. z,)12].The first term on the right-hand side of (4.12)can be written as

a.l,f(x,)l a,f(X.),..(4.13)Recall that

(4.14) ],fx(X,)l 2 =min [Ifx(X,)+ 2iq,x(X,)[ 2 + (/]/)22 Z/n v.

and that 2, is a minimizing 2 in (4.14). Let {2,, ,} denote any sequence with all-i > 0 and limit 2, and qi(2) + i O, < t. Note that

(4.15) (2,, ,, v,)f(,) 0 =:, (2, ) e F +.

Equations (4.12), (4.13) and (4.15) and the summability of both ., a,2l,l 2 and Px(X,, z,) 0 and , a. imply that (X,, Z,) F+ infinitelyoften w.p.1 for each e > 0 (for otherwise the sum over n of the right-hand side of(4.12) would tend to ).

Thus {X., Z,} e F+ f) C+ infinitely often w.p.1. Also F{ f’] C+ tends to theclosed set F + 71 C+ as e 0. Given small 6 > 0, there is an e > 0 so that we canwrite C+ fq F+ LI T, where T is closed and connected, contains T/and theT} are disjoint. We can suppose that the maximum variation off(x) on each

T is less than 6 and that iff 4: f, then f L.I >- 36.

Dow

nloa

ded

11/2

5/14

to 1

41.2

12.1

09.1

70. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p


Using the technique of Part (ii) of the proof of Theorem 2.1, we can showthat {f(X,)} converges and that (X,, Z,)is in F+ f3 C+ for large n w.p.1 for anyfixed e > 0. Furthermore (again using the same idea as in the proof of Theorem2.1) we can show that (X,, Z,) F/ 71 C / as n w.p.1, which is the desiredresult (4.6), except for the nonnegativity of the q’, <= t.

Part (iii). We can suppose that f,(x) v 0 at the limit points, for otherwise wecan take qi 0 for all i. Thus we only need to consider limits points x ((x, z),resp.) on the boundary of C (C /, resp.). By (A9), and the fact that the T are closed(and we can suppose bounded, since sup, Ix.(og)l < ), for any i, j, either i(x) 0on T or inf,rj ],i(x) > 0 or sup:,rj i(x) li < O. Fix T, and suppose that lit > O.This implies that q i(" is active on T.

By Parts (i) and (ii), for each e > 0, {X,, Z,} are ultimately in F+ f) C+;hence for any 6 > 0, the sequence is ultimately in Ui Na(Tk), and we can supposethat fi is small enough so that the {-a(Tk)} are disjoint and _a(Tk) = Tk. So forany small 6 > 0, the tail of {X,,Z,} is (w.p.1) contained in one of (which onedepends on o)the {Na(Tk)}. Let {X,,Z,} T then (for 09a null set)(All)implies that , =< - < 0 for some real li > 0, and all large n, and tk(X,, Z,) 0as n m. For this {X., Z,} sequence, (4.16a) holds for large n when zi, > v.

(4.16a) z _>_+ Z, + a,li/2

If Z, =< v,, then Z cannot decrease and may increase, but v, decreases; ultimatelyv, > Z,. Combining (4.2b, c) we get (for large n)

(4.16b) Z,+I >= Z, + [a,ii/2 + a,[,, + kdpi(Xn, Zn)]Iz+k4,,(x,,,z,,)<_oi.If (4.2d) holds.

(4.16c) Z, + Z, + a,w,.

In (4.16a, b), we can replace ii/2 by w,. (4.16a, b, c), the fact.that , a, is a squaresummable (hence convergent) martingale, and (A8) imply that Zi, , contradict-ing the fact that qi(" is active in T. Thus all J(x) => 0 on T, if {X,, Z.} e Na(T)infinitely often, for each t5 > 0, as desired. V] Q.E.D.

Note added in proof The conditions requiring square summability of {a,}and orthogonality of {,} have been considerably weakened. See General con-vergence theorems for stochastic approximation via weak convergence by the firstauthor, to appear. Also, numerical experiments indicate that the algorithms workreasonably well, with appropriate parameter selections.

REFERENCES

[1] H. J. KUSHNER, Stochastic approximation algorithms for constrained optimization problems, Ann.Statist., 2 (1974), pp. 713-723.

[2] H. J. KUSHNER AND H. T. GAVIN, Stochastic approximation-like algorith’msfor constrained systems:Algorithms andnumerical results, IEEE Trans. Automatic Control, AC-19 (1974), pp. 349-357.

[3] H. J. KUSHNER AND H. E. SANVICENTE, Penalty function methodsfor constrained stochastic approx-imation, J. Math. Anal. Appl., 46 (1974), pp. 499-512.

Dow

nloa

ded

11/2

5/14

to 1

41.2

12.1

09.1

70. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p


[4] A. MIELE, E. G. CRAGG, R. R. IYER AND A. V. Lvv, Use of augmented penalty function in mathe-matical programming problems, Part I, J. Optimization Theory Appl., 8 (1971), pp. 115-130.

[5] M. J. D. POWELL, A method for nonlinear constraints in minimization problems, Optimization,R. Fletcher, ed., Academic Press, New York, 1969, pp. 283-298.

[6] D. P. BERTSEKAS, Combined primal-dual and penalty methods for constrained minimization, this

Journal, 13 (1975), pp. 521-544.[7] R.T. ROCKAFELLAR, AugmentedLagrange multiplierfunctions andduality in nonconvexprogramming,

this Journal, 12 (1974), pp. 268-285.[8] E. POLAK, Computational Methods in Optimization, Academic Press, New York, 1971.

Dow

nloa

ded

11/2

5/14

to 1

41.2

12.1

09.1

70. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Documents

Stochastic Approximation Algorithms of the Multiplier Type for the Sequential Monte Carlo Optimization of Stochastic Systems