A Stochastic Approximation Framework for a Class of Randomized Optimization Algorithms

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 57, NO. 1, JANUARY 2012 165

A Stochastic Approximation Framework for a Classof Randomized Optimization Algorithms

Jiaqiao Hu, Member, IEEE, Ping Hu, and Hyeong Soo Chang, Senior Member, IEEE

Abstract—We study a class of random sampling-based al-gorithms for solving general non-differentiable optimizationproblems. These are iterative approaches that are based on sam-pling from and updating an underlying distribution function overthe set of feasible solutions. In particular, we propose a noveland systematic framework to investigate the convergence andasymptotic convergence rates of these algorithms by exploitingtheir connections to the well-known stochastic approximation(SA) method. Such an SA framework unifies our understanding ofthese randomized algorithms and provides new insight into theirdesign and implementation issues. Our preliminary numericalexperiments indicate that new implementations of these algo-rithms based on the proposed framework may lead to improvedperformance over existing procedures.

Index Terms—Algorithm design and analysis, optimization, sto-chastic approximation.

I. INTRODUCTION

I N many control and optimization applications, we are con-cerned with finding the optimal solution to the problem of

the form

(1)

where is a vector of decision variables, is a non-emptycompact set in , and is a bounded determin-istic function. We assume the existence of an optimal solution

. However, the function is not necessarily continuous andmay possess multiple local maxima. In other words, no struc-tural assumptions are imposed on the objective function. Thissetting arises in many complex systems of interest, e.g., whenthe explicit form of is not readily available and the objec-tive function is provided as a “black box” that returns only thefunction value for a specified solution. In the attempts at solvingsuch optimization problems, various random search methodsthat rely only on the objective function values have been pro-posed, which include simulated annealing [19], genetic algo-

Manuscript received May 15, 2009; revised January 15, 2010 and October15, 2010; accepted March 30, 2011. Date of publication May 31, 2011; date ofcurrent version December 29, 2011. The work of J. Hu was supported in partby the National Science Foundation under Grant DMI-0900332, and in part bythe Air Force Office of Scientific Research under Grant FA95501010340. Thework of H. S. Chang was supported by the Special Research Grant of SogangUniversity (200811037). Recommended for publication by Associate Editor C.Szepesvari.

J. Hu and P. Hu are with the Department of Applied Mathematics and Sta-tistics, State University of New York, Stony Brook, NY 11794 USA (e-mail:[email protected]; [email protected]).

H. S. Chang is with the Department of Computer Science and Engineering,Sogang University, Seoul 121-742, Korea (e-mail: [email protected]).

Digital Object Identifier 10.1109/TAC.2011.2158128

rithms [12], tabu search [11], nested partitions [35], generalizedhill climbing [18], and stochastic adaptive search [43], to namejust a few.

In this paper, we focus on a class of adaptive random searchalgorithms known as the model-based methods [45]. Thesemethods differ from other random search algorithms in thata set/population of candidate solutions is generated at eachiteration by sampling from an intermediate (parameterized)probability distribution “model” over the solution space. Theidea is to iteratively modify the distribution model based onthe sampled solutions to bias the future search towards regionscontaining high-quality solutions. Examples of model-basedmethods are ant colony optimization [8], [9], estimation ofdistribution algorithms (EDAs) [24], [31], annealing adaptivesearch (AAS) [43], probability collectives (PCs) [42], thecross-entropy (CE) method [34], and model reference adaptivesearch (MRAS) [16]; see also the stochastic approximationMonte-Carlo method [25], [26], which shares certain similar-ities with CE. These methods retain the primary strengths ofpopulation-based approaches such as genetic algorithms—im-proving upon algorithms that work with a single iterate at atime (e.g., simulated annealing), while at the same time pro-viding more flexibility and robustness in exploring the entiresolution space. Besides their many successful applications tohard optimization problems (e.g., [2], [9], [24], [34], [43],and [44]), these algorithms also have potential applicationsin a number of areas related to estimation and control, e.g.,estimation of rare-event probabilities [33], [34], data clustering[20], multi-target tracking [37], neural network training [25],and decision making under uncertainty [7], [17], [28].

However, while the sequential random search methods suchas simulated annealing, nested partitions, and stochastic adap-tive search are relatively well studied and understood, our under-standing of model-based methods is limited to specific settings[6], [8], [33], [40]. In particular, many model-based algorithmshave a heuristic nature and there is little support for their theo-retical convergence even in their idealized versions. This paperaims to develop an improved understanding of the asymptoticproperties of some model-based algorithms by exploiting theirconnections to the well-known stochastic approximation (SA)method [21], [22], [32], [38]. More specifically, we show thatthese algorithms implicitly interpret a deterministic optimiza-tion problem (which might be a discrete combinatorial problem)as a sequence of stochastic optimization problems in terms ofthe parameters of the probability distribution rather than theoriginal decision variables. Moreover, the Monte-Carlo randomsampling noise inherent in these algorithms can be naturallytreated as the uncertainty in evaluating the objective functions ofthe stochastic counterparts of the original problem. As a result,these model-based algorithms can be viewed as gradient recur-

0018-9286/$26.00 © 2011 IEEE

166 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 57, NO. 1, JANUARY 2012

sions on the continuous parameter space for solving a sequenceof stochastic optimization problems with differential structures.To the best of our knowledge, this type of optimization algo-rithms have never before integrated with stochastic approxima-tion in this manner. To fix ideas, we will use the CE methodas an exemplary instance of model-based algorithms and ana-lyze the convergence and convergence rate of the method via itsequivalent gradient interpretation. Although our discussion willbe centered around CE, we hope that with the help of existingtools in stochastic approximation and the proposed connectionbetween model-based algorithms and stochastic approximationprocedures, we will eventually be able to gain more insight intothe capability and potential of many other algorithms in thisclass as well.

The rest of the paper is structured as follows. In Section II, webriefly review model-based algorithms and derive their equiva-lent gradient recursions. This result is then used in Sections IIIand IV to study the convergence and the asymptotic convergencerate properties of the CE method. Our analysis based on gra-dient interpretation of CE leads to a modified implementationof the algorithm, whose performance is illustrated on a set ofwell-known benchmark optimization problems in Section V. Fi-nally, we conclude this paper in Section VI.

II. MODEL-BASED ALGORITHMS AND THEIR ASSOCIATED

GRADIENT ITERATIONS

The model-based optimization framework for solving (1)consists of the following steps: let be a probability distri-bution on at the th iteration of an algorithm: 1) randomlygenerate candidate solutions by sampling from ; 2) eval-uate/estimate the performance of the generated candidatesolutions; and 3) update to obtain a new distributionbased on selected set of candidate solutions. Schematically,we seek a sequence of distributions with the hope that

as , where is a limiting distribution thatconcentrates its mass around the optimal solution .

Examples for the sequence of distributions includethe following: 1) proportional selection scheme—introducedin EDAs [31] and PCs [42], and used in the instantiation ofMRAS in [16]; 2) Boltzmann distribution with decreasing tem-perature schedule—used in AAS [43]; 3) optimal importancesample measure—used in the CE method [34]. However, inall three cases, the sequence is unknown a priori, or elsethe problem would essentially be solved. So at each iteration

, sampling is often performed from a surrogate distributionthat approximates . One popular approach, introduced in[34], [42], is to specify a family of parameterized distributions

(with being the parameter space) and thenproject onto the family to obtain a sequence of samplingdistributions with desired convergence properties. Theprojection is carried out by minimizing the Kullback–Leibler(KL) divergence between and the parameterized family

, i.e.,

(2)

whereand is the Lebesgue/discrete measure on . The idea is that

the parameterized family is specified with some structure sothat once the parameter is determined, sampling from each ofthese distributions should be a relatively easy task. Moreover,the task of updating the entire distribution can be simplified tothe task of updating its associated parameters, and the sequence

(henceforth referred to as reference distributions) is onlyused implicitly to guide the parameter updating procedure.

We remark that there are other ways of constructing surro-gate distributions. For example, in AAS, Markov chain MonteCarlo techniques are frequently used to approximate the targetBoltzmann distribution at each iteration, whereas in traditionalEDAs, empirical distribution models are directly constructed togenerate new candidate solutions. However, these algorithmscan also be accommodated within the above projection frame-work by projecting the target distributions onto a family of pa-rameterized distributions. Specifically, under such a framework,the three example sequences of reference distributions can bewritten as follows:

1) proportional selection:;

2) Boltzmann distribution:

, where is a sequence of tempera-ture parameters determined by an annealing schedule;

3) importance sampling:;

where is a positive increasing (possibly iteration-varying)function, and is a generic random variable taking values in

. Throughout this paper, we use and to representthe probability and expectation taken with respect to the density/mass function , whereas for a given distribution parameterizedby , we use and to represent the probability andexpectation taken with respect to the underlying parameterizeddistribution.

A. Natural Exponential Families

When the parameterized family belongs to natural exponen-tial families (NEFs), the optimization problem (2) can be solvedanalytically in closed form for an arbitrary , which makesthe projection approach convenient to implement in practice.

Definition 2.1: A parameterized familyon is called a natural exponential family if there exist contin-uous mappings and such that

, where is the natural pa-rameter space and

.Let be the interior of . It is well-known (e.g.,

[30]) that the function is strictly convex on withand Hessian matrix , where

is the covariance with respect to . Therefore, theJacobian of the mean vector functionis strictly positive definite and invertible. From the inversefunction theorem, it follows that is also invertible.

B. Model-Based Algorithms as Gradient Iterations on theParameter Space

Let for allbe described as above and be a surrogate sampling

HU et al.: STOCHASTIC APPROXIMATION FRAMEWORK FOR A CLASS OF RANDOMIZED OPTIMIZATION ALGORITHMS 167

distribution obtained at the th iteration of an algorithm. Weconsider a general reference distribution in the following form:

(3)

where is a “smoothing” parameter ensuring that the differ-ence between the reference distribution and the samplingdistribution is only incremental, so that the new distribu-tion obtained by minimizing does not de-viate too much from the current distribution . Whenis used as the reference distribution sequence, the followinglemma states a key link between the two successive mean vec-tors in model-based algorithms.

Lemma 2.1: If belongs to NEFs and the new parameterobtained via minimizing is an interior point

of , i.e., , for all , then

Proof: Since , it satisfies the first ordernecessary condition for optimality. Thus, we have from Lemma2 in [16] that . Itfollows from (3) that

Thus, the difference between the two successive mean param-eter vectors can be written as

(4)

where the last equality follows from the properties of NEFs.Lemma 2.1 brings out explicitly the updating direction of

the mean vectors at each step, which is in the direction of thenegative gradient of the time-varying objective function for theminimization problem . In particular, inthe case of CE, i.e., when in the right-hand side of recur-sion (3) is replaced with , (4)becomes

(5)

where the interchange of derivative and integral above isguaranteed by the dominated convergence theorem. So theupdating direction is in the gradient of the objective functionfor the maximization problem . Theoptimal solution of this optimization problem is a parameterwhose associated sampling distribution assigns maximumprobability to the set of optimal solutions of (1). Note thatthe smoothing parameter sequence turns out to be thegain sequence for the gradient iteration, so that the specialcase corresponds to constant gain sequences. Thissuggests that MRAS, PCs, CE, as well as other algorithms thatcan be accommodated by the framework are essentially gra-dient-based recursions for solving a sequence of optimizationproblems on the parameter space (of the distribution model)with smooth differentiable structures. This key insight, which is

new to model-based methods, is crucial to understanding whythese algorithms work well for hard optimization problems withlittle structure. Furthermore, the connection to gradient searchallows us to incorporate the rich body of theories and tools fromgradient methods and stochastic approximation procedures toanalyze model-based algorithms for general non-continuous,non-differentiable optimization problems. To fix ideas, in thesequel of this paper, we will use the CE method as an exem-plary instance of model-based algorithms, and investigate itsconvergence properties in details.

III. CONVERGENCE OF THE CE METHOD

The CE method was motivated by an adaptive importancesampling algorithm for estimating rare event probabilities instochastic networks [33]. It was later discovered that the methodcould be adapted to solving combinatorial and continuous op-timization problems. Since then, the CE method together withits various extensions and adaptations have gradually becomeuseful and important tools for Monte Carlo simulation andmulti-extremal nonlinear optimization [34]. The idealizedversion of CE for solving (1) is given below:

Algorithm 1. Idealized CE Method

Step 0: Choose an initial pdf/pmf on ,. Specify constants and ,

a non-decreasing function , and a gainsequence . Set .Step 1: Calculate the -quantile of , where

is a random vector taking values in with distribution.

Step 2: Compute parameter, where is given by

(6).Step 3: If a stopping rule is satisfied, then return andterminate; otherwise set and go to Step 1.

In Algorithm 1, the function is used to account for thecases where the objective function is negative for some

. In the standard implementation of CE, is often taken tobe a constant function, i.e., for all . At step 2, thereference distribution is given by

(6)

with , where for pure technical reasons, we havemade a slight modification of the CE method by replacing theoriginal indicator function with a continuous threshold function

if ,ifif .

The motivation for using such a threshold function is to con-centrate the computational effort on the top -percent of the se-lected “elite” solutions. This is in the same spirit as the selectionscheme employed in many population-based approaches such asgenetic algorithms.


Proposition 3.1: In Algorithm 1, if belongs to NEFs andfor all , then the mean parameter vector func-

tion satisfies

(7)

Proof: Follows from the proof of Lemma 2.1.Algorithm 1 describes the idealized setting where quantile

values and expectations can be evaluated exactly wheneverneeded. In practice, since only a finite number of candidatesolutions are generated at each iteration, we need to replaceexpected values by their corresponding sample averages andestimate the true quantiles by sample quantiles based on ob-served data. This results in the following general Monte-Carloversion of the CE method.

Algorithm 2. The Monte-Carlo Version of CE

Step 0: Choose an initial pdf/pmf on ,

. Specify constants and ,parameter sequences and , and a boundednon-decreasing function satisfying

. Set .Step 1: Randomly sample independent and identicallydistributed (i.i.d.) solutions fromthe distribution .Step 2: Calculate the sample -quantile

, where is the smallest integergreater than , and is the th-order statistic of thesequence .Step 3: Compute a new parameter ,where , and

(8)

is an empirical estimate of recursion (7) based on thesampled solutions in .Step 4: If a stopping rule is satisfied, then return(and/or the best solution sampled thus far) and terminate;otherwise set and go to Step 1.

Note that since our primary purpose is to analyze the asymp-totic properties of CE, we have not specified a stopping rule forAlgorithm 2. In practice, we can stop the algorithm whenevera given computational budget is consumed, or when the param-eter does not change much for several consecutive iterations.A detailed discussion of how to choose an appropriate stoppingcriterion can be found in, e.g., [34].

Since Algorithm 2 is randomized, it induces a probability dis-tribution over the sequence of sampled solutions . Inwhat follows, and are the probability and expectationtaken with respect to this distribution, and probability one con-vergence of random vectors and matrices is to be understood

with respect to . Also, we define as thesequence of increasing -fields generated by the collection ofsolutions . Note that our knowledge about the randomparameter is completely determined by , and given ,the solutions generated at the th it-eration of the algorithm are i.i.d., conditionally independent ofthe past. We use and to denote theconditional probability and expectation taken with respect to

. Throughout the paper, for any two functions and ,we write if ,and write if and

. For a given set , we use to denote the indi-cator function of the set .

To see the connection of the Monte-Carlo version of CE tostochastic gradient search, we make the following assumptionon the parameter computed at step 3 of Algorithm 2:

Assumption A1: The parameter computed at step 3 ofAlgorithm 2 satisfies for all .

Under the above assumption, we define the following terms:

where with a slight abuse of notation, we have used to rep-resent the true -quantile of under . Thus, underour notations, (8) can be written in the form of a generalizedRobbins–Monro algorithm in terms of the gradient , the com-bined effect of bias and noise caused by Monte–Carlo randomsampling , and an additional error term due to sample averageapproximation

(9)

Note that choosing in Algorithm 2 results in the basicSA recursion [i.e., in (9)], whereas using a strictly pos-itive in the algorithm has the effect of injecting extra MonteCarlo noise into the basic SA updating step. The idea of the in-jection of extra noise into gradient recursions has proven effec-tive in allowing the algorithms to escape from local optima, andis a common way of converting algorithms of SA type to globaloptimizers (e.g., [29], [41]).

To analyze the asymptotic behavior of (9), we apply astandard argument of the ordinary differential equation (ODE)method, which was introduced in the pioneering work of[27] and [21]. Ever since its introduction, the ODE methodhas inspired many important works on studying stochasticapproximation and related adaptive recursive algorithms; see,for example, [3]–[5], [21], [22], [38], [39], and the referencestherein. Within the context of Algorithm 2, the basic idea ofthe ODE approach to the analysis of (9) is to show that the


sequence of iterates generated by (9) asymptoticallyapproaches the solution set of the following ODE:

(10)

whereand is the true

-quantile of under . In what follows, weshall assume that is continuous on and (10) has aunique integral curve for a given initial condition, i.e., theODE (10) is well-posed. Since the gradient has a closedform expression and is continuous, the continuity ofcan generally be verified by inspecting the continuity of thequantile as a function of the distribution parameter . Let

be the usual Euclidean norm. The following definition,due to [3], provides a useful way to describe the recurrencebehavior of the solutions to (10).

Definition 3.1: (Chain recurrence) Given an initial condition, let be the solution to (10). A point is said

to be chain recurrent if for any and , there existan integer , points with , and timeinstances such that , , and

for . A compact invariantset (i.e., for any , the trajectory satisfies

) is said to be internally chain recurrent if every pointis chain recurrent.

The following regularity conditions will be used throughoutthe analysis.

Assumptions:A2. The gain satisfies , as

, and . for someconstant . , where .A3. For a given and a distributionfamily , the -quantile of

is unique for each .A2 is the assumption on the input parameters. In particular,

the assumption on the gain sequence is a typical SA condi-tion. A3 will be used to show the convergence of the sequenceof sample quantiles in Algorithm 2 to the true quantiles. Theassumption is satisfied for many objective functions and distri-bution families encountered in practice.

The lemma below establishes the strong consistency of thebias term as . Its proof is based on the followingproposition.

Proposition 3.2: Let be the true -quantile ofwith respect to and be the corresponding sample

-quantile. If A2 and A3 are satisfied, then asw.p.1.

Proof: Similar to the proof of Lemma 7 in [16]; see alsothe proof of Lemma 4.1.

Lemma 3.1: If Assumptions A2 and A3 are satisfied, then

Proof: We proof Lemma 3.1 in Appendix A.We have the following convergence theorem for Algorithm

2. The result follows from Lemma 3.1 and then by directly ap-plying Theorem 1.2 in [3]; see, e.g., [5] and [22] for results sim-ilar to Theorem 1.2 in [3] under slightly different conditions.

Theorem 3.1: Assume that is continuous with a uniqueintegral curve and A1–A3 hold. Then the sequence gener-ated by (8) converges to a compact connected internally chainrecurrent set of (10) w.p.1. Furthermore, if the internally chainrecurrent sets of (10) are isolated equilibrium points, then w.p.1

converges to a unique equilibrium point.Proof: To show the desired result, we establish that con-

ditions A1–A3 in [3] hold. Relative to A1 in [3], the bounded-ness of follows from the fact that is continuous andis compact. A2 in [3] is a direct consequence of Lemma 3.1. Toestablish A3 in [3], we let . By construc-tion, it is easy to see that is a martingale. Furthermore,we have

since by

Therefore, the -bounded martingale convergence theorem(e.g., [36]) implies that converges w.p.1 to a finiterandom vector , this in turn shows condition A3 in [3],which completes the proof.

Note that since the mean vector function is invertible, the nat-ural exponential distribution can be reparameterized in termsof . Therefore, Algorithm 2 can be viewed as a sto-chastic gradient descent procedure on the transformed param-eter space for finding the optimal parameter of the distri-bution that assigns maximum probability to the set of optimalsolutions to (1) [cf. (9)]. Theorem 3.1 shows that the sequenceof iterates generated by the algorithm will asymptoticallyapproach the limiting solution of the underlying ODE (10); seealso the discussion at the beginning of Section IV. This resultseems to be the strongest available in this generality, and is con-sistent with that of [16], which provides counterexamples in-dicating that both CE and its variants are local improvementmethods. Theorem 3.1 implies that if all relevant regularity con-ditions are satisfied, then the local/global convergence of CE canessentially be determined by the properties of the solutions ofthe underlying ODE. We illustrate this issue by an example.

Example 3.1: Consider maximizing the function

(11)


by sampling from the parameterized pmf

, where,

, and . It is easytoverifythatthemapping isgivenby and

.If we take , , and , then since

for all , we have .It follows that and

.Consequently, . By setting

, we have that is anequilibrium point of the ODE .Let bea candidate Lyapunov function. The derivative of

along the trajectories of the ODE is given by,

which is negative definite for all . Therefore,is globally asymptotically stable, in which

case Theorem 3.1 implies that the sequence of samplingdistributions obtained in Algorithm 2 will converge to adegenerate distribution that assigns unit mass to the optimalsolution .

On the other hand, if we use thepmf

for some constant .It is straightforward to see that and

.If we take , then since

and, we

have from the definition of quantiles that .Thus,

. By setting, it can be seen that the isolated equilibrium

points , and are the only chainrecurrent points to the ODE . Therefore,Theorem 3.1 implies that the sequence generated byAlgorithm 2 will converge to one of them.

In a third case, if we use the pmf

for some constant. We have and

.If we take , then since

, it followsthat and

. Thus, we have and. It follows that the sequence will converge

to the set of chain recurrent points of (10), which is the set ofall satisfying and .

IV. ASYMPTOTIC NORMALITY

In this section, we analyze the convergence rate of Algorithm2, assuming that the convergence of occurs to a unique limit

point. In particular, we consider the special case when the ODE(10)hasa unique globallyasymptotic equilibrium . Then underour conditions, the sequence generated by Algorithm 2 con-verges w.p.1. to as by Theorem 3.1. Alternatively, onecan invoke the Kushner–Clark condition [21] by assuming that

is locally asymptotically stable and enters some com-pact set in the domain of attraction of infinitely often. We alsoassume that . Since is continuously dif-ferentiable on , the inverse function theorem implies that

is continuously differentiable on some open neighbor-hoodof .This further implies that thesequenceofsamplingdis-tributions generated by Algorithm 2 converges point-wiseto a limiting distribution w.p.1.

Let be the Jacobian matrix of defined in (10).Note that since is the gradient of some underlying func-tion , turns out to be the Hessian of . Thus, recursion(9) can essentially be viewed as a gradient-based algorithm forsolving the maximization problem . Consequently,by the probability one convergence of the sequence to ,it is reasonable to expect that is a global (local when islocally stable) maximizer of the objective function in itsneighborhood (see [1] for an in-depth discussion of the connec-tions between local optima of a cost function and the stable equi-libria of the gradient descent flow). The following assumptionabout is natural.

Assumption B1: The Jacobian matrix is continuousand symmetric negative definite in a small neighborhood of .

In addition, we need the following regularity condition onthe distribution function of the objective function. Let be

the probability density/mass function of when is dis-tributed according to .

Assumption B2: (i) (Continuous optimization) For a given, there exist constants and such that

, almost surely forsufficiently large. (ii) (Discrete finite optimization) For a given

, there exists a constant such thatand almost

surely for sufficiently large.Case (i) of B2 essentially requires that the density function

of evaluated at the true -quantile is bounded awayfrom zero. Note that by the point-wise convergence of ,the sequence of density functions will converge to a lim-

iting density . Thus, case (i) is generally satisfied if, where is the true -quantile of

under . In many practical situations, it is frequently ob-served that for discrete optimization problems, the sequence ofprobability mass functions will converge to a mass func-tion with most of its mass concentrated on some(say for a given ),under which the -quantile of is given by .Thus condition (ii) is satisfied, since

and .We consider a standard gain sequence of the form

for constants and , and letfor . Note that both and satisfy A2. Define thedifference . It follows from (9) that satisfiesthe recursion


By using a Taylor expansion of in a small neighborhoodof and the fact that

where is on the line segment between and . The aboveequation can be recast into the form of a recursion in Fabian[10]:

where is a constant, , ,, , and denotes a

-by- identity matrix.The convergence rate analysis of Algorithm 2 is investigated

along the line of [10], which provides generally sufficient con-ditions to establish asymptotic normality results for SA. Morespecifically, we show in Proposition 4.1 and Lemma 4.2 belowthat conditions (2.2.1), (2.2.2), and (2.2.3) in [10] hold.

The following is a strengthened version of Proposition 3.2,which indicates that the sample quantile converges to the truequantile at least at a polynomial rate.

Lemma 4.1: Let . For any , if A3 and B2hold, and , then w.p.1.

Proof: The proof is given in Appendix B.Lemma 4.1 gives rise to a strengthened version of Lemma

3.1, which shows that the amplified bias term vanishes tozero asymptotically.

Proposition 4.1: For any constant , let . If A3and B2 hold, then as w.p.1.

Proof: See Appendix D for a proof.Moreover, the amplified noise has the following proper-

ties.Lemma 4.2: Let , , and

for constants , , and. If A1 and A3 hold, and in addition, for

, then and there exists a matrix suchthat as w.p.1. Moreover, thesequence is uniformly square integrable in the sense that

.Proof: See Appendix E.

The asymptotic normality of Algorithm 2 follows directlyfrom lemma 4.2 and Proposition 4.1 above, and then by applyingTheorem 2.2 in [10].

Theorem 4.1: Let , , and, . If A1, A3, B1, and B2 hold,

, and , then

where is an orthogonal matrix such thatwith being a diagonal matrix, and the th entry of isgiven by :

if ,if .

We remark that in contrast to the optimal convergence rateof for general stochastic approximation algorithms,the asymptotic rate for Algorithm 2 is at least . Inparticular, in the case , Theorem 4.1 impliesthat in probability. This indicates that a

better asymptotic rate can be achieved by choosing a samplesize that increases sufficiently fast. However, we note that in-creasing sample sizes too fast may have a negative impact on thealgorithm’s practical performance, since the normality result isstated in terms of the number of iterations , not the sample size.

V. NUMERICAL EXAMPLES

In all previous numerical studies, the CE algorithm wasimplemented using a smoothed parameter updating procedure(e.g., [16], [34]):

(12)

where is a smoothing parameter, and is the newparameter calculated at Step 3 of Algorithm 2 with in (8)for all . Such a smoothed parameter updating procedure is pri-marily used to prevent premature convergence of the algorithmand often works well in practice. However, there is little theo-retical support for its usage. Note that our proposed approachdiffers from this procedure in that the reference distributionsrather than output parameters are smoothed [cf. (3)] Whenbelongs to NEFs, the new parameter at Step 3 of Algo-rithm 2 can still be obtained in analytical form. For example,in continuous optimization, if multivariate normal distributions

are used in CE and for all , then the explicitparameter updating equations are given by

and

This is clearly different from the smoothed parameter updatingprocedure, in that there is an extra term

involved in updating the covariance matrix.To illustrate the potential influence on practice of such a pa-

rameter updating procedure, which was derived using gradientinterpretation of CE, we consider some computational experi-ments on a set of 12 benchmark problems taken from [16], [23],and [34].

1) Shekel function ( , , ):,

with ,, and , where

represents the th row of . The function has a globalmaxima and .

2) Rosenbrock function ( , ):, where

.3) Zakharov function ( , ):

, where.

4) Rastrigin function ( , ):, where

.


5) Ackley function ( , ):

,where .

6) levy function ( , ):

, where ,and .

7) Trigonometric function ( , ):

, where .8) Griewank function ( , ):

, where.

9) Brown function ( , ):

,where .

10) Powell function ( , ):

, where .11) Cragg and Levy function ( ,

, ):

, where.

12) Pintér function ( , ):

, where, , and .

In Algorithm 2, we consider two choices of the parameterizeddistribution families: multivariate normal distributions and inde-pendent univariate normal distributions. We have experimentedwith different sets of parameters in Algorithm 2. We found em-pirically that the performance of the algorithm is primarily de-termined by the choice of the gain sequence , but is insensi-tive to the choices of the initial solutions (e.g., mean vector andcovariance matrix of the initial normal density), provided thatthe initial sampling variance is large enough. In both cases, theinitial means are uniformly selected from the solution space andthe covariance matrices are initialized as -by- diagonal ma-trices with diagonal entries equal to 1000. Intuitively, the gainsequence serves as a tradeoff between exploration and ex-ploitation. A slow decay rate in the gain sequence helps to keepthe search of the algorithm at a more scattered level, whereasa fast decay rate will lead to a more rapid convergence of thesampling distributions, so the search will become more concen-trated on specific regions of the solution space. Thus, in caseswhere explorative search is considered more desirable (e.g., forhighly multimodal problems), a slow decay rate of is oftenpreferred. For all test problems, we have used a relatively con-servative gain sequence , where the con-stant 100 is used to keep the initial step-sizes small with the hopeto prevent unstable behavior in early iterations of the algorithm;on the other hand, the slow decay rate 0.501 helps to preventpremature convergence and maintain a non-negligible step size(to prevent sluggish performance) in the later iterations of the

algorithm (see, e.g., [39] for a detailed discussion of gain se-quences in such a form). Our preliminary experimental resultsalso indicate some robustness with respect to the choice of ,in that values of between 0.01 and 0.2 all seem to work well.Throughout our experiment, we take , , and asample size , where the constant 400used in ensures that the number of top percent elite solu-tions that will be used to update the distribution parameters isat least on the order of tens (see, e.g., [34]). Note that the com-bination of , , and satisfies the relevant conditions inTheorem 3.1 for convergence.

The performance of Algorithm 2 was also compared with thatof the original CE with the smoothed parameter updating pro-cedure (12), where the smoothing parameter for all .All other parameters in CE are taken to be the same as in Al-gorithm 2. Our performance comparison is made based on thesame amount of computational effort, where for both Algorithm2 and CE, the total number of function evaluations is set tofor and , for , and forand .

For each test function, we performed 100 independent repli-cation runs of all algorithms. Numerical results are reported inTable I, where is the averaged value of the function eval-uated at the best solution visited by an algorithm (with stan-dard error given in parentheses), and indicates the number ofreplication runs in which an -optimal solution (i.e., a solutionwhose function value is within of ) was found out of 100runs . For test functions , we also plottedin Fig. 1 the averaged objective function values at the estimatedoptimal solutions as a function of the number of function evalu-ations used. The results indicate promising performance of theproposed smoothed reference distribution updating procedure.In particular, in the multivariate normal case, the procedure findsmore than 90% of the -optimal solutions on functions , ,and , and finds -optimal solutions in all 100 runs in the restnine test cases. In contrast, the smoothed parameter updatingprocedure does not seem to work well with multivariate normaldistributions in most of the test cases, whereas for univariatenormal distributions, the procedure works well on multimodalfunctions such as , , and , but not on badly-scaled func-tions. We see that for functions , even with such asmoothed parameter updating procedure, which was primarilydesigned to prevent premature convergence, the algorithm maystill frequently stagnate at solutions that are far from optimal.

VI. CONCLUSION AND FUTURE WORK

We have proposed a novel framework to study a class of ran-domized optimization algorithms by exploiting their connec-tions to the stochastic approximation procedure. In particular,we have used the CE method as an exemplary algorithm instanceof the framework, and analyzed its convergence and conver-gence rate properties in both continuous and combinatorial do-mains. Our analysis also provided some guidance on the imple-mentation issues of the algorithm. Preliminary numerical studyindicates that the proposed parameter updating procedure basedon stochastic gradient interpretation of CE may lead to improvedperformance over the commonly used smoothed parameter up-dating procedure.


TABLE IPERFORMANCE OF ALGORITHM 2 VERSUS CE WITH SMOOTHED PARAMETER UPDATING ON BENCHMARK PROBLEMS � � � ,

BASED ON 100 INDEPENDENT REPLICATIONS (STANDARD ERRORS IN PARENTHESES)

The proposed framework has also generated many interestingfuture research topics. For example, the ODE approach adoptedin our analysis will generally lead to local convergence results;however, this does not explain the superior empirical perfor-mance of CE on hard optimization problems with many localoptima. A possible research direction is to investigate whetherthe effect of the injected Monte-Carlo noise term in CE mayallow the algorithm to escape local optimal solutions, so thatthe algorithm may also be a global optimizer.

The convergence of Algorithm 2 requires increasing thesample size to obtain an asymptotically unbiased estimatorfor the true gradient because of the ratio bias, but increasingsample sizes may have a negative impact on the algorithm’spractical performance, as the normality result shows an asymp-totic convergence rate proportional to , independentof the sample size. This motivates the search for gradientestimation schemes or new algorithm instances that maintaina constant sample size at each iteration. As an example, notethat finding the zeros of the gradient in (5) is equivalent tosetting only the numerator expectation equal to zero, i.e.,

. Thus,we only need to find an estimator for ,for which a simple sample average estimator given by

is unbiased for allfinite . This gives rise to a new optimization algorithm, whichcan be interpreted as a gradient-based method for solving

and allows for an unbiased gradient esti-mator at the expense of inducing a larger variance.

APPENDIX APROOF OF LEMMA 3.1

Since , there exists a constant such thatfor all . We have

(13)

where the second inequality follows from the definition ofthe threshold function , and the third equality fol-lows from the definition of quantiles. Thus, it is sufficientto show that

as w.p.1. For

notational convenience, defineand . Note that

Since the mapping is continuous and is compact, it isclear that is bounded for all . Thus, by Proposition3.2 and the continuity of , the first term above vanishes tozero w.p.1.

To show that the second term also converges to zero, note thatconditional on , are i.i.d. and there exist con-stants and such that , where

is the th component of the vector . For any ,Let be the event that

. We have from the Hoeffding in-equality [13] that

for some constant . Next by unconditioning, we get

Moreover, we have from A2,. Finally, the

Borel–Cantelli lemma implies that .Since this holds for arbitrary , we have

w.p.1.A similar argument can be used to show that

w.p.1. There-

fore, by (13) we have as w.p.1.


Fig. 1. Average performance of Algorithm 2 vs. CE with smoothed parameter updating procedure on test functions� �� .

APPENDIX BPROOF OF LEMMA 4.1

Our proof is based on the proof of Lemma 2.4 in [15]. Letand be the lower and upper bound for . For given and

, it can be shown that the -quantile can beobtained as the optimal solution of the following problem (e.g.,[14])

(14)

where , , and

ifif .

Similarly, the sample -quantile can be expressed asthe solution to the sample average approximation of (14)

(15)


where andare i.i.d. with distribution function .

For a given and a constant , define. Let be a collection of open

balls centered at with radius . Since is compact, wecan find a collection of finite points suchthat . Moreover, for an arbitrary , thereexists such that . Thus, by the constructionsof and , we have

It follows that if , then for all

This implies that ,

. Next, byusing Bonferroni’s inequality, we have

(16)

where . Thus, by notingthat

and applying Hoeffding’s inequality [13] to theright-hand side of (16), we get

. Next, byunconditioning on , we have

(17)To complete the proof of Lemma 4.1, we need the following

intermediate result, which states that if the two functionsand are sufficiently close, then their optimal solutions willalso be close.

Proposition B.1: Assume that A3 and B2 hold. There existsa constant such that

(18)

almost surely for sufficiently large.

Let be the event that (18) holds at the th iteration ofAlgorithm 2. We have for a sufficiently small

by

where and.

It follows that for a given

Since , it is easy to verify that.

Since , the Borel Cantelli lemma implies

. Hence we have as w.p.1.APPENDIX C

PROOF OF PROPOSITION B.1

Define the difference , and letfor notational convenience. Note that the function isconvex, and for a given , its subdifferential is given by

(e.g., [14]). Before we proceed any further, we need to distin-guish between the continuous and the discrete finite optimiza-tion cases.

Case 1: (Continuous Optimization): It is easy to see thatis twice differentiable. Let and be constants

as defined in B2. Since w.p.1 as by Proposi-tion 3.2, a Taylor expansion of in a small neighborhood

of implies that

where lies on the line segment between and , and wehave used the fact that since is the optimal solu-tion to the convex optimization problem (14). It is straightfor-ward to verify that . Thus, for almost everysample path generated by Algorithms 2, we have from B2 thatfor sufficiently large


where the third inequality follows from the inequalityfor

any two real-valued functions and . Consequently, itis clear that there exists a constant , such that

almost surely for allsufficiently large.

Case 2: (Discrete Finite Optimization): Since the so-lution space is finite, the function is convex andpiece-wise linear, and its subdifferential at can be written as

.We have from part (ii) of B2 that for almost every sample pathgenerated by Algorithm 2, and

. For a sufficiently small , by thedefinition of subderivatives, we havefor any . It follows that

Hence, the desired result holds when is sufficiently small.

APPENDIX DPROOF OF PROPOSITION 4.1

Again, we define and. By (13), it is suffi-

cient to show thatas w.p.1. Note

that

Thus, by Lemma 4.1 and the continuity of , the first termabove converges to zero as w.p.1. By using the sameargument as in the proof of Lemma 3.1, it is easy to show thatthe second term also vanishes to zero as w.p.1.

APPENDIX EPROOF OF LEMMA 4.2

Recall that. Note that con-

ditional on , the solutions in are i.i.d.. Thus, it followstrivially that . To show the second claim, let

. We have

Since under A1, A3, and the choices of and , the se-quence of sampling distributions converges point-wiseto w.p.1, the dominated convergence theorem impliesthat the sequence converges w.p.1. to a limiting matrixgiven by

if ,if

By Hölder’s inequality, for any with, we have

(19)

for some constant

by Chebyshev's inequality

(20)

By taking , we have

(21)

A straightforward calculation shows that the right-hand sideof (21) is on the order of . Thus, combining(20) and (21), the right-hand side of (19) is on the order of


, which vanishes to zero as bytaking . This completes the proof of the lemma.

ACKNOWLEDGMENT

The authors would like to thank the associate editor and threeanonymous reviewers for their comments and suggestions,which have led to a substantially improved paper.

REFERENCES

[1] P. A. Absil and K. Kurdyka, “On the stable equilibrium points of gra-dient systems,” Syst. Control Lett., vol. 55, pp. 573–577, 2006.

[2] G. Allon, D. P. Kroese, T. Raviv, and R. Y. Rubinstein, “Applicationof the Cross-Entropy method to the buffer allocation problem in a sim-ulation-based environment,” Ann. Oper. Res., vol. 134, pp. 137–151,2005.

[3] M. Benaim, “A dynamical system approach to stochastic approxima-tions,” SIAM J. Control Optim., vol. 34, pp. 437–472, 1996.

[4] A. Benveniste, M. Metivier, and P. Priouret, Adaptive Algorithms andStochastic Approximation. New York: Springer Verlag, 1990.

[5] V. S. Borkar, Stochastic Approximation: A Dynamical Systems View-point. New Delhi, India: Cambridge Univ. Press; Hindustan BookAgency, 2008.

[6] A. Costa, O. D. Jones, and D. Kroese, “Convergence properties of theCross-Entropy method for discrete optimization,” Oper. Res. Lett., vol.35, pp. 573–580, 2007.

[7] F. Dambreville, “Cross-Entropic learning of a machine for the deci-sion in a partially observable universe,” J. Global Optim., vol. 37, pp.541–555, 2007.

[8] M. Dorigo and C. Blum, “Ant colony optimization theory: A survey,”Theoret. Comput. Sci., vol. 344, pp. 243–278, 2005.

[9] M. Dorigo and L. M. Gambardella, “Ant colony system: A coopera-tive learning approach to the traveling salesman problem,” IEEE Trans.Evol. Comput., vol. 1, no. 1, pp. 53–66, Apr. 1997.

[10] V. Fabian, “On asymptotic normality in stochastic approximation,” TheAnn. Math. Statist., vol. 39, pp. 1327–1332, 1968.

[11] F. W. Glover, “Tabu search: A tutorial,” Interfaces, vol. 20, pp. 74–94,1990.

[12] D. E. Goldberg, Genetic Algorithms in Search, Optimization, and Ma-chine Learning. Boston, MA: Kluwer, 1989.

[13] W. Hoeffding, “Probability inequalities for sums of bounded randomvariables,” J. Amer. Statist. Assoc., vol. 58, pp. 13–30, 1963.

[14] T. Homem-De-Mello, “A study on the Cross-Entropy method forrare event probability estimation,” INFORMS J. Comput., vol. 19, pp.381–394, 2007.

[15] T. Homem-De-Mello, “On rates of convergence for stochastic opti-mization problems under non-independent and identically distributedsampling,” SIAM J. Optim., vol. 19, pp. 524–551, 2008.

[16] J. Hu, M. C. Fu, and S. I. Marcus, “A model reference adaptive searchalgorithm for global optimization,” Oper. Res., vol. 55, pp. 549–568,2007.

[17] J. Hu, M. C. Fu, and S. I. Marcus, “A model reference adaptive searchalgorithm for stochastic optimization with applications to Markov de-cision processes,” in Proc. 46th IEEE Conf. Decision Control, NewOrleans, LA, 2007, pp. 975–980.

[18] A. W. Johnson and S. H. Jacobson, “A class of convergent generalizedhill climbing algorithms,” Appl. Math. Comput., vol. 125, pp. 359–373,2002.

[19] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization by sim-ulated annealing,” Science, vol. 220, pp. 671–680, 1983.

[20] D. P. Kroese, R. Y. Rubinstein, and T. Taimre, “Application of theCross-Entropy method to clustering and vector quantization,” J. GlobalOptim., vol. 37, pp. 137–157, 2007.

[21] H. J. Kushner and D. S. Clark, Stochastic Approximation Methodsfor Constrained and Unconstrained Systems. New York: Springer-Verlag, 1978.

[22] H. J. Kushner and G. G. Yin, Stochastic Approximation Algorithms andApplications. New York: Springer-Verlag, 1997.

[23] M. Laguna and R. Marti, “Experimental testing of advanced scattersearch designs for global optimization of multimodal functions,” J.Global Optim., vol. 33, pp. 235–255, 2005.

[24] , P. Larrañaga and J. A. Lozano, Eds., Estimation of Distribution Al-gorithms: A New Tool for Evolutionary Computation. Boston, MA:Kluwer, 2002.

[25] F. Liang, “Annealing stochastic approximation Monte Carlo for neuralnetwork training,” Mach. Learn., vol. 68, pp. 201–233, 2007.

[26] F. Liang, C. Liu, and R. J. Carroll, “Stochastic approximation in MonteCarlo computation,” J. Amer. Statist. Assoc., vol. 102, pp. 305–320,2007.

[27] L. Ljung, “Analysis of recursive stochastic algorithms,” IEEE Trans.Autom. Control, vol. AC-22, no. 4, pp. 551–575, Aug. 1977.

[28] S. Mannor, R. Y. Rubinstein, and Y. Gat, “The Cross-Entropy methodfor fast policy search,” in Proc. 20th Int. Conf. Mach. Learn., Wash-ington, DC, 2003, pp. 512–519.

[29] J. L. Maryak and D. C. Chin, “Global random optimization by simulta-neous perturbation stochastic approximation,” in Proc. American Con-trol Conf., Arlington, VA, 2001, pp. 756–762.

[30] C. N. Morris, “Natural exponential families with quadratic variancefunctions,” Ann. Statist., vol. 10, pp. 65–80, 1982.

[31] H. Mühlenbein and G. Paaß, “From recombination of genes to theestimation of distributions: I. Binary parameters,” in Parallel ProblemSolving From Nature—PPSN IV, H.-M. Voigt, W. Ebeling, I. Rechen-berg, and H.-P. Schwefel, Eds. New York: Springer, 1996, pp.178–187.

[32] H. Robbins and S. Monro, “A stochastic approximation method,” Ann.Math. Statist., vol. 22, pp. 400–407, 1951.

[33] R. Y. Rubinstein, “Optimization of computer simulation models withrare events,” Eur. J. Oper. Res., vol. 99, pp. 89–112, 1997.

[34] R. Y. Rubinstein and D. P. Kroese, The Cross-Entropy Method: A Uni-fied Approach to Combinatorial Optimization, Monte-Carlo Simula-tion, and Machine Learning. New York: Springer, 2004.

[35] L. Shi and S. Ólafsson, “Nested partitions method for global optimiza-tion,” Oper. Res., vol. 48, pp. 390–40, 2000.

[36] A. Shiryaev, Probability Theory. New York: Springer-Verlag, 1996.[37] D. Sigalov and N. Shimkin, “Cross-Entropy based data association for

multi-target tracking,” presented at the Proc. 3rd Int. Conf. Perf. Eval.Methodol. Tools, 2008, Article No. 30.

[38] J. C. Spall, “Multivariate stochastic approximation using simultaneousperturbation gradient approximation,” IEEE Trans. Autom. Control,vol. 37, no. 3, pp. 332–341, Mar. 1992.

[39] J. C. Spall, Introduction to Stochastic Search and Optimization. NewYork: Wiley, 2003.

[40] T. Stützle and M. Dorigo, “A short convergence proof for a class ofant colony optimization algorithms,” IEEE Trans. Evol. Comput., vol.6, no. 4, pp. 358–365, Aug. 2002.

[41] G. G. Yin, “Rates of convergence for a class of global stochatic opti-mization algorithms,” SIAM J. Optim., vol. 10, pp. 99–120, 1999.

[42] D. H. Wolpert, “Finding bounded rational equilibria part I: Iterative fo-cusing,” in Proc. 11th Int. Symp. Dynamic Games Applicat., T. Vincent,Ed., Tucson, AZ, Dec. 18–21, 2004.

[43] Z. B. Zabinsky, Stochastic Adaptive Search for Global Optimization.Norwell, MA: Kluwer, 2003.

[44] H. Zhang and M. C. Fu, “Applying model reference adaptive search toAmerican-style option pricing,” in Proc. 38th Winter Simulation Conf.,Monterey, CA, 2006, pp. 711–718.

[45] M. Zlochin, M. Birattari, N. Meuleau, and M. Dorigo, “Model-basedsearch for combinatorial optimization: A critical survey,” Ann. Oper.Res., vol. 131, pp. 373–395, 2004.

Jiaqiao Hu (M’11) received the B.S. degree inautomation from Shanghai Jiao Tong University,Shanghai, China, in 1997, the M.S. degree in appliedmathematics from the University of Maryland,Baltimore County, in 2001, and the Ph.D. degreein electrical engineering from the University ofMaryland, College Park, in 2006.

Since 2006, he has been with the Department ofApplied Mathematics and Statistics, State Universityof New York, Stony Brook, where he is currently anAssistant Professor. His research interests include

Markov decision processes, simulation-based optimization, global optimiza-tion, applied probability, and stochastic modeling and analysis. He is thecoauthor of the book Simulation-Based Algorithms for Markov DecisionProcesses (Springer, 2007).

Dr. Hu is a member of INFORMS.


Ping Hu received the B.S. degree in mathematicsfrom Peking University, Beijing, China, in 2006. Heis currently pursuing the Ph.D. degree in operationsresearch in the Department of Applied Mathematicsand Statistics, Stony Brook University, Stony Brook,NY.

His research interest focuses on Monte-Carlo sim-ulation and optimization algorithms.

Hyeong Soo Chang (SM’07) received the B.S. andM.S. degrees in electrical engineering and the Ph.D.degree in electrical and computer engineering, allfrom Purdue University, West Lafayette, IN, in 1994,1996, and 2001, respectively.

Since 2003, he has been with the Departmentof Computer Science and Engineering, SogangUniversity, Seoul, Korea, where he is now anAssociate Professor. He is coauthor of the book,Simulation-based Algorithms for Markov DecisionProcesses (Springer, 2007). His main research

interests include Markov decision processes, Markov games, computationallearning theory, computational intelligence, and stochastic optimization.

Prof. Chang currently serves as an Associate Editor for the IEEETRANSACTIONS ON AUTOMATIC CONTROL.

Documents

A Stochastic Approximation Framework for a Class of Randomized Optimization Algorithms