New combinatorial direction stochastic approximation algorithms

This article was downloaded by: [Akdeniz Universitesi]On: 20 December 2014, At: 09:13Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registeredoffice: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Optimization Methods and SoftwarePublication details, including instructions for authors andsubscription information:http://www.tandfonline.com/loi/goms20

New combinatorial direction stochasticapproximation algorithmsZi Xu aa Department of Mathematics, College of Sciences , ShanghaiUniversity , Shanghai , 200444 , People's Republic of ChinaPublished online: 20 Dec 2011.

To cite this article: Zi Xu (2013) New combinatorial direction stochastic approximation algorithms,Optimization Methods and Software, 28:4, 743-755, DOI: 10.1080/10556788.2011.645542

To link to this article: http://dx.doi.org/10.1080/10556788.2011.645542

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the“Content”) contained in the publications on our platform. However, Taylor & Francis,our agents, and our licensors make no representations or warranties whatsoever as tothe accuracy, completeness, or suitability for any purpose of the Content. Any opinionsand views expressed in this publication are the opinions and views of the authors,and are not the views of or endorsed by Taylor & Francis. The accuracy of the Contentshould not be relied upon and should be independently verified with primary sourcesof information. Taylor and Francis shall not be liable for any losses, actions, claims,proceedings, demands, costs, expenses, damages, and other liabilities whatsoever orhowsoever caused arising directly or indirectly in connection with, in relation to or arisingout of the use of the Content.

This article may be used for research, teaching, and private study purposes. Anysubstantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/loi/goms20

http://www.tandfonline.com/action/showCitFormats?doi=10.1080/10556788.2011.645542

http://dx.doi.org/10.1080/10556788.2011.645542

http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/page/terms-and-conditions

Optimization Methods & Software, 2013Vol. 28, No. 4, 743–755, http://dx.doi.org/10.1080/10556788.2011.645542

New combinatorial direction stochasticapproximation algorithms

Zi Xu*

Department of Mathematics, College of Sciences, Shanghai University, Shanghai 200444,People’s Republic of China

(Received 10 March 2011; final version received 28 November 2011)

The stochastic approximation problem is to find some roots or minimizers of a nonlinear function whoseexpression is unknown and whose evaluations are contaminated with noise. In order to accelerate the clas-sical RM algorithm, this paper proposes a new three-term combinatorial direction stochastic approximationalgorithm and its general framework which employ a weighted combination of the current noisy gradientand several previous noisy gradients as the iterative direction. Both the almost sure convergence and theasymptotic rate of convergence of the new algorithms are established. Numerical experiments show thatthe new algorithm outperforms the RM algorithm and another existing combined direction algorithm.

Keywords: stochastic approximation; conjugate gradient; Robbins–Monro algorithm; combinatorialdirection

AMS Subject Classification: 90C15; 62L20

1. Introduction

Consider a loss function f : Rn → R, with gradient g : Rn → Rn. The general optimizationproblem can be modelled as follows, i.e.

minx∈Rn

f (x), (1)

where x represents the input variables and f (x) is the objective function. However, the assumptionin the stochastic approximation setting is that either the expression of f or g is unknown, but canbe evaluated inaccurately through simulation and only f̃ (x), a noisy estimate of f (x), or g̃(x), anoisy estimate of g(x) can be obtained. This random phenomenon makes stochastic approximationdifferent from deterministic optimization. The most common form for f is an expectation, i.e.

minx∈Rn

E[J(x, ω)] (2)

where E[J(x, w)] is the expected response and J(x, ω) is a performance measure evaluated at x, avector of decision variables, and ω, a vector of uncontrollable random conditions. The expected

*Email: [email protected]

© 2013 Taylor & Francis

Dow

nloa

ded

by [

Akd

eniz

Uni

vers

itesi

] at

09:

13 2

0 D

ecem

ber

2014

744 Z. Xu

performance of a stochastic system is usually too complex for possible theoretical analysis, sincethe evaluations of the response function are often obtained through simulation in each parametersetting and thereby include some error.

The goal of the stochastic approximation (SA) algorithm is to generate a sequence of parametervalues that converges almost surely to the solution as the number of iterations goes to infinity. Therehas been a growing interest in stochastic approximation algorithms, which has been motivated,for example, by problems in the adaptive control and statistical identification of complex systems,the optimization of processes by large Monte Carlo simulations, the training of recurrent neuralnetworks, the recovery of images from noisy sensor data, and the design of complex queuingand discrete-event systems (see [19] for more details). Therefore, the progress in general SAmethodology can have a potential bearing on a wide range of practical implementations. Thisarticle focuses on the case where only noisy observation g̃k of the true gradient gk = g(xk) isavailable at any point xk ∈ Rn, where

g̃k = gk + ξk (3)

and ξk denotes the noise.The Robbins–Monro(RM) algorithm [15] (extended by Blum [2] to multi-dimensional cases)

estimates a root of g with the following recursion:

xk+1 = xk − αk g̃k . (4)

The almost sure convergence of this algorithm can be established if the step size αk satisfies thefollowing condition:

αk > 0,∑k≥1

αk = ∞,∑k≥1

α2k < ∞ (5)

as has been investigated by many researchers (for example, see [13,20]). Besides, the differentforms including finite difference stochastic approximation (FDSA) and simultaneous perturbationstochastic approximation (SPSA) algorithm are developed (see [10,17] for more details). Thoughthe RM algorithm is a classical algorithm, the major disadvantage of the RM algorithm and itsdifference forms including the FDSA and SPSA algorithms are their slow speed of convergence.There have been many efforts to accelerate the RM and SPSA algorithms, e.g. [11,18]. However,most of them consist in the choice of the step size αk , such as the Kesten algorithm (see [5,9]).Spall [19] has given an extensive review on the choice of αk in stochastic approximation algorithmswhen the iterative direction is −g̃k , the same as RM algorithm.

Actually, due to the effect of noise, −g̃k may not always be a good iterative direction, especiallywhen the function is ill conditioned. However, there is not too much work that has been done forthe choice of iterative direction till now. A conjugate direction method is given by Schraudolphand Graepel [16] if some information about the Hessian is known. Actually, it is difficult toobtain second-order information and hence the method is restrictive in its application. Xu andDai [22] introduce a stochastic approximation frame algorithm which can yield better finite-sample behaviour than RM algorithm. Xu [21] proposes a two-term combined direction stochasticapproximation algorithm (denoted by CD algorithm for convenience in this paper), which employsa weighted combination of the current noisy negative gradient and one previous noisy negativegradient as the iterative direction.

Under the influence of noise, if only one previous noisy gradient is used, the included infor-mation may not be enough to correct the current gradient direction. If one more previous noisygradient is selected to construct the current iterative direction, it can be expected to work bet-ter. Note that the main cost in stochastic approximation algorithms are the measurements ofnoisy gradient through simulation. Thus, compared with the measurements of noisy gradient,

Dow

nloa

ded

by [

Akd

eniz

Uni

vers

itesi

] at

09:

13 2

0 D

ecem

ber

2014

Optimization Methods & Software 745

the cost of matrix–vector multiplies or vector–vector multiplies per iteration can be omitted. Inthis paper, we propose a three-term combinatorial direction algorithm which employs a weightedcombination of the current noisy gradient and two previous noisy gradients as the iterative direc-tion. Furthermore, a general framework of combinatorial direction algorithm is also proposed.Numerical results show that the new combinatorial direction algorithms yield better finite-sampleperformance than the RM algorithm after the same number of evaluations of gradients.

The rest of this paper is organized as follows. In the next section, new combinatorial directionalgorithms are proposed. In Section 3, both the almost sure convergence result and the asymptoticconvergence rate of the new algorithms are presented. Numerical results are reported in Section 4.The last section provides some concluding remarks.

2. New combinatorial direction algorithms

2.1 Three terms combinatorial direction algorithm

Because of the influence of noise, −g̃k may not always be the best iterative direction, especiallywhen the function is ill conditioned. Note that conjugate gradient (CG) method works better thanthe steepest descent method in deterministic optimization, because CG method uses conjugatedirection at each iteration [4]. Furthermore, three-term conjugate gradient methods which use acombination of the current negative gradient and two other directions also show good descent prop-erty [3]. In this section, a three-term combinatorial direction stochastic approximation algorithmis proposed, which is a stochastic analogy of the three-term conjugate gradient method, thoughthe conjugate property cannot be expected to be held. At each iteration of the new algorithm,a weighted combination of the current noisy negative gradient and two previous noisy negativegradients is chosen as iterative direction. It may work worse than the noisy steepest descent direc-tion −g̃k at each iteration separately, however, we expect that it can work better through severaliterations, especially when some ‘good’ previous noisy gradients are chosen. The algorithm hasthe following procedure:

xk+1 = xk + αkdk , (6)

where

dk = −g̃k − g̃Tk g̃u1(k)

‖g̃u1(k)‖2g̃u1(k) − g̃T

k g̃u2(k)

‖g̃u2(k)‖2g̃u2(k), (7)

where u1(k) and u2(k) are subscripts less than k, αk is chosen the same as in RM algorithm.How to choose u1(k) and u2(k) is the key point of this method. Actually, if the norm of the

noisy gradient achieves minimum at some point xt during several successive iterations before thecurrent iterative point xk , it is reasonable to believe that the noisy negative gradient at point xt−1 isa ‘good’ direction. Here, ‘good’ refers that the direction −g̃t−1 can be regarded as a good descentdirection with large probability. Thus, −g̃u1(k) can be used to correct the current noisy negativegradient −g̃k where u1(k) denotes t − 1. The same idea can be applied for choosing u2(k). Ifthe norm of the noisy gradient achieves the second minimum at some point xt̃ during severalsuccessive iterations before the current iterative point xk , it is reasonable to believe that the noisynegative gradient at point xt̃−1 is also a ‘good’ direction. Based on this idea, we give the choiceof u1(k) and u2(k) as follows:

u1(k) =

⎧⎪⎨⎪⎩

arg mink−r+1≤i≤k

‖g̃i‖ − 1 if k > r,

arg min1≤i≤k

‖g̃i‖ − 1 otherwise,(8)

Dow

nloa

ded

by [

Akd

eniz

Uni

vers

itesi

] at

09:

13 2

0 D

ecem

ber

2014

746 Z. Xu

u2(k) =

⎧⎪⎨⎪⎩

arg mini∈[k−r+1,k]\{u1(k)}

‖g̃i‖ − 1 if k > r,

arg mini∈[1,k]\{u1(k)}


where r is a given positive integer. As we assumed before, the major cost to solve stochasticapproximation problem is the evaluation of function values or gradients, the choice of u1(k) in (8)and u2(k) in (9) need no more additional computations of function values or gradients than RMalgorithm. The new algorithm, with u1(k) and u2(k) chosen as in (8) and (9), respectively, can bedescribed concretely as follows.

Algorithm 1 (Three-Term Combinatorial Direction Algorithm)

Step 1: Given r, choose initial point x1, compute g̃1, set k = 1.Step 2: If some given stopping criteria is satisfied or some given maximum iteration number has

been reached then stop, otherwise go to step 3.Step 3: Choose αk from some given sequence; Compute u1(k) and u2(k) according to (8) and (9)

respectively, update dk by (7), goto step 4;Step 4: Compute xk+1 = xk + αkdk; k = k + 1; goto step 2.

The difference between the CD algorithm [21] and Algorithm 1 can be shown as follows. TheCD algorithm can be regarded as an ‘acceleration technique’ to speed up the convergence of theRM algorithm with matrix-valued step-size αkA(1)

k , a.e.

xk+1 = xk − αkA(1)

k g̃, (10)

where A(1)

k = I + g̃u(k)g̃Tu(k)/‖g̃u(k)‖2 is a rank 1 correction to the identity matrix I . The three

terms combinatorial direction stochastic approximation algorithm can also be rewritten with amatrix-valued step-size αkA(2)

k , while A(2)

k = I + g̃u1(k)g̃Tu1(k)/‖g̃u1(k)‖2 + g̃u2(k)g̃T

u2(k)/‖g̃u2(k)‖2 isa rank 2 correction to the identity matrix I . However, it is constructed such that matrix–vectormultiplications are avoided, instead, only two vector–vector multiplications are needed at eachiteration, which can be omitted compared with the cost of the simulation estimation of the gradient.In Section 4, it can be shown that the three terms combinatorial direction stochastic approximationalgorithm outperforms the CD algorithm.

2.2 General form

In this section, we give a general combinatorial direction stochastic approximation algorithm.At each iteration of the new algorithm, a weighted combination of the current noisy negativegradient and m previous noisy negative gradients are chosen as iterative direction. It has thefollowing recursive procedure:

dk = −g̃k −m∑

j=1

g̃Tk g̃uj(k)

‖g̃uj(k)‖2g̃uj(k), (11)

xk+1 = xk + αkdk , (12)

Dow

nloa

ded

by [

Akd

eniz

Uni

vers

itesi

] at

09:

13 2

0 D

ecem

ber

2014


where m is a given constant, 1 ≤ m < k and uj(k) (j = 1, . . . , m) are subscripts less than k. uj(k)

may be chosen as follows:

u1(k) =

⎧⎪⎨⎪⎩

arg mink−r+1≤i≤k

‖g̃i‖ − 1 if k > r,

arg min1≤i≤k


uj(k) =

⎧⎪⎨⎪⎩

arg mini∈[k−r+1,k]\{u1(k),...,uj−1(k)}

‖g̃i‖ − 1 if k > r,

arg mini∈[1,k]\{u1(k),...,uj−1(k)}


where j = 2, . . . , m. For convenience, we denote the general combinatorial direction stochasticapproximation algorithm framework by GCDSA algorithm in this paper.

Algorithm 2 (GCDSA Algorithm) In Step 3 of Algorithm 1, we compute u1(k) to um(k)

according to (13) and (14), respectively.

Actually, Algorithm 1 given in Section 2.1 is a special case of the GCDSA algorithm withm = 2. The CD algorithm proposed by Xu [21] is also a special case of the GCDSA algorithmwith m = 1.

The remaining problem in the GCDSA algorithm is the choice of m. First, m = 1 is not enoughto correct the current noisy gradient, the intuitive reason has been stated in the beginning of thissection and it will also be verified in our numerical experiments in Section 4. However, too largem is not suitable also. There are two major reasons. One is that it is useless to choose a gradientat a point which is far away from the current iteration to correct the current gradient. Anotherimportant reason is that too large m may cause too large covariance matrix for the direction dk

due to the multiplication and addition operation of the noisy gradients, and therefore may causeinstability of the algorithm.

Actually, there exists a tradeoff between the stability and the finite sample performance of theGCDSA algorithm. Too small m may cause bad finite sample performance, however, too largem can cause bad stability in the initial iterations. This phenomena is actually observed in ournumerical experiments in Section 4, the GCDSA algorithm with m = 5 diverges to infinity inthe initial iterations for problem P3, while it performs well with m < 3 for the same problem.Whereas, what we need is a good algorithm which can get better finite sample performance whilemaintains the stability in the initial iterations. Though there is no theoretic guarantee that which mis always the best choice for this algorithm, through our numerical results, we recommend m < 5(especially m = 2) which can always achieve good balance between the stability and the finitesample performance.

3. Convergence analysis

3.1 Almost sure convergence

We consider the problem:

arg minx∈Rn

f (x) (15)

where Rn denotes the n-dimensional Euclidean space and f : Rn → R is a continuouslydifferentiable function, such that for some constant L, we have

‖∇f (x) − ∇f (x̄)‖ ≤ L‖x − x̄‖ ∀x, x̄ ∈ Rn, (16)

Dow

nloa

ded

by [

Akd

eniz

Uni

vers

itesi

] at

09:

13 2

0 D

ecem

ber

2014

748 Z. Xu

where ‖ · ‖ denotes the norm of a vector. Consider the stochastic approximation procedure

xk+1 = xk + αk(sk + ωk), (17)

where αk is a deterministic positive step-size, sk is a descent direction and ωk is a random noiseterm. Let xk be a sequence generated by the above procedure, Fk be an increasing sequence ofσ -fields which should be interpreted as the history of the algorithm up to time k, just before ξk isgenerated. xk and sk are Fk-measurable.

To mention, σ -field is a technical concept for a collection of sets that is closed under com-plementation and countable unions of its members. In probability theory, it is interpreted as thecollection of events which can be assigned probabilities. Element(s) of the σ -field Fk are calledFk-measurable (sets).

Assumption 3.1 There exist positive scalars c1 and c2 such that, for all k,

−g(xk)Tsk ≥ c1‖g(xk)‖2, (18)

‖sk‖ ≤ c2(1 + ‖g(xk)‖). (19)

Assumption 3.2

αk > 0,∑k≥1

αk = ∞,∑k≥1

α2k < ∞. (20)

Assumption 3.3 For all k, there exists a positive deterministic constant A such that ξk satisfiesthat

E[ωk|Fk] = 0, (21)

E[‖ωk‖2|Fk] ≤ A(1 + ‖g(xk)‖2), (22)

Remark 1 Assumption 3.1 is satisfied by, but not limited to, some descent direction, sk , forexample, the direction sk = −g(xk). The step size condition given in Assumption 3.2 is essentialfor convergence in stochastic approximation algorithms. Assumption 3.3 on ωk is of the sametype as that considered in [14], which is a general condition for the error on the gradient.

First, we state the following existing theorem and omit the proof.

Proposition 3.4 (Bertsekas and Tsitsiklis [1]) Let xk be a sequence generated by (17), ifAssumptions 3.1–3.3 are satisfied, then either f (xk) → −∞ or else f (xk) converges to a finitevalue and limk→+∞ g(xk) = 0. Furthermore, every limit point of xk is a stationary point of f .

Assumption 3.5 For all k, there exists a positive deterministic constant A such that ξk satisfiesthat

E[ξk|Fk] = 0, (23)

E[‖ξk‖2|Fk] ≤ A

(m + 1)2(1 + ‖g(xk)‖2). (24)

Theorem 3.6 Under Assumptions 3.2 and 3.5, for GCDSA algorithm, we have either f (xk) →−∞ or f (xk) converging to a finite value and limk→∞gk = 0 with probability 1. Furthermore,every limit point of xk is a stationary point of f .

Dow

nloa

ded

by [

Akd

eniz

Uni

vers

itesi

] at

09:

13 2

0 D

ecem

ber

2014


Proof The procedure of GCDSA algorithm can be rewritten as follows:

xk+1 = xk + αksk + αkωk , (25)

where

sk = −gk −m∑

j=1

gTk g̃uj(k)

‖g̃uj(k)‖2g̃uj(k), (26)

ωk = −ξk −m∑

j=1

ξTk g̃uj(k)

‖g̃uj(k)‖2g̃uj(k). (27)

sk is Fk-measurable, where Fk denotes an increasing sequence of σ -fields, i.e.{x1, g1, ξ1, . . . , xk−1, gk−1, ξk−1, xk , gk}. We have

−sTk gk = ‖gk‖2 +

m∑j=1

(gTk g̃uj(k))

2

‖g̃uj(k)‖2≥ ‖gk‖2, (28)

and

‖sk‖ ≤ (m + 1)‖gk‖. (29)

On the other hand, due to the choice of u1(k) to um(k), we can conclude that u1(k), . . . , um(k) ≤k − 1, that is g̃u1(k), . . . , g̃um(k) ∈ Fk . So, using (27) and (23), we get that

E[ωk|Fk] = E[ξk|Fk] +m∑

j=1

E[ξk|Fk]Tg̃uj(k)

‖g̃uj(k)‖2g̃uj(k)

= 0. (30)

From (27) we have

‖ωk‖ ≤ (m + 1)‖ξk‖, (31)

and, by Assumption 3.5, we get

E[‖ωk‖2|Fk] ≤ (m + 1)2E[‖ξk‖2|Fk] ≤ A(1 + ‖gk‖2). (32)

Making use of (28)–(30) and (32), the proof is completed by directly applying Proposition 3.4 toprocedure (25) with c1 = 1 and c2 = m + 1. �

3.2 Asymptotic rate of convergence

As is typical in all stochastic algorithms, in this section, we analyse the asymptotic rate of con-vergence of the GCDSA algorithm. First, suppose x∗ is the limit point of the sequence {xn}, theasymptotic distribution of xn − x∗ can be established if we make the following further assumptions.

Assumption 3.7 Let a be positive, α be in ( 12 , 1] and

limn→∞ nααn = a. (33)

Dow

nloa

ded

by [

Akd

eniz

Uni

vers

itesi

] at

09:

13 2

0 D

ecem

ber

2014

750 Z. Xu

Assumption 3.8 The sequence {ξn} are independent identical distribution, Eξn = 0,

limn→∞ nαE‖ξn‖2 = 0, (34)

and E(ξnξTn ) = L, where L is the covariance matrix of ξn.

Assumption 3.9 Let A = ∇2f (x∗) be positive definite. There exists orthogonal matrix P suchthat PTAP = � and � is diagonal. Let λ = min1≤i≤n �(ii), where �(ii) denotes the (i, i) elementof the matrix �.

Assumption 3.10 If α = 1, then β+ = α, otherwise β+ = 0. Besides, β+ < 2λ.

Theorem 3.11 Under Assumptions 3.7–3.10, for GCDSA Algorithm, if xn → x∗, then we have

nα/2(xn − x∗) F−→ N(0, PMPT), (35)

where 0 is the zero vector,F−→ means convergence in distribution,

M(ij) = (PTP)(ij)(�(ii) + �(jj) − β+)−1 (36)

and = a2L.

Proof By easy deduction, the recursive expression of the combined direction algorithm can bewritten as follows:

Un+1 = (I − n−αn)Un + n−(α+β)/2�nVn + n−(α−β/2)Tn, (37)

where Un, n, , β, Tn, �n, � and Vn are given by

Un = xn − x∗, (38)

n = (nααn)∇2f (x̄n), = a∇2f (x∗), (39)

β = α, (40)

�n = � = I , (41)

Vn = −(nααn)ξn, (42)

Tn = −αnnα+β/2 ·m∑

j=1

g̃Tn g̃uj(n)

‖g̃uj(n)‖2g̃uj(n), T = 0, (43)

where x̄n in (39) is a point on the line segment between xn and x∗. The theorem can be provedby applying Fabian’s theorem [6]. We now show that the conditions (2.2.1)–(2.2.3) and (2.2.5)of Fabian’s theorem hold. The condition (2.2.5) of Fabian’s theorem [6] is a direct result usingAssumption 3.10. We now show that other conditions hold.

We see that n → a∇2f (x∗) by xn → x∗ and the continuity of ∇2f (x). On the other hand,

E‖Tn‖ ≤ mαnnα+β/2E‖g̃n‖. (44)

Dow

nloa

ded

by [

Akd

eniz

Uni

vers

itesi

] at

09:

13 2

0 D

ecem

ber

2014


Taking limits over n on both sides of the above inequality and using Assumption 3.7, we can get

0 ≤ limn→∞ E‖Tn‖ ≤ ma lim

n→∞ nβ/2E‖gn + ξn‖. (45)

Using Levi’s monotone convergence theorem and gn → 0 and (34) in Assumption 3.8, we havethat

ma limn→∞ nβ/2E‖gn + ξn‖ (46)

= ma limn→∞ nβ/2E‖ξn‖ (47)

≤ ma limn→∞(nβE‖ξn‖2)1/2 (48)

= 0. (49)

That is E‖Tn − T‖ → 0 with T = 0. So, the condition (2.2.1) is satisfied.By Assumption 3.8, we get EVn = −E(nααnξn) = 0 and

limn→∞ ‖EVnV T

n − ‖ = limn→∞ ‖(nααn)

2E(ξnξTn ) − a2L‖ = 0. (50)

So, the condition (2.2.2) is satisfied.For every r > 0, we have

0 ≤ limn→∞ Eχ{‖Vj‖2 ≥ rjα}‖Vj‖2 ≤ lim

n→∞ E‖Vj‖2, (51)

where χ is an indicator function. From (34), we can conclude that

limn→∞ E‖Vj‖2 = a2 lim

n→∞ E(‖ξj‖2) = 0. (52)

So

limn→∞ Eχ{‖Vj‖2 ≥ rjα}‖Vj‖2 = 0. (53)

That is the condition (2.2.3) is satisfied. Above all, the proof of the theorem is completed byapplying Fabian’s Theorem [6]. �

Remark 2 The result of Theorem 3.11 shows that if xk → x∗, the rate at which the iterate xk

approaches x∗ is proportional in a stochastic sense to n−α/2 for large n. That is the asymptotic rate ofconvergence of xk to x∗ in GCDSA algorithm is O(n−α/2) under general conditions. Furthermore,the maximum asymptotic rate of convergence of GCDSA algorithm is O(1/

√n) when we choose

α = 1.

4. Numerical results

In this section, we compare the Algorithm 1 and several special cases of the GCDSA algorithm(i.e. Algorithm 2), with the classical RM algorithm and the CD algorithm given by Xu [21].Actually, the RM algorithm, the CD algorithm and the Algorithm 1 can be regarded as specialcases of the GCDSA algorithm with m = 0, 1, 2 respectively. In many references, only one testproblem is taken to compare different stochastic algorithms (see e.g. [16,18,19]). In this section,in order to sufficiently compare the performance of these algorithms, nine nonlinear functionsand a practical inventory problem are chosen as test problems.

Dow

nloa

ded

by [

Akd

eniz

Uni

vers

itesi

] at

09:

13 2

0 D

ecem

ber

2014

752 Z. Xu

4.1 Unconstrained test problems

Nine functions were chosen from the collection of unconstrained minimization test problems inMoré et al. [12]. They are listed as follows.

P1. The Helical valley functionP2. The Penalty function IP3. The Penalty function IIP4. The extended Rosenbrock function (m = 4)P5. The Extended Powell singular functionP6. Beale functionP7. The extended Rosenbrock function (with m = 10)P8. The extended Rosenbrock function (with m = 50)P9. The extended Rosenbrock function (with m = 500)

The initial points for the nine test problems and the optimal function value and dimension ofeach problem are given in the last three columns of Table 1. Normal distribution noise is addedto the gradient evaluations of the above nine functions, namely, ξk ∼ N(0, σ 2 I), I is the identitymatrix. For all algorithms, we choose αk with the following form [19]:

αk = a

k + 1 + A. (54)

For each test problem, we use the same stability constants a and A for both algorithms given inthe first two columns in Table 1. MATLAB software was used to carry out this study.

For each test problem, as is typical in all stochastic algorithms, we ran each algorithm 50 timesand observed the average function values after 5000 iterations, which is called ‘finite sampleperformance’ for each algorithm. The results with σ = 0.5 and σ = 1.0 are, respectively, listed inTables 2 and 3. To avoid the table to be too broad, the power of the average function values is notlisted independently, which is chosen the same for each problem and listed in the first column,respectively, in Tables 2 and 3.

Tables 2 and 3 show that Algorithm 1 outperforms the RM algorithm for all test problems underdifferent noise levels. Algorithm 1 outperforms the CD algorithm given by Xu [21] for almostall test problems except for P3 when σ = 0.5. Besides, Algorithm 2 with m = 3 and 5 performbetter than the other algorithms for problems P2, P4, P6, P7, P8 and P9. However, they performvery badly for problems P1 and P3. ‘NaN’ in Tables 2 and 3 means the algorithm diverges in theinitial iterations in this case. Actually, there exists a tradeoff between the stability and the finitesample performance of the GCDSA algorithm. m = 0 may cause bad finite sample performanceof the GCDSA algorithm, however, too large m can cause bad stability in the initial iterations.

Table 1. The initialization of the parameters and x1 of RM algorithm andAlgorithm 1 for seven test problems.

Problem a A x1 f ∗ n

P1 1 100 [−1; 0; 0] 0 3P2 0.1 100 [1; 2; . . . ; 10] 7.087 × 10−5 10P3 0.1 100 [1; 1; . . . ; 1] 2.937 × 10−4 10P4 1 100 [−0.5; . . . ; −0.5] 0 4P5 0.1 50 [1; 3; −1; 0; . . . 1; 3; −1; 0] 0 16P6 1 100 [1; 1] 0 2P7 1 100 [−0.5; . . . ; −0.5] 0 10P8 1 100 [−0.5; . . . ; −0.5] 0 50P9 1 100 [−0.5; . . . ; −0.5] 0 500

Dow

nloa

ded

by [

Akd

eniz

Uni

vers

itesi

] at

09:

13 2

0 D

ecem

ber

2014


Table 2. The mean function values at the terminal 5000 iterations of RM algorithm andAlgorithm 1 over 50 independent runs with σ = 0.5.

Problem RM CD (m = 1) Alg. 1 (m = 2) Alg. 2 (m = 3) Alg. 2 (m = 5)

P1(e-4) 10.00 1.21 0.68 15.4 NaNP2(e-3) 43.1 3.23 0.76 0.20 0.09P3(e-1) 1.75 1.69 1.71 NaN NaNP4(e-3) 24.6 1.99 1.33 1.15 0.76P5(e-1) 21.0 4.42 1.97 8.26 5.30P6(e-4) 20.0 3.74 1.92 1.59 1.25P7(e-3) 34.5 5.04 3.23 2.04 1.62P8(e-2) 17.1 2.18 1.65 1.04 0.69P9(e-1) 17.2 3.00 1.76 1.07 0.72

Table 3. The mean function values at the terminal 5000 iterations of RM algorithm and Algorithm 1over 50 independent runs with σ = 1.0.

Problem RM (m = 0) CD (m = 1) Alg. 1 (m = 2) Alg. 2 (m = 3) Alg. 2 (m = 5)

P1(e-4) 11.3 8.47 2.94 4.52 NaNP2(e-3) 43.1 3.39 0.76 0.22 0.12P3(e-1) 1.76 1.72 1.67 NaN NaNP4(e-3) 25.2 3.28 2.79 2.18 2.57P5(e-1) 21.0 5.38 2.52 8.84 5.84P6(e-4) 19.4 6.42 6.37 5.09 5.19P7(e-3) 38.6 8.13 6.20 4.94 5.62P8(e-2) 18.5 4.41 3.29 2.68 2.15P9(e-1) 18.3 4.51 3.31 2.62 2.12

Through the numerical results of the GCDSA algorithm, we recommend m = 1 and 2 which canachieve a good balance between the stability and the finite sample performance. We remark thatthe main cost in all these algorithms are the number of measurements of noisy gradient. Thus, itcan be regarded that all these algorithms are compared under the same computational cost. Wecan see more clearly of the performance of all algorithms for problem P2 in Figure 1. From thenumerical results, we think that Algorithm 1 is a promising algorithm.

To mention, during our experiments, the expressions of the nine test functions are only usedto produce g̃k and compare the performance of different algorithms. We do not use any exactinformation of function values or gradients during the iterations of all algorithms.

4.2 (s, S) inventory problem

To investigate the effectiveness and efficiency of a stochastic optimization method, Fu andHealy [8] consider the optimization of a periodic review (s, S) inventory system. Full backloggingof orders is assumed and order lead times are taken to be zero. Let Xi denote the inventory levelin period i. The cost incurred in period i is

f (s, S, Xi) = I{Xi<s}[K + c(S − Xi)] + h max{0, Xi} + p max{0, −Xi},where I{·} is the indicator function of the set {·} and K , c, h and p are constants. The long-runaverage cost is

F(s, S) = E[f (s, S, Xi)]= c/λ + h{K + [s − 1/λ + λ/2(S2 − s2)]

+ (h + p)/λe−λs}/[1 + λ(S − s)].

Dow

nloa

ded

by [

Akd

eniz

Uni

vers

itesi

] at

09:

13 2

0 D

ecem

ber

2014

754 Z. Xu

0 1000 2000 3000 4000 500010

−4

10−2

100

102

104

106

iter

f

P2, σ=1.0

RMCDAlg.1Alg.2 (m=3)Alg.2 (m=5)

Figure 1. The true function values of five algorithms for problem P2 during 5000 iterations, σ = 1.0.

Table 4. Experiment results for Hybrid algorithm, RM algorithm, CD algorithm, Algs. 1 and 2 with m = 3 and 5.

Case a Hybrid RM Cd Alg. 1 (m = 2) Alg. 2 (m = 3) Alg. 2 (m = 5) F∗

1 20 741. 741.7 741.4 741.3 741.3 741.3 740.92 100 221x 2210.2 2212.7 2216.1 2218.5 2220.1 2200.03 200 118x 1210.2 1189.6 1186.6 1186.9 1186.3 1184.44 100 28xx 2685 2662 2651 2655 2653 2643.45 5000 170xx 17,085 17,086 17,086 17,087 17,088 17,0786 5000 215xx 21,539 21,527 21,525 21,527 21,521 21,4967 5000 285xx 28,866 28,577 28,356 28,489 28,503 28,1648 5000 33xxx 32,895 32,711 32,691 32,634 32,637 32,583

The objective is to minimize F(s, S). Following [8], eight cases are considered in that demands areassumed to obey independent exponential distribution with rate λ and c = h = 1 · (s(0), S(0)) =(1/2λ, 1/λ) is selected as the initial point. All estimates were computed over a common set of16 independently seeded replications of 100, 000 period demands. PA derivative estimators withN = 50 are used to generate estimate of gradient according to Fu [7]. The acceleration factor andupdate period used were the same as those employed in [8]. Table 4 shows the mean functionvalue of all the tested algorithms and the optimal solutions of the eight cases are shown in the lastcolumn. From this table, we can see that the RM algorithm performs the best for case 2 and 5,but performs the worst for the other six cases. The CD algorithm outperforms the RM algorithmexcept for case 2 and 5. Algorithm 2 with m = 3 and 5 performs better than the CD algorithm forfour cases, whereas worse for other four cases. While, Algorithm 1 performs more efficient thanthe RM algorithm and the CD algorithm except for case 2, and it performs better than Algorithm 2with m = 3 except for case 8, and it outperforms Algorithm 2 with m = 5 for more than half ofthe test cases. It performs the best in all these test algorithms. In the case of the (s, S) inventoryproblem, the results of simulation experiments indicate that Algorithm 1 is recommended.

5. Conclusions

In this paper, in order to accelerate the convergence of RM algorithm, a three-term combinatorialdirection stochastic approximation algorithm is proposed. Furthermore, a general combinatorial

Dow

nloa

ded

by [

Akd

eniz

Uni

vers

itesi

] at

09:

13 2

0 D

ecem

ber

2014


direction algorithm (GCDSA) is also given. Both the almost sure convergence property andthe asymptotic rate of the GCDSA algorithm under some general assumptions are established.Numerical simulation experiments show that the GCDSA algorithm with m = 2 outperforms theRM algorithm and the CD algorithm given by Xu [21] in most cases. It is promising.

Acknowledgements

The author is very grateful to the referees for many useful suggestions that improved this paper. The author also wishes tothank Prof. Yu-hong Dai (Institute of Computational Mathematics, Chinese Academy of Sciences) for his useful advicefor this research. The paper is presented in ICOTA8 and the research is supported by China NSF under the grant 11101261and Key Disciplines of Shanghai Municipality (S30104).

References

[1] D.P. Bertsekas and J.N. Tsitsiklis, Gradient convergence in gradient methods with errors, SIAM J. Optim. 10 (2003),pp. 627–642.

[2] J.R. Blum, Multidimensional stochastic approximation methods, Ann. Math. Statist. 25 (1954), pp. 737–744.[3] Y.H. Dai and Y. Yuan,Convergence of three-term conjugate gradient methods, Math. Numer. Sinica 21 (1999),

pp. 355–362.[4] Y.H. Dai and Y. Yuan, Nonlinear Conjugate Gradient Methods, Shanghai Scientific and Technical Publishers, 2000,

China, in Chinese.[5] B. Delyon and A. Juditsky, Accelerated stochastic approximation, SIAM J. Optim. 3 (1993), pp. 868–881.[6] V. Fabian, On asymptotic normality in stochastic approximation, Ann. Math. Statist. 39 (1968), pp. 1327–1332.[7] M.C. Fu, Optimization via simulation: A review, Ann. Oper. Res. 53 (1994), pp. 199–247.[8] M.C. Fu and K.J. Healy, Techniques for optimization via simulation: An experimental study on an (s, S) inventory

system, IIE Trans. 29 (1997), pp. 191–199.[9] H. Kesten, Accelerated stochastic approximation, Ann. Math. Statist. 29 (1958), pp. 41–59.

[10] J. Kiefer and J. Wolfowitz, Stochastic estimation of the modulus of a regression function, Ann. Math. Statist. 23(1952), pp. 462–466.

[11] H.J. Kushner and D.S. Clark, Stochastic Approximation for Constrained and Unconstrained Systems, Springer,Berlin, 1978.

[12] J.J. Moré, B.S. Garbow, and K.E. Hillstrom, Testing unconstrained optimization software, ACM Trans. Math. Softw.7 (1981), pp. 17–41.

[13] M.B. Nevel’son and R.Z. Khas’minskij, Stochastic Approximation and Recursive Estimation, American Mathemat-ical Society, Providence, RI, 1973.

[14] B.T. Poljak and Y.Z. Tsypkin, Pseudogradient adaptation and training algorithms, Automat. Remote Control 12(1973), pp. 83–94.

[15] H. Robbins and S. Monro, A stochastic approximation method, Ann. Math. Statist. 22 (1951), pp. 400–407.[16] N.N. Schraudolph and T. Graepel, Combining conjugate direction methods with stochastic approximation of gradi-

ents, Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics, Key West, FL, 2002,pp. 7–13.

[17] J.C. Spall, Multivariate stochastic approximation using a simultanoues perturbation gradient approximation, IEEETrans. Automat. Control 37 (1992), pp. 332–341.

[18] J.C. Spall, Adaptive stochastic approximation by the simultaneous perturbation method, IEEE Trans. Automat.Control 45 (2000), pp. 1839–1853.

[19] J.C. Spall, Introduction to Stochastic Search and Optimization, Wiley, New York, 2003.[20] M. Wasan, Stochastic Approximation, Cambridge University Press, Cambridge, MA, 1970.[21] Z. Xu, A combined direction stochastic approximation algorithm, Optim. Lett. 4 (2010), pp. 117–129.[22] Z. Xu and Y.H. Dai, A stochastic approximation frame algorithm with adaptive directions, Numer. Math. Theory

Methods Appl. 1 (2008), pp. 460–474.

Dow

nloa

ded

by [

Akd

eniz

Uni

vers

itesi

] at

09:

13 2

0 D

ecem

ber

2014

Documents

New combinatorial direction stochastic approximation algorithms