Primal-Dual Stochastic Hybrid Approximation Algorithm · 2020-05-12 · scribed. In Section 5, the new primal-dual stochastic hybrid approximation algorithm is presented along with

Noname manuscript No.(will be inserted by the editor)

Primal-Dual Stochastic Hybrid Approximation Algorithm

Tomas Tinoco De Rubira · Gabriela Hug

Received: date / Accepted: date

Abstract A new algorithm for solving convex stochastic optimization problems withexpectation functions in both the objective and constraints is presented. The algo-rithm combines a stochastic hybrid procedure, which was originally designed to solveproblems with expectation only in the objective, with dual stochastic gradient ascent.More specifically, the algorithm generates primal iterates by minimizing determin-istic approximations of the Lagrangian that are updated using noisy subgradients,and dual iterates by applying stochastic gradient ascent to the true Lagrangian. Thesequence of primal iterates produced by the algorithm is shown to have a subse-quence that converges almost surely to an optimal point under certain conditions.Numerical experience with the new and benchmark algorithms that include a primal-dual stochastic approximation algorithm and algorithms based on sample-averageapproximations is reported. The test problem used originates in power systems op-erations planning under high penetration of renewable energy, where optimal risk-averse power generation policies are sought. In particular, the problem consists ofa two-stage stochastic optimization problem with quadratic objectives and a Condi-tional Value-At-Risk constraint.

Keywords Stochastic Approximation · Stochastic Hybrid Approximation ·Expected-Value Constraints · Conditional Value-At-Risk

1 Introduction

In many applications, optimization problems arise whose objective function or con-straints have expectation functions. For example, in portfolio optimization, one is in-

T. Tinoco De RubiraPower Systems Laboratory, ETH Zurich, SwitzerlandTel.: +41-44-632-43-84 E-mail: [email protected]

G. HugPower Systems Laboratory, ETH Zurich, SwitzerlandTel.: +41-44-632-81-91 E-mail: [email protected]

2 Tomas Tinoco De Rubira, Gabriela Hug

terested in selecting the portfolio that results in the lowest expected loss, and a boundon the risk of high losses may be enforced using Conditional Value-At-Risk (CVaR)[1]. In power systems operations planning, an important task is planning generatoroutput powers in such a way that the expected system operation cost is minimizedin the presence of uncertain inputs from renewable energy sources [2]. In commu-nication networks, one goal is to maximize throughput while bounding the expectedpacket loss or time delay in transmission [3]. In call centers, one is interested in min-imizing cost while ensuring that the expected number of answered calls are above acertain level [4]. Solving optimization problems such as these that have expectationfunctions is in general a challenging task. The reason is that the multi-dimensionalintegrals that define the expectation functions cannot be computed with high accuracyin practice when the uncertainty has more than a few dimensions [5]. Hence, existingapproaches used for solving these problems work with approximations instead of theexact expectation functions.

Two families of approaches have been studied for trying to solve optimizationproblems with expectation functions in a computationally tractable way. The firstapproach is based on external sampling, which consists of sampling realizations ofthe uncertainty and replacing the stochastic problem with a deterministic approxima-tion of it. This approach is also referred to as the Sample-Average Approximation(SAA) approach since the deterministic approximation is obtained by replacing ex-pected values with sample averages [6]. The resulting deterministic problem, whichis typically very large, is solved using techniques that exploit the problem structure,such as the Benders or L-shaped decomposition method [7] [8]. Several authors haveanalyzed the performance of SAA-based algorithms in the literature. The reader is re-ferred to [1], [6], [9] and [10] for discussions on asymptotic convergence of solutionsand optimal values, and on the accuracy obtained using finite sample sizes.

The second approach for solving optimization problems with expectation func-tions is based on internal sampling, which consists of periodically sampling realiza-tions of the uncertainty during the execution of the algorithm. Some of the most stud-ied algorithms of this family are Stochastic Approximation (SA) algorithms. Thesealgorithms update a solution estimate by taking steps along random directions. Theserandom directions typically have the property that on expectation are equal to nega-tive gradients or subgradients of the function to be minimized [6]. The advantage ofthese algorithms is that they typically have low demand for computer memory. Theirconvergence, however, is very sensitive to the choice of step lengths, i.e., the lengthsof the steps taken along the random directions. The fundamental issue is that thestep lengths need to be small enough to adequately suppress noise, while at the sametime large enough to avoid slow convergence. Since first proposed by Robbins andMonro [11], SA algorithms have been an active area of research. Most of the workhas been done in the context of stochastic optimization problems with deterministicconstraints. The reader is referred to [5] and [12] for important advancements andinsights on these algorithms. To handle expectation functions in the constraints, tothe best of our knowledge, SA approaches explored in the literature have been basedon the Lagrangian method [3] [13] [14] [15]. More specifically, stochastic gradientor similar steps are used for minimizing the Lagrangian with respect to the primalvariables and for maximizing the Lagrangian with respect to the dual variables.

Primal-Dual Stochastic Hybrid Approximation Algorithm 3

Another important internal-sampling algorithm, which is referred to as stochas-tic hybrid approximation algorithm, is proposed in [16] for solving stochastic prob-lems with deterministic constraints. It consists of solving a sequence of deterministicapproximations of the original problem whose objective functions are periodicallyupdated using noisy subgradients. Its development was motivated by a dynamic re-source allocation problem, and it belongs to an important line of research that usestechniques from approximate dynamic programming for solving problems of suchtype [17] [18] [19]. The authors in [16] show that the iterates produced by the al-gorithm convergence almost surely to an optimal point under certain conditions. In[2], the authors apply this algorithm to a two-stage stochastic optimization problemwith quadratic objectives and linear constraints for determining optimal generationpolicies in power systems with high penetration of renewable energy. The initialdeterministic approximation used consists of the certainty-equivalent problem, i.e.,the problem obtained by replacing random variables with their expected values. Theresults obtained show the superiority of the hybrid algorithm over benchmark al-gorithms that include stochastic gradient descent and (sequential) algorithms basedon SAA with Benders decomposition on the problem considered. In particular, thehybrid algorithm is shown to produce better iterates than the other algorithms, andto be more robust against noise during the initial iterations compared to stochasticgradient descent. The authors attribute these properties to the quality of the initial de-terministic approximation used, and to an interesting connection between the hybridalgorithm and a stochastic proximal gradient algorithm when applied to the problemconsidered.

In this work, the stochastic hybrid approximation algorithm proposed in [16] andfurther investigated in [2] is extended to also handle constraints with expectationfunctions. This is done by applying the hybrid procedure to the Lagrangian and com-bining it with dual stochastic gradient ascent. An analysis of the convergence of thisnew algorithm is presented. Furthermore, numerical experience obtained from apply-ing this and benchmark algorithms to a risk-averse version of the problem consideredin [2] is reported. The benchmark algorithms include a primal-dual SA algorithm andSAA-based algorithms with and without decomposition techniques. The test problemconsidered consists of a two-stage stochastic problem with quadratic objectives and aCVaR constraint that bounds the risk of high second-stage costs. The instances of thetest problem used for the experiments come from real electric power networks havingaround 2500 and 3000 nodes and hence several thousand combined first- and second-stage variables. The empirical results obtained suggest that the hybrid algorithm canexploit accurate initial deterministic approximations of the expectation functions andconverge to a solution faster than the benchmark algorithms.

This paper is organized as follows: In Section 2, the general stochastic opti-mization problem considered as well as important notation are introduced. In Sec-tions 3 and 4, the SAA-based and primal-dual SA benchmark algorithms are de-scribed. In Section 5, the new primal-dual stochastic hybrid approximation algorithmis presented along with a convergence analysis and simple examples that illustrateits properties. In Section 6, an application from power systems operations planningof stochastic optimization involving expectation functions in both the objective andconstraints is described, and the numerical experience obtained from applying the


algorithms considered in this work to this application is reported. Lastly, in Section7, the key contributions and findings of this work are summarized, and next researchdirections are discussed.

2 General Problem

The general stochastic optimization problem considered is of the form

minimizex∈X

F (x) (1)

subject to G (x)≤ 0,

where X is a compact convex set, F := E [F( · ,w)], F( · ,w) is convex for each ran-dom vector w ∈ Ω , G := E [G( · ,w)], G( · ,w) is a vector-valued function composedof convex functions for each w ∈ Ω , Ω is a compact set, and E [ · ] denotes expecta-tion with respect to w. It is assumed that all functions are continuous in X , and thatSlater’s condition and hence strong duality holds [20].

The Lagrangian of problem (1) is given by

L (x,λ ) := F (x)+λT G (x),

and the “noisy” Lagrangian is defined as

L(x,λ ,w) := F(x,w)+λT G(x,w).

The vector x∗ denotes a primal optimal point of problem (1). Similarly, the vector λ ∗

denotes a dual optimal point of problem (1). It is assumed that λ ∗ ∈Λ , where

Λ := λ | 0≤ λ < 1λmax ,

1 is the vector of ones, and λ max is some known positive scalar.In the following sections, unless otherwise stated, all derivatives and subdifferen-

tials are assumed to be with respect to the primal variables x.

3 Algorithms Based on Sample-Average Approximations

A widely-used external-sampling approach for solving stochastic optimization prob-lems of the form of (1) is to sample independent realizations of the random vectorw, say wlN

l=1, where N ∈ Z++, and replace expected values with samples averages.The resulting deterministic problem approximation is

minimizex∈X

N−1N

∑l=1

F(x,wl) (2)

subject to N−1N

∑l=1

G(x,wl)≤ 0.


It can be shown that under certain conditions and sufficiently large N the solutions of(2) are arbitrarily close to those of (1) with high probability [1] [6].

Due to the requirement of choosing N sufficiently large in order to get an accurateestimate of a solution of (1), problem (2) can be difficult to solve with standard (deter-ministic) optimization algorithms. This is particularly true when F( · ,w) or G( · ,w)are computationally expensive to evaluate, e.g., when they are defined in terms ofoptimal values of other optimization problems that depend on both x and w. In thiscase, a common strategy for solving (2) is to use decomposition. One widely-useddecomposition approach is based on cutting planes and consists of first expressing(2) as

minimizex∈X

F(x)+N−1N

∑l=1

F(x,wl)

subject to G(x)+N−1N

∑l=1

G(x,wl)≤ 0,

and then performing the following step at each iteration k ∈ Z+:

xk = argmin

F(x)+ max0≤ j<k

a j +bTj (x− x j)

∣∣∣x ∈X , Gi(x)+ max

0≤ j<kci j +dT

i j(x− x j)≤ 0, ∀i, (3)

where

a j = N−1N

∑l=1

F(x j,wl), b j ∈ N−1N

∑l=1

∂ F(x j,wl),

ci j = N−1N

∑l=1

Gi(x j,wl), di j ∈ N−1N

∑l=1

∂ Gi(x j,wl),

and Gi(x,w) and Gi(x,w) denote the i-th component of G(x,w) and G(x,w), respec-tively. Some key properties of this algorithm are that the functions

max0≤ j<k

a j +bTj (x− x j) and max

0≤ j<kci j +dT

i j(x− x j)

are gradually-improving piecewise linear under-estimators of N−1∑

Nl=1 F( · ,wl) and

N−1∑

Nl=1 Gi( · ,wl), respectively, and that parallelization can be easily exploited for

computing a j, b j, ci j and di j efficiently. The interested reader is referred to [7] and[8] for more details about decomposition techniques based on cutting planes.

4 Primal-Dual Stochastic Approximation Algorithm

As already mentioned, the authors in [13] and [14] describe an internal-samplingprimal-dual stochastic approximation algorithm for solving problems of the form of


(1). It consists of performing the following steps at each iteration k ∈ Z+:

xk+1 = Πx(xk−αkξk) (4a)

λk+1 = Πλ

(λk +αkG(xk,wk)

), (4b)

where ξk ∈ ∂L(xk,λk,wk), αk are scalar step lengths, Πx is projection on X , Πλ isprojection on Λ , and wk are independent identically distributed (i.i.d.) samples of therandom vector w. Since ξk is a noisy subgradient of L ( · ,λk) at xk, and G(xk,wk) isa noisy gradient of L (xk, ·) at λk, i.e.,

E [ξk|Wk−1] ∈ ∂L (xk,λk) and E [G(xk,wk)|Wk−1] = ∇λ L (xk,λk),

where Wk denotes the observation history (w0, . . . ,wk), it is clear that this algorithmperforms stochastic subgradient descent to minimize the Lagrangian with respect tox, and stochastic gradient ascent to maximize the Lagrangian with respect to λ .

In [13], the authors provide a convergence proof of this algorithm under certainconditions based on the “ODE method”. These conditions include, among others,F being strictly convex and continuously differentiable, the components of G be-ing convex and continuously differentiable, and the existence of a point x such thatG (x)< 0.

5 Primal-Dual Stochastic Hybrid Approximation Algorithm

In [16], the authors propose an internal-sampling stochastic hybrid approximationprocedure for solving stochastic optimization problems of the form

minimizex∈X

E [F(x,w)] , (5)

where X and F( · ,w) have the same properties as those of problem (1). The proce-dure consists of generating iterates

xk = argmin Fk(x) | x ∈X

at each iteration k ∈ Z+, where Fk are strongly convex deterministic differentiableapproximations of E [F( · ,w)]. The first approximation F0 is provided by the user,and the rest are obtained by performing “noisy” slope corrections at each iterationusing

Fk+1(x) = Fk(x)+αk(gk−∇Fk(xk))T x,

where gk ∈ ∂F(xk,wk), wk are i.i.d samples of the random vector w, and αk are steplengths that satisfy conditions that are common in stochastic approximation algo-rithms [14] [21]. The slope corrections performed at each iteration are “noisy” be-cause they are based on subgradients associated with specific realizations of the ran-dom vector. Roughly speaking, the conditions satisfied by the step lengths ensurethat this noise is eventually averaged out. The authors show that the iterates xk con-vergence almost surely to an optimal point of (5) under the stated and other minorassumptions. They claim that the strengths of this approach are its ability to exploit


an accurate initial approximation F0, and the fact that the slope corrections do notchange the structure of the approximations.

In [2], the authors apply this stochastic hybrid approximation procedure to a two-stage stochastic generator dispatch problem that arises in power systems with highpenetration of renewable energy. Motivated by the fact that in power systems opera-tions planning certainty-equivalent models are common, the authors use the functionF0 = F( · ,E [w]) as the initial approximation. The results obtained show how thealgorithm outperforms common SAA-based algorithms on the problem considereddue to its ability to exploit the accuracy of the given initial function approximation.The algorithm is also shown to be less susceptible to noise compared to stochasticgradient descent, and this is attributed to an interesting connection between the al-gorithm and a stochastic proximal gradient algorithm when applied to the problemconsidered.

In order to apply the stochastic hybrid approximation procedure to solve problem(1), which also has expectation functions in the constraints as opposed to only in theobjective, a primal-dual extension is proposed here. The proposed algorithm workswith deterministic approximations of the Lagrangian, and combines the stochastichybrid approximation procedure with dual stochastic gradient ascent, the latter whichis motivated from algorithm (4). More specifically, the proposed algorithm consistsof performing the following steps at each iteration k ∈ Z+:

xk = argmin Fk(x)+λTk Gk(x) | x ∈X (6a)

λk+1 = Πλ

(λk +αkGk

)(6b)

Fk+1(x) = Fk(x)+αk(gk−gk(xk))T x (6c)

Gk+1(x) = Gk(x)+αk(Jk− Jk(xk)

)x, (6d)

where Fk and Gk are deterministic differentiable approximations of the stochasticfunctions F and G , respectively, Gk := G(xk,wk), gk ∈ ∂F(xk,wk), gk := ∇Fk, Jkis a matrix whose rows are subgradients (transposed) of the components of G( · ,wk)at xk, Jk := ∂

∂xGk, αk are scalar step lengths, and wk are i.i.d. samples of the randomvector w. Step (6a) generates primal iterates by minimizing deterministic approxima-tions of the Lagrangian. Step (6b) updates dual iterates using noisy measurementsof feasibility, namely, Gk = G(xk,wk). Lastly, steps (6c) and (6d) perform slope cor-rections on the deterministic approximations Fk and Gk, respectively, using noisysubgradients of the stochastic function F and of the components of G .

5.1 Convergence

Almost-sure convergence of a subsequence of the primal iterates produced by algo-rithm (6) to an optimal point of problem (1) is shown under certain assumptions. Theassumptions include the problem-specific ones stated in Section 2 as well as the ad-ditional ones stated below. The proof uses approximate primal and dual solutions xσ

k


and λ σk , respectively, of problem (1) that solve the regularized problem

minimizex∈X

F (x)+σ

2‖x− xk‖2

2 (7)

subject to G (x)≤ 0

for σ > 0 and k ∈ Z+. The proof also uses the approximate Lagrangian defined by

Lk(x,λ ) := Fk(x)+λT Gk(x).

Assumption 1 (Initial Function Approximations) The initial objective function ap-proximation F0 is strongly convex and continuously differentiable in X . The compo-nents of the initial constraint function approximation G0 are convex and continuouslydifferentiable in X .

Assumption 2 (Sampled Subgradients) The subgradients gk and matrices of sub-gradients Jk defined above are uniformly bounded for all k ∈ Z+.

Assumption 3 (Step Lengths) The step lengths αk lie inside the open interval (0,1),and satisfy ∑

∞k=0 αk = ∞ almost surely and ∑

∞k=0E

[α2

k

]< ∞.

Assumption 4 (Approximate Dual Solutions) For each σ > 0 there exists a con-stant M > 0 such that the approximate dual solutions λ σ

k satisfy

‖λ σk+1−λ

σk ‖2 ≤M

(‖xk+1− xk‖2 +‖xσ

k+1− xσk ‖2)

for all k ∈ Z+. Furthermore, for all σ > 0 small enough, λ σk ∈Λ for all k ∈ Z+.

Assumption 5 (Drift) For all σ > 0 small enough, the sequences of partial sums

K

∑k=0

(λk+1−λk)T

Γk,K

∑k=0

(λ σk+1−λ

σk )T

Ψk,K

∑k=0

(xσk+1− xσ

k )Tϒk, (8)

where Γk :=Gk(xσk )−Gk(xk),Ψk := λ σ

k −λk andϒk :=∇Lk(xσk ,λk), are almost-surely

bounded above.

Assumptions 1, 2, and 3 made on the initial function approximations, subgra-dients, and step lengths, respectively, follow those made in [16]. Assumption 4 isreasonable from (7), the definition of Λ , and λ ∗ ∈Λ . Assumption 5 is more obscureand hence Section 5.2 is devoted to providing an intuitive justification for why thismay hold in practice. Furthermore, the examples in Section 5.3 show the behavior ofthe partial sums in (8) when using initial function approximations and dual boundsλ max of various quality.

Lemma 1 The functions Fk and Gk and their derivatives are uniformly Lipschitz forall k ∈ Z+ and hence also uniformly bounded in X .


Proof From (6d), the function Gk can be expressed as

Gk(x) = G0(x)+Rkx, ∀x ∈X , (9)

for k ∈Z+, where Rk is a matrix and R0 = 0. It follows that for all k ∈Z+ and x∈X ,

Gk+1(x) = Gk(x)+αk(Jk− Jk(xk)

)x

= G0(x)+Rkx+αk(Jk− J0(xk)−Rk

)x

= G0(x)+((1−αk)Rk +αk

(Jk− J0(xk)

))x.

HenceRk+1 = (1−αk)Rk +αk

(Jk− J0(xk)

).

From Assumptions 1 and 2, there is some constant M1 such that

‖Jk− J0(xk)‖2 ≤M1

for all k ∈ Z+. Hence, Assumption 3 gives that ‖Rk‖2 ≤M1 implies

‖Rk+1‖2 =∥∥(1−αk)Rk +αk

(Jk− J0(xk)

)∥∥2

≤ (1−αk)‖Rk‖2 +αk‖Jk− J0(xk)‖2

≤ (1−αk)M1 +αkM1

= M1.

Since R0 = 0, it holds by induction that ‖Rk‖2 ≤M1 for all k ∈ Z+. It follows that forany x and y ∈X and k ∈ Z+,

‖Gk(x)−Gk(y)‖2 = ‖G0(x)+Rkx−G0(y)−Rky‖2

≤ ‖Rk‖2‖x− y‖2 +‖G0(x)−G0(y)‖2

≤ (M1 +M2)‖x− y‖2,

where M2 > 0 exists because Assumption 1 implies that G0 is Lipschitz in the compactset X .

From (9), the Jacobian of Gk is given by

Jk(x) = J0(x)+Rk, ∀x ∈X ,

for each k ∈ Z+. Assumption 1 and ‖Rk‖2 ≤M1 imply that the functions Jk, k ∈ Z+,are uniformly Lipschitz in X .

The derivation of these properties for the functions Fk, k ∈ Z+, is similar andhence omitted. ut

Lemma 2 The functions Fk, k ∈ Z+, are uniformly strongly convex.

Proof From (6c) and Lemma 1, Fk can be expressed as

Fk(x) = F0(x)+ rTk x, ∀x ∈X ,

for k ∈ Z+, where rk are uniformly bounded vectors with r0 = 0. Assumption 1 givesthat F0 is strongly convex, and therefore that Fk are uniformly strongly convex sincetheir strong-convexity constant is independent of k. ut


Lemma 3 There exists an M > 0 such that ‖xk+1− xk‖2 ≤ αkM for all k ∈ Z+. Forconvenience, we say that ‖xk+1− xk‖2 is O(αk).

Proof Equation (6a) implies that(gk(xk)+ Jk(xk)

Tλk)T

(x− xk)≥ 0, ∀x ∈X (10)(gk+1(xk+1)+ Jk+1(xk+1)

Tλk+1

)T(x− xk+1)≥ 0, ∀x ∈X , (11)

for all k ∈ Z+. In particular, (10) holds with x = xk+1, and (11) holds with x = xk.Hence,

(gk+1(xk+1)−gk(xk))T

δk ≤(Jk(xk)

Tλk− Jk+1(xk+1)

Tλk+1

)Tδk,

where δk := xk+1− xk. Using (6c) and (6d) gives

(gk(xk+1)−gk(xk))T

δk ≤(Jk(xk)

Tλk− Jk(xk+1)

Tλk+1

)Tδk−

αk(∆

Tk λk+1 +ηk

)Tδk, (12)

where

ηk := gk−gk(xk) (13)

∆k := Jk− Jk(xk). (14)

From Lemma 2, Fk is uniformly strongly convex so there exists a C > 0 independentof k such that

C‖δk‖22 ≤ (gk(xk+1)−gk(xk))

Tδk

for all k ∈ Z+. It follows from this and (12) that

C‖δk‖22 ≤

(λ

Tk Jk(xk)−λ

Tk+1Jk(xk+1)

)δk−αk

(∆

Tk λk+1 +ηk

)Tδk

≤(λ

Tk Jk(xk)−λ

Tk+1Jk(xk+1)

)δk +αkM1‖δk‖2 (15)

for all k ∈ Z+, where M1 > 0 exists because of the uniform boundedness of λk (byconstruction), and that of ∆k and ηk (by Assumption 2 and Lemma 1). From theconvexity of the component functions of Gk and the fact that λk ≥ 0 and λk+1 ≥ 0, itholds that

λTk+1 (Gk(xk)−Gk(xk+1))≥−λ

Tk+1Jk(xk+1)δk

λTk (Gk(xk+1)−Gk(xk))≥ λ

Tk Jk(xk)δk.

Adding these inequalities gives(λ

Tk Jk(xk)−λ

Tk+1Jk(xk+1)

)δk ≤ (λk+1−λk)

T (Gk(xk)−Gk(xk+1))

≤ ‖λk+1−λk‖2M2‖δk‖2

≤ αk‖Gk‖2M2‖δk‖2

≤ αkM2M3‖δk‖2,


where M2 > 0 and M3 > 0 exist due to the uniform boundedness of Gk and Lemma1. It follows from this and (15) that

C‖δk‖22 ≤ αk(M1 +M2M3)‖δk‖2,

or equivalently,‖δk‖2 ≤ αk(M1 +M2M3)/C

for all k ∈ Z+. ut

Corollary 1 For each σ > 0, ‖xσk+1− xσ

k ‖2 and ‖λ σk+1−λ σ

k ‖2 are O(αk).

Proof For each σ > 0 and k ∈ Z+, the optimality of xσk implies that there exists a

gσk ∈ ∂F (xσ

k ) such that (gσ

k +σ(xσk − xk)

)T(x− xσ

k )≥ 0

for all x ∈X such that G (x)≤ 0. It follows from this that(gσ

k +σ(xσk − xk)

)T(xσ

k+1− xσk )≥ 0(

gσk+1 +σ(xσ

k+1− xk+1))T

(xσk − xσ

k+1)≥ 0.

Adding these inequalities, rearranging, and using (gσk+1−gσ

k )T (xσ

k+1− xσk )≥ 0 gives

σ‖xσk+1− xσ

k ‖22 ≤ σ(xk+1− xk)

T (xσk+1− xσ

k )+(gσk −gσ

k+1)T (xσ

k+1− xσk )

≤ σ‖xk+1− xk‖2‖xσk+1− xσ

k ‖2.

It follows from this and Lemma 3 that ‖xσk+1− xσ

k ‖2 is O(αk). Assumption 4 thengives that ‖λ σ

k+1−λ σk ‖2 is also O(αk). ut

Lemma 4 For all σ > 0 small enough, there exists an M > 0 such that

12‖λk+1−λ

σk+1‖

22 ≤

12‖λk−λ

σk ‖

22 +(λ σ

k+1−λσk )T

Ψk +αk(λk−λσk )T Gk +α

2k M

for all k ∈ Z+, where Ψk is as defined in Assumption 5.

Proof From equation (6b), the fact that λ σk ∈Λ for small enough σ , Corollary 1, and

the uniform boundedness of Gk, it follows that there exist positive scalars M1 and M2independent of k such that

12‖λk+1−λ

σk+1‖

22 =

12

∥∥Πλ

(λk +αkGk

)−λ

σk+1

∥∥22

≤ 12‖λk +αkGk−λ

σk+1‖

22

=12‖λk−λ

σk+1‖

22 +αk(λk−λ

σk+1)

T Gk +12‖αkGk‖2

2

≤ 12‖λk−λ

σk+1‖

22 +αk(λk−λ

σk )T Gk +α

2k M1

≤ 12‖λk−λ

σk ‖

22 +(λ σ

k+1−λσk )T

Ψk +αk(λk−λσk )T Gk +α

2k M2.

ut


Lemma 5 For all σ > 0 small enough, the sequence T σk k∈Z+ defined by

T σk := Fk(xσ

k )+λTk Gk(xσ

k )−Fk(xk)−λTk Gk(xk)+

12‖λk−λ

σk ‖

22

satisfies

T σk+1−T σ

k ≤ (λk+1−λk)T

Γk +(λ σk+1−λ

σk )T

Ψk +(xσk+1− xσ

k )Tϒk +

αk

(L(xσ

k ,λσk ,wk)−L(xk,λ

σk ,wk)+

σ

2‖xσ

k − xk‖22

)+

αk(λk−λσk )T G(xσ

k ,wk) +

O(α2k )

for all k ∈ Z+, where Γk, Ψk and ϒk are as defined in Assumption 5.

Proof From the definition of T σk , it follows that

T σk+1−T σ

k = Fk+1(xσk+1)+λ

Tk+1Gk+1(xσ

k+1)−Fk+1(xk+1)−λTk+1Gk+1(xk+1) −

Fk(xσk )−λ

Tk Gk(xσ

k )+Fk(xk)+λTk Gk(xk) +

12‖λk+1−λ

σk+1‖

22−

12‖λk−λ

σk ‖

22.

Using equations (6c) and (6d), and adding and subtracting λ Tk Gk(xk+1) as well as

λ Tk Gk(xσ

k+1) to the right-had side of the above equation gives

T σk+1−T σ

k = Fk(xk)+λTk Gk(xk)−Fk(xk+1)−λ

Tk Gk(xk+1) +

Fk(xσk+1)+λ

Tk Gk(xσ

k+1)−Fk(xσk )−λ

Tk Gk(xσ

k ) +

(λk+1−λk)T (Gk(xσ

k+1)−Gk(xk+1)) +

αkηTk (x

σk+1− xk+1) +

αkλTk+1∆k(xσ

k+1− xk+1) +

12‖λk+1−λ

σk+1‖

22−

12‖λk−λ

σk ‖

22,

where ηk and ∆k are defined in (13) and (14), respectively. From (6a), it holds that

Fk(xk)+λTk Gk(xk)−Fk(xk+1)−λ

Tk Gk(xk+1)≤ 0.

Using this, Lemma 4, the convexity of Lk( · ,λk), and the definitions of ηk and ∆kgives

T σk+1−T σ

k ≤ (λk+1−λk)T (Gk(xσ

k+1)−Gk(xk+1)) +

(λ σk+1−λ

σk )T (λ σ

k −λk) +

(xσk+1− xσ

k )T

∇Lk(xσk+1,λk) +

αk(gk + JT

k λk+1)T

(xσk+1− xk+1) −

αk(gk(xk)+ Jk(xk)

Tλk+1

)T(xσ

k+1− xk+1) +

αk(λk−λσk )T Gk +

O(α2k ).


From equation (6b) and the uniform boundedness of Gk, it holds that ‖λk+1−λk‖2 isO(αk). From Lemma 3 and Corollary 1, it holds that ‖xk+1−xk‖2, ‖xσ

k+1−xσk ‖2 and

‖λ σk+1−λ σ

k ‖2 are O(αk). From these bounds, Lemma 1 and Assumption 2, it followsthat

T σk+1−T σ


Γk +(λ σk+1−λ

σk )T

Ψk +(xσk+1− xσ

k )Tϒk +

αk(gk + JT

k λk)T

(xσk − xk) −

αk(gk(xk)+ Jk(xk)

Tλk)T

(xσk − xk) +


O(α2k ).

where Γk, Ψk and ϒk are as defined in Assumption 5. From equation (6a), it holds that(gk(xk)+ Jk(xk)

Tλk)T

(xσk − xk)≥ 0.

Using this and the fact that gk + JTk λk is a subgradient of L(x,λk,wk)+

σ

2 ‖x− xk‖22 at

xk gives

T σk+1−T σ


Γk +(λ σk+1−λ

σk )T

Ψk +(xσk+1− xσ

k )Tϒk +

αk

(L(xσ

k ,λk,wk)−L(xk,λk,wk)+σ

2‖xσ

k − xk‖22

)+


O(α2k ).

Then, adding and subtracting αkλ σk G(xσ

k ,wk), and using Gk = G(xk,wk) and the def-inition of L( · , · ,wk) gives

T σk+1−T σ


Γk +(λ σk+1−λ

σk )T

Ψk +(xσk+1− xσ

k )Tϒk +

αk

(L(xσ


σk ,wk)+

σ

2‖xσ

k − xk‖22

)+


k ,wk) +

O(α2k ).

ut

Lemma 6 For all σ > 0 small enough, the sequence xk − xσk k∈Z+ has a subse-

quence that converges to zero almost surely.

Proof From Lemma 5, for all σ > 0 small enough it holds that

T σk+1−T σ


Γk +(λ σk+1−λ

σk )T

Ψk +(xσk+1− xσ

k )Tϒk +

αk

(L(xσ


σk ,wk)+

σ

2‖xσ

k − xk‖22

)+


k ,wk) +

O(α2k )


for all k ∈ Z+. Hence, summing over k from 0 to K gives

T σK+1−T σ

0 ≤K

∑k=0

(λk+1−λk)T

Γk +K

∑k=1

(λ σk+1−λ

σk )T

Ψk +K

∑k=1

(xσk+1− xσ

k )Tϒk +

K

∑k=0

αk

(L(xσ


σk ,wk)+

σ

2‖xσ

k − xk‖22

)+

K

∑k=0


k ,wk) +

K

∑k=0

O(α2k ).

From this, Assumptions 3 and 5, there exists an M > 0 independent of K such that

T σK+1−T σ

0 −M ≤K

∑k=0

αk

(L(xσ


σk ,wk)+

σ

2‖xσ

k − xk‖22

)+

K

∑k=0


k ,wk)

for all K almost surely. By rearranging, taking expectation, and using the properties(λ σ

k )T G (xσk ) = 0 and λ T

k G (xσk )≤ 0, which follow from the fact that the primal-dual

point (xσk ,λ

σk ) solves (7), it follows that

K

∑k=0

E[αk

(L (xk,λ

σk )−L (xσ

k ,λσk )− σ

2‖xσ

k − xk‖22

)]≤M+E [T σ

0 ]−E[T σ

K+1].

Taking K→ ∞ and using the boundedness of the terms on the right-hand side gives∞

∑k=0

E[αk

(L (xk,λ

σk )−L (xσ

k ,λσk )− σ

2‖xσ

k − xk‖22

)]< ∞. (16)

Again, since (xσk ,λ

σk ) solves (7),

L (xk,λσk ) = F (xk)+(λ σ

k )T G (xk)

= F (xk)+(λ σk )T G (xk)+

σ

2‖xk− xk‖2

2

≥F (xσk )+(λ σ

k )T G (xσk )+

σ

2‖xσ

k − xk‖22

= L (xσk ,λ

σk )+

σ

2‖xσ

k − xk‖22,

and henceL (xk,λ

σk )−L (xσ

k ,λσk )− σ

2‖xσ

k − xk‖22 ≥ 0.

It follows from this last inequality, (16), and Assumption 3 that there exists a subse-quence (xnk ,x

σnk,λ σ

nk)k∈Z+ such that

L (xnk ,λσnk)−L (xσ

nk,λ σ

nk)− σ

2‖xσ

nk− xnk‖

22→ 0 as k→ ∞ (17)


almost surely. Since (xσk ,λ

σk ) solves (7), there exists a ησ

k ∈ ∂L (xσk ,λ

σk ) such that(

ησk +σ(xσ

k − xk))T

(x− xσk )≥ 0

for all x ∈X , and hence that

(ησnk)T (xnk − xσ

nk)≥ σ‖xσ

nk− xnk‖

22

for all k ∈ Z+. It follows from this that

L (xnk ,λσnk)−L (xσ

nk,λ σ

nk)− σ

2‖xσ

nk− xnk‖

22 ≥ (ησ

nk)T (xnk − xσ

nk)− σ

2‖xσ

nk− xnk‖

22

≥ σ‖xσnk− xnk‖

22−

σ

2‖xσ

nk− xnk‖

22

=σ

2‖xσ

nk− xnk‖

22.

This and (17) then give that (xσnk− xnk)→ 0 as k→ ∞ almost surely. ut

Theorem 1 The sequence xkk∈Z+ of primal iterates produced by algorithm (6) hasa subsequence that converges almost surely to an optimal point of problem (1).

Proof Since xσk is an optimal primal point of (7), it follows that G (xσ

k )≤ 0 and

F (xσk )≤F (x∗)+

σ

2(‖x∗− xk‖2

2−‖xσk − xk‖2

2)

≤F (x∗)+σM1

for all k ∈ Z+ and σ > 0, where x∗ is an optimal primal point of problem (1), andM1 > 0 exists due to x∗, xσ

k and xk all being inside the compact convex set X . Lettingm−1 := 0, it follows from the above inequalities, Lemma 6, and the continuity of Fand G in X , that for each k ∈ Z+ there exist mk > mk−1 and σk > 0 small enoughwith probability one such that

|F (xmk)−F (xσkmk)| ≤ 1/k

‖G (xmk)−G (xσkmk)‖2 ≤ 1/k

F (xσkmk)≤F (x∗)+M1/k.

It follows that the resulting sequence xmkk∈Z+ , which is a subsequence of xkk∈Z+ ,satisfies

F (xmk) = F (xσkmk)+F (xmk)−F (xσk

mk)

≤F (xσkmk)+ |F (xmk)−F (xσk

mk)|

≤F (xσkmk)+1/k

≤F (x∗)+(M1 +1)/k


and

G (xmk)T ei = G (xσk

mk)T ei +(G (xmk)−G (xσ

mk))T ei

≤ G (xσkmk)T ei +‖G (xmk)−G (xσ

mk)‖2

≤ 1/k

for each k, where ei is any standard basis vector. These inequalities and the bounded-ness of xmkk∈Z+ imply that the sequence xkk∈Z+ has a subsequence that conver-gences almost surely to a point x∈X that satisfies G (x)≤ 0 and F (x) =F (x∗). ut

5.2 Drift Assumption

Assumption 5 states that for all σ > 0 small enough, the sequences of partial sumsgiven in (8) are almost-surely bounded above. This assumption is only needed forLemma 6 and Theorem 1. To provide a practical justification for this assumption, itis enough to consider the scalar case, i.e., λk ∈ R and xk ∈ R.

By the compactness of Λ and X , and Assumption 4, the sequences λk, λ σk

and xσk are bounded. From this and Lemma 1, it holds that the sequences Γk, Ψk

and ϒk are also bounded. Furthermore, equation (6b), the uniform boundedness ofGk, and Corollary 1 give that there exists an M1 > 0 such that

|λk+1−λk| ≤ αkM1, |λ σk+1−λ

σk | ≤ αkM1, |xσ

k+1− xσk | ≤ αkM1

for all k ∈ Z+. Similarly, these inequalities, equations (6c) and (6d), the boundednessof λk, λ σ

k and xσk , Lemma 1, and Assumption 2 imply that there exists an

M2 > 0 such that

|Γk+1−Γk| ≤ αkM2, |Ψk+1−Ψk| ≤ αkM2, |ϒk+1−ϒk| ≤ αkM2

for all k ∈ Z+. Hence, the partial sums in (8) (in the scalar case) are all of the form

Sn :=n

∑k=1

(ak+1−ak)bk,

where ak and bk are bounded, and |ak+1−ak| ≤ αkM and |bk+1−bk| ≤ αkM forall k ∈ Z+ for some M > 0. Unfortunately, these properties alone of ak and bk, whichcan be assumed to be non-negative without loss of generality, are not enough to ensurethat the partial sums Sn are bounded above in the general case, as counterexamplesfor this can be constructed. However, due to Assumption 3, all counterexamples re-quire a high degree of synchronization between ak and bk in order to make bk favorconsistently (and infinitely often) increases in ak, i.e., ak+1−ak > 0, over decreases.Due to this strong synchronization requirement, it seems reasonable to expect thatin practical applications the sequences of partial sums in (8) are bounded above andhence that Assumption 5 holds.


5.3 Illustrative Examples

Two simple deterministic two-dimensional examples are used to illustrate the keyproperties of algorithm (6). In particular, they are used to show the effects of thequality of the initial function approximation G0 and of the dual bound λ max (used fordefining Λ in Section 2) on the performance of the algorithm, and to illustrate thegeometric ideas behind the algorithm. The effects of the quality of F0 are similar orless severe than those of G0 and hence are not shown. Both examples considered areproblems of the form

minimizex1,x2

F (x1,x2) (18a)

subject to G (x1,x2)≤ 0 (18b)

xmin ≤ xi ≤ xmax, i ∈ 1,2. (18c)

The first example consists of the Quadratic Program (QP) obtained using

F (x1,x2) = x21 + x2

2

G (x1,x2) =−0.5x1− x2 +2

(xmin,xmax) = (−6,6).

The primal optimal point of the resulting problem is (x∗1,x∗2) = (0.8,1.6), and the dual

optimal point associated with constraint (18b) is λ ∗ = 3.2. For the objective function,the initial approximation

F0(x1,x2) = 0.3(x1−1)2 +2(x2−0.5)2

is used. For the constraint function, the initial approximations

G 10 (x1,x2) =−1.5x1−2x2

G 20 (x1,x2) = 3x1 + x2

G 30 (x1,x2) = 4x1 +9x2,

are used, which can be considered “good”, “poor”, and “very poor”, respectively,according to how their gradients compare against that of the exact function at theoptimal point. Note that for this case, the function approximations are of the sameform as the exact functions, i.e., the objective function approximation is quadraticand the constraint function approximation is linear. The algorithm parameters used inthis example are λ0 = λ ∗/8 and αk = 1/(k+3).

Figure 1 shows the primal error, dual error, and drift as a function of the iterationsof the algorithm for the different initial constraint approximations and dual bounds.The drift shown here is the maximum of the partial sums in (8) using σ = 1. As thefigure shows, the primal and dual errors approach zero relatively fast and the driftremains small when using G 1

0 and G 20 . On the other hand, when using the “very poor”

approximation G 30 and a loose dual bound, the primal error, dual error and drift all

become large during the initial iterations of the algorithm. In this case, a very largenumber of iterations is required by the algorithm to approach the solution. Compared


to this case, the performance of the algorithm is significantly improved when usinga tighter dual bound. This is because the better bound prevents the dual iterates fromcontinuing to grow, and this allows the slope corrections to “catch up” and make theconstraint approximation more consistent with the exact function.

Fig. 1 Algorithm performance on QP

Figure 2 shows the contours of the constraint and objective function approxima-tions during certain iterations using G 2

0 as well as those of the exact functions. Thegradients of these functions are shown at the optimal primal points associated withthe functions. As the figure shows, the initial approximations (top left) are poor sincetheir gradients point in opposite directions as those of the exact functions (bottomright). As the number of iterations increases, the slope corrections performed by thealgorithm improve the direction of these gradients and hence the quality of the iter-ates. An important observation is that the shape of the contours does not change sinceonly the slope and not the curvature of the function approximations is affected. Forexample, the gradient of the objective function approximation at the solution matcheswell that of the exact function but its contours are elliptical and not circular.

The second example consists of the Quadratically Constrained Quadratic Program(QCQP) obtained using

F (x1,x2) = x21 + x2

2

G (x1,x2) = (x1−1.8)2 +(x2−1.4)2−1

(xmin,xmax) = (−6,6).

The primal optimal point for this problem is (x∗1,x∗2) = (1,0.8), and the dual opti-

mal point associated with constraint (18b) is λ ∗ = 1.3. The initial objective functionapproximation used is

F0(x1,x2) = 0.3(x1−1)2 +2(x2−0.5)2,

and the initial constraint function approximations used are

G 10 (x1,x2) =−1.5x1−2x2

G 20 (x1,x2) =−1.5x1 +3x2

G 30 (x1,x2) = 2x1 +4x2,


Fig. 2 Contours of objective and constraint approximations of QP using G 20 and λ max = 14

which again can be considered “good”, “poor”, and “very poor”, respectively. Notethat for this case, only the objective function approximation is of the same form asthe exact function. The initial constraint approximation is not since it is linear whilethe exact function is quadratic. The algorithm parameters used in this example areλ0 = λ ∗/6 and αk = 1/(k+3).

Figures 3 and 4 show the performance of the algorithm and the evolution of thefunction approximations, respectively, on the second example. The results obtainedare similar to those obtained on the first example. That is, the algorithm can performwell with reasonable initial function approximations. With very poor initial functionapproximations, the primal error, dual error and drift can all increase significantlyduring the initial iterations until the approximations become reasonable, resulting inthe algorithm requiring a large number of iterations to approach the solution. Fur-thermore, when the initial function approximation is very poor, the quality of thedual bound plays an important role in the performance of the algorithm.

Fig. 3 Algorithm performance on QCQP


Fig. 4 Contours of objective and constraint approximations of QCQP using G 20 and λ max = 8

6 Application: Optimal Risk-Averse Generator Dispatch

In this section, the performance of the algorithms described in Sections 3, 4 and 5 iscompared on a stochastic optimization problem that contains expectation functions inboth the objective and constraints. This problem originates in power systems opera-tions planning under high penetration of renewable energy, where optimal risk-aversepower generation policies are sought. Mathematically, the problem is formulated asa two-stage stochastic optimization problem with quadratic objectives and a CVaRconstraint that limits the risk of high second-stage costs.

In electric power systems, which are also known as power networks or grids, theoutput powers of generators are typically planned in advance, e.g., hours ahead. Thereason for this is that low-cost generators are typically slow and hence require sometime to change output levels. Uncertainty in the operating conditions of a system,such as power injections from renewable energy sources, must be taken into accountduring operations planning. This is because real-time imbalances in power produc-tion and consumption are undesirable as they must be resolved with resources thattypically have a high cost, such as fast-ramping generators or load curtailments. Fur-thermore, due to the large quantities of money involved in the operation of a powersystem and society’s dependence on reliable electricity supply, risk-averse operationpolicies are desirable. The problem of determining optimal risk-averse power gen-eration policies hence consists of determining generator output powers in advancethat minimize expected operation cost, while ensuring that the risk of high balancingcosts is below an acceptable limit.


6.1 Problem Formulation

Mathematically, a simplified version of the problem of determining optimal risk-averse power generation policies can be formulated as

minimizep∈P

ϕ0(p)+E [Q(p,r)] (19a)

subject to CVaRγ (Q(p,r)−Qmax)≤ 0, (19b)

where p ∈ Rnp is a vector of planned generator output powers, np is the number ofgenerators, r ∈ Rnr

+ is a vector of uncertain renewable powers available in the nearfuture, say one hour ahead, nr is the number of renewable energy sources, ϕ0 is astrongly convex separable quadratic function that quantifies planned generation cost,Q(p,r) quantifies the real-time power balancing cost for a given plan of generatoroutputs and realization of the uncertainty, Qmax is a positive constant that defines adesirable limit on the balancing cost, the set

P := p | pmin ≤ p≤ pmax

represents power limits of generators, and CVaRγ is conditional value-at-risk withparameter γ ∈ [0,1]. CVaR is a coherent risk measure that quantifies dangers beyondthe Value-at-Risk (VaR) [22], and is given by the formula

CVaRγ (Z) = inft∈R

(t +

11− γ

E [(Z− t)+]), (20)

for any random variable Z, where ( ·)+ := max· ,0 [1] [23]. An important propertyof this risk measure is that constraint (19b) is a convex conservative approximation(when Q( · ,r) is convex) of the chance constraint

Prob(Q(p,r)≤ Qmax)≥ γ,

and hence it ensures that the balancing cost is below the predefined limit with a de-sired probability [23]. The function Q(p,r) is given by the optimal objective value ofa second-stage optimization problem that represents real-time balancing of the powersystem, which is given by

minimizeq,θ ,s

ϕ1(q) (21a)

subject to Y (p+q)+Rs−Aθ −Dd = 0 (21b)

pmin ≤ p+q≤ pmax (21c)

zmin ≤ Jθ ≤ zmax (21d)0≤ s≤ r. (21e)

Here, ϕ1 is a strongly convex separable quadratic function that quantifies the costof power balancing adjustments, q ∈ Rnp are the power adjustments, which for sim-plicity are treated as changes in planned generator powers, θ ∈ Rnb are node voltagephase angles, nb is the number of non-slack nodes, s ∈ Rnr are powers from renew-able sources after possible curtailments, d ∈ Rnd are load power consumptions, nd


is the number of loads, Y,R,A,D,J are sparse matrices, constraint (21b) enforcesconservation of power using a DC power flow model [24], constraint (21c) enforcesgenerator power limits, constraint (21d) enforces edge flow limits due to thermal rat-ings of transmission lines and transformers, and constraint (21e) enforces renewablepower curtailment limits. Under certain assumptions, which are made here, the func-tions Q( · ,r) and E [Q( · ,r)] are convex and differentiable [2].

Using the definition (20) of CVaR and adding bounds for the variable t, the first-stage problem (19) can be reformulated as

minimizep, t

E [ϕ0(p)+Q(p,r)] (22a)

subject to E[(1− γ)t +(Q(p,r)−Qmax− t)+

]≤ 0 (22b)

(p, t) ∈P×T , (22c)

where T := [tmin, tmax]. In order to avoid affecting the problem, tmax can be set to zerowhile tmin can be set to a sufficiently large negative number. The resulting problem isclearly of the form of (1).

6.2 Uncertainty

As already noted, the random vector r represents available powers from renewableenergy sources in the near future, say one hour ahead. It is modeled here by theexpression

r = Πr (r0 +δ ) ,

where r0 is a vector of base powers, δ ∼N (0,Σ) are perturbations, Πr is the projec-tion operator on the set

R := r | 0≤ r ≤ rmax ,and the vector rmax represents the capacities of the renewable energy sources.

The use of Gaussian random variables for modeling renewable energy uncertaintyis common in the power systems operations planning literature [25] [26]. To modelspatial correlation between nearby renewable sources, a sparse non-diagonal covari-ance matrix Σ is considered here. The off-diagonal entries of this matrix that corre-spond to pairs of renewable energy sources that are “nearby” are such that the corre-lation coefficient between their powers equals a pre-selected value ρ . In the absenceof geographical information, a pair of sources are considered as being “nearby” ifthe network shortest path between the nodes where they are connected is less than orequal to some pre-selected number of edges N. This maximum number of edges isreferred here as the correlation distance.

6.3 Noisy Functions and Subgradients

By comparing problems (1) and (22), the functions F( · ,w) and G( · ,w) are given by

F(x,w) = ϕ0(p)+Q(p,r)

G(x,w) = (1− γ)t +(Q(p,r)−Qmax− t)+


for all x = (p, t), where w = r. Subgradients of these functions are given by the ex-pressions [

∇ϕ0(p)+∇pQ(p,r)0

]and [

01− γ

]+1Q(p,r)−Qmax− t ≥ 0

[∇pQ(p,r)−1

],

respectively, where 1· is the indicator function defined by

1A=

1 if A is true,0 otherwise,

for any event or condition A. As shown in [2], ∇pQ(p,r) = −∇ϕ1(q∗), where q∗ ispart of the optimal point of problem (21).

6.4 Function Approximations

In order to apply the primal-dual stochastic hybrid approximation algorithm (6), ini-tial function approximations F0 and G0 need to be defined. F0 needs to be stronglyconvex and continuously differentiable, while G0 needs to be component-wise convexand continuously differentiable. Motivated by the results obtained in [2], the approx-imations used here are also based on the certainty-equivalent formulation of prob-lem (1), namely, the problem obtained by replacing E [ f (p,r)] with f (p, r), wherer := E [r], for any stochastic function f . More specifically, F0 is defined by

F0(x) := ϕ0(p)+Q(p, r)+12

εt2, (23)

for all x = (p, t), where ε > 0, and G0 is defined by

G0(x) := (1− γ)t +(Q(p, r)−Qmax− t)+ , (24)

for all x = (p, t). This choice of function G0 is differentiable everywhere except whenQ(p, r)−Qmax− t = 0. As will be seen in the experiments, this deficiency does notcause problems on the test cases considered. For cases for which this may cause prob-lems, e.g., for cases for which Q(p, r)−Qmax− t = 0 occurs near or at the solution,the smooth function

(1− γ)t +1ν

log(

1+ eν(Q(p,r)−Qmax−t))

may be used instead, where ν is a positive parameter.During the k-th iteration of algorithm (6), the function approximations can be

expressed as

Fk(x) = F0(x)+ζTk x

Gk(x) = G0(x)+υTk x,


for all x = (p, t), where the vectors ζk and υk follow the recursive relations

ζk+1 = ζk +αk (gk−g0(xk)−ζk)

υk+1 = υk +αk(Jk− J0(xk)−υ

Tk)T

with ζ0 = υ0 = 0, and gk, g0, Jk and J0 are as defined in Section 5.

6.5 Subproblems

To perform step (6a) of the primal-dual stochastic hybrid approximation algorithm,the subproblem

minimizex∈X

Fk(x)+λTk Gk(x)

needs to be solved, where λk ≥ 0. For problem (22) with initial function approxima-tions (23) and (24), this subproblem is given by

minimizep, t,z

ϕ0(p)+Q(p, r)+12

εt2 +λk(1− γ)t +λkz+(ζk +λkυk)T[

pt

]subject to Q(p, r)−Qmax− t− z≤ 0

(p, t) ∈P×T

0≤ z.

By embedding the definition of Q(p, r), this subproblem can be reformulated as

minimizep, t,q,θ ,s,z

ϕ0(p)+ϕ1(q)+12

εt2 +λk(1− γ)t +λkz+(ζk +λkυk)T[

pt

](25)

subject to Y (p+q)+Rs−Aθ −Dd = 0ϕ1(q)−Qmax− t− z≤ 0

pmin ≤ p+q≤ pmax

zmin ≤ Jθ ≤ zmax

pmin ≤ p≤ pmax

tmin ≤ t ≤ tmax

0≤ s≤ r

0≤ z,

which is a sparse QCQP.

6.6 Implementation

The benchmark primal-dual stochastic approximation algorithm (4) and the new primal-dual stochastic hybrid approximation algorithm (6) were implemented in Python us-ing the scientific computing packages Numpy and Scipy [27] [28]. The code for thesealgorithms has been included in the Python package OPTALG 1.

1 https://github.com/ttinoco/OPTALG


The optimal risk-averse generation planning problem (22) was also constructedusing Python and included in the Python package GRIDOPT 2. In particular, the C li-brary PFNET 3 was used through its Python wrapper for representing power networksand for constructing the required constraints and functions of the problem. For solv-ing the sparse QPs (21) and sparse QCQPs (25), the modeling library CVXPY [29] wasused together with the second-order cone solver ECOS [30]. The SAA-based bench-mark algorithms of Section 3, namely the one that solves (2) directly and algorithm(3) that uses decomposition and cutting planes were also implemented using CVXPY

and ECOS.The covariance matrix Σ of the renewable power perturbations defined in Section

6.2 was constructed using routines available in PFNET. For obtaining Σ 1/2 to sam-ple δ ∼N (0,Σ), the sparse Cholesky code CHOLMOD [31] was used via its Pythonwrapper 4.

6.7 Numerical Experiments

The stochastic optimization algorithms described in Sections 3, 4 and 5 were appliedto two instances of the optimal risk-averse generation planning problem (22) in orderto test and compare their performance. The test cases used were constructed fromreal power networks from North America (Canada) and Europe (Poland). Table 1shows important information about each of the cases, including name, number ofnodes in the network, number of edges, number of first-stage variables (dimension of(p, t) in (22)), number of second-stage variables (dimension of (q,θ ,s) in (21)), anddimension of the uncertainty (dimension of the vector r in (22)).

Table 1 Information about test cases

name nodes edges dim (p, t) dim (q,θ ,s) dim r

Case A 2454 2781 212 2900 236Case B 3012 3572 380 3688 298

The separable generation cost function ϕ0 in (22) was constructed using uni-formly random coefficients in [0.01,0.05] for the quadratic terms, and uniformlyrandom coefficients in [10,50] for the linear terms. These coefficients assume thatpower quantities are in units of MW and are consistent with those found in severalMATPOWER cases [32]. For the separable generation adjustment cost function ϕ1 in(21), the coefficients for the quadratic terms were set to equal those of ϕ0 scaled by afactor greater than one, and the coefficients of the linear terms were set to zero. Thescaling factor was set large enough to make balancing costs with adjustments higherthan with planned powers. The coefficients for the linear terms were set to zero to pe-

2 https://github.com/ttinoco/GRIDOPT3 https://github.com/ttinoco/PFNET4 http://pythonhosted.org/scikits.sparse/


nalize both positive and negative power adjustments equally. For the t-regularizationadded in (23) to make F0 strongly convex, ε = 10−6 was used.

The renewable energy sources used to construct the test cases of Table 1 wereadded manually to each of the power networks at the nodes with adjustable or fixedgenerators. The capacity rmax

i of each source was set to 1T d/nr, where d is the vectorof load power consumptions and nr is the number of renewable energy sources. Thebase powers were set using r0 = 0.5rmax so that the base renewable energy penetrationwas 50% of the load, which is a high penetration scenario. The standard deviationsof the renewable power perturbations, i.e., diagΣ 1/2, were also set to 0.5rmax, whichcorresponds to a high-variability scenario. For the off-diagonals of Σ , a correlationcoefficient ρ of 0.05 and a correlation distance N of 5 edges were used.

For the risk-limiting constraint (22b), the desirable balancing cost limit Qmax wasset to 0.8Q0, where Q0 := E [Q(p0,r)] and p0 is the optimal power generation policyfor the certainty-equivalent problem without risk-limiting constraint, i.e.,

p0 := argminp∈P

ϕ0(p)+Q(p,E [r]).

The CVaR parameter γ defined in Section 6.1, which is related to the probability ofsatisfying the inequality Q(p,r) ≤ Qmax, was set to 0.95. For bounding the variablet in problem (22), tmax and tmin were set to 0 and −0.1Q0, respectively. For choos-ing this value of tmin, some experimentation was required. In particular, executingalgorithms (4) and (6) for a few iterations was needed in order to check that tmin wassufficiently negative to allow for feasibility, but not too large in magnitude. The latterwas needed to avoid encountering excessively large values of G(xk,wk) (and hencelarge increases in λk) due to poor values of t in the initial iterations.

Each of the algorithms was applied to the test cases shown in Table 1 and its per-formance was recorded. In particular, the SAA cutting-plane algorithm (3) with 1000scenarios (SAA CP 1000), the primal-dual stochastic approximation algorithm (4)(PD SA), and the primal-dual stochastic hybrid approximation algorithm (6) consid-ering and ignoring the risk-limiting constraint (PD SHA and SHA, respectively) wereexecuted for a fixed time period. For the PD SA algorithm, (p0, tmin) was used as theinitial primal solution estimate. For both PD SHA and PD SA, 0 was used as the initialdual solution estimate λ0, and the step lengths αk used were of the form (k0 + k)−1,where k0 is a constant. A k0 value of 50 was used for the PD SHA algorithm, whilea k0 value of 350 was used for the PD SA algorithm. These values were chosen toreduce the susceptibility of the algorithms to noise during the initial iterations. Onlyfor the SAA CP 1000 algorithm, 24 parallel processes were used. This was done tocompute values and subgradients of the sample-average second-stage cost functionefficiently. For evaluating the progress of the algorithms as a function of time, thequantities E [F(x,w)], E [G(x,w)] and Prob(Q(p,r) ≤ Qmax) were evaluated using2000 fixed scenarios at fixed time intervals. Figures 5 and 6 show the results obtainedon a computer cluster at the Swiss Federal Institute of Technology (ETH) with com-puters equipped with Intel R©Xeon R©E5-2697 CPU (2.70 GHz), and running the oper-ating system CentOS 6.8. In the figures, the objective values shown were normalizedby ϕ0(p0)+Q0, while the constraint function values shown were normalized by Q0,which was defined in the previous paragraph. The dashed lines are meant to represent


references, and correspond to the results obtained by solving the certainty-equivalentproblem ignoring the risk-limiting constraint (CE), and the SAA-based algorithm thatconsists of solving problem (2) directly (without decomposition) using 1000 scenar-ios (SAA 1000). The times needed to compute these references are show in Table 2.Table 3 shows the (maximum) memory requirements of the algorithms on each of thetest cases.

Fig. 5 Performance of algorithms on Case A

Table 2 Computation times of references in minutes

algorithm Case A Case B

CE 0.0075 0.018SAA 1000 15 92

Table 3 Memory requirements of algorithms in gigabytes

algorithm Case A Case B

CE 0.14 0.16SAA 1000 8.4 13

SAA CP 1000 1.5 1.8PD SA 0.12 0.20PD SHA 0.14 0.16


Fig. 6 Performance of algorithms on Case B

From the plots shown in Figures 5 and 6, several important observations can bemade: First, the CE solution can result in both poor expected cost and high risk ofviolating Q(p,r)≤Qmax. On Case A, for example, all algorithms find power genera-tion policies that result in lower expected cost and in Q(p,r) ≤ Qmax being satisfiedwith significantly higher probability. Second, by ignoring the risk-limiting constraint,the SHA algorithm converges fast, i.e., in a few minutes, and finds a power gener-ation policy that results in the lowest expected cost. The probability of satisfyingQ(p,r) ≤ Qmax, however, is only around 0.7 and 0.8 for both cases, and hence thepolicy found is not enough risk-averse. Third, by considering the risk-limiting con-straint, the PD SHA and PD SA algorithms produce in only a few minutes power gen-eration policies that are risk-averse, i.e., Prob(Q(p,r)≤Qmax)≥ γ . This is, of course,at the expense of higher expected costs compared with those obtained with the SHA

algorithm. The iterates produced by the PD SA algorithm experience significant vari-ations and do not appear to settle during the execution period. This is particularlyvisible in the plot of expected cost for both test cases, and in the plot of probabilityfor Case B. The PD SHA algorithm, on the other hand, produces iterates that expe-rience much less variations (despite using larger step lengths) and converge duringthe execution period. The expected cost and risk associated with the policy found bythe PD SHA algorithm match those obtained with the reference policy found with theSAA 1000 algorithm, which requires more time and much more computer memory,as seen on Tables 2 and 3. Lastly, the other SAA-based algorithm, namely SAA CP

1000, also produces policies whose expected cost and risk approach those obtainedusing the SAA 1000 algorithm. However, it requires more computer memory and,despite using 24 parallel processes, more time compared to the PD SHA algorithm.


The power generation policies associated with the last iterates produced by theprimal-dual stochastic hybrid approximation algorithm considering and ignoring therisk-limiting constraint (PD SHA and SHA, respectively) were compared. The resultsare shown in Figure 7. The plots show the (sorted) power output differences in MWof the generators with respect to p0, which is the solution of the certainty-equivalentproblem without risk-limiting constraint (CE). As the plots show, the power gener-ation policy obtained when ignoring the risk-limiting constraint shows a consistentnegative bias on both test cases with respect to the risk-averse policy. This negativebias corresponds to lower planned generator powers, which is expected. The reasonis that higher planned generator powers help reduce the risk of requiring generatoroutput increases in real time when renewable power injections are low. On the otherhand, when renewable power injections are high, renewable power curtailments canbe used as a means for balancing instead of generator output adjustments.

Fig. 7 Power generation policies found by algorithms

7 Conclusions

In this work, a new algorithm for solving convex stochastic optimization problemswith expectation functions in both the objective and constraints has been describedand evaluated. The algorithm combines a stochastic hybrid procedure, which wasoriginally designed to solve problems with expectation only in the objective, withdual stochastic gradient ascent. More specifically, the algorithm generates primal it-erates by minimizing deterministic approximations of the Lagrangian that are updatedusing noisy subgradients, and dual iterates by applying stochastic gradient ascent tothe true Lagrangian. A proof that the sequence of primal iterates produced by thisalgorithm has a subsequence that converges almost surely to an optimal point undercertain conditions has been provided. The proof relies on a “bounded drift” assump-tion for which we have provided an intuitive justification for why it may hold inpractice as well as examples that illustrate its validity using different initial func-tion approximations. Furthermore, experimental results obtained from applying thenew and benchmark algorithms to instances of a stochastic optimization problem thatoriginates in power systems operations planning under high penetration of renewable


energy have been reported. In particular, the performance of the new algorithm hasbeen compared with that of a primal-dual stochastic approximation algorithm and al-gorithms based on sample-average approximations with and without decomposition.The results obtained showed that the new algorithm had superior performance com-pared to the benchmark algorithms on the test cases considered. In particular, the newalgorithm was able to leverage accurate initial function approximations and produceiterates that approached a solution fast without requiring large computer memory orparallel computing resources.

Next research steps for this work include testing the new algorithm on problemswith multiple expected-value constraints, extending the theoretical analysis, explor-ing new techniques for improving performance, and extending or modifying the algo-rithm to handle more complicated problems. Promising techniques for improving theperformance of the algorithm include importance sampling, parallel computing, andthe incorporation of curvature updates. Interesting directions for extending or mod-ifying the algorithm include handling non-convex stochastic problems, and multi-stage problems. With regards to the theory, simple and easy-to-check conditions thatguarantee the bounded drift condition are needed, and also an extension of the con-vergence analysis to show when the sequence of iterates and not just a subsequenceconverges and at what rate. Lastly, reliable termination conditions need to be investi-gated and tested in order to make the algorithm effective and easy-to-use in practice.

Acknowledgment

We thank the coordinating editor and reviewers for their constructive feedback andsuggestions for improving the paper. We also acknowledge the support provided bythe ETH Zurich Postdoctoral Fellowship FEL-11 15-1, which made this work possi-ble.

References

1. W. Wang and S. Ahmed. Sample average approximation of expected valueconstrained stochastic programs. Operations Research Letters, 36(5):515–519,2008.

2. T. Tinoco De Rubira and G. Hug. Adaptive certainty-equivalent approach foroptimal generator dispatch under uncertainty. In European Control Conference,June 2016.

3. S. Bhatnagar, N. Hemachandra, and V.K. Mishra. Stochastic approximation algo-rithms for constrained optimization via simulation. ACM Transactions on Mod-eling and Computer Simulation, 21(3):1–22, February 2011.

4. J. Atlason, M.A. Epelman, and S.G. Henderson. Call center staffing with sim-ulation and cutting plane methods. Annals of Operations Research, 127(1-4):333–358, 2004.

5. A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approxi-mation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.


6. T. Homem-de-Mello and G. Bayraksan. Monte Carlo sampling-based methodsfor stochastic optimization. Surveys in Operations Research and ManagementScience, 19(1):56–85, 2014.

7. J.R. Birge and H. Tang. L-shaped method for two stage problems of stochasticconvex programming. Technical report, University of Michigan, Department ofIndustrial and Operations Engineering, August 1993.

8. R. Van Slyke and R. Wets. L-shaped linear programs with applications to optimalcontrol and stochastic programming. SIAM Journal on Applied Mathematics, 17(4):638–663, 1969.

9. A.J. King and R.T. Rockafellar. Asymptotic theory for solutions in statisticalestimation and stochastic programming. Mathematics of Operations Research,18(1):148–162, February 1993.

10. A. Shapiro. Asymptotic behavior of optimal solutions in stochastic program-ming. Mathematics of Operations Research, 18(4):829–845, 1993.

11. H. Robbins and S. Monro. A stochastic approximation method. Annals of Math-ematical Statistics, 22(3):400–407, 1951.

12. J. Spall. Adaptive stochastic approximation by the simultaneous perturbationmethod. IEEE Transactions on Automatic Control, 45(10):1839–1853, October2000.

13. H. Kushner and D. Clark. Stochastic Approximation Methods for Constrainedand Unconstrained Systems. Applied Mathematical Sciences. Springer-Verlag,1978.

14. H. Kushner and G. Yin. Stochastic Approximation and Recursive Algorithms andApplications. Stochastic Modelling and Applied Probability. Springer-Verlag,2003.

15. J. Zhang and D. Zheng. A stochastic primal-dual algorithm for joint flow controland MAC design in multi-hop wireless networks. In Conference on InformationSciences and Systems, pages 339–344, March 2006.

16. R.K. Cheung and W.B. Powell. SHAPE - A stochastic hybrid approximationprocedure for two-stage stochastic programs. Operations Research, 48(1):73–79, 2000.

17. W.B. Powell. The optimizing-simulator: Merging simulation and optimizationusing approximate dynamic programming. In Winter Simulation Conference,pages 96–109, 2005.

18. W.B. Powell, A. Ruszczynski, and H. Topaloglu. Learning algorithms for sepa-rable approximations of discrete stochastic optimization problems. Mathematicsof Operations Research, 29(4):814–836, 2004.

19. W.B. Powell and H. Topaloglu. Stochastic programming in transportation andlogistics. In Stochastic Programming, volume 10 of Handbooks in OperationsResearch and Management Science, pages 555–635. Elsevier, 2003.

20. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge UniversityPress, New York, NY, USA, 2004.

21. S. Bhatnagar, H. Prasad, and L. Prashanth. Stochastic Approximation Algorithms,pages 17–28. Springer London, 2013.

22. R.T. Rockafellar and S. Uryasev. Conditional value-at-risk for general loss dis-tributions. Journal of Banking and Finance, pages 1443–1471, 2002.


23. A. Nemirovski and A. Shapiro. Convex approximations of chance constrainedprograms. SIAM Journal on Optimization, 17(4):969–996, 2007.

24. J. Zhu. Optimization of Power System Operation. John Wiley & Sons, Inc., 2009.25. D. Phan and S. Ghosh. Two-stage stochastic optimization for optimal power

flow under renewable generation uncertainty. ACM Transactions on Modelingand Computer Simulation, 24(1):1–22, January 2014.

26. L. Roald, F. Oldewurtel, T. Krause, and G. Andersson. Analytical reformulationof security constrained optimal power flow with probabilistic constraints. InIEEE PowerTech Conference, pages 1–6, June 2013.

27. E. Jones, T. Oliphant, and P. Peterson. SciPy: Open source scientific tools forPython. http://www.scipy.org, 2001.

28. S. Van der Walt, S.C. Colbert, and G. Varoquaux. The numpy array: A structurefor efficient numerical computation. Computing in Science Engineering, 13(2):22–30, March 2011.

29. S. Diamond and S. Boyd. CVXPY: A Python-embedded modeling language forconvex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016.

30. A. Domahidi, E. Chu, and S. Boyd. ECOS: An SOCP solver for embeddedsystems. In European Control Conference (ECC), pages 3071–3076, 2013.

31. Y. Chen, T.A. Davis, W.W. Hager, and S. Rajamanickam. Algorithm 887:CHOLMOD, supernodal sparse cholesky factorization and update/downdate.ACM Transactions on Mathematical Software, 35(3):1–14, October 2008.

32. R. Zimmerman, C. Murillo-Sanchez, and R. Thomas. MATPOWER: Steady-state operations, planning, and analysis tools for power systems research andeducation. IEEE Transactions on Power Systems, 26(1):12–19, February 2011.

Documents

Primal-Dual Stochastic Hybrid Approximation Algorithm · 2020-05-12 · scribed. In Section 5, the new primal-dual stochastic hybrid approximation algorithm is presented along with