@let@token Convex Stochastic and Large-Scale Deterministic ...nemirovs/BFG2009Final.pdf · Convex Stochastic and Large-Scale Deterministic Programming via Robust Stochastic Approximation

Convex Stochastic and Large-ScaleDeterministic Programming via Robust

Stochastic Approximation and its Extensions

Arkadi NemirovskiH. Milton Stewart School of Industrial and Systems Engineering

Georgia Institute of Technology

Joint research with Anatoli Juditsky†, Guanghui Lan‡,and Alexander Shapiro§

†: Joseph Fourier University, Grenoble, France; ‡: ISyE, University of Florida;§: ISyE, Georgia Tech

BFG’09, Leuven September 14-18, 2009

Robust Stochastic Approximation and Extensions

Overview

Convex Stochastic Minimization and Convex-ConcaveStochastic Saddle Point problemsClassical Stochastic ApproximationRobust Stochastic Approximation & Stochastic Mirror Prox

constructionefficiency estimateshow it works

Acceleration via RandomizationMatrix Game & `1-minimization with uniform fit`1-minimization with Least Squares fitMinimizing maxima of convex polynomials over a simplex


Convex Stochastic Minimization Problem

♣ The basic Convex Stochastic Minimization problem is

minx∈X

f (x) (SM)

• X ⊂ Rn: – convex compact set• f (x) : X → R: convex and Lipschitz continuous objectivegiven by a Stochastic Oracle.♠ At i-th call to the oracle, the input being x ∈ X , the oraclereturns stochastic gradient G(x , ξi) ∈ Rn of f , with independentξi ∼ P, i = 1,2, ... and

Eξ∼PG(x , ξ) ∈ ∂f (x) ∀x ∈ X .

♣ In many cases (but not always!) G(x , ξ) ∈ ∂xF (x , ξ), whereF (x , ξ) is convex in x . Here (SM) is the problem of minimizingthe expectation f (x) = EF (x , ξ) of a convex in x integrantF (x , ξ) given by a first order oracle.


Convex-Concave Stochastic Saddle Point Problem

♣ Convex Stochastic Minimization problem is a particular caseof Convex-Concave Stochastic Saddle Point problem

minx∈X

maxy∈Y

f (x , y) (SP)

• Z = X × Y ⊂ Rn = Rnx × Rny ⊂ Rn: convex compact set• f (x , y) : X × Y → R: convex in x ∈ X and concave in y ∈ YLipschitz continuous function given by a Stochastic Oracle.♠ At i-th call to the oracle, the input being z = (x , y) ∈ Z ,Stochastic Oracle returns “stochastic derivative”

G(z, ξi) = [Gx (z, ξi); Gy (z, ξi)] ∈ Rnx × Rny

of f , with independent ξi ∼ P i = 1,2, ... andF (x , y) := Eξ∼PG(x , y , ξ) ∈ ∂x f (x , y)× ∂y (−f (x , y))


Classical Stochastic Approximation

minx∈X

f (x) (SM)

minx∈X

maxy∈Y

f (x , y) (SP)

♣ The very first method for solving (SM) is the classicalStochastic Approximation (Robbins & Monro ’51) which mimicsthe usual Subgradient Descent:

xt+1 = argminu∈X1

2‖u − xt‖22 + γt〈G(xt , ξt ),u〉

⇔ xt+1 = closest to xt − γtG(xt , ξt ) point of X, γt = a

t

•When f is strongly convex on X , SA with properly chosen aensures that E‖xt − argminx f‖22 ≤ O(1/t).• If, in addition, argminX f ∈ intX and f ′ is Lipschitz continuous,then also Ef (xt )−minx f ≤ O(1/t).♠ Similar construction, with similar results, works for (SP) withstrongly convex-concave f .


Classical Stochastic Approximation (continued)

• Good news on Classical SA: with smooth strongly convex fattaining its minimum in intX, the rate of convergence

Ef (xt )−minx f ≤ O(1/t)is the best allowed under circumstances by Laws of Statistics.• Bad news on Classical SA: it requires strong convexity ofthe objective f and is highly sensitive to the scale factor a in thestepsize rule γt = a

t .• To choose a properly, we need to know the modulus of strongconvexity of f , and overestimating this modulus may result indramatic slowing down the convergence.⇒The Stochastic Optimization community commonly treats SAas a highly unstable and impractical optimization technique.


Robust Stochastic Approximation

xt+1 = argminu∈X1

2‖u − xt‖22 + γt〈G(xt , ξt ),u〉, γt = a

t (CSA)

♣ It was discovered long ago [Nem.&Yudin’79] that SA admits arobust version which does not require strong convexity of theobjective in (SM). To make SA robust, it suffices• to replace “short steps” γt = a

t with “long steps” γt = a√t, and

• to add to (CSA) the averaging x t =[∑t

τ=1γτ

]−1∑tτ=1γτxτ

and to treat, as approximate solutions, the averages x t ratherthan the search points xt .

♠ Another important improvement (in its rudimentary formgoing back to [Nem.&Yudin’79]) is to replace the quadratic proxterm 1

2‖u − xt‖22 in (CSA) with a more general prox term, whichallows to adjust, to some extent, the method to the “geometry”of the problem of interest.


Robust Stochastic Approximation[Jud.&Lan&Nem.&Shap.’09]

minx∈X

maxy∈Y

f (x , y)

G(x , y , ξ) : EξG(x , y , ξ) ∈ ∂x f (x , y)× ∂y [−f (x , y)]

Setup for RSA: a norm ‖ · ‖ on Rn ⊃ Z = X × Y and a convexcontinuous distance generating function ω(z) : Z → R such thatthe set Z o = z ∈ Z : ∂ω(z) 6= ∅ is convex, and ω

∣∣Z o is C1 and

strongly convex:∀z, z ′ ∈ Z o : 〈ω′(z)− ω′(z ′), z − z ′〉 ≥ α‖z − z ′‖2 [α > 0]

♠ Robust Stochastic Approximation:z1 = zω := argminZ ω(·)

zt+1 = argminu∈Z

[ω(u)− 〈ω′(zt ),u〉] + γt〈G(zt , ξt ),u〉

z t =[∑t

τ=1 γτ

]−1∑tτ=1 γτzτ


Stochastic Mirror Prox [Jud.&Nem.&Tauvel’08]

minx∈X

maxy∈Y

f (x , y)

G(x , y , ξ) : EξG(x , y , ξ) ∈ ∂x f (x , y)× ∂y [−f (x , y)]

♠ Stochastic Mirror Prox algorithm:z1 = zω := argminZ ω(·)wt = argmin

u∈Z[ω(u)− 〈ω′(zt ),u〉] + γt〈G(zt , ξ2t−1),u〉

zt+1 = argminu∈Z

[ω(u)− 〈ω′(zt ),u〉] + γt〈G(wt , ξ2t ),u〉

z t =[∑t

τ=1 γτ

]−1∑tτ=1 γτwτ


Saddle Point accuracy measure

minx∈X

maxy∈Y

f (x , y) (SP)

♣ Accuracy measure for (SM):

ErrSad(x , y) = maxy ′∈Y

f (x , y ′)− minx ′∈X

f (x ′, y).

Explanation: (SP) gives rise to two optimization programs withequal optimal values:

Primal: minx∈X

f (x), f (x) = maxy∈Y

f (x , y)

Dual: maxy∈Y

f (y), f (y) = minx∈X

f (x , y)

ErrSad(x , y) is the duality gap — the sum of non-optimalities ofx and y as approximate solutions to the respective problems.


RSA & SMP: Efficiency Estimates

minx∈X

maxy∈Y

f (x , y)

G(x , y , ξ) : EξG(x , y , ξ) ∈ ∂x f (x , y)× ∂y [−f (x , y)](SP)

Theorem: Assume that E‖G(z, ξ)‖2∗ ≤ M2 ∀z ∈ Z = X × Y ,and let N ≥ 1. N-step RSA with properly chosen constant stepsγt = γ, 1 ≤ t ≤ N, ensures that

E

ErrSad(zN)≤ O(1)ΘM/

√N[

Θ =√

2 maxz∈Z [ω(z)− ω(zω)− 〈ω′(zω), z − zω〉]/α]

♠With “light tail” ‖G(z, ξ)‖∗/M, probabilities of large deviationsin RSA admit exponential bounds:E

exp‖G(z, ξ)‖2∗/M2≤ exp1 ∀z ∈ Z ⇒

∀Ω > 1 : Prob

ErrSad(zN) > O(1)ΩΘM/√

N≤ 2 exp−Ω

• Similar results hold true for SMP.


Good Geometry case

♣ Let rint Z ⊂ rint Z +, where Z + is the direct product of K“basic blocks”:• Kb unit Euclidean balls Bi ⊂ Ei = Rni ;• Ks spectahedrons Sj ⊂ Fj ;Fj : spaces of symmetric kj × kj matrices of a given block-diagonal

structure with the Frobenius inner productSj : the set of all positive semidefinite matrices from Fj with unit

trace♠ Good setup for RSA/SMP is given by

the norm ‖z‖ =√∑Kb

i=1 ‖z i‖22 +∑Ks

j=1 ‖zj‖2tr• z i : ball components of z • zj : spectahedron components of z• ‖ · ‖tr: trace norm of a symmetric matrix.

the dgf ω(z) =∑Kb

i=112‖z

i‖22 +∑Ks

j=1 Entropy(zj)

• Entropy(u) =∑

` λ`(u) lnλ`(u), λ`(u) being eigenvalues of asymmetric matrix u.


Good Geometry case (continued)

♣With the above setup, we haveΘ = O(1)

√Kb +

∑Ksj=1 ln kj

⇒ EErrSad(zN) ≤ O(1)[Kb +∑Ks

j=1 ln kj ]1/2M/

√N

Good news: When K = O(1), the rate of convergence isnearly dimension-independent and, moreover, nearly the bestallowed under circumstances by Laws of Statistics

♣ The computational cost of a step is dominated by the costs ofa call to SO and a prox-step g 7→ argminz∈Z

[gT z + ω(z)

]♠When Z = Z +, the cost C of a prox-step is

C = O(1)(∑

i dim (Ei) +∑

j Cj

)• Cj : the cost of eigenvalue decomposition of A ∈ Fj

♠When Z is simple – product of K balls and simplexes – wehave C = O(1)dim Z .


How it works: Convex Stochastic Programming

minx∈X⊂Rn

f (x) := Eξ∼PF (x , ξ)

G(x , ξ) ∈ ∂xF (x , ξ), Eξ∼P‖G(x , ξ)‖2∗ ≤ M2(SM)

♣ Till now, the working horse of Convex StochasticMinimization is the Sample Average Approximation method. InSAA, problem (SM) is replaced with its empirical counterpart

minx fN(x), fN(x) = 1N∑N

i=1 F (x , ξi)[ξ1, ..., ξN : sample drawn from P]

♣ In the Good Geometry case, the SAA sample size Nresulting, with probability close to 1, in ε-solution to (SM) isO(1)n(M/ε)2 [Shap.’03], which is by factor O(n) worse than thenumber of steps of RSA/SMP yielding a solution of the samequality.


How it works (continued)

minx∈X⊂Rn

f (x) := Eξ∼PF (x , ξ) (SM)

⇒ minx∈X

fN(x) = 1N∑N

i=1 f (x , ξi) (SAA)

♣ A single call to the first order oracle for fN is as costly as Ncalls to the SO.⇒RSA and SMP are much cheaper computationally than SAA.When Z is simple, the overall computational effort in the N-stepRSA/SMP is of the same order as the effort to compute a singlesubgradient of fN(·)!

♠ Our experimental results consistently demonstrate that whilethe solutions built by RSA and SAA with a common sample sizeN are of nearly the same quality, RSA by orders of magnitudeoutperforms SAA in terms of running time.


How it works: Typical experiment

f∗ = minx

f (x) = Eξ∼N (a,In)φ(ξT x) : x ≥ 0,∑

i x1 = 1, ‖a‖∞ = 1

φ(·) : piecewise linear convex loss function

n = 1000, f∗ = −5.9439 n = 5000, f∗ = −5.6066Method f (z4000) : f (z4000)− f∗ CPU f (z4000) : f (z4000)− f∗ CPU

SAA −5.9384 : 0.0055 253 −5.5967 : 0.0099 1283N-RSA −5.9365 : 0.0074 12 −5.5935 : 0.0131 49E-RSA −5.9193 : 0.0246 13 −5.5354 : 0.0712 77

• N-RSA: ‖ · ‖ = ‖ · ‖1, ω(x) =∑

i xi ln xi Ef (zN)− f∗ ≤ O(1) ln n√N

• E-RSA: ‖ · ‖ = ‖ · ‖2, ω(x) = 12 xT x Ef (zN)− f∗ ≤ O(1)

√n√N


SMP vs RSAminx∈X

maxy∈Y

f (x , y) (SP)

F (x , y) := Eξ G(x , y , ξ) ∈ ∂x f (x , y)× ∂y [−f (x , y)]

Question: What, if any, are the advantages of SMP ascompared to RSA?Answer: SMP can utilize the fact that some components ofF (x , y) are regular and well-observable.

♣ Assumption A:‖F (z)− F (z ′)‖∗ ≤ L‖z − z ′‖+ σ ∀z, z ′ ∈ Z = X × Y

Eξ‖G(z, ξ)− F (z)‖2∗

≤ σ2

Fact: Under A, the quantity M2 = supz∈Z Eξ‖G(z, ξ)‖2∗ canbe as large as O(1)(ΘL + σ)2

⇒The efficiency of RSA cannot be better than

E

ErrSad(zNRSA)

≤ O(1)Θ

ΘL + σ√N

even when σ = 0, i.e., f ∈ C1,1(L), no noise at all!


SMP vs RSA (continued)minx∈X

maxy∈Y

f (x , y)

F (x , y) := Eξ G(x , y , ξ) ∈ ∂x f (x , y)× ∂y [−f (x , y)]‖F (z)− F (z ′)‖∗ ≤ L‖x − y‖+ σ, Eξ

‖G(z, ξ)− F (z)‖2∗

≤ σ2

Theorem: N-step SMP with properly chosen constantstepsizes ensures that

E

ErrSad(zNSMP)

≤ O(1)Θ

[ΘLN + σ√

N

](!)

When Eξ

exp‖G(z, ξ)− F (z)‖2∗/σ2≤ exp1, for all Ω > 0

one hasProb

ErrSad(zN

SMP) > O(1)[Θ[

ΘLN + σ√

N

]+ ΩΘσ√

N

]≤ exp−Ω2/3+ exp−ΩN

Good news: When Z is the product of O(1) high-dimensionalballs, (!) is the best efficiency estimate achievable undercircumstances.


Applying SMP: Stochastic Feasibility problem

♣We want to solve a feasible system of “simple” and “difficult”convex constraints:

Sµ(x) ≤ 0, 1 ≤ µ ≤ pDν(x) ≤ 0, 1 ≤ ν ≤ qx ∈ X = x ∈ Rn : ‖x‖2 ≤ 1

• Sµ are given by precise First Order oracle and are smooth:‖S′µ(x)‖2 ≤ Lµ, ‖S′µ(x)− S′µ(x ′)‖2 ≤

√ln(p + q)Lµ‖x − x ′‖2.

• Dν are given by Stochastic Oracle. At i-th call to the SO,x ∈ X being the input, the SO returns unbiased estimateshν(x , ξi), Hν(x , ξi) of Dν(x), D′ν(x), with independent ξi ∼ P and

Eξ

maxν

[h2ν(x ,ξ)M2ν

]≤ 1, max

νEξ‖Hν(x ,ξ)‖2

2M2ν

≤ ln(p + q).


Stochastic Feasibility problem (continued)Sµ(x) ≤ 0, 1 ≤ µ ≤ pDν(x) ≤ 0, 1 ≤ ν ≤ qx ∈ X = x ∈ Rn : ‖x‖2 ≤ 1

Sν : ‖S′µ‖ ≤ Lµ, ‖S′µ(x)− S′µ(x ′)‖2 ≤√

ln(p + q)Lµ‖x − x ′‖2Dν(x) = Eξhν(x , ξ), Eξ

maxν

h2ν(x , ξ)/M2

ν

≤ 1

D′ν(x) = EξHν(x , ξ), maxν

Eξ‖H(x , ξ)‖22/M2

ν

≤ ln(p + q)

♣ Properly applying N-step SMP to the Saddle Point problem

minx∈X

maxy≥0,∑i yi =1

[∑pµ=1 yµαµSµ(x) +

q∑ν=1

yp+ναp+νDν(x)

]

we build a solution xN ∈ X such that

Sµ(xN) ≤√

ln(p + q)LνN ζ, Dν(xN) ≤

√ln(p + q) Mν√

Nζ

ζ ≥ 0, Eζ ≤ O(1).


Acceleration via Randomization

♣When solving large-scale well-structured convex optimizationproblems, e.g., the `1 minimization problem

minx‖x‖1 : ‖Ax − b‖∞ ≤ δ

with large-scale and possibly dense matrix A,• polynomial time methods, and thus generating high accuracysolutions, become prohibitively time consuming;• when medium accuracy solutions are sought, our best choiceis computationally cheap first order methods.

♠ There are interesting and important cases when first ordermethods can be accelerated significantly by randomization –passing from computationally demanding deterministic firstorder oracles to their computationally cheap stochasticcounterparts.


Example 1: Matrix Game

minx∈∆n

maxy∈∆m

f (x , y) := yT Ax , ∆k = u ∈ Rk+ :∑

i ui = 1

F (x , y) := [f ′x (x , y);−f ′y (x , y)] = [AT y ;−Ax ]

A := maxi,j|Aij |

(MG)

•When A is large and dense, solving (MG) by polynomial timealgorithms is prohibitively time consuming.• The best deterministic First Order algorithms find ε-solution to(MG) in O(1) ln(m + n)(A/ε) steps, with O(1) computations ofF (·) per step. The total effort is O(1) ln(m + n)mn(A/ε) a.o.

♣ Matrix-vector multiplications ∆m 3 y 7→ AT y , ∆n 3 x 7→ Axrequired to compute F (x , y) are easy to randomize: to build anunbiased estimate of Bu =

∑i ujBj , draw at random a column

index : Prob = j =|uj |‖u‖1

, and return ‖u‖1sign(u)B.


Matrix Game (continued)

minx∈∆n

maxy∈∆m

yT Ax , A := maxi,j|Aij | (MG)

♣ Equipping (MG) with Stochastic Oracle and applying RSA,we get, with confidence ≥ 1− β, an ε solution to (MG) inO(1) ln(mn/β)(A/ε)2 steps. A step reduces to extracting from Arandomly chosen row and column plus O(m + n) a.o. overhead.⇒ε-solutions costs O(1) ln(mn/β)(m + n)(A/ε)2 a.o.

♠With ε, A fixed and m = O(n) large, RSA• outperforms by order of magnitudes the best known

deterministic alternatives• exhibits sublinear time behaviour: ε-solution is found by

just O(1) ln(mn/β)(m + n)(A/ε)2 of the total of mn dataentries, cf. Grigoriadis & Khachiyan ’95.


Application: `1 minimization with uniform fit

♣ Numerous sparsity-oriented Signal Processing problemsreduce to (small series of) problems

minu ‖Au − b‖p : ‖u‖1 ≤ 1 [A : m × n] (`1)

♠When p =∞, (`1) reduces to Matrix Gamemin

x=[u;v ]∈∆2n

maxy=[p;q]∈∆2m

[p − q]T[A− b1T ,−A− b1T ] [u; v ]

⇒RSA can be used to solve `1-minimization problems.

•When m,n are really large (like 104 and more) and A is ageneral-type dense analytically given matrix, (`1) becomesprohibitively difficult for all known deterministic algorithms, andrandomized algorithms become a kind of "last resort".


How it works

Opt = minu‖Au − b‖∞ : ‖u‖1 ≤ 1[

A: dense analytically given m × n matrixb: ‖Au∗ − b‖∞ ≤ δ = 1.e-3 with 16-sparse u∗, ‖u∗‖1 = 1

]Errors CPU

m × n Method ‖u∗ − u‖1 ‖u∗ − u‖2 ‖u∗ − u‖∞ sec Mult2048× 4096 DMP 0.0014 0.00052 0.00036 122.8 1770

RSA 0.039 0.0079 0.0030 325.4 29.38192× 32768 DMP 1.006 0.319 0.184 3141.9 5

RSA 0.120 0.0196 0.00634 3000.5 4.7

• DMP: Deterministic Mirror Prox • u: resulting solution• Last column: (equivalent) # of matrix-vector multiplications.


How it works (continued)

0 500 1000 1500 2000 2500 3000 3500 4000 4500−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

0 0.5 1 1.5 2 2.5 3 3.5

x 104

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

2048× 4096 8192× 32768`1-recovery: Blue: true signal, Cyan: MDSA


Example 2: `1-minimization with Least Squares fit

♣ The Least Squares `1 minimization problemminu ‖Au − b‖2 : ‖u‖1 ≤ 1 [A : m × n]

reduces to the Saddle Point problemminx∈∆n

max‖y‖2≤1

yT [Ax − b] [F (x , y) = [AT y ; b − Ax ]]

♠ Randomizing matrix-vector multiplications and applying RSA,we get, with confidence ≥ 1− β, an ε-solution with the overalleffort of

O(1) ln(mn/β)(√

mC2/ε)2 a.o. (∗)

• C2: maximum of ‖ · ‖2-norms of columns in C = [A,b]

Bad news: Factor√

m in (∗) reflecting high potential noisinessof the unbiased estimate (y , ‖y‖2 ≤ 1) 7→ ‖y‖1sign(yı)[Aı,:]T ofAT y : it may happen that

‖‖y‖1sign(yı)[Aı,:]T‖∞ ∼√

mC2


`1-minimization with Least Squares fit (continued)

minx∈∆n

max‖y‖2≤1

yT [Ax − b] [F (x , y) = [AT y ; b − Ax ]]

Remedy: Randomized preprocessing [A,b] 7→ UD[A,b] withorthogonal U, |Uij | ≤ O(m−1/2) and random diagonal D with iidDii ∼ Uniform−1; 1 results in an equivalent problem,preserves C2 and reduces noise in the estimate of AT y : now,with confidence 1− β,‖y‖2 ≤ 1⇒ ‖‖y‖1sign(yı)[Aı,:]T‖∞ ≤ O(1)

√ln(mn/β)C2

♠ After preprocessing, RSA finds, with confidence ≥ 1− β, anε-solution to the problem at the cost of

O(1) ln2(mn/β)(C2/ε)2 a.o.

•With properly chosen U (e.g., Hadamard or DFT matrix), thepreprocessing costs just O(1)mn ln(m) a.o.


Example 3: Minimizing the maximum of convexpolynomials over a simplex/`1-ball

♣ Consider the optimization problemminx∈∆n

max1≤k≤m

pk (x), pk (x) =∑d

`=0 Ak`[x , ..., x ] (∗)• Ak`[x1, ..., x`]: symmetric `-linear forms; • pk (·): convex.

♠We can reduce (∗) to the Saddle Point problemminx∈∆n

maxy∈∆m

∑k yjpk (x)

and randomize in the aforementioned manner computation ofpk (·), ∇pk (·):• To build an unbiased estimates pk (x), Pk (x) of pk (x) andPk (x) = ∇pk (x), x ∈ ∆n, we generate at random indices1, ..., d , Prob = j = xj , and set

pk =∑d

`=0 Ak`[e1 , ...,e` ][Pk ]j =

∑d`=1 `Ak`[e1 , ...,e`−1 ,ej ] 1 ≤ j ≤ n.


Minimizing maximum of convex polynomials(continued)

minx∈∆n

max1≤k≤m

∑d`=0 Ak`[x , ..., x ] (∗)

• A: maximum magnitude of coefficients in Ak`[·, ..., ·]

♠ Applying RSA, we find, with confidence ≥ 1− β, anε-solution to (∗) at the cost of O(1) ln(mn/β)d2(A/ε)2 steps,with a step reducing to

• extracting O(1)d(m + n) coefficients of the formsAk`[·, ·, ..., ·], given the “addresses” k , `, 1, ..., ` ofthe coefficients, plus• O(m + dn)-a.o. overhead.


Documents

@let@token Convex Stochastic and Large-Scale Deterministic ...nemirovs/BFG2009Final.pdf · Convex Stochastic and Large-Scale Deterministic Programming via Robust Stochastic Approximation