Variance Reduction
Abstract
We consider multi-level composite optimization problems where each
mapping in the compo- sition is the expectation over a family of
randomly chosen smooth mappings or the sum of some finite number of
smooth mappings. We present a normalized proximal approximate
gradient (NPAG) method where the approximate gradients are obtained
via nested variance reduction. In order to find an approximate
stationary point where the expected norm of its gradient map- ping
is less than ε, the total sample complexity of our method is O(ε−3)
in the expectation case, and O(N +
√ Nε−2) in the finite-sum case where N is the total number of
functions across all
composition levels. In addition, the dependence of our total sample
complexity on the number of composition levels is polynomial,
rather than exponential as in previous work.
keywords: composite stochastic optimization, proximal gradient
method, variance reduction
1 Introduction
minimize x∈Rd
+ Ψ(x), (1)
where ξ1, . . . , ξm are independent random variables. We assume
that each realization of fi,ξi : Rdi−1 → Rdi is a smooth vector
mapping (with d0 = d and dm = 1) and the function Ψ is convex but
possibly nonsmooth. An interesting special case is when each ξi
follows the uniform distribution over a finite support {1, . . . ,
Ni}, i.e.,
minimize x∈Rd
minimize x∈Rd
†Work done at Microsoft Research, Redmond, WA 98052
(
[email protected], https://linxiaolx.github.io).
1
F (x) = fm fm−1 · · · f1(x). (4)
For problems (1) we use the definition fi(·) = Eξi [fi,ξi(·)] and
for problem (2), we have fi(·) =
(1/Ni) ∑Ni
j=1 fi,j(·). The formulations (1) and (2) cover a broader range of
applications beyond the classical stochastic
programming and empirical risk minimization (ERM) problems, which
are special cases of (1) and (2) when m = 1, respectively.
Important applications with m = 2 include policy evaluation in
reinforcement learning and Markov decision processes (e.g., [34,
7]), risk-averse optimization (e.g., [31, 33]), and stochastic
variational inequality (e.g., [17, 19] through a novel formulation
in [15]). Other applications in machine learning and statistics
include optimization of conditional random fields (CRF), Cox’s
partial likelihood and expectation-maximization; see formulations
in [6, Section 3.1-3.3]. Multi-level composite risk minimization
problems with m > 2 have been considered in, e.g., [10] and
[39]. A more recent application of the multi-level problem is
adversarial attacks and defenses in deep neural networks; see the
survey paper [40].
1.1 Related work
Existing work on composite stochastic optimization traces back to
[12]. Several recent progresses have been made for two-level (m =
2) problems, both for the general stochastic formulation (e.g.,
[35, 36, 39, 15]) and for the finite-sum formulation (e.g., [20,
43, 16, 41]).
In order to cite specific results, we first explain the measure of
efficiency for composite stochastic optimization. A well-adopted
measure is their sample complexity, i.e., the total number of times
that any component mapping fi,ξi(·) or Jacobian f ′i,ξi(·) is
evaluated, in order to find some x (output
of a randomized algorithm) such that E[G(x)2] ≤ ε for a given
precision ε > 0. Here G(x) is the proximal gradient mapping of
the objective function Φ ≡ F + Ψ at x (exact definition given in
Section 2). If Ψ ≡ 0, then G(x) = F ′(x) where F ′ denotes the
gradient of F and the criteria for ε-approximation becomes E[F
′(x)2] ≤ ε.
For problem (1) with two levels (m = 2) and Ψ ≡ 0, stochastic
composite gradient methods were developed in [35] with sample
complexities O(ε−4), O(ε−3.5) and O(ε−1.25) for the smooth
nonconvex case, smooth convex case and smooth strongly convex case
respectively. For nontrivial convex Ψ, improved sample complexities
of O(ε−2.25), O(ε−2) and O(ε−1) were obtained in [36] for the three
cases respectively. These algorithms employ two-level stochastic
updates in different time scales, i.e., the two step-size sequences
converge to zero at different rates. Similar algorithms for solving
multi-level problems were proposed in [39], with sample complexity
O(ε−(m+7)/4). A single time-scale method is proposed in [15] for
solving two-level problems, which achieves an O(ε−2) sample
complexity, matching the result for single-level stochastic
optimization [14].
For the composite finite-sum problem (2) with two levels (m = 2),
several recent works ap- plied stochastic variance-reduction
techniques to obtain improved sample complexity. Stochas- tic
variance-reduction techniques were first developed for one-level
finite-sum convex optimization problems (e.g., [32, 18, 8, 38, 4,
24]), and later extended to the one-level nonconvex setting (e.g.,
[27, 29, 3, 1, 2, 13, 37, 25, 26]) and the two-level finite-sum
nonconvex setting (e.g., [20, 43, 16, 41]). For the two-level
finite-sum case, the previous best known sample complexity is O(N +
N2/3ε−1) where N = N1 +N2, obtained in [16, 41].
Interestingly, better complexity bounds have been obtained recently
for the non-composite case of m = 1, specifically, O(ε−1.5) for the
expectation case and O(N1 +
√ N1ε
2
case [13, 37, 25, 26]. These results look to be even better than
the classical O(ε−2) complexity of stochastic gradient method for
single-level smooth nonconvex optimization (e.g., [14]), which is
consistent with the statistical lower bound for sample average
approximation [10]. Indeed, a stronger smoothness assumption is
responsible for the improvement: instead of the classical
assumption that the gradient of the expected function is Lipschitz
continuous, the improved results require a Mean- Square-Lipschitz
(MSL) condition (exact definition given in Section 3).
Different from the one-level case, a major difficulty in the
multi-level cases (m ≥ 2) is to deal with biased gradient
estimators for the composite function (e.g., [35, 36, 41]).
Unbiased gradient estimators for two-level problems are developed
in [6], but they do not seem to improve sample complexity. On the
other hand, it is shown in [42] that the improved sample complexity
of O
( min{ε−1.5,
√ N1ε
−1} )
can be obtained with biased gradient estimator for two-level
problems where the outer level is deterministic or without
finite-sum. This result indicates that similar improvements are
possible for multi-level problems as well.
1.2 Contributions and outline
In this paper, we propose stochastic gradient algorithms with
nested variance-reduction for solving problems (1) and (2) for any
m ≥ 1, and show that their sample complexity for finding x such
that E[G(x)] ≤ ε is O(ε−3) for the expectation case, and O(N
+
√ Nε−2) for the finite-sum
case where N = ∑m
i=1Ni. To compare with previous work listed in Section 1.1, we note
that a complexity O(ε−a) for E[G(x)2] ≤ ε (measure used in Section
1.1) implies an O(ε−2a) complexity for E[G(x)] ≤ ε (by Jensen’s
inequality, but not vice versa); in other words, ε = O(ε2). In this
sense, our results are of the same orders as the best known sample
complexities for solving one-level stochastic optimization problems
[13, 37, 25, 26], and they are strictly better than previous
results for solving two and multi-level problems.
The improvement in our results relies on a nested
variance-reduction scheme using mini-batches and a stronger uniform
Lipschitz assumption. Specifically, we assume that each realization
of the random mapping fi,ξi and its Jacobian are Lipschitz
continuous. This condition is stronger than the MSL condition
assumed in [13, 37, 25, 26] for obtaining the O
( min{ε−3,
√ N1ε
−2} )
sample complexity for one-level problems. Nevertheless, it holds
for many applications in machine learning (e.g., [7, 6, 41]),
especially when dom Ψ is bounded. The results in [39] also rely on
such a uniform smooth condition.
Our contributions can be summarized as follows. In Section 2, we
consider solving more general problems of the form (3) without
specifying any composition structure, and propose a Normalized
Proximal Approximate Gradient (NPAG) method. We show that so long
as the Mean Square Errors (MSEs) of the approximate gradients at
each iteration are uniformly bounded by ε2, then the iteration
complexity of NPAG for finding an x such that EG(x)] ≤ ε is
O(ε−2).
In Section 3, we discuss stochastic variance-reduction techniques
for constructing the approx- imate gradients in NPAG. In
particular, we present a simple and unified perspective on variance
reduction under the MSL condition. This perspective clearly
explains the reason for periodic restart, and allows us to derive
the optimal period length and mini-batch sizes without binding to
any particular optimization algorithm. When using the SARAH/Spider
estimator ([24] and [13] respectively) in NPAG to solve one-level
stochastic or finite-sum problems, we obtain a Prox- Spider method.
We also analyze the SVRG [18] and SAGA [8] estimators in the same
framework, and obtain the best-known complexities for them in the
literature for nonconvex optimization.
In Section 4, we present the multi-level Nested-Spider method and
show that under the uniform
3
Lipschitz condition (stronger than MSL), its sample complexity for
the expectation is O(m4ε−3) and for the finite-sum cases is O(N
+m4
√ Nε−1) where N =
∑m i=1Ni. In particular, these results
are polynomial in the number of levels m, which improve
significantly over the previous result O(ε−(m+7)/2) in [39], where
m appears in the exponent of ε.
In Section 5, we present numerical experiments to compare different
variance reduction methods on two applications, with one-level and
two-level composition structure respectively. Finally, we give some
concluding remarks in Section 6.
2 Normalized proximal approximate gradient method
In this section, we present a normalized proximal approximate
gradient (NPAG) method for solving problems of form (3), which we
repeat here for convenience:
minimize x∈Rd
} . (5)
Here we consider a much more general setting, without imposing any
compositional structure on F . Formally, we have the following
assumption.
Assumption 2.1. The functions F , Ψ and Φ in (5) satisfy:
(a) F : Rd → R is differentiable and its gradient F ′ is
L-Lipschitz continuous;
(b) Ψ : Rd → R ∪ {+∞} is convex and lower semi-continuous;
(c) Φ ≡ F + Ψ is bounded below, i.e., there exists Φ∗ such that
Φ(x) ≥ Φ∗ for all x ∈ dom Ψ.
Under Assumption 2.1, we define the proximal operater of Ψ, with
step size η, as
proxηΨ(x) , arg min y∈Rd
{ Ψ(y) +
1
2η y − x2
} , x ∈ dom Ψ. (6)
Since Ψ is convex, this operator is well defined and unique for any
x ∈ dom Ψ. Given the gradient of F at x, denoted as F ′(x), the
proximal gradient mapping of Φ at x is defined as
GηΦ(x) , 1
)) . (7)
When Ψ ≡ 0, we have GηΦ(x) = F ′(x) for any η > 0. Additional
properties of the proximal gradient mapping are given in [22].
Throughout this paper, we denote the proximal gradient mapping as
G(·) by omitting the subscript Φ and superscript η whenever they
are clear from the context.
Our goal is to find an ε-stationary point x such that G(x) ≤ ε, or
E [ G(x)
] ≤ ε if using a
stochastic or randomized algorithm (see [11] for justifications of
G(·) as a termination criterion). If the full gradient oracle F
′(·) is available, we can use the proximal gradient method
xt+1 = proxηΨ ( xt − ηF ′(xt)
) . (8)
By setting η = 1/L where L is the Lipschitz constant of F ′, this
algorithm finds an ε-stationary point within O(Lε−2) iterations;
see, e.g., [5, Theorem 10.15].
Here we focus on the case where the full gradient oracle F ′(·) is
not available; instead, we can compute at each iteration t an
approximate gradient vt. A straightforward approach is to
replace
4
Algorithm 1: Normalized Proximal Approximate Gradient (NPAG)
Method
input: initial point x0 ∈ Rd, proximal step size η > 0, a
sequence εt > 0,∀t. for t = 0, ..., T − 1 do
Find an approximate gradient vt at the point xt.
Compute xt+1 = proxηΨ(xt − ηvt). Update xt+1 = xt + γt(x
t+1 − xt) with γt = min {
η εt xt+1−xt , 1
} .
end output: x∈{x0, ..., xT−1}, with xt chosen w.p. pt :=
εt∑T−1
k=0 εk for 0 ≤ t ≤ T − 1.
F ′(xt) with vt in (8), and then analyze its convergence properties
based on certain measures of approximation quality such as E[vt − F
′(xt)2]. In Algorithm 1, we present a more sophisticated variant,
which first compute a tentative iterate xt+1 = proxηΨ(xt − ηvt),
and then move in the direction of xt+1 − xt with a step size γt,
i.e., xt+1 = xt + γt(x
t+1 − xt). The choice of step size
γt = min {
ηεt xt+1−xt , 1
} ensures that the step length xt+1 − xt is always bounded:
xt+1 − xt = γtxt+1 − xt ≤ ηεt, (9)
where εt is a pre-specified parameter. Algorithm 1 can also be
written in terms of the approximate gradient mapping, which we
define
as
) . (10)
Using this notation, the updates in Algorithm 1 become γt = min
{
εt G(xt) , 1
xt − ηεt G(xt)
G(xt) if G(xt)
≤ εt. Notice that when
G(xt) > εt, the algorithm uses the normalized proximal gradient
mapping with
a step length ηεt. Therefore, we call it the Normalized Proximal
Approximate Gradient (NPAG) method.
Next, we prove a general convergence result concerning Algorithm 1
without specifying how the approximate gradient vt is generated.
The only condition we impose is that the Mean-Square Error (MSE),
E[vt − F ′(xt)2], is sufficiently small for all t. We first state a
useful lemma.
Lemma 2.2. Suppose Assumption 2.1 hold. Then the sequence {xt}
generated by Algorithm 1 satisfies
Φ(xt+1) ≤ Φ(xt)− ( γt/η − Lγ2
t
5
Proof. According to the update rule for xt+1, we have
Φ(xt+1) = F (xt + γt(x t+1 − xt)) + Ψ(xt + γt(x
t+1 − xt))
⟩ + Lγ2
t
= F (xt) + Ψ(xt) + Lγ2
⟩ +γt
) , (11)
where the inequality is due to the L-smoothness of F (see, e.g.,
[23, Lemma 1.2.3]) and the convexity of Ψ. Since xt+1 is the
minimizer of a 1
η -strongly convex function, i.e.,
xt+1 = proxηΨ ( xt − ηvt
) = arg min
y∈Rd
2η xt+1 − xt2,
which implies ⟨ vt, xt+1 − xt
⟩ + Ψ(xt+1)−Ψ(xt) ≤ − 1
xt+1 − xt 2 .
Applying the above inequality to the last term in (11) yields
Φ(xt+1) ≤ Φ(xt)− (γt η − Lγ2
t
2
) xt+1 − xt2 + γtF ′(xt)− vt, xt+1 − xt
≤ Φ(xt)− (γt η − Lγ2
1
= Φ(xt)− (γt η − Lγ2
where the second inequality is due to the Cauchy-Schwarz
inequality.
We note this result is similar to that of [26, Lemma 3], but we
used a slightly looser bound for the inner product in the last
step. This bound and our choice of γt allow a simple convergence
analysis of the NPAG method, which we present next.
Theorem 2.3. Suppose Assumption 2.1 hold and we set η = 1/2L in
Algorithm 1. If it holds that
E [ vt − F ′(xt)2
] ≤ ε2t , ∀ t ≥ 0, (12)
for some sequence {εt}, then the output x of Algorithm 1
satisfies
E [ G(x)
. (13)
6
Proof. Using Lemma 2.2, the assumption (12), and the choice η =
1/2L, we obtain
E [ Φ(xt+1)
] + ε2t 2L , (14)
where the equality is due to (10) and the last inequality is due to
0 ≤ γt ≤ 1, which implies
2γt − γ2 t ≥ γt. Note that γt = min
{ εt
} ≥ εt
G(xt) − 1
4 ε2t ,
where the inequality is due to the fact min { |z|, z2
} ≥ |z| − 1
4 for all z ∈ R (here we take z =
ε−1 t G(xt)). Taking expectations and substituting the above
inequality into (14) gives
E [ Φ(xt+1)
. (15)
On the other hand, by (12) and Jessen’s inequality we obtain
E [ vt − F ′(xt)
] ≤ √ E [vt − F ′(xt)2] ≤ εt.
Denote xt+1 = proxηr(xt− ηF ′(xt)) and also note that xt+1 =
proxηr(xt− ηvt). Since the proximal operator is nonexpansive (see,
e.g., [30, Section 31]), we have
E [ xt+1 − xt+1
] ≤ ηE[vt − F ′(xt)] ≤ ηεt, ∀ t ≥ 0.
Therefore,
] ≤ 1
] +
] ≤ E
εt 4L
E [G(xt)
] ≤ E [ Φ(xt)
] + ε2t L . (16)
Adding the above inequality for k = 0, ..., T − 1 and rearranging
the terms, we obtain
E [G(x)
·E [G(xt)
+ 4 ∑T−1
.
7
Remark 2.4. In the simplest case, one can set εt ≡ ε for all t = 0,
1, ..., T − 1 and
T ≥ 4L ( Φ(x0)− Φ∗
E [G(x)
] ≤ 4L(Φ(x0)− Φ∗)
Tε ≤ 5ε.
Theorem 2.3 applies directly to the deterministic proximal gradient
method with vt = F ′(xt). In this case, with εt and T set according
to Remark 2.4, condition (12) is automatically satisfied and we
recover the O(Lε−2) complexity of proximal gradient method for
nonconvex optimization [5, Theorem 10.15].
The problem of using a small, constant εt is that we will have very
small step size γt in the beginning iterations of the algorithm.
Although the order of the sample complexity is optimal, it may lead
to slow convergence in the practice. We will see in Section 4 that
we can obtain the same orders of samples complexity using a
variable sequence {εt}, starting with large values and gradually
decreases to the target precision ε. These choices of variable
sequence {εt} lead to improved performance in practice and are
adopted in our numerical experiments in Section 5.
In general, for the stochastic problems, we do not require the
estimates vt to be unbiased. On the other hand, condition (12)
looks to be quite restrictive by requiring the MSE of approximate
gradient to be less than ε2t , where εt is some predetermined
precision. In the rest of this paper, we show how to ensure
condition (12) using stochastic variance-reduction techniques and
derive the total sample complexity required for solving problems
(1) and (2).
3 A general framework of stochastic variance reduction
In this section, we discuss stochastic variance-reduction
techniques for smooth nonconvex optimiza- tion. In order to prepare
for the multi-level compositional case, we proceed with a general
framework of constructing stochastic estimators that satisfies (12)
for Lipschitz-continuous vector or matrix mappings. To simplify
presentation, we restrict our discussion in this section to the
setting where εt ≡ ε in (12), where ε is the target precsion so
that the output x satisfies E[G(x)] ≤ O(ε). A more general scheme
with variable sequence {εt} is presented in Section 4.
Consider a family of mappings {φξ : Rd → Rp×q} where the index ξ is
a random variable. We assume that they satisfy a uniform Lipschitz
condition, i.e., there exists L > 0 such that
φξ(x)− φξ(y) ≤ Lx− y, ∀x, y ∈ Rd, ∀ ξ ∈ , (17)
where · denotes the matrix Frobenius norm and is the sample space
of ξ. This uniform Lipschitz condition implies the Mean-Squared
Lipschitz (MSL) property, i.e.,
Eξ [ φξ(x)− φξ(y)2
which is responsible for obtaining the O ( min{ε−3,
√ N1ε
−2} )
sample complexity for one-level prob- lems in [13, 37, 25, 26].
This property in turn implies that the average mapping, defined
as
φ(x) , Eξ[φξ(x)],
8
is L-Lipschitz. In addition, we make the standard assumption that
they have bounded variance everywhere, i.e.,
Eξ[φξ(x)− φ(x)2] ≤ σ2, ∀x ∈ Rd. (19)
In the context of solving structured optimization problems of form
(5), we can restrict the above assumptions to hold within dom Ψ
instead of Rd.
Suppose a stochastic algorithm generates a sequence {x0, x1, . . .
, } using some recursive rule
xt+1 = ψt(x t, vt),
where vt is a stochastic estimate of φ(xt). As an example, consider
using Algorithm 1 to solve problem (5) where the smooth part F (x)
= Eξ[fξ(x)]. In this case, we have φξ(x) = f ′ξ(x) and
φ(x) = F ′(x), and can use the simple estimator vt = f ′ξt(x t)
where ξt is a realization of the random
variable at time t. If ε ≥ σ, then assumption (19) directly implies
(12) with εt ≡ ε; thus Theorem 2.3 and Remark 2.4 show that the
sample complexity of Algorithm 1 for obtaining E[G(x)] ≤ ε is T =
O(Lε−2).
However, if the desired accuracy ε is smaller than the uncertainty
level σ, then condition (12) is not satisfied and we cannot use
Theorem 2.3 directly. A common remedy is to use mini-batches; i.e.,
at each iteration of the algorithm, we randomly pick a subset Bt of
ξ and set
vt = φBt(x t) ,
|Bt| ∑ ξ∈Bt
φξ(x t). (20)
In this case, we have E[vt] = φ(xt) and E[vt − φ(xt)2] = σ2/|Bt|.
In order to satisfy (12) with εt ≡ ε, we need |Bt| ≥ σ2/ε2 for all
t = 0, . . . , T − 1. Then we can apply Theorem 2.3 and Remark 2.4
to derive its sample complexity for obtaining E[G(x)] ≤ ε, which
amounts to T · σ2/ε2 = O(Lσ2ε−4).
Throughout this paper, we assume that the target precision ε <
σ. In this case (and assuming Ψ ≡ 0), the total sample complexity
O(Lε−2 + σ2ε−4) can be obtained without using mini-batches (see,
e.g., [14]). Therefore, using mini-batches alone seems to be
insufficient to improve sample complexity (although it can
facilitate parallel computing; see, e.g., [9]). Notice that the
simple mini-batch estimator (20) does not rely on any Lipschitz
condition, so it can also be used with non- Lipschitz mappings such
as subgradients of nonsmooth convex functions. Under the MSL condi-
tion (18), several variance-reduction techniques that can
significantly reduce the sample complexity of stochastic
optimization have been developed in the literature (e.g., [32, 18,
8, 38, 4, 24, 13].
In order to better illustrate how Lipschitz continuity can help
with efficient variance reduc- tion, we consider the amount of
samples required in constructing a number of, say τ , consecutive
estimates {v0, . . . , vτ−1} that satisfy
E [vt − φ(xt)
2] ≤ ε2, t = 0, 1, . . . , τ − 1. (21)
If the incremental step lengths xt−xt−1 for t = 1, . . . , τ−1 are
sufficiently small and the period τ is not too long, then one can
leverage the MSL condition (18) to reduce the total sample
complexity, to be much less than using the simple mini-batch
estimator (20) for all τ steps. In the rest of this section, we
first study the SARAH/Spider estimator proposed in [24] and [13]
respectively. Then we also analyze the SVRG [18] and SAGA [8]
estimators from the same general framework.
9
3.1 SARAH/SPIDER estimator for stochastic optimization
In order to construct τ consecutive estimates {v0, . . . , vτ−1},
this estimator uses the mini-batch estimator (20) for v0, and then
constructs v1 through vτ−1 using a recursion:
v0 = φB0(x0),
(22)
It was first proposed in [24] as a gradient estimator of SARAH
(StochAstic Recursive grAdient algoritHm) for finite-sum convex
optimization, and later extended to nonconvex and stochastic
optimization [25, 26]. The form of (22) for general Lipschitz
continuous mappings is proposed in [13], known as Spider
(Stochastic Path-Integrated Differential EstimatoR). The following
lemma is key to our analysis.
Lemma 3.1. [13, Lemma 1] Suppose the random mappings φξ satisfy
(18). Then the MSE of the estimator in (22) can be bounded as
E [vt − φ(xt)
] . (23)
If we can control the step lengths xt − xt−1 ≤ δ for t = 1, . . . ,
τ − 1, then
E [v0 − φ(x0)
|B0| + τ−1∑ t=1
L2δ2
|Bt| . (24)
To simplify notation, here we allow the batch sizes |Bt| to take
fraction values, which does not change the order of required sample
complexity. Now, in order to satisfy (21), it suffices to set
|B0| = 2σ2
ε2 , t = 1, . . . , τ − 1. (25)
In this case, the number of samples required over τ steps is
|B0|+ (τ − 1)b ≤ 2σ2
ε2 +
2τ2L2δ2
ε2 .
Therefore, it is possible to choose δ and τ to make the above
number smaller than τσ2/ε2, which is the number of samples required
for (21) with the simple mini-batching scheme (20).
To make further simplifications, we choose δ = ε/2L, which leads to
b = τ/2 and
|B0|+ (τ − 1)b ≤ 2σ2/ε2 + τ2/2.
The ratio between the numbers of samples required by Spider and
simple mini-batching is
ρ(τ) ,
( 2σ2
2σ2 .
In order to maximize the efficiency of variance reduction, we
should choose τ to minimize this ratio, which is done by
setting
τ? = 2σ/ε.
Algorithm 2: Prox-Spider Method
input: initial point x0 ∈ Rd, proximal step size η > 0, and
precision ε > 0; Spider epoch length τ , large batch size B and
small batch size b.
for t = 0, 1, ..., T − 1 do if t mod τ == 0 then
Randomly sample a batch Bt of ξ with |Bt| = B vt = f ′Bt(x
t)
else Randomly sample a batch Bt of ξ with |Bt| = b vt = vt−1 + f
′Bt(x
t)− f ′Bt(x t−1)
end xt+1 = proxηΨ(xt − ηvt) xt+1 = xt + γt(x
t+1 − xt) with γt = min {
η ε xt+1−xt , 1
} end output: x uniformly sampled from {x0, ..., xT−1}.
Using (25) and δ = ε/2L, the corresponding mini-batch size and
optimal ratio of saving are
b? = τ?/2 = σ/ε, ρ(τ?) = 2 ε/σ. (26)
Therefore, significant reduction in sample complexity can be
expected when ε σ. The above derivation gives an optimal length τ?
for variance reduction over consecutive estimates
{v0, . . . , vτ−1}. If the number of iterations required by the
algorithm is lager than τ?, then we need to restart the Spider
estimator every τ? iterations. This way, we can keep the optimal
ratio ρ(τ?) to obtain the best total sample complexity.
Algorithm 2 is an instantiation of the NPAG method (Algorithm 1)
that uses Spider to con- struct vt. It incorporates the periodic
restart scheme dictated by the analysis above. The Spider- SFO
method (Option II) in [13] can be viewed as a special case of
Algorithm 2 with Ψ ≡ 0:
xt+1 = xt − γtη vt, where η = 1
2L and γt = min
} . (27)
Several variants of Spider-SFO, including ones that use constant
step sizes and direct extensions of the form xt+1 = proxηΨ(xt−ηvt),
have been proposed in the literature [24, 13, 25, 26]. Algorithm 1
is very similar to ProxSARAH [26], but with different choices of
γt. In ProxSARAH, the whole sequence {γt}t≥0 is specified off-line
in terms of parameters such as L and number of steps to run. Our
choice of γt follows closely that of Spider-SFO in (27), which is
automatically adjusted in an online fashion, and ensures bounded
step length. This property allows us to extend the algorithm
further to work with more general gradient estimators, including
those for multi-level composite stochastic optimization, all within
the common framework of NPAG and Theorem 2.3.
The sample complexity of Algorithm 2 is given in the following
corollary.
Corollary 3.2. Consider problem (5) with F (x) = Eξ[fξ(x)]. Suppose
Assumption 2.1 holds and the gradient mapping φξ ≡ f ′ξ satisfies
(18) and (19) on dom Ψ (instead of Rd). If the parameters in
Algorithm 2 are chosen as
η = 1
2L , τ =
⌉ , (28)
11
where d·e denotes the integer ceiling, then the output x satisfies
E[G(x)] ≤ ε after O(Lε−2) iterations, and the total sample
complexity is O(Lσε−3 + σ2ε−2).
Proof. In Algorithm 2, the step lengths satisfy the same bound in
(9), i.e., xt+1−xt ≤ ηε = ε/2L. From the analysis following (25),
the parameters in (28) guarantee (12). Therefore we can apply
Theorem 2.3 and Remark 2.4 to conclude that E[G(x)] ≤ 5ε provided
that T ≥ 4L(Φ(x0)−Φ∗)ε
−2. To estimate the sample complexity, we can simply multiply the
number of samples required by
simple mini-batching, T · σ2/ε2, by the optimal ratio in (26) to
obtain
T · σ 2
) ,
where we used the assumption ε < σ. Considering that B =
O(σ2/ε2) samples are always needed at t = 0, we can include it in
the total sample complexity, which becomes O(Lσε−3 + σ2ε−2).
We note that for each t that is not a multiple of τ , we need to
evaluate f ′ξ twice for each ξ ∈ Bt, one at xt and the other at
xt−1. Thus the computational cost per iteration, measured by the
number of evaluations of fξ and f ′ξ, is twice the number of
samples. Therefore the overall computational cost is proportional
to the sample complexity. This holds for all of our results in this
paper.
3.2 SARAH/SPIDER estimator for finite-sum optimization
Now consider a finite number of mappings φi : Rd → Rp×q for i = 1,
. . . , n, and define φ(x) = (1/n)
∑n i=1 φi(x) for every x ∈ Rd. We assume that the MSL condition
(18) holds, but do not
require a uniform bound on the variance as in (19). Again, we would
like to consider the number of samples required to construct τ
consecutive estimates {v0, . . . , vτ−1} that satisfy condition
(21). Without the uniform variance bound (19), this cannot be
treated exactly as a special case of what we did in Section 3.1,
and thus deserves separate attention.
We use the same Sarah/Spider estimator (22) and choose B0 = {1, . .
. , n}. In this case, we have v0 = φ(x0) and E
[ v0 − φ(x0)2
] = 0. From (23), if we can control the step lengths
xt − xt−1 ≤ ε/2L and use a constant batch size Bt = b for t = 1, .
. . , τ − 1, then
E [v1 − φ(x1)
τ−1∑ t=1
4b .
In order to satisfy (21), it suffices to use b = τ/4. The number of
samples required for τ steps is
|B0|+ (τ − 1)b ≤ n+ τ2/4.
As before, we compare the above number of samples against the
simple mini-batching scheme of using vt = φBt(x
t) for all t = 0, . . . , τ − 1. However, without an assumption
like (19), we cannot guarantee (21) unless using Bt = {1, . . . ,
n} for all t in simple mini-batching, which becomes full- batching.
The ratio between the numbers of samples required by Spider and
full-batching is
ρ(τ) = n+ τ2/4
nτ =
1
τ +
τ
4n .
By minimizing this ratio, we obtain the optimal epoch length and
batch size:
τ? = √
2n/4.
12
In this case, the minimum ratio is ρ(τ?) = 1/ √ n. Hence we expect
significant saving in sample
complexity for large n. Similar to the expectation case, we should
restart the estimator every τ? =
√ 2n iterations to keep the optimal ratio of reduction.
As an example, we consider using the Prox-Spider method to solve
problem (5) when the smooth part of the objective is a finite-sum.
The resulting sample complexity is summarized in the following
corollary of Theorem 2.3. Its proof is similar to that of Corollary
3.2 and thus omitted.
Corollary 3.3. Consider problem (5) with F (x) = (1/n) ∑n
i=1 fi(x). Suppose Assumption 2.1 holds and the gradient f ′i
satisfies (18) on dom Ψ. If we set the parameters in Algorithm 2
as
η = 1/2L, τ = ⌈√
⌉ ,
and use Bt = {1, . . . , n} whenever t mod τ = 0, then the output
satisfies E[G(x)] ≤ ε after O(Lε−2) iterations, and the total
sample complexity is O
( n+ L
√ nε−2
3.3 SVRG estimator for stochastic optimization
As a general framework, NPAG may also incorporate the SVRG
estimator proposed in [18]. We illustrate how SVRG estimator can be
applied to construct τ consecutive estimates {v0, ..., vτ−1} that
satisfy (21) for stochastic optimization (the expectation case).
For the SVRG estimator, we have
v0 = φB0(x0),
(29)
E [vt − φ(xt)
] ,
see [27, Lemma 4]. If we bound the step length by δ, then by the
triangle inequality,
xt − x0 ≤ t∑
|Bt| .
To guarantee (21), we can choose δ = ε/2L and |B0| ≥ 2σ2/ε2 and for
t = 1, . . . , τ , and |Bt| = b ≥ 2τ2L2δ2/ε2 = τ2/2. Thus the total
sample complexity per epoch is
|B0|+ (τ − 1)b = 2σ2
2 .
We can minimize the ratio between the above quantity and τσ2/ε2 (of
the naive mini-batch scheme) by choosing τ = (
√ 2σ/ε)2/3 and total sample complexity per epoch becomes O(σ2/ε2),
which can
be much smaller than τσ2/ε2. By restarting the gradient estimation
every τ iterations, (12) is satisfied with εt ≡ ε. By Theorem 2.3
and Remark 2.4, getting a x s.t. E [G(x)] ≤ 5ε requires T = O(Lε−2)
iterations and a total complexity of O
( (T/τ) · (σ2/ε2)
) = O(Lσ4/3ε−10/3), which is
slightly worse than that of the SARAH/Spider estimator in the
expectation case. The finite-sum case is similar and is hence
omitted. It will have the same sample complexity as
the SAGA estimator as we discuss next.
13
3.4 SAGA estimator for finite-sum optimization
In this section, we discussion how the SAGA estimator proposed in
[8] can be incorporated into our NPAG framework. Note that the SAGA
estimator is only meaningful for the finite-sum case (sampling with
replacement). The SAGA estimator can be written as
v0 = u0 = 1
vt = ut−1 + 1
) .
After each iteration, we update αti by setting αti ← xt if i ∈ Bt
and αti ← αt−1 i if i /∈ Bt. Denote
α[n, t] = {αri } r=0,...,t i=1,...,n. Suppose |Bt| ≡ b for all t.
Then the MSE of the SAGA estimator is bounded
by
] ≤ L2
nb
2.
Note that the αt−1 i ’s are identically distributed, thus we
further have
E [ vt − φ(xt)2
1 2 ] . (30)
When we restrict the step lengths by xr − xr−1 ≤ δ, then we have
xt− xr ≤ |t− r|δ for all t, r. Note that all the batches Br are
independently and uniformly sampled and αt−1
1 = xt−1 if 1 ∈ Bt−1. That is, αt−1
1 = xt−1 with probability b/n. Similarly, for r < t − 1, we have
αt−1 1 = xr if 1 ∈ Br
and 1 /∈ ∪t−1 j=r+1Bj . Hence αt−1
1 = xr with probability (b/n)(1− b/n)t−r−1. Therefore, (30)
implies
E [ vt − φ(xt)2
b
n
] (31)
≤ L2
( 1− b
≤ 3n2L2δ2
b3 .
When we choose δ = ε/2L and b = n2/3, we have E [ vt−φ(xt)2
] ≤ ε2 for any t ≥ 0. Consequently,
Theorem 2.3 and Remark 2.4 implies a total sample complexity of
O(n2/3ε−2) to output a point x s.t. E
[ G(x)
] ≤ 5ε. This sample complexity is worse than that of the
SARAH/Spider estimator.
In this section, we analyzed the normalized-step SARAH/Spider, SVRG
and SAGA estimators. Specifically, we described SARAH/Spider and
SVRG from the perspective of efficient variance reduction over a
period of consecutive iterations in a stochastic algorithm. This
perspective allows us to derive the optimal period length and
mini-batch sizes of the estimator without binding with
14
any particular algorithm. Using these estimators in the NPAG
framework led to the Prox-Spider method, and we obtained its sample
complexity by simply combining the iteration complexity of NPAG
(Theorem 2.3) and the optimal period and mini-batch sizes of
SARAH/Spider. The complexities in these cases on one-level
stochastic optimization are not new (see, e.g., [13, 37, 26, 27,
28]). But the NPAG framework and the separate perspective on
variance reduction are more powerful and also work for multi-level
stochastic optimization, which we show next.
4 Multi-level nested SPIDER
In this section, we present results for the general composite
optimization problems (1) and (2) with m ≥ 2. To provide
convergence and complexity result, we only need to construct a
gradient estimator vt that is sufficiently close to F ′(xt). For
the ease of presentation, let us denote
Fk = fk fk−1 · · · f1, k = 1, . . . ,m.
In particular, we have F1 = f1, and Fm = F . Then, by chain
rule,
F ′(x) = [ f ′1(x)
]T · · · [f ′m−1(Fm−2(x)) ]T f ′m(Fm−1(x)). (32)
To approximate F ′(xt), we construct m− 1 estimators for the
mappings,
yt1 ≈ F1(xt), . . . , ytm−1 ≈ Fm−1(xt), (33)
and m estimators for the Jacobians,
zt1 ≈ f ′1(xt), zt2 ≈ f ′2(yt1), . . . , ztm ≈ f ′m(ytm−1).
(34)
(Notice that we do not need to estimate F (xt) = Fm(xt).) Then we
can construct
vt = (zt1)T (zt2)T · · · (ztm−1)T ztm. (35)
We use the SARAH/Spider estimator (22) for all m−1 mappings in (33)
and m Jacobians in (34), then use the above vt as the approximate
gradient in the NPAG method. This leads to the multi- level
Nested-Spider method shown in Algorithm 3. To simplify
presentation, we adopted the notation yt0 = xt for all t ≥ 0.
Each iteration t in Algorithm 3 is assigned a stage or epoch
indicator χ(t) ∈ {1, . . . ,K}, indicating the current
variance-reduction stage. Each stage k has length τk which is
determined by the variable precision εk. Note that all iterations
within the same stage k share the same εχ(t) = εk, which in turn is
used to determine the step size γt = min
{ η εχ(t)/xt+1 − xt, 1
} .
For convergence analysis, we make the following assumption.
Assumption 4.1. For the functions and mappings appearing in (1) and
(2), we assume:
(a) For each i = 1, . . . ,m and each realization of ξi, mapping
fi,ξi : Rdi−1→Rdi is `i-Lipschitz and its Jacobian f ′i,ξi :
R
di−1→Rdi×di−1 is Li-Lipschitz.
(b) For each i = 1, . . . ,m − 1, there exists δi such that Eξi [
fi,ξi(y) − fi(y)2
] ≤ δ2
15
Algorithm 3: Multi-level Nested-Spider Method
input: initial point x0 ∈ Rd, proximal step size η > 0, a
sequence of precisions {εk}Kk=1, stage lengths {τk}Kk=1, and batch
sizes {Bk
i , S k i , b
k i , s
initialize: y0 0 = x0, k = 1, s = 0
for t = 0, 1, ..., T − 1 do χ(t) = k. /* iteration t is in the k-th
stage */
if s == 0 then for i = 1, . . . ,m do
Randomly sample batch Sti of ξi with |Sti |=Ski , batch Bti with
|Bti |=Bk i
yti = fi,Sti (y t i−1) /* skip if i == m */
zti = f ′ i,Bti
else for i = 1, . . . ,m do
Randomly sample batch Sti of ξi with |Sti |=ski , batch Bti with
|Bti |=bki yti = yt−1
i + fi,Sti (y t i−1)− f ′
i,Sti (yt−1 i−1) /* skip if i == m */
zti = zt−1 i + f ′
i,Bti (yti−1)− f ′
end
end vt = (zt1)T (zt2)T · · · (ztm−1)T ztm xt+1 = proxηΨ(xt − ηvt)
xt+1 = xt + γt(x
t+1 − xt) with γt = min {
η εχ(t) xt+1−xt , 1
} yt+1
0 = xt+1, s← s+ 1 if s == τk then s← 0, k ← k + 1 /* reset s = 0
for next stage */
end
output: x from {x0, ..., x T−1} with xt sampled with
probability
εχ(t)∑K k=1 τkεk
.
(c) For each i = 1, . . . ,m, there exists σi such that Eξi [ f
′i,ξi(y) − f ′i(y)2
] ≤ σ2
i for any y ∈ dom fi.
As a result, the function F = fm · · · f1 and its gradient F ′ are
both Lipschitz continuous, and their Lipschitz constants are given
in the following lemma.
Lemma 4.2. Suppose Assumption 4.1.(a) holds. Then the composite
function F and its gradient F ′
are Lipschitz continuous, with respective Lipschitz constants
`F = m∏ i=1
Li
where we use the convention ∏0 r=1 `r = 1.
The proof of this lemma is presented in Appendix A. For the special
case m = 2,
`F = `1`2, LF = `2L1 + `21L2,
16
which have appeared in previous work on two-level problems [15,
41]. In addition, we define two constants:
σ2 F =
δ2 i . (37)
They will be convenient in characterizing the sample complexity of
Algorithm 3. As inputs to Algorithm 3, we need to choose the
variable precisions εk, epoch length τk and
batch sizes {Bk i , S
k i , b
k i , s
k i } k=1,...,K i=1,...,m to ensure E[vt − F ′(xt)|2] ≤ ε2χ(t).
This is done through the
following lemma.
Lemma 4.3. Suppose Assumption 4.1 holds. In Algorithm 3, if we set
η = 1 2LF
, τk = `F 2mεk
then E [ vt − F ′(xt)2
] ≤ ε2k for all k and all t ∈ {r : χ(r) = k}.
Proof. First, we bound vt − F ′(xt)2 in terms of the approximation
errors of the individual esti- mators in (33) and (34). Denoting
Fi(x
t) by F ti for simplicity, we have from (32) and (35),
E [ vt− F ′(xt)2
] = E
[[zt1]T [zt2]T · · · [ztm−1]Tztm − [f ′1(xt)]T [f ′2(F t1)]T · · ·
f ′m(F tm−1) 2 ] .
To bound this value, we add and subtract intermediate terms inside
the norm such that each adjacent pair of products differ at most in
one factor. Then we use the inequality
∑r i=1 ai−bi
2 ≤ r ∑r
i=1 ai − bi2 to split the bound. More concretely, we split vt − F
′(xt) into k = 2m− 1 pairs:
E [ vt − F ′(xt)2
] (40)
≤ (2m− 1)
( E [[zt1]T [zt2]T · · · [ztm−1]T ztm − [f ′1(xt)]T [zt2]T · · ·
[ztm−1]T ztm
2 ]
+E [[f ′1(xt)]T [zt2]T [zt3]T· · · [ztm−1]T ztm − [f ′1(xt)]T [f
′2(yt1)]T [zt3]T· · · [ztm−1]T ztm
2 ]
+E [[f ′1(xt)]T [f ′2(yt1)]T [zt3]T· · ·[ztm−1]Tztm−[f ′1(xt)]T [f
′2(F t1)]T [zt3]T· · ·[ztm−1]T ztm
2 ]
+ · · ·
+E [[f ′1(xt)]T · · · [f ′m−1(F tm−2)]T ztm − [f ′1(xt)]T · · · [f
′m−1(F tm−2)]T f ′m(ytm−1)
2 ]
+E [[f ′1(xt)]T· · ·[f ′m−1(F tm−2)]Tf ′m(ytm−1)−[f ′1(xt)]T· · ·[f
′m−1(F tm−2)]Tf ′m(F tm−1)
2 ])
≤ 2m
)( m∏
r=i+1 ztr2
) ( zti − f ′i(yti−1)2 + 1i≥2f ′i(yti−1)− f ′i(F ti−1)2
)] ≤ 2m
)( zti − f ′i(yti−1)2 + 1i≥2L
2 i yti−1 − F ti−12
)] ,
17
where we used the notation F t0 = yt0 = xt and 1i≥2 = 1 if i ≥ 2
and 0 otherwise. In the last inequality, we used f ′r(·) ≤ `r,
which is a consequence of fr being `r-Lipschitz.
Next, we first derive deterministic bounds on zti2 so that they can
be moved outside of the expectation in (40). Then we will bound
E
[ zti − f ′i(yti−1)2
] and E
] separately.
Without loss of generality, we focus on the first epoch with t = 0,
1, . . . , τ1 − 1, but the results hold for all the epochs. For the
ease of notation, we denote ε1, τ1 and B1
i , S 1 i , b
1 i , s
1 i by ε, τ and
Bi, Si, bi, si in this proof. Step 1: Bounding the temporal
differences yti−y
t−1 i . For i = 0, we have yt0 = xt and from (9),
xt − xt−1 ≤ ηε = ε/2LF ≤ ε/LF . When i = 1, for any t ≥ 1,
yt1 − yt−1 1 ≤
LF .
In general, for all 1 ≤ i ≤ m− 1, we have
yti−yt−1 i ≤
∏i r=1 `r LF
ε. (41)
∑t r=1 zri − z
r−1 i for all 1 ≤ i ≤ m. Moreover,
zri −zr−1 i ≤
r−1 i−1 )
( i−1∏ r=1
`r
) ε
LF ,
where the last inequality is due to (41). Consequently, for all 0 ≤
t ≤ τ − 1,
zti ≤ z0 i + t ·
( i−1∏ r=1
] and E
E [ zti − f ′i(yti−1)2
] ≤ E
i−1 2 ]
E [ yti − fi(yti−1)2
] ≤ δ2
i
Si +
( i∏
. (44)
Step 4: Bounding E[yti−Fi(xt)2]. We show that the following
inequality holds for i = 1, . . . ,m:
E [ yti − Fi(xt)2
] . (45)
18
(Here we use the convention ∏i j=i+1 `j = 1.) We prove this result
by induction. With the notations
F1 = f1 and yt0 = xt, the base case of i = 1 is the same as (44),
thus already proven. Suppose (45) holds for i = k, for i = k + 1,
we have
E [ ytk+1 − Fk+1(xt)2
] ≤ (1 + k)E
] + (1 + k−1)E
] + (1 + k−1)`2k+1E
[ ytk − Fk(xt)2
[ k∑ r=1
[( k∏ j=1
`2j
which completes the induction. Therefore (45) holds.
Now we go back to inequality (40) and first apply the bound on zti
in (42). Using
τ = `F
we have for r = 1, . . . ,m and all t ≥ 0,
ztr ≤ `r + τLr r−1∏ j=1
`j ε
LF = `r
2mLF
2m
) `r,
where the last inequality is due to the definition of LF in (36).
Consequently,( i−1∏ r=1
`2r
)( m∏
) ≤
∏ r 6=i `2r ,
where we applied the inequality (1 + a−1)a ≤ e ≤ 3 for a > 0.
Applying the above inequality to (40), we obtain
E [ vt − F ′(xt)2
] ≤ T1 + T2, (46)
] ,
( ∏ r 6=i+1
Next we bound the two terms T1 and T2 separately.
19
For T1, we use (43) and set Bi = B and bi = b for all i = 1, . . .
,m, which yields
T1 ≤ 6m
)) .
Now recall the definition of σ2 F in (37) and definition of LF in
(36), which implies
L2 F =
m+ 1 ε2,
where the equality is due to the choice of parameters in (38). For
T2, recalling our notation F ti = Fi(x
t), we use (45) to obtain
T2 ≤ 6m m−1∑ i=1
( ∏ r 6=i+1
δ2 r
`2j ·L2 i+1 ·
i∏ j=r+1
i+1 · m∏
δ2 r
δ2 r
i+1 · i∏
j=1 `4j
( δ2 rL
2 F
) ,
where in the second inequality we used i ≤ m and in the last
inequality we used (47). Now setting Sr = S and sr = s for r = 1, .
. . ,m− 1, and using the definition of δ2
F in (37), we have
T2 ≤ 6m2 δ 2 F
S + 6m3 τε
m+ 1 ε2,
where the last equality is due to the choice of parameters in (39).
Finally, with the above bounds on T1 and T2, we get from (46) that
for any t ≥ 0,
E[vt − F ′(xt)2] ≤ T1 + T2 ≤ 1
m+ 1 ε2 +
20
Theorem 4.4. Consider problem (1) with m ≥ 2, and suppose
Assumptions 2.1 and 4.1 hold. In Algorithm 3, if we set εk =
mθLF
k·`F with θ > 0 and set the other parameters as in Lemma 4.3,
then
the output x satisfies E[G(x)] ≤ ε after K = O ( mLF ε·`F
) epochs, and the total sample complexity
is O ( m4LF (σ2
) .
Proof. From Theorem 2.3 and Lemma 4.3, the output of Algorithm 3
satisfies
E [ G(x)
t=0 εχ(t)
+ 4 ∑T−1
. (48)
≤ θLF ln(K).
E [G(x)] ≤ 8mLF
) such that E [G(x)] ≤ ε. With the Bk
i , b k i , S
k i , s
K∑ k=1
m∑ i=1
Next, we present our results of the finite-sum optimization
problem.
Theorem 4.5. Consider problem (2) with m ≥ 2, and suppose
Assumptions 2.1 and 4.1.(a) hold. In addition, let Nmax = max{N1, .
. . , Nm} and assume the target precision ε satisfies
√ Nmax ≤ `F
2mε .
In Algorithm 3, we set η = 1/2LF . For k = 1, ...,K and i = 1,
...,m, we set εk = (
θLF k· √ Nmax
bki ≡ min {
} .
Moreover, for every t at the beginning of each epoch, let Bti = Sti
= {1, . . . , Ni} be the full batches for i = 1, . . . ,m. Then the
output x satisfies E[G(x)] ≤ ε after K = O
( LF√
i=1Ni +m4LF √ Nmaxε
−2 ) .
21
Proof. Through a similar line of proof of Lemma 4.3, E [ vt − F
′(xt)2
] ≤ ε2χ(t) still holds. Conse-
T−1∑ t=0
εχ(t) = K∑ k=1
τkεk = K∑ k=1
k · √ Nmax
≤ 2θLF ln(K).
Substitute the above inequalities into (48), we get E [G(x)] ≤ O (
( LF√
Nmax·K ) 1 2
LF√ Nmax·ε2
the total sample complexity is
K∑ k=1
m∑ i=1
( Ni+(τk−1)bki
If the condition that √ Nmax ≤ `F
2mε does not hold, then we can use the setting in Theorem 4.4 and
obtain the same order or slightly better sample complexity.
Remark 4.6. In Theorem 4.4, if we set εk ≡ ε, then the Algorithm
output x s.t. E[G(x)] ≤ ε after K = O(mLFε·`F ) epochs, and the
total sample complexity is O(m4LF (σ2
F + δ2 F + `2F )ε−3). Similarly,
in Theorem 4.5, if we set εk ≡ ε, then the Algorithm output x s.t.
E[G(x)] ≤ ε after K = O( LF√
Nmax·ε2 ) epochs, and the total sample complexity is O
(∑m i=1Ni +m4LF
−2 ) . Although
using constant εk ≡ ε improves the sample complexities of Theorem
4.4 and 4.5 by a logarithmic factor, the O(ε) step length may be
too consevative in the early stages. In practice, the adaptive
setting of εk in Theorem 4.4 and 4.5 can be much more
efficient.
5 Numerical experiments
In this section, we present numerical experiments to demonstrate
the effectiveness of the proposed algorithms and compare them with
related work. We first apply NPAG with different variance- reduced
estimators on a noncovex sparse classification problem. Since this
is a one-level finite-sum problem, we also compare it with
ProxSARAH [26]. Then we consider a sparse portfolio selection
problem, which has a two-level composition structure. We compare
the Nested-SPIDER method with several other methods for solving
two-level problems.
5.1 Sparse binary classification
min w
`(aTi x, bi) + βx1, (49)
where {(ai, bi)}Ni=1 are N labeled examples with feature vectors ai
∈ Rd and binary labels yi ∈ {−1, ,+1}. Specifically, we consider
two different loss functions: the logistic difference loss `ld(t,
b) =
22
# of samples 10 6
# of samples 10 6
# of samples 10 6
# of samples 10 6
ProxSARAH-vstp
ProxSARAH-cstp
ProxSPIDER
ProxSVRG
ProxSAGA
Figure 1: Experiments on sparse binary classification on mnist
dataset, β = 1/12691.
log ( 1 + e−bt
( 1− 1
)2 . For
the `1-norm penalty, we set β = 1/N . This is a standard one-level
finite-sum problem, corresponding to the case of m = 1 in
(2).
We compare the ProxSPIDER method (Algorithm 2) with the ProxSARAH
algorithm [26]. Prox- SARAH can be implemented with two schemes:
for the variable stepsize scheme (Theorem 5, [26]) we use the name
ProxSARAH-vstp; for the constant stepsize scheme (Theorem 6, [26])
we call it ProxSARAH-cstp. We also include the SVRG and SAGA
estimator with normalized stepsize rule as described in Section 3.3
and 3.4 respectivley, which we call ProxSVRG and ProxSAGA.
The parameter setting (batch size, epoch length, etc.) of
ProxSARAH-vstp and ProxSARAH- cstp are made to be consistent with
the ProxSARAH-A-v1 and ProxSARAH-v1 in [26] where each mini-batch
consists of only 1 sample, this is the parameter setting that
yields the best empirical performance in [26]. For ProxSPIDER, we
set both the batchsize and epoch length to be
⌈√ N ⌉ .
For ProxSVRG and ProxSAGA, we set the batchsize to be ⌈ N
2 3
⌈ N
⌉ .
Finally, for ProxSPIDER, ProxSVRG and ProxSAGA, we use a variable
sequence εk as in NPAG (Algorithm 1 and set εk = 10/
√ k.
First, we compare these methods over the mnist1 data sets. To fit
the binary classification
1http://yann.lecun.com/exdb/mnist/
23
ProxSARAH-vstp
ProxSARAH-cstp
ProxSPIDER
ProxSVRG
ProxSAGA
Figure 2: Experiments on sparse binary classification on rcv1
dataset, β = 1/20242.
problems, we extract the data of two arbitrary digits, which are 1
and 9 in our experiments. This subset of mnist consists N = 12691
data points. For the ProxSARAH-vstp and ProxSARAH- cstp, the
parameter to be tuned is the estimator of the Lipschtiz constant L.
The smaller L is, the larger the stepsize is. For the ProxSPIDER,
ProxSVRG and ProxSAGA, the parameter to be tuned is the stepsize η,
which corresponds to 1/L in this experiment. We choose L among
{0.01, 0.1, 1, 10} and it turns out that L = 1 works best for
ProxSARAH-vstp and ProxSARAH- cstp for both loss functions. We
choose η among {0.1, 1, 10, 100} and it turns out that η = 1 works
best for ProxSIDER, ProxSVRG and ProxSAGA. These choices are
consistent with the relationship η = 1/L.
Figure 1 shows the norm of the gradient mapping and objective value
gap of different methods. For all the compared methods, the curves
are averaged over 20 runs of the algorithm. It can be seen that
these different methods have similar performance, with ProxSVRG and
ProxSAGA slightly worse and ProxSPIDER slightly better than other
methods. In theory ProxSARAH and ProxSPIDER have the same sampling
complexity, and the empirical results are consistent with the
theory. Since the plots are very similar for the norm of gradient
mapping and the objective value gap, we will only present plots of
the gradient mapping norms for the rest of the experiments.
We also conduct our experiment on the rcv1.binary2 dataset, which
is a medium sized dataset with N = 20242 data points and each data
point has d = 47236 features. The parameters are chosen in a
similar way to the experiments on mnist dataset. In particular, L =
0.01 works best for ProxSARAH-vstp and ProxSARAH-cstp for both loss
functions and η = 100 works best for ProxSIDER, ProxSVRG and
ProxSAGA (again we have η = 1/L). Figure 2 shows the results of
comparison. In this particular case, ProxSAGA perform the best
despite a worse sample complexity in theory.
5.2 Sparse portfolio selection problem
In this section, we present the numerical results for a risk-averse
portfolio optimization problems, which is a common benchmark
example used in many stochastic composite optimization methds
(e.g., [16, 41, 42, 21]). Consider the scenario where d assets can
be invested during the time periods
2https://www.csie.ntu.edu.tw/
cjlin/libsvmtools/datasets/binary.html
# of samples 10 5
# of samples 10 6
Figure 3: Numerical experiments on the sparse portfolio
optimization problem.
{1, ..., N}. We denote the per unit payoff vector at time i by ri
:= [ri,1, ..., ri,d] T , with ri,j being
the per unit payoff of asset j at time period i. Let x ∈ Rd be the
decision variable, where the j-th component xj represents the
amount of investment to asset j throughout the N time period, for j
= 1, ..., d.
In the portfolio optimization problems, our aim is to maximize the
average payoff over N periods. However, a penalty is imposed on the
variance of the payoff across the N periods, which accounts for the
risk of the investment. In addition, we also apply an `1 penalty to
impose sparsity on the portfolio x, which gives the following
formulation:
maximize x∈Rd
E [ rξ, x
] − λVar(rξ, x)− βx1,
where the random variable ξ follows the uniform distribution over
{1, ..., N}. Using the identity that Var(X) = E[X2]−E[X]2, the
problem can be reformulated as
minimize x∈Rd
] − λ ·E
[ rξ, x
]2 + βx1. (50)
If we set f1,ξ(x) : Rd 7→ R2 = [rξ, x rξ, x2]T , f2,ξ2(y, z) : R2
7→ R ≡ −y − λy2 + λz, and Ψ(x) = βx1, then (50) becomes a special
case of (1) with m = 2.
In the experiments, we compare our method with the C-SAGA algorithm
[41], the CIVR and CIVR-adp algorithms [42], and the VRSC-PG
algorithm [16]. We test these algorithms with the real world
portfolio datasets from Keneth R. French Data Library3. The first
dataset is named Industrial-38, which contains 38 industrial assets
payoff over 2552 consecutive time periods; and the second dataset
is named Industrial-49, and contains 49 industrial assets over
24400 consecutive time periods. For both datasets, we set λ = 0.2
and β = 0.01 in problem (50). All the curves are averaged over 20
runs of the methods and are plotted against the number of samples
of the component function and gradient evaluations.
For these experiments, CIVR and Nested-SPIDER both take the batch
size of dN1/2e; CIVR- adp takes the adaptive batch size of Sk =
dmin{10k + 1, N1/2}e; VRSC-PG and C-SAGA both take the batch size
of dN2/3e. For all these methods, their stepsizes η are chosen as
the best from
3http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data
library.html
# of samples 10 5
Figure 4: Experiments on Industrial-38 datasets using different
step size parameter η.
{0.0001, 0.001, 0.01, 0.1, 1} by experiments. In addition, we set
εk = C/ √ k for Nested-SPIDER
with C chosen from {0.1, 1, 10, 50} by experiments. For the
Industrial-38 dataset, η = 0.01 works best for CIVR, CIVR-adp,
C-SAGA and VRSC-PG, and η = 0.1 and C = 1 work best for Nested-
SPIDER. For the Industrial-49 dataset, η = 0.0001 works best for
all five tested algorithms and C = 50 works best for the
Nested-SPIDER algorithm.
Figure 3 shows that for the Industrial-49 dataset, the CIVR-adp and
CIVR perform the best. However, Nested-SPIDER quickly catches up
with them after a slightly slower early stage, which is due to the
relatively restrictive stepsizes caused by normalization (in early
stages of the algorithm). On the contrary, Nested-SPIDER converges
much faster than all other methods in the Industrial- 38 dataset
because the normalization step enables us to exploit a
significantly larger stepsizes (in later stages of the algorithm),
under which all the other methods diverge. In Figure 4, we show the
following behavior of the Nested-SPIDER on Industrial-38. On the
left, we set η = 0.1 for all the algorithms and C = 1 for
Nested-SPIDER. All methods except for Nested-SPIDER diverge. On the
right, we set η = 0.01 for all the algorithms and C = 50 for
Nested-SPIDER. In this case, Nested-SPIDER behaves almost identical
to CIVR.
6 Conclusion
We have proposed a normalized proximal approximate gradient (NPAG)
method for solving multi- level composite stochastic optimization
problems. The approximate gradients at each iteration are obtained
via nested variance reduction using the SARAH/Spider estimator. In
order to find an approximate stationary point where the expected
norm of its gradient mapping is less than ε, the total sample
complexity of our method is O(ε−3) in the expectation case, and
O(N+
√ Nε−2) in the
finite-sum case where N is the total number of functions across all
composition levels. In addition, the dependence of the sample
complexity on the number of composition levels is polynomial,
rather than exponential as in previous work. Our results rely on a
uniform Lipschitz condition on the composite mappings and
functions, which is stronger than the mean-squared Lipschitz
condition required for obtaining similar complexities for
single-level stochastic optimization.
The NPAG method extends the Spider-SFO method [13] to the proximal
setting and for solving multi-level composite stochastic
optimization problems. In particular, requiring a uniform
bound
26
on the MSEs of the approximate gradients allows separate analysis
of the optimization algorithm and variance reduction. In addition,
the uniform bound on the step lengths at each iteration is the key
to enable our analysis of nested variance reduction. More flexible
rules for choosing the step sizes or step lengths have been
developed to obtain similar sample complexities for single- level
stochastic or finite-sum optimization problems, especially those
used in the Spider-Boost [37] and Prox-SARAH [26] methods. It is
likely that such step size rules may also be developed for
multi-level problems, although their complexity analysis may become
much more involved.
A Proof of Lemma 4.2
Proof. Under Assumption 4.1.(a), it is straightforward to show that
fi = Eξifi,ξi is `i-Lipschitz and its gradient f ′i = Eξif
′ i,ξi
is Li-Lipschitz. Recall the definitions Fi = fi fi−1 · · · f1 for i
= 1, . . . ,m. By the chain rule, we have
F ′(x)− F ′(y) =
[f ′1(x)]T · · · [f ′m−1(Fm−2(x))]T f ′m(Fm−1(x))
− [f ′1(y)]T · · · [f ′m−1(Fm−2(y))]T f ′m(Fm−1(y))
≤ [f ′1(x)]T· · · [f ′m−1(Fm−2(x))]Tf ′m(Fm−1(x))− [f ′1(y)]T [f
′2(F1(x))]T· · · f ′m(Fm−1(x))
+ · · · + [f ′1(y)]T· · ·[f ′m−1(Fm−2(y))]Tf ′m(Fm−1(x))− [f
′1(y)]T [f ′2(F1(y))]T· · ·f ′m(Fm−1(y))
≤ L1
`rFm(x)− Fm(y).
On the other hand, we have for any x and y that
Fi(x)− Fi(y) = fi(Fi−1(x))− fi(Fi−1(y))
≤ `iFi−1(x)− Fi−1(y) ≤ i∏
r=1 `rx− y.
Therefore Fi is `F -Lipschitz with `F = ∏m i=1 `i. Substituting
these Lipschitz constants into the
bound of F ′(x)− F ′(y) yields
F ′(x)− F ′(y) ≤
References
[1] Zeyuan Allen-Zhu. Natasha: Faster non-convex stochastic
optimization via strongly non- convex parameter. In Proceedings of
the 34th International Conference on Machine Learning (ICML),
volume 70 of Proceedings of Machine Learning Research, pages 89–97,
Sydney, Aus- tralia, 2017.
[2] Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization
than SGD. In Advances in Neural Information Processing Systems 31,
pages 2675–2686. Curran Associates, Inc., 2018.
27
[3] Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster
non-convex optimization. In Proceedings of the 33rd International
Conference on Machine Learning, pages 699–707, 2016.
[4] Zeyuan Allen-Zhu and Yang Yuan. Improved SVRG for
non-strongly-convex or sum-of-non- convex objectives. In
Proceedings of the 33rd International Conference on International
Con- ference on Machine Learning (ICML), pages 1080–1089,
2016.
[5] Amir Beck. First-Order Methods in Optimization. MOS-SIAM Series
on Optimization. SIAM, 2017.
[6] Jose Blanchet, Donald Goldfarb, Garud Iyengar, Fengpei Li, and
Chaoxu Zhou. Unbiased simulation for optimizing stochastic function
compositions. Preprint, arXiv:1711.07564, 2017.
[7] Christoph Dann, Gerhard Neumann, and Jan Peters. Policy
evaluation with temporal dif- ferences: a survey and comparison.
Journal of Machine Learning Research, 15(1):809–883, 2014.
[8] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A
fast incremental gradient method with support for non-strongly
convex composite objectives. In Advances in Neural Information
Processing Systems 27, pages 1646–1654, 2014.
[9] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao.
Optimal distributed online prediction using mini-batches. Jorunal
of Machine Learning Research, 13:165–202, 2012.
[10] Darinka Dentcheva, Spiridon Penev, and Andrzej Ruszczynski.
Statistical estimation of com- posite risk functionals and risk
optimization problems. Annals of the Institute of Statistical
Mathematics, 69(4):737–760, 2017.
[11] Dmitriy Drusvyatskiy and Adrian S. Lewis. Error bounds,
quadratic growth, and linear con- vergence of proximal methods.
Mathematics of Operations Research, 43(3):919–948, 2018.
[12] Y. Ermoliev. Methods of Stochastic Programming. Monographs in
Optimization and OR. Nauka, Moscow, 1976.
[13] Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang.
Spider: Near-optimal non-convex optimization via stochastic
path-integrated differential estimator. In Advances in Neural In-
formation Processing Systems 31, pages 689–699. Curran Associates,
Inc., 2018.
[14] Saeed Ghadimi and Guanghui Lan. Stochastic first- and
zeroth-order methods for nonconvex stochastic programming. SIAM
Journal on Optimization, 23(4):2341–2368, 2013.
[15] Saeed Ghadimi, Andrzej Ruszczynski, and Mengdi Wang. A single
time-scale stochastic ap- proximation method for nested stochastic
optimization. Preprint, arXiv:1812.01094, 2018.
[16] Zhouyuan Huo, Bin Gu, Ji Jiu, and Heng Huang. Accelerated
method for stochastic composi- tion optimization with nonsmooth
regularization. In Proceedings of the 32nd AAAI Conference on
Artificial Intelligence, pages 3287–3294, 2018.
[17] A. N. Iusem, A. Jofre, R. I. Oliveira, and P. Phompson.
Extragradient method with variance reduction for stochastic
variational inequalities. SIAM Journal on Optimization, 27(2):686–
724, 2017.
28
[18] Rie Johnson and Tong Zhang. Accelerating stochastic gradient
descent using predictive vari- ance reduction. In Advances in
Neural Information Processing Systems 26, pages 315–323,
2013.
[19] J. Koshal, A. Nedic, and U. B. Shanbhag. Regularized iterative
stochastic approximation methods for stochastic variational
inequality problems. IEEE Transactions on Automatic Control,
58(3):594–609, 2013.
[20] Xiangru Lian, Mengdi Wang, and Ji Liu. Finite-sum composition
optimization via variance reduced gradient descent. In Proceedings
of the 20th International Conference on Artificial Intelligence and
Statistics (AISTATS), pages 1159–1167, 2017.
[21] Tianyi Lin, Chenyou Fan, Mengdi Wang, and Michael I Jordan.
Improved oracle complexity for stochastic compositional variance
reduced gradient. Preprint, arXiv:1806.00458, 2018.
[22] Yurii Nesterov. Gradient methods for minimizing composite
functions. Mathematical Pro- gramming, 140(1):125–161, 2013.
[23] Yurii Nesterov. Lectures on Convex Optimization. Springer, 2nd
edition, 2018.
[24] Lam M. Nguyen, Jie Liu, Katya Scheinberg, and Martin Takac.
SARAH: A novel method for machine learning problems using
stochastic recursive gradient. In Proceedings of the 34th
International Conference on Machine Learning (ICML), volume 70 of
PMLR, pages 2613–2621, 2017.
[25] Lam M. Nguyen, Marten van Dijk, Dzung T. Phan, Phuong Ha
Nguyen, Tsui-Wei Weng, and Jayant R. Kalagnanam. Finite-sum smooth
optimization with SARAH. arXiv:1901.07648, 2019.
[26] Nhan H. Pham, Lam M. Nguyen, Dzung T. Phan, and Quoc
Tran-Dinh. ProxSARAH: An efficient algorithmic framework for
stochastic composite nonconvex optimization. Preprint,
arXiv:1902.05679, 2019.
[27] Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos,
and Alex Smola. Stochastic variance reduction for nonconvex
optimization. In Proceedings of The 33rd International Con- ference
on Machine Learning, volume 48 of Proceedings of Machine Learning
Research, pages 314–323, New York, New York, USA, 2016.
[28] Sashank J Reddi, Suvrit Sra, Barnabas Poczos, and Alex Smola.
Fast incremental method for smooth nonconvex optimization. In 2016
IEEE 55th Conference on Decision and Control (CDC), pages
1971–1977. IEEE, 2016.
[29] Sashank J Reddi, Suvrit Sra, Barnabas Poczos, and Alexander J
Smola. Proximal stochastic methods for nonsmooth nonconvex
finite-sum optimization. In Advances in Neural Information
Processing Systems 29, pages 1145–1153, 2016.
[30] R. Tyrrell Rockafellar. Convex Analysis. Princeton University
Press, 1970.
[31] R. Tyrrell Rockafellar. Coherent approaches to risk in
optimization under uncertainty. IN- FORMS TutORials in Operations
Research, 2007.
29
[32] Nicolas L. Roux, Mark Schmidt, and Francis R. Bach. A
stochastic gradient method with an exponential convergence rate for
finite training sets. In Advances in Neural Information Processing
Systems 25, pages 2663–2671. Curran Associates, Inc., 2012.
[33] Andrzej Ruszczynski. Advances in risk-averse optimization.
INFORMS TutORials in Operation Research, 2013.
[34] Richard S Sutton and Andrew G Barto. Reinforcement Learning:
An Introduction. MIT Press, Cambridge, MA, 1998.
[35] Mengdi Wang, Ethan X Fang, and Han Liu. Stochastic
compositional gradient descent: algo- rithms for minimizing
compositions of expected-value functions. Mathematical Programming,
161(1-2):419–449, 2017.
[36] Mengdi Wang, Ji Liu, and Ethan Fang. Accelerating stochastic
composition optimization. Journal of Machine Learning Research,
18(105):1–23, 2017.
[37] Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tarokh.
SpiderBoost: A class of faster variance-reduced algorithms for
nonconvex optimization. Preprint, arXiv:1810.10690, 2018.
[38] Lin Xiao and Tong Zhang. A proximal stochastic gradient method
with progressive variance reduction. SIAM Journal on Optimization,
24(4):2057–2075, 2014.
[39] Shuoguang Yang, Mengdi Wang, and Ethan X. Fang. Multilevel
stochastic gradient methods for nested composition optimization.
SIAM Journal on Optimization, 29(1):616–659, 2019.
[40] Xiaoyong Yuan, Pan He, Qile Zhu, and Xiaolin Li. Adversarial
examples: Attacks and defenses for deep learning. IEEE Transactions
on Neural Networks and Learning Systems, 30(9):2805– 2824,
2019.
[41] Junyu Zhang and Lin Xiao. A composite randomized incremental
gradient method. In Pro- ceedings of the 36th International
Conference on Machine Learning (ICML), number 97 in Proceedings of
Machine Learning Research (PMLR), Long Beach, California,
2019.
[42] Junyu Zhang and Lin Xiao. A stochastic composite gradient
method with incremental variance reduction. In Advances in Neural
Information Processing Systems 32, pages 9078–9088. Curran
Associates, Inc., 2019.
[43] Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested
variance reduced gradient descent for nonconvex optimization. In
Advances in Neural Information Processing Systems 31, pages
3921–3932. Curran Associates, Inc., 2018.
30
3 A general framework of stochastic variance reduction
3.1 SARAH/SPIDER estimator for stochastic optimization
3.2 SARAH/SPIDER estimator for finite-sum optimization
3.3 SVRG estimator for stochastic optimization
3.4 SAGA estimator for finite-sum optimization
4 Multi-level nested SPIDER
6 Conclusion