34
Escape saddle points by a simple gradient-descent based algorithm Chenyi Zhang 1 Tongyang Li 2,3,4* 1 Institute for Interdisciplinary Information Sciences, Tsinghua University, China 2 Center on Frontiers of Computing Studies, Peking University, China 3 School of Computer Science, Peking University, China 4 Center for Theoretical Physics, Massachusetts Institute of Technology, USA Abstract Escaping saddle points is a central research topic in nonconvex optimization. In this paper, we propose a simple gradient-based algorithm such that for a smooth function f : R n R, it outputs an -approximate second-order stationary point in ˜ O(log n/ 1.75 ) iterations. Compared to the previous state-of-the-art algorithms by Jin et al. with ˜ O(log 4 n/ 2 ) or ˜ O(log 6 n/ 1.75 ) iterations, our algorithm is poly- nomially better in terms of log n and matches their complexities in terms of 1/. For the stochastic setting, our algorithm outputs an -approximate second-order stationary point in ˜ O(log 2 n/ 4 ) iterations. Technically, our main contribution is an idea of implementing a robust Hessian power method using only gradients, which can find negative curvature near saddle points and achieve the polynomial speedup in log n compared to the perturbed gradient descent methods. Finally, we also perform numerical experiments that support our results. 1 Introduction Nonconvex optimization is a central research area in optimization theory, since lots of modern ma- chine learning problems can be formulated in models with nonconvex loss functions, including deep neural networks, principal component analysis, tensor decomposition, etc. In general, finding a global minimum of a nonconvex function is NP-hard in the worst case. Instead, many theoretical works focus on finding a local minimum instead of a global one, because recent works (both empiri- cal and theoretical) suggested that local minima are nearly as good as global minima for a significant amount of well-studied machine learning problems; see e.g. [4, 11, 13, 14, 16, 17]. On the other hand, saddle points are major obstacles for solving these problems, not only because they are ubiqui- tous in high-dimensional settings where the directions for escaping may be few (see e.g. [5, 7, 10]), but also saddle points can correspond to highly suboptimal solutions (see e.g. [18, 27]). Hence, one of the most important topics in nonconvex optimization is to escape saddle points. Specifically, we consider a twice-differentiable function f : R n R such that f is -smooth: k∇f (x 1 ) -∇f (x 2 )k≤ kx 1 - x 2 k x 1 , x 2 R n , f is ρ-Hessian Lipschitz: kH(x 1 ) -H(x 2 )k≤ ρkx 1 - x 2 k x 1 , x 2 R n ; here H is the Hessian of f . The goal is to find an -approximate second-order stationary point x : 1 k∇f (x )k≤ , λ min (H(x )) ≥- ρ. (1) * Corresponding author. Email: [email protected] 1 We can ask for an (1,2)-approx. second-order stationary point s.t. k∇f (x)k≤ 1 and λmin(2 f (x)) -2 in general. The scaling in (1) was adopted as a standard in literature [1, 6, 9, 19, 20, 21, 25, 28, 29, 30]. 35th Conference on Neural Information Processing Systems (NeurIPS 2021). arXiv:2111.14069v1 [math.OC] 28 Nov 2021

based algorithm

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: based algorithm

Escape saddle points by a simple gradient-descentbased algorithm

Chenyi Zhang1 Tongyang Li2,3,4∗1 Institute for Interdisciplinary Information Sciences, Tsinghua University, China

2 Center on Frontiers of Computing Studies, Peking University, China3 School of Computer Science, Peking University, China

4 Center for Theoretical Physics, Massachusetts Institute of Technology, USA

Abstract

Escaping saddle points is a central research topic in nonconvex optimization. Inthis paper, we propose a simple gradient-based algorithm such that for a smoothfunction f : Rn → R, it outputs an ε-approximate second-order stationary point inO(log n/ε1.75) iterations. Compared to the previous state-of-the-art algorithms byJin et al. with O(log4 n/ε2) or O(log6 n/ε1.75) iterations, our algorithm is poly-nomially better in terms of log n and matches their complexities in terms of 1/ε.For the stochastic setting, our algorithm outputs an ε-approximate second-orderstationary point in O(log2 n/ε4) iterations. Technically, our main contributionis an idea of implementing a robust Hessian power method using only gradients,which can find negative curvature near saddle points and achieve the polynomialspeedup in log n compared to the perturbed gradient descent methods. Finally, wealso perform numerical experiments that support our results.

1 Introduction

Nonconvex optimization is a central research area in optimization theory, since lots of modern ma-chine learning problems can be formulated in models with nonconvex loss functions, including deepneural networks, principal component analysis, tensor decomposition, etc. In general, finding aglobal minimum of a nonconvex function is NP-hard in the worst case. Instead, many theoreticalworks focus on finding a local minimum instead of a global one, because recent works (both empiri-cal and theoretical) suggested that local minima are nearly as good as global minima for a significantamount of well-studied machine learning problems; see e.g. [4, 11, 13, 14, 16, 17]. On the otherhand, saddle points are major obstacles for solving these problems, not only because they are ubiqui-tous in high-dimensional settings where the directions for escaping may be few (see e.g. [5, 7, 10]),but also saddle points can correspond to highly suboptimal solutions (see e.g. [18, 27]).

Hence, one of the most important topics in nonconvex optimization is to escape saddle points.Specifically, we consider a twice-differentiable function f : Rn → R such that

• f is `-smooth: ‖∇f(x1)−∇f(x2)‖ ≤ `‖x1 − x2‖ ∀x1,x2 ∈ Rn,• f is ρ-Hessian Lipschitz: ‖H(x1)−H(x2)‖ ≤ ρ‖x1 − x2‖ ∀x1,x2 ∈ Rn;

hereH is the Hessian of f . The goal is to find an ε-approximate second-order stationary point xε:1

‖∇f(xε)‖ ≤ ε, λmin(H(xε)) ≥ −√ρε. (1)

∗Corresponding author. Email: [email protected] can ask for an (ε1, ε2)-approx. second-order stationary point s.t. ‖∇f(x)‖ ≤ ε1 and λmin(∇2f(x)) ≥

−ε2 in general. The scaling in (1) was adopted as a standard in literature [1, 6, 9, 19, 20, 21, 25, 28, 29, 30].

35th Conference on Neural Information Processing Systems (NeurIPS 2021).

arX

iv:2

111.

1406

9v1

[m

ath.

OC

] 2

8 N

ov 2

021

Page 2: based algorithm

In other words, at any ε-approx. second-order stationary point xε, the gradient is small with normbeing at most ε and the Hessian is close to be positive semi-definite with all its eigenvalues≥ −√ρε.Algorithms for escaping saddle points are mainly evaluated from two aspects. On the one hand,considering the enormous dimensions of machine learning models in practice, dimension-free oralmost dimension-free (i.e., having poly(log n) dependence) algorithms are highly preferred. Onthe other hand, recent empirical discoveries in machine learning suggests that it is often feasibleto tackle difficult real-world problems using simple algorithms, which can be implemented andmaintained more easily in practice. On the contrary, algorithms with nested loops often suffer fromsignificant overheads in large scales, or introduce concerns with the setting of hyperparameters andnumerical stability (see e.g. [1, 6]), making them relatively hard to find practical implementations.

It is then natural to explore simple gradient-based algorithms for escaping from saddle points. Thereason we do not assume access to Hessians is because its construction takes Ω(n2) cost in general,which is computationally infeasible when the dimension is large. A seminal work along this linewas by Ge et al. [11], which found an ε-approximate second-order stationary point satisfying (1) us-ing only gradients in O(poly(n, 1/ε)) iterations. This is later improved to be almost dimension-freeO(log4 n/ε2) in the follow-up work [19],2 and the perturbed accelerated gradient descent algo-rithm [21] based on Nesterov’s accelerated gradient descent [26] takes O(log6 n/ε1.75) iterations.However, these results still suffer from a significant overhead in terms of log n. On the other di-rection, Refs. [3, 24, 29] demonstrate that an ε-approximate second-order stationary point can befind using gradients in O(log n/ε1.75) iterations. Their results are based on previous works [1, 6]using Hessian-vector products and the observation that the Hessian-vector product can be approxi-mated via the difference of two gradient queries. Hence, their implementations contain nested-loopstructures with relatively large numbers of hyperparameters. It has been an open question whetherit is possible to keep both the merits of using only first-order information as well as being close todimension-free using a simple, gradient-based algorithm without a nested-loop structure [22]. Thispaper answers this question in the affirmative.

Contributions. Our main contribution is a simple, single-loop, and robust gradient-based algo-rithm that can find an ε-approximate second-order stationary point of a smooth, Hessian Lipschitzfunction f : Rn → R. Compared to previous works [3, 24, 29] exploiting the idea of gradient-basedHessian power method, our algorithm has a single-looped, simpler structure and better numericalstability. Compared to the previous state-of-the-art results with single-looped structures by [21]and [19, 20] using O(log6 n/ε1.75) or O(log4 n/ε2) iterations, our algorithm achieves a polynomialspeedup in log n:

Theorem 1 (informal). Our single-looped algorithm finds an ε-approximate second-order station-ary point in O(log n/ε1.75) iterations.

Technically, our work is inspired by the perturbed gradient descent (PGD) algorithm in [19, 20]and the perturbed accelerated gradient descent (PAGD) algorithm in [21]. Specifically, PGD appliesgradient descents iteratively until it reaches a point with small gradient, which can be a potentialsaddle point. Then PGD generates a uniform perturbation in a small ball centered at that pointand then continues the GD procedure. It is demonstrated that, with an appropriate choice of theperturbation radius, PGD can shake the point off from the neighborhood of the saddle point andconverge to a second-order stationary point with high probability. The PAGD in [21] adopts asimilar perturbation idea, but the GD is replaced by Nesterov’s AGD [26].

Our algorithm is built upon PGD and PAGD but with one main modification regarding the pertur-bation idea: it is more efficient to add a perturbation in the negative curvature direction nearby thesaddle point, rather than the uniform perturbation in PGD and PAGD, which is a compromise sincewe generally cannot access the Hessian at the saddle due to its high computational cost. Our keyobservation lies in the fact that we do not have to compute the entire Hessian to detect the negativecurvature. Instead, in a small neighborhood of a saddle point, gradients can be viewed as Hessian-vector products plus some bounded deviation. In particular, GD near the saddle with learning rate1/` is approximately the same as the power method of the matrix (I −H/`). As a result, the mostnegative eigenvalues stand out in GD because they have leading exponents in the power method,and thus it approximately moves along the direction of the most negative curvature nearby the sad-

2The O notation omits poly-logarithmic terms, i.e., O(g) = O(g poly(log g)).

2

Page 3: based algorithm

dle point. Following this approach, we can escape the saddle points more rapidly than previousalgorithms: for a constant ε, PGD and PAGD take O(log n) iterations to decrease the function valueby Ω(1/ log3 n) and Ω(1/ log5 n) with high probability, respectively; on the contrary, we can firsttake O(log n) iterations to specify a negative curvature direction, and then add a larger perturbationin this direction to decrease the function value by Ω(1). See Proposition 3 and Proposition 5. Afterescaping the saddle point, similar to PGD and PAGD, we switch back to GD and AGD iterations,which are efficient to decrease the function value when the gradient is large [19, 20, 21].

Our algorithm is also applicable to the stochastic setting where we can only access stochastic gra-dients, and the stochasticity is not under the control of our algorithm. We further assume that thestochastic gradients are Lipschitz (or equivalently, the underlying functions are gradient-Lipschitz,see Assumption 2), which is also adopted in most of the existing works; see e.g. [8, 19, 20, 34]. Wedemonstrate that a simple extended version of our algorithm takes O(log2 n) iterations to detect anegative curvature direction using only stochastic gradients, and then obtain an Ω(1) function valuedecrease with high probability. On the contrary, the perturbed stochastic gradient descent (PSGD)algorithm in [19, 20], the stochastic version of PGD, takes O(log10 n) iterations to decrease thefunction value by Ω(1/ log5 n) with high probability.Theorem 2 (informal). In the stochastic setting, our algorithm finds an ε-approximate second-orderstationary point using O(log2 n/ε4) iterations via stochastic gradients.

Our results are summarized in Table 1. Although the underlying dynamics in [3, 24, 29] and our al-gorithm have similarity, the main focus of our work is different. Specifically, Refs. [3, 24, 29] mainlyaim at using novel techniques to reduce the iteration complexity for finding a second-order stationarypoint, whereas our work mainly focuses on reducing the number of loops and hyper-parameters ofnegative curvature finding methods while preserving their advantage in iteration complexity, since amuch simpler structure accords with empirical observations and enables wider applications. More-over, the choice of perturbation in [3] is based on the Chebyshev approximation theory, which mayrequire additional nested-looped structures to boost the success probability. In the stochastic set-ting, there are also other results studying nonconvex optimization [15, 23, 31, 36, 12, 32, 35] fromdifferent perspectives than escaping saddle points, which are incomparable to our results.

Setting Reference Oracle Iterations SimplicityNon-stochastic [1, 6] Hessian-vector product O(log n/ε1.75) Nested-loopNon-stochastic [19, 20] Gradient O(log4 n/ε2) Single-loopNon-stochastic [21] Gradient O(log6 n/ε1.75) Single-loopNon-stochastic [3, 24, 29] Gradient O(log n/ε1.75) Nested-loopNon-stochastic this work Gradient O(log n/ε1.75) Single-loop

Stochastic [19, 20] Gradient O(log15 n/ε4) Single-loopStochastic [9] Gradient O(log5 n/ε3.5) Single-loopStochastic [3] Gradient O(log2 n/ε3.5) Nested-loopStochastic [8] Gradient O(log2 n/ε3) Nested-loopStochastic this work Gradient O(log2 n/ε4) Single-loop

Table 1: A summary of the state-of-the-art results on finding approximate second-order stationary points by thefirst-order (gradient) oracle. Iteration numbers are highlighted in terms of the dimension n and the precision ε.

It is worth highlighting that our gradient-descent based algorithm enjoys the following nice features:

• Simplicity: Some of the previous algorithms have nested-loop structures with the concern of prac-tical impact when setting the hyperparameters. In contrast, our algorithm based on negative cur-vature finding only contains a single loop with two components: gradient descent (including AGDor SGD) and perturbation. As mentioned above, such simple structure is preferred in machinelearning, which increases the possibility of our algorithm to find real-world applications.

• Numerical stability: Our algorithm contains an additional renormalization step at each iterationwhen escaping from saddle points. Although in theoretical aspect a renormalization step doesnot affect the output and the complexity of our algorithm, when finding negative curvature near

3

Page 4: based algorithm

saddle points it enables us to sample gradients in a larger region, which makes our algorithm morenumerically stable against floating point error and other errors. The introduction of renormaliza-tion step is enabled by the simple structure of our algorithm, which may not be feasible for morecomplicated algorithms [3, 24, 29].

• Robustness: Our algorithm is robust against adversarial attacks when evaluating the value of thegradient. Specifically, when analyzing the performance of our algorithm near saddle points, weessentially view the deflation from pure quadratic geometry as an external noise. Hence, theeffectiveness of our algorithm is unaffected under external attacks as long as the adversary isbounded by deflations from quadratic landscape.

Finally, we perform numerical experiments that support our polynomial speedup in log n. We per-form our negative curvature finding algorithms using GD or SGD in various landscapes and generalclasses of nonconvex functions, and use comparative studies to show that our Algorithm 1 and Algo-rithm 3 achieve a higher probability of escaping saddle points using much fewer iterations than PGDand PSGD (typically less than 1/3 times of the iteration number of PGD and 1/2 times of the iter-ation number of PSGD, respectively). Moreover, we perform numerical experiments benchmarkingthe solution quality and iteration complexity of our algorithm against accelerated methods. Com-pared to PAGD [21] and even advanced optimization algorithms such as NEON+ [29], Algorithm 2possesses better solution quality and iteration complexity in various landscapes given by more gen-eral nonconvex functions. With fewer iterations compared to PAGD and NEON+ (typically lessthan 1/3 times of the iteration number of PAGD and 1/2 times of the iteration number of NEON+,respectively), our Algorithm 2 achieves a higher probability of escaping from saddle points.

Open questions. This work leaves a couple of natural open questions for future investigation:

• Can we achieve the polynomial speedup in log n for more advanced stochastic optimization algo-rithms with complexity O(poly(log n)/ε3.5) [2, 3, 9, 28, 30] or O(poly(log n)/ε3) [8, 33]?

• How is the performance of our algorithms for escaping saddle points in real-world applications,such as tensor decomposition [11, 16], matrix completion [13], etc.?

Broader impact. This work focuses on the theory of nonconvex optimization, and as far as wesee, we do not anticipate its potential negative societal impact. Nevertheless, it might have a pos-itive impact for researchers who are interested in understanding the theoretical underpinnings of(stochastic) gradient descent methods for machine learning applications.

Organization. In Section 2, we introduce our gradient-based Hessian power method algorithmfor negative curvature finding, and present how our algorithms provide polynomial speedup in log nfor both PGD and PAGD. In Section 3, we present the stochastic version of our negative curvaturefinding algorithm using stochastic gradients and demonstrate its polynomial speedup in log n forPSGD. Numerical experiments are presented in Section 4. We provide detailed proofs and additionalnumerical experiments in the supplementary material.

2 A simple algorithm for negative curvature finding

We show how to find negative curvature near a saddle point using a gradient-based Hessian powermethod algorithm, and extend it to a version with faster convergence rate by replacing gradient de-scents by accelerated gradient descents. The intuition works as follows: in a small enough regionnearby a saddle point, the gradient can be approximately expressed as a Hessian-vector product for-mula, and the approximation error can be efficiently upper bounded, see Eq. (6). Hence, using onlygradients information, we can implement an accurate enough Hessian power method to find negativeeigenvectors of the Hessian matrix, and further find the negative curvature nearby the saddle.

2.1 Negative curvature finding based on gradient descents

We first present an algorithm for negative curvature finding based on gradient descents. Specifically,for any x ∈ Rn with λmin(H(x)) ≤ −√ρε, it finds a unit vector e such that eTH(x)e ≤ −√ρε/4.

4

Page 5: based algorithm

Algorithm 1: Negative Curvature Finding(x, r,T ).1 y0 ←Uniform(Bx(r)) where Bx(r) is the `2-norm ball centered at x with radius r;2 for t = 1, ...,T do3 yt ← yt−1 − ‖yt−1‖

`r

(∇f(x + ryt−1/‖yt−1‖)−∇f(x)

);

4 Output yT /r.

Proposition 3. Suppose the function f : Rn → R is `-smooth and ρ-Hessian Lipschitz. For any0 < δ0 ≤ 1, we specify our choice of parameters and constants we use as follows:

T =8`√ρε· log

( `δ0

√n

πρε

), r =

ε

8`

√π

nδ0. (2)

Suppose that x satisfies λmin(∇2f(x)) ≤ −√ρε. Then with probability at least 1− δ0, Algorithm 1outputs a unit vector e satisfying

eTH(x)e ≤ −√ρε/4, (3)

using O(T ) = O(

logn√ρε

)iterations, whereH stands for the Hessian matrix of function f .

Proof. Without loss of generality we assume x = 0 by shifting Rn such that x is mapped to 0.Define a new n-dimensional function

hf (x) := f(x)− 〈∇f(0),x〉 , (4)

for the ease of our analysis. Since 〈∇f(0),x〉 is a linear function with Hessian being 0, the Hessianof hf equals to the Hessian of f , and hf (x) is also `-smooth and ρ-Hessian Lipschitz. In addition,note that∇hf (0) = ∇f(0)−∇f(0) = 0. Then for all x ∈ Rn,

∇hf (x) =

∫ 1

ξ=0

H(ξx) · x dξ = H(0)x +

∫ 1

ξ=0

(H(ξx)−H(0)) · x dξ. (5)

Furthermore, due to the ρ-Hessian Lipschitz condition of both f and hf , for any ξ ∈ [0, 1] we have‖H(ξx)−H(0)‖ ≤ ρ‖x‖, which leads to

‖∇hf (x)−H(0)x‖ ≤ ρ‖x‖2. (6)

Observe that the Hessian matrixH(0) admits the following eigen-decomposition:

H(0) =

n∑i=1

λiuiuTi , (7)

where the set uini=1 forms an orthonormal basis of Rn. Without loss of generality, we assume theeigenvalues λ1, λ2, . . . , λn corresponding to u1,u2, . . . ,un satisfy

λ1 ≤ λ2 ≤ · · · ≤ λn, (8)

in which λ1 ≤ −√ρε. If λn ≤ −

√ρε/2, Proposition 3 holds directly. Hence, we only need to prove

the case where λn > −√ρε/2, in which there exists some p > 1, p′ > 1 with

λp ≤ −√ρε ≤ λp+1, λp′ ≤ −

√ρε/2 < λp′+1. (9)

We use S‖, S⊥ to separately denote the subspace of Rn spanned by u1,u2, . . . ,up,up+1,up+2, . . . ,un, and use S′‖, S′⊥ to denote the subspace of Rn spanned byu1,u2, . . . ,up′, up′+1,up+2, . . . ,un. Furthermore, we define

yt,‖ :=

p∑i=1

〈ui,yt〉ui, yt,⊥ :=n∑i=p

〈ui,yt〉ui; (10)

yt,‖′ :=

p′∑i=1

〈ui,yt〉ui, yt,⊥′ :=

n∑i=p′

〈ui,yt〉ui (11)

5

Page 6: based algorithm

respectively to denote the component of yt in Line 3 in the subspaces S‖, S⊥, S′‖, S′⊥, and let

αt := ‖yt,‖‖/‖yt‖. Observe that

Prα0 ≥ δ0

√π/n

≥ Pr

|y0,1|/r ≥ δ0

√π/n

, (12)

where y0,1 := 〈u1,y0〉 denotes the component of y0 along u1. Consider the case where α0 ≥δ0√π/n, which can be achieved with probability

Pr

α0 ≥

√π

nδ0

≥ 1−

√π

nδ0 ·

Vol(Bn−10 (1))

Vol(Bn0 (1))≥ 1−

√π

nδ0 ·

√n

π= 1− δ0. (13)

We prove that there exists some t0 with 1 ≤ t0 ≤ T such that

‖yt0,⊥′‖/‖yt0‖ ≤√ρε/(8`). (14)

Assume the contrary, for any 1 ≤ t ≤ T , we all have ‖yt,⊥′‖/‖yt‖ >√ρε/(8`). Then ‖y′t,⊥‖

satisfies the following recurrence formula:

‖yt+1,⊥′‖ ≤ (1 +√ρε/(2`))‖yt,⊥′‖+ ‖∆⊥′‖ ≤ (1 +

√ρε/(2`) + ‖∆‖/‖yt,⊥′‖)‖yt,⊥′‖, (15)

where

∆ :=‖yt‖r`

(∇hf (ryt/‖yt‖)−H(0) · (ryt/‖yt‖)) (16)

stands for the deviation from pure quadratic approximation and ‖∆‖/‖yt‖ ≤ ρr/` due to (6).Hence,

‖yt+1,⊥′‖ ≤(

1 +

√ρε

2`+‖∆‖‖yt,⊥′‖

)‖yt,⊥′‖ ≤

(1 +

√ρε

2`+ · 8ρr√

ρε

)‖yt+1,⊥′‖, (17)

which leads to

‖yt,⊥′‖ ≤ ‖y0,⊥′‖(1 +√ρε/(2`) + 8ρr/

√ρε)t ≤ ‖y0,⊥′‖(1 + 5

√ρε/(8`))t, ∀t ∈ [T ]. (18)

Similarly, we can have the recurrence formula for ‖yt,‖‖:

‖yt+1,‖‖ ≥ (1 +√ρε/(2`))‖yt,‖‖ − ‖∆‖‖ ≥ (1 +

√ρε/(2`)− ‖∆‖/(αt‖yt‖))‖yt,‖‖. (19)

Considering that ‖∆‖/‖yt‖ ≤ ρr/` due to (6), we can further have

‖yt+1,‖‖ ≥ (1 +√ρε/(2`)− ρr/(αt`))‖yt,‖‖. (20)

Intuitively, ‖yt,‖‖ should have a faster increasing rate than ‖yt,⊥‖ in this gradient-based Hessianpower method, ignoring the deviation from quadratic approximation. As a result, the value the valueαt = ‖yt,‖‖/‖yt‖ should be non-decreasing. It is demonstrated in Lemma 18 in Appendix B that,even if we count this deviation in, αt can still be lower bounded by some constant αmin:

αt ≥ αmin =δ04

√π

n, ∀1 ≤ t ≤ T . (21)

by which we can further deduce that

‖yt,‖‖ ≥ ‖y0,‖‖(1 +√ρε/`− ρr/(αmin`))

t ≥ ‖y0,‖‖(1 + 7√ρε/(8`))t, ∀1 ≤ t ≤ T . (22)

Observe that

‖yT ,⊥′‖‖yT ,‖‖

≤ ‖y0,⊥′‖‖y0,‖‖

·(1 + 5

√ρε/(8`)

1 + 7√ρε/(8`)

)T

≤ 1

δ0

√n

π

(1 + 5√ρε/(8`)

1 + 7√ρε/(8`)

)T

≤√ρε

8`. (23)

Since ‖yT ,‖‖ ≤ ‖yT ‖, we have ‖yT ,⊥′‖/‖yT ‖ ≤√ρε/(8`), contradiction. Hence, there here

exists some t0 with 1 ≤ t0 ≤ T such that ‖yt0,⊥′‖/‖yt0‖ ≤√ρε/(8`). Consider the normalized

vector e = yt0/r, we use e⊥′ and e‖′ to separately denote the component of e in S′⊥ and S′‖. Then,‖e⊥′‖ ≤

√ρε/(8`) whereas ‖e‖′‖ ≥ 1− ρε/(8`)2. Then,

eTH(0)e = (e⊥′ + e‖′)TH(0)(e⊥′ + e‖′) = eT⊥′H(0)e⊥′ + eT‖′H(0)e‖′ (24)

6

Page 7: based algorithm

since H(0)e⊥′ ∈ S′⊥ and H(0)e‖′ ∈ S′‖. Due to the `-smoothness of the function, all eigenvalueof the Hessian matrix has its absolute value upper bounded by `. Hence,

eT⊥′H(0)e⊥′ ≤ `‖eT⊥′‖22 = ρε/(64`2). (25)

Further according to the definition of S‖, we have

eT‖′H(0)e‖′ ≤ −√ρε‖e‖′‖2/2. (26)

Combining these two inequalities together, we can obtain

eTH(0)e = eT⊥H(0)e⊥′ + eT‖′H(0)e‖′ ≤ −√ρε‖e‖′‖2/2 + ρε/(64`2) ≤ −√ρε/4. (27)

Remark 4. In practice, the value of ‖yt‖ can become large during the execution of Algorithm 1.To fix this, we can renormalize yt to have `2-norm r at the ends of such iterations, and this does notinfluence the performance of the algorithm.

2.2 Faster negative curvature finding based on accelerated gradient descents

In this subsection, we replace the GD part in Algorithm 1 by AGD to obtain an accelerated negativecurvature finding subroutine with similar effect and faster convergence rate, based on which we fur-ther implement our Accelerated Gradient Descent with Negative Curvature Finding (Algorithm 2).Near any saddle point x ∈ Rn with λmin(H(x)) ≤ −√ρε, Algorithm 2 finds a unit vector e suchthat eTH(x)e ≤ −√ρε/4.

Algorithm 2: Perturbed Accelerated Gradient Descent with Accelerated Negative CurvatureFinding(x0, η, θ, γ, s,T ′, r′)

1 tperturb ← 0, z0 ← x0, x← x0, ζ ← 0;2 for t = 0, 1, 2, ..., T do3 if ‖∇f(xt)‖ ≤ ε and t− tperturb > T then4 x = xt;5 xt ← Uniform(Bx(r′)) where Uniform(Bx(r′)) is the `2-norm ball centered at x with

radius r′, zt ← xt, ζ ← ∇f(x), tperturb ← t;6 if t− tperturb = T ′ then7 e := xt−x

‖xt−x‖ ;

8 xt ← x− f ′e(x)4|f ′e(x)|

√ερ · e, zt ← xt, ζ = 0;

9 xt+1 ← zt − η(∇f(zt)− ζ);10 vt+1 ← xt+1 − xt;11 zt+1 ← xt+1 + (1− θ)vt+1;12 if tperturb 6= 0 and t− tperturb < T ′ then13 zt+1 ← x + r′ · zt+1−x

‖zt+1−x‖ , xt+1 ← x + r′ · xt+1−x‖zt+1−x‖ ;

14 else15 if f(xt+1) ≤ f(zt+1) + 〈∇f(zt+1),xt+1 − zt+1〉 − γ

2 ‖zt+1 − xt+1‖2 then16 (xt+1,vt+1)←NegativeCurvatureExploitation(xt+1,vt+1, s)3;17 zt+1 ← xt+1 + (1− θ)vt+1;

The following proposition exhibits the effectiveness of Algorithm 2 for finding negative curvaturesnear saddle points:

3 This NegativeCurvatureExploitation (NCE) subroutine was originally introduced in [21, Algorithm 3] andis called when we detect that the current momentum vt coincides with a negative curvature direction of zt. Inthis case, we reset the momentum vt and decide whether to exploit this direction based on the value of ‖vt‖.

7

Page 8: based algorithm

Proposition 5. Suppose the function f : Rn → R is `-smooth and ρ-Hessian Lipschitz. For any0 < δ0 ≤ 1, we specify our choice of parameters and constants we use as follows:

η :=1

4`θ :=

(ρε)1/4

4√`

T ′ :=32√`

(ρε)1/4log( `δ0

√n

ρε

)γ :=

θ2

ηs :=

γ

4ρr′ :=

δ0ε

32

√π

ρn. (28)

Then for a point x satisfying λmin(∇2f(x)) ≤ −√ρε, if running Algorithm 2 with the uniformperturbation in Line 5 being added at t = 0, the unit vector e in Line 7 obtained after T ′ iterationssatisfies:

P(eTH(x)e ≤ −√ρε/4

)≥ 1− δ0. (29)

The proof of Proposition 5 is similar to the proof of Proposition 3, and is deferred to Appendix B.2.

2.3 Escaping saddle points using negative curvature finding

In this subsection, we demonstrate that our Algorithm 1 and Algorithm 2 with the ability to findnegative curvature near saddle points can further escape saddle points of nonconvex functions. Theintuition works as follows: we start with gradient descents or accelerated gradient descents until thegradient becomes small. At this position, we compute the negative curvature direction, described bya unit vector e, via Algorithm 1 or the negative curvature finding subroutine of Algorithm 2. Then,we add a perturbation along this direction of negative curvature and go back to gradient descentsor accelerated gradient descents with an additional NegativeCurvatureExploitation subroutine (seeFootnote 3). It has the following guarantee:Lemma 6. Suppose the function f : Rn → R is `-smooth and ρ-Hessian Lipschitz. Then for anypoint x0 ∈ Rn, if there exists a unit vector e satisfying eTH(x0)e ≤ −

√ρε

4 where H stands for theHessian matrix of function f , the following inequality holds:

f(x0 −

f ′e(x0)

4|f ′e(x0)|

√ε

ρ· e)≤ f(x0)− 1

384

√ε3

ρ, (30)

where f ′e stands for the gradient component of f along the direction of e.

Proof. Without loss of generality, we assume x0 = 0. We can also assume 〈∇f(0), e〉 ≤ 0; if thisis not the case we can pick −e instead, which still satisfies (−e)TH(x0)(−e) ≤ −

√ρε

4 . In practice,to figure out whether we should use e or −e, we apply both of them in (30) and choose the one withsmaller function value. Then, for any x = xee with some xe > 0, we have ∂2f

∂x2e(x) ≤ −

√ρε

4 + ρxe

due to the ρ-Hessian Lipschitz condition of f . Hence,

∂f

∂xe(x) ≤ f ′e(0)−

√ρε

4xe + ρx2e, (31)

by which we can further derive that

f(xee)− f(0) ≤ f ′e(0)xe −√ρε

8x2e +

ρ

3x3e ≤ −

√ρε

8x2e +

ρ

3x3e. (32)

Settings xe = 14

√ερ gives (30).

We give the full algorithm details based on Algorithm 1 in Appendix C.1. Along this approach, weachieve the following:Theorem 7 (informal, full version deferred to Appendix C.3). For any ε > 0 and a constant 0 <δ ≤ 1, Algorithm 2 satisfies that at least one of the iterations xt will be an ε-approximate second-order stationary point in

O( (f(x0)− f∗)

ε1.75· log n

)(33)

iterations, with probability at least 1− δ, where f∗ is the global minimum of f .

8

Page 9: based algorithm

Intuitively, the proof of Theorem 7 has two parts. The first part is similar to the proof of [21,Theorem 3], which shows that PAGD uses O(log6 n/ε1.75) iterations to escape saddle points. Weshow that there can be at most O(∆f/ε

1.75) iterations with the norm of gradient larger than ε usingalmost the same techniques, but with slightly different parameter choices. The second part is basedon the negative curvature part of Algorithm 2, our accelerated negative curvature finding algorithm.Specifically, at each saddle point we encounter, we can take O(log n/ε1/4) iterations to find itsnegative curvature (Proposition 5), and add a perturbation in this direction to decrease the functionvalue by O(ε1.5) (Lemma 6). Hence, the iterations introduced by Algorithm 4 can be at mostO(lognε1.5 ·

1ε0.25

)= O(log n/ε1.75), which is simply an upper bound on the overall iteration number.

The detailed proof is deferred to Appendix C.3.

Remark 8. Although Theorem 7 only demonstrates that our algorithms will visit some ε-approximate second-order stationary point during their execution with high probability, it isstraightforward to identify one of them if we add a termination condition: once Negative CurvatureFinding (Algorithm 1 or Algorithm 2) is applied, we record the position xt0 and the function value

decrease due to the following perturbation. If the function value decrease is larger than 1384

√ε3

ρ asper Lemma 6, then the algorithms make progress. Otherwise, xt0 is an ε-approximate second-orderstationary point with high probability.

3 Stochastic setting

In this section, we present a stochastic version of Algorithm 1 using stochastic gradients, and demon-strate that it can also be used to escape saddle points and obtain a polynomial speedup in log ncompared to the perturbed stochastic gradient (PSGD) algorithm in [20].

3.1 Stochastic negative curvature finding

In the stochastic gradient descent setting, the exact gradients oracle ∇f of function f cannot beaccessed. Instead, we only have unbiased stochastic gradients g(x; θ) such that

∇f(x) = Eθ∼D[g(x; θ)] ∀x ∈ Rn, (34)

where D stands for the probability distribution followed by the random variable θ. Define

ζ(x; θ) := g(x; θ)−∇f(x) (35)

to be the error term of the stochastic gradient. Then, the expected value of vector ζ(x; θ) at anyx ∈ Rn equals to 0. Further, we assume the stochastic gradient g(x, θ) also satisfies the followingassumptions, which were also adopted in previous literatures; see e.g. [8, 19, 20, 34].

Assumption 1. For any x ∈ Rn, the stochastic gradient g(x; θ) with θ ∼ D satisfies:

Pr[(‖g(x; θ)−∇f(x)‖ ≥ t)] ≤ 2 exp(−t2/(2σ2)

), ∀t ∈ R. (36)

Assumption 2. For any θ ∈ supp(D), g(x; θ) is ˜-Lipschitz for some constant ˜:

‖g(x1; θ)− g(x2; θ)‖ ≤ ˜‖x1 − x2‖, ∀x1,x2 ∈ Rn. (37)

Assumption 2 emerges from the fact that the stochastic gradient g is often obtained as an exactgradient of some smooth function,

g(x; θ) = ∇f(x; θ). (38)

In this case, Assumption 2 guarantees that for any θ ∼ D, the spectral norm of the Hessian of f(x; θ)

is upper bounded by ˜. Under these two assumptions, we can construct the stochastic version ofAlgorithm 1, as shown in Algorithm 3.Similar to the non-stochastic setting, Algorithm 3 can be used to escape from saddle points andobtain a polynomial speedup in log n compared to PSGD algorithm in [20]. This is quantitativelyshown in the following theorem:

9

Page 10: based algorithm

Algorithm 3: Stochastic Negative Curvature Finding(x0, rs,Ts,m).1 y0 ← 0, L0 ← rs;2 for t = 1, ...,Ts do3 Sample

θ(1), θ(2), · · · , θ(m)

∼ D;

4 g(yt−1)← 1m

∑mj=1

(g(x0 + yt−1; θ(j))− g(x0; θ(j))

);

5 yt ← yt−1 − 1` (g(yt−1) + ξt/Lt−1), ξt ∼ N

(0,

r2sd I)

;

6 Lt ← ‖yt‖rs

Lt−1 and yt ← yt · rs‖yt‖ ;

7 Output yT /rs.

Theorem 9 (informal, full version deferred to Appendix D.2). For any ε > 0 and a constant 0 <δ ≤ 1, our algorithm4 based on Algorithm 3 using only stochastic gradient descent satisfies that atleast one of the iterations xt will be an ε-approximate second-order stationary point in

O( (f(x0)− f∗)

ε4· log2 n

)(39)

iterations, with probability at least 1− δ, where f∗ is the global minimum of f .

4 Numerical experiments

In this section, we provide numerical results that exhibit the power of our negative curvature findingalgorithm for escaping saddle points. More experimental results can be found in Appendix E. Allthe experiments are performed on MATLAB R2019b on a computer with Six-Core Intel Core i7processor and 16GB memory, and their codes are given in the supplementary material.

Comparison between Algorithm 1 and PGD. We compare the performance of our Algorithm 1with the perturbed gradient descent (PGD) algorithm in [20] on a test function f(x1, x2) = 1

16x41 −

12x

21 + 9

8x22 with a saddle point at (0, 0). The advantage of Algorithm 1 is illustrated in Figure 1.

Figure 1: Run Algorithm 1 and PGD on landscape f(x1, x2) = 116x41 − 1

2x21 + 9

8x22. Parameters: η = 0.05

(step length), r = 0.1 (ball radius in PGD and parameter r in Algorithm 1), M = 300 (number of samplings).Left: The contour of the landscape is placed on the background with labels being function values. Bluepoints represent samplings of Algorithm 1 at time step tNCGD = 15 and tNCGD = 30, and red points representsamplings of PGD at time step tPGD = 45 and tPGD = 90. Algorithm 1 transforms an initial uniform-circledistribution into a distribution concentrating on two points indicating negative curvature, and these two figuresrepresent intermediate states of this process. It converges faster than PGD even when tNCGD tPGD.Right: A histogram of descent values obtained by Algorithm 1 and PGD, respectively. Set tNCGD = 30 andtPGD = 90. Although we run three times of iterations in PGD, there are still over 40% of gradient descent pathswith function value decrease no greater than 0.9, while this ratio for Algorithm 1 is less than 5%.

4Our algorithm based on Algorithm 3 has similarities to the Neon2online algorithm in [3]. Both algorithmsfind a second-order stationary point for stochastic optimization in O(log2 n/ε4) iterations, and we both applydirected perturbations based on the results of negative curvature finding. Nevertheless, our algorithm enjoyssimplicity by only having a single loop, whereas Neon2online has a nested loop for boosting their confidence.

10

Page 11: based algorithm

Comparison between Algorithm 3 and PSGD. We compare the performance of our Algo-rithm 3 with the perturbed stochastic gradient descent (PSGD) algorithm in [20] on a test functionf(x1, x2) = (x31 − x32)/2− 3x1x2 + (x21 + x22)2/2. Compared to the landscape of the previous ex-periment, this function is more deflated from a quadratic field due to the cubic terms. Nevertheless,Algorithm 3 still possesses a notable advantage compared to PSGD as demonstrated in Figure 2.

Figure 2: Run Algorithm 3 and PSGD on landscape f(x1, x2) =x31−x

32

2− 3x1x2 +

12(x21 +x22)

2. Parameters:η = 0.02 (step length), r = 0.01 (variance in PSGD and rs in Algorithm 3), M = 300 (number of samplings).Left: The contour of the landscape is placed on the background with labels being function values. Blue pointsrepresent samplings of Algorithm 3 at time step tSNCGD = 15 and tSNCGD = 30, and red points representsamplings of PSGD at time step tPSGD = 30 and tPSGD = 60. Algorithm 3 transforms an initial uniform-circledistribution into a distribution concentrating on two points indicating negative curvature, and these two figuresrepresent intermediate states of this process. It converges faster than PSGD even when tSNCGD tPSGD.Right: A histogram of descent values obtained by Algorithm 3 and PSGD, respectively. Set tSNCGD = 30 andtPSGD = 60. Although we run two times of iterations in PSGD, there are still over 50% of SGD paths withfunction value decrease no greater than 0.6, while this ratio for Algorithm 3 is less than 10%.

Acknowledgements

We thank Jiaqi Leng for valuable suggestions and generous help on the design of numerical exper-iments. TL was supported by the NSF grant PHY-1818914 and a Samsung Advanced Institute ofTechnology Global Research Partnership.

References[1] Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma, Finding approximate

local minima faster than gradient descent, Proceedings of the 49th Annual ACM SIGACT Symposiumon Theory of Computing, pp. 1195–1199, 2017, arXiv:1611.01146.

[2] Zeyuan Allen-Zhu, Natasha 2: Faster non-convex optimization than SGD, Advances in Neural Informa-tion Processing Systems, pp. 2675–2686, 2018, arXiv:1708.08694.

[3] Zeyuan Allen-Zhu and Yuanzhi Li, Neon2: Finding local minima via first-order oracles, Advances inNeural Information Processing Systems, pp. 3716–3726, 2018, arXiv:1711.06673.

[4] Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro, Global optimality of local search for lowrank matrix recovery, Proceedings of the 30th International Conference on Neural Information ProcessingSystems, pp. 3880–3888, 2016, arXiv:1605.07221.

[5] Alan J. Bray and David S. Dean, Statistics of critical points of Gaussian fields on large-dimensionalspaces, Physical Review Letters 98 (2007), no. 15, 150201, arXiv:cond-mat/0611023.

[6] Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford, Accelerated methods for nonconvexoptimization, SIAM Journal on Optimization 28 (2018), no. 2, 1751–1772, arXiv:1611.00756.

[7] Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Ben-gio, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization,Advances in neural information processing systems, pp. 2933–2941, 2014, arXiv:1406.2572.

[8] Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang, Spider: Near-optimal non-convex opti-mization via stochastic path-integrated differential estimator, Advances in Neural Information ProcessingSystems, pp. 689–699, 2018, arXiv:1807.01695.

11

Page 12: based algorithm

[9] Cong Fang, Zhouchen Lin, and Tong Zhang, Sharp analysis for nonconvex SGD escaping from saddlepoints, Conference on Learning Theory, pp. 1192–1234, 2019, arXiv:1902.00247.

[10] Yan V. Fyodorov and Ian Williams, Replica symmetry breaking condition exposed by random matrixcalculation of landscape complexity, Journal of Statistical Physics 129 (2007), no. 5-6, 1081–1116,arXiv:cond-mat/0702601.

[11] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan, Escaping from saddle points – online stochastic gra-dient for tensor decomposition, Proceedings of the 28th Conference on Learning Theory, Proceedings ofMachine Learning Research, vol. 40, pp. 797–842, 2015, arXiv:1503.02101.

[12] Rong Ge, Chi Jin, and Yi Zheng, No spurious local minima in nonconvex low rank problems: A uni-fied geometric analysis, International Conference on Machine Learning, pp. 1233–1242, PMLR, 2017,arXiv:1704.00708.

[13] Rong Ge, Jason D. Lee, and Tengyu Ma, Matrix completion has no spurious local minimum, Advances inNeural Information Processing Systems, pp. 2981–2989, 2016, arXiv:1605.07272.

[14] Rong Ge, Jason D. Lee, and Tengyu Ma, Learning one-hidden-layer neural networks with landscapedesign, International Conference on Learning Representations, 2018, arXiv:1711.00501.

[15] Rong Ge, Zhize Li, Weiyao Wang, and Xiang Wang, Stabilized SVRG: Simple variance reduction for non-convex optimization, Conference on Learning Theory, pp. 1394–1448, PMLR, 2019, arXiv:1905.00529.

[16] Rong Ge and Tengyu Ma, On the optimization landscape of tensor decompositions, Advances in NeuralInformation Processing Systems, pp. 3656–3666, Curran Associates Inc., 2017, arXiv:1706.05598.

[17] Moritz Hardt, Tengyu Ma, and Benjamin Recht, Gradient descent learns linear dynamical systems, Jour-nal of Machine Learning Research 19 (2018), no. 29, 1–44, arXiv:1609.05191.

[18] Prateek Jain, Chi Jin, Sham Kakade, and Praneeth Netrapalli, Global convergence of non-convex gradi-ent descent for computing matrix squareroot, Artificial Intelligence and Statistics, pp. 479–488, 2017,arXiv:1507.05854.

[19] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan, How to escape sad-dle points efficiently, Proceedings of the 34th International Conference on Machine Learning, vol. 70,pp. 1724–1732, 2017, arXiv:1703.00887.

[20] Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M Kakade, and Michael I. Jordan, On nonconvex optimiza-tion for machine learning: Gradients, stochasticity, and saddle points, Journal of the ACM (JACM) 68(2021), no. 2, 1–29, arXiv:1902.04811.

[21] Chi Jin, Praneeth Netrapalli, and Michael I. Jordan, Accelerated gradient descent escapes saddle pointsfaster than gradient descent, Conference on Learning Theory, pp. 1042–1085, 2018, arXiv:1711.10456.

[22] Michael I. Jordan, On gradient-based optimization: Accelerated, distributed, asynchronous and stochas-tic optimization, https://www.youtube.com/watch?v=VE2ITg%5FhGnI, 2017.

[23] Zhize Li, SSRGD: Simple stochastic recursive gradient descent for escaping saddle points, Advances inNeural Information Processing Systems 32 (2019), 1523–1533, arXiv:1904.09265.

[24] Mingrui Liu, Zhe Li, Xiaoyu Wang, Jinfeng Yi, and Tianbao Yang, Adaptive negative curvature descentwith applications in non-convex optimization, Advances in Neural Information Processing Systems 31(2018), 4853–4862.

[25] Yurii Nesterov and Boris T. Polyak, Cubic regularization of Newton method and its global performance,Mathematical Programming 108 (2006), no. 1, 177–205.

[26] Yurii E. Nesterov, A method for solving the convex programming problem with convergence rateO(1/k2),Soviet Mathematics Doklady, vol. 27, pp. 372–376, 1983.

[27] Ju Sun, Qing Qu, and John Wright, A geometric analysis of phase retrieval, Foundations of ComputationalMathematics 18 (2018), no. 5, 1131–1198, arXiv:1602.06664.

[28] Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier, and Michael I. Jordan, Stochastic cubic regu-larization for fast nonconvex optimization, Advances in neural Information Processing Systems, pp. 2899–2908, 2018, arXiv:1711.02838.

[29] Yi Xu, Rong Jin, and Tianbao Yang, NEON+: Accelerated gradient methods for extracting negativecurvature for non-convex optimization, 2017, arXiv:1712.01033.

[30] Yi Xu, Rong Jin, and Tianbao Yang, First-order stochastic algorithms for escaping from saddle pointsin almost linear time, Advances in Neural Information Processing Systems, pp. 5530–5540, 2018,arXiv:1711.01944.

[31] Yaodong Yu, Pan Xu, and Quanquan Gu, Third-order smoothness helps: Faster stochastic optimizationalgorithms for finding local minima, Advances in Neural Information Processing Systems (2018), 4530–4540.

12

Page 13: based algorithm

[32] Xiao Zhang, Lingxiao Wang, Yaodong Yu, and Quanquan Gu, A primal-dual analysis of global optimalityin nonconvex low-rank matrix recovery, International Conference on Machine Learning, pp. 5862–5871,PMLR, 2018.

[33] Dongruo Zhou and Quanquan Gu, Stochastic recursive variance-reduced cubic regularizationmethods, International Conference on Artificial Intelligence and Statistics, pp. 3980–3990, 2020,arXiv:1901.11518.

[34] Dongruo Zhou, Pan Xu, and Quanquan Gu, Finding local minima via stochastic nested variance reduc-tion, 2018, arXiv:1806.08782.

[35] Dongruo Zhou, Pan Xu, and Quanquan Gu, Stochastic variance-reduced cubic regularized Newton meth-ods, International Conference on Machine Learning, pp. 5990–5999, PMLR, 2018, arXiv:1802.04796.

[36] Dongruo Zhou, Pan Xu, and Quanquan Gu, Stochastic nested variance reduction for nonconvex optimiza-tion, Journal of Machine Learning Research 21 (2020), no. 103, 1–63, arXiv:1806.07811.

13

Page 14: based algorithm

A Auxiliary lemmas

In this section, we introduce auxiliary lemmas that are necessary for our proofs.Lemma 10 ([20, Lemma 19]). If f(·) is `-smooth and ρ-Hessian Lipschitz, η = 1/`, then thegradient descent sequence xt obtained by xt+1 := xt − η∇f(xt) satisfies:

f(xt+1)− f(xt) ≤ η‖∇f(x)‖2/2, (40)

for any step t in which Negative Curvature Finding is not called.Lemma 11 ([20, Lemma 23]). For a `-smooth and ρ-Hessian Lipschitz function f with its stochasticgradient satisfying Assumption 1, there exists an absolute constant c such that, for any fixed t, t0, ι >0, with probability at least 1−4eι, the stochastic gradient descent sequence in Algorithm 8 satisfies:

f(xt0+t)− f(xt0) ≤ −η8

t−1∑i=0

‖∇f(xt0+i)‖2 + c · σ2

`(t+ ι), (41)

if during t0 ∼ t0 + t, Stochastic Negative Curvature Finding has not been called.

Lemma 12 ([20, Lemma 29]). Denote α(t) :=[∑t−1

τ=0(1 + ηγ)2(t−1−τ)]1/2

and β(t) := (1 +

ηγ)t/√

2ηγ. If ηγ ∈ [0, 1], then we have

1. α(t) ≤ β(t) for any t ∈ N;

2. α(t) ≥ β(t)/√

3, ∀t ≥ ln 2/(ηγ).Lemma 13 ([20, Lemma 30]). Under the notation of Lemma 12 and Appendix D.1, letting −γ :=λmin(H), for any 0 ≤ t ≤ Ts and ι > 0 we have

Pr(‖qp(t)‖ ≤ β(t)ηrs ·

√ι)≥ 1− 2e−ι, (42)

and

Pr(‖qp(t)‖ ≥

β(t)ηrs10√n· δTs

)≥ 1− δ

Ts, (43)

Pr(‖qp(t)‖ ≥

β(t)ηrs10√n· δ)≥ 1− δ. (44)

Definition 14 ([20, Definition 32]). A random vector X ∈ Rn is norm-subGaussian (or nSG(σ)), ifthere exists σ so that:

Pr(‖X− E[X]‖ ≥ t) ≤ 2e−t2

2σ2 , ∀t ∈ R. (45)

Lemma 15 ([20, Lemma 37]). Given i.i.d. X1, . . . ,Xτ ∈ Rn all being zero-mean nSG(σi) definedin Definition 14, then for any ι > 0, and B > b > 0, there exists an absolute constant c such that,with probability at least 1− 2n log(B/b) · e−ι:

n∑i=1

σ2i ≥ B or

∥∥ τ∑i=1

Xi

∥∥ ≤ c ·√√√√max

τ∑i=1

σ2i , b

· ι. (46)

The next two lemmas are frequently used in the large gradient scenario of the accelerated gradientdescent method:Lemma 16 ([21, Lemma 7]). Consider the setting of Theorem 23, define a new parameter

T :=

√`

(ρε)1/4· cA, (47)

for some large enough constant cA. If ‖∇f(xτ )‖ ≥ ε for all τ ∈ [0, T ], then there exists a largeenough positive constant cA0, such that if we choose cA ≥ cA0, by running Algorithm 2 we have

ET − E0 ≤ −E , in which E =√

ε3

ρ · c−7A , and Eτ is defined as:

Eτ := f(xτ ) +1

2η‖vτ‖2. (48)

14

Page 15: based algorithm

Lemma 17 ([21, Lemma 4 and Lemma 5]). Assume that the function f is `-smooth. Consider thesetting of Theorem 23, for every iteration that is not within T ′ steps after uniform perturbation, wehave:

Eτ+1 ≤ Eτ , (49)

where Eτ is defined in Eq. (48) in Lemma 16.

B Proof details of negative curvature finding by gradient descents

B.1 Proof of Lemma 18

Lemma 18. Under the setting of Proposition 3, we use αt to denote

αt = ‖yt,‖‖/‖yt‖, (50)

where yt,‖ defined in Eq. (10) is the component of yt in the subspace S‖ spanned byu1,u2, . . . ,up. Then, during all the T iterations of Algorithm 1, we have αt ≥ αmin for

αmin =δ04

√π

n, (51)

given that α0 ≥√

πnδ0.

To prove Lemma 18, we first introduce the following lemma:Lemma 19. Under the setting of Proposition 3 and Lemma 18, for any t > 0 and initial value y0,αt achieves its minimum possible value in the specific landscapes satisfying

λ1 = λ2 = · · · = λp = λp+1 = λp+2 = · · · = λn−1 = −√ρε. (52)

Proof. We prove this lemma by contradiction. Suppose for some τ > 0 and initial value y0, ατachieves its minimum value in some landscape g that does not satisfy Eq. (52). That is to say, thereexists some k ∈ [n− 1] such that λk 6= −

√ρε.

We first consider the case k > p. Since λn−1, . . . , λp+1 ≥ −√ρε, we have λk > −

√ρε. We use

yg,t to denote the iteration sequence of yt in this landscape.

Consider another landscape h with λk = −√ρε and all other eigenvalues being the same as g, andwe use yh,t to denote the iteration sequence of yt in this landscape. Furthermore, we assume thatat each gradient query, the deviation from pure quadratic approximation defined in Eq. (16), denoted∆h,t, is the same as the corresponding deviation ∆g,t in the landscape g along all the directionsother than uk. As for its component ∆h,t,k along uk, we set |∆h,t,k| = |∆g,t,k| with its sign beingthe same as yt,k.

Under these settings, we have yh,τ,j = yg,τ,j for any j ∈ [n] and j 6= k. As for the component alonguk, we have |yh,τ,j | > |yg,τ,j |. Hence, by the definition of yt,‖ in Eq. (10), we have

‖yg,τ,‖‖ =( p∑j=1

y2g,τ,j)1/2

=( p∑j=1

y2h,τ,j)1/2

= ‖yh,τ,‖‖, (53)

whereas

‖yg,τ‖ =( n∑j=1

y2g,τ,j)1/2

<( n∑j=1

y2h,τ,j)1/2

= ‖yh,τ‖, (54)

indicating

αg,τ =‖yg,τ,‖‖‖yg,τ‖

>‖yh,τ,‖‖‖yh,τ‖

= αh,τ , (55)

contradicting to the supposition that ατ achieves its minimum possible value in g. Similarly, we canalso obtain a contradiction for the case k ≤ p.

15

Page 16: based algorithm

Equipped with Lemma 19, we are now ready to prove Lemma 18.

Proof. In this proof, we consider the worst case, where the initial value α0 =√

πnδ0. Also, accord-

ing to Lemma 19, we assume that the eigenvalues satisfyλ1 = λ2 = · · · = λp = λp+1 = λp+2 = · · · = λn−1 = −√ρε. (56)

Moreover, we assume that each time we make a gradient call at some point x, the derivation term ∆from pure quadratic approximation

∆ = ∇hf (x)−H(0)x (57)lies in the direction that can make αt as small as possible. Then, the component ∆⊥ in S⊥ shouldbe in the direction of x⊥, and the component ∆‖ in S‖ should be in the opposite direction to x‖,as long as ‖∆‖‖ ≤ ‖x‖‖. Hence in this case, we have ‖yt,⊥‖/‖yt‖ being non-decreasing. Also, itadmits the following recurrence formula:

‖yt+1,⊥‖ = (1 +√ρε/`)‖yt,⊥‖+ ‖∆⊥‖/` (58)

≤ (1 +√ρε/`)‖yt,⊥‖+ ‖∆‖/` (59)

≤(

1 +√ρε/`+

‖∆‖`‖yt,⊥‖

)‖yt,⊥‖, (60)

where the second inequality is due to the fact that ν can be an arbitrarily small positive number.Note that since ‖yt,⊥‖/‖yt‖ is non-decreasing in this worst-case scenario, we have

‖∆‖‖yt,⊥‖

≤ ‖∆‖‖yt‖

· ‖y0‖‖y0,⊥‖

≤ 2‖∆‖‖yt‖

≤ 2ρr, (61)

which leads to‖yt+1,⊥‖ ≤ (1 +

√ρε/`+ 2ρr/`)‖yt,⊥‖. (62)

On the other hand, suppose for some value t, we have αk ≥ αmin with any 1 ≤ k ≤ t. Then,‖yt+2,‖‖ = (1 +

√ρε/`)‖yt+1,⊥‖ − ‖∆‖‖/` (63)

≥(

1 +√ρε/`− ‖∆‖

`‖yt,‖‖

)‖yt,‖‖. (64)

Note that since ‖yt,‖‖/‖yt‖ ≥ αmin, we have‖∆‖‖yt,‖‖

≥ ‖∆‖αmin‖yt‖

= ρr/αmin, (65)

which leads to‖yt+1,‖‖ ≥ (1 +

√ρε/`− ρr/(αmin`))‖yt,‖‖. (66)

Then we can observe that‖yt,‖‖‖yt,⊥‖

≥‖y0,‖‖‖y0,⊥‖

·(1 +

√ρε/`− ρr/(αmin`)

1 +√ρε/`+ 2ρr/`

)t, (67)

where1 +√ρε/`− ρr/(αmin`)

1 +√ρε/`+ 2ρr/`

≥ (1 +√ρε/`− ρr/(αmin`))(1−

√ρε/`− 2ρr/`) (68)

≥ 1− ρε/`2 − 2ρr

αmin`≥ 1− 1

T, (69)

by which we can derive that‖yt,‖‖‖yt,⊥‖

≥‖y0,‖‖‖y0,⊥‖

(1− 1/T )t (70)

≥‖y0,‖‖‖y0,⊥‖

exp(− t

T − 1

)≥‖y0,‖‖

2‖y0,⊥‖, (71)

indicating

αt =‖yt,‖‖√

‖yt,‖‖2 + ‖yt,⊥‖2≥‖y0,‖‖

4‖y0,⊥‖≥ αmin. (72)

Hence, as long as αk ≥ αmin for any 1 ≤ k ≤ t− 1, we can also have αt ≥ αmin if t ≤ T . Sincewe have α0 ≥ αmin, we can thus claim that αt ≥ αmin for any t ≤ T using recurrence.

16

Page 17: based algorithm

B.2 Proof of Proposition 5

To make it easier to analyze the properties and running time of Algorithm 2, we introduce a newAlgorithm 4, which has a more straightforward structure and has the same effect as Algorithm 2

Algorithm 4: Accelerated Negative Curvature Finding without Renormalization(x, r′,T ′).1 x0 ←Uniform(B0(r′)) where B0(r′) is the `2-norm ball centered at x with radius r′;2 z0 ← x0;3 for t = 1, ...,T ′ do4 xt+1 ← zt − η

(‖zt−x‖r′ ∇f

(r′ · zt−x

‖zt−x‖ + x)−∇f(x)

);

5 vt+1 ← xt+1 − xt;6 zt+1 ← xt+1 + (1− θ)vt+1;

7 Output xT ′−x‖xT ′−x‖

.

near any saddle point x. Quantitatively, this is demonstrated in the following lemma:Lemma 20. Under the setting of Proposition 5, suppose the perturbation in Line 5 of Algorithm 2is added at t = 0. Then with the same value of r′, T ′, x and x0, the output of Algorithm 4 is thesame as the unit vector e in Algorithm 2 obtained T ′ steps later. In other words, if we separatelydenote the iteration sequence of xt in Algorithm 2 and Algorithm 4 as

x1,0,x1,1, . . . ,x1,T ′ , x2,0,x2,1, . . . ,x2,T ′ , (73)

we havex1,T ′ − x

‖x1,T ′ − x‖=

x2,T ′ − x

‖x2,T ′ − x‖. (74)

Proof. Without loss of generality, we assume x = 0 and ∇f(x) = 0. Use recurrence to prove thedesired relationship. Suppose the following identities holds for all k ≤ t with t being some naturalnumber:

x2,k

‖x2,k‖=

x1,k

r,

z2,k‖x2,k‖

=z1,kr′

. (75)

Then,

x2,t+1 = z2,t − η ·‖z2,t‖r′∇f(z2,t · r′/‖z2,t‖) =

‖z2,t‖r′

(z1,t − η∇f(z1,t)). (76)

Adopting the notation in Algorithm 2, we use x′1,t+1 and z′1,t+1 to separately denote the value ofx1,t+1 and z1,t+1 before renormalization:

x′1,t+1 = z1,t − η∇f(z1,t), z′1,t+1 = x′1,t+1 + (1− θ)(x′1,t+1 − x1,t). (77)

Then,

x2,t+1 =‖z2,t‖r′

(z1,t − η∇f(z1,t)) =‖z2,t‖r′· x′1,t+1, (78)

which further leads to

z2,t+1 = x2,t+1 + (1− θ)(x2,t+1 − x2,t) =‖z2,t‖r′· z′1,t+1. (79)

Note that z1,t+1 = r′

‖z′1,t+1‖· z′1,t+1, we thus have

z2,t+1

‖z2,t+1‖=

z1,t+1

r′. (80)

Hence,

x2,t+1 =‖z2,t‖r′· x′1,t+1 =

‖z2,t‖r′· ‖z1,t+1‖‖z1,t‖

· x1,t+1 =‖z2,t+1‖

r′· x1,t+1. (81)

Since (75) holds for k = 0, we can now claim that it also holds for k = T ′.

17

Page 18: based algorithm

Lemma 20 shows that, Algorithm 4 also works in principle for finding the negative curvature nearany saddle point x. But considering that Algorithm 4 may result in an exponentially large ‖xt‖during execution, and it is hard to be merged with the AGD algorithm for large gradient scenarios.Hence, only Algorithm 2 is applicable in practical situations.

Use H(x) to denote the Hessian matrix of f at x. Observe that H(x) admits the following eigen-decomposition:

H(x) =

n∑i=1

λiuiuTi , (82)

where the set uini=1 forms an orthonormal basis of Rn. Without loss of generality, we assume theeigenvalues λ1, λ2, . . . , λn corresponding to u1,u2, . . . ,un satisfy

λ1 ≤ λ2 ≤ · · · ≤ λn, (83)

in which λ1 ≤ −√ρε. If λn ≤ −

√ρε/2, Proposition 5 holds directly, since no matter the value

of e, we can have f(xT ′) − f(x) ≤ −√ε3/ρ/384. Hence, we only need to prove the case where

λn > −√ρε/2, where there exists some p > 1 with

λp ≤ −√ρε ≤ λp+1. (84)

We use S‖ to denote the subspace of Rn spanned by u1,u2, . . . ,up, and use S⊥ to denote thesubspace spanned by up+1,up+2, . . . ,un. Then we can have the following lemma:Lemma 21. Under the setting of Proposition 5, we use α′t to denote

α′t =‖xt,‖ − x‖‖‖xt − x‖

, (85)

in which xt,‖ is the component of xt in the subspace S‖. Then, during all the T ′ iterations ofAlgorithm 4, we have α′t ≥ α′min for

α′min =δ08

√π

n, (86)

given that α′0 ≥√

πnδ0.

Proof. Without loss of generality, we assume x = 0 and ∇f(x) = 0. In this proof, we considerthe worst case, where the initial value α′0 =

√πnδ0 and the component x0,n along un equals 0. In

addition, according to the same proof of Lemma 19, we assume that the eigenvalues satisfy

λ1 = λ2 = λ3 = · · · = λp = λp+1 = λp+2 = · · · = λn−1 = −√ρε. (87)

Out of the same reason, we assume that each time we make a gradient call at point zt, the derivationterm ∆ from pure quadratic approximation

∆ =‖zt‖r′·(∇f(zt · r′/‖zt‖)−H(0) · r′

‖zt‖· zt)

(88)

lies in the direction that can make α′t as small as possible. Then, the component ∆‖ in S‖ shouldbe in the opposite direction to v‖, and the component ∆⊥ in S⊥ should be in the direction of v⊥.Hence in this case, we have both ‖xt,⊥‖/‖xt‖ and ‖zt,⊥‖/‖zt‖ being non-decreasing. Also, itadmits the following recurrence formula:

‖xt+2,⊥‖ ≤ (1 + η√ρε)(‖xt+1,⊥‖+ (1− θ)(‖xt+1,⊥‖ − ‖xt,⊥‖)

)+ η‖∆⊥‖. (89)

Since ‖xt,⊥‖/‖xt‖ is non-decreasing in this worst-case scenario, we have

‖∆⊥‖‖xt+1,⊥‖

≤ ‖∆‖‖xt+1‖

· ‖x0‖‖x0,⊥‖

≤ 2‖∆‖‖xt+1‖

≤ 2ρr′, (90)

which leads to

‖xt+2,⊥‖ ≤ (1 + η√ρε+ 2ηρr′)

((2− θ)‖xt+1,⊥‖ − (1− θ)‖xt,⊥‖

). (91)

18

Page 19: based algorithm

On the other hand, suppose for some value t, we have α′k ≥ α′min with any 1 ≤ k ≤ t+ 1. Then,

‖xt+2,‖‖ ≥ (1 + η(√ρε− ν))

(‖xt+1,‖‖+ (1− θ)(‖xt+1,‖‖ − ‖xt,‖‖)

)+ η‖∆‖‖ (92)

≥ (1 + η√ρε)(‖xt+1,‖‖+ (1− θ)(‖xt+1,‖‖ − ‖xt,‖‖)

)− η‖∆‖. (93)

Note that since ‖xt+1,‖‖/‖xt‖ ≥ α′min for all t > 0, we also have

‖xt+1,‖‖‖xt‖

≥ α′min, ∀t ≥ 0, (94)

which further leads to

‖∆‖‖zt+1,‖‖

≥ ‖∆‖α′min‖zt+1‖

= ρr′/α′min, (95)

which leads to

‖xt+2,‖‖ ≥ (1 + η√ρε− ηρr′/α′min)

((2− θ)‖xt+1,‖‖ − (1− θ)‖xt,‖‖

). (96)

Consider the sequences with recurrence that can be written as

ξt+2 = (1 + κ)((2− θ)ξt+1 − (1− θ)ξt

)(97)

for some κ > 0. Its characteristic equation can be written as

x2 − (1 + κ)(2− θ)x+ (1 + κ)(1− θ) = 0, (98)

whose roots satisfy

x =1 + κ

2

((2− θ)±

√(2− θ)2 − 4(1− θ)

1 + κ

), (99)

indicating

ξt =(1 + κ

2

)t(C1(2− θ + µ)t + C2(2− θ − µ)t

), (100)

where µ :=√

(2− θ)2 − 4(1−θ)1+κ , for constants C1 and C2 being

C1 = −2− θ − µ2µ

ξ0 +1

(1 + κ)µξ1,

C2 =2− θ + µ

2µξ0 −

1

(1 + κ)µξ1.

(101)

Then by the inequalities (91) and (96), as long as α′k ≥ α′min for any 1 ≤ k ≤ t − 1, the values‖xt,⊥‖ and ‖xt,‖‖ satisfy

‖xt,⊥‖ ≤(− 2− θ − µ⊥

2µ⊥ξ0,⊥ +

1

(1 + κ⊥)µ⊥ξ1,⊥

)·(1 + κ⊥

2

)t· (2− θ + µ⊥)t (102)

+(2− θ + µ⊥

2µ⊥ξ0,⊥ −

1

(1 + κ⊥)µ⊥ξ1,⊥

)·(1 + κ⊥

2

)t· (2− θ − µ⊥)t, (103)

and

‖xt,‖‖ ≥(−

2− θ − µ‖2µ‖

ξ0,‖ +1

(1 + κ‖)µ‖ξ1,‖

)·(1 + κ‖

2

)t· (2− θ + µ‖)

t (104)

+(2− θ + µ‖

2µ‖ξ0,‖ −

1

(1 + κ‖)µ‖ξ1,‖

)·(1 + κ‖

2

)t· (2− θ − µ‖)t, (105)

where

κ⊥ = η√ρε+ 2ηρr′, ξ0,⊥ = ‖x0,⊥‖, ξ1,⊥ = (1 + κ⊥)ξ0,⊥, (106)

κ‖ = η√ρε− ηρr′/α′min, ξ0,‖ = ‖x0,‖‖, ξ1,‖ = (1 + κ‖)ξ0,‖. (107)

19

Page 20: based algorithm

Furthermore, we can derive that

‖xt,⊥‖ ≤(− 2− θ − µ⊥

2µ⊥ξ0,⊥ +

1

(1 + κ⊥)µ⊥ξ1,⊥

)·(1 + κ⊥

2

)t· (2− θ + µ⊥)t (108)

+(2− θ + µ⊥

2µ⊥ξ0,⊥ −

1

(1 + κ⊥)µ⊥ξ1,⊥

)·(1 + κ⊥

2

)t· (2− θ + µ⊥)t (109)

≤ ξ0,⊥ ·(1 + κ⊥

2

)t· (2− θ + µ⊥)t = ‖x0,⊥‖ ·

(1 + κ⊥2

)t· (2− θ + µ⊥)t, (110)

and

‖xt,‖‖ ≥(−

2− θ − µ‖2µ‖

ξ0,‖ +1

(1 + κ‖)µ‖ξ1,‖

)·(1 + κ‖

2

)t· (2− θ + µ‖)

t (111)

+(2− θ + µ‖

2µ‖ξ0,‖ −

1

(1 + κ‖)µ‖ξ1,‖

)·(1 + κ‖

2

)t· (2− θ − µ‖)t (112)

≥(−

2− θ − µ‖2µ‖

ξ0,‖ +1

(1 + κ‖)µ‖ξ1,‖

)·(1 + κ‖

2

)t· (2− θ + µ‖)

t (113)

=µ‖ + θ

2µ‖ξ0,‖ ·

(1 + κ‖

2

)t· (2− θ + µ‖)

t (114)

≥‖x0,‖‖

2·(1 + κ‖

2

)t· (2− θ + µ‖)

t. (115)

Then we can observe that

‖xt,‖‖‖xt,⊥‖

≥‖x0,‖‖

2‖x0,⊥‖·( 1 + κ‖

1 + κ⊥

)t·( 2− θ + µ‖

2− θ + µ⊥

)t, (116)

where

1 + κ‖

1 + κ⊥≥ (1 + κ‖)(1− κ⊥) (117)

≥ 1− (2 + 1/α′min)ηρr′ − κ‖κ⊥ (118)

≥ 1− 2ηρr′/α′min, (119)

and

2− θ + µ‖

2− θ + µ⊥=

1 + µ‖/(2− θ)1 + µ⊥/(2− θ)

(120)

=1 +

√1− 4(1−θ)

(1+κ‖)(2−θ)2

1 +√

1− 4(1−θ)(1+κ⊥)(2−θ)2

(121)

≥(

1 +1

2− θ

√θ2 + κ‖(2− θ)2

1 + κ‖

)(1− 1

2− θ

√θ2 + κ⊥(2− θ)2

1 + κ⊥

)(122)

≥ 1−2(κ⊥ − κ‖)

θ≥ 1− 3ηρr′

α′minθ, (123)

by which we can derive that

‖xt,‖‖‖xt,⊥‖

≥‖x0,‖‖

2‖x0,⊥‖·(

1− 4ρr′

α′minθ

)t(124)

≥‖x0,‖‖

2‖x0,⊥‖(1− 1/T ′)t (125)

≥‖x0,‖‖

2‖x0,⊥‖exp

(− t

T ′ − 1

)≥‖x0,‖‖

4‖x0,⊥‖, (126)

20

Page 21: based algorithm

indicating

α′t =‖xt,‖‖√

‖xt,‖‖2 + ‖xt,⊥‖2≥‖x0,‖‖

8‖x0,⊥‖≥ α′min. (127)

Hence, as long as α′k ≥ α′min for any 1 ≤ k ≤ t − 1, we can also have α′t ≥ α′min if t ≤ T ′.Since we have α′0 ≥ α′min and α′1 ≥ α′min, we can claim that α′t ≥ α′min for any t ≤ T ′ usingrecurrence.

Equipped with Lemma 21, we are now ready to prove Proposition 5.

Proof. By Lemma 20, the unit vector e in Line 7 of Algorithm 2 obtained after T ′ iterations equalsto the output of Algorithm 4 starting from x. Hence in this proof we consider the output of Algo-rithm 4 instead of the original Algorithm 2.

If λn ≤ −√ρε/2, Proposition 5 holds directly. Hence, we only need to prove the case where

λn > −√ρε/2, in which there exists some p′ with

λ′p ≤ −√ρε/2 < λp+1. (128)

We use S′‖, S′⊥ to denote the subspace of Rn spanned by u1,u2, . . . ,up′,up′+1,up+2, . . . ,un. Furthermore, we define xt,‖′ :=

∑p′

i=1 〈ui,xt〉ui, xt,⊥′ :=∑ni=p′ 〈ui,xt〉ui, vt,‖′ :=

∑p′

i=1 〈ui,vt〉ui, vt,⊥′ :=∑ni=p′ 〈ui,vt〉ui respectively to denote

the component of x′t and v′t in Algorithm 4 in the subspaces S′‖, S′⊥, and let α′t := ‖xt,‖‖/‖xt‖.

Consider the case where α′0 ≥√

πnδ0, which can be achieved with probability

Pr

α′0 ≥

√π

nδ0

≥ 1−

√π

nδ0 ·

Vol(Bn−10 (1))

Vol(Bn0 (1))≥ 1−

√π

nδ0 ·

√n

π= 1− δ0, (129)

we prove that there exists some t0 with 1 ≤ t0 ≤ T ′ such that

‖xt0,⊥′‖‖xt0‖

≤√ρε

8`. (130)

Assume the contrary, for any t with 1 ≤ t ≤ T ′, we all have ‖xt,⊥′‖‖xt‖ >√ρε

8` and ‖zt,⊥′‖‖zt‖ >√ρε

8` .Focus on the case where ‖xt,⊥′‖, the component of xt in subspace S′⊥, achieves the largest valuepossible. Then in this case, we have the following recurrence formula:

‖xt+2,⊥′‖ ≤ (1 + η√ρε/2)

(‖xt+1,⊥′‖+ (1− θ)(‖xt+1,⊥′‖ − ‖xt,⊥′‖)

)+ η‖∆⊥′‖. (131)

Since ‖zk,⊥′‖‖zk‖ ≥√ρε

8` for any 1 ≤ k ≤ t+ 1, we can derive that

‖∆⊥‖‖xt+1,⊥‖+ (1− θ)(‖xt+1,⊥‖ − ‖xt,⊥‖)

≤ ‖∆‖‖zt,⊥′‖

≤ 2ρr′√ρε, (132)

which leads to

‖xt+2,⊥′‖ ≤ (1 + η√ρε/2)

(‖xt+1,⊥′‖+ (1− θ)(‖xt+1,⊥′‖ − ‖xt,⊥′‖)

)+ η‖∆⊥′‖ (133)

≤ (1 + η√ρε/2 + 2ρr′/

√ρε)((2− θ)‖xt+1,⊥′‖ − (1− θ)‖xt,⊥′‖

). (134)

Using similar characteristic function techniques shown in the proof of Lemma 21, it can be furtherderived that

‖xt,⊥′‖ ≤ ‖x0,⊥′‖ ·(1 + κ⊥′

2

)t· (2− θ + µ⊥′)

t, (135)

for κ⊥′ = η√ρε/2 + 2ρr′/

√ρε and µ⊥′ =

√(2− θ)2 − 4(1−θ)

1+κ⊥′, given ‖xk,⊥′‖‖xk‖ ≥

√ρε

8` and‖zk,⊥′‖‖zk‖ ≥

√ρε

8` for any 1 ≤ k ≤ t− 1. Due to Lemma 21,

α′t ≥ α′min =δ08

√π

n, ∀1 ≤ t ≤ T ′. (136)

21

Page 22: based algorithm

and it is demonstrated in the proof of Lemma 21 that,

‖xt,‖‖ ≥‖x0,‖‖

2·(1 + κ‖

2

)t· (2− θ + µ‖)

t, ∀1 ≤ t ≤ T ′, (137)

for κ‖ = η√ρε− ηρr′/α′min and µ‖ =

√(2− θ)2 − 4(1−θ)

1+κ‖. Observe that

‖xT ′,⊥′‖‖xT ′,‖‖

≤ 2‖x0,⊥′‖‖x0,‖‖

·(1 + κ⊥′

1 + κ‖

)T ′

·(2− θ + µ⊥′

2− θ + µ‖

)T ′

(138)

≤ 2

δ0

√n

π

(1 + κ⊥′

1 + κ‖

)T ′

·(2− θ + µ⊥′

2− θ + µ‖

)T ′

, (139)

where1 + κ⊥′

1 + κ‖≤ 1

1 + (κ‖ − κ⊥′)= 1− 1

η√ρε/2 + ρr′(η/αmin′ + 2/

√ρε)≤ 1−

η√ρε

4, (140)

and

2− θ + µ⊥′

2− θ + µ‖=

1 +√

1− 4(1−θ)(1+κ⊥′ )(2−θ)2

1 +√

1− 4(1−θ)(1+κ‖)(2−θ)2

(141)

≤ 1

1 +(√

1− 4(1−θ)(1+κ⊥′ )(2−θ)2

−√

1− 4(1−θ)(1+κ‖)(2−θ)2

) (142)

≤ 1−κ‖ − κ⊥′

θ(143)

≤ 1−η√ρε

4θ= 1− (ρε)1/4

16√`. (144)

Hence,

‖xT ′,⊥′‖‖xT ′,‖‖

≤ 2

δ0

√n

π

(1− (ρε)1/4

16√`

)T ′

≤√ρε

8`. (145)

Since ‖xT ′,‖‖ ≤ ‖xT ′‖, we have ‖xT ′,⊥′‖‖xT ′‖

≤√ρε

8` , contradiction. Hence, there here exists some

t0 with 1 ≤ t0 ≤ T ′ such that‖xt0,⊥′‖‖xt0‖

≤√ρε

8` . Consider the normalized vector e = xt0/r, weuse e⊥′ and e‖′ to separately denote the component of e in S′⊥ and S′‖. Then, ‖e⊥′‖ ≤

√ρε/(8`)

whereas ‖e‖′‖ ≥ 1− ρε/(8`)2. Then,

eTH(0)e = (e⊥′ + e‖′)TH(0)(e⊥′ + e‖′), (146)

sinceH(0)e⊥′ ∈ S′⊥ andH(0)e‖′ ∈ S′‖, it can be further simplified to

eTH(0)e = eT⊥′H(0)e⊥′ + eT‖′H(0)e‖′ , (147)

Due to the `-smoothness of the function, all eigenvalue of the Hessian matrix has its absolute valueupper bounded by `. Hence,

eT⊥′H(0)e⊥′ ≤ `‖eT⊥′‖22 =ρε

64`2. (148)

Further according to the definition of S‖, we have

eT‖′H(0)e‖′ ≤ −√ρε

2‖e‖′‖2. (149)

Combining these two inequalities together, we can obtain

eTH(0)e = eT⊥H(0)e⊥′ + eT‖′H(0)e‖′ ≤ −√ρε

2‖e‖′‖2 +

ρε

64`2≤ −√ρε

4. (150)

22

Page 23: based algorithm

C Proof details of escaping from saddle points by negative curvature finding

C.1 Algorithms for escaping from saddle points using negative curvature finding

In this subsection, we first present algorithm for escaping from saddle points using Algorithm 1 asAlgorithm 5.

Algorithm 5: Perturbed Gradient Descent with Negative Curvature Finding1 Input: x0 ∈ Rn;2 for t = 0, 1, ..., T do3 if ‖∇f(xt)‖ ≤ ε then4 e←NegativeCurvatureFinding(xt, r,T ) ;

5 xt ← xt − f ′e(x0)4|f ′e(x0)|

√ερ · e;

6 xt+1 ← xt − 1`∇f(xt);

Observe that Algorithm 5 and Algorithm 2 are similar to perturbed gradient descent and perturbedaccelerated gradient descent but the uniform perturbation step is replaced by our negative curvaturefinding algorithms. One may wonder that Algorithm 5 seems to involve nested loops since negativecurvature finding algorithm are contained in the primary loop, contradicting our previous claim thatAlgorithm 5 only contains a single loop. But actually, Algorithm 5 contains only two operations:gradient descents and one perturbation step, the same as operations outside the negative curvaturefinding algorithms. Hence, Algorithm 5 is essentially single-loop algorithm, and we count theiriteration number as the total number of gradient calls.

C.2 Proof details of escaping saddle points using Algorithm 1

In this subsection, we prove:

Theorem 22. For any ε > 0 and 0 < δ ≤ 1, Algorithm 5 with parameters chosen in Proposition 3satisfies that at least 1/4 of its iterations will be ε-approximate second-order stationary point, using

O( (f(x0)− f∗)

ε2· log n

)iterations, with probability at least 1− δ, where f∗ is the global minimum of f .

Proof. Let the parameters be chosen according to (2), and set the total step number T to be:

T = max

8`(f(x0)− f∗)

ε2, 768(f(x0)− f∗) ·

√ρ

ε3

, (151)

similar to the perturbed gradient descent algorithm [20, Algorithm 4]. We first assume that for eachxt we apply negative curvature finding (Algorithm 1) with δ0 contained in the parameters be chosenas

δ0 =1

384(f(x0 − f∗)

√ε3

ρδ, (152)

we can successfully obtain a unit vector e with eTHe ≤ −√ρε/4, as long as λmin(H(xt)) ≤−√ρε. The error probability of this assumption is provided later.

Under this assumption, Algorithm 1 can be called for at most 384(f(x0)− f∗)√

ρε3 ≤

T2 times, for

otherwise the function value decrease will be greater than f(x0)− f∗, which is not possible. Then,the error probability that some calls to Algorithm 1 fails is upper bounded by

384(f(x0)− f∗)√ρ

ε3· δ0 = δ. (153)

23

Page 24: based algorithm

For the rest of iterations in which Algorithm 1 is not called, they are either large gradient steps,‖∇f(xt)‖ ≥ ε, or ε-approximate second-order stationary points. Within them, we know that thenumber of large gradient steps cannot be more than T/4 because otherwise, by Lemma 10 in Ap-pendix A:

f(xT ) ≤ f(x0)− Tηε2/8 < f∗,

a contradiction. Therefore, we conclude that at least T/4 of the iterations must be ε-approximatesecond-order stationary points, with probability at least 1− δ.

The number of iterations can be viewed as the sum of two parts, the number of iterations needed forgradient descent, denoted by T1, and the number of iterations needed for negative curvature finding,denoted by T2. with probability at least 1− δ,

T1 = T = O( (f(x0)− f∗)

ε2

). (154)

As for T2, with probability at least 1 − δ, Algorithm 1 is called for at most 384(f(x0) − f∗)√

ρε3

times, and by Proposition 3 it takes O(

logn√ρε

)iterations each time. Hence,

T2 = 384(f(x0)− f∗)√ρ

ε3· O( log n√ρε

)= O

( (f(x0)− f∗)ε2

· log n). (155)

As a result, the total iteration number T1 + T2 is

O( (f(x0)− f∗)

ε2· log n

). (156)

C.3 Proof details of escaping saddle points using Algorithm 2

We first present here the Negative Curvature Exploitation algorithm proposed in proposed in [21,Algorithm 3] appearing in Line 16 of Algorithm 2:

Algorithm 6: Negative Curvature Exploitation(xt,vt, s)1 if ‖vt‖ ≥ s then2 xt+1 ← xt;3 else4 ξ = s · vt/‖v‖t;5 xt ← argminx∈xt+ξ,xt−ξf(x);

6 Output (zt+1,0).

Now, we give the full version of Theorem 7 as follows:Theorem 23. Suppose that the function f is `-smooth and ρ-Hessian Lipschitz. For any ε > 0 anda constant 0 < δ ≤ 1, we choose the parameters appearing in Algorithm 2 as follows:

δ0 =δ

384∆f

√ε3

ρ, T ′ =

32√`

(ρε)1/4log( `δ0

√n

ρε

), ζ =

`√ρε, (157)

r′ =δ0ε

32

√π

ρn, η =

1

4`, θ =

1

4√ζ, (158)

E =

√ε3

ρ· c−7A , γ =

θ2

η, s =

γ

4ρ, (159)

where ∆f := f(x0) − f∗ and f∗ is the global minimum of f , and the constant cA is chosen largeenough to satisfy both the condition in Lemma 16 and cA ≥ (384)1/7. Then, Algorithm 2 satisfiesthat at least one of the iterations zt will be an ε-approximate second-order stationary point in

O( (f(x0)− f∗)

ε1.75· log n

)(160)

iterations, with probability at least 1− δ.

24

Page 25: based algorithm

Proof. Set the total step number T to be:

T = max

4∆f (T + T ′)

E, 768∆fT

′√ρ

ε3

= O

( (f(x0)− f∗)ε1.75

· log n), (161)

where T =√ζ · cA as defined in Lemma 16, similar to the perturbed accelerated gradient descent

algorithm [21, Algorithm 2]. We first assert that for each iteration xt that a uniform perturbationis added, after T ′ iterations we can successfully obtain a unit vector e with eTHe ≤ −√ρε/4, aslong as λmin(H(xt)) ≤ −

√ρε. The error probability of this assumption is provided later.

Under this assumption, the uniform perturbation can be called for at most 384(f(x0) − f∗)√

ρε3

times, for otherwise the function value decrease will be greater than f(x0) − f∗, which is notpossible. Then, the probability that at least one negative curvature finding subroutine after uniformperturbation fails is upper bounded by

384(f(x0)− f∗)√ρ

ε3· δ0 = δ. (162)

For the rest of steps which is not within T ′ steps after uniform perturbation, they are either large gra-dient steps, ‖∇f(xt)‖ ≥ ε, or ε-approximate second-order stationary points. Next, we demonstratethat at least one of these steps is an ε-approximate stationary point.

Assume the contrary. We use NT to denote the number of disjoint time periods with length largerthan T containing only large gradient steps and do not contain any step within T ′ steps afteruniform perturbation. Then, it satisfies

NT ≥T

2(T + T ′)− 384∆f

√ρ

ε3≥ (2c7A − 384)∆f

√ρ

ε3≥ ∆f

E. (163)

From Lemma 16, during these time intervals the Hamiltonian will decrease in total at leastNT ·E =∆f , which is impossible due to Lemma 17, the Hamiltonian decreases monotonically for everystep except for the T ′ steps after uniform perturbation, and the overall decrease cannot be greaterthan ∆f , a contradiction. Therefore, we conclude that at least one of the iterations must be anε-approximate second-order stationary point, with probability at least 1− δ.

D Proofs of the stochastic setting

D.1 Proof details of negative curvature finding using stochastic gradients

In this subsection, we demonstrate that Algorithm 3 can find a negative curvature efficiently. Specif-ically, we prove the following proposition:Proposition 24. Suppose the function f : Rn → R is `-smooth and ρ-Hessian Lipschitz. For any0 < δ < 1, we specify our choice of parameters and constants we use as follows:

Ts =8`√ρε· log

( `n

δ√ρε

), ι = 10 log

(nT 2s

δlog(√nηrs

)), (164)

rs =δ

480ρnTs

√ρε

ι, m =

160(`+ ˜)

δ√ρε

√Tsι, (165)

Then for any point x ∈ Rn satisfying λmin(H(x)) ≤ −√ρε, with probability at least 1 − 3δ,Algorithm 3 outputs a unit vector e satisfying

eTH(x)e ≤ −√ρε

4, (166)

whereH stands for the Hessian matrix of function f , using O(m ·Ts) = O(

log2 nδε1/2

)iteartions.

Similarly to Algorithm 1 and Algorithm 2, the renormalization step Line 6 in Algorithm 3 only guar-antees that the value ‖yt‖ would not scales exponentially during the algorithm, and does not affect

25

Page 26: based algorithm

Algorithm 7: Stochastic Negative Curvature Finding without Renormalization(x, rs,Ts,m).1 z0 ← 0;2 for t = 1, ...,Ts do3 Sample

θ(1), θ(2), · · · , θ(m)

∼ D;

4 g(zt−1)← ‖zt−1‖rs· 1m

∑mj=1

(g(x + rs

‖zt−1‖zt−1; θ(j))− g(x; θ(j))

);

5 zt ← zt−1 − 1` (g(zt−1) + ξt), ξt ∼ N

(0,

r2sd I)

;

6 Output zT /‖zT ‖.

the output. We thus introduce the following Algorithm 7, which is the no-renormalization versionof Algorithm 3 that possess the same output and a simpler structure. Hence in this subsection, weanalyze Algorithm 7 instead of Algorithm 3.

Without loss of generality, we assume x = 0 by shifting Rn such that x is mapped to 0. As arguedin the proof of Proposition 3,H(0) admits the following following eigen-decomposition:

H(0) =n∑i=1

λiuiuTi , (167)

where the set uini=1 forms an orthonormal basis of Rn. Without loss of generality, we assume theeigenvalues λ1, λ2, . . . , λn corresponding to u1,u2, . . . ,un satisfy

λ1 ≤ λ2 ≤ · · · ≤ λn, (168)

where λ1 ≤ −√ρε. If λn ≤ −

√ρε/2, Proposition 24 holds directly. Hence, we only need to prove

the case where λn > −√ρε/2, where there exists some p > 1 and p′ > 1 with

λp ≤ −√ρε ≤ λp+1, λp′ ≤ −

√ρε/2 < λp′+1. (169)

Notation: Throughout this subsection, let H := H(x). Use S‖, S⊥ to separately denote thesubspace of Rn spanned by u1,u2, . . . ,up, up+1,up+2, . . . ,un, and use S′‖, S

′⊥ to denote the

subspace of Rn spanned by u1,u2, . . . ,up′, up′+1,up+2, . . . ,un. Furthermore, define zt,‖ :=∑pi=1 〈ui, zt〉ui, zt,⊥ :=

∑ni=p 〈ui, zt〉ui, zt,‖′ :=

∑p′

i=1 〈ui, zt〉ui, zt,⊥′ :=∑ni=p′ 〈ui, zt〉ui

respectively to denote the component of zt in Line 5 of Algorithm 7 in the subspaces S‖, S⊥, S′‖,S′⊥, and let γ = −λ1.

To prove Proposition 24, we first introduce the following lemma:Lemma 25. Under the setting of Proposition 24, for any point x ∈ Rn satisfying λmin(∇2f(x) ≤−√ρε, with probability at least 1− 3δ, Algorithm 3 outputs a unit vector e satisfying

‖e⊥′‖ :=∥∥∥ n∑i=p′

〈ui, e〉ui∥∥∥ ≤ √ρε

8`(170)

using O(m ·Ts) = O(

log2 nδε1/2

)iteartions.

D.1.1 Proof of Lemma 25

In the proof of Lemma 25, we consider the worst case, where λ1 = −γ = −√ρε is the onlyeigenvalue less than −√ρε/2, and all other eigenvalues equal to −√ρε/2 + ν for an arbitrarilysmall constant ν. Under this scenario, the component zt,⊥′ is as small as possible at each time step.

The following lemma characterizes the dynamics of Algorithm 7:Lemma 26. Consider the sequence zi and let η = 1/`. Further, for any 0 ≤ t ≤ Ts we define

ζt := g(zt−1)− ‖zt‖rs

(∇f(x +

rs‖zt‖

zt

)−∇f(x)

), (171)

26

Page 27: based algorithm

to be the errors caused by the stochastic gradients. Then zt = −qh(t)− qsg(t)− qp(t), where:

qh(t) := η

t−1∑τ=0

(I − ηH)t−1−τ∆τ zτ , (172)

for ∆τ =∫ 1

0Hf(ψ rs‖zτ‖zτ

)dψ − H, and

qsg(t) := η

t−1∑τ=0

(I − ηH)t−1−τζτ , qp(t) := η

t−1∑τ=0

(I − ηH)t−1−τξτ . (173)

Proof. Without loss of generality we assume x = 0. The update formula for zt can be written as

zt+1 = zt − η(‖zt‖rs

(∇f( rs‖zt‖

zt

)−∇f(0)

)+ ζt + ξt

), (174)

where

‖zt‖rs

(∇f( rs‖zt‖

zt

)−∇f(0)

)=‖zt‖rs

∫ 1

0

Hf(ψrs‖zt‖

zt

) rs‖zt‖

ztdψ = (H+ ∆t)zt, (175)

indicating

zt+1 = (I − ηH)xt − η(∆tzt + ζt + ξt) (176)

= −ηt∑

τ=0

(I − ηH)t−τ (∆tzt + ζt + ξt), (177)

which finishes the proof.

At a high level, under our parameter choice in Proposition 24, qp(t) is the dominating term control-ling the dynamics, and qh(t)+qsg(t) will be small compared to qp(t). Quantitatively, this is shownin the following lemma:

Lemma 27. Under the setting of Proposition 24 while using the notation in Lemma 12 andLemma 26, we have

Pr(‖qh(t) + qsg(t)‖ ≤

β(t)ηrsδ

20√n·√ρε

16`, ∀t ≤ Ts

)≥ 1− δ, (178)

where −γ := λmin(H) = −√ρε.

Proof. Divide qp(t) into two parts:

qp,1(t) := 〈qp(t),u1〉u1, (179)

and

qp,⊥′(t) := qp(t)− qp,1(t). (180)

Then by Lemma 13, we have

Pr(‖qp,1(t)‖ ≤ β(t)ηrs√

n·√ι)≥ 1− 2e−ι, (181)

and

Pr(‖qp,1(t)‖ ≥ β(t)ηrs

20√n· δ)≥ 1− δ/4. (182)

Similarly,

Pr(‖qp,⊥′(t)‖ ≤ β⊥′(t)ηrs ·

√ι)≥ 1− 2e−ι, (183)

27

Page 28: based algorithm

and

Pr(‖qp,⊥′(t)‖ ≥

β⊥′(t)ηrs20

· δ)≥ 1− δ/4, (184)

where β⊥′(t) := (1+ηγ/2)t√ηγ . Set t⊥′ := logn

ηγ . Then for all τ ≤ t⊥′ , we have

β(τ)

β⊥′(τ)≤√n, (185)

which further leads to

Pr(‖qp,⊥′(τ)‖ ≤ 2β⊥′(t)ηrs ·

√ι)≥ 1− 2e−ι. (186)

Next, we use induction to prove that the following inequality holds for all t ≤ t⊥′ :

Pr(‖qh(τ) + qsg(τ)‖ ≤ β⊥′(τ)ηrs ·

δ

20, ∀τ ≤ t

)≥ 1− 10nt2 log

(√nηrs

)e−ι. (187)

For the base case t = 0, the claim holds trivially. Suppose it holds for all τ ≤ t for some t. Thendue to Lemma 13, with probability at least 1− 2t⊥′e

−ι, we have

‖zt‖ ≤ η‖qp(t)‖+ η‖qh(t) + qsg(t)‖ ≤ 3β⊥′(τ)ηrs ·√ι. (188)

By the Hessian Lipschitz property, ∆τ satisfies:

‖∆τ‖ ≤ ρrs. (189)

Hence,

‖qh(t+ 1)‖ ≤∥∥η t∑

τ=0

(I − ηH)t−τ∆τzτ∥∥ (190)

≤ ηρrst∑

τ=0

(I − ηH)t−τ‖zτ‖ (191)

≤ (ηρrsnTs) · (3β⊥′(t)ηrs) ·√ι (192)

≤ β⊥′(t+ 1)ηrs10√n

·δ√ρε

16`. (193)

As for qsg(t), note that ζτ |Fτ−1 satisfies the norm-subGaussian property defined in Definition 14.Specifically, ζτ |Fτ−1 ∼ nSG((` + ˜)‖zτ‖/

√m). By applying Lemma 15 with b = α2(t) · η2(` +

˜)2/m and b = α2(t)η2(`+ ˜)2η2r2s/(mn), with probability at least

1− 4n · log(√nηrs

)· e−ι, (194)

we have

‖qsg(t+ 1)‖ ≤ η(`+ ˜)√t

m· (β⊥(t)ηrs) ·

√ι ≤ β⊥′(t+ 1)ηrs

20·δ√ρε

8`. (195)

Then by union bound, with probability at least

1− 10n(t+ 1)2 log(√nηrs

)e−ι, (196)

we have

‖qh(t+ 1) + qsg(t+ 1)‖ ≤ β⊥′(t+ 1)ηrs ·δ

20·√ρε

8`, (197)

indicating that (187) holds. Then with probability at least

1− 10nt2⊥′ log(√nηrs

)e−ι − δ/4, (198)

28

Page 29: based algorithm

we have

‖qh(t⊥′) + qsg(t⊥′)‖ ≤ ‖qp,1(t⊥′)‖ ·√ρε

16`. (199)

Based on this, we prove that the following inequality holds for any t⊥′ ≤ t ≤ Ts:

Pr(‖qh(τ) + qsg(τ)‖ ≤ β(τ)ηrs

20√n·δ√ρε

16`, ∀t⊥′ ≤ τ ≤ t

)≥ 1− 10nt2 log

(√nηrs

)e−ι. (200)

We still use recurrence to prove it. Note that its base case τ = t⊥′ is guaranteed by (187). Supposeit holds for all τ ≤ t for some t. Then with probability at least 1− 2te−ι, we have

‖zt‖ ≤ η‖qp(t)‖+ η‖qh(t) + qsg(t)‖ (201)≤ 2‖qp,1(t)‖+ η‖qh(t) + qsg(t)‖ (202)

≤ 3β(τ)ηrs√n

·√ι. (203)

Then following a similar procedure as before, we can claim that

‖qh(t+ 1) + qsg(t+ 1)‖ ≤ β(t+ 1)ηrs√n

· δ20·√ρε

8`, (204)

holds with probability

1− 10n(t+ 1)2 log(√nηrs

)e−ι − δ

4, (205)

indicating that (200) holds. Then under our choice of parameters, the desired inequality

‖qh(t) + qsg(t)‖ ≤β(t)ηrsδ

20√n·√ρε

16`(206)

holds with probability at least 1− δ.

Equipped with Lemma 27, we are now ready to prove Lemma 25.

Proof. First note that under our choice of Ts, we have

Pr(‖qp,⊥′(Ts)‖‖qp,1(Ts)‖

≤√ρε

16`

)≥ 1− δ. (207)

Further by Lemma 27 and union bound, with probability at least 1− 2δ,

‖qh(Ts) + qsg(Ts)‖‖qp(Ts)‖

≤ ‖qh(Ts) + qsg(Ts)‖ ·20√n

δβ(t)ηrs≤√ρε

16`. (208)

For the output e, observe that its component e⊥′ = e − e1, since u1 is the only component insubspace S‖′ . Then with probability at least 1− 3δ,

‖e⊥′‖ ≤√ρε/(8`). (209)

D.1.2 Proof of Proposition 24

Based on Lemma 25, we present the proof of Proposition 24 as follows:

Proof. By Lemma 25, the component e⊥′ of output e satisfies

‖e⊥′‖ ≤√ρε

8`. (210)

Since e = e‖′ + e⊥′ , we can derive that

‖e‖′‖ ≥√

1− ρε

(8`)2≥ 1− ρε

(8`)2. (211)

29

Page 30: based algorithm

Note that

eT He = (e⊥′ + e‖′)T H(e⊥′ + e‖′), (212)

which can be further simplified to

eT He = eT⊥′He⊥′ + eT‖′He‖′ . (213)

Due to the `-smoothness of the function, all eigenvalue of the Hessian matrix has its absolute valueupper bounded by `. Hence,

eT⊥He⊥ ≤ `‖eT⊥‖22 =ρε

64`2, (214)

whereas

eT‖′He‖′ ≤ −√ρε

2‖e‖′‖2. (215)

Combining these two inequalities together, we can obtain

eT He = eT⊥′He⊥′ + eT‖′He‖′ ≤ −√ρε

2‖e‖′‖2 +

ρε

64`2≤ −√ρε

4. (216)

D.2 Proof details of escaping saddle points using Algorithm 3

In this subsection, we demonstrate that Algorithm 3 can be used to escape from saddle points inthe stochastic setting. We first present the explicit Algorithm 8, and then introduce the full versionTheorem 9 with proof.

Algorithm 8: Stochastic Gradient Descent with Negative Curvature Finding.1 Input: x0 ∈ Rn;2 for t = 0, 1, ..., T do3 Sample

θ(1), θ(2), · · · , θ(M)

∼ D;

4 g(xt) = 1M

∑Mj=1 g(xt; θ

(j));5 if ‖g(xt)‖ ≤ 3ε/4 then6 e←StochasticNegativeCurvatureFinding(xt, rs,Ts,m);

7 xt ← xt − f ′e(x0)4|f ′e(x0)|

√ερ · e;

8 Sampleθ(1), θ(2), · · · , θ(M)

∼ D;

9 g(xt) = 1M

∑Mj=1 g(xt; θ

(j));

10 xt+1 ← xt − 1`g(xt; θt);

Theorem 28 (Full version of Theorem 9). Suppose that the function f is `-smooth and ρ-HessianLipschitz. For any ε > 0 and a constant 0 < δs ≤ 1, we choose the parameters appearing inAlgorithm 8 as

δ =δs

2304∆f

√ε3

ρTs =

8`√ρε· log

( `n

δ√ρε

), ι = 10 log

(nT 2s

δlog(√nηrs

)), (217)

rs =δ

480ρnTs

√ρε

ι, m =

160(`+ ˜)

δ√ρε

√Tsι, M =

16`∆f

ε2(218)

where ∆f := f(x0) − f∗ and f∗ is the global minimum of f . Then, Algorithm 8 satisfies that atleast 1/4 of the iterations xt will be ε-approximate second-order stationary points, using

O( (f(x0)− f∗)

ε4· log2 n

)(219)

iterations, with probability at least 1− δs.

30

Page 31: based algorithm

Proof. Let the parameters be chosen according to (2), and set the total step number T to be:

T = max

8`(f(x0)− f∗)

ε2, 768(f(x0)− f∗) ·

√ρ

ε3

. (220)

We will show that the following two claims hold simultaneously with probability 1− δs:

1. At most T/4 steps have gradients larger than ε;

2. Algorithm 3 can be called for at most 384∆f

√ρε3 times.

Therefore, at least T/4 steps are ε-approximate secondary stationary points. We prove the twoclaims separately.

Claim 1. Suppose that within T steps, we have more than T/4 steps with gradients larger than ε.Then with probability 1− δs/2,

f(xT )− f(x0) ≤ −η8

T−1∑i=0

‖∇f(xi)‖2 + c · σ2

M`(T + log(1/δs)) ≤ f∗ − f(x0), (221)

contradiction.

Claim 2. We first assume that for each xt we apply negative curvature finding (Algorithm 3), wecan successfully obtain a unit vector e with eTH(xt)e ≤ −

√ρε/4, as long as λmin(H(xt)) ≤

−√ρε. The error probability of this assumption is provided later.

Under this assumption, Algorithm 3 can be called for at most 384(f(x0)− f∗)√

ρε3 ≤

T2 times, for

otherwise the function value decrease will be greater than f(x0)− f∗, which is not possible. Then,the error probability that some calls to Algorithm 3 fails is upper bounded by

384(f(x0)− f∗)√ρ

ε3· (3δ) = δs/2. (222)

The number of iterations can be viewed as the sum of two parts, the number of iterations needed inlarge gradient scenario, denoted by T1, and the number of iterations needed for negative curvaturefinding, denoted by T2. With probability at least 1− δs,

T1 = O(M · T ) = O( (f(x0)− f∗)

ε4

). (223)

As for T2, with probability at least 1 − δs, Algorithm 3 is called for at most 384(f(x0) − f∗)√

ρε3

times, and by Proposition 24 it takes O(

log2 nδ√ρε

)iterations each time. Hence,

T2 = 384(f(x0)− f∗)√ρ

ε3· O( log2 n

δ√ρε

)= O

( (f(x0)− f∗)ε4

· log2 n). (224)

As a result, the total iteration number T1 + T2 is

O( (f(x0)− f∗)

ε4· log2 n

). (225)

E More numerical experiments

In this section, we present more numerical experiment results that support our theoretical claimsfrom a few different perspectives compared to Section 4. Specifically, considering that previousexperiments all lies in a two-dimensional space, and theoretically our algorithms have a better de-pendence on the dimension of the problem n, it is reasonable to check the actual performance ofour algorithm on high-dimensional test functions, which is presented in Appendix E.1. Then inAppendix E.2, we introduce experiments on various landscapes that demonstrate the advantage ofAlgorithm 2 over PAGD [21]. Moreover, we compare the performance of our Algorithm 2 with theNEON+ algorithm [29] on a few test functions in Appendix E.3. To be more precise, we comparethe negative curvature extracting part of NEON+ with Algorithm 2 at saddle points in different typesof nonconvex landscapes.

31

Page 32: based algorithm

E.1 Dimension dependence

Recall that n is the dimension of the problem. We choose a test function h(x) = 12x

THx + 116x

41

where H is an n-by-n diagonal matrix: H = diag(−ε, 1, 1, ..., 1). The function h(x) has a saddlepoint at the origin, and only one negative curvature direction. Throughout the experiment, we setε = 1. For the sake of comparison, the iteration numbers are chosen in a manner such that thestatistics of Algorithm 1 and PGD in each category of the histogram are of similar magnitude.

Figure 3: Dimension dependence of Algorithm 1 and PGD. We set ε = 0.01, r = 0.1, n = 10p for p = 1, 2, 3.The iteration number of Algorithm 1 and PGD are separately set to be 30p and 20p2 + 10, and the sample sizeM = 100. As we can see, to maintain the same performance, the number of iterations in PGD grows fasterthan the number of iterations in Algorithm 1.

E.2 Comparison between Algorithm 2 and PAGD on various nonconvex landscapes

Quartic-type test function Consider the test function f(x1, x2) = 116x

41 − 1

2x21 + 9

8x22 with a

saddle point at (0, 0). The advantage of Algorithm 2 is illustrated in Figure 4.

Figure 4: Run Algorithm 2 and PAGD on landscape f(x1, x2) = 116x41 − 1

2x21 +

98x22. Parameters: η = 0.05

(step length), r = 0.08 (ball radius in PAGD and parameter r in Algorithm 2), M = 300 (number of sam-plings).Left: The contour of the landscape is placed on the background with labels being function values. Blue pointsrepresent samplings of Algorithm 2 at time step tANCGD = 10 and tANCGD = 20, and red points represent sam-plings of PAGD at time step tPAGD = 20 and tPAGD = 40. Similarly to Algorithm 1, Algorithm 2 transforms aninitial uniform-circle distribution into a distribution concentrating on two points indicating negative curvature,and these two figures represent intermediate states of this process. It converges faster than PAGD even whentANCGD tPAGD.Right: A histogram of descent values obtained by Algorithm 2 and PAGD, respectively. Set tANCGD = 20 andtPAGD = 40. Although we run two times of iterations in PAGD, there are still over 20% of PAGD paths withfunction value decrease no greater than 0.9, while this ratio for Algorithm 2 is less than 5%.

32

Page 33: based algorithm

Triangle-type test function. Consider the test function f(x1, x2) = 12 cos(πx1) + 1

2

(x2 +

cos(2πx1)−12

)2− 1

2 with a saddle point at (0, 0). The advantage of Algorithm 2 is illustrated inFigure 5.

Figure 5: Run Algorithm 2 and PAGD on landscape f(x1, x2) = 12cos(πx1) +

12

(x2 +

cos(2πx1)−12

)2

− 12

.Parameters: η = 0.01 (step length), r = 0.1 (ball radius in PAGD and parameter r in Algorithm 2), M = 300(number of samplings).Left: The contour of the landscape is placed on the background with labels being function values. Blue pointsrepresent samplings of Algorithm 2 at time step tANCGD = 10 and tANCGD = 20, and red points representsamplings of PAGD at time step tPAGD = 40 and tPAGD = 80. Algorithm 2 converges faster than PAGD evenwhen tANCGD tPAGD.Right: A histogram of descent values obtained by Algorithm 2 and PAGD, respectively. Set tANCGD = 20 andtPAGD = 80. Although we run four times of iterations in PAGD, there are still over 20% of gradient descentpaths with function value decrease no greater than 0.9, while this ratio for Algorithm 2 is less than 5%.

Exponential-type test function. Consider the test function f(x1, x2) = 1

1+ex21

+ 12

(x2 −

x21e−x2

1

)2 − 1 with a saddle point at (0, 0). The advantage of Algorithm 2 is illustrated in Figure 6.

Figure 6: Run Algorithm 2 and PAGD on landscape f(x1, x2) = f(x1, x2) =1

1+ex21

+ 12

(x2−x21e−x

21)2−1.

Parameters: η = 0.03 (step length), r = 0.1 (ball radius in PAGD and parameter r in Algorithm 2), M = 300(number of samplings).Left: The contour of the landscape is placed on the background with labels being function values. Blue pointsrepresent samplings of Algorithm 2 at time step tANCGD = 10 and tANCGD = 20, and red points representsamplings of PAGD at time step tPAGD = 30 and tPAGD = 60. Algorithm 2converges faster than PAGD evenwhen tANCGD tPAGD.Right: A histogram of descent values obtained by Algorithm 2 and PAGD, respectively. Set tANCGD = 20 andtPAGD = 60. Although we run three times of iterations in PAGD, its performance is still dominated by ourAlgorithm 2.

Compared to the previous experiment on Algorithm 1 and PGD shown as Figure 1 in Section 4,these experiments also demonstrate the faster convergence rates enjoyed by the general family of"momentum methods". Specifically, using fewer iterations, Algorithm 2 and PAGD achieve largerfunction value decreases separately compared to Algorithm 1 and PGD.

33

Page 34: based algorithm

E.3 Comparison between Algorithm 2 and NEON+ on various nonconvex landscapes

Triangle-type test function. Consider the test function f(x1, x2) = 12 cos(πx1) + 1

2

(x2 +

cos(2πx1)−12

)2− 1

2 with a saddle point at (0, 0). The advantage of Algorithm 2 is illustrated inFigure 7.

Figure 7: Run Algorithm 2 and NEON+ on landscape f(x1, x2) = 12cos(πx1)+

12

(x2 +

cos(2πx1)−12

)2− 12

.Parameters: η = 0.04 (step length), r = 0.1 (ball radius in NEON+ and parameter r in Algorithm 2),M = 300(number of samplings).Left: The contour of the landscape is placed on the background with labels being function values. Red pointsrepresent samplings of NEON+ at time step tNEON = 20, and blue points represent samplings of Algorithm 2at time step tANCGD = 10. Algorithm 2 and the negative curvature extracting part of NEON+ both transform aninitial uniform-circle distribution into a distribution concentrating on two points indicating negative curvature.Note that Algorithm 2 converges faster than NEON+ even when tANCGD tNEON.Right: A histogram of descent values obtained by Algorithm 2 and NEON+, respectively. Set tANCGD = 10and tNEON = 20. Although we run two times of iterations in NEON+, none of NEON+ paths has functionvalue decrease greater than 0.95, while this ratio for Algorithm 2 is larger than 90%.

Exponential-type test function. Consider the test function f(x1, x2) = 1

1+ex21

+ 12

(x2 −

x21e−x2

1

)2 − 1 with a saddle point at (0, 0). The advantage of Algorithm 2 is illustrated in Figure 8

Figure 8: Run Algorithm 2 and NEON+ on landscape f(x1, x2) = 1

1+ex21

+ 12

(x2−x21e−x

21)2−1. Parameters:

η = 0.03 (step length), r = 0.1 (ball radius in NEON+ and parameter r in Algorithm 2), M = 300 (numberof samplings).Left: The contour of the landscape is placed on the background with labels being function values. Red pointsrepresent samplings of NEON+ at time step tNEON = 40, and blue points represent samplings of Algorithm 2at time step tANCGD = 20. Algorithm 2 converges faster than NEON+ even when tANCGD tNEON.Right: A histogram of descent values obtained by Algorithm 2 and NEON+, respectively. Set tANCGD = 20and tNEON = 40. Although we run two times of iterations in NEON+, there are still over 20% of NEON+ pathswith function value decrease no greater than 0.9, while this ratio for Algorithm 2 is less than 10%.

Compared to the previous experiments on Algorithm 2 and PAGD in Appendix E.2, these two ex-periments also reveal the faster convergence rate of both NEON+ and Algorithm 2 against PAGD[21] at small gradient regions.

34