Nonconvex Optimization Methods: Iteration Complexity and

Nonconvex Optimization Methods:Iteration Complexity and Applications

A DISSERTATION

SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

OF THE UNIVERSITY OF MINNESOTA

BY

Junyu Zhang

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Advisor: Prof. Shuzhong Zhang

Committee members:

Prof. Zhaosong Lu

Prof. Ankur Mani

Prof. Mingyi Hong

Department of Industrial & Systems Engineering

University of Minnesota

May, 2020

© Junyu Zhang 2020

ALL RIGHT RESERVED

Acknowledgement

First, I would like to give my special acknowledgment and sincere gratitude to my advisor

Professor Shuzhong Zhang. Throughout my five years of Ph.D. study at the University of

Minnesota, he has continuously provided me valuable instruction and guidance. He helped

me to build up my research abilities, taste of problems, and many other crucial skills

for future research and studies. He is the best instructor that I could imagine. Second,

I would like to thank my summer internship mentor Doctor Lin Xiao. During my two

summers at Microsoft Research, Machine Learning and Optimization Group, his guidance

and support have helped me to broaden my research areas. Third, I would also like to

thank my undergraduate advisor Professor Zaiwen Wen, at Peking University. He led me

to the world of scientific research and guided me to finish my first piece of research in my

life.

Besides, I would like to thank Professor Shiqian Ma from the University of California,

Davis, Professor Mingyi Hong from the University of Minnesota, Professor Dmitriy Drusvy-

atskiy from the University of Washington, and many other collaborators who have worked

with me during the last five years. Finally, I would like to thank my colleagues: Xiaobo

Li, and Xiang Gao, Shaozhe Tao, Yangyang Xu, Guanglin Xu, and Kevin Huang, from

whom I have received great help.

In the end, I would like to thank my parents and grandparents for their endless love and

support from my first day, and this dissertation is dedicated to them.

i

Contents

1 Introduction 1

2 Primal-Dual Optimization Algorithms over Riemannian Manifolds: an

Iteration Complexity Analysis 4

2.1 Main Problem and Relevant Literatures . . . . . . . . . . . . . . . . . . . . 5

2.2 Optimality Conditions and ε-Stationary Point . . . . . . . . . . . . . . . . . 7

2.2.1 Relevant Concepts in Riemannian Optimization . . . . . . . . . . . . 7

2.2.2 Optimality condition and the ε-stationary solution . . . . . . . . . . 11

2.3 Proximal Gradient ADMM and Its Variants . . . . . . . . . . . . . . . . . . 15

2.3.1 Nonconvex proximal gradient-based ADMM . . . . . . . . . . . . . . 16

2.3.2 Nonconvex linearized proximal gradient-based ADMM . . . . . . . . 22

2.3.3 Stochastic linearized proximal ADMM . . . . . . . . . . . . . . . . 26

2.3.4 A feasible curvilinear line-search variant of ADMM . . . . . . . . . 33

2.4 Extending the Basic Model . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.4.1 Relaxing the convexity requirement on nonsmooth regularizers . . . 40

ii

2.4.2 Relaxing the condition on the last block variables . . . . . . . . . . . 41

2.4.3 The Jacobi-style updating rule . . . . . . . . . . . . . . . . . . . . . 42

2.5 Some Applications and Their Implementations . . . . . . . . . . . . . . . . 44

2.5.1 Maximum bisection problem . . . . . . . . . . . . . . . . . . . . . . 44

2.5.2 The `q-regularized sparse tensor PCA . . . . . . . . . . . . . . . . . 46

2.5.3 The community detection problem . . . . . . . . . . . . . . . . . . . 50

2.6 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

2.6.1 The maximum bisection problem . . . . . . . . . . . . . . . . . . . . 51

2.6.2 The `q regularized sparse tensor PCA . . . . . . . . . . . . . . . . . 53

2.6.3 The community detection problem . . . . . . . . . . . . . . . . . . . 54

3 Adaptive Stochastic Variance Reduction for Subsampled Newton Method

with Cubic Regularization 56

3.1 The Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.1.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.2 Adaptive variance reduction for cubic regularization . . . . . . . . . . . . . 61

3.2.1 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.2.2 Bounding the Hessian sample complexity . . . . . . . . . . . . . . . 74

3.2.3 Analysis of sampling without replacement . . . . . . . . . . . . . . . 76

3.3 Analysis of non-adaptive SVRC Schemes . . . . . . . . . . . . . . . . . . . . 81

3.3.1 The case of using full gradient . . . . . . . . . . . . . . . . . . . . . 81

3.3.2 The case of using subsampled gradient . . . . . . . . . . . . . . . . . 86

iii

3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4 A Sparse Completely Positive Relaxation of the Modularity Maximiza-

tion for Community Detection 92

4.1 Introduction and Related Literature . . . . . . . . . . . . . . . . . . . . . . 92

4.2 Model Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.2.1 The Degree-Correlated Stochastic Block Model . . . . . . . . . . . . 95

4.2.2 A Sparse and Low-Rank Completely Positive Relaxation . . . . . . . 96

4.2.3 A Community Detection Framework . . . . . . . . . . . . . . . . . . 99

4.3 An Asynchrounous Proximal RBR Algorithm . . . . . . . . . . . . . . . . . 100

4.3.1 A Nonconvex Proximal RBR Algorithm . . . . . . . . . . . . . . . . 100

4.3.2 An Asynchronous Parallel Proximal RBR Scheme . . . . . . . . . . 104

4.4 Theoretical Error Bounds Under The DCSBM . . . . . . . . . . . . . . . . 106

4.5 Numerical Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

4.5.1 Solvers and Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

4.5.2 Results on Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . 120

4.5.3 Results on Small-Scaled Real World Data . . . . . . . . . . . . . . . 122

4.5.4 Results on Large-Scaled Data . . . . . . . . . . . . . . . . . . . . . . 124

iv

Chapter 1

Introduction

Nowadays, optimization algorithms have become the essential tool for a large variety of

applications in many fields. Despite the hardness of nonconvexity, a large number of these

applications are dealing with nonconvex optimization problems. In this thesis, we shall

focus on several different aspects of nonconvex optimization algorithms, including the Rie-

mannian optimization methods for solving manifold constrained problems and variance

reduced cubic regularized Newton’s method for computing second-order solution for em-

pirical risk minimization problems. In the end, we include an application of such methods

for the problem of community detection.

First, let us consider the nonconvex optimization problem in the usual sense,

minxf(x), s.t. g(x) ≥ 0. (1.1)

Note that even for the simplest unconstrained case where g(x) ≡ 0, finding the local

minimum point of a smooth objective function is an NP-hard problem. When the problems

are constrained, the story is more complicated. Though the constraints have caused a lot

of difficulties, there is still one type of constraint that is relatively mild, namely, the

Riemannian manifold constraints,

minxf(x), s.t. x ∈M.

1

where the decision variables are required to belong to some Riemannian manifold M.

Such examples include spherical constraints, orthogonal constraints and matrix low-rank

constraints, etc. The special geometry of the Riemannian manifold actually indicates

that optimization problems with merely Riemannian manifold constraints are in nature

unconstrained, and hence makes it possible for us to extend the previously unconstrained

analysis to this class of special constrained problems. See e.g. [9, 2, 1]. Many researches

has been focused on such problems with only Riemannian constraints. However, to the best

of our knowledge, there was no research on the “constrained” Riemannian optimization

problems, i.e., the problem has some extra constraints beyond the Riemannian constraints:

minxf(x), s.t. x ∈M, g(x) ≥ 0.

In the first part of the thesis, we study the linearly constrained problems with embed-

ded submanifold constraint structures. We have designed an ADMM-type algorithm for

such problems and we have established the corresponding convergence results. Several

applications with numerical experiments are included.

Besides the constraints, another important problem to handle is the sample complexity of

stochastic algorithms. In machine learning and statistical learning problems, the following

population risk minimization problem is often considered

minx

Eξ[f(x, ξ)].

The ξ often corresponds to the data point comming from some underlying data distribution.

And the decision variable x will correspond to the model parameter that we want to

optimize over. However, in practice, we actually don’t know the data distribution. Hence,

there is no way to compute the expectation. In practice, the following empirical risk

minimization (ERM) surrogate problem is solved

minxF (x) :=

1

n

n∑i=1

f(x, ξi).

In this case, ξi’s are i.i.d. samples from the data distribution. For such problems, the

stochastic algorithms are quite popular due to its low iteration complexity and scalability

2

to large datasets. See e.g. [31, 30, 29, 76]. For the ERM problem, a special technique

called variance reduction has been developed to further reduce gradient sample complexity

of randomized first-order algorithms (see [47, 104, 92] for convex optimization and [5, 75]

for nonconvex optimization). However, one should notice that these algorithms typically

obtain a series of iterates with a subsequence whose gradient size goes to zero. However,

the limit points can also be a saddle point. Therefore, it will be desirable to use the

second-order information to find an ε-second-order stationary point x such that

‖∇F (x)‖ ≤ ε, λmin

(∇2F (x)

)≥ −√ε, (1.2)

where ε is an desired tolerance and λmin(·) denotes the smallest eigenvalue of a symmetric

matrix. For minimizing a general smooth and nonconvex function F , Nesterov and Polyak

[63] introduced a modified Newton method with cubic regularization (CR) to find such

points within O(ε−3/2) iterations. This is better than purely gradient-based methods,

which need O(ε−2) iterations to reach a point x satisfying ‖∇F (x)‖ ≤ ε. Various variants

are developed to efficiently carry out the cubic regularized methods, see [12, 13, 14, 11, 4]

for example. However, the computational cost per iteration of cubic regularized method

can be much higher than gradient-based methods, especially when we need to compute n

Hessian matrices for the ERM problem. Therefore, in the second part of the thesis, we

would like to present a method to combine the strength of the variance reduced subsampling

method and the cubic regularized method.

Finally, in the last part, we present our study of the community detection problem as an

application of nonconvex optimization algorithms. In this work, we have developed a scal-

able block-coordinate minimization algorithm for a nonconvex and nonsmooth relaxation

of the modularity maximization problem. The efficiency and accuracy of our method is

validated by comparing with several state-of-the-art methods.

3

Chapter 2

Primal-Dual Optimization

Algorithms over Riemannian

Manifolds: an Iteration

Complexity Analysis

In this chapter, we study nonconvex and nonsmooth multi-block optimization over Eu-

clidean embedded (smooth) Riemannian submanifolds with coupled linear constraints. By

utilizing the embedding structure, we develop an ADMM-type primal-dual algorithm based

on decoupled solvable subroutines and several variants of this method. First, we study the

optimality conditions for the this optimization models. Then, we define the notion of ε-

stationary solutions for such problems. The main contribution of the paper is to show that

the proposed algorithms enjoy an iteration complexity of O(1/ε2) to reach an ε-stationary

solution. Several variants like the sampling-based stochastic algorithm and a feasible curvi-

linear line-search variant of the algorithm are also proposed. Finally, we show specifically

how the algorithms can be implemented to solve a variety of practical problems such as the

4

NP-hard maximum bisection problem, the `q regularized sparse tensor principal component

analysis and the community detection problem.

2.1 Main Problem and Relevant Literatures

In general, we consider the following model:

min f(x1, · · · , xN ) +N−1∑i=1

ri(xi)

s.t.N∑i=1

Aixi = b, with AN = I,

xN ∈ RnN , (2.1)

xi ∈Mi, i = 1, ..., N − 1,

xi ∈ Xi, i = 1, ..., N − 1,

where f is a smooth function with L-Lipschitz continuous gradient, but is possibly non-

convex; the functions ri(xi) are convex but are possibly nonsmooth;Mi’s are Riemannian

manifolds, not necessarily compact, embedded in Euclidean spaces; the additional con-

straint sets Xi are assumed to be some closed convex sets. As we shall see later, the

restrictions on ri being convex and AN being identity can all be relaxed, after a reformu-

lation. For the time being however, let us focus on (2.1).

Related literature

There has been a recent intensive research interest in studying optimization over a Rie-

mannian manifold:

minx∈M

f(x),

where f is smooth; see [80, 59, 43, 2, 1, 9] and the references therein. Note that viewed

within manifold itself, the problem is essentially unconstrained. Alongside deterministic

algorithms, the stochastic gradient descent method (SGD) and the stochastic variance re-

duced gradient method (SVRG) have also been extended to optimization over Riemannian

5

manifold; see e.g. [102, 49, 101, 58, 45]. Compared to all these approaches, our proposed

methods allow a nonsmooth objective, a constraint xi ∈ Xi, as well as the coupling affine

constraints. A key feature deviating from the traditional Riemannian optimization is that

we take advantage of the global solutions for decoupled proximal mappings instead of re-

lying on a retraction mapping, although if retraction mapping is available then it can be

incorporated as well.

Properly dealing with the affine coupling constraints in problem (2.1) is the key to solving

this problem. When the nonconvex set constraints and manifold constraints are absent,

Alternating Direction Method of Multipliers (ADMM) is a natural choice for it. This type

of algorithms have attracted much research attention in the past few years. Convergence

and iteration complexity results have been thoroughly studied in the convex setting, and

recently results have been extended to various nonconvex settings as well; see [39, 40, 57, 85,

89, 44, 96]. Among these results, [57, 96, 85, 89] show the convergence to a stationary point

without any iteration complexity guarantee. A closely related paper is [108], where the

authors consider a multi-block nonconvex nonsmooth optimization problem on the Stiefel

manifold with coupling linear constraints. An approximate augmented Lagrangian method

is proposed to solve the problem and convergence to the KKT point is analyzed, but no

iteration complexity result is given. Another related paper is [53], where the authors solve

various manifold optimization problems with affine constraints by a two-block ADMM

algorithm, without convergence assurance though. The current investigation is inspired

by our previous work [44], which requires the convexity of the constraint sets. In the

current project, we drop this restriction and extend the result to stochastic setting and

allow Riemannian manifold constraints.

Finally, we remark that for large-scale optimization such as tensor decomposition [52, 24,

66], black box tensor approximation problems [67, 6] and the worst-case input models

estimation problems [32, 33], the costs for function or gradient evaluation are prohibitively

expensive. Our stochastic approach considerably alleviates the computational burden.

6

2.2 Optimality Conditions and ε-Stationary Point

2.2.1 Relevant Concepts in Riemannian Optimization

In this section, we shall introduce the basics of optimization over manifolds. Suppose M

is a differentiable manifold, then for any x ∈M, there exists a chart (U,ψ) in which U is

an open set with x ∈ U ⊂M and ψ is a homeomorphism between U and an open set ψ(U)

in Euclidean space. This coordinate transform enables us to locally treat a Riemannian

manifold as a Euclidean space. Denote the tangent space M at point x ∈ M by TxM,

then M is a Riemannian manifold if it is equipped with a metric on the tangent space

TxM which is continuous in x.

Definition 2.2.1 (Tangent Space) Consider a Riemannian manifold M embedded in

a Euclidean space. For any x ∈ M, the tangent space TxM at x is a linear subspace

consisting of the derivatives of all smooth curves on M passing x; that is

TxM =γ′(0) : γ(0) = x, γ([−δ, δ]) ⊂M, for some δ > 0, γ is smooth

. (2.2)

The Riemannian metric, i.e., the inner product between u, v ∈ TxM, is defined to be

〈u, v〉x := 〈u, v〉, where the latter is the Euclidean inner product.

Define the set of all functions differentiable at point x to be Fx. An alternative but more

general way of defining tangent space is by viewing a tangent vector v ∈ TxM as an

operator mapping f ∈ Fx to v[f ] ∈ R which satisfies the following property: For any

given f ∈ Fx, there exists a smooth curve γ on M with γ(0) = x and v[f ] = d(f(γ(t)))dt

∣∣∣∣t=0

.

For submanifolds embedded in Euclidean spaces, we can obtain Definition 2.2.1 by defining

v = γ′(0) and v[f ] = 〈γ′(0),∇f(x)〉, with 〈·, ·〉 being the standard Euclidean vector product

in the embedded space. For example, when M is a sphere, TxM is the tangent plane

at x with a proper translation such that the origin is included. When M = Rn, then

TxM = Rn =M. See Figure 2.1.

7

M

TxM

γ(t) ∈ M, γ(0) = x

γ′(0)x

Figure 2.1: The tangent space for embeded submanifolds.

Definition 2.2.2 (Riemannian Gradient) For f ∈ Fx, the Riemannian gradient gradf(x)

is a tangent vector in TxM satisfying v[f ] = 〈v, gradf(x)〉x for any v ∈ TxM.

If M is an embedded submanifold of a Euclidean space, we have

gradf(x) = ProjTxM(∇f(x)),

where ProjTxM is the Euclidean projection operator onto the subspace TxM, which is a

nonexpansive linear transformation. Therefore, with the concept of Riemannian gradient

descent, the gradient descent method can then be paralelly extended to the Riemannian

gradient descent method. See Figure 2.2 where Retr is a mapping that maps a vector in

the tangent space back to the manifold.

Definition 2.2.3 (Differential) Let F :M→N be a smooth mapping between two Rie-

mannian manifolds M and N . The differential (or push-forward) of F at x is a mapping

DF (x) : TxM→ TF (x)N defined by

(DF (x)[v])[f ] = v[f F ], for all v ∈ TxM, and ∀f ∈ FF (x).

Suppose M is an m-dimensional embedded Riemannian submanifold of Rn,m ≤ n, and

let (U,ψ) be a chart at point x ∈ M, then ψ is a smooth mapping from U ⊂ M to

8

xk

xk − γk∇f(xk)

xk+1 = xk − γk · gradf(xk)

xk+1 = Retr(xk+1)

M

TxkM

The manifoldM

The tangent space at xk

Figure 2.2: The Riemannian gradient and Riemannian gradient descent method.

ψ(U) ⊂ N = Rm. Under a proper set of basis aimi=1 of TxM and suppose v =∑m

i=1 viai,

then

v := Dψ(x)[v] = (v1, ..., vm).

Clearly, this establishes a bijection between the tangent space TxM and the tangent space

of Tψ(x)ψ(U) = Rm. Following the notation in [97], we use o to denote the Euclidean

counterpart of an object o in M; e.g.,

f = f ψ−1, v = Dψ(x)[v], x = ψ(x).

Finally, if we define the Gram matrix Gx(i, j) = 〈ai,aj〉x, which is also known as the

Riemannian metric, then 〈u, v〉x = u>Gxv.

Next, we shall present a few optimization concepts generalized to the manifold case. Let

C be a subset in Rn and x ∈ C, the tangent cone TC(x) and the normal cone NC(x)

of C at x are defined in accordance with that in [65]. Suppose S is a closed subset on

the Riemannian manifold M, (U,ψ) is a chart at point x ∈ S, then by using coordinate

9

ψ

ψ−1

M

x x

0x

TxM Dψ

[Dψ]−1

x ∈ intU ⊂M ψ(U) ⊂ Rd

0x

Rd

Figure 2.3: The chart and the differential of the chart.

transform (see also [97, 61]), the Riemannian tangent cone can be defined as

TS(x) := [Dψ(x)]−1[Tψ(S∩U)(ψ(x))]. (2.3)

Consequently, the Riemannian normal cone can be defined as

NS(x) := u ∈ TxM : 〈u, v〉x ≤ 0,∀v ∈ TS(x). (2.4)

By a rather standard argument (see [97]), the following proposition can be shown:

Proposition 2.2.4 NS(x) = [Dψ(x)]−1[G−1x Nψ(U∩S)(ψ(x))].

A function f is said to be locally Lipschitz onM if for any x ∈M, there exists some L > 0

such that in a neighborhood of x, f is L-Lipschitz in the sense of Riemannian distance.

WhenM is a compact manifold, a global L exists. WhenM is an embedded submanifold

of Rn and f is a locally Lipschitz on Rn, let f |M be the function f restricted to M, then

f |M is also locally Lipschitz on M.

Definition 2.2.5 (The Clarke subdifferential on Riemannian manifold [97, 42])

For a locally Lipschitz continuous function f onM, the Riemannian generalized directional

10

derivative of f at x ∈M in direction v ∈ TxM is defined as

f(x; v) = lim supy→x,t↓0

f ψ−1(ψ(y) + tDψ(y)[v])− f ψ−1

t. (2.5)

Then the Clarke subdifferential is defined as

∂f(x) = ξ ∈ TxM : 〈ξ, v〉 ≤ f(x; v), ∀v ∈ TxM. (2.6)

There are several remarks for the notion of Riemannian Clarke subdifferentials. IfM = Rn

and ψ = id, then the above notion reduces to the original Clarke subdifferential [20]. In

this case, suppose f is differentiable and r is locally Lipschitz, then we have

∂(f + r)(x) = ∇f(x) + ∂r(x), (2.7)

where ∂r(x) is the Clarke subdifferential. Furthermore, if we have additional manifold

constraints and r is convex, from [97] we have

∂(f + r)|M(x) = ProjTxM(∇f(x) + ∂r(x)). (2.8)

The convexity of r is crucial in this property. If the nonsmooth part ri(xi) in our problem

is also nonconvex, then we will have to use additional variables and consensus constraints

to decouple ri, the manifold constraint and smooth component f , which will be discussed

in Section 2.4. More importantly, we have the following result (see [97]):

Proposition 2.2.6 Suppose f is locally Lipschitz continuous in a neighborhood of x, and

(U,ψ) is a chart at x. It holds that

∂f(x) = [Dψ(x)]−1[G−1x ∂(f ψ−1)(ψ(x))].

2.2.2 Optimality condition and the ε-stationary solution

Consider the following optimization problem over manifold:

min f(x) (2.9)

s.t. x ∈ S ⊂M.

11

Suppose that x∗ is a local minimum, and that (U,ψ) is a chart at x∗. Then, x∗ := ψ(x∗)

must also be a local minimum for the problem

min f(x) (2.10)

s.t. x ∈ ψ(S ∩ U).

Therefore, problem (2.9) is transformed into a standard nonlinear programming problem

w∗2

w∗1

w∗3

x0

x1

x2

x3

∇f(w∗1) = 0

(a) Gradient = 0

x∗

∇f(x∗)

gradf(x∗) = ProjTx∗M(∇f(x∗)) = 0

x0

x1

x2

x3

x4

Always feasible

Interior of manifold, “unconstrained problem:”

(b) “Riemannian gradient” = 0

Figure 2.4: Comparison between the first-order KKT condition of Euclidean optimization

and Riemannian optimization for unconstrained problems.

(2.10) in Euclidean space. We will then find the optimality condition via (2.10) and map

it back to that of (2.9) by using the differential operator.

Assume that both f and f are locally Lipschitz. The optimality of x∗ yields (cf. [20])

0 ∈ ∂f(x∗) +Nψ(U∩S)(x∗).

Apply the bijection [Dψ(x)]−1 G−1x on both sides, and by Propositions 2.2.6 and 2.2.4,

the first-order optimality condition for problem (2.9) follows as a result:

0 ∈ ∂f(x∗) +NS(x∗). (2.11)

If f is differentiable, then (2.11) reduces to

−gradf(x∗) ∈ NS(x∗).

12

To specify the set S in problem (2.1), let us consider an equality constrained problem

min f(x)

s.t. ci(x) = 0, i = 1, ...,m, (2.12)

x ∈M∩X.

Note that in the case of (2.1), the above constraints ci(x) = 0, i = 1, 2, ...,m, represent the

linear equality constraints. Define Ω := x ∈M : ci(x) = 0, i = 1, ...,m, and S := Ω∩X.

By assuming the so-called Linear Independent Constraint Qualification (LICQ) condition

on the Jacobian of c(x) at x∗, Corollary 4.1 in [97] implies

NΩ(x∗) =

m∑i=1

λi gradci(x∗)

∣∣∣∣∣ λ ∈ Rm

= −(TΩ(x∗))?, (2.13)

where K? indicates the dual of cone K. Therefore, (2.11) implies

∂f(x∗) ∩ (−NS(x∗)) 6= ∅.

We have

−(NΩ(x∗) +NX(x∗)) = (TΩ(x∗))? + (TX(x∗))?

j cl ((TΩ(x∗))? + (TX(x∗))?)

= (TΩ(x∗) ∩ TX(x∗))?

j (TΩ∩X(x∗))?.

The optimality condition is established as:

Proposition 2.2.7 Suppose that x∗ ∈M∩X and ci(x∗) = 0, i = 1, ...,m. If

∂f(x∗) ∩ (−NΩ(x∗)−NX(x∗)) 6= ∅,

then x∗ is a stationary solution for problem (2.12).

By specifying the optimality condition in Proposition 2.2.7 to (2.1), we have:

13

Theorem 2.2.8 Consider problem (2.1) where f is smooth with Lipschitz gradient and

ri’s are convex and locally Lipschitz continuous. If there exists a Lagrange multiplier λ∗

such that∇Nf(x∗)−A>Nλ∗ = 0,∑N

i=1Aix∗i − b = 0,

ProjTx∗iMi

(∇if(x∗)−A>i λ∗ + ∂ri(x

∗i ))

+NXi∩Mi(x∗i ) 3 0, i = 1, ..., N − 1,

(2.14)

then x∗ is a stationary solution for problem (2.1).

Hence, an ε-stationary solution of problem (2.1) can be naturally defined as:

Definition 2.2.9 (ε-stationary solution) Consider problem (2.1) where f is smooth

with Lipschitz gradient and ri are convex and locally Lipschitz continuous. Solution x∗

is said to be an ε-stationary solution if there exists a multiplier λ∗ such that‖∇Nf(x∗)−A>Nλ∗‖ ≤ ε,

‖∑N

i=1Aix∗i − b‖ ≤ ε,

dist(

ProjTx∗iMi

(−∇if(x∗) +A>i λ

∗ − ∂ri(x∗i )),NXi∩Mi(x

∗i ))≤ ε, i = 1, ..., N − 1.

In the case that x∗ is a vector generated by some randomized algorithm, the following

adaptation is appropriate.

Definition 2.2.10 (ε-stationary solution in expectation) Suppose that x∗ and λ∗ are

generated by some randomized process. Then, we call x∗ and λ∗ to be ε-stationary solution

for problem (2.1) in expectation if the following holds

E[‖∇Nf(x∗)−A>Nλ∗‖

]≤ ε,

E[‖∑N

i=1Aix∗i − b‖

]≤ ε,

E[dist

(ProjTx∗

iMi

(−∇if(x∗) +A>i λ

∗ − ∂ri(x∗i )),NXi∩Mi(x

∗i ))]≤ ε, i = 1, ..., N − 1.

14

2.3 Proximal Gradient ADMM and Its Variants

In [44], Jiang, Lin, Ma and Zhang proposed a proximal gradient-based variant of ADMM for

nonconvex and nonsmooth optimization model with convex constraints. In this chapter, we

extend the analysis to include nonconvex Riemannian submanifold constraints, motivated

by the vast array of potential applications. Moreover, we propose to linearize the nonconvex

function f , which significantly broadens the applicability and enables us to utilize the

stochastic gradient-based method to reduce computational costs for large-scale problems.

As it turns out, the convergence result for this variant remains intact.

Concerning problem (2.1), we first make some assumptions on f and ri’s. It is worth

mentioning that all the decision variables x1, ..., xN are in the Euclidean spaces Rni where

the submanifoldsMi are all embedded in. This will remain true throughout the subsequent

sections.

Assumption 2.3.1 f and ri, i = 1, ..., N − 1, are all bounded from bellow in the feasible

region. We denote the lower bounds by r∗i = minxi∈Mi∩Xi ri(xi), i = 1, ..., N − 1 and

f∗ = minxi∈Mi∩Xi,i=1,...,N−1,xN∈RnN

f(x1, · · · , xN ).

Assumption 2.3.2 f is a smooth function with L-Lipschitz continuous gradient; i.e.

‖∇f(x1, . . . , xN )−∇f(x1, . . . , xN )‖2 ≤ L‖(x1 − x1, . . . , xN − xN )‖2, ∀x, x. (2.15)

Assumption 2.3.3 The proximal mappings required at Step 1 of Algorithms 1, 2 and 3

can be computed efficiently with either a closed form solution or by an efficient iterative

algorithm. (As we will see in Section 2.5, this assumption holds true for many practical

applications).

15

2.3.1 Nonconvex proximal gradient-based ADMM

The augmented Lagrangian function for problem (2.1) is

Lβ(x1, x2, · · · , xN , λ) = f(x1, · · · , xN )+

N−1∑i=1

ri(xi)−⟨ N∑i=1

Aixi−b, λ⟩

+β

2

∥∥∥∥∥N∑i=1

Aixi − b

∥∥∥∥∥2

,

(2.16)

where λ is the Lagrange multiplier, β > 0 is a penalty parameter. Our proximal gradient-

based ADMM for solving (2.1) is described in Algorithm 1, where for two symmetric

matrices A and B, A B means that A−B is positive definite.

Algorithm 1: Nonconvex Proximal Gradient-Based ADMM on Riemannian

Manifold

1 Given x0i ∈ (Mi ∩Xi), i = 1, ..., N − 1, x0

N ∈ RnN , λ0 ∈ Rm, β > 0, γ > 0,

Hi 0, i = 1, . . . , N − 1.

2 for k = 0, 1, ... do

3 [Step 1] For i = 1, 2, ..., N − 1, and positive definite matrix Hi, compute

xk+1i :=

argminxi∈Mi∩Xi Lβ(xk+11 , · · · , xk+1

i−1 , xi, xki+1, · · · , xkN , λk) + 1

2‖xi − xki ‖2Hi ;

4 [Step 2] xk+1N := xkN − γ∇NLβ(xk+1

1 , · · · , xk+1N−1, x

kN , λ

k);

5 [Step 3] λk+1 := λk − β(∑N

i=1Aixk+1i − b).

Before we give the main convergence result of Algorithm 1, we need the following lemmas.

Note that Algorithm 1 is almost identical to Algorithm 2 in [44], and the only difference

lies in the constraint sets in Step 1. Steps 2 and 3 of these two algorithms are exactly the

same. By checking the proofs of Lemma 3.9 in [44] (the counterpart of Lemma 3.4 in this

chapter) and Lemma 3.11 in [44] (the counterpart of Lemma 3.6 in this chapter), we found

that the proofs only rely on Steps 2 and 3. Therefore, the proofs can directly apply here

for Lemmas 3.4 and 3.6, and we thus omit the proofs for succinctness.

Lemma 2.3.4 (Lemma 3.9 in [44]) Suppose that the sequence xk1, ..., xkN , λk is generated

16

by Algorithm 1. Then,

‖λk+1 − λk‖2 ≤ 3

(β − 1

γ

)2

‖xkN − xk+1N ‖2 + 3

[(β − 1

γ

)2

+ L2

]‖xk−1

N − xkN‖2

+3L2N−1∑i=1

‖xki − xk+1i ‖2. (2.17)

Since Steps 2 and 3 in Algorithm 1 are the same as those in [44], this lemma remains valid

here. Specially, Step 2 and Step 3 directly result in

λk+1 =

(β − 1

γ

)(xkN − xk+1

N ) +∇Nf(xk+11 , . . . , xk+1

N−1, xkN ). (2.18)

We define a potential function

ΨG(x1, · · · , xN , λ, x) = Lβ(x1, · · · , xN , λ) +3

β

[(β − 1

γ

)2

+ L2

]‖x− xN‖2. (2.19)

With Lemma 2.3.4, the following monotonicity property can be established.

Lemma 2.3.5 Suppose the sequence (xk1, · · · , xkN , λk) is generated by Algorithm 1. As-

sume that

β >

(6 + 18

√3

13

)L ≈ 2.860L and Hi

6L2

βI, i = 1, ..., N − 1. (2.20)

Then ΨG(xk+11 , · · · , xk+1

N , λk+1, xkN ) is monotonically decreasing over k if γ lies in the

following interval:

γ ∈

(12

13β +√

13β2 − 12βL− 72L2,

12

13β −√

13β2 − 12βL− 72L2

). (2.21)

More specifically, we have

ΨG(xk+11 , · · · , xk+1

N−1, xk+1N , λk+1, xkN )−ΨG(xk1, · · · , xkN−1, x

kN , λ

k, xk−1N )

≤

[β + L

2− 1

γ+

6

β

(β − 1

γ

)2

+3L2

β

]‖xkN − xk+1

N ‖2 (2.22)

−N−1∑i=1

‖xki − xk+1i ‖21

2Hi− 3L2

βI< 0.

17

Proof. By the global optimality for the subproblems in Step 1 of Algorithm 1, we have

Lβ(xk+11 , · · · , xk+1

N−1, xkN , λ

k) ≤ Lβ(xk1, · · · , xkN−1, xkN , λ

k)− 1

2

N−1∑i=1

‖xki − xk+1i ‖2Hi . (2.23)

By Step 2 of Algorithm 1 we have

Lβ(xk+11 , · · · , xk+1

N−1, xk+1N , λk) ≤ Lβ(xk+1

1 , · · · , xk+1N−1, x

kN , λ

k)+

(L+ β

2− 1

γ

)‖xkN−xk+1

N ‖2.

(2.24)

By Step 3, directly substitute λk+1 into the augmented Lagrangian gives

Lβ(xk+11 , · · · , xk+1

N , λk+1) = Lβ(xk+11 , · · · , xk+1

N , λk) +1

β‖λk − λk+1‖2. (2.25)

Summing up (2.23), (2.24), (2.25)) and apply Lemma 2.3.4, we obtain the following in-

equality,

Lβ(xk+11 , · · · , xk+1

N−1, xk+1N , λk+1)− Lβ(xk1, · · · , xkN−1, x

kN , λ

k)

≤

[L+ β

2− 1

γ+

3

β

(β − 1

γ

)2]‖xkN − xk+1

N ‖2 (2.26)

+3

β

[(β − 1

γ

)2

+ L2

]‖xk−1

N − xkN‖2 −N−1∑i=1

‖xki − xk+1i ‖21

2Hi− 3L2

βI,

which further indicates

ΨG(xk+11 , · · · , xk+1


kN , λ

k, xk−1N )

≤

[β + L

2− 1

γ+

6

β

(β − 1

γ

)2

+3L2

β

]‖xkN − xk+1

N ‖2 (2.27)

−N−1∑i=1

‖xki − xk+1i ‖21

2Hi− 3L2

βI.

To ensure that the right hand side of (2.22) is negative, we need to choose Hi 6L2

β I, and

ensure that

β + L

2− 1

γ+

6

β

(β − 1

γ

)2

+3L2

β< 0. (2.28)

18

This can be proved by first viewing it as a quadratic function of z = 1γ . To find some z > 0

such that

p(z) =6

βz2 − 13z +

(L+ β

2+ 6β +

3

βL2

)< 0,

we need the discriminant to be positive, i.e.

∆(β) =1

β2(13β2 − 12βL− 72L2) > 0. (2.29)

It is easy to verify that (2.20) suffices to guarantee (2.29). Solving p(z) = 0, we find two

positive roots

z1 =13β −

√13β2 − 12βL− 72L2

12, and z2 =

13β +√

13β2 − 12βL− 72L2

12.

Note that γ defined in (2.21) satisfies 1z2

< γ < 1z1

and thus guarantees (2.28). This

completes the proof.

Lemma 2.3.6 (Lemma 3.11 in [44]) Suppose that the sequence xk1, · · · , xkN , λk is gen-

erated by Algorithm 1. It holds that

ΨG(xk+11 , · · · , xk+1

N−1, xk+1N , λk+1, xkN ) ≥

N−1∑i=1

r∗i + f∗, (2.30)

where r∗i , i = 1, ..., N − 1 and f∗ are defined in Assumption 2.3.1.

Denote σmin(M) as the smallest singular value of a matrix M . Now we are ready to present

the main convergence result of Algorithm 1.

Theorem 2.3.7 Suppose that the sequence xk1, ..., xkN , λk is generated by Algorithm 1,

and the parameters β and γ satisfy (2.20) and (2.21) respectively. Define the constants

κ1 :=3

β2

[(β − 1

γ

)2

+ L2

], κ2 :=

(|β − 1

γ|+ L

)2

,

κ3 :=

(L+ β

√N max

1≤i≤N‖Ai‖22 + max

1≤i≤N−1‖Hi‖2

)2

,

19

τ := min

−

[β + L

2− 1

γ+

6

β

(β − 1

γ

)2

+3L2

β

], mini=1,...,N−1

[−(

3L2

β− σmin(Hi)

2

)].

Assume that Hi 6L2

β I, i = 1, . . . , N − 1. Suppose xk∗

is the first iterate that satisfies

N∑i=1

(‖xk∗i − xk∗−1i ‖2 + ‖xk∗−1

i − xk∗−2i ‖2) ≤ ε2/maxκ1, κ2, κ3.

Then we stop the algorithm and output xk∗. It follows that (xk

∗1 , · · · , xk

∗N , λ

k∗) is an ε-

stationary solution of (2.1) defined in Definition 2.2.9. Moreover, k∗ ≤ K with K defined

as

K :=

⌈2 maxκ1, κ2, κ3

τε2

(ΨG(x1

1, ..., x1N , λ

1, x0N )−

N−1∑i=1

r∗i − f∗)⌉

. (2.31)

Proof. For the ease of presentation, we denote

θk :=

N∑i=1

(‖xki − xk+1i ‖2 + ‖xk−1

i − xki ‖2). (2.32)

Summing (2.22) over k = 1, . . . ,K yields

ΨG(x11, · · · , x1

N , λ1, x0

N )−ΨG(xK+11 , · · · , xK+1

N , λK+1, xKN ) ≥ τK∑k=1

N∑i=1

‖xki−xk+1i ‖2, (2.33)

which implies

min2≤k≤K+1

θk ≤ 1

τK

[2ΨG(x1

1, · · · , x1N , λ

1, x0N )−ΨG(xK+1

1 , . . . , xK+1N , λK+1, xKN )

−ΨG(xK+21 , . . . , xK+2

N , λK+2, xK+1N )

]≤ 2

τK

[ΨG(x1

1, · · · , x1N , λ

1, x0N )− f∗ −

N−1∑i=1

r∗i

]. (2.34)

By (2.18) we have

‖λk+1 −∇Nf(xk+11 , · · · , xk+1

N )‖2

≤(∣∣∣∣β − 1

γ

∣∣∣∣ ‖xkN − xk+1N ‖+ ‖∇Nf(xk+1

1 , · · · , xk+1N−1, x

kN )−∇Nf(xk+1

1 , · · · , xk+1N )‖

)2

≤(∣∣∣∣β − 1

γ

∣∣∣∣+ L

)2

‖xkN − xk+1N ‖2

≤ κ2θk. (2.35)

20

From Step 3 of Algorithm 1 and (2.17), we have

∥∥∥∥∥N−1∑i=1

Aixk+1i + xk+1

N − b

∥∥∥∥∥2

=1

β2‖λk − λk+1‖2

≤ 3

β2

[(β − 1

γ

)2

+ L2

]‖xk−1

N − xkN‖2 +3

β2

(β − 1

γ

)2

‖xk+1N − xkN‖2 +

3L2

β2

N−1∑i=1

‖xk+1i − xki ‖2

≤ κ1θk. (2.36)

By the optimality conditions (e.g., (2.11)) for the subproblems in Step 1 of Algorithm 1,

and using (2.8) and Step 3 of Algorithm 1, we can get

ProjTxk+1i

Mi

∇if(xk+1

1 , · · · , xk+1i , xki+1, · · · , xkN )−A>i λk+1+βA>i

N∑j=i+1

Aj(xkj − xk+1

j )

+Hi(x

k+1i − xki ) + gi(x

k+1i )

+ qi(x

k+1i ) = 0, (2.37)

for some gi(xk+1i ) ∈ ∂ri(xk+1

i ), qi(xk+1i ) ∈ NXi(x

k+1i ). Therefore,

dist

(ProjT

xk+1i

Mi

−∇if(xk+1) +A>i λ

k+1 − ∂ri(xk+1i )

,NXi(x

k+1i )

)≤

∥∥∥∥ProjTxk+1i

Mi

−∇if(xk+1) +ATi λ

k+1 − gi(xk+1i )− qi(xk+1

i )

∥∥∥∥=


Mi

−∇if(xk+1) +∇if(xk+1

1 , · · · , xk+1i , xki+1, · · · , xkN )

+βA>i (N∑

j=i+1

Aj(xkj − xk+1

j )) +Hi(xk+1i − xki )

∥∥∥∥21

≤ ‖ −∇if(xk+1) +∇if(xk+11 , · · · , xk+1

i , xki+1, · · · , xkN )−Hi(xk+1i − xki )

+βA>i (N∑

j=i+1

Aj(xk+1j − xkj ))‖

≤ ‖∇if(xk+1)−∇if(xk+11 , · · · , xk+1

i , xki+1, · · · , xkN )‖+ ‖Hi(xk+1i − xki )‖

+‖βA>i (N∑

j=i+1

Aj(xk+1j − xkj ))‖

≤(L+ β max

1≤j≤N‖Aj‖22

√N

)√√√√ N∑j=i+1

‖xk+1j − xkj ‖2 + max

1≤j≤N−1‖Hj‖2‖xki − xk+1

i ‖

≤√κ3θk. (2.38)

Combining (2.35), (2.36), (2.38) and (2.70) yields the desired result.

2.3.2 Nonconvex linearized proximal gradient-based ADMM

When modeling nonconvex and nonsmooth optimization with manifold constraints, it is of-

ten the case that computing proximal mapping (in the presence of f) may be difficult, while

optimizing with a quadratic objective is still possible. This leads to a variant of ADMM

which linearizes the f function. In particular, we define the following approximation to

the augmented Lagrangian function:

Liβ(xi; x1, · · · , xN , λ) := f(x1, · · · , xN ) + 〈∇if(x1, · · · , xN ), xi − xi〉+ ri(xi)

−⟨ N∑j=1,j 6=i

Aj xj +Aixi − b, λ⟩

+β

2

∥∥∥∥ N∑j=1,j 6=i

Aj xj +Aixi − b∥∥∥∥2

,(2.39)

where λ is the Lagrange multiplier and β > 0 is a penalty parameter. It is worth noting

that this approximation is defined with respect to a particular block of variable xi. The

22

linearized proximal gradient-based ADMM algorithm is described as in Algorithm 2.

Algorithm 2: Nonconvex Linearized Proximal Gradient-Based ADMM

1 Given (x01, x

02, · · · , x0

N ) ∈ (M1 ∩X1)× (M2 ∩X2)× · · · × (MN−1 ∩XN−1)×RnN ,

λ0 ∈ Rm, β > 0, γ > 0, Hi 0, i = 1, . . . , N − 1.

2 for k = 0, 1, ... do

3 [Step 1] For i = 1, 2, ..., N − 1 and positive semi-definite matrix Hi, compute

xk+1i := argminxi∈Mi∩Xi L

iβ(xi;x

k+11 , · · · , xk+1

i−1 , xki , · · · , xkN , λk)+ 1

2‖xi−xki ‖2Hi ,


1 , · · · , xk+1N−1, x

kN , λ

k),

5 [Step 3] λk+1 := λk − β(∑N

i=1Aixk+1i − b).

Essentially, instead of solving the subproblem involving the exact augmented Lagrangian

defined by (2.16), we use the linearized approximation defined in (2.39). It is also noted

that the Steps 2 and 3 of Algorithm 2 are the same as the ones in Algorithm 1, and thus

Lemmas 2.3.4 and 2.3.6 still hold, as they do not depend on Step 1 of the algorithms. As

a result, we only need to present the following lemma, which is a counterpart of Lemma

2.3.5, and the proof is omitted to prevent lengthy presentation.

Lemma 2.3.8 Suppose that the sequence (xki , · · · , xkN , λk) is generated by Algorithm 2. Let

the parameters β and γ be defined according to (2.20) and (2.21), and ΨG(x1, · · · , xN , λ, x)

be defined according to (2.19). If we choose

Hi (

6L2

β+ L

)I, for i = 1, ..., N − 1,

then ΨG(xk+11 , · · · , xk+1

N , λk+1, xkN ) monotonically decreases. More specifically, we have

ΨG(xk+11 , · · · , xk+1


kN , λ

k, xk−1N )

≤

[β + L

2− 1

γ+

6

β

(β − 1

γ

)2

+3L2

β

]‖xkN − xk+1

N ‖2 (2.40)

−N−1∑i=1

‖xki − xk+1i ‖ 1

2Hi−L2 I−

3L2

βI.

Note that the right hand side of (2.40) is negative under the above conditions.

23

Proof. For the subproblem in Step 1 of Algorithm 2, since xk+1i is the global minimizer,

we have

〈∇if(xk+11 , · · · , xk+1

i−1 , xki , · · · , xkN ), xk+1

i − xki 〉 −⟨ i∑j=1

Ajxk+1j +

N∑j=i+1

Ajxkj − b, λk

⟩

+β

2

∥∥∥∥ i∑j=1

Ajxk+1j +

N∑j=i+1

Ajxkj − b

∥∥∥∥2

+

i∑j=1

rj(xk+1j ) +

N−1∑j=i+1

rj(xkj )

≤ −⟨ i−1∑j=1

Ajxk+1j +

N∑j=i

Ajxkj − b, λk

⟩+β

2

∥∥∥∥ i−1∑j=1

Ajxk+1j +

N∑j=i

Ajxkj − b

∥∥∥∥2

+i−1∑j=1

rj(xk+1j ) +

N−1∑j=i

rj(xkj )−

1

2‖xk+1

i − xki ‖2Hi .

By the L-Lipschitz continuity of ∇if , we have

f(xk+11 , · · · , xk+1

i , xki+1, · · · , xkN )

≤ f(xk+11 , · · · , xk+1

i−1 , xki , · · · , xkN ) + 〈∇if(xk+1

1 , · · · , xk+1i−1 , x

ki , · · · , xkN ), xk+1

i − xki 〉

+L

2‖xk+1

i − xki ‖2.

Combining the above two inequalities and using the definition of Lβ in (2.16), we have

Lβ(xk+11 ,· · ·, xk+1

i , xki+1,· · ·, xkN , λk)≤Lβ(xk+11 ,· · ·, xk+1

i−1 , xki ,· · ·, xkN , λk)−‖xki −xk+1

i ‖2Hi2−L

2I.

(2.41)

Summing (2.41) over i = 1, . . . , N − 1, we have the following inequality, which is the

counterpart of (2.23):

Lβ(xk+11 ,· · ·, xk+1

N−1, xkN , λ

k) ≤ Lβ(xk1, · · · , xkN , λk)−N−1∑i=1

‖xki − xk+1i ‖2Hi

2−L

2I. (2.42)

Besides, since (2.24) and (2.25) still hold, by combining (2.42), (2.24) and (2.25) and

applying Lemma 2.3.4, we establish the following two inequalities, which are respectively

24

the counterparts of (2.26) and (2.22):

Lβ(xk+11 , · · · , xk+1

N−1, xk+1N , λk+1)− Lβ(xk1, · · · , xkN−1, x

kN , λ

k)

≤

[L+ β

2− 1

γ+

3

β

(β − 1

γ

)2]‖xkN − xk+1

N ‖2 (2.43)

+3

β

[(β − 1

γ

)2

+ L2

]‖xk−1

N − xkN‖2 −N−1∑i=1

‖xki − xk+1i ‖21

2Hi−L2 I−

3L2

βI,

and

ΨG(xk+11 , · · · , xk+1


kN , λ

k, xk−1N )

≤

[β + L

2− 1

γ+

6

β

(β − 1

γ

)2

+3L2

β

]‖xkN − xk+1

N ‖2

−N−1∑i=1

‖xki − xk+1i ‖21

2Hi−L2 I−

3L2

βI.

From the proof of Lemma 2.3.5, it is easy to see that the right hand side of the above

inequality is negative, if Hi (

6L2

β + L)I and β and γ are chosen according to (2.20) and

(2.21).

We are now ready to present the main complexity result for Algorithm 2, and the proof is

omitted because it is very similar to that of Theorem 2.3.7.

Theorem 2.3.9 Suppose the sequence xk1, · · · , xkN , λk is generated by Algorithm 2. Let

the parameters β and γ satisfy (2.20) and (2.21) respectively. Define κ1, κ2, κ3 same as

that in Theorem 2.3.7. Define

τ := min

−

[β + L

2− 1

γ+

6

β

(β − 1

γ

)2

+3L2

β

], mini=1,...,N−1

−(

3L2

β+L

2− σmin(Hi)

2

).

Assume Hi (

6L2

β + L)I, and let

K =

⌈2 maxκ1, κ2, κ3

τε2

(ΨG(x1

1, · · · , x1N , λ

1, x0N )−

N−1∑i=1

r∗i − f∗)⌉

,

and k∗ = argmin2≤k≤K+1

∑Ni=1(‖xki−x

k+1i ‖2+‖xk−1

i −xki ‖2). Then, (xk∗+1

1 , · · · , xk∗+1N , λk

∗+1)

is an ε-stationary solution defined in Definition 2.2.9.

25

The proof of this theorem is omitted because it is mainly the same as that of Theorem

2.3.7, thus omitted.

2.3.3 Stochastic linearized proximal ADMM

Here we provide two motivating examples where stochastic algorithms are in demand:

empirical risk minimization (ERM) and tensor rank-one decomposition. ERM arises from

supervised machine learning. Let x1, ..., xN be the model parameters. Given the data-label

pairs (w1, y1), ..., (wm, ym) and a loss function `, the ERM problem is formulated as

min f(x1, · · · , xN ) =1

m

m∑i=1

fi(x1, · · · , xN ),

where fi(x1, · · · , xN ) := `(x1, · · · , xN ;wi, yi) corresponds to the training loss of the ith

data point. In many applications, the sample size m can be a very large number.

In tensor algebra, the rankCP -1 tensor decomposition is a very important problem; see

[105]. In such problem, one seeks a tensor whose CP-rank is 1, which approximates a

given tensor. Here CP stands for CANDECOMP/PARAFAC, which is one of the most

commonly used tensor ranks. We refer the interested readers to [52] for more information

regarding tensor decompositions and the CP rank. For a given tensor T ∈ Rn1×···×nN , its

rankCP -1 tensor decomposition problem can be formulated as

max f(x1, · · · , xN ) = 〈T,⊗Ni=1xi〉, s.t. ‖xi‖2 = 1, xi ∈ Rni , i = 1, ..., N,

where 〈T,⊗Ni=1xi〉 =∑

1≤i1≤n1

∑1≤i2≤n2

· · ·∑

1≤iN≤nN T(i1, i2, ..., iN ) · x1(i1) · x2(i2) ·

xN (iN ),

T(i1, i2, ..., iN ) denotes the (i1, i2, ..., iN )-th entry of T, and xj(ij) denotes the ij-th entry

of vector xj . When the dimension N is high, which is often the case in some black-box ten-

sor or multi-variable function approximation problems, each evaluation of function value

or partial gradients requires O(ΠNi=1ni) flops, which grows exponentially with respect to

N . However, an unbiased estimation of the partial gradients can be constructed easily by

sampling the entries of T.

26

In such cases, function evaluations in Algorithm 1, and the gradient evaluations in Algo-

rithm 2 are prohibitively expensive. In this section, we propose a nonconvex linearized

stochastic proximal gradient-based ADMM with mini-batch to resolve this problem. First,

let us make the following assumption.

Assumption 2.3.10 For smooth f and i = 1, . . . , N , there exists a stochastic first-order

oracle that returns a noisy estimation to the partial gradient of f with respect to xi, and

the noisy estimation Gi(x1, · · · , xN , ξi) satisfies

E[Gi(x1, · · · , xN , ξi)] = ∇if(x1, · · · , xN ), (2.44)

E[‖Gi(x1, · · · , xN , ξi)−∇if(x1, · · · , xN )‖2

]≤ σ2, (2.45)

where the expectation is taken with respect to the random variable ξi.

Let M be the size of mini-batch, and denote

GMi (x1, · · · , xN ) :=1

M

M∑j=1

Gi(x1, · · · , xN , ξji ),

where ξji , j = 1, ...,M are i.i.d. random variables. Clearly it holds that

E[GMi (x1, · · · , xN )] = ∇if(x1, · · · , xN )

and

E[‖GMi (x1, · · · , xN )−∇if(x1, · · · , xN )‖2

]≤ σ2/M. (2.46)

Now, the stochastic linear approximation of the augmented Lagrangian function with re-

spect to block xi at point (x1, · · · , xN ) is defined as (note that rN ≡ 0):

Liβ(xi; x1, · · · , xN , λ;M) := f(x1, · · · , xN ) + 〈GMi (x1, · · · , xN ), xi − xi〉+ ri(xi)

−⟨ N∑

j 6=iAj xj +Aixi − b, λ

⟩+β

2

∥∥∥∥ N∑j 6=i

Aj xj +Aixi − b∥∥∥∥2

, (2.47)

27

where λ and β > 0 follow the previous definitions. Compared to (2.39), the full partial

derivative ∇if is replaced by the sample average of stochastic first-order oracles.

Algorithm 3: A Nonconvex Linearized Stochastic Proximal Gradient-Based

ADMM Algorithm

1 Given x0i ∈ (Mi ∩Xi), i = 1, ..., N − 1, x0

N ∈ RnN , λ0 ∈ Rm, β > 0, γ > 0,

Hi 0, i = 1, . . . , N − 1, the batch-size Mk = ck, c ∈ Z+, and positive integer K.

Randomly choose k∗ among 2, . . . ,K + 1.

2 for k = 0, 1, . . . , k∗ do

3 [Step 1] For i = 1, 2, ..., N − 1, and positive semi-definite matrix Hi, compute

xk+1i =

argminxi∈Mi∩Xi Liβ(xi;x

k+11 , · · · , xk+1

i−1 , xki , · · · , xkN , λk;Mk) + 1

2‖xi − xki ‖2Hi ;

4 [Step 2] xk+1N = xkN − γ∇N LNβ (xk+1

1 , · · · , xk+1N−1, x

kN , λ

k);

5 [Step 3] λk+1 = λk − β(∑N

i=1Aixk+1i − b).

6 Output xk∗+1.

The convergence analysis of this algorithm follows the similar logic as that of the previous

two algorithms.

Lemma 2.3.11 The following inequality holds:

E[‖λk+1 − λk‖2] ≤ 4

(β − 1

γ

)2

E[‖xkN − xk+1N ‖2] + 4

[(β − 1

γ

)2

+ L2

]E[‖xk−1

N − xkN‖2]

+4L2N−1∑i=1

E[‖xki − xk+1i ‖2] +

8

Mkσ2. (2.48)

In the stochastic setting, define the new potential function

ΨS(x1, · · · , xN , λ, x) = Lβ(x1, · · · , xN , λ) +4

β

[(β − 1

γ

)2

+ L2

]‖x− xN‖2. (2.49)

Proof. For the ease of notation, we denote

GMi (xk+11 , . . . , xk+1

i−1 , xki , . . . , x

kN ) = ∇if(xk+1

1 , . . . , xk+1i−1 , x

ki , . . . , x

kN ) + δki . (2.50)

28

Note that δki is a zero-mean random variable. By Steps 2 and 3 of Algorithm 3 we obtain

λk+1 =

(β − 1

γ

)(xkN − xk+1

N ) +∇Nf(xk+11 , · · · , xk+1

N−1, xkN ) + δkN . (2.51)

Applying (2.51) for k and k + 1, and using (2.51), we get

‖λk+1 − λk‖2 =

∥∥∥∥(β − 1

γ

)(xkN − xk+1

N )−(β − 1

γ

)(xk−1N − xkN ) + (δkN − δk−1

N )

+(∇Nf(xk+11 , · · · , xk+1

N−1, xkN )−∇Nf(xk1, · · · , xkN−1, x

k−1N )

∥∥∥∥2

≤ 4

(β − 1

γ

)2

‖xkN − xk+1N ‖2 + 4

[(β − 1

γ

)2

+ L2

]‖xk−1

N − xkN‖2

+4L2N−1∑i=1

‖xki − xk+1i ‖2 + 4‖δkN − δk−1

N ‖2.

Taking expectation with respect to all random variables on both sides and using

E[〈δkN , δk−1N 〉] = 0 completes the proof.

Lemma 2.3.12 Suppose the sequence (xk1, ..., xkN , λk) is generated by Algorithm 3. De-

fine ∆ = 17β2 − 16(L+ 1)β − 128L2, and assume that

β ∈

(8(L+ 1) + 8

√(L+ 1)2 + 34L2

17,+∞

), Hi

(8L2

β+ L+ 1

)I, i = 1, . . . , N − 1,

(2.52)

γ ∈(

16

17β +√

∆,

16

17β −√

∆

). (2.53)

Then it holds that

E[ΨS(xk+11 , · · · , xk+1

N−1, xk+1N , λk+1, xkN )]− E[ΨS(xk1, · · · , xkN−1, x

kN , λ

k, xk−1N )]

≤

[β + L

2− 1

γ+

8

β

(β − 1

γ

)2

+4L2

β+

1

2

]E[‖xk+1

N − xkN‖2] (2.54)

−N−1∑i=1

E

[‖xki − xk+1

i ‖212Hi− 4L2

βI−L+1

2I

]+

(8

β+N

2

)σ2

Mk,

and the coefficient in front of E[‖xk+1N − xkN‖2] is negative.

29

Proof. Similar as (2.41), by further incorporating (2.50), we have

Lβ(xk+11 , · · · , xk+1

i , xki+1, · · · , xkN , λk)− Lβ(xk+11 , · · · , xk+1

i−1 , xki , · · · , xkN , λk)

≤ −‖xki − xk+1i ‖2Hi

2−L

2I

+ 〈δki , xk+1i − xki 〉

≤ −‖xki − xk+1i ‖2Hi

2−L

2I

+1

2‖δki ‖2 +

1

2‖xk+1

i − xki ‖2.

Taking expectation with respect to all random variables on both sides and summing over

i = 1, . . . , N − 1, and using (2.46), we obtain

E[Lβ(xk+11 , · · · , xk+1

N−1, xkN , λ

k)]− E[Lβ(xk1, · · · , xkN , λk)] (2.55)

≤ −N−1∑i=1

E[‖xk+1

i − xki ‖212Hi−L+1

2I

]+N − 1

2Mσ2.

Note that by the Step 2 of Algorithm 3 and the descent lemma we have

0 =

⟨xkN − xk+1

N ,∇Nf(xk+11 , ..., xk+1

N−1, xkN ) + δkN − λk + β

(N−1∑j=1

Ajxk+1j + xkN − b

)⟩−⟨xkN − xk+1

N

1

γ(xkN − xk+1

N )⟩

≤ f(xk+11 , ..., xk+1

N−1, xkN )− f(xk+1) +

(L+ β

2− 1

γ

)‖xk+1

N − xkN‖2 − 〈λk, xkN − xk+1N 〉

+β

2‖N−1∑j=1

Ajxk+1j + xkN − b‖2 −

β

2‖N−1∑j=1

Ajxk+1j + xk+1

N − b‖2 + 〈δkN , xkN − xk+1N 〉

≤ Lβ(xk+11 ,· · ·, xk+1

N−1, xkN , λ

k)−Lβ(xk+1, λk)+

(L+β

2− 1

γ+

1

2

)‖xkN−xk+1

N ‖2+1

2‖δkN‖2.

Taking the expectation with respect to all random variables yields

E[Lβ(xk+11 , · · · , xk+1

N−1, xk+1N , λk)]− E[Lβ(xk+1

1 , · · · , xk+1N−1, x

kN , λ

k)]

≤(L+ β

2− 1

γ+

1

2

)E[‖xkN − xk+1

N ‖2] +1

2Mσ2. (2.56)

The following equality holds trivially from Step 3 of Algorithm 3:

E[Lβ(xk+11 , · · · , xk+1

N , λk+1)]− E[Lβ(xk+11 , · · · , xk+1

N , λk)] =1

βE[‖λk − λk+1‖2]. (2.57)

30

Combining (2.55), (2.56), (2.57) and (2.48), we obtain

E[ΨS(xk+11 , · · · , xk+1

N−1, xk+1N , λk+1, xkN )]− E[ΨS(xk1, · · · , xkN−1, x

kN , λ

k, xk−1N )]

≤

[β + L

2− 1

γ+

8

β

(β − 1

γ

)2

+4L2

β+

1

2

]E[‖xkN − xk+1

N ‖2] (2.58)

−N−1∑i=1

E

[‖xki − xk+1

i ‖212Hi− 4L2

βI−L+1

2I

]+

(8

β+

1

2+N − 1

2

)σ2

M.

Choosing β and γ according to (2.52) and (2.53), and using the similar arguments in the

proof of Lemma 2.3.5, it is easy to verify that[β + L

2− 1

γ+

8

β

(β − 1

γ

)2

+4L2

β+

1

2

]< 0.

By further choosing Hi (

8L2

β + L+ 1)I, we know that the right hand side of (2.58) is

negative, and this completes the proof.

Lemma 2.3.13 Suppose the sequence xk1, · · · , xkN , λk is generated by Algorithm 3. It

holds that

E[ΨS(xk+11 ,· · ·, xk+1

N , λk+1, xkN )] ≥N−1∑i=1

r∗i +f∗− 2σ2

βMk≥

N−1∑i=1

r∗i +f∗− 2σ2

β. (2.59)

Proof. From (2.51) and (2.15), we have that

Lβ(xk+11 , · · · , xk+1

N , λk+1)

=N−1∑i=1

ri(xk+1i ) + f(xk+1)−

⟨ N∑i=1

Aixk+1i − b,∇Nf(xk+1) +

(β − 1

γ

)(xkN − xk+1

N )

+∇Nf(xk+11 , ..., xk+1

N−1, xkN )−∇Nf(xk+1) + δkN

⟩+β

2

∥∥∥∥ N∑i=1

Aixk+1i − b

∥∥∥∥2

≥N−1∑i=1

ri(xk+1i )+f(xk+1

1 , ..., xk+1N−1, b−

N−1∑i=1

Aixk+1i )− 4

β

[(β− 1

γ

)2

+L2

]‖xkN−xk+1

N ‖2

+

(β

2− β

8− β

8− L

2

)∥∥∥∥ N∑i=1

Aixk+1i − b

∥∥∥∥2

− 2

β‖δkN‖2

≥N−1∑i=1

r∗i + f∗ − 4

β

[(β − 1

γ

)2

+ L2

]‖xkN − xk+1

N ‖2 − 2

β‖δkN‖2

31

where the first inequality is obtained by applying 〈a, b〉 ≤ 12( 1η‖a‖

2 + η‖b‖2) to terms

〈∑N

i=1Aixk+1i − b,

(β − 1

γ

)(xkN − xk+1

N )〉, 〈∑N

i=1Aixk+1i − b,∇Nf(xk+1

1 , ..., xk+1N−1, x

kN ) −

∇Nf(xk+1)〉 and 〈∑N

i=1Aixk+1i − b, δkN 〉 respectively with η = 8

β ,8β and 4

β . Note that

β > 2L according to (2.52), thus (β2 −β8 −

β8 −

L2 ) > 0 and the last inequality holds.

By rearranging the terms and taking expectation with respect to all random variables

completes the proof.

We are now ready to present the iteration complexity result for Algorithm 3.

Theorem 2.3.14 Suppose that the sequence xk1, ..., xkN , λk is generated by Algorithm 3.

Let the parameters β and γ satisfy (2.52) and (2.53) respectively. Define the constants

κ1 :=4

β2

[(β − 1

γ

)2

+ L2

], κ2 := 3

[(β − 1

γ

)2

+ L2

], κ4 =

2

τ

(8

β+N

2

),

κ3 := 2

(L+ β

√N max

1≤i≤N‖Ai‖22+ max

1≤i≤N−1‖Hi‖2

)2

,

and

τ :=min

−

(β+L

2− 1

γ+

8

β

[β− 1

γ

]2

+4L2

β+

1

2

), mini=1,...,N−1

−(

4L2

β+L+1

2−σmin(Hi)

2

).

Assume Hi (8L2

β + L+ 1)I and let Mk = ck with c ∈ Z+. Then for any positive integer

number K and randomly chosen integer k∗ among 2, ...,K + 1, Algorithm 3 returns

an εK-stationary solution (xk∗+1

1 , · · · , xk∗+1N , λk

∗+1) in expectation following the Definition

2.2.10, where

εK :=

√√√√2 maxκ1, κ2, κ3τK

(E[ΨG(x1

1, ..., x1N , λ

1, x0N )]−

N−1∑i=1

r∗i − f∗ +2σ2

β

)

+σ

√log(K + 1)

cKmaxκ1κ4 + 8/β2, κ2κ4 + 3, κ3κ4 + 2

= O

(√logK

K

).

32

Proof. Most parts of the proof are similar to that of Theorem 2.3.7, the only difference

is that we need to carry the stochastic errors throughout the process. For simplicity, we

shall highlight the key differences. First, we define θk according to (2.32) and then bound

E[θk∗ ] by

E[θk∗ ] =1

K

K+1∑k=2

E[θk] (2.60)

≤ 2

τK

(E[ΨS(x1

1, ..., x1N , λ

1, x0N )]−

N−1∑i=1

r∗i − f∗ +2σ2

β

)+ κ4σ

2

∑K+1k=2

1ck

K

≤ 2

τK

(E[ΨS(x1

1, ..., x1N , λ

1, x0N )]−

N−1∑i=1

r∗i − f∗ +2σ2

β

)+ κ4σ

2 log(K + 1)

cK,

where the last inequality is due to the fact that∑K+1

k=21k ≤ log(K + 1).

Second, we have

E[‖λk∗+1 −∇Nf(xk

∗+11 , ..., xk

∗+1N )‖2

]≤ κ2E[θk∗ ] +

3σ2 log(K + 1)

cK, (2.61)

E

[‖N−1∑i=1

Aixk∗+1i + xk

∗+1N − b‖2

]≤ κ1E[θk∗ ] +

8σ2 log(K + 1)

β2cK, (2.62)

and

E

[dist

(ProjT

xk∗+1i

Mi

(−∇if(xk

∗+1) +A>i λk∗+1 − ∂ri(xk

∗+1i )

),NXi(x

k∗+1i )

)2]

≤ κ3E[θk∗ ] +2σ2 log(K + 1)

cK. (2.63)

Finally, apply Jensen’s inequality Eξ[√ξ] ≤

√Eξ[ξ] to the above bounds (2.61), (2.62) and

(2.63) and the fact that√a+ b ≤

√a+√b, the εK-stationary solution defined in (2.2.10)

holds in expectation.

2.3.4 A feasible curvilinear line-search variant of ADMM

We remark that the efficiency of the previous algorithms rely on the solvability of the

subproblems at Step 1. Though the subproblems may be easy computable as we shall see

33

from application examples in Section 2.5, there are also examples where such solutions are

not available for many manifolds even when the objective is linearized. As a remedy we

present in this subsection a feasible curvilinear line-search based variant of the ADMM.

First let us make a few additional assumptions.

Assumption 2.3.15 In problem (2.1), the submanifolds Mi, i = 1, . . . , N − 1, are com-

pact; the nonsmooth regularizing functions ri(xi) are absent, and Xi = Rni, for i =

1, ..., N − 1.

Accordingly, the third part of the optimality condition (2.14) is simplified to

ProjTx∗iMi

(∇if(x∗)−A>i λ∗

)= 0, i = 1, ..., N − 1. (2.64)

Now let us introduce the retraction operation for a manifoldM. Suppose R(x, ξ) : TxM→

M is a mapping from the tangent space TxM to the manifold M, for any x ∈ M. Then

we call it a retraction if Retr(x, 0) = x, ddtRetr(x, tξ)

∣∣t=0

= ξ, ∀x ∈ M, ∀ξ ∈ TxM. (See

[3] for examples.) That is, if we let Ri(·, ·) be a retraction operator on the submanifolds

Mi, then for any xi ∈ Mi, g ∈ TxiMi, Yi(t) := Ri(xi, tg) is a parametrized curve on the

submanifold Mi ⊂ Rni , satisfying

Yi(0) = xi and Y ′i (0) = g. (2.65)

Proposition 2.3.16 For retractions Yi(t) = Ri(xi, tg), i = 1, ..., N−1, if the submanifolds

Mi, i = 1, ..., N − 1 are compact, then there exist L1, L2 > 0 such that

‖Yi(t)− Yi(0)‖ ≤ L1t‖Y ′i (0)‖, (2.66)

‖Yi(t)− Yi(0)− tY ′i (0)‖ ≤ L2t2‖Y ′i (0)‖2. (2.67)

The above proposition states that the retraction curve is approximately close to a line in

Euclidean space. The result was proved as a byproduct of Lemma 3 in [9] and was also

34

M

TxM

xg

t · g

Y (t) = R(x, tg)

Figure 2.5: A retraction.

adopted in [45]. Let the augmented Lagrangian function be defined by (2.16) (without the

ri(xi) terms) and denote

gradxiLβ(x1, · · · , xN , λ) = ProjTxiMi

∇iLβ(x1, · · · , xN , λ)

35

as the Riemannian partial gradient. We present the resulting algorithm as Algorithm 4.

Algorithm 4: A feasible curvilinear line-search-based ADMM

1 Given (x01, · · · , x0

N−1, x0N ) ∈M1 × · · · ×MN−1 × RnN , λ0 ∈ Rm, β, γ, σ > 0, s > 0

, α ∈ (0, 1).

2 for k = 0, 1, ... do

3 [Step 1] for i = 1, 2, ..., N − 1 do

4 Compute gki = gradxiLβ(xk+11 , · · · , xk+1

i−1 , xki , · · · , xkN , λk);

5 Initialize with tki = s. While

Lβ(xk+11 , · · · , xk+1

i−1 , Ri(xki ,−tki gki ), xki+1, · · · , xkN , λk)

> Lβ(xk+11 , · · · , xk+1

i−1 , xki , · · · , xkN , λk)−

σ

2(tki )

2‖gki ‖2,

shrink tki by tki ← αtki ;

6 Set xk+1i = Ri(x

ki ,−tki gki );


1 , · · · , xk+1N−1, x

kN , λ

k);

8 [Step 3] λk+1 := λk − β(∑N

i=1Aixk+1i − b).

For Steps 2 and 3, Lemma 2.3.4 and Lemma 2.3.6 still hold. Further using Proposition

2.3.16, Lemma 2.3.4 becomes

Lemma 2.3.17 Suppose that the sequence xk1, ..., xkN , λk is generated by Algorithm 4.

Then,

‖λk+1 − λk‖2 ≤ 3

(β − 1

γ

)2

‖tkNgkN‖2 + 3

[(β − 1

γ

)2

+ L2

]‖tk−1N gk−1

N ‖2

+3L2L21

N−1∑i=1

‖tki gki ‖2, (2.68)

where we define tkN = γ and xk+1N = xkN + tkNg

kN , ∀k ≥ 0, for simplicity. Moreover, for

the definition of ΨG in (2.19), Lemma 2.3.5 remains true, whereas the amount of decrease

36

becomes

ΨG(xk+11 , · · · , xk+1


kN , λ

k, xk−1N ) (2.69)

≤

[β + L

2− 1

γ+

6

β

(β − 1

γ

)2

+3L2

β

]‖tkNgkN‖2 −

N−1∑i=1

(σ

2− 3

βL2L2

1

)‖tki gki ‖2 < 0.

Now we are in a position to present the iteration complexity result.

Theorem 2.3.18 Suppose that the sequence xk1, ..., xkN , λk is generated by Algorithm

4, and the parameters β and γ satisfy (2.20) and (2.21) respectively. Denote Amax =

max1≤j≤N ‖Aj‖2. Define the constants

κ1 :=3

β2

[(β − 1

γ

)2

+ L2 ·maxL21, 1

], κ2 :=

(|β − 1

γ|+ L

)2

,

κ3 :=

((L+

√NβA2

max) ·maxL1, 1+σ + 2L2C + (L+ βA2

max)L21

2α+ βAmax

√κ1

)2

,

and

τ := min

−

[β + L

2− 1

γ+

6

β

(β − 1

γ

)2

+3L2

β

],σ

2− 3

βL2L2

1

,

where C > 0 is a constant that depends only on the first iterate and the initial point.

Assume σ > max 6βL

2L21,

2αs . Define

K :=

⌈3 maxκ1, κ2, κ3

τε2(ΨG(x1

1, ..., x1N , λ

1, x0N )− f∗

)⌉, (2.70)

and k∗ := argmin2≤k≤K+1

∑Ni=1(‖tk+1

i gk+1i ‖2 + ‖tki gki ‖2 + ‖tk−1

i gk−1i ‖2). Then

(xk∗+1

1 , · · · , xk∗+1N , λk

∗+1) is an ε-stationary solution of (2.1).

Proof. Through similar argument, one can easily obtain

‖λk+1 −∇Nf(xk+11 , · · · , xk+1

N )‖2 ≤ κ2θk and

∥∥∥∥∥N−1∑i=1

Aixk+1i + xk+1

N − b

∥∥∥∥∥2

≤ κ1θk,

where θk =∑N

i=1(‖tk+1i gk+1

i ‖2 + ‖tki gki ‖2 + ‖tk−1i gk−1

i ‖2). The only remaining task is to

guarantee an ε version of (2.64). First let us prove that

‖gk+1i ‖ ≤ σ + 2L2C + (L+ βA2

max)L21

2α

√θk. (2.71)

37

Denote hi(xi) = Lβ(xk+21 , ..., xk+2

i−1 , xi, xk+1i+1 , ..., x

k+1N , λk+1) and Yi(t) = R(xk+1

i ,−tgk+1i ),

then it is not hard to see that ∇hi(xi) is Lipschitz continuous with parameter L+β‖Ai‖22 ≤

L3 := L+ βA2max. Consequently, it yields

hi(Yi(t)) ≤ hi(Yi(0)) + 〈∇hi(Yi(0)), Yi(t)− Yi(0)− tY ′i (0) + tY ′i (0)〉+L3

2‖Yi(t)− Yi(0)‖2

≤ hi(Yi(0))+t〈∇hi(Yi(0)), Y ′i (0)〉+L2t2‖∇hi(Yi(0))‖‖Y ′i (0)‖2+

L3L21

2t2‖Y ′i (0)‖2

= hi(Yi(0))−(t− L2t

2‖∇hi(Yi(0))‖ − L3L21

2t2)‖Y ′i (0)‖2,

where the last equality is due to 〈∇hi(Yi(0)), Y ′i (0)〉 = −〈Y ′i (0), Y ′i (0)〉. Also note the

relationship

‖Y ′i (0)‖ = ‖gk+1i ‖ = ‖ProjT

xk+1i

Mi

∇hi(Yi(0))

‖ ≤ ‖∇hi(Yi(0))‖.

Note that∥∥∥∑N−1

i=1 Aixk+1i + xk+1

N − b∥∥∥ ≤ √κ1θk ≤

√κ1τ (ΨG(x1

1, ..., x1N , λ

1, x0N )− f∗). Be-

cause Mi, i = 1, ..., N − 1 are all compact submanifolds, xk+1i , i = 1, ..., N − 1 are all

bounded. Hence the whole sequence xkN is also bounded. By (2.18) (which also holds in

this case),

‖λk+1‖ ≤ |β − 1

γ|√θk + ‖∇Nf(xk+1

1 , . . . , xk+1N−1, x

kN )‖.

By the boundedness of (xk1, . . . , xkN ) and the continuity of ∇f(·), the second term is

bounded. Combining the boundedness of θk, we know that whole sequence λk is

bounded. Consequently, there exists a constant C > 0 such that ‖∇hi(Yi(0))‖ ≤ C, where

∇hi(Yi(0))=∇if(xk+21 , ..., xk+2

i−1 , xk+1i , ..., xk+1

N )−A>i λk+1+βA>i

( i−1∑j=1

Ajxk+2j +

N∑j=i

Ajxk+1j −b

).

Note that this constant C depends only on the first two iterates xt1, ..., xtN , λt, t = 0, 1,

except for the absolute constants such as ‖Ai‖2, i = 1, ..., N . Therefore, when

t ≤ 2

2L2C + σ + L3L21

≤ 2

2L2‖∇hi(Yi(0))‖+ σ + L3L21

,

it holds that

hi(Yi(t)) ≤ hi(xk+1i )− σ

2t2‖gk+1

i ‖2.

38

Note that σ > 2αs , by the terminating rule of the line-search step, we have

tki ≥ min

s,

2α

2L2C + σ + L3L21

=

2α

2L2C + σ + L3L21

.

Then by noting

2α‖gk+1i ‖

2L2C + σ + L3L21

≤ tk+1i ‖gk+1

i ‖ ≤√θk,

we have (2.71).

Now let us discuss the issue of (2.64). By definition,

gk+1i = ProjT

xk+1i

Mi

∇if(xk+2

1 , ..., xk+2i−1 , x

k+1i , ..., xk+1

N )−A>i λk+1

+βA>i

( i−1∑j=1

Ajxk+2j +

N∑j=i

Ajxk+1j − b

).

Consequently, we obtain


Mi

∇if(xk+1)−A>i λk+1

∥∥∥∥=


Mi

∇if(xk+1)−∇if(xk+2

1 , · · · , xk+2i−1 , x

k+1i , · · · , xk+1

N ) + gk+1i

−βA>i

N∑j=1

Ajxk+1j − b

+ βA>i

i−1∑j=1

Aj(xk+1j − xk+2

j )

∥∥∥∥≤ ‖∇if(xk+1)−∇if(xk+2

1 , · · · , xk+2i−1 , x

k+1i , · · · , xk+1

N )‖+ ‖βA>i (N∑j=1

Ajxk+1j − b)‖

+‖gk+1i ‖+ ‖βA>i (

N∑j=i+1

Aj(xk+1j − xk+2

j ))‖

≤(L+√NβA2

max

)maxL1, 1

√θk+

σ+2L2C+(L+βA2max)L2

1

2α

√θk+β‖Ai‖2

√κ1θk

≤√κ3θk.

39

2.4 Extending the Basic Model

Recall that for our basic model (2.1), a number of assumptions have been made; e.g. we

assumed that ri, i = 1, ..., N − 1 are convex, xN is unconstrained and AN = I. In this

section we shall extend the model by relaxing some of these assumptions. We shall also

extend our basic algorithmic model from a Gauss-Seidel style updating rule to allow the

Jacobi style updating, in order to enable parallelization.

2.4.1 Relaxing the convexity requirement on nonsmooth regularizers

For problem (2.1) the nonsmooth part ri are actually not necessarily convex. As an

example, nonconvex and nonsmooth regularizations such as `q regularization with 0 < q <

1 are very common in compressive sensing. To accommodate the change, the following

adaptation is needed.

Proposition 2.4.1 For problem (2.1), where f is smooth with Lipchitz continuous gradi-

ent. Suppose that I1, I2 form a partition of the index set 1, ..., N − 1, in such a way that

for i ∈ I1, ri’s are nonsmooth but convex, and for i ∈ I2, ri’s are nonsmooth and non-

convex but are locally Lipschitz continuous. If for blocks xi, i ∈ I2 there are no manifold

constraints, i.e. Mi = Rni , i ∈ I2, then Theorems 2.3.7, 2.3.9 and 2.3.14 remain true.

Recall that in the proofs for (2.37) and (2.38), we required the convexity of ri to ensure

(2.8). However, ifMi = Rni , then we directly have (2.7), i.e., ∂i(f+ri) = ∇if+∂ri instead

of (2.8). The only difference is that ∂ri becomes the Clarke subdifferential instead of the

convex subgradient and the projection operator is no longer needed. In the subsequent

complexity analysis, we just need to remove all the projection operators in (2.38) and

(2.63). Hence the same convergence result follows.

Moreover, if for some blocks, ri’s are nonsmooth and nonconvex, while the constraint

xi ∈Mi 6= Rni is still imposed, then we can solve the problem via the following equivalent

40

formulation:

min f(x1, ..., xN ) +∑

i∈I1∪I2

ri(xi) +∑i∈I3

ri(yi)

s.t.N∑i=1

Aixi = b, with AN = I,

xN ∈ RnN , (2.72)

xi ∈Mi ∩Xi, i ∈ I1 ∪ I3,

xi ∈ Xi, i ∈ I2,

yi = xi, i ∈ I3,

where I1, I2 and I3 form a partition for 1, ..., N − 1, with ri convex for i ∈ I1 and

nonconvex but locally Lipschitz continuous for i ∈ I2 ∪ I3. The difference is that xi is not

required to satisfy Riemannian manifold constraint for i ∈ I2.

2.4.2 Relaxing the condition on the last block variables

In the previous discussion, we limit our problem to the case where AN = I and xN is

unconstrained. Actually, for the general case

min f(x1, · · · , xN ) +N∑i=1

ri(xi)

s.t.N∑i=1

Aixi = b, (2.73)

xi ∈Mi ∩Xi, i = 1, ..., N,

41

we can actually add an additional block xN+1 and modify the objective a little bit and

arrive at the modified problem

min f(x1, · · · , xN , xN+1) +

N∑i=1

ri(xi) +µ

2‖xN+1‖2

s.t.N∑i=1

Aixi + xN+1 = b,

xN+1 ∈ Rm, (2.74)

xi ∈Mi ∩Xi, i = 1, ..., N.

Following a similar line of proofs for Theorem 4.1 in [44], we have the following proposition.

Proposition 2.4.2 Consider the modified problem (2.74) with µ = 1/ε for some given

tolerance ε ∈ (0, 1) and suppose the sequence (xk1, ..., xkN+1, λk) is generated by Algorithm

1 (resp. Algorithm 2). Let (xk∗+11 , ..., xk∗+1

N , λk∗+1) be ε-stationary solution of (2.74) as

defined in Theorem 2.3.7 (resp. Theorem 2.3.9). Then (xk∗+11 , ..., xk∗+1

N , λk∗+1) is an ε-

stationary point of the original problem (2.73).

Remark 2.4.3 We remark here that when µ = 1/ε, the Lipschitz constant of the objective

function L also depends on ε. As a result, the iteration complexity of Algorithms 1 and 2

becomes O(1/ε4).

2.4.3 The Jacobi-style updating rule

Parallel to (2.39), we define a new linearized approximation of the augmented Lagrangian

as

Liβ(xi; x1, · · · , xN , λ) = fβ(x1, · · · , xN ) + 〈∇ifβ(x1, · · · , xN ), xi − xi〉

−⟨ N∑

j 6=iAj xj +Aixi − b, λ

⟩+ ri(xi), (2.75)

where

fβ(x1, · · · , xN ) = f(x1, · · · , xN ) +β

2

∥∥∥∥ N∑j=1

Ajxj − b∥∥∥∥2

.

42

Compared with (2.39), in this case we linearize both the coupling smooth objective function

and the augmented term.

In Step 1 of Algorithm 2, we have the Gauss-Seidel style updating rule,

xk+1i = argmin

xi∈Mi∩XiLiβ(xi;x

k+11 , · · · , xk+1

i−1 , xki , · · · , xkN , λk) +

1

2‖xi − xki ‖2Hi .

Now if we replace this with the Jacobi style updating rule,

xk+1i = argmin

xi∈Mi∩XiLiβ(xi;x

k1, · · · , xki−1, x

ki , · · · , xkN , λk) +

1

2‖xi − xki ‖2Hi , (2.76)

then we end up with a new algorithm which updates all blocks in parallel instead of

sequentially. When the number of blocks, namely N , is large, using the Jacobi updating

rule can be beneficial because the computation can be parallelized.

The convergence analysis of this part is very similar to that of Algorithms 1 and 2. All we

need is to establish a counterpart of (2.42), which is part of the proof for Lemma 2.3.5.

That is, we need to show

Lβ(xk+11 , · · · , xk+1

N−1, xkN , λ

k) ≤ Lβ(xk1, · · · , xkN , λk)−N−1∑i=1

‖xki − xk+1i ‖2Hi

2− L

2I, (2.77)

for some L > 0. Consequently, if we choose Hi LI, then the convergence and complexity

analysis goes through for Algorithm 2. Moreover, Algorithm 3 can also be adapted to the

Jacobi-style updates. The proof for (2.77) is attached here.

Proof. First, we need to figure out the Lipschitz constant of fβ.

‖∇fβ(x)−∇fβ(y)‖

≤ L‖x− y‖+ β

∥∥∥∥∥∥∥ N∑j=1

Aj(xj − yj)

>A1, · · · ,

N∑j=1

Aj(xj − yj)

>AN∥∥∥∥∥∥∥(2.78)

≤ L‖x− y‖+ β√N max

1≤i≤N‖Ai‖2

∥∥∥∥∥∥N∑j=1

Aj(xj − yj)

∥∥∥∥∥∥≤

(L+ βN max

1≤i≤N‖Ai‖22

)‖x− y‖.

43

So we define L = L + βN max1≤i≤N ‖Ai‖22 as the Lipschitz constant for function fβ. The

global optimality of the subproblem (2.76) yields

〈∇ifβ(xk1, · · · , xkN ), xk+1i −xki 〉−〈λk, Aixk+1

i 〉+ri(xk+1i )+

1

2‖xk+1

i −xki ‖2Hi ≤ ri(xki )−〈λk, Aixki 〉.

By the descent lemma we have

Lβ(xk+11 , · · · , xk+1

N−1, xkN , λ

k)

= fβ(xk+11 , · · · , xk+1

N−1, xkN )− 〈λk,

N∑i=1

Aixk+1i − b〉+

N−1∑i=1

ri(xk+1i )

≤ fβ(xk1, · · · , xkN−1, xkN ) + 〈∇fβ(xk1, · · · , xkN−1, x

kN ), xk+1 − xk〉

L

2‖xk+1 − xk‖2 − 〈λk,

N∑i=1

Aixk+1i − b〉+

N−1∑i=1

ri(xk+1i ).

Combining the above two inequalities yields (2.77).

2.5 Some Applications and Their Implementations

The applications of block optimization with manifold constraints are abundant. In this

section we shall present some typical examples. Our choices include the NP-hard maximum

bisection problem, the sparse multilinear principal component analysis, and the community

detection problem.

2.5.1 Maximum bisection problem

The maximum bisection problem is a variant of the well known NP-hard maximum cut

problem. Suppose we have a graph G = (V,E) where V = 1, ..., n := [n] denotes the

set of nodes and E denotes the set of edges, each edge eij ∈ E is assigned with a weight

Wij ≥ 0. For pair (i, j) /∈ E, define Wij = 0. Let a bisection V1, V2 of V be defined as

V1 ∪ V2 = V, V1 ∩ V2 = ∅, |V1| = |V2|.

44

The maximum bisection problem is to find the best bisection that maximize the graph cut

value:

maxV1,V2

∑i∈V1

∑j∈V2

Wij

s.t. V1, V2 is a bisection of V.

Note that if we relax the constraint |V1| = |V2|, that is, we only require V1, V2 to be

a partition of V , then this problem becomes the maximum cut problem. In this paper,

we propose to solve this problem by our method and compare our results with the two

semi-definite programming (SDP) relaxations proposed in [28, 100].

First, we model the bisection V1, V2 by a binary assignment matrix U ∈ 0, 1n×2.

Each node i is represented by the ith row of matrix U . Denote this row by u>i , where

ui ∈ 0, 12×1 is a column vector with exactly one entry equal to 1. Then u>i = (1, 0) or

(0, 1) corresponds to i ∈ V1 or i ∈ V2 respectively, and the objective can be represented by∑i∈V1

∑j∈V2

Wij =∑i,j

(1− 〈ui, uj〉)Wi,j = −〈W,UU>〉+ const.

The constraint that |V1| = |V2| is characterized by the linear equality constraint

n∑i=1

(ui)1 −n∑i=1

(ui)2 = 0.

Consequently, we can develop the nonconvex relaxation of the maximum bisection problem

as

minU 〈W,UU>〉

s.t. ‖ui‖2 = 1, ui ≥ 0, for i = 1, ..., n, (2.79)n∑i=1

(ui)1 −n∑i=1

(ui)2 = 0.

After the relaxation is solved, each row is first rounded to an integer solution

ui ←

(1, 0)>, if (ui)1 ≥ (ui)2,

(0, 1)>, otherwise.

45

Then a greedy algorithm is applied to adjust current solution to a feasible bisection solu-

tion. Note that this greedy step is necessary for our algorithm and the SDP relaxations in

[28, 100] to reach a feasible bisection.

The ADMM formulation of this problem will be shown in the numerical experiment part

and the algorithm realization is omitted. Here we only need to mention that all the

subproblems are of the following form:

minx b>x (2.80)

s.t. ‖x‖2 = 1, x ≥ 0.

This nonconvex constrained problem can actually be solved to global optimality in closed

form, see the Lemma 1 in [103]. For the sake of completeness, we present the lemma

bellow.

Lemma 2.5.1 (Lemma 1 in [103].) Define b+ = maxb, 0, b− = −minb, 0, where max

and min are taken element-wise. Note that b+ ≥ 0, b− ≥ 0, and b = b+ − b−. The closed

form solution for problem (2.80) is

x∗ =

b−

‖b−‖ , if b− 6= 0

ei, otherwise,

(2.81)

where ei is the i-th unit vector with i = argminj bj.

2.5.2 The `q-regularized sparse tensor PCA

As we discussed at the beginning of Section 2.1, the tensor principal component analysis

(or multilinear principal component analysis (MPCA)) has been a popular subject of study

in recent years. Below, we shall discuss a sparse version of this problem.

Suppose that we are given a collection of order-d tensors T(1),T(2), ...,T(N) ∈ Rn1×n2×···×nd .

46

The sparse MPCA problem can be formulated as (see also [95]):

minN∑i=1

‖T(i) −C(i) ×1 U1 × · · · ×d Ud‖2F + α1

N∑i=1

‖C(i)‖pp + α2

d∑j=1

‖Uj‖qq

s.t. C(i) ∈ Rm1×···×md , i = 1, ..., N

Uj ∈ Rnj×mj , U>j Uj = I, j = 1, ..., d.

In order to apply our developed algorithms, we can consider the following variant of sparse

MPCA:

min∑N

i=1‖T(i)−C(i)×1U1×· · ·×dUd‖2F +α1∑N

i=1‖C(i)‖pp+α2∑d

j=1 ‖Vj‖qq+

µ2

∑dj=1‖Yj‖2

s.t. C(i) ∈ Rm1×···×md , i = 1, ..., N

Uj ∈ Rnj×mj , U>j Uj = I, j = 1, ..., d

Vj − Uj + Yj = 0, j = 1, ..., d.

(2.82)

Note that, although our model and the ones in [55, 88] bear similar names, they are totally

different and are in fact unrelated models.

We denote T(i)(j) to be the mode-j unfolding of a tensor T(i). (For readers who are not

familiar with the tensor unfolding operations, we refer to [52] for more information.) We

denote C to be the set of all tensors C(i) : i = 1, ..., N. The augmented Lagrangian

function of (2.82) is

Lβ(C, U, V, Y,Λ) =N∑i=1

‖T(i) −C(i) ×1 U1 × · · · ×d Ud‖2F + α1

N∑i=1

‖C(i)‖pp + α2

d∑j=1

‖Vj‖qq

+µ

2

d∑j=1

‖Yj‖2 −d∑j=1

〈Uj − Vj + Yj ,Λj〉+β

2

d∑j=1

‖Uj − Vj + Yj‖2F ,

where the tensor `p and `q norms are taken element-wise. Namely, they are equivalent to

the usual vector `p and `q norm if a tensor is regarded as a vector. An implementation of

Algorithm 1 for solving (2.82) is shown bellow as Algorithm 5.

In Step 1 of Algorithm 5, the subproblem to be solved is

Uj = argminU>U=I

−〈2B,U〉 = argminU>U=I

‖B − U‖2F , (2.83)

47

Algorithm 5: A typical iteration of Algorithm 1 for solving (2.82)

1 [Step 1] for j = 1, ..., d do

2 Set

B =∑N

i=1 T(i)(j)(Ud⊗· · ·⊗Uj+1⊗Uj−1⊗· · ·⊗U1)(C

(i)(j))>+ 1

2Λj− β2Yj+

β2Vj+

σ2Uj

3 Uj ← argminU>U=I −〈2B,U〉

4 [Step 2] for j = 1, ..., d do

5 For each component Vj(s) where s = (s1, s2) is a multilinear index,

6 set b = βYj(s) + βUj(s)− Λj(s) + σVj(s).

7 Vj(s) = argminxβ+σ

2 x2 + α2|x|q − bx

8 [Step 3] for i = 1, ..., N do

9 For each component C(i)(s), where s = (s1, ..., sd) is a multilinear index,

10 set b = σC(i)(s)− 2[(U>d ⊗ · · · ⊗ U>1 )vec(T(i))

](s).

C(i)(s)← argminx2+σ

2 x2 + α1|x|q − bx

11 [Step 4] for j = 1, ..., d do

12 Yj ← Yj − η [(β + µ)Yj − βUj − βVj − Λj ]

13 [Step 5] for j = 1, ..., d do

14 Λj ← Λj − β (Uj − Vj + Yj)

48

which is known as the nearest orthogonal matrix problem. Suppose we have the SVD

decomposition of the matrix B as B = QΣP>, then the global optimal solution is Uj =

QP>. When B has full column rank, the solution is also unique.

In Steps 2 and 3 of Algorithm 5, they are actually a group of one-dimensional decoupled

problems. Since no nonnegative constraints are imposed, we can apply `1 regularization

for which soft-thresholding gives closed form solution to the subproblems. However, if we

want to apply `q refularization for 0 < q < 1, then the subproblem amounts to solve

min f(x) = ax2 + bx+ c|x|q, (2.84)

where 0 < q < 1, a > 0, c > 0. The function is nonconvex and nonsmooth at 0 with f(0) =

0. For x > 0, we can take the derivative and set it to 0, and obtain 2ax+ qcxq−1 + b = 0,

or equivalently

2ax2−q + bx1−q + cq = 0.

If q = 12 , then setting z =

√x leads to 2az3 + bz + cq = 0. If q = 2

3 , then setting z = x13

leads to 2az4 + bz + cq = 0. In both cases, we have closed-form solutions. Similarly, we

apply this trick to the case when x < 0. Suppose we find the roots x1, ..., xk and we set

x0 = 0, then the solution to (2.84) is xi∗ with i∗ = argmin0≤j≤k f(xj).

Remark 2.5.2 The `q regularization is not locally Lipschitz at 0 when 0 < q < 1, which

might cause problems. However, if we replace ‖x‖q with min|x|q, B|x|, B 0, then the

new regularization is locally Lipschitz on R, and it differs from the original function only on

(− 1B1−q ,+

1B1−q ). The closed-form solution can still be obtained by comparing the objective

values at x∗1 = argminx ax2 + bx + c|x|q and x∗2 = argminx ax

2 + bx + cB|x| =(−cB−b

2a

)+

.

Actually due to the limited machine precision, the window (− 1B1−q ,+

1B1−q ) shrinks to a

single point 0 when B is sufficiently large. Since this causes no numerical difficulties, we

can just deal with `q penalties by replacing it by the modified version.

49

2.5.3 The community detection problem

Given any undirected network, the community detection problem aims to figure out the

clusters, in other words the communities, of this network; see for example [17, 46, 106, 103].

A possible way to solve this problem is via the symmetric othorgonal nonnegative matrix

approximation. Suppose the adjacency matrix of the network is A, then the method aims

to solve

minX∈Rn×k

‖A−XX>‖2F , s.t. X>X = Ik×k, X ≥ 0, (2.85)

where n equals the number of nodes and k equals the number of communities. When

the network is connected, the orthogonality and nonnegativeness of the optimal solution

X∗ indicate that there is exactly one positive entry in each row of X∗. Therefore we can

reconstruct the community structure by letting node i belong to community j if X∗ij > 0.

In our framework, this problem can be naturally formulated as

minX,Y,Z∈Rn×k ‖A−XX>‖2F +µ

2‖Z‖2F

s.t. X>X = Ik×k, Y ≥ 0, (2.86)

X − Y + Z = 0,

where the orthogonal X is forced to be equal to the nonnegative Y , while a slack variable

Z is added so that they do not need to be exactly equal. In the implementation of the

Algorithm 2, two subproblems for block X and Y need to be solved. For the orthogonal

block X, the subproblem is still in the form of (2.83). For the nonnegative block Y , the

subproblem can be formulated as:

Y ∗ = arg minY≥0‖Y −B‖2F = B+, (2.87)

for some matrix B. The notation B+ is defined by B+ = maxB, 0, where the max is

taken elementwise.

50

2.6 Numerical Results

Through out the numerical experiments in this section, the algorithmic parameters γ, β

and Hi = σI are tuned through the following procedures. First, choose an estimate of

L > 0. Second, according to Theorem 2.3.7 and Theorem 2.3.9, we set the parameters

through the following ways β := 3L > 2.860L; With this β, we set σ := 12L2

β = 6L2

β + 6L2

3L >

max6L2

β , 6L2

β + L. Denote γ1,2 = 12

13β±√

13β2−12βL−72L2, γ1 < γ2, we set γ := γ1+γ2

2 ∈

(γ1, γ2). Note that the way we set these parameters is chosen without any tuning. Therefore

we only tune estimation of the Lipschitz parameter L, which is selected by trying among

the following list of values 0.01, 0.1, 1, 10, 100.

2.6.1 The maximum bisection problem

We consider the following variant of maximum bisection problem to apply our proposed

algorithm.

minU,z,x 〈W,UU>〉+ µ2‖z‖

2

s.t. ‖ui‖2 = 1, ui ≥ 0, for i = 1, ..., n,∑ni=1 ui − x1 + z = 0,

z ∈ R2 is free, n2 − ν ≤ x ≤

n2 + ν,

where ν ≥ 0 is a parameter that controls the tightness of the relaxation. In our experiments,

we set ν = 1. We choose five graphs from the maximum cut library Biq Mac Library [91]

to test our algorithm, with the following specifications in Table 2.1.

Graph Information

Network g05 60.0 g05 80.0 g05 100.0 pw01 100.0 pw09 100.0

# nodes 60 80 100 100 100

# edges 885 1580 2475 495 4455

Table 2.1: The test graph information.

51

For the three tested algorithms, we denote the SDP relaxation proposed by Frieze et al.

in [28] as SDP-F, we denote the SDP relaxation proposed by Ye in [100] as SDP-Y, and

we denote our low-rank relaxation as LR. The SDP relaxations are solved by the interior

point method embedded in CVX [34]. To solve the problem by our proposed Algorithm 1,

the parameters are tuned to be µ = 0.01, β = 0.3, γ = 3.09, Hi = 0.4I. For all cases, the

number of iterations is set to 50. For each graph, all algorithms are tested for 20 times

and then we compare their average cut values. The results are reported in Table 2.2.

Network avg LR cut SD avg SDP-Y cut ratio1 avg SDP-F cut ratio2

g05 60.0 1050.7 12.2479 1021.5 1.0286 1045.2 1.0053

g05 80.0 1819.7 18.9767 1791.9 1.0155 1804.5 1.0084

g05 100.0 2813.2 17.7278 2764.6 1.0176 2792.3 1.0075

pw01 100.0 3872.4 61.3415 3728.9 1.0385 3867.3 1.0013

pw09 100.0 26871 119.6812 26633 1.0090 26838 1.0012

Table 2.2: The column SD contains the standard deviations of the LR cut values in 20

rounds. ratio1 = avg LR cutavg SDP-Y cut , and ratio2 = avg LR cut

avg SDP-F cut .

It is interesting to see that in all tested cases, under the best tuned parameters, our

proposed relaxation solved by Algorithm 1 outperforms the two SDP relaxations in [28,

100]. Moreover, our method is a first-order method, and it naturally enjoys computational

advantages compared to the interior-point based methods for solving the SDP relaxation.

Finally, in this application we test the performance of Algorithm 2 by comparing it to

Algorithm 1. We keep the parameters µ, β, γ, ν unchanged for testing Algorithm 2, but we

reset Hi = σI according to its new bound in Theorem 2.3.9. For each graph, 20 instances

are tested, and 50 iterations are performed for each algorithm. The objective measured

is 〈W,UU>〉. The result is shown in Table 2.3. It can be observed that in this case,

Algorithm 2 behaves similarly as Algorithm 1.

This result not only suggests the possibility that our relaxation may be a better model in

52

practice, but also establishes the validity of our proposed algorithm.

NetworkAlgorithm 1 Algorithm 2

avg obj SD avg obj SD

g05 60.0 713 12.0462 719.6000 16.7279

g05 80.0 1333.6 15.1085 1332 10.4987

g05 100.0 2132.8 24.2249 2136.4 20.4135

pw01 100.0 1569.8 51.6006 1557.6 56.3860

pw09 100.0 22361 91.4389 22293 107.1041

Table 2.3: Numerical performance of Algorithm 2 for problem (2.79).

2.6.2 The `q regularized sparse tensor PCA

In this experiment, we synthesize a set of ground truth Tucker format tensors T(i)true =

C(i) ×1 U1 ×2 · · · ×d Ud, where all T(i)true’s share the same factors Uj while having different

cores C(i). We test our methods by two cases, the first set of tensors have mode sizes

30 × 30 × 30 and core mode sizes 5 × 5 × 5. The second set of tensors have mode sizes

42× 42× 42 and core mode sizes 7× 7× 7. For both cases, we generate 100 instances. We

associate a componentwise Gaussian white noise T(i)noise with standard deviation 0.001 to

each tensor. Namely, the input data are T(i) = T(i)true + T

(i)noise, i = 1, ..., 100. For all cases,

the core elements are generated by uniform distribution in [−1, 1]. The sparsity level of

each core C(i) is set to 0.3, i.e., we randomly set 70% of the elements to zero in each core.

Finally, the orthogonal factors Ui are generated with sparsity level 1/6.

To solve (2.82), we set the regularization terms to `2/3 penalties for cores and to `1 penalties

for the factors. That is, q = 2/3 and p = 1 in (2.82). The sparse penalty parameters are

set to α1 = 0.1 and α2 = 0.01. We set µ = 10−6, β = 3, γ = 0.3089, Hi = 4I.

Our numerical results show that it is indeed necessary to set different regularizations

for cores and factors. In the output of the result, the matrices Ui’s are definitely not

53

sparse, but with plenty of entries very close to 0. The output Vi’s are very sparse but

are not orthogonal. We construct the final output from Ui by zeroing out all the entries

with absolute value less than 0.001. Then the resulting matrices Ui’s are sparse and are

almost orthogonal. Finally, the relative error is measured using Ui and the underlying

true tensor, i.e.,∑100i=1 ‖T

(i)true−T

(i)out‖2∑100

i=1 ‖T(i)true‖2

, where T(i)out’s are constructed from the output of the

algorithms. The orthogonality violation is measured by 13

∑3i=1 ‖U>i Ui−I‖F . In both cases,

the iteration number is set to be 100. For each case, 10 instances are generated and we

report the average performance in Table 2.4. The results are obtained from 20 randomly

generated instances. The columns err1, SD, err2, spars1, spars2 denote the averaged

objective relative errors, the standard deviation of the objective relative errors, the average

orthogonality constraint violation, the average core sparse levels and the average factor

sparse levels respectively.

30× 30× 30, core 5× 5× 5 42× 42× 42, core 7× 7× 7

err1 SD err2 spars1 spars2 err1 SD err2 spars1 spars2

0.041 0.011 2.78×10−14 0.58 1/6 0.044 6×10−4 4×10−5 0.62 0.1663

Table 2.4: Numerical performance of Algorithm 1 for problem (2.82).

2.6.3 The community detection problem

For this problem, we test our algorithm on three real world social networks with ground

truth information. They are the American political blogs network with 1222 nodes and

2 communities specified by their political leaning, the Caltech facebook network with

597 nodes and 8 communities specified by their dorm number, and the Simmons College

facebook network with 1168 nodes and 4 communities specified by their graduation years.

Note that (2.86) is a very simple model, so we will not compare it the more sophisticated

models such as [17, 103]. Instead it is compared with the state-of-the-art spectral methods

SCORE [46] and OCCAM [106].

54

In all tests for the three networks, the parameter µ is set to be 50 and L is set to be 100.

The other parameters are set to be β = 300, γ = 0.031, Hi = 400I. For each network, every

algorithm is run for 40 times and the average misclassification rate is reported in Table

2.5, where the misclassification rate equals the number of misclustered nodes divided by

the total number of nodes.

Network Name Algorithm 2 SCORE OCCAM

Polblogs 5.07% 4.75% 4.91%

Caltech 21.47% 30.03% 35.52%

Simmons 19.44% 22.54% 23.92%

Table 2.5: Numerical performance (misclassification error) of Algorithm 2 for problem

(2.86).

It can be observed from the numerical results that Algorithm 2 yields the best result in

Caltech and Simmons College networks, and is only slightly outperformed in the political

blogs network, which shows the effectiveness of our method for this problem.

55

Chapter 3

Adaptive Stochastic Variance

Reduction for Subsampled Newton

Method with Cubic Regularization

The cubic regularized Newton method of Nesterov and Polyak has become increasingly

popular for non-convex optimization because of its capability of finding an approximate

local solution with second-order guarantee. Several recent works extended this method

to the setting of minimizing the average of N smooth functions by replacing the exact

gradients and Hessians with subsampled approximations. It has been shown that the total

Hessian sample complexity can be reduced to be sublinear in N per iteration by leverag-

ing stochastic variance reduction techniques. We present an adaptive variance reduction

scheme for subsampled Newton method with cubic regularization, and show that the ex-

pected Hessian sample complexity is O(N +N2/3ε−3/2) for finding an (ε,√ε)-approximate

local solution (in terms of first and second-order guarantees respectively). Moreover, we

show that the same Hessian sample complexity retains with fixed sample sizes if exact

gradients are used. The techniques of our analysis are different from previous works in

56

that we do not rely on high probability bounds based on matrix concentration inequalities.

Instead, we derive and utilize bounds on the 3rd and 4th order moments of the average of

random matrices, which are of independent interest on their own.

3.1 The Problem Formulation

We consider the problem of minimizing the average of a large number of loss functions:

minx∈Rd

F (x) :=1

N

N∑i=1

fi(x), (3.1)

where each fi : Rd → R is smooth but may be non-convex. Such problems often arise in

machine learning applications where each fi is the loss function associated with a train-

ing example, and the number of training examples N can be very large. The necessary

conditions for a point x∗ to be a local minimum of F are

∇F (x∗) = 0, ∇2F (x∗) 0.

Our goal is to find an approximate local solution x that satisfies

‖∇F (x)‖ ≤ ε, λmin

(∇2F (x)

)≥ −√ε, (3.2)

where ε is an desired tolerance and λmin(·) denotes the smallest eigenvalue of a symmetric

matrix.

For minimizing a general smooth and nonconvex function F , Nesterov and Polyak [63]

introduced a modified Newton method with cubic regularization (CR). Specifically, each

iteration of the CR method consists of the following updates:

ξk = arg minξ

ξT gk +

1

2ξTHkξ +

σ

6‖ξ‖3

, (3.3)

xk+1 = xk + ξk,

where

gk = ∇F (xk), Hk = ∇2F (xk). (3.4)

57

Assuming the Hessian ∇2F to be Lipschitz continuous, it is shown in [63] that the CR

method finds an approximate solution satisfying (3.2) within O(ε−3/2) iterations. This

is better than purely gradient-based methods, which need O(ε−2) iterations to reach a

point x satisfying ‖∇F (x)‖ ≤ ε [62, Section 1.2.3]. However, the computational cost per

iteration of CR can be much higher than gradient-based methods.

Much recent efforts have been devoted to improving the efficiency of CR by exploiting

the finite-sum structure in (3.1); see, e.g., [12, 13, 51, 16, 4, 93, 107, 90, 77]. An natural

approach is to replace ∇F (xk) and ∇2F (xk) by subsampled approximations:

gk =1

|Sk|∑i∈Sk

∇fi(xk), (3.5)

Hk =1

|Bk|∑i∈Bk

∇2fi(xk), (3.6)

where Sk,Bk ⊆ 1, . . . , N are two sets (or multisets for sampling with replacement) of

random indices at the kth iteration. The cost of computing the Hessians ∇2fi usually

dominates that of the gradients. Moreover, the cost of solving the CR subproblem (3.3)

may grow fast when the batch size |Bk| increases, especially when using iterative methods

such as gradient descent or the Lanczos method [12, 7, 11, 83, 4]. Therefore, an important

measure of efficiency is the number of second-order oracle calls for ∇2fi, i.e., the Hessian

sample complexity.

In this chapter, we develop an adaptive subsampling CR method that requires O(N +

N2/3ε−3/2) second-order oracle calls in expectation.

Assuming that ε is small enough, we often simply refer to it as O(N2/3ε−3/2). Notice

that using the choices in (3.4) would require O(Nε−3/2) Hessian samples. Thus this is a

significant improvement especially when N is very large.

58

3.1.1 Related works

It is shown in [12] that the order of convergence rate of the CR method remains the same

as long as gk and Hk in (3.3) satisfy

∥∥gk −∇F (xk)∥∥ ≤ C1‖ξk‖2, (3.7)∥∥Hk −∇2F (xk)∥∥ ≤ C2‖ξk‖, (3.8)

with ξk being defined in (3.3) and C1, C2 being some positive constants. Here ‖ · ‖ for

vectors denotes their Euclidean norm, and for matrices denotes their spectral norm. In

order to exploit the averaging structure in (3.1), a subsampled cubic regularization (SCR)

method was proposed in [51] where gk and Hk are calculated as in (3.5) and (3.6) (here we

omit additional accept/reject steps based on trust-region methods). Matrix concentration

inequalities are used in [51] to derive appropriate sample sizes |Sk| and |Bk| such that

the conditions (3.7) and (3.8) hold with high probability. In particular, matrix Bernstein

inequality (e.g., [84, 51]) implies that with probability at least 1− δ,

∥∥Hk −∇2F (xk)∥∥ ≤ 4L

√log(2d/δ)

|Bk|, (3.9)

where L is a uniform Lipschitz constant of ∇fi for all i. Therefore, if we upper bound the

right-hand side above by C2‖ξk‖, then (3.8) holds with probability at least 1− δ provided

that

|Bk| ≥16L2 log(2d/δ)

(C2‖ξk‖)2. (3.10)

A similar condition on |Sk| is also derived in [51]. The overall gradient and Hessian sample

complexities for SCR, with or without replacement, are summarized in Table 3.1 (based on

the analysis in [51, 93, 90]). When ε ≤ 1/N , SCR can be much worse than the deterministic

CR method.

In order to further reduce the Hessian sample complexity, several recent works [90, 107, 16]

combine CR with stochastic variance reduction techniques. Stochastic variance-reduced

59

Algorithm Replacement Gradient Samples Hessian Samples Convergence Type

CR [63] all samples O(Nε3/2

)O(Nε3/2

)Deterministic

SCR [51]with O

(1ε7/2

)O(

1ε5/2

)w.p. 1− δ0

without O(minN,ε−2

ε3/2

)O(minN,ε−1

ε3/2

)w.p. 1− δ0

SVRC [90] without Not Provided O(N + N2/3

ε3/2

)w.p. 1− δ0

SVRC [107] with O(N + N4/5

ε3/2

)O(N + N4/5

ε3/2

)In Expectation

This Paperwithout O

(N+N2/3minN1/3,ε−1

ε3/2

)O(N + N2/3

ε3/2

)In Expectation

with O(N +N2/3ε−5/2

)O(N + N2/3

ε3/2

)In Expectation

Table 3.1: Comparison of gradient and Hessian sample complexities. The O(·) notation

hides poly-logarithmic terms such as log(d/εδ0) for [51, 90] and log(d) for [107].

gradient (SVRG) method was first proposed to reduce gradient sample complexity of ran-

domized first-order algorithms (see [47, 104, 92] for convex optimization and [5, 75] for

nonconvex optimization). Two different SVRC (stochastic variance-reduced cubic regular-

ization) methods were proposed in [90] and [107] respectively. Both of them employed the

same variance-reduction technique to reduce the Hessian sample complexity. In particu-

lar, [107] incorporated additional second-order corrections in gradient variance-reduction,

therefore it obtained better gradient sample complexity, but with slightly worse Hessian

sample complexity than [90]. See Table 3.1 for a summary of their results. All these

methods require O(ε−3/2) calls to solve the cubic regularized sub-problem (3.3).

Subsampled Newton methods without CR have been studied in, e.g., [27, 79, 94, 99], but

their convergence rates are worse than the ones that are based on CR.

Among the works on CR with stochastic variance reduction, most of them (e.g., [51, 90, 16])

60

rely on matrix concentration bounds such as (3.10) to set the sample sizes |Bk| and |Sk|

according to ‖ξk‖ at each iteration k, The problem is that ξk is the solution to the CR

sub-problem in (3.3), where gk and Hk need to be obtained from |Sk| and |Bk| samples in

the first place. While one can assume bounds like (3.10) hold in the complexity analysis,

they do not provide practically implementable variance-reduction schemes.

In contrast, the sample sizes in [107] are set as constants across all iterations, which

only depend on N and log(d). The disadvantage of this approach is that it can be very

conservative. According to concentration bounds such as (3.10), the sample sizes at the

initial stage of the algorithm (when ‖ξk‖ is large) can be set much smaller than the ones

required in the later stage (when ‖ξk‖ is very small). Therefore when using a constant

sample size, much of the samples in the initial stage can be wasteful.

There is another technicality of using matrix concentration bounds: we need inequalities

such as (3.9) to hold for all iterations with high probability, say with probability at least

1− δ0. Then the probability margin δ per iteration and δ0 should satisfy (1− δ)T ≥ 1− δ0,

where T = O(ε−3/2) is the number of iterations for CR methods. Therefore, δ = O(δ0ε3/2)

and there is an additionalO (log(d/εδ0)) factor in the sampling complexities (see Table 3.1).

The results in [107] are for convergence in expectation, nevertheless they still have a log(d)

factor and unusually large constants in their bounds.

3.2 Adaptive variance reduction for cubic regularization

In this section we first present the adaptive SVRC method, then analyze its convergence

rate and Hessian sample complexity. In particular, the gradient sample size |Sk| and

Hessian sample size |Bk| are chosen adaptively in each iteration to ensure the following

conditions hold in expectation (conditioned on xk):∥∥gk −∇F (xk)∥∥ ≤ C ′1‖ξk−1‖2, (3.11)∥∥Hk −∇2F (xk)∥∥ ≤ C ′2‖ξk−1‖, (3.12)

61

where C ′1 and C ′2 are some positive constants. The major difference from (3.7) and (3.8) is

that here ‖ξk−1‖ is a known quantity conditioned on xk, before choosing the sample sizes

to form the approximations gk and Hk. Indeed we choose |Sk| and |Bk| based on ‖ξk−1‖.

Such an adaptive scheme is readily implementable in practice.

The adaptive SVRC method is a multi-stage iterative method as described in Algorithm 6.

At the beginning of each stage k ≥ 1, we compute the full gradient gk−1 and the full

Hessian Hk−1 at xk−1, which is the result of the previous stage (for k = 1, x0 is given as

input). Each stage k has an inner loop of length m and the variables used in the inner

loop are indexed with a subscript t = 0, . . . ,m. At the beginning of each inner loop, we

set xk0 = xk−1 and ξk−1 = 0. During each inner iteration, we randomly sample two set of

indices Skt and Bkt satisfying

|Skt | ≥‖xkt − xk−1‖2

εg, |Bkt | ≥

‖xkt − xk−1‖2

εH, (3.13)

where εg and εH are determined by ξkt−1 and the desired tolerance ε:

εg = max‖ξkt−1‖4, ε2

, εH = max

‖ξkt−1‖2, ε

.

Then we compute the subsampled gradient and Hessian as

gkt =1

|Skt |∑i∈Skt

(∇fi(xkt )−∇fi(xk−1)

)+ gk−1, (3.14)

Hkt =

1

|Bkt |∑i∈Bkt

(∇2fi(x

kt )−∇2fi(x

k−1))

+ Hk−1. (3.15)

This construction follows the stochastic variance reduction scheme proposed in [47], which

has been adopted by recent works on subsampled Newton method with cubic regularization

[83, 90, 107].

We make several remarks regarding the subsampling scheme in Algorithm 6. First, com-

pared with the algorithms in [51, 90], which use large enough sample sizes to ensure (3.7)

and (3.8) with high probability, Algorithm 6 aims to ensure (3.11) and (3.12) in expecta-

tion. Moreover, instead of depending on ξkt which is not available until after the current

62

iteration, our sample sizes are determined by ξkt−1, which is computed in the previous itera-

tion. Second, compared with the constant sample size used in [107], our adaptive sampling

scheme may use much less samples in the early stages when ‖ξkt−1‖ is relatively large. Our

overall Hessian sample complexity is better than either of these two previous approaches.

An alternative approach to avoid the dependence of sample sizes on ξkt is to use the full gra-

dient, i.e., let gkt = ∇F (xkt ), and use (3.15) or (3.6) to compute the Hessian approximation

[16]. Then condition (3.7) holds automatically, and one can replace (3.8) with

∥∥Hkt −∇2F (xkt )

∥∥ ≤ C2

∥∥∇F (xkt )∥∥, (3.16)

because it can be be shown that ‖ξkt ‖ ≤ c‖∇F (xkt )‖ for some large enough constant c. Since

the full gradient∇F (xkt ) can be computed before sampling Bkt , this condition can be used to

determine sample size |Bkt |. However, it can be shown that one have roughly ‖∇F (xkt )‖ ≤

O(‖ξkt−1‖2) plus some random noise, thus the mini-batch size required for (3.16) to hold

can be much larger than that for (3.12).

3.2.1 Convergence analysis

We make the following assumption regarding the objective function in (3.1):

Assumption 3.2.1 The gradient and Hessian of each component function fi are Lipschitz-

continuous, i.e., there exist positive constants L and ρ such that for i = 1, . . . , N and for

all x, y ∈ Rd,

∥∥∇fi(x)−∇fi(y)∥∥ ≤ L‖x− y‖, (3.17)∥∥∇2fi(x)−∇2fi(y)∥∥F≤ ρ‖x− y‖. (3.18)

Consequently, ∇F and ∇2F are L and ρ-Lipschitz continuous respectively.

This assumption is very similar to those adopted in subsampled Newton methods with

cubic regularization [63, 51, 16, 90, 107]. The only difference is that here we use the

63

Algorithm 6: Adaptive SVRC Method.

1 Input: initial point x0, inner loop length m, CR parameter σ > 0, and tolerance

ε > 0.

2 for k = 1,......,K do

3 Compute gk−1 = ∇F (xk−1) and Hk−1 = ∇2F (xk−1).

4 Assign xk0 = xk−1 and ξk−1 = 0.

5 for t = 0, ...,m− 1 do

6 Set εg = max‖ξkt−1‖4, ε2

, εH = max

‖ξkt−1‖2, ε

.

7 Sample index sets Skt and Bkt with |Skt | ≥‖xkt−xk−1‖2

εgand |Bkt | ≥

‖xkt−xk−1‖2εH

.

8 Construct gkt and Hkt according to (3.14) and (3.15) respectively.

9 ξkt = arg minξ

ξT gkt +

1

2ξTHk

t ξ +σ

6‖ξ‖3

.

10 xkt+1 = xkt + ξkt .

11 xk = xkm.

12 Output: Option 1: Find k∗, t∗ = argmin1≤k≤K, 0≤t≤m−1

(‖ξkt ‖3 + ‖ξkt−1‖3

)and output

xk∗t∗+1.

13 Option 2: Randomly choose k∗ and t∗ and output xk∗t∗+1.

Frobenius norm, instead of the spectral norm, for the Hessian smoothness assumption.

The advantage of using the Frobenius norm is that it works well with matrix inner product,

which allows us to derive simple bounds on the 3rd and 4th order moments for the average

of random matrices. On the other hand, the Lipschitz constant ρ is always larger than the

one corresponding to the spectral norm, up to a factor of√d in the worst case. However,

when the Hessians ∇2fi are of low-rank or ill-conditioned, which is often the case in

practice, the Lipschitz constants for different norms are very close.

Before presenting the main results, we first provide a few supporting lemmas. The first

one gives a simple bound on the 4th order moment of the average of i.i.d. random matrices

with zero mean.

Lemma 3.2.2 Let Z1, ..., Zn be i.i.d. random matrices in Rd×d with E [Z1] = 0 and

64

E[‖Z1‖4F

]<∞. Then

E

[∥∥∥∥ 1

n

n∑i=1

Zi

∥∥∥∥4

F

]≤ 3

n2E[‖Z1‖4F

]. (3.19)

Proof. Let 〈Z1, Z2〉 = trace(ZT1 Z2) be the inner product of two matrices. Then ‖Z1‖2F =

〈Z1, Z1〉. Using the assumption that E[Zi] = 0 for all i and Z1, . . . , Zn are independent

and identically distributed, we have

E

[∥∥∥∥ 1

n

n∑i=1

Zi

∥∥∥∥4

F

]=

1

n4E

[⟨ n∑i=1

Zi,

n∑i=1

Zi

⟩2]

=1

n4(T1 + T2 + · · ·+ T7), (3.20)

where

T1 = nE[‖Z1‖4F

],

T2 = 4n(n− 1)E[‖Z1‖2F 〈Z1, Z2〉

]= 0,

T3 = 2n(n− 1)E[〈Z1, Z2〉2

]≤ 2n(n− 1)E

[‖Z1‖4F

],

T4 = n(n− 1)E[‖Z1‖2F ‖Z2‖2F

]≤ n(n− 1)E

[‖Z1‖4F

],

T5 = 4n(n− 1)(n− 2)E [〈Z1, Z2〉〈Z2, Z3〉] = 0,

T6 = 2n(n− 1)(n− 2)E[〈Z1, Z2〉‖Z3‖2F

]= 0,

T7 = n(n− 1)(n− 2)(n− 3)E [〈Z1, Z2〉〈Z3, Z4〉] = 0.

In total, we have

E

[∥∥∥∥ 1

n

n∑i=1

Zi

∥∥∥∥4

F

]=

3n2 − 2n

n4E[‖Z1‖4F

]≤ 3

n2E[‖Z1‖4F

].

which is the desired result.

Based on the above lemma, we can bound the 2nd and 4th order variances of the gradient

and Hessian approximations.

Lemma 3.2.3 Let the variance reduced gradient gkt and Hessian Hkt be constructed ac-

65

cording to (3.14) and (3.15). Then they satisfy the following equalities and inequalities

E[Hkt

∣∣xkt ] = ∇2F (xkt ),

E[∥∥Hk

t −∇2F (xkt )∥∥2

F

∣∣xkt ] ≤ ρ2

|Bkt |∥∥xkt − xk−1

∥∥2,

E[∥∥Hk

t −∇2F (xkt )∥∥4

F

∣∣xkt ] ≤ 33ρ4

|Bkt |2∥∥xkt − xk−1

∥∥4,

and

E[gkt∣∣xkt ] = ∇F (xkt ),

E[∥∥gkt −∇F (xkt )

∥∥2

F

∣∣xkt ] ≤ L2

|Skt |∥∥xkt − xk−1

∥∥2.

Proof. First, we prove the bounds for the variance reduced Hessian estimate. It is straight-

forward to show that E[Hkt |xkt

]= ∇2F (xkt ). For the rest two inequalities, let us first define

Zj = ∇2fj(xkt )−∇2fj(x

k−1) +∇2F (xk−1)−∇2F (xkt ),

where j is a uniform sample from 1, ..., N. Note that

E[∇2fj(x

kt )−∇2fj(x

k−1)∣∣∣xkt ] = ∇2F (xkt )−∇2F (xk−1).

Therefore, E[Zj∣∣xkt ] = 0. For the ease of notation, define zj = ∇2fj(x

kt ) − ∇2fj(x

k−1)

such that Zj = zj − E[zj |xkt

]. According to the Lipschitz continuity conditions, ‖zj‖F ≤

ρ‖xkt − xk−1‖. For the second moment, we have

E[‖Zj‖2F

∣∣∣xkt ] = E[∥∥zj − E[zj |xkt ]

∥∥2

F

∣∣∣xkt ]= E

[‖zj‖2F

∣∣xkt ]− ∥∥E[zj |xkt ]∥∥2

F

≤ E[‖zj‖2F

∣∣xkt ]≤ ρ2

∥∥xkt − xk−1∥∥2.

Therefore,

E[∥∥Hk

t −∇2F (xkt )∥∥2

F

∣∣xkt]=E

[∥∥∥∥ 1

|Bkt |∑j∈Bkt

Zj

∥∥∥∥2

F

∣∣∣∣xkt]

=1

|Bkt |E[‖Z1‖2F

∣∣xkt ]≤ ρ2

|Bkt |∥∥xkt −xk−1

∥∥2.

66

For the fourth moment, we have

E[‖Zj‖4F

∣∣xkt ] = E[∥∥zj − E[zj |xkt ]

∥∥4

F

∣∣xkt ] (3.21)

= E[‖zj‖4F

∣∣xkt ]+ 4E[⟨zj ,E

[zj |xkt

]⟩2∣∣xkt ]+ 2E[‖zj‖2F

∣∣xkt ] ∥∥E[zj |xkt ]∥∥2

F

− 4⟨E[‖zj‖2F zj |xkt

], E[zj |xkt ]

⟩− 3∥∥E[zj |xkt ]

∥∥4

F

≤ 11E[‖zj‖4F

∣∣xkt ]≤ 11ρ4

∥∥xkt − xk−1∥∥4,

where we used 〈zj ,E[zj |xkt ]〉 ≤ ‖zj‖F∥∥E[zj |xkt ]

∥∥F

and∥∥E[zj |xkt ]

∥∥2

F≤ E

[‖zj‖2F

∣∣xkt ]. Then by

Lemma 3.2.2, we obtain

E[∥∥Hk

t −∇2F (xkt )∥∥4

F

∣∣∣xkt]=E

[∥∥∥∥ 1

|Bkt |∑j∈Bkt

Zj

∥∥∥∥4

F

∣∣∣∣xkt]≤ 3

|Bkt |2E[‖Z1‖4F

∣∣xkt ]≤ 33ρ4

|Bkt |2∥∥xkt−xk−1

∥∥4.

Similarly, one can get the variance bounds for the gradient estimate. This part of the proof

is omitted for simplicity.

As a result, we have the following corollary.

Corollary 3.2.4 Let Hkt , gkt , ξkt and the mini-batch index sets Bkt and Skt be generated

according to Algorithm 6. Then we have

E[∥∥Hk

t −∇2F (xkt )∥∥F

∣∣xkt ] ≤ ρ(‖ξkt−1‖+ ε1/2),

E[∥∥Hk

t −∇2F (xkt )∥∥2

F

∣∣xkt ] ≤ ρ2(‖ξkt−1‖2 + ε

),

E[∥∥Hk

t −∇2F (xkt )∥∥3

F

∣∣xkt ] ≤ 333/4ρ3(‖ξkt−1‖3 + ε3/2

),

and


∥∥F

∣∣xkt ] ≤ L(‖ξkt−1‖2 + ε),


∥∥3/2

F

∣∣xkt ] ≤ L3/2(‖ξkt−1‖3 + ε3/2

).

67

Proof. By Lemma 3.2.3 and the mini-batch size rule in Algorithm 6,

E[∥∥Hk

t −∇2F (xkt )∥∥2

F

∣∣xkt ] ≤ ρ2

|Bkt |∥∥xkt − xk−1

∥∥2 ≤ ρ2εH ≤ ρ2(‖ξkt−1‖2 + ε

).

Due to the concavity of the square root function√·, we can apply Jensen’s inequality to

obtain

E[∥∥Hk

t −∇2F (xkt )∥∥F

∣∣xkt ] ≤ √E[‖Hk

t −∇2F (xkt )‖2F∣∣xkt ] ≤ ρ

√εH ≤ ρ

(‖ξkt−1‖+ ε1/2

).

Similarly, we have

E[∥∥Hk

t −∇2F (xkt )∥∥4

F

∣∣xkt ] ≤ 33ρ4

|Bkt |2∥∥xkt − xk−1

∥∥4 ≤ 33ρ4ε2H .

Again, with the concavity of the function (·)3/4, applying Jensen’s inequality yields

E[∥∥Hk

t −∇2F (xkt )∥∥3

F

∣∣xkt ] ≤ (E [∥∥Hkt −∇2F (xkt )

∥∥4

F

∣∣xkt ])3/4≤ 333/4ρ3ε

3/2H

≤ 333/4ρ3(‖ξkt−1‖3 + ε3/2

).

Following a similar line of arguments, one can get the bounds for the gradient variances.

We omit the details for simplicity.

Now we are ready to present the descent property of Algorithm 6.

Lemma 3.2.5 Suppose the sequence xkt k=1,...,Ki=1,...,m is generated by Algorithm 6. Then the

following descent property holds

E[F (xkt+1)

∣∣∣xkt ] ≤ F (xkt )−(σ

4− ρ

2− L

3

)E[∥∥ξkt ∥∥3

∣∣∣xkt ]+

(5ρ

2+

2L

3

)(∥∥ξkt−1

∥∥3+ ε3/2

).

(3.22)

As a remark, as long as σ4 −

ρ2 −

L3 > 5ρ

2 + 2L3 , the variance term associated with ‖ξkt−1‖3

in current step can be dominated by the descent associated with ‖ξkt−1‖3 in the previous

step. At the first step of each stage, the variance term is 0, hence we define ξk−1 = 0 for a

unified expression.

68

Proof. Consider the cubic regularization subproblem in Algorithm 6:

ξkt = arg minξ

ξT gkt +

1

2ξTHk

t ξ +σ

6‖ξ‖3

.

Its optimality conditions imply (see [63])

gkt +Hkt ξ

kt +

σ

2

∥∥ξkt ∥∥ξkt = 0, (3.23)

Hkt +

σ

2

∥∥ξkt ∥∥I 0. (3.24)

Then by the Lipschitz continuous condition of the objective function, we have

F (xkt+1) (3.25)

≤ F (xkt ) +∇F (xkt )T ξkt +

1

2(ξkt )T∇2F (xkt )ξ

kt +

ρ

6

∥∥ξkt ∥∥3

= F (xkt ) +∇F (xkt )T ξkt +

1

2(ξkt )T∇2F (xkt )ξ

kt +

ρ

6

∥∥ξkt ∥∥3 − (ξkt )T(gkt +Hk

t ξkt +

σ

2

∥∥ξkt ∥∥ξkt )= F (xkt ) +

(∇F (xkt )− gkt

)Tξkt +

1

2(ξkt )T

(∇2F (xkt )−Hk

t

)ξkt −

(σ4− ρ

6

)∥∥ξkt ∥∥3

−1

2(ξkt )T

(Hkt +

σ

2

∥∥ξkt ∥∥I) ξkt≤ F (xkt ) +

∥∥ξkt ∥∥∥∥gkt −∇F (xkt )∥∥+

1

2

∥∥∇2F (xkt )−Hkt

∥∥F

∥∥ξkt ∥∥2 −(σ

4− ρ

6

)∥∥ξkt ∥∥3,

where the second line is due to (3.23) and the fifth line is due to (3.24). By the following

variant of Young’s inequality

ab ≤ (aθ)p

p+

(b/θ)q

q, ∀ a, b, p, q, θ > 0 and

1

p+

1

q= 1,

we have

∥∥ξkt ∥∥∥∥gkt −∇F (xkt )∥∥ ≤

(L1/3

∥∥ξkt ∥∥)33

+

(L−1/3‖∇F (xkt )− gkt ‖

)3/23/2

=L

3

∥∥ξkt ∥∥3+

2

3√L

∥∥∇F (xkt )− gkt∥∥3/2

F,

and

∥∥ξkt ∥∥2∥∥Hkt −∇2F (xkt )

∥∥ ≤(ρ2/3

∥∥ξkt ∥∥)3/23/2

+

(ρ−2/3


∥∥)33

=2ρ

3

∥∥ξkt ∥∥3+

1

3ρ2


∥∥3.

69

Consequently,

E[F (xkt+1)

∣∣xkt ] ≤ F (xkt )−(σ

4− ρ

2− L

3

)E[∥∥ξkt ∥∥3∣∣xkt ]+

1

6ρ2E[∥∥∇2F (xkt )−Hk

t

∥∥3

F

∣∣xkt ]+

2

3√LE[∥∥∇F (xkt )− gkt

∥∥3/2∣∣xkt ] . (3.26)

Since gk0 = ∇F (xk0) and Hk0 = ∇2F (xk0) at the beginning of each stage, we have

E[F (xk1)

∣∣xk0] ≤ F (xk0)−(σ

4− ρ

6

)E[∥∥ξk0∥∥3∣∣xk0] . (3.27)

By substituting the variance bounds in Corollary 3.2.4 into (3.26), we get

E[F (xkt+1)

∣∣xkt ]≤ F (xkt )−

(σ

4− ρ

2− L

3

)E[∥∥ξkt ∥∥3∣∣xkt ]+

(333/4ρ

6+

2L

3

)(∥∥ξkt−1

∥∥3+ ε3/2

)≤ F (xkt )−

(σ

4− ρ

2− L

3

)E[∥∥ξkt ∥∥3∣∣xkt ]+

(5ρ

2+

2L

3

)(∥∥ξkt−1

∥∥3+ ε3/2

).

This completes the proof.

Next, we give a lemma that connects E[‖∇f(xkt+1)‖] and E[λmin(∇2f(xkt+1))] with ξkt and

ξkt−1.

Lemma 3.2.6 Let xkt+1, ξkt−1 and ξkt be generated by Algorithm 6. Then the following

relations hold

E[∥∥∇F (xkt+1)

∥∥] ≤ (ρ+σ

2

)E[∥∥ξkt ∥∥2

]+(ρ

2+ L

)(E[∥∥ξkt−1

∥∥2]

+ ε),

E[λmin

(∇2f(xkt+1)

)]≥ −

(ρ+

σ

2

)E[∥∥ξkt ∥∥]− ρ(E [∥∥ξkt−1

∥∥]+ ε1/2).

Proof. Using the triangule inequality, we have∥∥∇F (xkt+1)∥∥ ≤

∥∥∇F (xkt+1)−∇F (xkt )−∇2F (xkt )ξkt

∥∥+∥∥∇2F (xkt )ξ

kt −Hk

t ξkt

∥∥+∥∥∇F (xkt )− gkt

∥∥+∥∥gkt +Hk

t ξkt

∥∥.By the Lipschitz continuity of ∇2F and the fact xkt+1 = xkt + ξkt ,∥∥∇F (xkt+1)−∇F (xkt )−∇2F (xkt )ξ

kt

∥∥ ≤ ρ

2

∥∥ξkt ∥∥2.

70

In addition, the optimality condition (3.23) implies

∥∥gkt +Hkt ξ

kt

∥∥ =σ

2

∥∥ξkt ∥∥2.

Consequently,

∥∥∇F (xkt+1)∥∥ ≤ σ + ρ

2

∥∥ξkt ∥∥2+∥∥Hk

t −∇2F (xkt )∥∥F

∥∥ξkt ∥∥+∥∥∇F (xkt )− gkt

∥∥≤

(ρ+

σ

2

)∥∥ξkt ∥∥2+

1

2ρ

∥∥Hkt −∇2F (xkt )

∥∥2

F+∥∥∇F (xkt )− gkt

∥∥. (3.28)

Taking expectation on both sides of the above inequality and applying Corollary 3.2.4 yield

E[∥∥∇F (xkt+1)

∥∥] ≤ (ρ+

σ

2

)E[∥∥ξkt ∥∥2

]+(ρ

2+ L

)(E[∥∥ξkt−1

∥∥2]

+ ε),

which is the first desired result.

For the secnod inequality, the optimality condition (3.24) implies

λmin(Hkt ) ≥ −σ

2

∥∥ξkt ∥∥.Therefore, according to the Lipschitz continuity of ∇2F ,

λmin

(∇2F (xkt+1)

)≥ λmin

(∇2F (xkt )

)− ρ∥∥ξkt ∥∥

≥ λmin

(Hkt

)− ρ∥∥ξkt ∥∥− ∥∥∇2F (xkt )−Hk

t

∥∥F

≥ −(ρ+

σ

2

)∥∥ξkt ∥∥− ∥∥∇2F (xkt )−Hkt

∥∥F. (3.29)

Taking expectation on both sides and applying the Corollary 3.2.4 results in

E[λmin

(∇2F (xkt+1)

)]≥ −

(ρ+

σ

2

)E[∥∥ξkt ∥∥]− E

[∥∥∇2F (xkt )−Hkt

∥∥F

]≥ −

(ρ+

σ

2

)E[∥∥ξkt ∥∥]− ρ(E [‖ξkt−1‖

]+ ε1/2

),

which is the second desired inequality.

Finally we are ready to present the main result on iteration complexity.

71

Theorem 3.2.7 (Iteration Complexity) Choose the parameter σ > 13ρ + 4L and let

k∗, i∗ be given by either output option in Algorithm 6 after running for K stages, then

E[∥∥ξk∗t∗ ∥∥3

+∥∥ξk∗t∗−1

∥∥3]≤

2(F (x0)− F ∗

)Km(σ/4− 3ρ− L)

+5ρ+ 4L/3

σ/4− 3ρ− Lε3/2, (3.30)

where F ? = minx F (x). As a result,

E[∥∥∇F (xk

∗t∗+1)

∥∥] ≤ Cg1 (Km)−2/3 + Cg2 ε, (3.31)

E[λmin

(∇2F (xk

∗t∗+1)

)]≥ −CH1 (Km)−1/3 − CH2 ε1/2, (3.32)

where the constants satisfy

Cg1 = O(

(ρ+ L)1/3(F (x0)− F ?

)2/3), Cg2 = O(ρ+ L),

CH1 = O(

(ρ+ L)2/3(F (x0)− F ?

)1/3), CH2 = O(ρ+ L).

Remark 3.2.8 If we choose K and m such that Km = O(ε−3/2

), then within O

(ε−3/2

)iterations, Algorithm 6 would reach a point xk

∗t∗+1 such that

E[∥∥∇F (xk

∗t∗+1)

∥∥] ≤ O(ε), E[λmin

(∇2F (xk

∗t∗+1)

)]≥ −O

(√ε).

In other words, the approximate optimality conditions in (3.2) are satisfied in expectation.

Proof. First, taking expectation over the whole history of random samples for (3.22)

and (3.27) and summing them up for t = 0, ...,m− 1 yield

E[F (xk0)

]− E

[F (xkm)

]+m

(5ρ

2+

2L

3

)ε3/2

≥(σ

4− ρ

6

)E[∥∥ξk0∥∥3

]+(σ

4− 3ρ− L

)m−2∑t=1

E[∥∥ξkt ∥∥3

]+

(σ

4− ρ

2− L

3

)E[∥∥ξkm−1

∥∥3]

≥(σ

4− 3ρ− L

)m−1∑t=0

E[∥∥ξkt ∥∥3

].

Under the assumption σ > 13ρ + 4L, we have σ/4 − 3ρ − L > 0. Further summing over

the stages k = 1, . . . ,K, we obtain(σ4− 3ρ− L

) K∑k=1

m−1∑t=0

E[∥∥ξkt ∥∥3

]≤ F (x0)− F ? +Km

(5ρ

2+

2L

3

)ε3/2. (3.33)

72

For option 1 in the output rule, due to the concavity of min function, Jensen’s inequality

gives

E[∥∥ξk∗t∗ ∥∥3

+∥∥ξk∗t∗−1

∥∥3]

= E[mini,k

(∥∥ξkt ∥∥3+∥∥ξk∗t∗−1

∥∥3)]

≤ mini,k

E[∥∥ξkt ∥∥3

+∥∥ξk∗t∗−1

∥∥3]

≤2(F (x0)− F ? +Km

(5ρ/2 + 2L/3

)ε3/2

)Km

(σ/4− 3ρ− L

) ,

which is precisely (3.30). For option 2, since k∗ and t∗ are randomly chosen,

E[∥∥ξk∗t∗ ∥∥3

+∥∥ξk∗t∗−1

∥∥3]

=2

Km

K∑k=1

m−1∑t=0

E[∥∥ξkt ∥∥3

]− 1

Km

K∑k=1

E[∥∥ξkm−1

∥∥3]

≤2(F (x0)− F ? +Km

(5ρ/2 + 2L/3

)ε3/2

)Km(σ/4− 3ρ− L)

.

Therefore, for both options, inequality (3.30) holds.

Next, we derive guarantees for approximating the first and second-order stationary condi-

tions. According to Lemma 3.2.6,

E[∥∥∇F (xk

∗t∗+1)

∥∥] ≤ (ρ+

σ

2

)E[∥∥ξk∗t∗ ∥∥2

]+(ρ

2+ L

)E[∥∥ξk∗t∗−1

∥∥2]

+(ρ

2+ L

)ε. (3.34)

Consequently, we have

E[(ρ+

σ

2

)∥∥ξk∗t∗ ∥∥2+(ρ

2+ L

)∥∥ξk∗t∗−1

∥∥2]

≤((

ρ+σ

2

)3+(ρ

2+ L

)3)1/3

E

[((‖ξk∗t∗−1‖2

)3/2+(‖ξk∗t∗ ‖2

)3/2)2/3

]

≤((

ρ+σ

2

)3+(ρ

2+ L

)3)1/3 (

E[∥∥ξk∗t∗−1

∥∥3+∥∥ξk∗t∗ ∥∥3

])2/3

≤((

ρ+σ

2

)3+(ρ

2+ L

)3)1/3( 2(F (x0)− F ?)

Km(σ/4− 3ρ− L)+

5ρ+ 4L/3

σ/4− 3ρ− Lε3/2

)2/3

≤((ρ+

σ

2

)+(ρ

2+ L

))((2(F (x0)− F ?)σ/4− 3ρ− L

)2/31

(Km)2/3+

(5ρ+ 4L/3

σ/4− 3ρ− L

)2/3

ε

),

where the first inequality is due to Holder’s inequality, the second inequality is due to

Jensen’s inequality, the third inequality is due to (3.30) and the last inequality is due to

73

the fact that (a + b)θ ≤ aθ + bθ for all a, b > 0 and 0 ≤ θ ≤ 1. Finally, substituting the

above inequality into (3.34) yields the desired result in (3.31).

In order to bound the minimum eigenvalue of the Hessian, we have from Lemma 3.2.6,

E[λmin

(∇2F (xk

∗t∗+1)

)]≥ −

(ρ+

σ

2

)E[∥∥ξk∗t∗ ∥∥]− ρE [∥∥ξk∗t∗−1

∥∥]− ρε1/2. (3.35)

Similar to the arguments used for proving the first-order bound,

E[(ρ+

σ

2

)∥∥ξk∗t∗ ∥∥+ ρ∥∥ξk∗t∗−1

∥∥]≤

((ρ+

σ

2

)3/2+ ρ3/2

)2/3

E

[(∥∥ξk∗t∗−1

∥∥3+∥∥ξk∗t∗ ∥∥3

)1/3]

≤((

ρ+σ

2

)3/2+ ρ3/2

)2/3 (E[∥∥ξk∗t∗−1

∥∥3+∥∥ξk∗t∗ ∥∥3

])1/3

≤((

ρ+σ

2

)3/2+ ρ3/2

)2/3( 2(F (x0)− F ?)Km(σ/4− 3ρ− L)

+5ρ+ 4L/3

σ/4− 3ρ− Lε3/2

)1/3

≤((ρ+

σ

2

)+ ρ)((2(F (x0)− F ?)

σ/4− 3ρ− L

)1/31

(Km)1/3+

(5ρ+ 4L/3

σ/4− 3ρ− L

)1/3

ε1/2

).

Combining the inequality above with (3.35) gives the desired bound in (3.32).

3.2.2 Bounding the Hessian sample complexity

Due to the adaptive mini-batch size rule, the total Hessian sample complexity is not given

explicitly. In this subsection, we provide a bound on the complexity of Hessian sampling.

Theorem 3.2.9 Let the total number of Hessian samples in Algorithm 6 be BH . If we set

the length of each stage and number of stages to be

m = O(N1/3

), K = ε−3/2/m, (3.36)

then the expectation of BH taken to reach a second-order ε-solution will be

E [BH ] ≤ O(N2/3ε−3/2

). (3.37)

74

Proof. According to (3.13), it suffices to use the following sample size for approximating

the Hessian,

|Bkt | =∥∥xkt − xk−1

∥∥2

max‖ξkt−1‖2, ε

≤ ε−1∥∥xkt − xk−1

∥∥2.

Noticing that xk−1 = xk0 and xkt − xk−1 =∑t

j=1(xkj − xkj−1) =∑t−1

j=0 ξkj , we have

∥∥xkt − xk−1∥∥2

=

∥∥∥∥ t−1∑j=0

ξkj

∥∥∥∥2

≤( t−1∑j=0

∥∥ξkj ∥∥)2

≤ tt−1∑j=0

∥∥ξkj ∥∥2,

where the first inequality is due to the triangle inequality and the second one is due to the

Cauchy-Schwarz inequality. Summing up for all k = 1, . . . ,K and t = 0, . . . ,m, we get

K∑k=1

m−1∑t=0

|Bkt |≤1

ε

K∑k=1

m−1∑t=0

(t

t−1∑j=0

∥∥ξkj ∥∥2)≤ 1

ε

K∑k=1

m−1∑t=0

(m

m−1∑j=0

∥∥ξkj ∥∥2)

=m2

ε

K∑k=1

m−1∑j=0

‖ξkj ‖2.

Using Holder’s inequality, we have

K∑k=1

m−1∑j=0

‖ξkj ‖2 ≤( K∑k=1

m−1∑j=0

13

)1/3( K∑k=1

m−1∑j=0

(∥∥ξkj ∥∥2)3/2

)2/3

= (Km)1/3

( K∑k=1

m−1∑j=0

∥∥ξkj ∥∥3)2/3

.

Therefore,

K∑k=1

m−1∑t=0

|Bkt | ≤ ε−1m2(Km)1/3

( K∑k=1

m−1∑j=0

∥∥ξkj ∥∥3)2/3

= ε−3/2m2

( K∑k=1

m−1∑j=0

∥∥ξkj ∥∥3)2/3

,

where we used Km = ε−3/2 as specified in (3.36). Taking expectation on both sides gives

E

[K∑k=1

m−1∑t=0

|Bkt |

]≤ ε−3/2m2 E

[( K∑k=1

m−1∑t=0

‖ξkt ‖3)2/3

]

≤ ε−3/2m2

(E

[K∑k=1

m−1∑t=0

‖ξkt ‖3])2/3

≤ ε−3/2m2

(F (x0)− F ?

σ/4− 3ρ− L+

5ρ+ 4L/3

σ/4− 3ρ− LKmε3/2

)2/3

= ε−3/2m2

(F (x0)− F ?

σ/4− 3ρ− L+

5ρ+ 4L/3

σ/4− 3ρ− L

)2/3

,

where the second inequality is due to Jensen’s inequality and the third inequality is due

to (3.33).

75

In addition to the sum of |Bkt | over k and t, we also need to account for the N samples

used to compute Hk−1 at the beginning of each state k = 1, . . . ,K. Therefore, the total

Hessian sample complexity is

E [BH ] = E

[K∑k=1

m−1∑i=0

|Bkt |

]+KN ≤ Cε−3/2m2 + ε−3/2N/m,

where we used Km = ε−3/2 and the constant C is defined as

C =

(F (x0)− F ?

σ/4− 3ρ− L+

5ρ+ 4L/3

σ/4− 3ρ− L

)2/3

.

By taking

m = (N/2C)1/3 = O(N1/3),

we have minimized the upper bound on E [BH ] and achieved

E [BH ] ≤ N2/3ε−3/2C1/3(2−2/3 + 21/3

)= O

(N2/3ε−3/2

),

which is the desired result.

Note that the above bound holds only when ε is small enough so that (3.36) makes sense.

In order to cover the case when ε is large, we can write

E [BH ] ≤ O(N +N2/3ε−3/2

).

Through similar arguments, one can bound the total gradient sample complexity BG by

E[BG] ≤ O(N +N2/3ε−5/2).

Note that the complexity E[BG] can be improved by sampling without replacement.

3.2.3 Analysis of sampling without replacement

In this section, we show that sampling without replacement will not change the sample

complexity of Algorithm 6. Different Hessian sample complexity bounds for sampling

76

with and without replacement are derived in [90] (see Table 3.1), and both of them are

worse than the O(N2/3ε−2/3) complexity obtained in this paper. Again, we start from the

variance estimation of the subsampled gradient and Hessian.

Lemma 3.2.10 Suppose X1, . . . , XN are matrices in Rd×d satisfying 1N

∑Ni=1Xi = 0. Let

Z1, ..., Zn, where n ≤ N , be uniformly sampled from X1, ..., XN without replacement. Then

E

[∥∥∥∥ 1

n

n∑i=1

Zi

∥∥∥∥2

F

]=

1

n

(1− n− 1

N − 1

)E[‖Z1‖2F

], (3.38)

E

[∥∥∥∥ 1

n

n∑i=1

Zi

∥∥∥∥4

F

]=

1

n4

(r1E

[‖Z1‖4F

]+ r2E

[⟨Z1, Z2

⟩2]

+ r3E[‖Z1‖2F ‖Z2‖2F

]),(3.39)

where 〈Zi, Zj〉 = trace(ZTi Zj

)and

r1 = n

(1− 4 · n− 1

N − 1+ 6 · (n− 1)(n− 2)

(N − 1)(N − 2)− 3 · (n− 1)(n− 2)(n− 3)

(N − 1)(N − 2)(N − 3)

),

r2 = n(n− 1)

(2− 4 · n− 2

N − 2+ 2 · (n− 2)(n− 3)

(N − 2)(N − 3)

),

r3 = n(n− 1)

(1− 2 · n− 2

N − 2+ 1 · (n− 2)(n− 3)

(N − 2)(N − 3)

).

Proof. Equation (3.38) is a standard result for variance analysis of sampling without

replacement scheme. Hence we omit the proof of this equation. To prove (3.39), we

start with the expansions in (3.20) and calculate the terms T1, ..., T7 for sampling without

replacement. First, the following three terms do not change:

T1 = nE[‖Z1‖4F

],

T3 = 2n(n− 1)E[(〈Z1, Z2〉

)2],

T4 = n(n− 1)E[‖Z1‖2F ‖Z2‖2F

].

77

For T2, we have

T2 = 4n(n− 1)E[‖Z1‖2F 〈Z1, Z2〉

]= 4n(n− 1)E

[‖Z1‖2F

⟨Z1,E [Z2|Z1]

⟩]= 4n(n− 1)

N∑i=1

1

N‖Xi‖2F

⟨Xi,∑j 6=i

1

N − 1Xj

⟩

=4n(n− 1)

(N)(N − 1)

N∑i=1

‖Xi‖2F⟨Xi,

N∑j=1

Xj︸︷︷︸0

−Xi

⟩

= −4n(n− 1)

N − 1

∑i=1

1

N‖Xi‖4F

= −4n(n− 1)

N − 1E[‖Z1‖4F

].

For T5, we have

T5 = 4n(n− 1)(n− 2)E[〈Z1, Z2〉 · 〈Z2, Z3〉

]= 4n(n− 1)(n− 2)E

[⟨E[Z1|Z2

], Z2

⟩·⟨Z2,E

[Z3|Z2, Z1

]⟩]=

4n(n− 1)(n− 2)

N(N − 1)(N − 2)

N∑i=1

⟨Xi,∑j 6=i

Xj

(⟨Xi,

∑k 6=i,j

Xk

⟩)⟩

=4n(n− 1)(n− 2)

N(N − 1)(N − 2)

N∑i=1

⟨Xi,∑j 6=i

Xj

(⟨Xi,−Xi −Xj

⟩)⟩

=4n(n− 1)(n− 2)

N(N − 1)(N − 2)

N∑i=1

⟨Xi,∑j 6=i

Xj

(−‖Xi‖2F − 〈Xi, Xj〉

)⟩

= −4n(n− 1)(n− 2)

N − 2

N∑i=1

∑j 6=i

1

N(N − 1)

(‖Xi‖2F 〈Xi, Xj〉+

(〈Xi, Xj〉

)2)= −4n(n− 1)(n− 2)

N − 2

(E[‖Z1‖2F 〈Z1, Z2〉

]+ E

[(〈Z1, Z2〉

)2])= 4n

(n− 1)(n− 2)

(N − 1)(N − 2)E[‖Z1‖4F

]− 4n(n− 1)(n− 2)

N − 2E[(〈Z1, Z2〉

)2],

where in the last equality we used result for T2. In similar ways, one can find the expressions

78

for T6 and T7:

T6 = 2n(n− 1)(n− 2)E[‖Z3‖2F 〈Z1, Z2〉

]= 2n

(n− 1)(n− 2)

(N − 1)(N − 2)E[‖Z1‖4F

]− 2n(n− 1)

n− 2

N − 2E[‖Z1‖2F ‖Z2‖2F

]and

T7 = 2n(n− 1)(n− 2)(n− 3)E[〈Z1, Z2〉 · 〈Z3, Z4〉

]= −3n

(n− 1)(n− 1)(n− 3)

(N − 1)(N − 2)(N − 3)E[‖Z1‖4F

]+ 2n(n− 1)

(n− 2)(n− 3)

(N − 2)(N − 3)E[(〈Z1, Z2〉

)2]+n(n− 1)

(n− 2)(n− 3)

(N − 2)(N − 3)E[‖Z1‖2F ‖Z2‖2F

].

Summing these terms up gives the desired equation (3.39).

As a result of this lemma, we have the following corollary. The proof is similar to that of

Lemma 3.2.3, which we omit for simplicity.

Corollary 3.2.11 Let gkt and Hkt be constructed by (3.14) and (3.15) respectively, with

the mini-batches Skt and Bkt be sampled without replacement. Then


∥∥2∣∣xkt ] ≤ L2

|Skt |

(1− |S

kt | − 1

N − 1

)∥∥xkt − xk−1∥∥2,

and

E[∥∥Hk

t −∇2F (xkt )∥∥4

F

∣∣xkt ]≤ 11ρ4

|Bkt |2

(3−6|Bkt |−2

N−2+3

(|Bkt |−2

)(|Bkt |−3

)(N−2)(N−3)

)∥∥xkt −xk−1∥∥4.

When N > |Bk| 1, we have

|Bkt | − 2

N − 2≈ |B

kt |N

,(|Bkt | − 2)(|Bkt | − 3)

(N − 2)(N − 3)≈ |B

kt |2

N2.

Thus by Corollary 3.2.11, the following inequality holds approximately

E[‖Hk

t −∇2f(xkt )‖4|xkt]≤ 33ρ4

(1

|Bkt |− 1

N

)2

‖xkt − xk−1‖4.

79

We can derive the desired mini-batch sizes for Hessian and gradient smapling by requiring

E[∥∥Hk

t −∇2F (xkt )∥∥3

F

∣∣∣xkt ] ≤ (33)3/4ρ3

(1

|Bkt |− 1

N

)3/2 ∥∥xkt − xk−1∥∥3

≤ ρ3 max∥∥ξkt−1

∥∥3, ε3/2

,

and


∥∥3/2

F

∣∣∣xkt ] ≤ L3/2

(1

|Skt |− 1

N

)3/4 ∥∥xkt − xk−1∥∥3/2

≤ L3/2 max∥∥ξkt−1

∥∥3, ε3/2

.

Namely,

|Bkt | ≥1

1

N+

max‖ξkt−1‖2, ε√33∥∥xkt − xk−1

∥∥2

, |Skt | ≥1

1

N+

max‖ξkt−1‖4, ε2∥∥xkt − xk−1∥∥2

.

For the Hessian sample complexity, if we want to have a sublinear sample size Nα with

α < 1, then we should expect |Bkt | ≤ O(Nα) holds during the iterations. This requires


√

33∥∥xkt − xk−1

∥∥2 ≥ Ω(N−α) 1

N,

which further implies

|Bkt | ≈√

33∥∥xkt − xk−1

∥∥2


.Notice that this Hessian batch size is the same order as the one used by sampling with

replacement. Therefore no improvement should be expected in terms of the dependence on

N . For gradient sampling, similar arguments lead to an uper bound of minN,‖xkt−xk−1‖2

ε2

.

Based on the above analysis, we provide a corollary on the sample complexity bounds when

sampling without replacement is adopted.

Corollary 3.2.12 Consider Algorithm 6, where we sample Skt and Bkt without replacement

and set the length of each epoch and number of epochs to be m = O(N1/3) and K = ε−3/2/m

80

respectively. Then the totoal number of Hessian samples BH and total number of gradient

samples BG required to reach a second-order ε-solution is

E [BH ] ≤ O(N +N2/3ε−3/2

),

E [BG] ≤ O(N +N2/3ε−3/2 min

N1/3, ε−1

).

3.3 Analysis of non-adaptive SVRC Schemes

In this section, we consider variants of the SVRC methods that use fixed gradient and

Hessian sample sizes across all iterations. In the first variant, we use the full gradients

but subsampled Hessians. In this case, we show that the total Hessian sample complexity

is still O(N2/3ε−3/2). The second variant uses both approximate gradients and approx-

imate Hessians, and adds a correction term to the gradient approximation based on the

second-order information. This variant is proposed in [107], where an O(N4/5ε−3/2) sam-

ple complexity is proved for both the gradient and Hessian approximations. We obtain

the same order of sample complexity using the higher moment bounds developed in this

paper (instead of using concentration inequalities), which avoid the poly(log d) factor and

excessively large constant in the results of [107].

3.3.1 The case of using full gradient

Algorithm 7 describes the SVRC method using full gradient in each iteration, i.e., gkt =

∇F (xkt ). One remark regarding output option 1 is that the best choice of γ, depending

on the parameters B,m, σ and ρ, is given in Lemma 3.3.3. However, one does not need to

know the value exactly, and any choice of γ = Θ(ρ) will not affect our complexity result.

81

Algorithm 7: Non-adaptive SVRC method using full gradients

1 Input: An initial point x0, σ > 0, sample size B, and inner loop length m. An

estimate of the Lipschitz constant ρ and a parameter γ = Θ(ρ).

2 for k = 1,......,K do

3 Compute Hk−1 = ∇2f(xk−1).

4 Assign xk0 = xk−1.

5 for t = 0, ...,m− 1 do

6 Randomly sample an index set Bkt with constant cardinality |Bkt | = B.

7 Construct Hkt according to (3.15).

8 Solve ξkt = arg minξ

ξT∇F (xkt ) +

1

2ξTHk

t ξ +σ

6‖ξ‖3

.


10 xk = xkm.

11 Output: xk∗t∗+1 with

12 Option 1: k∗, t∗ = argmin1≤k≤K, 0≤t≤m−1

(∥∥ξkt ∥∥3+ ρ/γ

2B3/2

∥∥xkt − xk−1∥∥3)

.

13 Option 2: k∗ and t∗ are chosen randomly.

Similar to the derivation of (3.25), we have the following result,

E[F (xkt+1)

]≤ E

[F (xkt )

]+

1

2E[∥∥∇2F (xkt )−Hk

t

∥∥F

∥∥ξkt ∥∥2]−(σ

4− ρ

6

)E[∥∥ξkt ∥∥3

]≤ E

[F (xkt )

]−(σ

4− ρ

2

)E[∥∥ξkt ∥∥3

]+

1


t

∥∥3

F

]≤ E

[F (xkt )

]−(σ

4− ρ

2

)E[∥∥ξkt ∥∥3

]+

1

6ρ2· 333/4ρ3

|Bkt |3/2E[∥∥xkt − xk−1

∥∥3]

≤ E[F (xkt )

]−(σ

4− ρ

2

)E[∥∥ξkt ∥∥3

]+

5ρ

2|Bkt |3/2E[∥∥xkt − xk−1

∥∥3]. (3.40)

Note that the expectation is not taken over |Bkt | since it is a predetermined constant, which

we denote as B from now on. Let us define a Lyapunov function

Rkt := E[F (xkt ) + ct

∥∥xkt − xk−1∥∥3], (3.41)

where the coefficients ct are constructed recursively by setting cm = 0 and

ct = ct+1

(1 + 2θ−3

1 + θ−62

)+

3ρ

B3/2, t = 0, ....,m− 1. (3.42)

82

Here θ1, θ2 are some constant to be determined later. Next, we prove a monotone decreasing

property of this Lyapunov function over one epoch of Algorithm 7.

First, we note the following simple fact.

Lemma 3.3.1 For any a, b, θ1, θ2 > 0, the following inequality holds:

(a+ b)3 ≤(1 + 2θ−3

1 + θ−62

)a3 +

(1 + θ6

1 + 2θ32

)b3. (3.43)

Proof. We expand (a+ b)3 and then use Young’s inequality,

(a+ b)3 = a3 + b3 + 3a2b+ 3b2a

= a3 + b3 + 3(a/θ1)2(bθ21) + 3(a/θ2

2)(bθ2)2

≤ a3 + b3 + 3

((a2/θ2

1)3/2

3/2+

(bθ21)3

3

)+ 3

((a/θ2

2)3

3+

(b2θ22)

3/2

3/2

)= (1 + 2θ−3

1 + θ−62 )a3 + (1 + θ6

1 + 2θ32)b3.


Consequently, by substituting a = ‖xkt − xk−1‖, b = ‖xkt+1 − xkt ‖ into Lemma 3.3.1 and

then taking the expectation on both sides, we get

E[‖xkt+1 − xk−1‖3

]≤ E

[(‖xkt+1 − xkt ‖+ ‖xkt − xk−1‖

)3](3.44)

≤(1 + 2θ−3

1 + θ−62

)E[‖xkt − xk−1‖3

]+(1 + θ6

1 + 2θ32

)E[‖xkt+1 − xkt ‖3

].

Substituting (3.44) into (3.40) yields

E[F (xkt+1)+ct+1

∥∥xkt+1−xk−1∥∥3]≤ E

[F (xkt )

]−(σ

4− ρ

2−ct+1

(1+θ6

1 +2θ32

))E[‖xkt+1−xkt ‖3

]+

(ct+1

(1 + 2θ−3

1 + θ−62

)+

5ρ

2B3/2

)E[‖xkt − xk−1‖3

].

Now, using the definition of Rki in (3.41) and expression of ci (3.42), we see that the above

inequality leads to the following descent property of the Lyapunov function.

83

Lemma 3.3.2 Let the sequence xkt be generated by Algorithm 7, then for all k and

0 ≤ t ≤ m− 1,(σ4− ρ

2− ct+1

(1 + θ6

1 + 2θ32

))E[‖ξkt ‖3

]+

ρ

2B3/2E[‖xkt − xk−1‖3

]≤ Rkt −Rkt+1.

Let us define a constant

γ = min1≤i≤m

σ4− ρ

2− ci+1(1 + θ6

1 + 2θ32). (3.45)

Then Lemma 3.3.2 implies

m−1∑t=0

E

[∥∥ξkt ∥∥3+

ρ/γ

2B3/2

∥∥xkt − xk−1∥∥3]≤ Rk0 −Rkm

γ.

Next, we prove that when the parameters are properly choosen, then γ = O(ρ).

Lemma 3.3.3 Suppose we set the batch size B = αN2/3, where α ≥ 8 is a constant. If

we set σ ≥ 3ρ, m = (1/3)N1/3, θ1 = N1/9 and θ2 = N1/18, then

γ ≥ 1− 6eα−3/2

3ρ = O(ρ).

Proof. By (3.42) and the values of θ1, θ2, B and m, we have

ct =(

1 + 2N−1/3 +N−1/3)ct+1 + 3ρα−3/2N−1

=(

1 + 3N−1/3)ct+1 + 3ρα−3/2N−1.

Adding ρα3/2N2/3 to both sides of the above equality, we obtain

ct +ρ

α3/2N2/3=(

1 + 3N−1/3)(

ct+1 +ρ

α3/2N2/3

).

Thus,

max0≤t≤m

ct = c0 ≤(cm +

ρ

α3/2N2/3

)(1 + 3N−1/3

)m≤ eρ

α3/2N2/3,

where we used cm = 0 and by choosing m = (1/3)N1/3 we have 3N−1/3 = 1/m and

(1 + 1/m)m ≤ e. In addition,

γ ≥ σ

4− ρ

2−(

1 +N2/3 + 2N1/6)c0 ≥

1− 8eα−3/2

4ρ.

Therefore if α ≥ 8 > (8e)2/3, then we have γ = O(ρ).

84

Theorem 3.3.4 Suppose in Algorithm 7 we set m = (1/3)N1/3, |Bkt | = B = 8N2/3,

σ ≥ 3ρ, and γ is defined in (3.45). Let k∗, t∗ be given by either of the two output options

in the algorithm, then

E

[∥∥ξk∗t∗ ∥∥3+

ρ/γ

2B3/2

∥∥xk∗t∗+1 − xk∗−1∥∥3]≤ F (x0)− F ?

Kmγ. (3.46)


E[∥∥∇F (xk

∗t∗+1)

∥∥] ≤ O((Km)−2/3

),

E[λmin

(∇2F (xk

∗t∗+1)

)]≥ −O

((Km)−1/3

),

and the expected total Hessian samples required to reach a second-order ε-solution is O(N2/3ε−3/2).

Proof. We start with the definition of Rkt in (3.41). Since xk0 = xk−1 and cm = 0, we have

Rk0 = E[F (xk0)] = E[F (xk−1m )], Rkm = E[F (xkm)].

Then Lemma 3.3.2 and Lemma 3.3.3 immediately give

K∑k=1

m−1∑i=0

E

[∥∥ξkt ∥∥3+

ρ/γ

2B3/2

∥∥xkt − xk−1∥∥3]≤ F (x0)− F ?

γ.

Using similar arguments for the output option 1 and 2 in the proof of Theorem 3.2.7, the

first bound (3.46) follows.

From (3.46), by Jensen’s inequality, one simply gets

E[∥∥ξk∗t∗ ∥∥2

]≤(F (x0)− F ?

γKm

)2/3

, E[∥∥ξk∗t∗ ∥∥] ≤ (F (x0)− F ?

γKm

)1/3

,

and

E[∥∥xk∗t∗+1 − xk

∗−1∥∥2]≤

(2B3/2

(F (x0)− F ?

)Kmρ

)2/3

,

E[∥∥xk∗t∗+1 − xk

∗−1∥∥] ≤ (

2B3/2(F (x0)− F ?

)Kmρ

)1/3

.

85

Then following the proof of Lemma 3.2.6, more specifically (3.28), we have

E[∥∥∇F (xk

∗t∗+1)

∥∥] ≤ E

[(ρ+

σ

2

)∥∥ξk∗t∗ ∥∥2+

1

2ρ

∥∥Hk∗i∗ −∇2F (xk

∗t∗ )∥∥2

F

]≤ E

[√2(ρ+

σ

2

)∥∥ξk∗t∗ ∥∥2+

ρ

2B

∥∥xk∗t∗ − xk∗−1∥∥2]

≤ O((Km)−2/3

).

Similarly, using (3.28), one can bound the minimum eigenvalue of the Hessian as

E[λmin

(∇2F (xk

∗t∗+1)

)]≥ −E

[(ρ+

σ

2

)∥∥∥ξkt ∥∥∥− ∥∥∇2F (xkt )−Hkt

∥∥F

]≥ −E

[(ρ+

σ

2

)∥∥ξk∗t∗ ∥∥− ρ

B1/2

∥∥xk∗t∗+1 − xk∗−1∥∥]

≥ O((Km)−1/3

).

Finally, by setting B = 8N2/3, m = (1/3)N1/3 and K = ε−3/2/m, the total number of

Hessian samples is

mKB +KN = O(N2/3ε−3/2

).

This concludes the proof.

3.3.2 The case of using subsampled gradient

In this subsection, we analyze a non-adaptive SVRC scheme proposed in [107], which is

shown in Algorithm 8. Similar to Algorithm 7, one need not knwon the best value of γ

given in Lemma 3.3.7, any choice of γ = Θ(ρ) will be acceptable. In this scheme, the

approximate gradients and Hessians are constructed as follows:

gkt =1

|Skt |∑j∈Skt

(∇fj(xkt )−∇fj(xk−1) + g

)+

1

|Skt |∑j∈Skt

(Hk−1 −∇2fj(x

k−1)) (xkt − xk−1

),

(3.47)

Hkt =

1

|Bkt |∑i∈Bkt

(∇2fi(x

kt )−∇2fi(x

k−1))

+ Hk−1. (3.48)

where g = ∇F (xk−1) and Hk−1 = ∇2F (xk−1). Compared with (3.14) and (3.15), the

construction of Hkt is the same but gkt includes an additional second-order correction term.

86

Algorithm 8: Non-adaptive SVRC method using subsampled gradients and Hes-

sians.

1 Input: An initial point x0, σ > 0, sample sizes S and B, and inner loop length m.

An estimate of the Lipschitz constant ρ and a parameter γ = Θ(ρ).

2 for k = 1,......,K do

3 Compute Hk−1 = ∇2f(xk−1).

4 Assign xk0 = xk−1.

5 for t = 0, ...,m− 1 do

6 Randomly sample index sets Skt and Bkt with |Skt | = S and |Bkt | = B.

7 Construct gkt according to (3.47) and Hkt according to (3.48).

8 Solve ξkt = arg minξξT gkt + 1

2ξTHk

t ξ + σ6 ‖ξ‖

3.


10 Output: xk∗t∗+1 with

11 Option 1: k∗, t∗ = argmin1≤k≤T, 0≤t≤m−1

∥∥ξkt ∥∥3+(

ρ/γ

2B3/2 + ρ/γ

3√

2S3/4

)∥∥xkt − xk−1∥∥3

.

12 Option 2: k∗ and t∗ are chosen randomly.

Lemma 3.3.5 Let gkt be constructed by (3.47) and |Skt | = S, then the following inequalities

hold:

E[gkt |xkt

]= ∇F (xkt ), (3.49)


∥∥2

F

∣∣∣xkt ] ≤ ρ2

4S

∥∥xkt − xk−1∥∥4. (3.50)

By the above lemma and a discussion similar to that of (3.25) and (3.40), we have

E[F (xkt+1)

∣∣xkt ] ≤ F (xkt )−(σ

4− 5ρ

6

)E[∥∥ξkt ∥∥3

∣∣∣xkt ]+1


t

∥∥3

F

∣∣∣xkt ]+

2

3ρ1/2E[∥∥∇F (xkt )− gkt

∥∥3/2∣∣∣xkt ] (3.51)

≤ F (xkt )−(σ

4− 5ρ

6

)E[∥∥ξkt ∥∥3

∣∣∣xkt ]+

(5ρ

2B3/2+

ρ

3√

2S3/4

)∥∥xkt − xk−1∥∥3,

where the last inequality is due to Lemmas 3.2.3 and 3.3.5 and Jensen’s inequality. In the

rest of the analysis, we use the Lyapunov function Rkt defined in (3.41) but with a new set

87

of parameters by setting cm = 0 and

ct = ct+1

(1 + 2θ−3

1 + θ−62

)+

3ρ

B3/2+

√2ρ

3S3/4, t = 0, ....,m− 1. (3.52)

Again the constants θ1 and θ2 will be determined later. We present the following results

without proof due to the similarity to their counterparts in previous subsection.

Lemma 3.3.6 Suppose the sequence xkt k=1,...,Kt=1,...,m is generated by Algorithm 8, then we

have(σ

4− 5ρ

6−ct+1

(1+θ6

1 +2θ32

))E[‖ξkt ‖3

]+

(ρ

2B3/2+

ρ

3√

2S3/4

)E[‖xkt −xk−1‖3

]≤ Rkt −Rkt+1.

Define γ = min1≤t≤m

σ4 −

5ρ6 − ct+1

(1 + θ6

1 + 2θ32

). Then for k = 1, ...,K,

m−1∑t=0

E

[∥∥ξkt ∥∥3+

(ρ/γ

2B3/2+

ρ/γ

3√

2S3/4

)∥∥xkt − xk−1∥∥3]≤ Rk0 −Rkm

γ. (3.53)

Lemma 3.3.7 If we set σ ≥ 4ρ, m = (1/3)N1/5, B = αN2/5, S = α2N4/5, θ1 = N1/15

and θ2 = N1/30, then we have ρ > 0 as long as α ≥ 12 and

γ ≥ ρ

6

(1− 4

(3 +√

2/3)eα−3/2

)= O(ρ).

Theorem 3.3.8 For Algorithm 8, set

σ ≥ 4ρ, m = (1/3)N1/5, B = 12N2/5, S = B2.

Let k∗, t∗ be chosen by either of the two output options in the algorithm, then

E

[∥∥ξk∗t∗ ∥∥3+

(ρ/γ

2B3/2+

ρ/γ

3√

2S3/4

)∥∥xk∗t∗+1 − xk∗−1∥∥3]≤ F (x0)− F ?

Kmγ.


E[∥∥∇F (xk

∗t∗+1)

∥∥] ≤ O((Km)−2/3

),

E[λmin

(∇2F (xk

∗t∗+1)

)]≥ −O

((Km)−1/3

),

and the total Hessian and gradient sample complexity for finding a second-order ε-solution

is O(N4/5ε−3/2

).

88

3.4 Discussion

We considered the problem of minimizing the average of a large number of smooth and pos-

sibly nonconvex functions, F (x) = (1/N)∑N

i=1 fi(x), using subsampled Newton method

with cubic regularization. We presented an adaptive variance reduction method that re-

quires O(N + N2/3ε−3/2) Hessian samples for finding an approximate solution satisfying

‖∇F (x)‖ ≤ ε and ∇2F (x) −√εI. This result holds for both sampling with and without

replacement. Our analysis do not rely on high probability bounds from matrix concentra-

tion inequalities, instead, we use bounds on 3rd and 4th moments of the average of random

matrices.

We have focused on the Hessian sample complexity by assuming that the solution to the

cubic regularization (CR) subproblem (3.3) is available at each iteration. Nesterov and

Polyak [63] showed that the CR subproblem is equivalent to a convex one-dimensional op-

timization problem, but this approach requires eigenvalue decomposion of the approximate

Hessian Hk, which can be very costly for large-dimensional problems. Several recent works

propose to solve the CR subproblem using iterative algorithms such as gradient descent

[11, 83] or Lanczos method [12, 51], and approximate trust-region solver [37] can also be

used. However, the overall complexity of the combined methods may still be high.

An efficient approximate solver for the CR subproblem has been developed in [4], which

leads to a total computational complexity of O(Nε−3/2 + N3/4ε−7/4) for minimizing the

finite average problem (3.1). This complexity is measured in terms of the total number of

Hessian-vector products, i.e., multiplications by the component Hessians ∇2fi(xk). Similar

results have also been obtained by [77]. For many mahcine learning problems, including

generalized linear models and training neural networks, such Hessian-vector products can

be computed in O(d) time [71, 83]. We notice that the CR subproblem solved in [4] includes

all N component Hessians. Thus it is possible to further reduce the overall computational

complexity by combining the efficient CR solver developed in [4] and the lower Hessian

89

sample complexity obtained in this paper.

Theorem 3.4.1 For Algorithm 8, set

σ ≥ 4ρ, m = (1/3)N1/5, B = 12N2/5, S = B2.

Let k∗, t∗ be chosen by either of the two output options in the algorithm, then

E

[∥∥ξk∗t∗ ∥∥3+

(ρ/γ

2B3/2+

ρ/γ

3√

2S3/4

)∥∥xk∗t∗+1 − xk∗−1∥∥3]≤ F (x0)− F ?

Kmγ.


E[∥∥∇F (xk

∗t∗+1)

∥∥] ≤ O((Km)−2/3

),

E[λmin

(∇2F (xk

∗t∗+1)

)]≥ −O

((Km)−1/3

),

and the total Hessian and gradient sample complexity for finding a second-order ε-solution

is O(N4/5ε−3/2

).

3.5 Discussion

We considered the problem of minimizing the average of a large number of smooth and pos-

sibly nonconvex functions, F (x) = (1/N)∑N

i=1 fi(x), using subsampled Newton method

with cubic regularization. We presented an adaptive variance reduction method that re-

quires O(N + N2/3ε−3/2) Hessian samples for finding an approximate solution satisfying

‖∇F (x)‖ ≤ ε and ∇2F (x) −√εI. This result holds for both sampling with and without

replacement. Our analysis do not rely on high probability bounds from matrix concentra-

tion inequalities, instead, we use bounds on 3rd and 4th moments of the average of random

matrices.

We have focused on the Hessian sample complexity by assuming that the solution to the

cubic regularization (CR) subproblem (3.3) is available at each iteration. Nesterov and

90

Polyak [63] showed that the CR subproblem is equivalent to a convex one-dimensional op-

timization problem, but this approach requires eigenvalue decomposion of the approximate

Hessian Hk, which can be very costly for large-dimensional problems. Several recent works

propose to solve the CR subproblem using iterative algorithms such as gradient descent

[11, 83] or Lanczos method [12, 51], and approximate trust-region solver [37] can also be

used. However, the overall complexity of the combined methods may still be high.

An efficient approximate solver for the CR subproblem has been developed in [4], which

leads to a total computational complexity of O(Nε−3/2 + N3/4ε−7/4) for minimizing the

finite average problem (3.1). This complexity is measured in terms of the total number of

Hessian-vector products, i.e., multiplications by the component Hessians ∇2fi(xk). Similar

results have also been obtained by [77]. For many mahcine learning problems, including

generalized linear models and training neural networks, such Hessian-vector products can

be computed in O(d) time [71, 83]. We notice that the CR subproblem solved in [4] includes

all N component Hessians. Thus it is possible to further reduce the overall computational

complexity by combining the efficient CR solver developed in [4] and the lower Hessian

sample complexity obtained in this paper.

91

Chapter 4

A Sparse Completely Positive

Relaxation of the Modularity

Maximization for Community

Detection

In this chapter, we will focus on an important application of the nonconvex optimization

techniques in statistics and machine learning, which is called the community detection

problem.

4.1 Introduction and Related Literature

The community detection problem aims to retrieve the underlying community/cluster

structure of a network from the observed nodes connection data. This network structure

inference technique has been reinvigorated because of the modern applications of large

scale networks, including social networks, energy distribution networks and economic net-

92

works, etc. The standard frameworks for studying community detection in networks are

the stochastic block model (SBM) [38] and the degree-correlated stochastic block model

(DCSBM) [48, 22]. Both of them are random graph models based on an underlying dis-

joint community structure. To solve the community detection problem, there have been a

variety of greedy methods which aim to maximize some quality function of the community

structure; see [87] and the references therein. Although these methods are numerically

efficient, their theoretical analysis is still largely missing.

An important class of methods for community detection are the so-called spectral clustering

methods, which exploit either the spectral decomposition of the adjacency matrix [60,

81, 46], or the normalized adjacency or graph Laplacian matrix [22, 21, 36, 78, 15, 73].

A significant advantage of the spectral clustering methods is that they are numerically

efficient and fully scalable. Although many of them are shown to be statistically consistent

with the SBM and the DCSBM under certain conditions, their numerical performance on

synthetic and real data does not fully validate this property. This can also be observed

in the numerical experiments where the spectral methods SCORE [46] and OCCAM [106]

are tested. It is worth mentioning that spectral clustering methods may fail to detect the

communities for large sparse networks.

Another closely connected class of methods are based on the nonnegative matrix factor-

ization (NMF) methods. In [25], an interesting equivalence is shown between some spe-

cific NMF-based methods, the kernel K-means method and a specific spectral clustering

method. There have been extensive research on graph clustering and community detection

problems; see e.g. [26, 50, 86, 72, 54, 98]. The clustering performance and scalability of

these methods are well appreciated, while their theoretical guarantees are still not fully

understood. In [70], the authors address this issue by showing the asymptotic consis-

tency of their NMF-based community detection algorithm under the SBM or the DCSBM.

However, a non-asymptotic error bound is still unknown.

In addition to the nonconvex methods, significant advances have also been made in convex

93

approaches. In [68, 18, 19], the authors propose to decompose the adjacency matrix or its

variants into a “low-rank + sparse” form by using nuclear norm minimization methods.

In [17], an SDP relaxation of the modularity maximization followed by a doubly weighted

k-median clustering is proposed. This method is demonstrated to be very competitive

with nice theoretical guarantees under the SBM or the DCSBM against the state-of-the-

art methods in various datasets; see [10, 35] for more examples. Although these convex

methods are numerically stable and statistically efficient under various model settings,

they are algorithmically not scalable. This is because solving these problems constantly

involves expensive subroutines such as full eigenvalue decomposition or matrix inversion,

making these algorithms unsuitable for huge datasets.

In this project, a sparse and low-rank completely positive relaxation for the modularity

maximization problem [64] is proposed. Specifically, we define an assignment matrix Φ

for k communities among n nodes as an n × k binary matrix such that each row of Φ

corresponds to a particular node in the network and contains exactly one element equal

to one. The column index of each element indicates which community this node belongs

to. Consequently, we reformulate the modularity maximization as an optimization over

the set of low-rank assignment matrices. Then, we relax the binary constraints for each

row of the assignment matrix and impose the nonnegative and spherical constraints. The

cardinality constraints can be optionally added in order to handle large scale datasets.

To solve the resulting relaxation, we regard each row of the relaxed assignment matrix as

a block-variable and propose a row-by-row (RBR) proximal block coordinate descent ap-

proach. We show that a closed-form solution for these nonconvex subproblems is available

and an O(1/√T ) convergence rate to a stationary point can be obtained. One can then

proceed with either K-means or K-median clustering to retrieve the community structure

from the solution of the relaxed modularity maximization problem. Since these clustering

schemes themselves are expensive, a simple yet efficient rounding scheme is constructed

for the community retrieval purpose. Non-asymptotic high-probability bounds are estab-

lished for the misclassification rate under the DCSBM. The properites of the SBM can be

94

obtained as a special case of the DCSBM. We further propose an asynchronous parallel

computing scheme as well as an iterative rounding scheme to enhance the efficiency of

our algorithm. Finally, numerical experiments on both real world and synthetic networks

show that our method is competitive with the SCORE [46], OCCAM [106] and CMM [17]

algorithms regarding both numerical efficiency and clustering accuracy. We also perform

experiments on extremely large sparse networks with up to 50 million nodes and compare

the proposed method with the LOUVAIN algorithm [8]. It truns out that our method is at

least comparable to the LOUVAIN algorithm in terms of finding a community assignment

with a small number of communities and a high modularity.

4.2 Model Description

4.2.1 The Degree-Correlated Stochastic Block Model

In [38], Holland et al. proposed the so called stochastic block model (SBM), where a non-

overlapping true community structure is assumed. This structure is a partition C∗1 , ..., C∗k

of the nodes set [n] = 1, ..., n, that is

C∗a ∩ C∗b = ∅,∀a 6= b and ∪ka=1 C∗a = [n].

Under the SBM, the network is generated as a random graph characterized by a symmetric

matrix B ∈ Sk×k with all entries ranging between 0 and 1. Let A ∈ 0, 1n×n be the

adjacency matrix of the network with Aii = 0,∀i ∈ [n]. Then for i ∈ C∗a , j ∈ C∗b , i 6= j,

Aij =

1, with probability Bab,

0, with probability 1−Bab.

One significant drawback of the SBM is that it oversimplifies the network structure. All

nodes in a same community are homogeneously characterized. The DCSBM is then pro-

posed to alleviate this drawback by introducing the degree heterogeneity parameters θ =

95

(θ1, ..., θn) for each node; see [22, 48]. If one keeps the definitions of B and C∗a , a = 1, ..., k

and still let A be the adjacency matrix of the network with Aii = 0,∀i ∈ [n], then for

i ∈ C∗a , j ∈ C∗b , i 6= j,

Aij =

1, with probability Babθiθj ,

0, with probability 1−Babθiθj .

Hence, the heterogeneity of nodes is characterized by this vector θ. Potentially, a node i

with larger θi has more edges linking to the other nodes.

4.2.2 A Sparse and Low-Rank Completely Positive Relaxation

For the community detection problem, one popular method is to maximize a community

quality function called “modularity” proposed in [64]. We say that a binary matrix X

is a partition matrix, if there exists a partition S1, ..., Sr, r ≥ 2 of the index/node set

[n] such that Xij = 1 for i, j ∈ St, t ∈ 1, ..., r and Xij = 0 otherwise. Then a given

community structure of a network, not necessarily the true underlying structure, is fully

characterized by its associated partition matrix. We denote the degree vector by d where

di =∑

j Aij , i ∈ [n]. Define the matrix

C = −(A− λdd>), (4.1)

where λ = 1/‖d‖1. Therefore the modularity of a community structure is defined as

Q = −〈C,X〉 . Let the set of all partition matrices of n nodes with no more than r subsets

be denoted by Prn. Then the modularity maximization can be formulated as

minX〈C,X〉 s.t. X ∈ Pnn . (4.2)

Assume that we know the number of communities in the network is k. Then the problem

becomes

minX〈C,X〉 s.t. X ∈ Pkn. (4.3)

96

Although modularity optimization enjoys many desirable properties, it is an NP-hard

problem. A natural relaxation is the convex SDP relaxation considered in [17]:

minX∈Rn×n

〈C,X〉

s.t. Xii = 1, i = 1, . . . , n,

0 ≤ Xij ≤ 1, ∀i, j,

X 0.

(4.4)

This formulation exhibits desirable theoretical properties under the SBM or the DCSBM.

However, the computational cost is quite expensive when the network is large. Here we

aims to formulating a model that retains nice theoretical properties while maintaining

computationally viable. Recall the definition of an assignment matrix in section 4.1 and

suppose that the number of communities k is no more than r. Then the true partition

matrix X∗ can be decomposed as X∗ = Φ∗(Φ∗)>, where Φ∗ ∈ 0, 1n×r is the true assign-

ment matrix. Let us define the set of assignment matrix in Rn×r to be An×r, and define

the notations

U = (u1, ..., un)> , Φ = (φ1, ..., φn)> ,

where the ui’s and φi’s are r-by-1 column vectors. By setting r = n and r = k, respectively,

problem (4.2) and (4.3) can be formulated as

minΦ

⟨C,ΦΦ>

⟩s.t. Φ ∈ An×r. (4.5)

A matrix X ∈ Rn×n is completely positive (CP) if there exists U ∈ Rn×r, r ≥ 1 such that

U ≥ 0, X = UU>. Let the set of all n × n CP matrices be denoted by CPn. Due to

the reformulation (4.5), by constraining the CP rank of X, we naturally obtain a rank-

97

constrained CP relaxation of problem (4.2) and (4.3) as

minX

〈C,X〉

s.t. Xii = 1, i = 1, . . . , n,

X ∈ CPn,

rankCP(X) ≤ r,

(4.6)

where rankCP(X) := mins : X = UU>, U ∈ Rn×s+ . Note that the above problem is still

NP-hard. By using a decompsition X = UU>, we propose an alternative formulation:

minU∈Rn×r

⟨C,UU>

⟩s.t. ‖ui‖2 = 1, i = 1, . . . , n,

‖ui‖0 ≤ p, i = 1, . . . , n,

U ≥ 0,

(4.7)

where the parameter λ in C can be set to a different value other than 1/‖d‖1 and 1 ≤ p ≤ r

is an integer. Note that the additional `0 constraint is added to each row of U to impose

sparsity of the solution. When p = r, that is, the `0 constraints vanish, this problem is

exactly equivalent to the CP relaxation (4.6). Usually, the value of p is set to a small

number such that the total amount of the storage of U is affordable on huge scale datasets.

Even though (4.7) is still NP-hard, the decomposition and the structure of the problem

enable us to develop a computationally efficient method to reach at least a stationary

solution.

We summarize the relationships between different formulations as

(4.3)

m ⇒ (4.7) ⇒ (4.6) ⇒ (4.4),

(4.5)

where formulations (4.3) and (4.5) are equivalent forms of the NP-hard modularity maxi-

mization and problem (4.4) is the SDP relaxation. The ”⇒” indicates that our relaxation

98

is tighter than the convex SDP relaxation proposed in [17]. If U is feasible to our problem

(4.7), then UU> is automatically feasible to the SDP relaxation (4.4).

4.2.3 A Community Detection Framework

Suppose that a solution U is computed from problem (4.7), we can then use one of the

following three possible ways to retrieve the community structure from U .

i. Apply the K-means clustering algorithm directly to the rows of U by solving

min ‖ΦXc − U∗‖2F

s.t. Φ ∈ An×k, Xc ∈ Rk×k, (4.8)

where the variable Φ contains the community/cluster assignment and Xc contains

the corresponding centers.

ii. Apply the weighted K-means clustering over the rows of U , namely, we solve,

min ‖D(ΦXc − U)‖2F

s.t. Φ ∈ An×k, Xc ∈ Rk×k, (4.9)

where D = diagd1, ..., dn, Φ and Xc are interpreted the same as that in (i).

iii. Apply a direct rounding scheme to generate the final community assignment Φ. Let

φ>i be the ith row of Φ,

φi ← ej0 , j0 = arg maxj

(ui)j , (4.10)

where ej0 is the j0th unit vector in Rk.

All three methods have theoretical guarantees under the SBM or the DCSBM as will be

shown in later sections. The overall numerical framework in this project is as follows:

Step 1. Solve problem (4.7) and obtain an approximate solution.

99

Step 2. Recover the community assignment via one of the three clustering methods.

Step 3. Repeat steps 1 and 2 for a couple of times and then pick up a community assignment

with the highest modularity.

4.3 An Asynchrounous Proximal RBR Algorithm

4.3.1 A Nonconvex Proximal RBR Algorithm

In this subsection, we setup a basic proximal RBR algorithm without parallelization for

(4.7). Let r be chosen as the community number k. Define the feasible set for the block

variable ui to be Ui = Sk−1 ∩ Rk+ ∩Mkp , where Sk−1 is the k − 1 dimensional sphere and

Mkp = x ∈ Rk : ‖x‖0 ≤ p. Define U = U1×· · ·×Un. Then problem (4.7) can be rewritten

as

minU∈U

f(U) ≡⟨C,UU>

⟩. (4.11)

For the ith subproblem, we fix all except the ith row of U and solve the subproblem

ui = arg minx∈Ui

f(u1, ..., ui−1, x, ui+1, ..., un) +σ

2‖x− ui‖2,

where ui is some reference point and σ > 0. Note that the spherical constraint eliminates

the second order term ‖x‖2 and simplifies the subproblem to

ui = arg minx∈Ui

b>x, (4.12)

where b = 2Ci−iU−i − σui, and Ci−i is the ith row of C without the ith component, U−i is

the matrix U without the ith row u>i . For any positive integer p, a closed-form solution

can be derived in the next lemma.

Lemma 4.3.1 For problem u = argminb>x : x ∈ Sk−1 ∩ Rk+ ∩ Mkp , define b+ =

maxb, 0, b− = max−b, 0, where the max is taken component-wisely. Then the closed-

100

form solution is given by

u =

b−p‖b−p ‖

, if b− 6= 0,

ej0 , with j0 = arg minj bj , otherwise,

(4.13)

where b−p is obtained by keeping the top p components in b− and zeroing the others. When

‖b−‖0 ≤ p, b−p = b−.

Proof. When b− 6= 0, for ∀u ∈ Rk+, ‖u‖2 = 1, ‖u‖0 ≤ p, we obtain

b>u = (b+ − b−)>u ≥ −(b−)>u ≥ −‖b−p ‖,

where the equality is achieved at u =b−p‖b−p ‖

.

When b− = 0, i.e., b ≥ 0, we have

b>u =∑i

biui ≥∑i

biu2i ≥ bmin

∑i

u2i = bmin,

where bmin = minj bj , and the equality is achieved at u = ej0 , j0 = arg minj bj .

Our proximal RBR algorithm is outlined in Algorithm 9.

Algorithm 9: A Row-By-Row (RBR) Algorithm

1 Give U0, σ > 0. Set t = 0.

2 while Not converging do

3 for i = 1,...,n do

4 ut+1i = arg minx∈Ui f(ut+1

1 , ..., ut+1i−1, x, u

ti+1, ..., u

tn) + σ

2 ‖x− uti‖2.

5 t← t+ 1.

Note that the objective function is smooth with a Lipschitz continuous gradient. When

p = k, the `0 constraints hold trivially and thus can be removed. Then problem (4.7)

101

becomes a smooth nonlinear program. Therefore, the KKT condition can be written as2CU∗ + 2ΛU∗ − V = 0,

‖u∗i ‖ = 1, u∗i ≥ 0, vi ≥ 0, for i = 1, ..., n,

(u∗i )j(vi)j = 0, for i = 1, ..., n and j = 1, ..., k,

where Λ = diagλ, V = (v1, ..., vn)> and λi ∈ R, vi ∈ Rk+ are the Lagrange multipliers

associated with the constraints ‖ui‖2 = 1, ui ≥ 0, respectively. Then we have the following

convergence result.

Theorem 4.3.2 Suppose that the sequence U0, U1, ..., UT is generated by Algorithm 9

with any chosen proximal parameter σ > 0 and p = k. Let

t∗ := arg min1≤t≤T

‖U t−1 − U t‖2F

. (4.14)

Then there exist Lagrange multipliers λi ∈ R, vi ∈ Rk+, i = 1, ..., n, such that

∥∥2CU t∗

+ 2ΛU t∗ − V

∥∥F≤ 2

√2‖C‖1(4‖C‖2F+σ2)

Tσ ,

‖ut∗i ‖ = 1, ut∗i ≥ 0, vi ≥ 0, for i = 1, ..., n,

(ut∗i )j(vi)j = 0, for i = 1, ..., n and j = 1, ..., k,

where ‖C‖1 is the component-wise `1 norm of C.

Proof. The proof consists mainly of two steps. First, we show that∑T

t=1 ‖U t − U t−1‖2F is

moderately bounded. By the optimality of the subproblems, we have that for i = 1, ..., n,

f(ut1, ..., uti−1, u

t−1i , ..., ut−1

n )− f(ut1, ..., uti, u

t−1i+1, ..., u

t−1n ) ≥ σ

2‖uti − ut−1

i ‖2.

Summing these inequalities up over i gives

f(U t−1)− f(U t) ≥ σ

2‖U t − U t−1‖2F ,

which yieldsT∑t=1

‖U t − U t−1‖2F ≤2

σ

(f(U0)− f(UT )

). (4.15)

102

Note that |〈ui, uj〉| ≤ ‖ui‖‖uj‖ = 1, hence for any feasible U ,

f(U) = 〈CU,U〉 =∑i,j

Ci,j〈ui, uj〉 ∈ [−‖C‖1, ‖C‖1].

Combining with (4.15) and the definition of t∗ in (4.14), it holds that

‖U t∗ − U t∗−1‖2F ≤4

Tσ‖C‖1. (4.16)

Second, also by the optimality of the subproblems, there exist KKT multipliers λi ∈ R

and vi ∈ Rk+ for each subproblem such that,∇if(ut

∗1 , ..., u

t∗i , u

t∗−1i+1 , ..., u

t∗−1n ) + σ(ut

∗i − u

t∗−1i ) + 2λiu

t∗i − vi = 0,

‖ut∗i ‖2 = 1, ut∗i ≥ 0, vi ≥ 0,

(ut∗i )j(vi)j = 0, for j = 1, ..., k.

(4.17)

Define U t∗,i =

[ut∗

1 , ...ut∗i , u

t∗−1i+1 , ..., u

t∗−1n

]>, then

‖∇if(U t∗)−∇if(ut

∗1 , ..., u

t∗i , u

t∗−1i+1 , ..., u

t∗−1n )‖2

= ‖2Ci(U t∗ − U t∗,i

)‖2

≤ 4‖Ci‖2‖U t∗ − U t∗,i‖2F (4.18)

≤ 4‖Ci‖2‖U t∗ − U t∗−1‖2F ,

where Ci is the ith row of matrix C. Consequently,∥∥2CU t∗

+ 2ΛU t∗ − V

∥∥2

F

=

n∑i=1

∥∥∇if(U t∗) + 2λiu

t∗i − vi

∥∥2

=n∑i=1

∥∥∇if(U t∗)−∇if(ut

∗1 , ..., u

t∗i , u

t∗−1i+1 , ..., u

t∗−1n )− σ(ut

∗i − ut

∗−1i )

∥∥2(4.19)

≤ 2n∑i=1

(‖∇if(U t

∗)−∇if(ut

∗1 , ..., u

t∗i , u

t∗−1i+1 , ..., u

t∗−1n )‖2 + σ2‖ut∗i − ut

∗−1i ‖2

)≤ 2(4‖C‖2F + σ2)‖U t∗ − U t∗−1‖2F

≤8‖C‖1(4‖C‖2F + σ2)

Tσ,

103

where the second equality is due to (4.17) and the last inequality is due to (4.16). Com-

bining the bounds in (4.19) and (4.17) proves the theorem.

4.3.2 An Asynchronous Parallel Proximal RBR Scheme

In this subsection, we briefly discuss the parallelization of the RBR Algorithm 9. The

overall parallel setup for the algorithm is a shared memory model with many threads. The

variable U is in the shared memory so that it can be accessed by all threads. The memory

locking is not imposed throughout the process. Namely, even when a thread is updating

some row ui of U , the other threads can still access U whenever needed.

Before explaining the parallel implementation, let us first make clear the sequential case.

The main computational cost of updating one row ui by solving the subproblem (4.12) is

due to the computation of b = 2Ci−iU−i − σui, where ui and U are the current iterates.

Note that the definition of C in (4.1) yields

b> = −2Ai−iU−i + 2λdid>−iU−i − σui, (4.20)

where Ai−i is the ith row of A without the ith component. By the sparsity of A, computing

−Ai−iU−i takes O(dip) flops. As for the term λdid>−iU−i, we can compute the vector

d>U =∑|V |

i=1 diu>i once, then store it throughout all iterations. For each time after a

row ui is updated, we can evaluate this vector at a cost of O(p) flops. Hence, the total

computational cost of one full sweep of all rows is O(2(k + p)|V |+ p|E|).

Our parallel implementation is outlined in Algorithm 10 where many threads working at

the same time. The vector d>U and matrix U are stored in the shared memory and

they can be accessed and updated by all threads. Each thread picks up one row ui at a

time and then it reads U and the vector d>U . Then an individual copy of the vector b> is

calculated, i.e., b> is owned privately and cannot be accessed by other threads. Thereafter,

the variable ui is updated and d>U is set as d>U ← d>U+di(ui−ui) in the shared memory.

Immediately without waiting for other threads to finish their computation, it proceeds to

104

another row. Hence, unlike the situation in Algorithm 9, other blocks of variables uj , j 6= i

are not necessarily up to date when a thread is updating a row ui and d>U . Moreover,

if another thread is just modifying some row uj or the vector d>U when this thread is

reading these memories, it will make the update of ui with the partially updated data.

A possible way to avoid the partially updated conflict is to add memory locking when a

thread is changing the memories of U and d>U . In this way, other threads will not be

able to access these memories and will wait until the update is finished. However, this

process may cause many cores to be idle so that it only provides limited speedups or even

slows the algorithm down. In our implementation, these memory locking are completely

removed. Consequently, our method may be able to provide near-linear speedups.

In order to increase the chance of finding a good solution, we perform the simple rounding

scheme (4.10) in section 4.2.3 whenever it is necessary. By applying this technique, we

obtain a binary solution after each iteration as a valid community partition. Although

the convergence is not guaranteed in this case, it is numerically robust throughout our

experiments. Note that the rounding technique can also be asynchronously parallelized

because the update of all rows are completely independent.

It is worth noting that our algorithm is very similar to the HOGWILD! algorithm [74],

where HOGWILD! minimizes the objective function in a stochastic gradient descend style

while we adopt the exact block coordinate minimization style. There are no conflicts in

our method when updating the variable U compared to the HOGWILD!, since each row

can only be modified by up to one thread. However, the issue of conflicts arises when

updating the intermediate vector d>U . Another similar asynchronous parallel framework

is CYCLADES [69]. Consider a “variable conflict graph”, of which the nodes are grouped

as blocks of variables and an edge between two blocks of variables exists if they are coupled

in the objective function. The CYCLADES carefully utilizes the sparsity of the variable

conflict graph and designs a particular parallel updating order so that no conflict emerges

due to the asynchronism with high probability. This enables the algorithm to be asyn-

chronously implemented while keeping the sequence of iterates equivalent to that generated

105

by a sequential algorithm. However, because the block variables are all coupled together

in formulation (4.7), no variable sparsity can be exploited in our case.

Algorithm 10: Asynchronous parallel RBR algorithm

1 Give U0, set t = 0

2 while Not converging do

3 for each row i asynchronously do

4 Compute the vector b>i = −2Ai−iU−i + 2λdid>−iU−i − σui, and save previous

iterate ui in the private memory.

5 Update ui ← argminx∈Ui b>i x in the shared memory.

6 Update the vector d>U ← d>U + di(ui − ui) in the shared memory.

7 if rounding is activated then

8 for each row i asynchronously do

9 Set ui = ej0 where j0 = arg max(ui)j .

10 Compute and update d>U .

4.4 Theoretical Error Bounds Under The DCSBM

In this section, we establish the theoretical guarantees of our method. Throughout the

discussion, we focus mainly on the DCSBM, as these assumptions and results for the SBM

can be easily reduced from those of the DCSBM as special cases. Define

Ga =∑i∈C∗a

θi, Ja =∑i∈C∗a

θ2i , Ma =

∑i∈C∗a

θ3i , na = |C∗a |,

Ha =k∑b=1

BabGb, and fi = Haθi, given i ∈ C∗a .

A direct computation gives Edi = θiHa − θ2iBaa ≈ fi. For the ease of notation, we de-

fine nmax = max1≤a≤k na, and define nmin, Hmax, Hmin, Jmax, Jmin,Mmax,Mmin in a similar

106

fashion. We adopt the density gap assumption in [17] to guarantee the model’s identifia-

bility.

Assumption 4.4.1 (Chen, Li and Xu, 2016,[17]) Under the DCSBM, the density gap

condition holds that

max1≤a<b≤k

BabHaHb

< min1≤a≤k

BaaH2a

.

Define ‖X‖1,θ =∑

ij |Xij |θiθj as the weighted `1 norm. Then we have the following lemma

which bounds the difference between the approximate partition matrix U∗(U∗)> and the

true partition matrix Φ∗(Φ∗)>, where U∗ is the optimal solution to problem (4.7).

Lemma 4.4.2 Suppose that Assumption 4.4.1 holds and the tuning parameter λ satisfies

max1≤a<b≤k

Bab + δ

HaHb< λ < min

1≤a≤k

Baa − δH2a

(4.21)

for some δ > 0. Let U∗ be the global optimal solution to problem (4.7), and define ∆ =

U∗(U∗)> − Φ∗(Φ∗)>. Then with probability at least 0.99− 2(e/2)−2n, we have

‖∆‖1,θ ≤C0

δ

(1 +

(max

1≤a≤k

BaaH2a

‖f‖1))

(√n‖f‖1 + n),

where C0 > 0 is some absolute constant that does not depend on problem scale and param-

eter selections.

Proof. Lemma 4.4.2 is proved directly from Theorem 1 in [17] for the convex SDP relaxation

(4.4). Here we only show the main steps of the proof and how they can be translated from

the original result of Chen, Li and Xu in [17]. By the optimality of U∗ and the feasibility

of Φ∗ to problem (4.7), we have

0 ≤ 〈∆, A− λdd>〉 = 〈∆,EA− λff>〉︸︷︷︸S1

+λ〈∆, ff> − dd>〉︸︷︷︸S2

+ 〈∆, A− EA〉︸︷︷︸S3

. (4.22)

The remaining task is to bound the three terms respectively.

107

First, we bound the term S1. Note that U∗(U∗)> is feasible to the SDP problem (4.4), we

still have for ∀i, j ∈ C∗a , a ∈ [k], ∆ij ≤ 0; for ∀i ∈ C∗a ,∀j ∈ C∗b , a 6= b, a, b ∈ [k],∆ij ≥ 0.

Combining with (4.21), the proof in [17] is still valid and gives

∆ij(EA− λff>)ij ≤ −δθiθj |∆ij |.

Hence we have

S1 ≤ −δ‖∆‖1,θ.

Similarly, the proof goes through for the bounds of S2 and S3. It can be shown that

S2 ≤ C(

max1≤a≤k

BaaH2a

)‖f‖1

(√n‖f‖1 + n

)holds with probability at least 0.99, where C > 0 is some absolute constant. Let KG ≤

1.783 be the constant for the original Grothendieck’s inequality. Then

S3 ≤ 2KG

√8n‖f‖1 +

16KG

3n

holds with probability at least 1 − 2(e/2)−2n. Combining the bounds for S1, S2, S3 with

the inequality (4.22) proves the theorem.

Lemma 4.4.2 indicates that U∗(U∗)> is close to Φ∗(Φ∗)>. Naturally, we want to bound the

difference between the factors U∗ and Φ∗. However, note that for any orthogonal matrix

Q, (U∗Q)(U∗Q)> = U∗(U∗)>. This implies that in general U∗ is not necessarily close to

Φ∗ unless multiplied by a proper orthogonal matrix. To establish this result, we introduce

a matrix perturbation lemma in [41].

Lemma 4.4.3 (Corollary 6.3.8, in [41] on page 407) Let A,E ∈ Cn×n. Assume that A

is Hermitian and A + E is normal. Let λ1 ≤ λ2 ≤ · · · ≤ λn be the eigenvalues of A, and

let λ1, λ2, ..., λn be the eigenvalues of A + E arranged in order Re(λ1) ≤ Re(λ2) ≤ · · · ≤

Re(λn). Thenn∑i=1

‖λi − λi‖2 ≤ ‖E‖2F ,

where ‖ · ‖F is the matrix Frobenius norm.

108

In the real symmetric case, this lemma states that we can bound the difference between

every eigenvalue of A and perturbed matrix A+E if the error term ‖E‖2F can be properly

bounded.

Theorem 4.4.4 Let Θ = diagθ1, ..., θn, and let U∗,Φ∗, ∆ be defined according to Lemma

4.4.2. Then there exist orthogonal matrices Q∗1 and Q∗2 such that

‖Θ(U∗Q∗1 − Φ∗)‖2F ≤ 2√kMmax

Jmin‖∆‖

121,θ,

‖U∗Q∗2 − Φ∗‖2F ≤ 2√knmax

nmin‖∆‖

121 ≤ 2

√k

nmax

θminnmin‖∆‖

121,θ.

Proof. (i). We first prove the bound for Q∗1. A direct calculation yields

v(θ) := minQ>Q=I

‖Θ(U∗Q− Φ∗)‖2F

= minQ>Q=I

‖ΘU∗Q‖2F + ‖ΘΦ∗‖2F − 2 〈ΘU∗Q,ΘΦ∗〉 (4.23)

= 2

n∑i=1

θ2i − 2 max

Q>Q=I

⟨Q, (U∗)>Θ2Φ∗

⟩= 2

(n∑i=1

θ2i − ‖(U∗)>Θ2Φ∗‖∗

),

where ‖ · ‖∗ denotes the nuclear norm of a matrix, and the last equality is due to the

following argument. Note that

Q∗1 = arg maxQ>Q=I

⟨Q, (U∗)>Θ2Φ∗

⟩= arg min

Q>Q=I‖Q− (U∗)>Θ2Φ∗‖2F .

Suppose that we have the singular value decomposition (U∗)>Θ2Φ∗ = RΣP>, then Q∗1 =

RP> solves the above nearest orthogonal matrix problem. Hence,⟨Q∗1, (U

∗)>Θ2Φ∗⟩

= Tr(Σ) = ‖(U∗)>Θ2Φ∗‖∗.

For any matrix Z, denote by σi(Z) its ith singular value, and λi(Z) its ith eigenvector.

For a positive semidefinite matrix Z, σi(Z) = λi(Z). Then

σ2i ((U

∗)>Θ2Φ∗) = λi((Φ∗)>Θ2U∗(U∗)>Θ2Φ∗)

= λi((Φ∗)>Θ2Φ∗(Φ∗)>Θ2Φ∗ + (Φ∗)>Θ1.5∆ΘΘ1.5Φ∗), (4.24)

109

where ∆Θ = Θ12 (U∗(U∗)> − Φ∗(Φ∗)>)Θ

12 = Θ

12 ∆Θ

12 . The theorem is proved by the fol-

lowing three steps and by applying Lemma 4.4.3 to bound every σ2i ((U

∗)>Θ2Φ∗).

Step (a). We derive a tight estimation for every eigenvalue λi((Φ∗)>Θ2Φ∗(Φ∗)>Θ2Φ∗),

i = 1, ..., k. Let va, 1 ≤ a ≤ k, be the ath column of Φ∗. Then we have

va(t) =

1, t ∈ C∗a ,

0, t /∈ C∗a .

Therefore, we obtain

((Φ∗)>Θ2Φ∗)ab =n∑t=1

θ2t va(t)vb(t) =

0, a 6= b,

Ja, a = b,

which yields

((Φ∗)>Θ2Φ∗)2 = diagJ21 , ..., J

2k and λi

(((Φ∗)>Θ2Φ∗)2

)= J2

i .

Step (b). We bound the error term ‖(Φ∗)>Θ1.5∆ΘΘ1.5Φ∗‖2F for (4.24):

‖(Φ∗)>Θ1.5∆ΘΘ1.5Φ∗‖2F ≤ ‖∆Θ‖2F ‖Θ1.5Φ∗(Φ∗)>Θ1.5‖22 ≤ ‖∆‖1,θM2max, (4.25)

where the second inequality is due to the following argument.

By the feasibility of U∗ and Φ∗, it is not hard to see that max1≤i,j≤n |∆ij | ≤ 1. Therefore

‖∆Θ‖2F =∑

1≤i,j≤n∆2ijθiθj ≤

∑1≤i,j≤n

|∆ij |θiθj = ‖∆‖1,θ.

By properly permuting the row index, the true assignment matrix Φ∗ can be written as a

block diagonal matrix diag1n1×1, ...,1nk×1, where 1ni×1, 1 ≤ i ≤ k is an ni × 1 vector of

all ones. Consequently, Φ∗(Φ∗)> = diag1n1×n1 , ...,1nk×nk is also a block diagonal ma-

trix, where 1ni×ni , 1 ≤ i ≤ k is an ni×ni all-one square matrix. The term Θ1.5Φ∗(Φ∗)>Θ1.5

is also a block diagonal matrix. Let θ(a) = (θi1 , ..., θina )>,Θ(a) = diagθ(a), where

i1, ..., ina = C∗a . Then the ath block of Θ1.5Φ∗(Φ∗)>Θ1.5 is

Θ1.5(a)11>Θ1.5

(a) = θ1.5(a)(θ

1.5(a))>,

110

whose eigenvalues are Ma and zeros. Hence, we have

‖Θ1.5Φ∗(Φ∗)>Θ1.5‖22 = max1≤a≤k

M2a = M2

max.

Therefore, the inequality (4.25) is proved.

Step (c). We next bound the final objective v(θ) defined in (4.23). Note that both ma-

trices (Φ∗)>Θ2Φ∗(Φ∗)>Θ2Φ∗ and (Φ∗)>Θ1.5∆ΘΘ1.5Φ∗ are real symmetric, and are hence

Hermitian and normal. We can apply Lemma 4.4.3. For the ease of notation, let the k

singular values of (U∗)>Θ2Φ∗ be σ1, ..., σk, and define δi = |σ2i −J2

i |. Lemma 4.4.3 implies

thatk∑i=1

δ2i ≤ ‖(Φ∗)>Θ1.5∆ΘΘ1.5Φ∗‖2F ≤ ‖∆‖1,θM2

max.

Together with

σi ≥√J2i − δi = Ji

√1− δi

J2i

≥ Ji(1−δiJ2i

) = Ji −δiJi, (4.26)

we have

v(θ) = 2

(n∑i=1

θ2i −

k∑i=1

σi

)≤ 2

(k∑i=1

Ji −k∑i=1

(Ji −δiJi

)

)

= 2k∑i=1

δiJi≤ 2

(k∑i=1

1

J2i

) 12(

k∑i=1

δ2i

) 12

≤ 2√kMmax

Jmin‖∆‖

121,θ.

We need to mention that in (4.26), σi >√

2Ji > Ji − δiJi

when J2i − δi < 0. Consequently,

the first inequality of Theorem 4.4.4 is established.

(ii) Applying the first inequality with Θ = I gives

‖U∗Q∗2 − Φ∗‖2F ≤ 2√knmax

nmin‖∆‖

121 .

Together with

‖∆‖1,θ =∑ij

|∆ij |θiθj ≥ θ2min‖∆‖1,

111

we prove the second inequality.

The bound on ‖U∗Q∗2 − Φ∗‖2F implies that with a proper orthogonal transform, U∗ is

close to the true community assignment matrix Φ∗. Then if we apply K-means under `2

norm metric over the rows of U∗, which is equivalent to clustering over rows of U∗Q∗2, the

clustering result will intuitively be close to the true community structure. Similar intuition

applies the weighted K-means method (4.9). Suppose that the solution to the K-means

clustering problem (4.8) is Φ. For an arbitrary permutation Π of the columns of Φ, we

define the set

ErrΠ(Φ) = i : (ΦΠ)i 6= φ∗i ,

where Π also stands for the corresponding permutation matrix and (ΦΠ)i denotes the ith

row of the matrix ΦΠ. This set contains the indices of the rows of ΦΠ and Φ∗ that do not

match. Since applying any permutation Π to the columns of Φ does not change the cluster

structure at all, a fair error measure should be

minΠ|ErrΠ(Φ)|,

where the minimization is performed over the set of all permutation matrix. Under this

measure, the following theorem guarantees the quality of our clustering result.

Theorem 4.4.5 Suppose that Φ is generated by some K-means clustering algorithm with

cluster number k equal to community number. Then under the DCSBM,

minΠ|ErrΠ(Φ)| ≤ 8

√knmax

θminnmin

(1 +

nmax

nmin

)(1 + α(k))‖∆‖

121,θ,

where α(k) is the approximation ratio of the polynomial time K-means approximation al-

gorithm, and ‖∆‖1,θ is bounded with high probability by Lemma 4.4.2.

Proof. Consider the K-means problem (4.8). Suppose that the solution generated by some

polynomial time approximation algorithm is denoted by Φ, Xc, and the global optimal

112

solution is denoted by Φopt, Xoptc . Suppose that the approximation ratio of the algorithm

is α(k), where k is the cluster number in the algorithm. Then directly

‖ΦXc − U∗‖2F ≤ α(k)‖ΦoptXoptc − U∗‖2F .

By the optimality of Φopt, Xoptc and the feasibility of Φ∗, (Q∗2)>, where Q∗2 is defined in

Theorem 4.4.4, we have

‖ΦoptXoptc − U∗‖2F ≤ ‖Φ∗(Q∗2)> − U∗‖2F .

Therefore,

‖ΦXc − Φ∗(Q∗2)>‖2F = ‖ΦXc − U∗ + U∗ − Φ∗(Q∗2)>‖2F

≤ 2‖ΦXc − U∗‖2F + 2‖Φ∗(Q∗2)> − U∗‖2F

≤ 2α(k)‖ΦoptXoptc − U∗‖2F + 2‖Φ∗(Q∗2)> − U∗‖2F

≤ 2(1 + α(k))‖Φ∗(Q∗2)> − U∗‖2F .

Denote by Zi the ith row of a matrix Z. We define the set that contains the potential

errors

Ea =

i ∈ C∗a |‖

(ΦXc

)i−(

Φ∗(Q∗2)>)i‖2 ≥ 1

2

,

E = ∪ka=1Ea, Sa = C∗a\Ea.

A straightforward bound for the cardinality of E gives

|E| ≤ 2‖ΦXc − Φ∗(Q∗2)>‖2F .

Now we define a partition of the community index set [k] = 1, ..., k,

B1 = a : Sa = ∅,

B2 = a : ∀i, j ∈ Sa, (ΦXc)i = (ΦXc)j,

B3 = [k]\(B1 ∪B2).

113

For ∀i, j ∈ Ec = ∪a∈B2∪B3Sa, and i ∈ C∗a , j ∈ C∗b , a 6= b,

‖(ΦXc)i − (ΦXc)j‖

≥ ‖(Φ∗(Q∗2)>)i − (Φ∗(Q∗2)>)j‖ − ‖(ΦXc)i − (Φ∗(Q∗2)>)i‖

−‖(ΦXc)j − (Φ∗(Q∗2)>)j‖

>√

2−√

2

2−√

2

2= 0.

That is, (ΦXc)i 6= (ΦXc)i, or equivalently Φi 6= Φj . This implies that for all Sa, a ∈

B2, they belong to different clusters. It further indicates that all nodes in ∪a∈B2Sa are

successfully classified with a proper permutation Π∗. In other words,

ErrΠ∗(Φ) ⊂ ∪a∈B3Sa ∪ E.

The above argument also reveals that if a 6= b, Sa, Sb 6= ∅, then Sa and Sb will correspond

to different rows from Xc. If a ∈ B2, then Sa corresponds to only one row of Xc. If a ∈ B3,

then Sa corresponds to at least two different rows of Xc. But note that Xc contains at

most k different rows. Therefore,

|B2|+ 2|B3| ≤ k = |B1|+ |B2|+ |B3|,

which further implies

|B3| ≤ |B1|.

Note that ∪a∈B1C∗a ⊂ E, we have |B1|nmin ≤ |E|, or equivalently

|B1| ≤1

nmin|E|.

Combining all previous results yields

|ErrΠ∗(Φ)| ≤ |E|+ | ∪a∈B3 Sa| ≤ |E|+ |B3|nmax

≤ (1 +nmax

nmin)|E|

≤ 4(1 +nmax

nmin)(1 + α(k))‖U∗Q∗2 − Φ∗‖2F

≤ 8√knmax

θminnmin

(1 +

nmax

nmin

)(1 + α(k))‖∆‖

121,θ.

114


Note that the factor 1θmin

appears in the bound. This term vanishes under the SBM where

all θi = 1. But it is undesirable under the DCSBM. To avoid the dependence on this term,

we propose to solve the weighted K-means problem (4.9). In this case, suppose that the

solution to problem (4.9) is Φ, then the fair error measure will be

minΠ

∑i∈ErrΠ(Φ)

θ2i ,

where Π is a permutation matrix. This measure means that when a node has smaller θi, it

potentially has few edges and provides less information for clustering, then misclassifying

this node is somewhat forgivable and suffers less penalty. In contrast, if a node has large

θi, then misclassifying this node should incur more penalty. Now we present the theorem

that address the above issue.

Theorem 4.4.6 Suppose that Φ is generated by some weighted K-means clustering algo-

rithm with cluster number k equal to community number. Then under the DCSBM,

minΠ

∑i∈ErrΠ(Φ)

θ2i ≤

16√kJmax

H2minJmin

(CH +H2

max

)(1 + α(k))‖∆‖

121,θ,

where α(k) is the approximation ratio of the polynomial time approximation algorithm for

problem (4.9) and the quantity CH H2max.

The proof of this theorem is similar in nature to that of Theorem 4.4.5, but is more

technically involved. For the sake of being succinct, we omit the proof here.

For large scale networks, solving problem (4.7) can be very efficient when the `0 penalty

parameter p is set to be moderate and the asynchronous parallel scheme is applied. How-

ever, performing the K-means or weighted K-means clustering will be a bottleneck when

the community number is relatively large. Therefore, the proposed direct rounding scheme

will be desirable. In the following theorem, we discuss the theoretical guarantees for this

rounding scheme.

115

To invoke the analysis, the following assumption is needed.

Assumption 4.4.7 Define the set Err(Q, δ) = i : ‖(U∗Q)i − φ∗i ‖2 ≥ δ and define

Ta(Q, δ) = C∗a\Err(Q, δ). Suppose Q∗1, Q∗2 are defined according to Theorem 4.4.4, we

assume that all Ta(Q∗1,

132p2 ) 6= ∅, Ta(Q∗2, 1

32p2 ) 6= ∅.

This assumption states that for each community C∗a , there is at least one node i ∈ C∗a such

that U∗i is close to the true assignment φ∗i after the rotation Q∗1 or Q∗2. In our extensive

numerical experiments, this assumption actually always holds.

Theorem 4.4.8 Suppose that all Assumption 4.4.7 holds and the community assignment

Φ is given by a directly rounding U∗. Then under the DCSBM, the error is bounded by

minΠ|ErrΠ(Φ)| ≤ C2

p2√knmax

nminθmin‖∆‖

121,θ,

minΠ

∑i∈ErrΠ(Φ)

θ2i ≤ C2

p2√kMmax

Jmin‖∆‖

121,θ,

where C2 > 0 is some absolute constant.

Proof. (i) We prove the first inequality by showing that all nodes in ∪ka=1Ta(Q∗2,

132p2 )

are correctly classified under a proper permutation matrix Π. Define δ0 = 132p2 . By the

Assumption 4.4.7, we have

|Err(Q∗2, δ0)| ≤ 1

δ0‖U∗Q∗2 − Φ∗‖2F , and Ta(Q

∗2, δ0) 6= ∅, 1 ≤ a ≤ k.

First, for ∀i, j ∈ Ta(Q∗2, δ0), and for some a ∈ [k], we have φ∗i = φ∗j . Recall that (u∗i )

>

stands for the ith row of the solution U∗ to problem (4.7), we further have

‖u∗i − u∗j‖ = ‖(Q∗2)>(u∗i − u∗j )‖

= ‖(Q∗2)>u∗i − φ∗i + φ∗j − (Q∗2)>u∗j‖

≤ ‖(Q∗2)>u∗i − φ∗i ‖+ ‖(Q∗2)>u∗j − φ∗j‖

< 2√δ0.

116

Second, for ∀i, j, a 6= b, i ∈ Ta(Q∗2, δ0), j ∈ Tb(Q∗2, δ0), we have φ∗i 6= φ∗j . Similarly,

‖u∗i − u∗j‖ = ‖(Q∗2)>(u∗i − u∗j )‖

= ‖(Q∗2)>u∗i − φ∗i + φ∗i − φ∗j + φ∗j − (Q∗2)>u∗j‖

≥ ‖φ∗i − φ∗j‖ − ‖(Q∗2)>u∗i − φ∗i ‖ − ‖(Q∗2)>u∗j − φ∗j‖

>√

2− 2√δ0.

That is,

‖u∗i − u∗j‖

< 2√δ0, ∀a,∀i, j ∈ Ta(Q∗2, δ0),

>√

2− 2√δ0, ∀a 6= b,∀i ∈ Ta(Q∗2, δ0),∀j ∈ Tb(Q∗2, δ0).

(4.27)

From each set Ta(Q∗2, δ0), we take an arbitrary representative va. Then we have k vectors

in Rk such that

‖vi‖2 = 1, ‖vi‖0 ≤ p, vi ≥ 0,∀1 ≤ i ≤ k,

‖vi − vj‖ ≥√

2− 2√δ0, ∀i 6= j,

which further implies that

〈vi, vj〉 ≤ ε, ∀i 6= j,

where ε = 12p ≥ 2(

√2δ0 − δ0). The following proof consists of mainly three steps.

Step (a). Define m(i) = arg max1≤j≤k(vi)j , where (vi)j is the jth component of vi. Then

for any i 6= j, we prove m(i) 6= m(j). Suppose there exist i 6= j such that m(i) = m(j).

Then by the definition of m(i), it is straightforward that

(vi)m(i) ≥1√p.

Therefore, by vi, vj ≥ 0,

〈vi, vj〉 ≥ (vi)m(i) · (vj)m(j) ≥1

p>

1

2p= ε,

which leads to a contradiction. We can then choose a proper permutation of index Π

such that m(i) = i for 1 ≤ i ≤ k. This means that all the k representatives v1, ..., vj are

correctly classified.

117

Step (b). Suppose that after proper permutation of index, m(i) = i, 1 ≤ i ≤ k. Then in

this step, we prove that for all 1 ≤ i ≤ k, vi is sufficiently close to ei. First, by

(v1)2 ·1√p≤ (v1)2 · (v2)2 ≤ 〈v1, v2〉 ≤

1

2p, (4.28)

we obtain,

(v1)2 ≤1

2√p. (4.29)

Repeat the same argument for (v1)3, ..., (v1)k and we have them all less than or equal to

12√p . It follows from ‖v1‖0 ≤ p and ‖v1‖2 = 1 that

(v1)21 = 1−

k∑j=2

(v1)2j ≥ 1− p− 1

4p>

3

4. (4.30)

By repeating the same argument for v2, ..., vk, we have for 1 ≤ i ≤ k,(vi)i >

√3

2 ,

(vi)j <1

2√p , for ∀j 6= i.

(4.31)

Now based on the new lower bounds of (vi)i, 1 ≤ i ≤ k, we repeat the process (4.28), (4.30)

and (4.31) again. We obtain that for all 1 ≤ i ≤ k,(vi)i >

√1− 1

3p ,

(vi)j <1√3p, for ∀j 6= i.

Step (c). Now we argue that all nodes in ∪1≤a≤kTa(Q∗2, δ0) are correctly classified in this

step. For any other nodes s ∈ Ta(Q∗2, δ0), due to its proximity to va (4.27), we know that(u∗s)a >

√1− 1

3p −√

24p ,

(u∗s)j <1√3p

+√

24p , ∀j 6= a.

For all p ≥ 2, u∗s(a) > u∗s(j) for j 6= a. This means that all nodes in ∪1≤a≤kTa(Q∗2, δ0) are

correctly classified. That is,

minΠ|ErrΠ(Φ)| ≤ |Err(Q∗2, δ0)| ≤ 1

δ0· ‖U∗Q∗2 − Φ∗‖2F

≤ C2p2√knmax

nminθmin‖∆‖

121,θ,

118

where the last inequality follows from the result of Theorem 4.4.4.

(ii). The second inequality of the theorem can be extended directly from the first one:

∑i∈Err(Q∗1,δ0)

θ2i ≤

∑i∈Err(Q∗1,δ0)

θ2i ·

1

δ0· ‖(Q∗1)>u∗i − φ∗i ‖2 ≤

1

δ0‖Θ(U∗Q∗1 − Φ∗)‖2F .

Then following the bound in Theorem (4.4.4), the second inequality is proved.

4.5 Numerical Result

In this section, we evaluate the performance of our RBR method on both synthetic and

real datasets, where the parameter λ is set to 1/‖d‖1 in all cases. Our RBR algorithm

with rounding and K-means are referred to as RBR(r) and RBR(K), respectively. For

synthetic and small-sized real networks, we compare them with the CMM algorithm in

[17], the SCORE algorithm in [46] and the OCCAM algorithm in [106]. For large-sized real

datasets, we only compare the asynchronous parallel RBR algorithm with the LOUVAIN

algorithm in [8], which is an extremely fast algorithm to process large networks.

4.5.1 Solvers and Platform

The algorithm RBR(r) is implemented in C, with multi-threading support using OpenMP.

Due to the usage of K-means, the version RBR(K) is written in MATLAB. The CMM,

SCORE and OCCAM for synthetic and small-sized real networks are also written in MAT-

LAB. For large-sized networks, we use the LOUVAIN solver hosted on Google Sites1, which

is well implemented in C++. The parameters in all methods are set to their default val-

ues unless otherwise specified. Since k, the number of communities of the synthetic and

small-sized real networks, is known (usually very small), the parameter r, i.e., the number

of columns of U is set to k. The parameter p in the `0 constraints of our RBR algorithm

1See https://sites.google.com/site/findcommunities/

119

https://sites.google.com/site/findcommunities/

is also set to k in these tests. A few different values of p are tested on large-sized real

networks.

All numerical experiments are performed on a workstation with two twelve-core Intel Xeon

E5-2680 v3 processors at 2.5 GHz and a total amount of 128 GB shared memory. The

efficiency of OpenMP may be affected by NUMA systems greatly. It is much slower for

a CPU to access the memory of another CPU. To prevent such allocations of threads, we

bind all programs to one CPU, i.e., up to 12 physical cores are available per task.

Other programming environments are:

• gcc/g++ version 6.2.0, used for compiling our C programs and LOUVAIN executa-

bles.

• MATLAB version R2015b, used for running CMM, SCORE, OCCAM and RBR with

K-means solvers.

• Intel MKL 2017 for BLAS routines, linked with libmkl sequential.so.

We should mention that there are threaded and sequential versions in the Intel MKL’s li-

brary. The threaded MKL library should be disabled in favor of our manual parallelization.

The reason is that when updating one row, the threaded library will automatically utilize

all available computing threads for BLAS, which may probably interfere with the cores

processing other rows. Besides, the sequential library is usually faster than the parallel

library with one thread. The reported runtimes are wall-clock times in seconds.

4.5.2 Results on Synthetic Data

To set up the experiments, we generate the synthetic graph of n = m× k nodes and split

them into k groups C∗1 , . . . , C∗k , each group with m nodes. These groups are then used as

the ground truth of the model. The edges of the graph are generated by sampling. For each

pair of nodes i ∈ C∗a and j ∈ C∗b , i and j are connected with probability min1, θiθjBab,

120

where

Bab =

q , a = b,

0.3q , a 6= b.

(4.32)

For each i, θi is sampled independently from a Pareto(α, β) distribution with probability

density function f(x;α, β) = αβα

xα+11x>β. Here α and β are called the shape and scale

parameters respectively. In our testing model, we choose different shape parameter α first,

and select the scale parameter β such that E(θi) = 1 for all i. These parameters determine

the strength of the cluster structure of the graph. With larger q and α, it is easier to detect

the community structure. Otherwise all these algorithms may fail. Since the ground truths

of the synthetic networks are known, we compute the misclassification rate to check the

correctness of these algorithms. Suppose that C1, . . . , Ck are the detected communities

provided by any algorithm, then the misclassification rate can be defined as

err := 1−∑k

i maxj |Ci ∩ C∗j |n

. (4.33)

We need to mention that permuting the columns of U does not change its objective function

value. Since it is impractical to enumerate all possible permutations and choose the best

one to match the detected communities and the ground truth when k is large, we still

use the definition of misclassification rate in (4.33), and it works in most cases during our

numerical experiments.

The performance of our method is shown in Figure 4.1. The y-axis is the misclassification

rate while the x-axis is the shape parameter α by which we generate the random graphs.

The circle solid line is our RBR algorithm with rounding. The line with “x” marker

stands for the CMM algorithm. The q in the figure denotes the intra-cluster connection

probability. The inter-cluster connection probability is set to be 0.3q. It is clear that in

all these cases, our approach outperforms the CMM algorithm, especially when in class

connection probability is small, i.e., the class structure is not very strong. Table 4.1 and

4.2 are the results under q = 0.1, α = 1.4 and q = 0.1, α = 1.8, respectively. We can

see that CMM and our method have similar performances. Both of them are better than

121

SCORE and OCCAM. If the shape parameter is increased from 1.4 to 1.8, all of the four

methods perform better. But our method still has the smallest misclassification rate.

Table 4.1: Misclassification rate and runtime, q = 0.1, α = 1.4

n = 200, k = 2 n = 450, k = 2 n = 200, k = 3 n = 200, k = 4

solver err Time err Time err Time err Time

CMM 0.011 5.10 0.000 22.05 0.003 10.43 0.003 18.37

RBR(r) 0.009 0.01 0.000 0.03 0.003 0.03 0.002 0.07

RBR(K) 0.011 1.35 0.000 4.33 0.003 3.00 0.003 4.37

OCCAM 0.015 0.02 0.200 0.05 0.191 0.04 0.224 0.08

SCORE 0.015 0.02 0.180 0.03 0.074 0.05 0.149 0.09

Table 4.2: Misclassification rate and runtime, q = 0.1, α = 1.8

n = 200, k = 2 n = 450, k = 2 n = 200, k = 3 n = 200, k = 4

solver err Time err Time err Time err Time

CMM 0.003 5.20 0.000 22.77 0.001 10.44 0.001 18.66

RBR(r) 0.003 0.01 0.000 0.04 0.001 0.03 0.001 0.06

RBR(K) 0.002 1.41 0.000 4.76 0.001 3.24 0.001 5.08

OCCAM 0.004 0.02 0.000 0.04 0.001 0.04 0.029 0.06

SCORE 0.004 0.02 0.000 0.03 0.003 0.04 0.021 0.05

4.5.3 Results on Small-Scaled Real World Data

In this section, we test the empirical performances of CMM, SCORE, OCCAM, RBR(r)

and RBR(K) on the US political blog network dataset from [56], the Simmons College

network and the Caltech network from Facebook dataset. The properties of these networks

are shown in Table 4.3, where nc is the number of real communities.

122

Table 4.3: Small-scale real-world networks with ground truth

Name nodes edges nc feature of community stucture

Polblogs 1222 16714 2 political leaning

Simmons 1168 24449 4 graduation year between 2006 and 2009

Caltech 597 12823 8 total 8 dorm numbers

We run our RBR algorithms for 10 times as a batch starting from different randomly initial

points. The one with the highest modularity is chosen as the result in this batch. This

procedure is repeated 300 times. The histograms of the misclassification rates of RBR(r)

are shown in Figure 4.2. A summary of the misclassification rates and the cpu time of all

algorithms is presented in Table 4.4, where the results for RBR(r) and RBR(K) are the

averaged values.

Table 4.4: Overall comparison between the five methods

Polblogs Simmons Caltech

err Time err Time err Time

CMM 4.99% 50.12 13.10% 53.74 21.94% 14.22

SCORE 4.75% 0.05 23.92% 0.11 27.18% 0.17

OCCAM 5.32% 0.10 24.36% 0.16 35.53% 0.10

RBR(K) 5.09% 1.42 12.56% 10.53 16.51% 1.08

RBR(r) 4.75% 0.04 14.00% 0.08 21.03% 0.06

On the polblogs network, the smallest misclassification rate of RBR(r) method is 4.4%

while its averaged value is 4.75%. For the Caltech network, we use dorms as the ground

truth cluster labels according to [82]. The nodes are partitioned into 8 groups. The

averaged misclassfication rate of RBR(K) and RBR(r) are 16.54% and 21.03%, respectively,

whereas CMM, SCORE and OCCAM have higher error rate of 21.94%, 27.18% and 35.53%,

respectively. In particular, the smallest error by RBR(K) is 15.55%. For the Simmons

123

College network, the graduation year ranging from 2006 to 2009 is treated as cluster labels

for its strong correlation with the community structure observed in [82]. RBR(r), RBR(K),

CMM, SCORE and OCCAM misclassified 14.00%, 12.56%, 13.10%, 23.92% and 24.36%

of the nodes on average, respectively. The smallest error of RBR(K) reaches 11.5%.

Finally, the confusion matrices of CMM, SCORE and RBR(r) on these networks are pre-

sented in Figures 4.3, 4.4 and 4.5, respectively. The results for OCCAM and RBR(K) are

not shown due to their similarity with SCORE and RBR(r) respectively. We can see that

RBR(r) is indeed competitive on identifying good communities.

4.5.4 Results on Large-Scaled Data

For large-sized network, we choose ten undirected networks from Stanford Large Network

Dataset Collection [56] and UF Sparse Matrix Collection [23]. A detailed description of

these networks can be found in Table 4.5. The number of nodes in the dataset varies from

36 thousand to 50 million. Due to the size of these networks, we run the RBR(r) algorithm

on these networks for only once.

Table 4.5: Large-scale networks

Name nodes edges Descriptions

amazon 334,863 925,872 Amazon product network

DBLP 317,080 1,049,866 DBLP collaboration network

email-Enron 36,692 183,831 Email communication network from Enron

loc-Gowalla 196,591 950,327 Gowalla location based online social network

loc-Brightkite 58,228 214,078 Brightkite location based online social network

youtube 1,134,890 2,987,624 Youtube online social network

LiveJournal 3,997,962 34,681,189 LiveJournal online social network

Delaunay n24 16,777,216 50,331,601 Delaunay triangulations of random points in the plane

road usa 23,947,347 28,854,312 Road network of USA

europe osm 50,912,018 54,054,660 Road network of Europe

124

For these large networks, we do not have the true partition. Therefore we employ three

metrics to evaluate the quality of the clustering.

• CC: Cluster Coefficient(CC) is defined as

CC =1

nC

nC∑i=1

1

|Ci|∑v∈Ci

2|ets ∈ E : vt, vs ∈ N(v) ∩ Ci|d(v)(d(v)− 1)

(4.34)

where nC is total number of detected communities; |Ci| is the cardinality of the

community Ci; N(v) is the set that contains all neighbors of node v. Thus, high CC

means that the connections in each cluster are dense. That is, it assumes that for

any v within class C, the neighborhood of v should have the structure of a complete

graph. In some sparse graphs such as road nets, the CC value may be small.

• S: Strength(S) shows that a community is valid if most nodes have the same label

as their neighbors. Denote d(v)in and d(v)out as the degrees inside and outside the

community C containing v. If for all v in C, we have d(v)in > d(v)out, then C is

a strong community; if∑

v∈C d(v)in >∑

v∈C d(v)out, then C is a weak community;

otherwise C not valid as a community. Consequently, S is defined as

S =1

nC

nC∑i=1

score(Ci), (4.35)

where

score(Ci) =

1, Ci is strong,

0.5, Ci is weak,

0, Ci is invalid.

A cluster with a high S implies that the neighbouring nodes tend to have the same

label, which is true for most graphs with community structures.

• Q: Modularity(Q) is a widely used measure in community detection. In our RBR

algorithm, the modularity is the absolute value of the objective function value modulo

a constant factor 2|E|.

125

We first evaluate the performance of RBR(r) with respect to different values of k, i.e., the

expected number of communities. Specifically, for each k ∈ 5, 10, 20, 30, 50, 100, 200, we

perform a fixed number of iterations and report the corresponding values of the metrics

(CC, S, Q). The paramter p is set to 5 in all cases. The results are shown in Figure 4.6.

The modualarity Q increases as k becomes larger but it becomes almost the same when

k > 50. From the optimization perspective, a larger k means a larger feasible domain.

Therefore, a better value should be expected. On the other hand, the values of CC and S

are almost the same for all k.

We next compare the performance of RBR(r) with respect to different values of p in the `0

constraint over each subproblem. By taking advantage of the sparsity, our RBR algorithm

is able to solve problems of very large scale since the storage of U is O(np). For each

p ∈ 1, 2, 5, 10, 20, we set k = 100 and perform a fixed number of iterations. The results

are shown in Figure 4.7. Ideally, the RBR method tends to search solutions in a larger

space for the subproblem if a larger p is applied. From the figure, we can observe that the

RBR algorithm performs better when a larger p is used. The metrics become similar when

p > 5 but they are better than the results with values p = 1 and p = 2.

Finally we compare our parallel RBR algorithm with the LOUVAIN algorithm, which runs

extremely fast and usually provides good results. It is worth mentioning that LOUVAIN

algorithm can begin with different initial solutions. By default it starts from a trivial

classification that each node belongs to a community that only contains itself. It merges

the clusters in order to obtain a higher modularity. The LOUVAIN algorithm terminates

if it is unable to increase the modularity or the improvement is not significant. In this

sense, there is no guarantee on the number of communities it finally provides. On the

other hand, one can limit the maximal number of communities by letting the LOUVAIN

algorithm start from a partition with no more than k communities. Since the LOUVAIN

algorithm always reduces the number of communities by merging, it is possible to obtain

a low rank solution. However, the quality of the solution may be highly dependent on the

initial values. Without utilizing much prior knowledge of these networks, we set the initial

126

value of both algorithms to be a random partition with at most k labels.

A summary of the computational results are shown in Table 4.6. For each algorithm,

three sets of parameters are tested. For example, RBR(100, 1) denotes that the desired

number of communities k is set to 100 and p is set to 1; LV(500) means that the LOUVAIN

algorithm begins with a randomly assigned partition with at most 500 communities. LV(0),

however, means that we launch the LOUVAIN algorithm with a trivial partition, in which

case it may not produce low rank solutions. In the table, k0 stands for the actual number

of communities that the algorithm finally produces. If k is given, then k0 will not exceed

k. The CPU time of each algorithm is also reported.

From the perspective of solution qualities, LV(0) outperforms other algorithms and set-

tings. Its CC, S, Q are high on most networks. Especially for the last three large networks,

the modularity Q is over 0.99. However, it is hard to control k0 in LV(0). It is acceptable

on networks such as DBLP and Delaunay n24. It may produce over one thousand com-

munities and k0 is over 28000 in the youtube network. Although LV(500) and LV(2000)

can return partitions with small k0, their modularity is not as good as LV(0).

We can see that RBR(100,5) almost always returns a good moduality. It is even better

than LV(0) on email-Enron. Note that RBR(20,5) is also very competitive due to its low

rank solutions without sacrificing the modularity too much. These facts indicate that our

RBR algorithm is better than the LOUVAIN algorithm if a small number of communities

are needed. Additionally, it can be observed that RBR(100,1) can produce similar results

as LV(500) and LV(2000) do. In fact, the procedure of these algorithms are similar in the

sense of finding a new label for a given node and increasing the modularity Q. The difference

is that the LOUVAIN algorithm only considers the labels of the node’s neighbours while

our RBR algorithm searches for the new label among all k possibilities. It means that if

modified properly, our RBR algorithm can also utilize the network structure.

The LOUVAIN algorithm is faster in terms of the running time. Each step of the LOUVAIN

algorithm only considers at most dmax nodes, where dmax is the maximum of the vertex

127

Table 4.6: Comparison between RBR and LOUVAIN algorithm under different settings

methodamazon youtube

k0 CC S Q Time k0 CC S Q Time

RBR(100,1) 100 0.426 0.500 0.755 3.5 100 0.589 0.480 0.623 12.5

RBR(20,5) 20 0.458 0.500 0.827 3.0 20 0.600 0.500 0.714 10.2

RBR(100,5) 100 0.467 0.500 0.895 5.0 98 0.614 0.490 0.710 22.6

LV(0) 230 0.531 0.598 0.926 1.7 28514 0.930 0.159 0.723 8.7

LV(500) 165 0.427 0.500 0.757 1.7 95 0.593 0.495 0.646 7.0

LV(2000) 183 0.435 0.500 0.775 1.7 93 0.587 0.495 0.658 6.0

methodDBLP LiveJournal


RBR(100,1) 100 0.585 0.500 0.665 3.1 100 0.425 0.500 0.668 39.2

RBR(20,5) 20 0.672 0.500 0.754 2.9 20 0.448 0.500 0.737 43.4

RBR(100,5) 100 0.679 0.500 0.801 5.6 98 0.497 0.490 0.754 64.5

LV(0) 187 0.742 0.618 0.820 2.9 2088 0.729 0.877 0.749 181.5

LV(500) 146 0.583 0.500 0.673 2.1 101 0.433 0.500 0.665 81.9

LV(2000) 155 0.578 0.500 0.681 2.3 113 0.447 0.500 0.687 172.7

methodemail-Enron loc-Brightkite


RBR(100,1) 100 0.589 0.400 0.559 0.3 100 0.485 0.360 0.596 0.5

RBR(20,5) 20 0.708 0.500 0.605 0.3 20 0.516 0.500 0.667 0.5

RBR(100,5) 100 0.782 0.585 0.605 0.6 100 0.552 0.510 0.675 0.8

LV(0) 1245 0.935 0.974 0.597 0.2 732 0.829 0.939 0.687 0.3

LV(500) 45 0.705 0.600 0.581 0.2 47 0.513 0.500 0.637 0.3

LV(2000) 331 0.894 0.899 0.604 0.2 79 0.653 0.671 0.661 0.4

methodloc-Gowalla Delaunay n24


RBR(100,1) 100 0.439 0.470 0.637 1.7 100 0.298 0.500 0.730 190.7

RBR(20,5) 20 0.459 0.500 0.694 1.7 20 0.372 0.500 0.837 172.9

RBR(100,5) 98 0.497 0.475 0.696 4.3 100 0.365 0.500 0.868 252.5

LV(0) 850 0.592 0.761 0.705 1.2 354 0.430 0.500 0.990 189.1

LV(500) 60 0.442 0.500 0.658 1.3 500 0.287 0.500 0.717 792.7

LV(2000) 65 0.460 0.500 0.672 1.1 981 0.286 0.500 0.716 1390.5

methodeurope osm road usa


RBR(100,1) 100 0.042 0.500 0.790 689.9 100 0.216 0.500 0.736 274.3

RBR(20,5) 20 0.042 0.500 0.822 574.5 20 0.216 0.500 0.841 239.6

RBR(100,5) 100 0.042 0.500 0.894 717.4 100 0.216 0.500 0.908 297.4

LV(0) 3057 0.038 0.501 0.999 481.8 1569 0.225 0.510 0.998 303.2

LV(500) 500 0.041 0.500 0.663 243.7 500 0.216 0.500 0.663 153.6

LV(2000) 1999 0.041 0.500 0.663 337.3 1367 0.216 0.500 0.662 206.1128

degrees, while our RBR algorithm has to sort all k possible labels and choose at most p

best candidates. However, RBR(20,5) is still competitive on most graphs in terms of both

speed and the capability of identifying good communities. It is the fastest one on networks

such as LiveJournal and Delaunay n24. Of course, it is possible for our RBR algorithm

to consider fewer labels according to the structure of the network in order to reduce the

computational cost of sorting.

We should point out that the LOUVAIN algorithm is run in a single threaded mode since

its multi-threaded version is not available. On the other hand, RBR(r) is an asynchronous

parallel method. Although its theoretical property is not clear, it works fine in all of our

experiments. Thus, RBR(r) is useful on multi-core machines. The speedup of RBR(r) is

shown in Figure 4.8 when utilizing multiple computing threads. We can see that our RBR

algorithm has almost a linear acceleration rate thanks to asynchronous parallelization. The

average speedup is close to 8 using 12 threads. In particular, it reaches 10 times speedup

on the loc-Brightkite network.

129

Figure 4.1: Numerical results on synthetic data.

1.2 1.4 1.6 1.8

0

0.1

0.2

0.3

0.4

shape

err

q=0.05,CMMq=0.1,CMMq=0.15,CMMq=0.2,CMMq=0.05,RBR(r)q=0.1,RBR(r)q=0.15,RBR(r)q=0.2,RBR(r)

(a) 2 communities, 200 nodes per commu-

nity

1.2 1.4 1.6 1.8

0

0.1

0.2

0.3

shape

err


(b) 2 communities, 450 nodes per commu-

nity

1.2 1.4 1.6 1.8

0

0.1

0.2

0.3

0.4

shape

err


(c) 3 communities, 200 nodes per commu-

nity

1.2 1.4 1.6 1.8

0

0.2

0.4

0.6

shape

err


(d) 4 communities, 200 nodes per commu-

nity

130

Figure 4.2: Histograms of misclassification rates

0.044 0.045 0.046 0.047 0.048 0.049 0.05 0.051 0.052 0.0530

20

40

60

80

100

120

140Polblogs

Misclassification rate

Num

ber

of cases

0.115 0.12 0.125 0.13 0.135 0.14 0.145 0.150

10

20

30

40

50

60

70

80

90

Simmons

Misclassification rate

Num

ber

of cases

537

12

49

624

1 2

Clusters by CMM

Lib

Con

Po

litic

al L

ea

nin

g

552

24

34

612

1 2

Clusters by SCORE

Lib

Con

Po

litic

al L

ea

nin

g

546

22

40

614

1 2

Clusters by RBR(r)

Lib

ConP

olit

ica

l L

ea

nin

g

Figure 4.3: The confusion matrices of CMM, SCORE and RBR(r) on Polblogs network.

26

0

5

3

2

5

0

2

3

68

5

1

0

0

1

1

3

1

12

3

5

9

0

2

1

0

0

59

0

1

1

0

2

0

2

2

91

1

2

2

2

0

1

5

0

69

3

1

1

0

38

0

1

0

60

2

6

1

0

3

0

2

0

81

1 2 3 4 5 6 7 8

Clusters by CMM

Do

rms

1

2

3

4

5

6

7

8

30

0

9

9

8

22

4

3

2

67

5

1

0

0

0

4

2

1

40

1

0

4

9

6

1

0

0

61

0

5

1

1

1

1

3

1

90

3

2

2

2

0

1

2

0

51

1

0

0

0

5

0

1

0

50

1

6

1

0

1

0

2

0

74

1 2 3 4 5 6 7 8

Clusters by SCORE

Do

rms

1

2

3

4

5

6

7

8

30

0

1

3

2

9

0

1

3

68

3

1

1

0

1

1

1

1

41

1

1

4

0

0

1

0

0

60

0

1

1

0

2

0

2

3

93

3

2

2

1

0

1

5

0

68

2

2

1

0

14

0

1

0

61

3

5

1

1

3

1

2

0

82

1 2 3 4 5 6 7 8

Clusters by RBR(r)

Do

rms

1

2

3

4

5

6

7

8

Figure 4.4: The confusion matrices of CMM, SCORE and RBR(r) on Caltech network.

131

197

51

11

0

37

223

9

1

5

11

319

0

8

4

16

276

1 2 3 4

Clusters by CMM

Gra

duation Y

ears

2006

2007

2008

2009

177

114

23

0

67

150

8

0

2

14

301

0

1

11

23

277

1 2 3 4

Clusters by SCORE

Gra

duation Y

ears

2006

2007

2008

2009

191

21

6

0

52

253

16

1

3

12

317

0

1

3

16

276

1 2 3 4

Clusters by RBR(r)

Gra

duation Y

ears

2006

2007

2008

2009

Figure 4.5: The confusion matrices of CMM, SCORE and RBR(r) on Simmons College

network.

Figure 4.6: Metrics with respect to different values of k

5 10 20 30 50 100

200

0

0.2

0.4

0.6

0.8

1

Val

ue

CCSQ

(a) amazon

5 10 20 30 50 100

200

0

0.2

0.4

0.6

0.8

1

Val

ue

CCSQ

(b) DBLP

5 10 20 30 50 100

200

0

0.2

0.4

0.6

0.8

1

Val

ue

CCSQ

(c) youtube

5 10 20 30 50 100

200

0

0.2

0.4

0.6

0.8

1

Val

ue

CCSQ

(d) livejournal

132

Figure 4.7: Metric with respect to different values of p

1 2 5 10 200

0.2

0.4

0.6

0.8

1

Val

ue

CCSQ

(a) amazon

1 2 5 10 200

0.2

0.4

0.6

0.8

1

Val

ue

CCSQ

(b) DBLP

1 2 5 10 200

0.2

0.4

0.6

0.8

1

Val

ue

CCSQ

(c) road usa

1 2 5 10 200

0.2

0.4

0.6

0.8

1

Val

ue

CCSQ

(d) Delaunay n24

133

1 2 4 8 120

2

4

6

8

10

12

Number of nodes

Acc

eler

atio

nra

te

RBR(r)

(e) amazon

1 2 4 8 120

2

4

6

8

10

12

Number of nodes

Acc

eler

ati

onra

te

RBR(r)

(f) youtube

1 2 4 8 120

2

4

6

8

10

12

Number of nodes

Acc

eler

atio

nra

te

RBR(r)

(g) dblp

1 2 4 8 120

2

4

6

8

10

12

Number of nodes

Acc

eler

atio

nra

te

RBR(r)

(h) livejournal

1 2 4 8 120

2

4

6

8

10

12

Number of nodes

Acc

eler

atio

nra

te

RBR(r)

(i) email-Enron

1 2 4 8 120

2

4

6

8

10

12

Number of nodes

Acc

eler

ati

onra

te

RBR(r)

(j) brightkite

Figure 4.8: Acceleration rate of RBR(r) on large real data

134

Bibliography

[1] P. A. Absil, C. G. Baker, and K. A. Gallivan. Convergence analysis of Riemannian

trust-region methods. Technical Report, 2006. 2, 5

[2] P. A. Absil, C. G. Baker, and K. A. Gallivan. Trust-region methods on Riemannian

manifolds. Foundations of Computational Mathematics, 7(3):303–330, 2007. 2, 5

[3] P. A. Absil and J. Malick. Projection-like retractions on matrix manifolds. SIAM

Journal on Optimization, 22(1):135–158, 2012. 34

[4] N. Agarwal, Z. Allen-Zhu, B. Bullins, E. Hazan, and T. Ma. Finding approximate

local minima faster than gradient descent. In Proceedings of the 49th Annual ACM

SIGACT Symposium on Theory of Computing, pages 1195–1199. ACM, 2017. 3, 58,

89, 91

[5] Z. Allen-Zhu and E. Hazan. Variance reduction for faster non-convex optimization.

In Proceedings of the 33rd International Conference on International Conference on

Machine Learning (ICML), pages 699–707, 2016. 3, 60

[6] J. Ballani, L. Grasedyck, and M. Kluge. Black box approximation of tensors in

hierarchical Tucker format. Linear Algebra and its Applications, 438(2):639–657,

2013. 6

135

[7] T. Bianconcini, G. Liuzzi, B. Morini, and M. Sciandrone. On the use of iterative

methods in cubic regularization for unconstrained optimization. Computational Op-

timization and Applications, 60(1):35–57, 2015. 58

[8] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre. Fast unfolding

of communities in large networks. Journal of Statistical Mechanics: Theory and

Experiment, 10:10008, Oct. 2008. 95, 119

[9] N. Boumal, P. A. Absil, and C. Cartis. Global rates of convergence for nonconvex

optimization on manifolds. arXiv preprint arXiv:1605.08101, 2016. 2, 5, 34

[10] T. T. Cai, X. Li, et al. Robust and computationally feasible community detection in

the presence of arbitrary outlier nodes. The Annals of Statistics, 43(3):1027–1059,

2015. 94

[11] Y. Carmon and J. C. Duchi. Gradient descent efficiently finds the cubic-regularized

non-convex Newton step. arXiv preprint arXiv:1612.00547, 2016. 3, 58, 89, 91

[12] C. Cartis, N. I. M. Gould, and P. L. Toint. Adaptive cubic regularisation methods for

unconstrained optimization. Part I: motivation, convergence and numerical results.

Mathematical Programming, 127(2):245–295, 2011. 3, 58, 59, 89, 91

[13] C. Cartis, N. I. M. Gould, and P. L. Toint. Adaptive cubic regularisation methods for

unconstrained optimization. Part II: worst-case function-and derivative-evaluation

complexity. Mathematical Programming, 130(2):295–319, 2011. 3, 58

[14] C. Cartis, N. I. M. Gould, and P. L. Toint. An adaptive cubic regularization algo-

rithm for nonconvex optimization with convex constraints and its function-evaluation

complexity. IMA Journal of Numerical Analysis, 32(4):1662–1695, 2012. 3

[15] K. Chaudhuri, F. C. Graham, and A. Tsiatas. Spectral clustering of graphs with

general degrees in the extended planted partition model. In COLT, volume 23, pages

35–1, 2012. 93

136

[16] X. Chen, B. Jiang, T. Lin, and S. Zhang. On adaptive cubic regularized New-

ton’s methods for convex optimization via random sampling. arXiv preprint

arXiv:1802.05426, 2018. 58, 59, 60, 63

[17] Y. Chen, X. Li, and J. Xu. Convexified modularity maximization for degree-corrected

stochastic block models. arXiv preprint arXiv:1512.08425, 2015. 50, 54, 94, 95, 97,

99, 107, 108, 119

[18] Y. Chen, S. Sanghavi, and H. Xu. Clustering sparse graphs. In Advances in neural

information processing systems, pages 2204–2212, 2012. 94

[19] Y. Chen and J. Xu. Statistical-computational phase transitions in planted models:

The high-dimensional setting. In ICML, pages 244–252, 2014. 94

[20] F. H. Clarke. Optimization and nonsmooth analysis. 5:847–853, 1983. 11, 12

[21] A. Coja-Oghlan and A. Lanka. Finding planted partitions in random graphs with

general degree distributions. SIAM Journal on Discrete Mathematics, 23(4):1682–

1714, 2009. 93

[22] A. Dasgupta, J. E. Hopcroft, and F. McSherry. Spectral analysis of random graphs

with skewed degree distributions. In Foundations of Computer Science, 2004. Pro-

ceedings. 45th Annual IEEE Symposium on, pages 602–610. IEEE, 2004. 93, 96

[23] T. A. Davis and Y. Hu. The university of florida sparse matrix collection. http:

//www.cise.ufl.edu/research/sparse/matrices, 2011. 124

[24] L. De Lathauwer, B. De Moor, and J. Vandewalle. A multilinear singular value

decomposition. SIAM Journal on Matrix Analysis and Applications, 21(4):1253–

1278, 2000. 6

[25] C. Ding, X. He, and H. D. Simon. On the equivalence of nonnegative matrix fac-

torization and spectral clustering. In Proceedings of the 2005 SIAM International

Conference on Data Mining, pages 606–610. SIAM, 2005. 93

137

http://www.cise.ufl.edu/research/sparse/matrices

http://www.cise.ufl.edu/research/sparse/matrices

[26] C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix t-

factorizations for clustering. In Proceedings of the 12th ACM SIGKDD international

conference on Knowledge discovery and data mining, pages 126–135. ACM, 2006. 93

[27] M. A. Erdogdu and A. Montanari. Convergence rates of sub-sampled Newton meth-

ods. In Advances in Neural Information Processing Systems (NIPS) 28, pages 3052–

3060, 2015. 60

[28] A. Frieze and M. Jerrum. Improved approximation algorithms for maxk-cut and max

bisection. Algorithmica, 18(1):67–81, 1997. 45, 46, 52

[29] S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex

stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013. 2

[30] S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and

stochastic programming. Mathematical Programming, 156(1-2):59–99, 2016. 2

[31] S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods

for nonconvex stochastic composite optimization. Mathematical Programming, 155(1-

2):267–305, 2016. 2

[32] S. Ghosh and H. Lam. Computing worst-case input models in stochastic simulation.

arXiv preprint arXiv:1507.05609, 2015. 6

[33] S. Ghosh and H. Lam. Mirror descent stochastic approximation for computing worst-

case stochastic input models. In Winter Simulation Conference, 2015, pages 425–436.

IEEE, 2015. 6

[34] M. Grant, S. Boyd, and Y. Ye. Cvx: Matlab software for disciplined convex pro-

gramming, 2008. 52

[35] O. Guedon and R. Vershynin. Community detection in sparse networks via

grothendiecks inequality. Probability Theory and Related Fields, 165(3-4):1025–1049,

2016. 94

138

[36] L. Gulikers, M. Lelarge, and L. Massoulie. A spectral method for community detec-

tion in moderately-sparse degree-corrected stochastic block models. arXiv preprint

arXiv:1506.08621, 2015. 93

[37] E. Hazan and T. Koren. A linear-time algorithm for trust region problems. Mathe-

matical Programming, 158(1):363–381, 2016. 89, 91

[38] P. W. Holland, K. B. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps.

Social networks, 5(2):109–137, 1983. 93, 95

[39] M. Hong. Decomposing linearly constrained nonconvex problems by a proximal

primal dual approach: Algorithms, convergence, and applications. arXiv preprint

arXiv:1604.00543, 2016. 6

[40] M. Hong, Z.-Q. Luo, and M. Razaviyayn. Convergence analysis of alternating direc-

tion method of multipliers for a family of nonconvex problems. SIAM Journal on

Optimization, 26(1):337–364, 2016. 6

[41] R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge university press, 2012.

108

[42] S. Hosseini and M. R. Pouryayevali. Generalized gradients and characterization of

epi-Lipschitz sets in Riemannian manifolds. Fuel and Energy Abstracts, 74(12):3884–

3895, 2011. 10

[43] K. Huper and J. Trumpf. Newton-like methods for numerical optimization on man-

ifolds. In Signals, Systems and Computers, 2004. Conference Record of the Thirty-

Eighth Asilomar Conference, volume 1, pages 136–139. IEEE, 2004. 5

[44] B. Jiang, T. Lin, S. Ma, and S. Zhang. Structured nonconvex and nons-

mooth optimization: Algorithms and iteration complexity analysis. arXiv preprint

arXiv:1605.02408, 2016. 6, 15, 16, 17, 19, 42

139

[45] B. Jiang, S. Ma, A. M.-C. So, and S. Zhang. Vector transport-free SVRG with

general retraction for Riemannian optimization: Complexity analysis and practical

implementation. Preprint available at https://arxiv.org/abs/1705.09059, 2017. 6, 35

[46] J. Jin. Fast community detection by score. The Annals of Statistics, 43(1):57–89,

2015. 50, 54, 93, 95, 119

[47] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive

variance reduction. In Advances in Neural Information Processing Systems (NIPS)

26, pages 315–323. 2013. 3, 60, 62

[48] B. Karrer and M. E. Newman. Stochastic blockmodels and community structure in

networks. Physical Review E, 83(1):016107, 2011. 93, 96

[49] H. Kasai, H. Sato, and B. Mishra. Riemannian stochastic variance reduced gradient

on Grassmann manifold. arXiv preprint arXiv:1605.07367, 2016. 6

[50] J. Kim and H. Park. Sparse nonnegative matrix factorization for clustering. Technical

report, Georgia Institute of Technology, 2008. 93

[51] J. Kohler and A. Lucchi. Sub-sampled cubic regularization for non-convex optimiza-

tion. arXiv preprint arXiv:1705.05933, 2017. 58, 59, 60, 62, 63, 89, 91

[52] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM

Review, 51(3):455–500, 2009. 6, 26, 47

[53] A. Kovnatsky, K. Glashoff, and M. Bronstein. Madmm: a generic algorithm for

non-smooth optimization on manifolds. arXiv preprint arXiv:1505.07676, 2015. 6

[54] D. Kuang, C. Ding, and H. Park. Symmetric nonnegative matrix factorization for

graph clustering. In Proceedings of the 2012 SIAM international conference on data

mining, pages 106–117. SIAM, 2012. 93

140

[55] Z. Lai, Y. Xu, Q. Chen, J. Yang, and D. Zhang. Multilinear sparse principal com-

ponent analysis. IEEE Transactions on Neural Networks and Learning Systems,

25(10):1942–1950, 2014. 47

[56] J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection.

http://snap.stanford.edu/data, jun 2014. 122, 124

[57] G. Li and T. K. Pong. Global convergence of splitting methods for nonconvex com-

posite optimization. SIAM Journal on Optimization, 25(4):2434–2460, 2015. 6

[58] H. Liu, W. Wu, and A. M.-C. So. Quadratic optimization with orthogonality con-

straints: Explicit Lojasiewicz exponent and linear convergence of line-search meth-

ods. arXiv preprint arXiv:1510.01025, 2015. 6

[59] D. G. Luenberger. The gradient projection method along geodesics. Management

Science, 18(11):620–631, 1972. 5

[60] F. McSherry. Spectral partitioning of random graphs. In Foundations of Computer

Science, 2001. Proceedings. 42nd IEEE Symposium on, pages 529–537. IEEE, 2001.

93

[61] D. Motreanu and N. H. Pavel. Quasi-tangent vectors in flow-invariance and op-

timization problems on Banach manifolds. Journal of Mathematical Analysis and

Applications, 88(1):116–132, 1982. 10

[62] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course.

Kluwer, Boston, 2004. 58

[63] Y. Nesterov and B. T. Polyak. Cubic regularization of newton method and its global

performance. Mathematical Programming, 108(1):177–205, 2006. 3, 57, 58, 60, 63,

69, 89, 91

[64] M. E. Newman. Modularity and community structure in networks. Proceedings of

the national academy of sciences, 103(23):8577–8582, 2006. 94, 96

141

http://snap.stanford.edu/data

[65] J. Nocedal and S. J. Wright. Numerical optimization. Springer, 9(4):1556–1556,

1999. 9

[66] I. V. Oseledets. Tensor-train decomposition. SIAM Journal on Scientific Computing,

33(5):2295–2317, 2011. 6

[67] I. V. Oseledets and E. Tyrtyshnikov. TT-cross approximation for multidimensional

arrays. Linear Algebra and its Applications, 432(1):70–88, 2010. 6

[68] S. Oymak and B. Hassibi. Finding dense clusters via” low rank+ sparse” decompo-

sition. arXiv preprint arXiv:1104.5186, 2011. 94

[69] X. Pan, M. Lam, S. Tu, D. Papailiopoulos, C. Zhang, M. I. Jordan, K. Ramchandran,

C. Re, and B. Recht. Cyclades: Conflict-free asynchronous machine learning. arXiv

preprint arXiv:1605.09721, 2016. 105

[70] S. Paul and Y. Chen. Orthogonal symmetric non-negative matrix factorization under

the stochastic block model. arXiv preprint arXiv:1605.05349, 2016. 93

[71] B. A. Pearlmutter. Fast exact multiplication by the Hessian. Neural Computation,

6(1):147160, 1994. 89, 91

[72] I. Psorakis, S. Roberts, M. Ebden, and B. Sheldon. Overlapping community detection

using bayesian non-negative matrix factorization. Physical Review E, 83(6):066114,

2011. 93

[73] T. Qin and K. Rohe. Regularized spectral clustering under the degree-corrected

stochastic blockmodel. In Advances in Neural Information Processing Systems, pages

3120–3128, 2013. 93

[74] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to paral-

lelizing stochastic gradient descent. In Advances in Neural Information Processing

Systems, pages 693–701, 2011. 105

142

[75] S. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola. Stochastic variance reduction

for nonconvex optimization. In Proceedings of the 33rd International Conference on

International Conference on Machine Learning (ICML), pages 314–323, 2016. 3, 60

[76] S. Reddi, S. Sra, B. Poczos, and A. Smola. Proximal stochastic methods for nons-

mooth nonconvex finite-sum optimization. In Advances in Neural Information Pro-

cessing Systems, pages 1145–1153, 2016. 2

[77] S. Reddi, M. Zaheer, S. Sra, B. Poczos, F. Bach, R. Salakhutdinov, and A. Smola.

A generic approach for escaping saddle points. In Proceedings of the Twenty-First

International Conference on Artificial Intelligence and Statistics, pages 1233–1242,

2018. 58, 89, 91

[78] K. Rohe, S. Chatterjee, and B. Yu. Spectral clustering and the high-dimensional

stochastic blockmodel. The Annals of Statistics, pages 1878–1915, 2011. 93

[79] F. Roosta-Khorasani and M. W. Mahoney. Sub-sampled Newton methods I: Globally

convergent algorithms. arXiv preprint arXiv:1601.04737, 2016. 60

[80] S. T. Smith. Optimization techniques on Riemannian manifolds. Fields Institute

Communications, 3(3):113–135, 1994. 5

[81] D. L. Sussman, M. Tang, D. E. Fishkind, and C. E. Priebe. A consistent adjacency

spectral embedding for stochastic blockmodel graphs. Journal of the American Sta-

tistical Association, 107(499):1119–1128, 2012. 93

[82] A. L. Traud, E. D. Kelsic, P. J. Mucha, and M. A. Porter. Comparing Community

Structure to Characteristics in Online Collegiate Social Networks. ArXiv e-prints,

Sept. 2008. 123, 124

[83] N. Tripuraneni, M. Stern, C. Jin, J. Regier, and M. I. Jordan. Stochastic cubic

regularization for fast nonconvex optimization. arXiv preprint arXiv:1711.02838,

2017. 58, 62, 89, 91

143

[84] J. A. Tropp. An introduction to matrix concentration inequalities. Foundations and

Trends in Machine Learning, 8(1-2):1–230, 2015. 59

[85] F. Wang, W. Cao, and Z. Xu. Convergence of multi-block Bregman ADMM for

nonconvex composite problems. arXiv preprint arXiv:1505.03063, 2015. 6

[86] F. Wang, T. Li, X. Wang, S. Zhu, and C. Ding. Community discovery using nonneg-

ative matrix factorization. Data Mining and Knowledge Discovery, 22(3):493–521,

2011. 93

[87] M. Wang, C. Wang, J. X. Yu, and J. Zhang. Community detection in so-

cial networks: an in-depth benchmarking study with a procedure-oriented frame-

work. @inproceedingsSpectral-Mcsherry-2001, title=Spectral partitioning of ran-

dom graphs, author=McSherry, Frank, booktitle=Foundations of Computer Science,

2001. Proceedings. 42nd IEEE Symposium on, pages=529–537, year=2001, organi-

zation=IEEE Proceedings of the VLDB Endowment, 8(10):998–1009, 2015. 93

[88] S. Wang, M. Sun, Y. Chen, E. Pang, and C. Zhou. Stpca: sparse tensor princi-

pal component analysis for feature extraction. In Pattern Recognition, 2012 21st

International Conference, pages 2278–2281. IEEE, 2012. 47

[89] Y. Wang, W. Yin, and J. Zeng. Global convergence of ADMM in nonconvex nons-

mooth optimization. arXiv preprint arXiv:1511.06324, 2015. 6

[90] Z. Wang, Y. Zhou, Y. Liang, and G. Lan. Sample complexity of stochastic

variance-reduced cubic regularization for nonconvex optimization. arXiv preprint

arXiv:1802.07372, 2018. 58, 59, 60, 62, 63, 77

[91] A. Wiegele. Biq mac librarya collection of max-cut and quadratic 0-1 programming

instances of medium size. Preprint, 2007. 51

[92] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive

variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014. 3, 60

144

[93] P. Xu, F. Roosta-Khorasani, and M. W. Mahoney. Newton-type methods for non-

convex optimization under inexact Hessian information. arXiv:708.07164, 2018. 58,

59

[94] P. Xu, J. Yang, F. Roosta-Khorasani, C. Re, and M. W. Mahoney. Sub-sampled

Newton methods with non-uniform sampling. In Advances in Neural Information

Processing Systems 29, pages 3000–3008. 2016. 60

[95] Y. Xu. Alternating proximal gradient method for sparse nonnegative Tucker decom-

position. Mathematical Programming Computation, 7(1):39–70, 2015. 47

[96] L. Yang, T. K. Pong, and X. Chen. Alternating direction method of multipli-

ers for a class of nonconvex and nonsmooth problems with applications to back-

ground/foreground extraction. SIAM Journal on Imaging Sciences, 10(1):74–110,

2017. 6

[97] W. H. Yang, L.-H. Zhang, and R. Song. Optimality conditions for the nonlinear

programming problems on Riemannian manifolds. Pacific Journal of Optimization,

10(2):415–434, 2014. 9, 10, 11, 13

[98] Z. Yang, T. Hao, O. Dikmen, X. Chen, and E. Oja. Clustering by nonnegative

matrix factorization using graph random walk. In Advances in Neural Information

Processing Systems, pages 1079–1087, 2012. 93

[99] H. Ye, L. Luo, and Z. Zhang. Approximate Newton methods and their local con-

vergence. In Proceedings of the 34th International Conference on Machine Learning,

pages 3931–3939, 2017. 60

[100] Y. Ye. A .699-approximation algorithm for max-bisection. Mathematical Program-

ming, 90(1):101–111, 2001. 45, 46, 52

[101] H. Zhang, S. J. Reddi, and S. Sra. Riemannian svrg: Fast stochastic optimization

on Riemannian manifolds. In Advances in Neural Information Processing Systems,

pages 4592–4600, 2016. 6

145

[102] H. Zhang and S. Sra. First-order methods for geodesically convex optimization.

arXiv preprint arXiv:1602.06053, 2016. 6

[103] J. Zhang, H. Liu, Z. Wen, and S. Zhang. A sparse completely positive relax-

ation of the modularity maximization for community detection. arXiv preprint

arXiv:1708.01072, 2017. 46, 50, 54

[104] L. Zhang, M. Mahdavi, and R. Jin. Linear convergence with condition number

independent access of full gradients. In Advances in Neural Information Processing

Systems 26, pages 980–988. 2013. 3, 60

[105] T. Zhang and G. H. Golub. Rank-one approximation to high order tensors. SIAM

Journal on Matrix Analysis and Applications, 23(2):534–550, 2001. 26

[106] Y. Zhang, E. Levina, and J. Zhu. Detecting overlapping communities in networks

using spectral methods. arXiv preprint arXiv:1412.3432, 2014. 50, 54, 93, 95, 119

[107] D. Zhou, P. Xu, and Q. Gu. Stochastic variance-reduced cubic regularized Newton

methods. In Proceedings of the 35th International Conference on Machine Learning,

pages 5990–5999, 2018. 58, 59, 60, 61, 62, 63, 81, 86

[108] H. Zhu, X. Zhang, D. Chu, and L. Liao. Nonconvex and nonsmooth optimization

with generalized orthogonality constraints: An approximate augmented lagrangian

method. Journal of Scientific Computing, pages 1–42, 2017. 6

146

Documents

Nonconvex Optimization Methods: Iteration Complexity and