27
1 On the Linear Convergence of Distributed Optimization over Directed Graphs Chenguang Xi, and Usman A. Khan Abstract This paper develops a fast distributed algorithm, termed DEXTRA, to solve the optimization prob- lem when N agents reach agreement and collaboratively minimize the sum of their local objective functions over the network, where the communications between agents are described by a directed graph, which is uniformly strongly-connected. Existing algorithms, including Gradient-Push (GP) and Directed-Distributed Gradient Descent (D-DGD), solve the same problem restricted to directed graphs with a convergence rate of O(ln k/ k), where k is the number of iterations. Our analysis shows that DEXTRA converges at a linear rate O( k ) for some constant < 1, with the assumption that the objective functions are strongly convex. The implementation of DEXTRA requires each agent know its local out-degree only. Simulation examples illustrate our findings. Index Terms Distributed optimization; multi-agent networks; directed graphs;. I. I NTRODUCTION Distributed computation and optimization have gained great interests due to their widely applications often in, e.g., large scale machine learning, [1, 2], model predictive control, [3], cognitive networks, [4, 5], source localization, [6, 7], resource scheduling, [8], and message routing, [9]. All of above applications can be reduced to variations of distributed optimization problems by a network of nodes when knowledge of the objective function is scattered over the network. In particular, we consider the problem that minimizing a sum of objectives, n i=1 f i (x), where f i : R p R is a private objective function at the ith agent of the network. C. Xi and U. A. Khan are with the Department of Electrical and Computer Engineering, Tufts University, 161 College Ave, Medford, MA 02155; [email protected], [email protected]. This work has been partially supported by an NSF Career Award # CCF-1350264. October 23, 2021 DRAFT arXiv:1510.02149v1 [math.OC] 7 Oct 2015

On the Linear Convergence of Distributed Optimization over

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: On the Linear Convergence of Distributed Optimization over

1

On the Linear Convergence of Distributed

Optimization over Directed Graphs

Chenguang Xi, and Usman A. Khan†

Abstract

This paper develops a fast distributed algorithm, termed DEXTRA, to solve the optimization prob-

lem when N agents reach agreement and collaboratively minimize the sum of their local objective

functions over the network, where the communications between agents are described by a directed

graph, which is uniformly strongly-connected. Existing algorithms, including Gradient-Push (GP) and

Directed-Distributed Gradient Descent (D-DGD), solve the same problem restricted to directed graphs

with a convergence rate of O(ln k/√k), where k is the number of iterations. Our analysis shows that

DEXTRA converges at a linear rate O(εk) for some constant ε < 1, with the assumption that the

objective functions are strongly convex. The implementation of DEXTRA requires each agent know its

local out-degree only. Simulation examples illustrate our findings.

Index Terms

Distributed optimization; multi-agent networks; directed graphs;.

I. INTRODUCTION

Distributed computation and optimization have gained great interests due to their widely

applications often in, e.g., large scale machine learning, [1, 2], model predictive control, [3],

cognitive networks, [4, 5], source localization, [6, 7], resource scheduling, [8], and message

routing, [9]. All of above applications can be reduced to variations of distributed optimization

problems by a network of nodes when knowledge of the objective function is scattered over the

network. In particular, we consider the problem that minimizing a sum of objectives,∑n

i=1 fi(x),

where fi : Rp → R is a private objective function at the ith agent of the network.

†C. Xi and U. A. Khan are with the Department of Electrical and Computer Engineering, Tufts University, 161 College Ave,

Medford, MA 02155; [email protected], [email protected]. This work has been partially supported by

an NSF Career Award # CCF-1350264.

October 23, 2021 DRAFT

arX

iv:1

510.

0214

9v1

[m

ath.

OC

] 7

Oct

201

5

Page 2: On the Linear Convergence of Distributed Optimization over

2

There have been many types of algorithms to solve the problem in a distributed manner.

The most common alternatives are Distributed Gradient Descent (DGD), [10, 11], Distributed

Dual Averaging (DDA), [12], and the distributed implementations of the Alternating Direction

Method of Multipliers (ADMM), [13–15]. DGD and DDA can be reduced to gradient-based

methods, where at each iteration a gradient related step is calculated, followed by averaging

with neighbors in the network. The main advantage of these methods is computational simplicity.

However, they suffer from a slower convergence rate due to the necessary slowly diminishing

stepsize in order to achieve an exact convergence. The convergence rate of DGD and DDA, with a

diminishing stepsize, is shown to be O( ln k√k

), [10]. When given a constant stepsize, the algorithms

accelerate themselves to O( 1k), at a cost of inexact converging to a neighborhood of optimal

solutions, [11]. To overcome the difficulties, some alternatives include Nesterov-based methods,

e.g., the Distributed Nesterov Gradient (DNG), [16], whose convergence rate is O( log kk

) with a

diminishing stepsize, and Distributed Nesterov gradient with Consensus iterations (DNC), [16],

with a faster convergence rate of O( 1k2

) at the cost of k times iterations at kth iteration. Some

methods use second order information such as the Network Newton (NN) method, [17], with a

convergence rate that is at least linear. Both DNC and NN need multiple communication steps

in one iteration.

The distributed implementations of ADMM is based on augmented Lagrangians, where at

each iteration the primal and dual variables are solved to minimize a lagrangian related function.

This type of methods converges exactly to the optimal point with a faster rate of O( 1k) with a

constant stepsize and further get linear convergence rate with the assumption of strongly convex

of objective functions. However, the disadvantage is the high computation burden. Each agent

needs to solve an optimization problem at each iteration. To resolve the issue, Decentralized

Linearized ADMM (DLM), [18], and EXTRA, [19], are proposed, which can be considered

as a first order approximation of decentralized ADMM. DLM and EXTRA have been proven

to converge at a linear rate if the local objective functions are assumed to be strongly convex.

All these proposed distributed algorithms, [10–19], assume the multi-agent network to be an

undirected graph. In contrast, Researches successfully solving the problem restricted to directed

graphs are limited.

We report the papers considering directed graphs here. There are two types of alternatives,

which are both gradient-based algorithms, solving the problems restricted to directed graphs.

The first type of algorithms, called Gradient-Push (GP) method, [20, 21], combines gradient

October 23, 2021 DRAFT

Page 3: On the Linear Convergence of Distributed Optimization over

3

descent and push-sum consensus. The push-sum algorithm, [22, 23], is first proposed in consensus

problems1 to achieve average-consensus given a column-stochastic matrix. The idea is based

on computing the stationary distribution of the Markov chain characterized by the multi-agent

network and canceling the imbalance by dividing with the left-eigenvector. GP follows a similar

spirit of push-sum consensus and can be regarded as a perturbed push-sum protocol converging

to the optimal solution. Directed-Distributed Graident Descent (D-DGD), [28, 29], is the other

gradient-based alternative to solve the problem. The main idea follows Cai’s paper, [30], where a

new non-doubly-stochastic matrix is constructed to reach average-consensus. In each iteration, D-

DGD simultaneously constructs a row-stochastic matrix and a column-stochastic matrix instead

of only a doubly-stochastic matrix. The convergence of the new weight matrix, depending on

the row-stochastic and column-stochastic matrices, ensures agents to reach both consensus and

optimality. Since both GP and D-DGD are gradient-based algorithm, they suffer from a slow

convergence rate of O( ln k√k

) due to the diminishing stepsize. We conclude the existing first order

distributed algorithms solving the optimization problem of minimizing the sum of agents local

objective functions, and do a comparison in terms of speed, in Table I, including both the

circumstance of undirected graphs and directed graphs. As seen in Table I, ‘I’ means DGD with

a constant stepsize is an Inexact method; ‘M’ is saying that the DNC needs to do Multiple (k)

times of consensus at kth iteration, which is impractical in real situation as k goes to infinity,

and ‘C’ represents that DADMM has a much higher Computation burden compared to other first

order methods.

In this paper, we propose a fast distributed algorithm, termed DEXTRA, to solve the problem

over directed graphs, by combining the push-sum protocol and EXTRA. The push-sum protocol

has proven to be a good technique for dealing with optimization over digraphs, [20, 21]. EXTRA

works well in optimization problems over undirected graph with a fast convergence rate and

a low computation complexity. By applying the push-sum techniqies into EXTRA, we show

that DEXTRA converges exactly to the optimal solution with a linear rate, O(εk), even when

the underlying network is directed. The fast convergence rate is guaranteed because we apply

a constant stepsize in DEXTRA compared with the diminishing stepsize used in GP or D-

DGD. Currently, our formulation is restricted to strongly convex functions. We claim that though

DEXTRA is a combination of push-sum and EXTRA, the proof is not trival by simply following

1See, [24–27], for additional information on average consensus problems.

October 23, 2021 DRAFT

Page 4: On the Linear Convergence of Distributed Optimization over

4

Algorithms General Convex Strongly Convex

DGD (αk) O( ln k√k) O( ln k

k)

DDA (αk) O( ln k√k) O( ln k

k)

DGD (α) O( 1k) I

DNG (αk) O( ln kk

)

DNC (α) O( 1k2

) M

DADMM (α) O(εk) C

DLM (α) O(εk)

EXTRA (α) O( 1k) O(εk)

GP (αk) O( ln k√k) O( ln k

k)

D-DGD (αk) O( ln k√k) O( ln k

k)

DEXTRA (α) O(εk)

TABLE I: Convergence rate of first order distributed optimization algorithms regarding undirected

and directed graphs.

the proof of either algorithm. As the communication link is more restricted compared with the

case in undirected graph, many properties used in the proof of EXTRA no longer works for

DEXTRA, e.g., the proof of EXTRA requiring the weighting matrix to be symmetric which

never happens in directed graphs.

The remainder of the paper is organized as follows. Section II describes, develops and interprets

the DEXTRA algorithm. Section III presents the appropriate assumptions and states the main

convergence results. In Section IV, we show the proof of some key lemmas, which will be used in

the subsequent proofs of convergence rate of DEXTRA. The main proof of the convergence rate

of DEXTRA is provided in Section V. We show numerical results in Section VI and Section VII

contains the concluding remarks.

Notation: We use lowercase bold letters to denote vectors and uppercase italic letters to denote

matrices. We denote by [x]i the ith component of a vector x. For matrix A, we denote by [A]i

the ith row of the matrix, and by [A]ij the (i, j)th element of a matrix, A. The matrix, In,

October 23, 2021 DRAFT

Page 5: On the Linear Convergence of Distributed Optimization over

5

represents the n × n identity, and 1n (0n) are the n-dimensional vector of all 1’s (0’s). The

inner product of two vectors x and y is 〈x,y〉. We define the A-matrix norm, ‖x‖2A, of any

vector x as ‖x‖2A , 〈x, Ax〉 = 〈x, A>x〉 = 〈x, A+A>2

x〉 for A+A> being positive semi-definite.

If a symmetric matrix A is positive semi-definite, we write A � 0, while A � 0 means A is

positive definite. The largest and smallest eigenvalues of a matrix A are denoted as λmax(A)

and λmin(A). The smallest nonzero eigenvalue of a matrix A is denoted as λmin(A). For any

objective function f(x), we denote ∇f(x) the gradient of f at x.

II. DEXTRA DEVELOPMENT

In this section, we formulate the optimization problems and describe the DEXTRA algorithm.

We derive an informal but intuitive proof of DEXTRA, showing that the algorithm push agents

to achieve consensus and reach the optimal solution. The EXTRA algorithm, [19], is briefly

recapitulated in this section. We derive DEXTRA to a similar form as EXTRA such that our

algorithm can be viewed as an improvement of EXTRA suited to the case of directed graphs.

This reveals the meaning of DEXTRA: Directed EXTRA. Formal convergence results and proofs

are left to Section III, IV, and V.

Consider a strongly-connected network of n agents communicating over a directed graph, G =

(V , E), where V is the set of agents, and E is the collection of ordered pairs, (i, j), i, j ∈ V , such

that agent j can send information to agent i. Define N ini to be the collection of in-neighbors,

i.e., the set of agents that can send information to agent i. Similarly, N outi is defined as the

out-neighborhood of agent i, i.e., the set of agents that can receive information from agent i. We

allow both N ini and N out

i to include the node i itself. Note that in a directed graph when (i, j) ∈ E ,

it is not necessary that (j, i) ∈ E . Consequently, N ini 6= N out

i , in general. We focus on solving a

convex optimization problem that is distributed over the above multi-agent network. In particular,

the network of agents cooperatively solve the following optimization problem:

P1 : min f(x) =n∑i=1

fi(x),

where each fi : Rp → R is convex, not necessarily differentiable, representing the local objective

function at agent i.

October 23, 2021 DRAFT

Page 6: On the Linear Convergence of Distributed Optimization over

6

A. EXTRA for undirected graphs

EXTRA is a fast exact first-order algorithm to solve Problem P1 but in the case that the

communication network is described by an undirected graph. At kth iteration, each agent i

performs the following updates:

xk+1i =xki +

∑j∈Ni

wijxkj −

∑j∈Ni

wijxk−1j − α

[∇fi(xki )−∇fi(xk−1i )

], (1)

where the weight, wij , forms a weighting matrix, W = {wij} satisfying that W is symmetric

and doubly stochastic, and the collection W = {wij} satisfies W = θIn + (1 − θ)W with

some θ ∈ (0, 12]. The update in Eq. (1) ensures xki to converge to the optimal solution of the

problem for all i with convergence rate O( 1k), and converge linearly when the objective function

is further strongly convex. To better represent EXTRA and later compare with the DEXTRA

algorithm, we write Eq. (1) in a matrix form. Let xk, ∇f(xk) ∈ Rnp be the collection of all agents

states and gradients at time k, i.e., xk , [xk1; · · · ;xkn], ∇f(xk) , [∇f1(xk1); , · · · ,∇fn(xkn)],

and W , W ∈ Rn×n be the weighting matrix collecting weights wij , wij , respectively. Then

Eq. (1) can be represented in a matrix form as:

xk+1 = [(In +W )⊗ Ip]xk − (W ⊗ Ip)xk−1 − α[∇f(xk)−∇f(xk−1)

], (2)

where the symbol ⊗ is the Kronecker product. We now state DEXTRA and derive it in a similar

form as EXTRA.

B. DEXTRA Algorithm

To solve the Problem P1 suited to the case of directed graphs, we claim the DEXTRA

algorithm implemented in a distributed manner as follow. Each agent, j ∈ V , maintains two

vector variables: xkj , zkj ∈ Rp, as well as a scalar variable, ykj ∈ R, where k is the discrete-time

index. At the kth iteration, agent j weights its state estimate, aijxkj , aijykj ,and aijxk−1j , and sends

it to each out-neighbor, i ∈ N outj , where the weights, aij , and, aij ,’s are such that:

aij =

> 0, i ∈ N out

j ,

0, otw.,

n∑i=1

aij = 1, ∀j,

aij =

θ + (1− θ)aij, i = j,

(1− θ)aij, i 6= j,

∀j,

October 23, 2021 DRAFT

Page 7: On the Linear Convergence of Distributed Optimization over

7

where θ ∈ (0, 12]. With agent i receiving the information from its in-neighbors, j ∈ N in

i , it

calculates the states, zki , by dividing xki over yki , and updates the variables xk+1i , and yk+1

i as

follows:

zki =xkiyki, (3a)

xk+1i =xki +

∑j∈N in

i

(aijx

kj

)−∑j∈N in

i

(aijx

k−1j

)− α

[∇fi(zki )−∇fi(zk−1i )

], (3b)

yk+1i =

∑j∈N in

i

(aijy

kj

), (3c)

where ∇fi(zki ) is the gradient of the function fi(z) at z = zki , and ∇fi(zk−1i ) is the gradient

at zk−1i , respectively. The method is initiated with an arbitrary vector, x0i , and with y0i = 1 for

any agent i. The stepsize, α, is a positive number within a certain interval. We will explicitly

discuss the range of α in Section III. At the first iteration, i.e., k = 0, we have the following

iteration instead of Eq. (3),

z0i =x0i

y0i,

x1i =

∑j∈N in

i

(aijx

0j

)− α∇fi(z0i ),

y1i =∑j∈N in

i

(aijy

0j

).

We note that the implementation of Eq. (3) needs each agent having the knowledge of its out-

neighbors (such that it can design the weights). In a more restricted setting, e.g., a broadcast

application where it may not be possible to know the out-neighbors, we may use aij = |N outj |−1;

thus, the implementation only requires each agent knowing its out-degree, see, e.g., [20, 28] for

similar assumptions.

To simplify the proof, we write DEXTRA, Eq. (3), in a matrix form. Let, A = {aij} ∈Rn×n, A = {aij} ∈ Rn×n, be the collection of weights, aij , aij , respectively, xk, zk, ∇f(xk) ∈Rnp, be the collection of all agents states and gradients at time k, i.e., xk , [xk1; · · · ;xkn], zk ,

[zk1; · · · ; zkn], ∇f(xk) , [∇f1(xk1); · · · ;∇fn(xkn)], and yk ∈ Rn be the collection of agent

states, yki , i.e., yk , [yk1 ; · · · ; ykn]. Note that at time k, yk can be represented by the initial

value, y0:

yk = Ayk−1 = Aky0 = Ak · 1n. (5)

October 23, 2021 DRAFT

Page 8: On the Linear Convergence of Distributed Optimization over

8

Define the diagonal matrix, Dk ∈ Rn×n, for each k, such that the ith element of Dk is yki , i.e.,

Dk = diag(Ak · 1n

). (6)

Then we have Eq. (3) in the matrix form equivalently as follows:

zk =([Dk]−1 ⊗ Ip)xk, (7a)

xk+1 =xk + (A⊗ Ip)xk − (A⊗ Ip)xk−1 − α[∇f(zk)−∇f(zk−1)

], (7b)

yk+1 =Ayk, (7c)

where both the two weight matrices, A and A, are column stochastic matrix and satisfy the

relationship: A = θIn + (1 − θ)A with some θ ∈ (0, 12], and when k = 0, Eq. (7b) changes

to x1 = (A⊗ Ip)x0 − α∇f(z0). Consider Eq. (7a), we obtain

xk =(Dk ⊗ Ip

)zk, ∀k. (8)

Therefore, it follows that DEXTRA can be represented in one single equation:(Dk+1 ⊗ Ip

)zk+1 =

[(In + A)Dk ⊗ Ip

]zk − (ADk−1 ⊗ Ip)zk−1 − α

[∇f(zk)−∇f(zk−1)

].

(9)

We refer to our algorithm DEXTRA, since Eq. (9) has a similar form as the EXTRA algorithm,

Eq. (2), and can be used to solve Problem P1 in the case of directed graph. We state our main

result in Section III, showing that as time goes to infinity, the iteration in Eq. (9) push zk to

achieve consensus and reach the optimal solution. Our proof in this paper will based on the

form, Eq. (9), of DEXTRA.

C. Interpretation of DEXTRA

In this section, we give an intuitive interpretation on DEXTRA’s convergence to the optimal

point before we show the formal proof which will appear in Section IV, V. Note that the analysis

given here is not a formal proof of DEXTRA. However, it gives a snapshot on how DEXTRA

works.

Since A is a column stochastic matrix, the sequence of iterates,{yk}

, generated by Eq. (7c)

converges to the span of A’s right eigenvector corresponding to eigenvalue 1, i.e., y∞ = u ·π for

some number u, where π is the right eigenvector of A corresponding to eigenvalue 1. We assume

that the sequence of iterates,{zk}

and{xk}

, generated by DEXTRA, Eq. (7) or (9), converges

October 23, 2021 DRAFT

Page 9: On the Linear Convergence of Distributed Optimization over

9

to their own limit points, z∞, x∞, respectively, (which might not be the truth). According to the

updating rule in Eq. (7b), the limit point x∞ satisfies

x∞ =x∞ + (A⊗ Ip)x∞ − (A⊗ Ip)x∞ − α [∇f(z∞)−∇f(z∞)] , (10)

which implies that [(A− A)⊗ Ip]x∞ = 0np, or x∞ = π⊗v for some vector v ∈ Rp. Therefore,

it follows from Eq. (7a) that

z∞ =([D∞]−1 ⊗ Ip

)(π ⊗ v) ∈ {1n ⊗ v}, (11)

where the consensus is reached. The above analysis reveals the idea of DEXTRA to overcome

the imbalance of agent states occurred when the graph is directed: both x∞ and y∞ lie in the

span of π. By dividing x∞ over y∞, the imbalance is canceled.

By summing up the updates in Eq. (7b) over k, we obtain that

xk+1 = (A⊗ Ip)xk − α∇f(zk)−k−1∑r=0

[(A− A

)⊗ Ip

]xr.

Consider that x∞ = π⊗v and the preceding relation. It follows that the limit point z∞ satisfies

α∇f(z∞) = −∞∑r=0

[(A− A

)⊗ Ip

]xr. (12)

Therefore, we obtain that

α (1n ⊗ Ip)>∇f(z∞) = −[1>n

(A− A

)⊗ Ip

] ∞∑r=0

xr = 0p,

which is the optimality condition of Problem P1. Therefore, given the assumption that the

sequence of DEXTRA iterates,{zk}

and{xk}

, have limit points, z∞ and x∞, the limit point, z∞,

achieves consensus and reaches the optimal solution of Problem P1. In next section, we state

our main result of this paper and we leave the formal proof to Section IV, V.

III. ASSUMPTIONS AND MAIN RESULTS

With appropriate assumptions, our main result states that DEXTRA converges to the optimal

solution of Problem P1 linearly. In this paper, we assume that the agent graph, G, is strongly-

connected; each local function, fi : Rp → R, is convex and differentiable, ∀i ∈ V , and the

optimal solution of Problem P1 and the corresponding optimal value exist. Formally, we denote

the optimal solution by φ ∈ Rp and optimal value by f ∗, i.e.,

f(φ) = f ∗ = minx∈Rp

f(x). (13)

October 23, 2021 DRAFT

Page 10: On the Linear Convergence of Distributed Optimization over

10

Besides the above assumptions, we emphasize some other assumptions regarding the objective

functions and weighting matrices, which are formally presented as follow.

Assumption A1 (Functions and Gradients). For all agent i, each function, fi, is differentiable

and satisfies the following assumptions.

(a) The function, fi, has Lipschitz gradient with the constant Lfi , i.e., ‖∇fi(x) −∇fi(y)‖ ≤Lfi‖x− y‖, ∀x,y ∈ Rp.

(b) The gradient, ∇fi(x), is bounded, i.e., ‖∇fi(x)‖ ≤ Bfi , ∀x ∈ Rp.

(c) The function, fi, is strictly convex with a constant Sfi , i.e., Sfi‖x − y‖2 ≤ 〈∇fi(x) −∇fi(y),x− y〉, ∀x,y ∈ Rp.

Following Assumption A1, we have for any x,y ∈ Rnp,

‖∇f(x)−∇f(y)‖ ≤ Lf ‖x− y‖ , (14a)

‖∇f(x)‖ ≤ Bf , (14b)

Sf ‖x− y‖2 ≤ 〈∇f(x)−∇f(y),x− y〉 , (14c)

where the constants Bf = maxi{Bfi}, Lf = maxi{Lfi}, and Sf = maxi{Sfi}. By combining

Eqs. (14b) and (14c), we obtain

‖x− y‖ ≤ 1

Sf‖∇f(x)−∇f(y)‖ ≤ 2Bf

Sf,

which is

‖x‖ ≤ 2Bf

Sf, (15)

by letting y = 0np. Eq. (15) reveals that the sequences updated by DEXTRA is bounded.

Recall the definition of Dk in Eq. (6), we denote the limits of Dk by D∞, i.e.,

D∞ = limk→∞

Dk = diag (A∞ · 1n) = n · diag (π) , (16)

where π is the normalized right eigenvector of A corresponding to eigenvalue 1. The next

assumption is related to the weighting matrices, A and A, and D∞.

Assumption A2 (Weighting matrices). The weighting matrices, A and A, used in DEXTRA,

Eq. (7) or (9), satisfy the following.

(a) A is a column stochastic matrix.

(b) A is a column stochastic matrix and satisfies A = θIn + (1− θ)A, for some θ ∈ (0, 12].

October 23, 2021 DRAFT

Page 11: On the Linear Convergence of Distributed Optimization over

11

(c) (D∞)−1 A+ A> (D∞)−1 � 0.

To ensure (D∞)−1 A + A> (D∞)−1 � 0, the agent i can assign a large value to its own

weight aii in the implementation of DEXTRA, (such that the weighting matrices A and A are

more like diagonally-dominant matrices). Then all eigenvalues of (D∞)−1 A+ A> (D∞)−1 can

be positive.

Before we present the main result of this paper, we define some auxiliary sequences. Let z∗ ∈Rnp be the collection of optimal solution, φ, of Problem P1, i.e.,

z∗ = 1n ⊗ φ, (17)

and q∗ ∈ Rnp some vector satisfying[(A− A

)⊗ Ip

]q∗ + α∇f(z∗) = 0np. (18)

Introduce auxiliary sequence

qk =k∑r=0

xr, (19)

and for each k

tk =

(Dk ⊗ Ip

)zk

qk

, t∗ =

(D∞ ⊗ Ip) z∗

q∗

,

G =

[A> (D∞)−1

]⊗ Ip [

(D∞)−1(A− A

)]⊗ Ip

, (20)

We are ready to state the main result as shown in Theorem 1, which makes explicit the rate at

which the objective function converges to its optimal value.

Theorem 1. Let Assumptions A1, A2 hold. Denote P = In−A, L = A−A, N = (D∞)−1(A−A),

M = (D∞)−1A, dmax = maxk{‖Dk‖

}, d−max = maxk

{‖(Dk)−1‖

}, and d−∞ = ‖(D∞)−1‖.

Define ρ1 = 4(1 − 2θ)2λmax

(P>P

)+ 8(αLfd

−max)

2, and ρ2 = 4λmax

(A>A

)+ 8(αLfd

−max)

2,

where α is some constant regarded as a ceiling of α. Then with proper step size α ∈ [αmin, αmax],

there exists, δ > 0, Γ <∞, and γ ∈ (0, 1), such that the sequence{tk}

generated by DEXTRA

satisfies ∥∥tk − t∗∥∥2G≥ (1 + δ)

∥∥tk+1 − t∗∥∥2G− Γγk. (21)

October 23, 2021 DRAFT

Page 12: On the Linear Convergence of Distributed Optimization over

12

The lower bound, αmin, of α is

αmin =

+ σλmax(NN>)

2λmin(L>L)ρ1

Sf

2(dmax)2 − η

2

,

and the upper bound, αmax, of α is

αmax =λmin

(M+M>

2

)− σλmax

(MM>

2

)− σλmax(NN>)

2λmin(L>L)ρ2

L2f

η(d−maxd

−∞)2

,

where, η and σ, are arbitrary constants to ensure αmin > 0 and αmax > 0, i.e., η < Sfd2max

, and

σ <λmin(M+M>)

12λmax

(MM>

2

)+λmax(NN>)

λmin(L>L)ρ2

.

We will show the proof of Theorem 1 in Section V, with the help of Lemmas 1, 3, and 4

in Section IV. Note that the result of Theorem 1, Eq. (21), does not mean the convergence of

DEXTRA to the optimal solution. To proof tk → t∗, we further need to show that the matrix G

satisfying ‖a‖2G ≥ 0, ∀a ∈ R2np. With Assumption A2(c), we already have ‖a‖2A>(D∞)−1

≥0, ∀a ∈ Rn. It is left to show that ‖a‖2

(D∞)−1(A−A)≥ 0, ∀a ∈ Rn, which will be presented in

Section IV. Assume that we have shown that ‖a‖2(D∞)−1(A−A)

≥ 0, ∀a ∈ Rn. Since γ ∈ (0, 1),

then we can say that∥∥tk − t∗

∥∥2G

converges to 0 at the R-linear rate O((1 + δ)−k) for some δ ∈(0, δ). Consequently, ‖zk − z∗‖2

A>(D∞)−1converges to 0 at the R-linear rate O((1 + δ)−k) for

some δ ∈ (0, δ). Again with Assumption A2(c), it follows that zk → z∗ with a R-linear rate.

In Theorem 1, we have given the specific value of αmin and αmax. We represent αmin =αtmin

αdmin,

and αmax = αtmax

αdmaxfor the sake of simplicity. To ensure αmin < αmax, the objective function needs

to have the strong convexity constant Sf satisfying

Sf ≥ 2(dmax)2

(αtminα

dmax

αtmax

2

)(22)

It follows from Eq. (22) that in order to achieve a linear convergence rate for DEXTRA algorithm,

the objective function should not only be strongly convex, but also with the constant Sf above

some threshold. It is beyond the scope of this paper on how to design the weighting matrix, A,

which influences the eigenvalues in the representation of αmin and αmax, such that the threshold

of Sf can be close to zero. In real situation, numerical experiments illustrate that practically Sf

can be really small, i.e., DEXTRA have all agents converge to the optimal solution with an

R-linear rate for almost all the strongly convex functions..

Note that the values of αmin and αmax need the knowledge of network topology, which is not

available in a distributed manner. It is an open question on how to avoid the global knowledge

October 23, 2021 DRAFT

Page 13: On the Linear Convergence of Distributed Optimization over

13

of network topology when designing the value of α. In numerical experiments, however, the

range of α is much wider than the theoretical value.

IV. BASIC RELATIONS

We will present the complete proof of Theorem 1 in Section V, which will rely on three

lemmas stated in this section. For the proof from now on, we will assume that the sequences

updated by DEXTRA have only one dimension, i.e., zki , xki ∈ R,∀i, k. For xki , zki ∈ Rp being

p-dimensional vectors, the proof is same for every dimension by applying the results to each

coordinate. Thus, assuming p = 1 is without the loss of generality. The advantage is that with p =

1 we get rid of the Kronecker product in equations, which is easier to follow. Let p = 1, and

rewrite DEXTRA, Eq. (9),

Dk+1zk+1 =(In + A)Dkzk − ADk−1zk−1 − α[∇f(zk)−∇f(zk−1)

]. (23)

We begin to introduce three main lemmas in this section. Recall the definition of qk,q∗,

in Eqs. (19) and (18), and Dk, D∞, in Eqs. (6) and (16). Lemma 1 establishes the relations

among Dkzk, qk, D∞z∗, and q∗.

Lemma 1. Let Assumptions A1, A2 hold. In DEXTRA, the qradruple sequence {Dkzk,qk, D∞z∗,q∗}obeys

(In + A− 2A)(Dk+1zk+1 −D∞z∗

)+ A

(Dk+1zk+1 −Dkzk

)= −

(A− A

) (qk+1 − q∗

)− α

[∇f(zk)−∇f(z∗)

], (24)

for any k = 0, 1, · · · .

Proof. We sum DEXTRA, Eq. (23), over time from 0 to k,

Dk+1zk+1 = ADkzk − α∇f(zk)−k∑r=0

(A− A

)Drzr.

By substracting(A− A

)Dk+1zk+1 on both sides of the preceding equation, and rearrange the

terms, it follows that(In + A− 2A

)Dk+1zk+1 + A

(Dk+1zk+1 −Dkzk

)= −

(A− A

)qk+1 − α∇f

(zk). (25)

Since(In + A− 2A

)π = 0n, where π is the right eigenvector of A corresponding to eigen-

value 1, we have (In + A− 2A

)D∞z∗ = 0n. (26)

October 23, 2021 DRAFT

Page 14: On the Linear Convergence of Distributed Optimization over

14

By combining Eqs. (25) and (26), and noting that(A− A

)q∗ + α∇f(z∗) = 0n, Eq. (18), we

obtain the desired result.

Before presenting Lemma 3, we introduce an existing result on the special structure of column-

stochastic matrices, which helps efficiently on our proof. Specifically, we have the following

property of the weighting matrix, A and A, used in DEXTRA.

Lemma 2. (Nedic et al. [20]) For any column stochastic matrix A ∈ Rn×n, we have

(a) The limit limk→∞[Ak]

exists and limk→∞[Ak]ij

= πi, where π is the right eigenvector

of A corresponding to eigenvalue 1.

(b) For all i ∈ {1, · · · , n}, the entries[Ak]ij

and πi satisfy∣∣∣[Ak]ij− πi

∣∣∣ < Cλk, ∀j,

where we can have C = 4 and λ = (1− 1nn

).

The convergence properties of the weighting matrix, A, help to build the relation between∥∥Dk+1zk+1 −Dkzk∥∥ and

∥∥zk+1 − zk∥∥. There is also similar relations between

∥∥Dk+1zk+1 −D∞z∗∥∥

and∥∥zk+1 − z∗

∥∥.

Lemma 3. Let Assumptions A1, A2 hold. Denote dmax = maxk{‖Dk‖

}, d−max = maxk

{‖(Dk)−1‖

}.

The sequences,∥∥Dk+1zk+1 −D∞z∗

∥∥,∥∥zk+1 − z∗

∥∥,∥∥Dk+1zk+1 −Dkzk

∥∥, and∥∥zk+1 − zk

∥∥, gen-

erated by DEXTRA, satisfies the following relations.

(a) ∥∥zk+1 − zk∥∥ ≤d−max

∥∥Dk+1zk+1 −Dkzk∥∥+

4d−maxnCBf

Sfλk.

(b) ∥∥zk+1 − z∗∥∥ ≤d−max

∥∥Dk+1zk+1 −D∞z∗∥∥+

2d−maxnCBf

Sfλk.

(c) ∥∥Dk+1zk+1 −D∞z∗∥∥ ≤dmax

∥∥zk+1 − z∗∥∥+ 2nCBf

Sfλk.

Proof. (a) ∥∥zk+1 − zk∥∥ =

∥∥∥(Dk+1)−1 (

Dk+1) (

zk+1 − zk)∥∥∥

≤∥∥∥(Dk+1

)−1∥∥∥∥∥Dk+1zk+1 −Dkzk +Dkzk −Dk+1zk∥∥

October 23, 2021 DRAFT

Page 15: On the Linear Convergence of Distributed Optimization over

15

≤ d−max

∥∥Dk+1zk+1 −Dkzk∥∥+ d−max

∥∥Dk −Dk+1∥∥∥∥zk∥∥ .

≤ d−max

∥∥Dk+1zk+1 −Dkzk∥∥+

4d−maxnCBf

Sfλk,

where the last inequality follows by applying Lemma 2 and the boundedness of the sequences

updated by DEXTRA, Eq. (15). Similarly, we can proof the result of part (b). The proof of part

(c) follows: ∥∥Dk+1zk+1 −D∞z∗∥∥ =

∥∥Dk+1zk+1 −Dk+1z∗ +Dk+1z∗ −D∞z∗∥∥

≤ dmax

∥∥zk+1 − z∗∥∥+

∥∥Dk+1 −D∞∥∥ ‖z∗‖ .

≤ dmax

∥∥zk+1 − z∗∥∥+

2nCBf

Sfλk.

Lemmas 1 and 3 will be used in the proof of Theorem 1 in Section V. We now analysis the

property of matrix (D∞)−1(A− A

). As we have stated in Section III, the result of Theorem 1

itself can not show that the sequence,{zk}

, generated by DEXTRA converges, to the optimal

solution, z∗, i.e., zk → z∗. To proof this, we still need to show that ‖a‖2(D∞)−1(A−A)

≥ 0, ∀a ∈Rn. This property can be easily proved by directly applying the following lemma.

Lemma 4. (Chung. [31]) The Laplacian, L, of a directed graph, G, is defined by

L(G) = In −S1/2RS−1/2 + S−1/2R>S1/2

2, (27)

where R is the transition probability matrix and S = diag(s), where s is the left eigenvector

of R corresponding to eigenvalue 1. If G is strongly connected, then the eigenvalues of L(G)

satisfy 0 = λ0 < λ1 < · · · < λn.

Since

(D∞)−1 (In − A) + (In − A)> (D∞)−1 = (D∞)−1/2 L(G) (D∞)−1/2 ,

it follows from Lemma 4 that for any a ∈ Rn,

‖a‖22(D∞)−1(In−A) = ‖a‖2

(D∞)−1(In−A)+(In−A)>(D∞)−1 ≥ 0.

Noting that A− A = θ(In − A), and In + A− 2A = (1− 2θ)(In − A), and θ ∈ (0, 12], we also

have the following relations.

‖a‖2(D∞)−1(A−A) ≥ 0, ∀a ∈ Rn, (28)

October 23, 2021 DRAFT

Page 16: On the Linear Convergence of Distributed Optimization over

16

‖a‖2(D∞)−1(In+A−2A) ≥ 0, ∀a ∈ Rn, (29)

and

λmin

((D∞)−1

(A− A

)+(A− A

)>(D∞)−1

)= 0,

λmin

((D∞)−1 (In + A− 2A) + (In + A− 2A)> (D∞)−1

)= 0,

Therefore, Eq. (28), together with the result of Theorem 1, Eq. (21) prove the convergence of

DEXTRA to the optimal solution of Problem P1. What is left to prove is Theorem 1, which is

presented in the next section.

V. PROOF OF THEOREM 1

With all the pieces in place, we are finally ready to prove Theorem 1. According to the strong

convexity assumption, A1(c), we have

2αSf∥∥zk+1 − z∗

∥∥2 ≤2α⟨D∞zk+1 −D∞z∗, (D∞)−1

[∇f(zk+1)−∇f(z∗)

]⟩,

=2α⟨D∞zk+1 −Dk+1zk+1, (D∞)−1

[∇f(zk+1)−∇f(z∗)

]⟩+ 2α

⟨Dk+1zk+1 −D∞z∗, (D∞)−1

[∇f(zk+1)−∇f(zk)

]⟩+ 2

⟨Dk+1zk+1 −D∞z∗, (D∞)−1 α

[∇f(zk)−∇f(z∗)

]⟩(30)

:=s1 + s2 + s3,

where s1, s2, s3 denote each of RHS terms in Eq. (30). We discuss each term in sequence.

Denote dmax = maxk{‖Dk‖

}, d−max = maxk

{‖(Dk)−1‖

}, and d−∞ = ‖(D∞)−1‖, which are all

bounded due to the convergence of the weighting matrix, A. For the first two terms, we have

s1 ≤2α∥∥D∞ −Dk+1

∥∥∥∥zk+1∥∥∥∥(D∞)−1

∥∥∥∥∇f(zk+1)−∇f(z∗)∥∥

≤8αncB2

fd−∞

Sfλk; (31)

s2 ≤αη∥∥Dk+1zk+1 −D∞z∗

∥∥2 +αL2

f

η

∥∥(D∞)−1∥∥2 ∥∥zk+1 − zk

∥∥2≤αη

∥∥Dk+1zk+1 −D∞z∗∥∥2 +

2αL2f (d−∞d−max)

2

η

∥∥Dk+1zk+1 −Dkzk∥∥2

+32α(ncLfBfd

−∞d−max)

2

ηS2f

λk; (32)

October 23, 2021 DRAFT

Page 17: On the Linear Convergence of Distributed Optimization over

17

Recall Lemma 1, Eq. (24), we can represent s3 as

s3 =∥∥Dk+1zk+1 −D∞z∗

∥∥2−2(D∞)−1(In+A−2A)

+ 2⟨Dk+1zk+1 −D∞z∗, (D∞)−1 A

(Dkzk −Dk+1zk+1

)⟩+ 2

⟨Dk+1zk+1 −D∞z∗, (D∞)−1

(A− A

) (q∗ − qk+1

)⟩:=s3a + s3b + s3c, (33)

where

s3b = 2⟨Dk+1zk+1 −Dkzk, A> (D∞)−1

(D∞z∗ −Dk+1zk+1

)⟩(34)

and

s3c = 2⟨Dk+1zk+1, (D∞)−1

(A− A

) (q∗ − qk+1

)⟩= 2

⟨qk+1 − qk, (D∞)−1

(A− A

) (q∗ − qk+1

)⟩. (35)

The first equality in Eq. (35) holds because we have (A−A)>(D∞)−1D∞z∗ = 0n. By substituting

Eqs. (34) and (35) into (33), and recalling the definition of tk, t∗, G in Eq. (20), we obtain

s3 =∥∥Dk+1zk+1 −D∞z∗

∥∥2−2(D∞)−1(In+A−2A) + 2

⟨tk+1 − tk, G

(t∗ − tk+1

)⟩. (36)

By applying the basic inequality⟨tk+1 − tk, G

(t∗ − tk+1

)⟩=∥∥tk − t∗

∥∥2G−∥∥tk+1 − t∗

∥∥2G−∥∥tk+1 − tk

∥∥2G−⟨G(tk+1 − tk

), t∗ − tk+1

⟩, (37)

and note that

−∥∥tk+1 − tk

∥∥2G

=−∥∥Dk+1zk+1 −Dkzk

∥∥2A>(D∞)−1 −

∥∥qk+1 − qk∥∥2(D∞)−1(A−A)

≤−∥∥Dk+1zk+1 −Dkzk

∥∥2A>(D∞)−1 , (38)

where the inequality holds due to Eq. (28), and

−⟨G(tk+1 − tk

), t∗ − tk+1

⟩=−

⟨A> (D∞)−1

(Dk+1zk+1 −Dkzk

), D∞z∗ −Dk+1zk+1

⟩−⟨Dk+1zk+1 −D∞z∗,

(A− A

)>(D∞)−1

(q∗ − qk+1

)⟩≤σ

2

∥∥Dk+1zk+1 −Dkzk∥∥2(D∞)−1AA>(D∞)−1 +

1

σ

∥∥Dk+1zk+1 −D∞z∗∥∥2

October 23, 2021 DRAFT

Page 18: On the Linear Convergence of Distributed Optimization over

18

2

∥∥q∗ − qk+1∥∥2(D∞)−1(A−A)(A−A)>(D∞)−1 , (39)

we are able to bound s3 with all terms of matrix norms by combining Eqs. (36), (37), (38),

and (39),

s3 ≤∥∥Dk+1zk+1 −D∞z∗

∥∥2−2(D∞)−1(In+A−2A) + 2

∥∥tk − t∗∥∥2G− 2

∥∥tk+1 − t∗∥∥2G

+∥∥Dk+1zk+1 −Dkzk

∥∥2−2A>(D∞)−1+σ(D∞)−1AA>(D∞)−1 +

2

σ

∥∥Dk+1zk+1 −D∞z∗∥∥2

+ σ∥∥q∗ − qk+1

∥∥2(D∞)−1(A−A)(A−A)>(D∞)−1 . (40)

It follows from Lemma 3(c) that∥∥Dk+1zk+1 −D∞z∗∥∥2 ≤2d2max

∥∥zk+1 − z∗∥∥+

8(nCBf )2

S2f

λk. (41)

By combining Eqs. (41) and (30), we obtain

αSfd2max

∥∥Dk+1zk+1 −D∞z∗∥∥2 ≤s1 + s2 + s3 +

8α(nCBf )2

Sfd2max

λk. (42)

Finally, by plugging the bounds of s1, Eq. (31), s2, Eq. (32), and s3, Eq. (40), into Eq. (42),

and rearranging the equations, it follows that∥∥tk − t∗∥∥2G−∥∥tk+1 − t∗

∥∥2G

≥∥∥Dk+1zk+1 −D∞z∗

∥∥2α2

(Sf

d2max−η)In− 1

σIn+(D∞)−1(In+A−2A)

+∥∥Dk+1zk+1 −Dkzk

∥∥2M>−σ

2MM>−

αL2f(d−∞d

−max)

2

ηIn

−[

4α(nCBf )2

Sfd2max

+8αncB2

fd−∞

Sf+

16α(ncLfBfd−∞d−max)

2

ηS2f

]λk

− σ

2

∥∥q∗ − qk+1∥∥2NN>

, (43)

where we denote M = (D∞)−1A, and N = (D∞)−1(A− A) for simplicity in representation.

In order to establish the result of Theorem 1, Eq. (21), i.e.,∥∥tk − t∗

∥∥2G≥ (1+δ)

∥∥tk+1 − t∗∥∥2G−

Γγk. it remains to show that the RHS of Eq. (43) is no less than δ∥∥tk+1 − t∗

∥∥2G− Γγk. This is

equivalent to show that∥∥Dk+1zk+1 −D∞z∗∥∥2α2

(Sf

d2max−η)In− 1

σIn+(D∞)−1(In+A−2A)−δM>

+∥∥Dk+1zk+1 −Dkzk

∥∥2M>−σ

2MM>−

αL2f(d−∞d

−max)

2

ηIn

October 23, 2021 DRAFT

Page 19: On the Linear Convergence of Distributed Optimization over

19

−[

4α(nCBf )2

Sfd2max

+8αncB2

fd−∞

Sf+

16α(ncLfBfd−∞d−max)

2

ηS2f

]λk

≥∥∥q∗ − qk+1

∥∥2σ2NN>+δN

− Γλk. (44)

From Lemma 1, we have∥∥q∗ − qk+1∥∥2

(A−A)>(A−A)

=∥∥∥(A− A) (q∗ − qk+1

)∥∥∥2=∥∥∥(In + A− 2A)(Dk+1zk+1 −D∞z∗) + α[∇f(zk+1)−∇f(z∗)]

+ A(Dk+1zk+1 −Dkzk) + α[∇f(zk)−∇f(zk+1)]∥∥∥2

≤4(∥∥Dk+1zk+1 −D∞z∗

∥∥2(1−2θ)2P>P + α2L2

f

∥∥zk+1 − z∗∥∥2)

+ 4(∥∥Dk+1zk+1 −Dkzk

∥∥2A>A

+ α2L2f

∥∥zk+1 − zk∥∥2)

≤∥∥Dk+1zk+1 −D∞z∗

∥∥24(1−2θ)2P>P+8(αLfd

−max)2In

+∥∥Dk+1zk+1 −Dkzk

∥∥24A>A+8(αLfd

−max)2In

+ 160

(αLfd

−maxnCBf

Sf

)2

λk, (45)

where we denote P = In−A for simplicity in representation. Consider the analysis in Eqs. (28)

and (29), we get λ(N+N>

2

)≥ 0. It is also true that λ

(NN>

)≥ 0, and λ

((A− A)>(A− A)

)≥

0. Therefore, we have the relation that

1

σ2λmax (NN>) + δλmax

(N+N>

2

) ∥∥q∗ − qk+1∥∥2σ2NN>+δN

≤∥∥q∗ − qk+1

∥∥2≤ 1

λmin

((A− A)>(A− A)

) ∥∥q∗ − qk+1∥∥2

(A−A)>(A−A) , (46)

where λmin(·) gives the smallest nonzero eigenvalue. We denote L = (A − A). By considering

Eqs. (44), (45), and (46), and let Γ in Eq. (44) be

Γ =

[4α(nCBf )

2

Sfd2max

+8αncB2

fd−∞

Sf+

16α(ncLfBfd−∞d−max)

2

ηS2f

]

+

σ2λmax

(NN>

)+ δλmax

(N+N>

2

)λmin (L>L)

160

(αLfd

−maxnCBf

Sf

)2

,

October 23, 2021 DRAFT

Page 20: On the Linear Convergence of Distributed Optimization over

20

it follows that in order to valid Eq. (44), we need to have the following relation be valid.∥∥Dk+1zk+1 −D∞z∗∥∥2α2

(Sf

d2max−η)In− 1

σIn+(D∞)−1(In+A−2A)−δM>

+∥∥Dk+1zk+1 −Dkzk

∥∥2M>−σ

2MM>−

αL2f(d−∞d

−max)

2

ηIn

≥σ2λmax

(NN>

)+ δλmax

(N+N>

2

)λmin (L>L)

·(∥∥Dk+1zk+1 −D∞z∗∥∥24(1−2θ)2P>P+8(αLfd

−max)2In

+∥∥Dk+1zk+1 −Dkzk

∥∥24A>A+8(αLfd

−max)2In

),

which can be implemented by letting

α2

(Sfd2max− η)In− 1

σIn − δ

(M+M>

2

)�

σ2λmax(NN>)+δλmax

(N+N>

2

)λmin(L>L)

·(4(1− 2θ)2P>P + 8(αLfd

−max)

2In),

M+M>

2− σ

2MM>− αL2

f (d−∞d−max)

2

ηIn

�σ2λmax(NN>)+δλmax

(N+N>

2

)λmin(L>L)

·(

4A>A+ 8(αLfd−max)

2In

).

Denote ρ1 = 4(1 − 2θ)2λmax

(P>P

)+ 8(αLfd

−max)

2, and ρ2 = 4λmax

(A>A

)+ 8(αLfd

−max)

2,

where α is some constant regarded as a ceiling of α. Therefore, we have

δ ≤ min

α2

(Sfd2max− η)− 1

σ− σλmax(NN>)

2λmin(L>L)ρ1

λmax

(M+M>

2

)+

λmax

(N+N>

2

)λmin(L>L)

ρ1

,

λmin

(M+M>

2

)− σ

2λmax

(MM>)− αL2

f (d−∞d−max)

2

η

λmax

(N+N>

2

)λmin(L>L)

ρ2

− σλmax

(NN>

)2λmax

(N+N>

2

).

To ensure δ > 0, we set the parameters, η, σ, to be η < Sfd2max

, σ <λmin(M+M>)

12λmax

(MM>

2

)+λmax(NN>)

λmin(L>L)ρ2

,

and also α ∈ (αmin, αmax), where αmin, and αmax are the values given in Theorem 1. Therefore,

we obtained the desired result.

VI. NUMERICAL EXPERIMENTS

This section provides numerical experiments to study the convergence rate of DEXTRA for

a least squares problem in a directed graph. The local objective functions in the least squares

October 23, 2021 DRAFT

Page 21: On the Linear Convergence of Distributed Optimization over

21

problems are strongly convex. We compare the performance of DEXTRA with other algorithms

suited to the case of directed graph: GP as defined by [20, 21] and D-DGD as defined by [32].

Our second experiment verifies the existance of αmin and αmax, such that the proper stepsize α

should be α ∈ [αmin, αmax]. We also consider various network topologies, to see how they effect

the stepsize α. Convergence is studied in terms of the residual

r =1

n

n∑i=1

∥∥zki − φ∥∥ ,where φ is the optimal solution. The distributed least squares problem is described as follow.

Each agent owns a private objective function, si = Rix + ni, where si ∈ Rmi and Mi ∈ Rmi×p

are measured data, x ∈ Rp is unknown states, and ni ∈ Rmi is random unknown noise. The goal

is to estimate x. This problem can be formulated as a distributed optimization problem solving

min f(x) =1

n

n∑i=1

‖Rix− si‖ .

We consider the network topology as the digraphs shown in Fig. 1.

23 1

56 47

10 89

Ga

23 1

56 47

10 89

Gb

23 1

56 47

10 89

Gc

Fig. 1: Three examples of strongly-connected but non-balanced digraphs.

Our first experiment compares several algorithms with the choice of network Ga, illustrated

in Fig. 1. The comparison of DEXTRA, GP, D-DGD and DGD with weighting matrix being

row-stochastic is shown in Fig. 2. In this experiment, we set α = 0.1. The convergence rate of

DEXTRA is linear as stated in Section III. G-P and D-DGD apply the same stepsize, α = α√k.

As a result, the convergence rate of both is sub-linear. We also consider the DGD algorithm,

but with the weighting matrix being row-stochastic. The reason is that in a directed graph, it

is impossible to construct a doubly-stochastic matrix. As expected, DGD with row-stochastic

matrix does not converge to the exact optimal solution while other three algorithms are suited

to the case of directed graph.

October 23, 2021 DRAFT

Page 22: On the Linear Convergence of Distributed Optimization over

22

As stated in Theorem 1, DEXTRA converges to the optimal solution when the stepsize α is

in a proper interval. Fig. 3 illustrate this fact. The network topology is chosen to be Ga in Fig. 1.

It is shown that when α = 0.01 or α = 1, DEXTRA does not converge. DEXTRA has a more

restricted stepsize compared with EXTRA, see [19], where the value of stepsize can be as low

as any value close to zero, i.e., αmin = 0, when a symmetric and doubly-stochastic matrix is

applied in EXTRA. The relative smaller range of interval is due to the fact that the weighting

matrix applied in DEXTRA can not be symmetric.

0 500 1000 1500 200010

−10

10−8

10−6

10−4

10−2

100

k

Residual

DGD with Row−Stochastic MatrixGPD−DGDDEXTRA

Fig. 2: Comparison of different distributed optimization algorithms in a least squares problem.

GP, D-DGD, and DEXTRA are proved to work when the network topology is described by

digraphs. Moreover, DEXTRA has a much faster convergence rate compared with GP and D-

DGD.

In third experiments, we explore the relations between the range of stepsize and the network

topology. Fig. 4 shows the interval of proper stepsizes for the convergence of DEXTRA to optimal

solution when the network topology is chosen to be Ga, Gb, and Gc in Fig. 1. The interval

of proper stepsizes is related to the network topology as stated in Theorem 1 in Section III.

Compared Fig. 4 (b) and (c) with (a), a faster information diffusion on the network can increase

the range of stepsize interval. Given a fixed network topology, e.g., see Fig. 4(a) only, a larger

stepsize increase the convergence speed of algorithms. When we compare Fig. 4(b) with (a), it

is interesting to see that α = 0.3 makes DEXTRA iterates about 190 times to converge with the

October 23, 2021 DRAFT

Page 23: On the Linear Convergence of Distributed Optimization over

23

0 500 1000 1500 200010

−20

10−10

100

1010

1020

1030

1040

1050

k

Residual

DEXTRA α=0.01DEXTRA α=0.1DEXTRA α=1

Fig. 3: There exists some interval for the stepsize to have DEXTRA converge to the optimal

solution.

choice of Ga while the same stepsize needs DEXTRA iterates 230 times to converge when the

topology changes to Gb. We claim that a faster information diffusion on the network fasten the

convergence speed.

VII. CONCLUSIONS

We introduce DEXTRA, a distributed algorithm to solve multi-agent optimization problems

over directed graphs. We have shown that DEXTRA succeeds in driving all agents to the same

point, which is the optimal solution of the problem, given that the communication graph is

strongly-connected and the objective functions are strongly convex. Moreover, the algorithm

converges at a linear rate O(εk) for some constant ε < 1. The constant depends on the network

topology G, as well as the Lipschitz constant, strong convexity constant, and gradient norm of

objective functions. Numerical experiments are conducted for a least squares problem, where we

show that DEXTRA is the fastest distributed algorithm among all algorithms suited to the case

of directed graphs, compared with GP, and D-DGD.

Our analysis also explicitly gives the range of interval, which needs the knowledge of network

topology, of stepsize to have DEXTRA converge to the optimal solution. Although numerical

experiments show that the practical range of interval are much bigger, it is still an open question

October 23, 2021 DRAFT

Page 24: On the Linear Convergence of Distributed Optimization over

24

0 200 400 600 800 100010

−15

10−10

10−5

100

105

k

Residual

Ga, α ∈ [0.03, 0.3]

0 200 400 600 800 100010

−20

10−15

10−10

10−5

100

105

k

Residual

Gb, α ∈ [10−10, 0.3]

0 200 400 600 800 100010

−20

10−15

10−10

10−5

100

105

k

Residual

Gc, α ∈ [10−10, 0.4]

Fig. 4: There exists some interval for the stepsize to have DEXTRA converge to the optimal

solution. The upperbound and lowerbound of the interval depends on the network topology,

Lipschitz constant, strong convexity constant, and gradient norm of objective functions. (a) Ga;(b) Gb; (c) Gc. Given a fixed topology, the convergence speed increases as α increases.

October 23, 2021 DRAFT

Page 25: On the Linear Convergence of Distributed Optimization over

25

on how to get rid of the knowlege of network topology, which is a global information in a

distributed setting.

REFERENCES

[1] V. Cevher, S. Becker, and M. Schmidt, “Convex optimization for big data: Scalable,

randomized, and parallel algorithms for big data analytics,” IEEE Signal Processing

Magazine, vol. 31, no. 5, pp. 32–43, 2014.

[2] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and

statistical learning via the alternating direction method of multipliers,” Foundation and

Trends in Maching Learning, vol. 3, no. 1, pp. 1–122, Jan. 2011.

[3] I. Necoara and J. A. K. Suykens, “Application of a smoothing technique to decomposition

in convex optimization,” IEEE Transactions on Automatic Control, vol. 53, no. 11, pp.

2674–2679, Dec. 2008.

[4] G. Mateos, J. A. Bazerque, and G. B. Giannakis, “Distributed sparse linear regression,”

IEEE Transactions on Signal Processing, vol. 58, no. 10, pp. 5262–5276, Oct. 2010.

[5] J. A. Bazerque and G. B. Giannakis, “Distributed spectrum sensing for cognitive radio

networks by exploiting sparsity,” IEEE Transactions on Signal Processing, vol. 58, no. 3,

pp. 1847–1862, March 2010.

[6] M. Rabbat and R. Nowak, “Distributed optimization in sensor networks,” in 3rd

International Symposium on Information Processing in Sensor Networks, Berkeley, CA,

Apr. 2004, pp. 20–27.

[7] U. A. Khan, S. Kar, and J. M. F. Moura, “Diland: An algorithm for distributed sensor

localization with noisy distance measurements,” IEEE Transactions on Signal Processing,

vol. 58, no. 3, pp. 1940–1947, Mar. 2010.

[8] C. L and L. Li, “A distributed multiple dimensional qos constrained resource scheduling

optimization policy in computational grid,” Journal of Computer and System Sciences, vol.

72, no. 4, pp. 706 – 726, 2006.

[9] G. Neglia, G. Reina, and S. Alouf, “Distributed gradient optimization for epidemic routing:

A preliminary evaluation,” in 2nd IFIP in IEEE Wireless Days, Paris, Dec. 2009, pp. 1–6.

[10] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,”

IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, Jan. 2009.

October 23, 2021 DRAFT

Page 26: On the Linear Convergence of Distributed Optimization over

26

[11] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,”

arXiv preprint arXiv:1310.7063, 2013.

[12] J. C. Duchi, A. Agarwal, and M. J. Wainwright, “Dual averaging for distributed

optimization: Convergence analysis and network scaling,” IEEE Transactions on Automatic

Control, vol. 57, no. 3, pp. 592–606, Mar. 2012.

[13] J. F. C. Mota, J. M. F. Xavier, P. M. Q. Aguiar, and M. Puschel, “D-ADMM: A

communication-efficient distributed algorithm for separable optimization,” IEEE Trans-

actions on Signal Processing, vol. 61, no. 10, pp. 2718–2723, May 2013.

[14] E. Wei and A. Ozdaglar, “Distributed alternating direction method of multipliers,” in 51st

IEEE Annual Conference on Decision and Control, Dec. 2012, pp. 5445–5450.

[15] W. Shi, Q. Ling, K Yuan, G Wu, and W Yin, “On the linear convergence of the admm in

decentralized consensus optimization,” IEEE Transactions on Signal Processing, vol. 62,

no. 7, pp. 1750–1761, April 2014.

[16] D. Jakovetic, J. Xavier, and J. M. F. Moura, “Fast distributed gradient methods,” IEEE

Transactions onAutomatic Control, vol. 59, no. 5, pp. 1131–1146, 2014.

[17] A. Mokhatari, Q. Ling, and A. Ribeiro, “Network newton,” http://www.seas.upenn.edu/

∼aryanm/wiki/NN.pdf, 2014.

[18] Q. Ling and A. Ribeiro, “Decentralized linearized alternating direction method of

multipliers,” in IEEE International Conference on Acoustics, Speech and Signal Processing.

IEEE, 2014, pp. 5447–5451.

[19] W. Shi, Q. Ling, G. Wu, and W Yin, “Extra: An exact first-order algorithm for decentralized

consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.

[20] A. Nedic and A. Olshevsky, “Distributed optimization over time-varying directed graphs,”

IEEE Transactions on Automatic Control, vol. PP, no. 99, pp. 1–1, 2014.

[21] K. I. Tsianos, The role of the Network in Distributed Optimization Algorithms: Convergence

Rates, Scalability, Communication/Computation Tradeoffs and Communication Delays,

Ph.D. thesis, Dept. Elect. Comp. Eng. McGill University, 2013.

[22] D. Kempe, A. Dobra, and J. Gehrke, “Gossip-based computation of aggregate information,”

in 44th Annual IEEE Symposium on Foundations of Computer Science, Oct. 2003, pp. 482–

491.

[23] F. Benezit, V. Blondel, P. Thiran, J. Tsitsiklis, and M. Vetterli, “Weighted gossip: Distributed

averaging using non-doubly stochastic matrices,” in IEEE International Symposium on

October 23, 2021 DRAFT

Page 27: On the Linear Convergence of Distributed Optimization over

27

Information Theory, Jun. 2010, pp. 1753–1757.

[24] A. Jadbabaie, J. Lim, and A. S. Morse, “Coordination of groups of mobile autonomous

agents using nearest neighbor rules,” IEEE Transactions on Automatic Control, vol. 48,

no. 6, pp. 988–1001, Jun. 2003.

[25] R. Olfati-Saber and R. M. Murray, “Consensus problems in networks of agents with

switching topology and time-delays,” IEEE Transactions on Automatic Control, vol. 49,

no. 9, pp. 1520–1533, Sep. 2004.

[26] R. Olfati-Saber and R. M. Murray, “Consensus protocols for networks of dynamic agents,”

in IEEE American Control Conference, Denver, Colorado, Jun. 2003, vol. 2, pp. 951–956.

[27] L. Xiao, S. Boyd, and S. J. Kim, “Distributed average consensus with least-mean-square

deviation,” Journal of Parallel and Distributed Computing, vol. 67, no. 1, pp. 33 – 46,

2007.

[28] C. Xi, Q. Wu, and U. A. Khan, “Distributed gradient descent over directed graphs,”

http://www.eecs.tufts.edu/∼cxi01/paper/2015D-DGD.pdf, 2015.

[29] C. Xi and U. A. Khan, “Directed-distributed gradient descent,” in 53rd Annual Allerton

Conference on Communication, Control, and Computing, Monticello, IL, USA, Oct. 2015.

[30] K. Cai and H. Ishii, “Average consensus on general strongly connected digraphs,”

Automatica, vol. 48, no. 11, pp. 2750 – 2761, 2012.

[31] F. Chung, “Laplacians and the cheeger inequality for directed graphs,” Annals of

Combinatorics, vol. 9, no. 1, pp. 1–19, 2005.

[32] C. Xi, Q. Wu, and U. A. Khan, “Distributed gradient descent over directed graphs,”

http://www.eecs.tufts.edu/∼cxi01/paper/2015D-DGD.pdf, 2015.

October 23, 2021 DRAFT