65
Coordinate friendly structures, algorithms and applications * Zhimin Peng, Tianyu Wu, Yangyang Xu, Ming Yan, and Wotao Yin This paper focuses on coordinate update methods, which are useful for solving problems involving large or high-dimensional datasets. They decompose a problem into simple subproblems, where each updates one, or a small block of, variables while fixing others. These methods can deal with linear and nonlinear mappings, smooth and nonsmooth functions, as well as convex and nonconvex problems. In addition, they are easy to parallelize. The great performance of coordinate update methods depends on solving simple subproblems. To derive simple subproblems for several new classes of applications, this paper systematically stud- ies coordinate friendly operators that perform low-cost coordinate updates. Based on the discovered coordinate friendly operators, as well as operator splitting techniques, we obtain new coordinate update algorithms for a variety of problems in machine learning, image processing, as well as sub-areas of optimization. Several problems are treated with coordinate update for the first time in history. The obtained algorithms are scalable to large instances through parallel and even asynchronous computing. We present numerical examples to illustrate how effective these algorithms are. Keywords and phrases: coordinate update, fixed point, operator split- ting, primal-dual splitting, parallel, asynchronous. * This work is supported by NSF Grants DMS-1317602 and ECCS-1462398. 1

Coordinate friendly structures, algorithms and applications

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms andapplications∗

Zhimin Peng, Tianyu Wu, Yangyang Xu, Ming Yan, andWotao Yin

This paper focuses on coordinate update methods, which are usefulfor solving problems involving large or high-dimensional datasets.They decompose a problem into simple subproblems, where eachupdates one, or a small block of, variables while fixing others. Thesemethods can deal with linear and nonlinear mappings, smooth andnonsmooth functions, as well as convex and nonconvex problems.In addition, they are easy to parallelize.

The great performance of coordinate update methods dependson solving simple subproblems. To derive simple subproblems forseveral new classes of applications, this paper systematically stud-ies coordinate friendly operators that perform low-cost coordinateupdates.

Based on the discovered coordinate friendly operators, as wellas operator splitting techniques, we obtain new coordinate updatealgorithms for a variety of problems in machine learning, imageprocessing, as well as sub-areas of optimization. Several problemsare treated with coordinate update for the first time in history.The obtained algorithms are scalable to large instances throughparallel and even asynchronous computing. We present numericalexamples to illustrate how effective these algorithms are.

Keywords and phrases: coordinate update, fixed point, operator split-ting, primal-dual splitting, parallel, asynchronous.

∗This work is supported by NSF Grants DMS-1317602 and ECCS-1462398.

1

Page 2: Coordinate friendly structures, algorithms and applications

2

1. Introduction

This paper studies coordinate update methods, which reduce a large prob-lem to smaller subproblems and are useful for solving large-sized problems.These methods handle both linear and nonlinear maps, smooth and nons-mooth functions, and convex and nonconvex problems. The common specialexamples of these methods are the Jacobian and Gauss-Seidel algorithms forsolving a linear system of equations, and they are also commonly used forsolving differential equations (e.g., domain decomposition) and optimizationproblems (e.g., coordinate descent).

After coordinate update methods were initially introduced in each topicarea, their evolution had been slow until recently, when data-driven ap-plications (e.g., in signal processing, image processing, and statistical andmachine learning) impose strong demand for scalable numerical solutions;consequently, numerical methods of small footprints, including coordinateupdate methods, become increasingly popular. These methods are generallyapplicable to many problems involving large or high-dimensional datasets.

Coordinate update methods generate simple subproblems that updateone variable, or a small block of variables, while fixing others. The variablescan be updated in the cyclic, random, or greedy orders, which can be se-lected to adapt to the problem. The subproblems that perform coordinateupdates also have different forms. Coordinate updates can be applied eithersequentially on a single thread or concurrently on multiple threads, or evenin an asynchronous parallel fashion. They have been demonstrated to giverise to very powerful and scalable algorithms.

Clearly, the strong performance of coordinate update methods relies onsolving simple subproblems. The cost of each subproblem must be propor-tional to how many coordinates it updates. When there are totally m co-ordinates, the cost of updating one coordinate should not exceed the av-erage per-coordinate cost of the full update (made to all the coordinatesat once). Otherwise, coordinate update is not computationally worthy. Forexample, let f : Rm → R be a C2 function, and consider the Newton updatexk+1 ← xk −

(∇2f(xk)

)−1∇f(xk). Since updating each xi (keeping oth-ers fixed) still requires forming the Hessian matrix ∇2f(x) (at least O(m2)operations) and factorizing it (O(m3) operations), there is little to save incomputation compared to updating all the components of x at once; hence,the Netwon’s method is generally not amenable to coordinate update.

The recent coordinate-update literature has introduced new algorithms.However, they are primarily applied to a few, albeit important, classes ofproblems that arise in machine learning. For many complicated problems,

Page 3: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 3

it remains open whether simple subproblems can be obtained. We providepositive answers to several new classes of applications and introduce theircoordinate update algorithms. Therefore, the focus of this paper is to builda set of tools for deriving simple subproblems and extending coordinateupdates to new territories of applications.

We will frame each application into an equivalent fixed-point problem

(1) x = T x

by specifying the operator T : H → H, where x = (x1, . . . , xm) ∈ H, andH = H1 × · · · ×Hm is a Hilbert space. In many cases, the operator T itselfrepresents an iteration:

(2) xk+1 = T xk

such that the limit of the sequence {xk} exists and is a fixed point of T ,which is also a solution to the application or from which a solution to theapplication can be obtained. We call the scheme (2) a full update, as opposedto updating one xi at a time. The scheme (2) has a number of interestingspecial cases including methods of gradient descent, gradient projection,proximal gradient, operator splitting, and many others.

We study the structures of T that make the following coordinate updatealgorithm computationally worthy

(3) xk+1i = xki − ηk(xk − T xk)i,

where ηk is a step size and i ∈ [m] := {1, . . . ,m} is arbitrary. Specifically, thecost of performing (3) is roughly 1

m , or lower, of that of performing (2). Wecall such T a Coordinate Friendly (CF) operator, which we will formallydefine.

This paper will explore a variety of CF operators. Single CF operatorsinclude linear maps, projections to certain simple sets, proximal maps andgradients of (nearly) separable functions, as well as gradients of sparselysupported functions. There are many more composite CF operators, whichare built from single CF and non-CF operators under a set of rules. The factthat some of these operators are CF is not obvious.

These CF operators let us derive powerful coordinate update algorithmsfor a variety of applications including, but not limited to, linear and second-order cone programming, variational image processing, support vector ma-chine, empirical risk minimization, portfolio optimization, distributed com-puting, and nonnegative matrix factorization. For each application, we present

Page 4: Coordinate friendly structures, algorithms and applications

4

an algorithm in the form of (2) so that its coordinate update (3) is efficient.In this way we obtain new coordinate update algorithms for these applica-tions, some of which are treated with coordinate update for the first time.

The developed coordinate update algorithms are easy to parallelize. Inaddition, the work in this paper gives rise to parallel and asynchronous ex-tensions to existing algorithms including the Alternating Direction Methodof Multipliers (ADMM), primal-dual splitting algorithms, and others.

The paper is organized as follows. §1.1 reviews the existing frameworksof coordinate update algorithms. §2 defines the CF operator and discussesdifferent classes of CF operators. §3 introduces a set of rules to obtain com-posite CF operators and applies the results to operator splitting methods.§4 is dedicated to primal-dual splitting methods with CF operators, whereexisting ones are reviewed and a new one is introduced. Applying the resultsof previous sections, §5 obtains novel coordinate update algorithms for a va-riety of applications, some of which have been tested with their numericalresults presented in §6.

Throughout this paper, all functions f, g, h are proper closed convex andcan take the extended value ∞, and all sets X,Y, Z are nonempty closedconvex. The indicator function ιX(x) returns 0 if x ∈ X, and ∞ elsewhere.For a positive integer m, we let [m] := {1, . . . ,m}.

1.1. Coordinate Update Algorithmic Frameworks

This subsection reviews the sequential and parallel algorithmic frameworksfor coordinate updates, as well as the relevant literature.

The general framework of coordinate update is

1. set k ← 0 and initialize x0 ∈ H = H1 × · · · ×Hm

2. while not converged do3. select an index ik ∈ [m];4. update xk+1

i for i = ik while keeping xk+1i = xki , ∀ i 6= ik;

5. k ← k + 1;

Next we review the index rules and the methods to update xi.

1.1.1. Sequential Update. In this framework, there is a sequence ofcoordinate indices i1, i2, . . . chosen according to one of the following rules:cyclic, cyclic permutation, random, and greedy rules. At iteration k, onlythe ikth coordinate is updated:{

xk+1i = xki − ηk(xk − T xk)i, i = ik,

xk+1i = xki , for all i 6= ik.

Page 5: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 5

Sequential updates have been applied to many problems such as the Gauss-Seidel iteration for solving a linear system of equations, alternating projec-tion [75, 4] for finding a point in the intersection of two sets, ADMM [31, 30]for solving monotropic programs, and Douglas-Rachford Splitting (DRS) [26]for finding a zero to the sum of two operators.

In optimization, coordinate descent algorithms, at each iteration, mini-mize the function f(x1, . . . , xm) by fixing all but one variable xi. Let

xi− := (x1, . . . , xi−1), xi+ = (xi+1, . . . , xm)

collect all but the ith coordinate of x. Coordinate descent solves one of thefollowing subproblems:

(T xk)i = arg minxi

f(xki−, xi, xki+),(4a)

(T xk)i = arg minxi

f(xki−, xi, xki+) +

1

2ηk‖xi − xki ‖2,(4b)

(T xk)i = arg minxi

〈∇if(xk), xi〉+1

2ηk‖xi − xki ‖2,(4c)

(T xk)i = arg minxi

〈∇ifdiff(xk), xi〉+ fproxi (xi) +

1

2ηk‖xi − xki ‖2,(4d)

which are called direct update, proximal update, gradient update, and prox-gradient update, respectively. The last update applies to the function

f(x) = fdiff(x) +

m∑i=1

fproxi (xi),

where fdiff is differentiable and each fproxi is proximable (its proximal map

takes O(

dim(xi) polylog(dim(xi)))

operations to compute).

Sequential-update literature. Coordinate descent algorithms dateback to the 1950s [35], when the cyclic index rule was used. Its convergencehas been established under a variety of cases, for both convex and nonconvexobjective functions; see [77, 85, 57, 33, 45, 70, 32, 72, 58, 8, 36, 78]. Proximalupdates are studied in [32, 1] and developed into prox-gradient updatesin [74, 73, 13] and mixed updates in [81].

The random index rule first appeared in [48] and then [61, 44]. Re-cently, [82, 80] compared the convergence speeds of cyclic and stochasticupdate-orders. The gradient update has been relaxed to stochastic gradientupdate for large-scale problems in [21, 83].

Page 6: Coordinate friendly structures, algorithms and applications

6

The greedy index rule leads to fewer iterations but is often impractical

since it requires a lot of effort to calculate scores for all the coordinates.

However, there are cases where calculating the scores is inexpensive [11, 41,

79] and the save in the total number of iterations significantly outweighs the

extra calculation [74, 25, 55, 49].

A simple example. We present the coordinate update algorithms un-

der different index rules for solving a simple least squares problem:

minimizex

f(x) :=1

2‖Ax− b‖2,

where A ∈ Rp×m and b ∈ Rp are Gaussian random. Our goal is to numeri-

cally demonstrate the advantages of coordinate updates over the full update

of gradient descent:

xk+1 = xk − ηkA>(Axk − b).

The four tested index rules are: cyclic, cyclic permutation, random, and

greedy under the Gauss-Southwell1 rule. Note that because this example is

very special, the comparisons of different index rules are far from conclusive.

In the full update, the step size ηk is set to the theoretical upper bound2‖A‖22

, where ‖A‖2 denotes the matrix operator norm and equals the largest

singular value of A. For each coordinate update to xi, the step size ηk is set

to 1(A>A)ii

. All of the full and coordinate updates have the same per-epoch

complexity, so we plot the objective errors in Figure 1.

1.1.2. Parallel Update. As one of their main advantages, coordinate

update algorithms are easy to parallelize. In this subsection, we discuss both

synchronous (sync) and asynchronous (async) parallel updates.

Sync-parallel (Jacobi) update specifies a sequence of index subsets

I1, I2, . . . ⊆ [m], and at each iteration k, the coordinates in Ik are updated

in parallel by multiple agents:{xk+1i = xki − ηk(xk − T xk)i, i ∈ Ik,xk+1i = xki , i 6∈ Ik.

Synchronization across all agents ensures that all xi in Ik are updated and

also written to the memory before the next iteration starts. Note that, if

1it selects ik = arg maxi ‖∇if(xk)‖.

Page 7: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 7

0 20 40 60 80 10010

−15

10−10

10−5

100

105

Epochs

fk−

f∗

full update

greedy

random

cyclic

cyclic permutation

Figure 1: Gradient descent: the coordinate updates are faster than the fullupdate since the former can take larger steps at each step.

Ik = [m] for all k, then all the coordinates are updated and, thus, eachiteration reduces to the full update: xk+1 = xk − ηk(xk − T xk).

Async-parallel update. In this setting, a set of agents still performparallel updates, but synchronization is eliminated or weakened. Hence, eachagent continuously applies (5), which reads x from and writes xi back tothe shared memory (or through communicating with other agents withoutshared memory):

(5)

{xk+1i = xki − ηk

((I − T )xk−dk

)i, i = ik,

xk+1i = xki , for all i 6= ik.

Unlike before, k increases whenever any agent completes an update.The lack of synchronization often results in computation with out-of-

date information. During the computation of the kth update, other agentsmake dk updates to x in the shared memory; when the kth update is written,its input is already dk iterations out of date. This number is referred to as theasynchronous delay. In (5), the agent reads xk−dk and commits the updateto xkik . Here we have assumed consistent reading, i.e., xk−dk lying in the set

{xj}kj=1. This requires implementing a memory lock. Removing the lock canlead to inconsistent reading, which still has convergence guarantees; see [54,Section 1.2] for more details.

Synchronization across all agents means that all agents will wait for thelast (slowest) agent to complete. Async-parallel updates eliminate such idletime, spread out memory access and communication, and thus often run

Page 8: Coordinate friendly structures, algorithms and applications

8

Agent 1

Agent 2

Agent 3

0 1 2

idle idle

idle

idle

k:

(a) sync-parallel computing

Agent 1

Agent 2

Agent 3

0 1 2 3 4 5 6 7 108 9k:

(b) async-parallel computing

Figure 2: Sync-parallel computing (left) versus async-parallel computing(right). On the left, all the agents must wait at idle (white boxes) untilthe slowest agent has finished.

much faster. However, async-parallel is more difficult to analyze because ofthe asynchronous delay.

Parallel-update literature. Async-parallel methods can be tracedback to [17] for systems of linear equations. For function minimization, [12]introduced an async-parallel gradient projection method. Convergence ratesare obtained in [69]. Recently, [14, 62] developed parallel randomized meth-ods.

For fixed-point problems, async-parallel methods date back to [3] in1978. In the pre-2010 methods [2, 10, 6, 27] and the review [29], each agentupdates its own subset of coordinates. Convergence is established under theP -contraction condition and its variants [10]. Papers [6, 7] show convergencefor async-parallel iterations with simultaneous reading and writing to thesame set of components. Unbounded but stochastic delays are consideredin [67].

Recently, random coordinate selection appeared in [19] for fixed-pointproblems. The works [47, 59, 43, 42, 37] introduced async-parallel stochasticmethods for function minimization. For fixed-point problems, [54] introducedasync-parallel stochastic methods, as well as several applications.

1.2. Contributions of This Paper

The paper systematically discusses the CF properties found in both sin-gle and composite operators underlying many interesting applications. Weintroduce approaches to recognize CF operators and develop coordinate-update algorithms based on them. We provide a variety of applications toillustrate our approaches. In particular, we obtain new coordinate-updatealgorithms for image deblurring, portfolio optimization, second-order coneprogramming, as well as matrix decomposition. Our analysis also provides

Page 9: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 9

guidance to the implementation of coordinate-update algorithms by speci-fying how to compute certain operators and maintain certain quantities inmemory. We also provide numerical results to illustrate the efficiency of theproposed coordinate update algorithms.

This paper does not focus on the convergence perspective of coordinateupdate algorithms, though a convergence proof is provided in the appendixfor a new primal-dual coordinate update algorithm. In general, in fixed-pointalgorithms, the iterate convergence is ensured by the monotonic decrease ofthe distance between the iterates and the solution set, while in minimiza-tion problems, the objective value convergence is ensured by the monotonicdecrease of a certain energy function. The reader is referred to the existingliterature for details.

The structural properties of operators discussed in this paper are irrele-vant to the convergence-related properties such as nonexpansiveness (for anoperator) or convexity (for a set or function). Hence, the algorithms devel-oped can be still applied to nonconvex problems.

2. Coordinate Friendly Operators

2.1. Notation

For convenience, we do not distinguish a coordinate from a block of co-ordinates throughout this paper. We assume our variable x consists of mcoordinates:

x = (x1, . . . , xm) ∈ H := H1 × · · · ×Hm and xi ∈ Hi, i = 1, . . . ,m.

For simplicity, we assume that H1, . . . ,Hm are finite-dimensional real Hilbertspaces, though most results hold for general Hilbert spaces. A function mapsfrom H to R, the set of real numbers, and an operator maps from H to G,where the definition of G depends on the context.

Our discussion often involves two points x, x+ ∈ H that differ over onecoordinate: there exists an index i ∈ [m] and a point δ ∈ H supported onHi, such that

(6) x+ = x+ δ.

Note that x+j = xj for all j 6= i. Hence, x+ = (x1, . . . , xi + δi, . . . , xm).

Definition 1 (number of operations). We let M [a 7→ b] denote the numberof basic operations that it takes to compute the quantity b from the input a.

Page 10: Coordinate friendly structures, algorithms and applications

10

For example, M [x 7→ (T x)i] denotes the number of operations to com-pute the ith component of T x given x. We explore the possibility to compute(T x)i with much fewer operations than what is needed to first compute T xand then take its ith component.

2.2. Single Coordinate Friendly Operators

This subsection studies a few classes of CF operators and then formallydefines the CF operator. We motivate the first class through an example.

In the example below, we let Ai,: and A:,j be the ith row and jth columnof a matrix A, respectively. Let A> be the transpose of A and A>i,: be (A>)i,:,i.e., the ith row of the transpose of A.

Example 1 (least squares I). Consider the least squares problem

(7) minimizex

f(x) :=1

2‖Ax− b‖2,

where A ∈ Rp×m and b ∈ Rp. In this example, assume that m = Θ(p),namely, m and p are of the same order. We compare the full update ofgradient descent to its coordinate update.2 The full update is referred to asthe iteration xk+1 = T xk where T is given by

(8) T x := x− η∇f(x) = x− ηA>Ax+ ηA>b.

Assuming that A>A and A>b are already computed, we have M [x 7→ T x] =O(m2). The coordinate update at the kth iteration performs

xk+1ik

= (T xk)ik = xkik − η∇ikf(xk),

and xk+1j = xkj , ∀j 6= ik, where ik is some selected coordinate.

Since for all i, ∇if(xk) =(A>(Ax− b)

)i

= (A>A)i,: · x − (A>b)i, we

have M [x 7→ (T x)i] = O(m) and thus M [x 7→ (T x)i] = O( 1mM [x 7→ T x]).

Therefore, the coordinate gradient descent is computationally worthy.

The operator T in the above example is a special Type-I CF operator.

Definition 2 (Type-I CF). For an operator T : H→ H, let M [x 7→ (T x)i]be the number of operations for computing the ith coordinate of T x given x

2Although gradient descent is seldom used to solve least squares, it often appearsas a part in first-order algorithms for problems involving a least squares term.

Page 11: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 11

and M [x 7→ T x] the number of operations for computing T x given x. Wesay T is Type-I CF (denoted as F1) if for any x ∈ H and i ∈ [m], it holds

M [x 7→ (T x)i] = O

(1

mM [x 7→ T x]

).

Example 2 (least squares II). We can implement the coordinate update inExample 1 in a different manner by maintaining the result T xk in the mem-ory. This approach works when m = Θ(p) or p� m. The full update (8) isunchanged. At each coordinate update, from the maintained quantity T xk, weimmediately obtain xk+1

ik= (T xk)ik . But we need to update T xk to T xk+1.

Since xk+1 and xk differ only over the coordinate ik, this update can becomputed as

T xk+1 = T xk + xk+1 − xk − η(xk+1ik− xkik)(A

>A):,ik ,

which is a scalar-vector multiplication followed by vector addition, takingonly O(m) operations. Computing T xk+1 from scratch involves a matrix-vector multiplication, taking O(M [x 7→ T (x)]) = O(m2) operations. There-fore,

M[{xk, T xk, xk+1} 7→ T xk+1

]= O

(1

mM[xk+1 7→ T xk+1

]).

The operator T in the above example is a special Type-II CF operator.

Definition 3 (Type-II CF). An operator T is called Type-II CF (denotedas F2) if, for any i, x and x+ :=

(x1, . . . , (T x)i, . . . , xm

), the following holds

(9) M[{x, T x, x+} 7→ T x+

]= O

(1

mM[x+ 7→ T x+

]).

The next example illustrates an efficient coordinate update by maintain-ing certain quantity other than T x.

Example 3 (least squares III). For the case p � m, we should avoidpre-computing the relative large matrix A>A, and it is cheaper to computeA>(Ax) than (A>A)x. Therefore, we change the implementations of boththe full and coordinate updates in Example 1. In particular, the full update

xk+1 = T xk = xk − η∇f(xk) = xk − ηA>(Axk − b),

pre-multiplies xk by A and then A>. Hence, M[xk 7→ T (xk)

]= O(mp).

Page 12: Coordinate friendly structures, algorithms and applications

12

We change the coordinate update to maintain the intermediate quantityAxk. In the first step, the coordinate update computes

(T xk)ik = xkik − η(A>(Axk)−A>b)ik ,

by pre-multiplying Axk by A>ik,:. Then, the second step updates Axk to Axk+1

by adding (xk+1ik− xkik)A:,ik to Axk. Both steps take O(p) operations, so

M[{xk, Axk} 7→ {xk+1, Axk+1}

]= O(p) = O

(1

mM[xk 7→ T xk

]).

Combining Type-I and Type-II CF operators with the last example, wearrive at the following CF definition.

Definition 4 (CF operator). We say that an operator T : H→ H is CF if,for any i, x and x+ :=

(x1, . . . , (T x)i, . . . , xm

), the following holds

(10) M[{x,M(x)} 7→ {x+,M(x+)}

]= O

(1

mM [x 7→ T x]

),

where M(x) is some quantity maintained in the memory to facilitate eachcoordinate update and refreshed to M(x+). M(x) can be empty, i.e., exceptx, no other varying quantity is maintained.

The left-hand side of (10) measures the cost of performing one coor-dinate update (including the cost of updating M(x) to M(x+)) while theright-hand side measures the average per-coordinate cost of updating allthe coordinates together. When (10) holds, T is amenable to coordinateupdates.

By definition, a Type-I CF operator T is CF without maintaining anyquantity, i.e., M(x) = ∅.

A Type-II CF operator T satisfies (10) with M(x) = T x, so it is alsoCF. Indeed, given any x and i, we can compute x+ by immediately lettingx+i = (T x)i (at O(1) cost) and keeping x+

j = xj , ∀j 6= i; then, by (9), we

update T x to T x+ at a low cost. Formally, letting M(x) = T x,

M[{x,M(x)} 7→ {x+,M(x+)}

]≤M

[{x, T x} 7→ x+

]+ M

[{x, T x, x+} 7→ T x+

](9)=O(1) +O

(1

mM[x+ 7→ T x+

])=O

(1

mM [x 7→ T x]

).

Page 13: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 13

In general, the set of CF operators is much larger than the union ofType-I and Type-II CF operators.

Another important subclass of CF operators are operators T : H → Hwhere (T x)i only depends on one, or a few, entries among x1, . . . , xm. Basedon how many input coordinates they depend on, we partition them into threesubclasses.

Definition 5 (separable operator). Consider T := {T | T : H → H}. Wehave the partition T = C1 ∪ C2 ∪ C3, where

• separable operator: T ∈ C1 if, for any index i, there exists Ti : Hi → Hi

such that (T x)i = Tixi, that is, (T x)i only depends on xi.• nearly-separable operator: T ∈ C2 if, for any index i, there exists Ti

and index set Ii such that (T x)i = Ti({xj}j∈Ii) with |Ii| � m, that is,each (T x)i depends on a few coordinates of x.• non-separable operator: C3 := T \ (C1 ∪ C2). If T ∈ C3, there exists

some i such that (T x)i depends on many coordinates of x.

Throughout the paper, we assume the coordinate update of a (nearly-)separable operator costs roughly the same for all coordinates. Under thisassumption, separable operators are both Type-I CF and Type-II CF, andnearly-separable operators are Type-I CF.3

2.3. Examples of CF Operators

In this subsection, we give examples of CF operators arising in differentareas including linear algebra, optimization, and machine learning.

Example 4 ((block) diagonal matrix). Consider the diagonal matrix

A =

a1,1 0. . .

0 am,m

∈ Rm×m.

Clearly T : x 7→ Ax is separable.

3Not all nearly-separable operators are Type-II CF. Indeed, consider a sparsematrix A ∈ Rm×m whose non-zero entries are only located in the last column. LetT x = Ax and x+ = x+δm. As x+ and x differ over the last entry, T x+ = T x+(x+m−xm)A:,m takes m operations. Therefore, we have M [{x, T x, x+} 7→ T x+] = O(m).Since T x+ = x+mA:,m takes m operations, we also have M [x+ 7→ T x+] = O(m).Therefore, (9) is violated, and there is no benefit from maintaining T x.

Page 14: Coordinate friendly structures, algorithms and applications

14

Example 5 (gradient and proximal maps of a separable function). Considera separable function

f(x) =

m∑i=1

fi(xi).

Then, both ∇f and proxγf are separable, in particular,

(∇f(x))i = ∇fi(xi) and (proxγf (x))i = proxγfi(xi).

Here, proxγf (x) (γ > 0) is the proximal operator that we define in Defini-tion 10 in Appendix A.

Example 6 (projection to box constraints). Consider the “box” set B :={x : ai ≤ xi ≤ bi, i ∈ [m]} ⊂ Rm. Then, the projection operator projB isseparable. Indeed, (

projB(x))i

= max(bi, min(ai, xi)).

Example 7 (sparse matrices). If every row of the matrix A ∈ Rm×m issparse, T : x 7→ Ax is nearly-separable.

Examples of sparse matrices arise from various finite difference schemesfor differential equations, problems defined on sparse graphs. When mostpairs of a set of random variables are conditionally independent, their inversecovariance matrix is sparse.

Example 8 (sum of sparsely supported functions). Let E be a class ofindex sets and every e ∈ E be a small subset of [m], |e| � m. In addition#{e : i ∈ e} � #{e} for all i ∈ [m]. Let xe := (xi)i∈e, and

f(x) =∑e∈E

fe(xe).

The gradient map ∇f is nearly-separable.

An application of this example arises in wireless communication over agraph of m nodes. Let each xi be the spectrum assignment to node i, each e bea neighborhood of nodes, and each fe be a utility function. The input of fe isxe since the utility depends on the spectra assignments in the neighborhood.

In machine learning, if each observation only involves a few features,then each function of the optimization objective will depend on a smallnumber of components of x. This is the case when graphical models areused [64, 9].

Page 15: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 15

Example 9 (squared hinge loss function). Consider for a, x ∈ Rm,

f(x) :=1

2

(max(0, 1− βa>x)

)2,

which is known as the squared hinge loss function. Consider the operator

(11) T x := ∇f(x) = −βmax(0, 1− βa>x)a.

Let us maintain M(x) = a>x. For arbitrary x and i, let

x+i := (T x)i = −βmax(0, 1− βa>x)ai

and x+j := xj , ∀j 6= i. Then, computing x+

i from x and a>x takes O(1) (as

a>x is maintained), and computing a>x+ from x+i −xi and a>x costs O(1).

Formally, we have

M[{x, a>x} 7→ {x+, a>x+}

]≤M

[{x, a>x} 7→ x+

]+ M

[{a>x, x+

i − xi} 7→ a>x+]

=O(1) +O(1) = O(1).

On the other hand, M [x 7→ T x] = O(m). Therefore, (10) holds, and Tdefined in (11) is CF.

3. Composite Coordinate Friendly Operators

Compositions of two or more operators arise in algorithms for problems thathave composite functions, as well as algorithms that are derived from op-erator splitting methods. To update the variable xk to xk+1, two or moreoperators are sequentially applied, and therefore the structures of all op-erators determine whether the update is CF. This is where CF structuresbecome less trivial but more interesting. This section studies composite CFoperators. The exposition leads to the recovery of existing algorithms, aswell as powerful new algorithms.

3.1. Combinations of Operators

We start by an example with numerous applications. It is a generalizationof Example 9.

Page 16: Coordinate friendly structures, algorithms and applications

16

Example 10 (scalar map pre-composing affine function). Let aj ∈ Rm, bj ∈R, and φj : R→ R be differentiable functions, j ∈ [p]. Let

f(x) =

p∑j=1

φj(a>j x+ bj).

Assume that evaluating φ′j costs O(1) for each j. Then, ∇f is CF. Indeed,let

T1y := A>y, T2y := [φ′1(y1); . . . ;φ′p(yp)], T3x := Ax+ b,

where A = [a>1 ; a>2 ; . . . ; a>p ] ∈ Rp×m and b = [b1; b2; . . . ; bp] ∈ Rp×1. Then we

have ∇f(x) = T1 ◦ T2 ◦ T3x. For any x and i ∈ [m], let x+i = ∇if(x) and

x+j = xj , ∀j 6= i, and let M(x) := T3x. We can first compute T2 ◦ T3x from

T3x for O(p) operations, then compute ∇if(x) and thus x+ from {x, T2◦T3x}for O(p) operations, and finally update the maintained T3x to T3x

+ from{x, x+, T3x} for another O(p) operations. Formally,

M[{x, T3x} 7→ {x+, T3x

+}]

≤M [T3x 7→ T2 ◦ T3x] + M[{x, T2 ◦ T3x} 7→ x+

]+ M

[{x, T3x, x

+} 7→ {T3x+}]

=O(p) +O(p) +O(p) = O(p).

Since M [x 7→ ∇f(x)] = O(pm), therefore ∇f = T1 ◦ T2 ◦ T3 is CF.If p = m, T1, T2, T3 all map from Rm to Rm. Then, it is easy to check

that T1 is Type-I CF, T2 is separable, and T3 is Type-II CF. The last one iscrucial since not maintaining T3x would disqualify T from CF. Indeed, toobtain (T x)i, we must multiply A>i to all the entries of T2 ◦ T3x, which inturn needs all the entries of T3x, computing which from scratch would costO(pm).

There are general rules to preserve Type-I and Type-II CF. For exam-ple, T1 ◦ T2 is still Type-I CF, and T2 ◦ T3 is still CF, but there are counterexamples where T2 ◦ T3 can be neither Type-I nor Type-II CF. Such proper-ties are important for developing efficient coordinate update algorithms forcomplicated problems; we will formalize them in the following.

The operators T2 and T3 in the above example are prototypes of cheapand easy-to-maintain operators from H to G that arise in operator compo-sitions.

Definition 6 (cheap operator). For a composite operator T = T1 ◦ · · · ◦ Tp,an operator Ti : H → G is cheap if M [x 7→ Tix] is less than or equal to thenumber of remaining coordinate-update operations, in order of magnitude.

Page 17: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 17

Case T1 ∈ T2 ∈ (T1 ◦ T2) ∈1 C1 (separable) C1, C2, C3 C1, C2, C3, respectively2 C2 (nearly-sep.) C1, C3 C2, C3, resp.3 C2 C2 C2 or C3, case by case4 C3 (non-sep.) C1 ∪ C2 ∪ C3 C3

Table 1: T1 ◦ T2 inherits the weaker separability property from those of T1

and T2.

Case T1 ∈ T2 ∈ (T1 ◦ T2) ∈ Example

5 C1 ∪ C2 F , F1 F , F1, resp. Examples 11 and 136 F , F2 C1 F , F2, resp. Example 107 F1 F2 F Example 128 cheap F2 F Example 139 F1 cheap F1 Examples 10 and 13

Table 2: Summary of how T1 ◦ T2 inherits CF properties from those of T1

and T2.

Definition 7 (easy-to-maintain operator). For a composite operator T =

T1 ◦ · · · ◦ Tp, the operator Tp : H→ G is easy-to-maintain, if for any x, i, x+

satisfying (6), M [{x, Tpx, x+} 7→ Tpx+] is less than or equal to the number

of remaining coordinate-update operations, in order of magnitude, or belongs

to O( 1dimGM [x+ 7→ T x+]).

The splitting schemes in §3.2 below will be based on T1 + T2 or T1 ◦ T2,

as well as a sequence of such combinations. If T1 and T2 are both CF, T1 +T2

remains CF, but T1 ◦ T2 is not necessarily so. This subsection discusses how

T1 ◦ T2 inherits the properties from T1 and T2. Our results are summarized

in Tables 1 and 2 and explained in detail below.

The combination T1 ◦ T2 generally inherits the weaker property from T1

and T2.

The separability (C1) property is preserved by composition. If T1, . . . , Tnare separable, then T1 ◦ · · · ◦ Tn is separable. However, combining nearly-

separable (C2) operators may not yield a nearly-separable operator since

composition introduces more dependence among the input entries. There-

fore, composition of nearly-separable operators can be either nearly-separable

or non-separable.

Next, we discuss how T1 ◦ T2 inherits the CF properties from T1 and

T2. For simplicity, we only use matrix-vector multiplication as examples to

illustrate the ideas; more interesting examples will be given later.

Page 18: Coordinate friendly structures, algorithms and applications

18

• If T1 is separable or nearly-separable (C1 ∪ C2), then as long as T2 isCF (F), T1 ◦ T2 remains CF. In addition, if T2 is Type-I CF (F1), sois T1 ◦ T2.

Example 11. Let A ∈ Rm×m be sparse and B ∈ Rm×m dense. ThenT1x = Ax is nearly-separable and T2x = Bx is Type-I CF4. For anyi, let Ii index the set of nonzeros on the ith row of A. We first com-pute (Bx)Ii, which costs O(|Ii|m), and then ai,Ii(Bx)Ii, which costsO(|Ii|), where ai,Ii is formed by the nonzero entries on the ith row ofA. Assume O(|Ii|) = O(1), ∀i. We have, from the above discussion,that M [x 7→ (T1 ◦ T2x)i] = O(m), while M [x 7→ T1 ◦ T2x] = O(m2).Hence, T1 ◦ T2 is Type-I CF.

• Assume that T2 is separable (C1). It is easy to see that if T1 is CF (F),then T1 ◦ T2 remains CF. In addition if T1 is Type-II CF (F2), so isT1 ◦ T2; see Example 10.Note that, if T2 is nearly-separable, we do not always have CF proper-ties for T1 ◦ T2. This is because T2x and T2x

+ can be totally different(so updating T2x is expensive) even if x and x+ only differ over onecoordinate; see the footnote 3 on Page 13.• Assume that T1 is Type-I CF (F1). If T2 is Type-II CF (F2), thenT1 ◦ T2 is CF (F).

Example 12. Let A,B ∈ Rm×m be dense. Then T1x = Ax is Type-ICF and T2x = Bx Type-II CF (by maintaining Bx; see Example 2).For any x and i, let x+ satisfy (6). Maintaining T2x, we can compute(T1 ◦ T2x)j for O(m) operations for any j and update T2x

+ for O(m)operations. On the other hand, computing T1 ◦T2x

+ without maintain-ing T2x takes O(m2) operations.

• Assume that one of T1 and T2 is cheap. If T2 is cheap, then as long asT1 is Type-I CF (F1), T1 ◦T2 is Type-I CF. If T1 is cheap, then as longas T2 is Type-II CF (F2), T1 ◦ T2 is CF (F); see Example 13.

We will see more examples of the above cases in the rest of the paper.

3.2. Operator Splitting Schemes

We will apply our discussions above to operator splitting and obtain newalgorithms. But first, we review several major operator splitting schemes and

4For this example, one can of course pre-compute AB and claim that (T1 ◦ T2)is Type-I CF. Our arguments keep A and B separate and only use the nearly-separability of T1 and Type-I CF property of T2, so our result holds for any suchcomposition even when T1 and T2 are nonlinear.

Page 19: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 19

discuss their CF properties. We will encounter important concepts such as(maximum) monotonicity and cocoercivity, which are given in Appendix A.For a monotone operator A, the resolvent operator JA and the reflective-resolvent operator RA are also defined there, in (67) and (68), respectively.

Consider the following problem: given three operators A,B, C, possiblyset-valued,

(12) find x ∈ H such that 0 ∈ Ax+ Bx+ Cx,

where “+” is the Minkowski sum. This is a high-level abstraction of manyproblems or their optimality conditions. The study began in the 1960s, fol-lowed by a large number of algorithms and applications over the last fiftyyears. Next, we review a few basic methods for solving (12).

When A,B are maximally monotone (think it as the subdifferential ∂fof a proper convex function f) and C is β-cocoercive (think it as the gradient∇f of a 1/β-Lipschitz differentiable function f), a solution can be found bythe iteration (2) with T = T3S, introduced recently in [24], where

(13) T3S := I − JγB + JγA ◦ (2JγB − I − γC ◦ JγB).

Indeed, by setting γ ∈ (0, 2β), T3S is ( 2β4β−γ )-averaged (think it as a property

weaker than the Picard contraction; in particular, T may not have a fixedpoint). Following the standard convergence result (cf. textbook [5]), providedthat T has a fixed point, the sequence from (2) converges to a fixed-pointx∗ of T . Note that, instead of x∗, JγB(x∗) is a solution to (12).

Following §3.1, T3S is CF if JγA is separable (C1), JγB is Type-II CF(F2), and C is Type-I CF (F1).

We give a few special cases of T3S below, which have much longer history.They all converge to a fixed point x∗ whenever a solution exists and γ isproperly chosen. If B 6= 0, then JγB(x∗), instead of x∗, is a solution to (12).

Forward-Backward Splitting (FBS): Letting B = 0 yields JγB = I.Then, T3S reduces to FBS [52]:

(14) TFBS := JγA ◦ (I − γC)

for solving the problem 0 ∈ Ax+ Cx.

Backward-Forward Splitting (BFS): Letting A = 0 yields JγA = I.Then, T3S reduces to BFS:

(15) TBFS := (I − γC) ◦ JγB

Page 20: Coordinate friendly structures, algorithms and applications

20

for solving the problem 0 ∈ Bx + Cx. When A = B, TFBS and TBFS apply

the same pair of operators in the opposite orders, and they solve the same

problem. Iterations based on TBFS are rarely used in the literature because

they need an extra application of JγB to return the solution, so TBFS is

seemingly an unnecessary variant of TFBS. However, they become different

for coordinate update; in particular, TBFS is CF (but TFBS is generally not)

when JγB is Type-II CF (F2) and C is Type-I CF (F1). Therefore, TBFS is

worth discussing alone.

Douglas-Rachford Splitting (DRS): Letting C = 0, T3S reduces to

(16) TDRS := I − JγB + JγA ◦ (2JγB − I) =1

2(I +RγA ◦ RγB)

introduced in [26] for solving the problem 0 ∈ Ax+Bx. A more general split-

ting is the Relaxed Peaceman-Rachford Splitting (RPRS) with λ ∈ [0, 1]:

(17) TRPRS = (1− λ) I + λRγA ◦ RγB,

which recovers TDRS by setting λ = 12 and Peaceman-Rachford Splitting

(PRS) [53] by letting λ = 1.

Forward-Douglas-Rachford Splitting (FDRS): Let V be a linear

subspace, and NV and PV be its normal cone and projection operator, re-

spectively. The FDRS [15]

TFDRS = I − PV + JγA ◦ (2PV − I − γPV ◦ C ◦ PV ),

aims at finding a point x such that 0 ∈ Ax+C x+NV x. If an optimal x exists,

we have x ∈ V and NV x is the orthogonal complement of V . Therefore, the

problem is equivalent to finding x such that 0 ∈ Ax+PV ◦ C ◦PV x+NV x.

Thus, T3S recovers TFDRS by letting B = NV and C = PV ◦ C ◦ PV .

Forward-Backward-Forward Splitting (FBFS): Composing TFBS

with one more forward step gives TFBFS introduced in [71]:

TFBFS = −γC + (I − γC)JγA(I − γC).(18)

TFBFS is not a special case of T3S. At the expense of one more application

of (I − γC), TFBFS relaxes the convergence condition of TFBS from the coco-

ercivity of C to its monotonicity. (For example, a nonzero skew symmetric

matrix is monotonic but not cocoercive.) From Table 2, we know that TFBFS

is CF if both C and JγA are separable.

Page 21: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 21

3.2.1. Examples in Optimization. Consider the optimization problem

(19) minimizex∈X

f(x) + g(x),

where X is the feasible set and f and g are objective functions. We presentexamples of operator splitting methods discussed above.

Example 13 (proximal gradient method). Let X = Rm, f be differentiable,and g be proximable in (19). Setting A = ∂g and C = ∇f in (14) givesJγA = proxγg and reduces xk+1 = TFBS(xk) to prox-gradient iteration:

(20) xk+1 = proxγg(xk − γ∇f(xk)).

A special case of (20) with g = ιX is the projected gradient iteration:

(21) xk+1 = PX(xk − γ∇f(xk)).

If ∇f is CF and proxγg is (nearly-)separable (e.g., g(x) = ‖x‖1 or theindicator function of a box constraint) or if ∇f is Type-II CF and proxγgis cheap (e.g., ∇f(x) = Ax− b and g = ‖x‖2), then the FBS iteration (20)is CF. In the latter case, we can also apply the BFS iteration (15) (i.e,compute proxγg and then perform the gradient update), which is also CF.

Example 14 (ADMM). Setting X = Rm simplifies (19) to

(22) minimizex,y

f(x) + g(y), subject to x− y = 0.

The ADMM method iterates:

xk+1 = proxγf (yk − γsk),(23a)

yk+1 = proxγg(xk+1 + γsk),(23b)

sk+1 = sk +1

γ(xk+1 − yk+1).(23c)

(The iteration can be generalized to handle the constraint Ax − By = b.)The dual problem of (22) is mins f

∗(−s) + g∗(s), where f∗ is the convexconjugate of f . Letting A = −∂f∗(−·) and B = ∂g∗ in (16) recovers theiteration (23) through (see the derivation in Appendix B)

tk+1 = TDRS(tk) = tk − JγB(tk) + JγA ◦ (2JγB − I)(tk).

From the results in §3.1, a sufficient condition for the above iteration to beCF is that JγA is (nearly-)separable and JγB being CF.

Page 22: Coordinate friendly structures, algorithms and applications

22

The above abstract operators and their CF properties will be applied

in §5 to give interesting algorithms for several applications.

4. Primal-dual Coordinate Friendly Operators

We study how to solve the problem

(24) minimizex∈H

f(x) + g(x) + h(Ax),

with primal-dual splitting algorithms, as well as their coordinate update

versions. Here, f is differentiable and A is a “p-by-m” linear operator from

H = H1 × · · · × Hm to G = G1 × · · · × Gp. Problem (24) abstracts many

applications in image processing and machine learning.

Example 15 (image deblurring/denoising). Let u0 be an image, where

u0i ∈ [0, 255], and B be the blurring linear operator. Let ‖∇u‖1 be the

anisotropic5 total variation of u (see (49) for definition). Suppose that b

is a noisy observation of Bu0. Then, we can try to recover u0 by solving

(25) minimizeu

1

2‖Bu− b‖2 + ι[0,255](u) + λ‖∇u‖1,

which can be written in the form of (24) with f = 12‖B · −b‖

2, g = ι[0,255],

A = ∇, and h = λ‖ · ‖1.

More examples with the formulation (24) will be given in §4.2. In general,

primal-dual methods are capable of solving complicated problems involving

constraints and the compositions of proximable and linear maps like ‖∇u‖1.

In many applications, although h is proximable, h ◦ A is generally non-

proximable and non-differentiable. To avoid using slow subgradient methods,

we can consider the primal-dual splitting approaches to separate h and A so

that proxh can be applied. We derive that the equivalent form (for convex

cases) of (24) is to find x such that

(26) 0 ∈ (∇f + ∂g +A> ◦ ∂h ◦A)(x).

Introducing the dual variable s ∈ G and applying the biconjugation prop-

5Generalization to the isotropic case is straightforward by grouping variablesproperly.

Page 23: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 23

erty: s ∈ ∂h(Ax)⇔ Ax ∈ ∂h∗(s), yields the equivalent condition

(27) 0 ∈( [∇f 00 0

]︸ ︷︷ ︸

operator A

+

[∂g 00 ∂h∗

]+

[0 A>

−A 0

]︸ ︷︷ ︸

operator B

) [xs

]︸︷︷︸z

,

which we shorten as 0 ∈ Az + Bz, with z ∈ H×G =: F.Problem (27) can be solved by the Condat-Vu algorithm [20, 76]:

(28)

{sk+1 = proxγh∗(s

k + γAxk),

xk+1 = proxηg(xk − η(∇f(xk) +A>(2sk+1 − sk))),

which explicitly applies A and A> and updates s, x in a Gauss-Seidel style 6.We introduce an operator TCV : F→ F and write

iteration (28) ⇐⇒ zk+1 = TCV(zk).

Switching the orders of x and s yields the following algorithm:

(29)

{xk+1 = proxηg(x

k − η(∇f(xk) +A>sk)),

sk+1 = proxγh∗(sk + γA(2xk+1 − xk)), as zk+1 = T ′CVz

k.

It is known from [18, 23] that both (28) and (29) reduce to iterations ofnonexpansive operators (under a special metric), i.e., TCV is nonexpansive;see Appendix C for the reasoning.

Remark 1. Similar primal-dual algorithms can be used to solve other prob-lems such as saddle point problems [40, 46, 16] and variational inequali-ties [68]. Our coordinate update algorithms below apply to these problems aswell.

4.1. Primal-dual Coordinate Update Algorithms

In this subsection, we make the following assumption.

Assumption 1. Functions g and h∗ in the problem (24) are separable andproximable. Specifically,

g(x) =

m∑i=1

gi(xi) and h∗(y) =

p∑j=1

h∗i (yi).

6By the Moreau identity: proxγh∗ = I−γprox 1γ h

( ·γ ), one can compute prox 1

γ h

instead of proxγh∗ , which inherits the same separability properties from prox 1γ h

.

Page 24: Coordinate friendly structures, algorithms and applications

24

Furthermore, ∇f is CF.

Proposition 1. Under Assumption 1, the followings hold:

(a) when p = O(m), the Condat-Vu operator TCV in (28) is CF, more specif-

ically,

M[{zk, Ax} 7→ {z+, Ax+}

]= O

(1

m+ pM[zk 7→ TCVz

k])

;

(b) when m� p and M [x 7→ ∇f(x)] = O(m), the Condat-Vu operator T ′CV

in (29) is CF, more specifically,

M[{zk, A>s} 7→ {z+, A>s+}

]= O

(1

m+ pM[zk 7→ T ′CVz

k])

.

Proof. Computing zk+1 = TCVzk involves evaluating∇f , proxg, and proxh∗ ,

applying A and A>, and adding vectors. It is easy to see M[zk 7→ TCVz

k]

=

O(mp+m+ p) + M[x→ ∇f(x)], and M[zk 7→ T ′CVz

k]

is the same.

(a) We assume ∇f ∈ F1 for simplicity, and other cases are similar.

1. If (TCVzk)j = sk+1

i , computing it involves: adding ski and γ(Axk)i, and

evaluating proxγh∗i . In this case M[{zk, Ax} 7→ {z+, Ax+}

]= O(1).

2. If (TCVzk)j = xk+1

i , computing it involves evaluating: the entire sk+1

for O(p) operations, (A>(2sk+1 − sk))i for O(p) operations, proxηgifor O(1) operations, ∇if(xk) for O( 1

mM [x 7→ ∇f(x)]) operations, as

well as updating Ax+ for O(p) operations. In this case

M[{zk, Ax} 7→ {z+, Ax+}

]= O(p+ 1

mM [x 7→ ∇f(x)]).

Therefore, M[{zk, Ax} 7→ {z+, Ax+}

]= O

(1

m+pM[zk 7→ TCVz

k] )

.

(b) When m� p and M [x 7→ ∇f(x)] = O(m), following arguments similar

to the above, we have

M[{zk, A>s} 7→ {z+, A>s+}

]= O(1)+M [x 7→ ∇if(x)] if (T ′CVz

k)j = xk+1i ;

and M[{zk, A>s} 7→ {z+, A>s+}

]= O(m) + M [x 7→ ∇f(x)] if (T ′CVz

k)j =

sk+1i .

In both cases M[{zk, A>s} 7→ {z+, A>s+}

]= O( 1

m+pM[zk 7→ T ′CVz

k]).

Page 25: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 25

4.2. Extended Monotropic Programming

We develop a primal-dual coordinate update algorithm for the extended

monotropic program:

(30)minimize

x∈Hg1(x1) + g2(x2) + · · ·+ gm(xm) + f(x),

subject to A1x1 +A2x2 + · · ·+Amxm = b,

where x = (x1, . . . , xm) ∈ H = H1×. . .×Hm with Hi being Euclidean spaces.

It generalizes linear, quadratic, second-order cone, semi-definite programs by

allowing extended-valued objective functions gi and f . It is a special case

of (24) by letting g(x) =

m∑i=1

gi(xi), A = [A1, · · · , Am] and h = ι{b}.

Example 16 (quadratic programming). Consider the quadratic program

(31) minimizex∈Rm

1

2x>Ux+ c>x, subject to Ax = b, x ∈ X,

where U is a symmetric positive semidefinite matrix and X = {x : xi ≥0 ∀i}. Then, (31) is a special case of (30) with gi(xi) = ι·≥0(xi), f(x) =12x>Ux+ c>x and h = ι{b}.

Example 17 (Second Order Cone Programming (SOCP)). The SOCP

minimizex∈Rm

c>x, subject to Ax = b,

x ∈ X = Q1 × · · · ×Qn,

(where the number of cones n may not be equal to the number of blocks m,)

can be written in the form of (30): minimizex∈Rm ιX(x) + c>x+ ι{b}(Ax).

Applying iteration (28) to problem (30) and eliminating sk+1 from the

second row yield the Jacobi-style update (denoted as Temp):

(32)

{sk+1 = sk + γ(Axk − b),xk+1 = proxηg(x

k − η(∇f(xk) +A>sk + 2γA>Axk − 2γA>b)).

To the best of our knowledge, this update is never found in the literature.

Note that xk+1 no longer depends on sk+1, making it more convenient to

perform coordinate updates.

Page 26: Coordinate friendly structures, algorithms and applications

26

Remark 2. In general, when the s update is affine, we can decouple sk+1

and xk+1 by plugging the s update into the x update. It is the case when his affine or quadratic in problem (24).

A sufficient condition for Temp to be CF is proxg ∈ C1 i.e., separable.Indeed, we have Temp = T1 ◦ T2, where

T1 =

[I 00 proxηg

], T2

[sx

]=

[s+ γ(Ax− b)

x− η(∇f(x) +A>s+ 2γA>Ax− 2γA>b)

].

Following Case 5 of Table 2, Temp is CF. When m = Θ(p), the separabilitycondition on proxg can be relaxed to proxg ∈ F1 since in this case T2 ∈ F2,

and we can apply Case 7 of Table 2 (by maintaining ∇f(x), A>s, Ax andA>Ax.)

4.3. Overlapping-Block Coordinate Updates

In the coordinate update scheme based on (28), if we select xi to update thenwe must first compute sk+1, because the variables xi’s and sj ’s are coupledthrough the matrix A. However, once xk+1

i is obtained, sk+1 is discarded. Itis not used to update s or cached for further use. This subsection introducesways to utilize the otherwise wasted computation.

We define, for each i, J(i) ⊂ [p] as the set of indices j such that A>i,j 6= 0,

and, for each j, I(j) ⊂ [m] as the set of indices of i such that A>i,j 6= 0. Wealso let mj := |I(j)|, and assume mj 6= 0,∀j ∈ [p] without loss of generality.

We arrange the coordinates of z = [x; s] into m overlapping blocks. Theith block consists of the coordinate xi and all sj ’s for j ∈ J(i). This way, eachsj may appear in more than one block. We propose a block coordinate updatescheme based on (28). Because the blocks overlap, each sj may be updatedin multiple blocks, so the sj update is relaxed with parameters ρi,j ≥ 0(see (33) below) that satisfy

∑i∈I(j) ρi,j = 1, ∀j ∈ [p]. The aggregated

effect is to update sj without scaling. (Following the KM iteration [39], wecan also assign a relaxation parameter ηk for the xi update; then, the sjupdate should be relaxed with ρi,jηk.)

We propose the following update scheme:

(33)

select i ∈ [m], and then compute

sk+1j = proxγh∗j (s

kj + γ(Axk)j), for all j ∈ J(i),

xk+1i = proxηgi(x

ki − η(∇if(xk) +

∑j∈J(i)A

>i,j(2s

k+1j − skj ))),

update xk+1i = xki + (xk+1

i − xki ),update sk+1

j = skj + ρi,j(sk+1j − skj ), for all j ∈ J(i).

Page 27: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 27

Remark 3. The use of relaxation parameters ρi,j makes our scheme differ-

ent from that in [56].

Following the assumptions and arguments in §4.1, if we maintain Ax,

the cost for each block coordinate update is O(p) + M [x 7→ ∇if(x)], which

is O( 1mM [z 7→ TCVz]). Therefore the coordinate update scheme (33) is com-

putationally worthy.

Typical choices of ρi,j include: (1) one of the ρi,j ’s is 1 for each j, others

all equal to 0. This can be viewed as assigning the update of sj solely to a

block containing xi. (2) ρi,j = 1mj

for all i ∈ I(j). This approach spreads the

update of sj over all the related blocks.

Remark 4. The recent paper [28] proposes a different primal-dual coordi-

nate update algorithm. The authors produce a new matrix A based on A,

with only one nonzero entry in each row, i.e. mj = 1 for each j. They also

modify h to h so that the problem

(34) minimizex∈H

f(x) + g(x) + h(Ax)

has the same solution as (24). Then they solve (34) by the scheme (33).

Because they have mj = 1, every dual variable coordinate is only associated

with one primal variable coordinate. They create non-overlapping blocks of

z by duplicating each dual variable coordinate sj multiple times. The com-

putation cost for each block coordinate update of their algorithm is the same

as (33), but more memory is needed for the duplicated copies of each sj.

4.4. Async-Parallel Primal-Dual Coordinate Update Algorithms

and Their Convergence

In this subsection, we propose two async-parallel primal-dual coordinate

update algorithms using the algorithmic framework of [54] and state their

convergence results. When there is only one agent, all algorithms proposed in

this section reduce to stochastic coordinate update algorithms [19], and their

convergence is a direct consequence of Theorem 1. Moreover, our convergence

analysis also applies to sync-parallel algorithms.

Page 28: Coordinate friendly structures, algorithms and applications

28

The two algorithms are based on §4.1 and §4.3, respectively.

Algorithm 1: Async-parallel primal-dual coordinate update algo-rithm using TCV

Input : z0 ∈ F, K > 0, a discrete distribution (q1, . . . , qm+p) with∑m+pi=1 qi = 1 and qi > 0,∀i,

set global iteration counter k = 0;while k < K, every agent asynchronously and continuously do

select ik ∈ [m+ p] with Prob(ik = i) = qi;perform an update to zik according to (35);update the global counter k ← k + 1;

Whenever an agent updates a coordinate, the global iteration number k

increases by one. The kth update is applied to zik , with ik being independent

random variables: zi = xi when i ≤ m and zi = si−m when i > m. Each

coordinate update has the form:

(35)

{zk+1ik

= zkik −ηk

(m+p)qik(zkik − (TCVz

k)ik),

zk+1i = zki , ∀i 6= ik,

where ηk is the step size, zk denotes the state of z in global memory just

before the update (35) is applied, and zk is the result that z in global memory

is read by an agent to its local cache (see [54, §1.2] for both consistent and

inconsistent cases). While (zkik−(TCVzk)ik) is being computed, asynchronous

parallel computing allows other agents to make updates to z, introducing

so-called asynchronous delays. Therefore, zk can be different from zk. We

refer the reader to [54, §1.2] for more details.

The async-parallel algorithm using the overlapping-block coordinate up-

date (33) is in Algorithm 2 (recall that the overlapping-block coordinate

Page 29: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 29

update is introduced to save computation).

Algorithm 2: Async-parallel primal-dual overlapping-block coordi-nate update algorithm using TCV

Input : z0 ∈ F, K > 0, a discrete distribution (q1, . . . , qm) with∑mi=1 qi = 1 and qi > 0,∀i,

set global iteration counter k = 0;while k < K, every agent asynchronously and continuously do

select ik ∈ [m] with Prob(ik = i) = qi;

Compute sk+1j ,∀j ∈ J(ik) and xk+1

ikaccording to (33);

update xk+1ik

= xkik + ηkmqik

(xk+1ik− xkik);

let xk+1i = xki for i 6= ik;

update sk+1j = skj + ρi,jηk

mqik(sk+1j − skj ), for all j ∈ J(ik);

let sk+1j = skj , for all j /∈ J(ik);

update the global counter k ← k + 1;

Here we still allow asynchronous delays, so xik and sk+1j are computed

using some zk.

Remark 5. If shared memory is used, it is recommended to set all but one

ρi,j’s to 0 for each i.

Theorem 1. Let Z∗ be the set of solutions to problem (24) and (zk)k≥0 ⊂ Fbe the sequence generated by Algorithm 1 or Algorithm 2 under the following

conditions:

(i) f, g, h∗ are closed proper convex functions, f is differentiable, and ∇fis Lipschitz continuous with constant β;

(ii) the delay for every coordinate is bounded by a positive number τ , i.e.

for every 1 ≤ i ≤ m+ p, zki = zk−dkii for some 0 ≤ dki ≤ τ ;

(iii) ηk ∈ [ηmin, ηmax] for certain ηmin, ηmax > 0.

Then (zk)k≥0 converges to a Z∗-valued random variable with probability 1.

The formulas for ηmin and ηmax, as well as the proof of Theorem 1, are

given in Appendix D along with additional remarks. The algorithms can be

applied to solve problem (24). A variety of examples are provided in §5.1

and §5.2.

Page 30: Coordinate friendly structures, algorithms and applications

30

5. Applications

In this section, we provide examples to illustrate how to develop coordinate

update algorithms based on CF operators. The applications are categorized

into five different areas. The first subsection discusses three well-known ma-

chine learning problems: empirical risk minimization, Support Vector Ma-

chine (SVM), and group Lasso. The second subsection discusses image pro-

cessing problems including image deblurring, image denoising, and Com-

puted Tomography (CT) image recovery. The remaining subsections provide

applications in finance, distributed computing as well as certain stylized op-

timization models. Several applications are treated with coordinate update

algorithms for the first time.

For each problem, we describe the operator T and how to efficiently

calculate (T x)i. The final algorithm is obtained after plugging the update

in a coordinate update framework in §1.1 along with parameter initialization,

an index selection rule, as well as some termination criteria.

5.1. Machine Learning

5.1.1. Empirical Risk Minimization (ERM). We consider the follow-

ing regularized empirical risk minimization problem

(36) minimizex∈Rm

1

p

p∑j=1

φj(a>j x) + f(x) + g(x),

where aj ’s are sample vectors, φj ’s are loss functions, and f + g is a regu-

larization function. We assume that f is differentiable and g is proximable.

Examples of (36) include linear SVM, regularized logistic regression, ridge

regression, and Lasso. Further information on ERM can be found in [34]. The

need for coordinate update algorithms arises in many applications of (36)

where the number of samples or the dimension of x is large.

We define A = [a>1 ; a>2 ; . . . ; a>p ] and h(y) := 1p

∑pj=1 φj(yj). Hence,

h(Ax) = 1p

∑pj=1 φj(a

>j x), and problem (36) reduces to form (24). We can

apply the primal-dual update scheme to solve this problem, for which we in-

troduce the dual variable s = (s1, ..., sp)>. We use p+ 1 coordinates, where

the 0th coordinate is x ∈ Rm and the jth coordinate is sj , j ∈ [p]. The

Page 31: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 31

operator T is given in (29). At each iteration, a coordinate is updated:

(37)

if x is chosen (the index 0), then computexk+1 = proxηg(x

k − η(∇f(xk) +A>sk)),if sj is chosen (an index j ∈ [p]), then compute

xk+1 = proxηg(xk − η(∇f(xk) +A>sk)),

and

sk+1j = 1

pproxpγφ∗j (pskj + pγa>j (2xk+1 − xk)).

We maintain A>s ∈ Rm in the memory. Depending on the structure of

∇f , we can compute it each time or maintain it. When proxg ∈ F1, we

can consider breaking x into coordinates xi’s and also select an index i to

update xi at each time.

5.1.2. Support Vector Machine. Given the training data {(ai, βi)}mi=1

with βi ∈ {+1,−1}, ∀i, the kernel support vector machine [65] is

(38)minimize

x,ξ,y

12‖x‖

22 + C

∑mi=1 ξi,

subject to βi(x>φ(ai)− y) ≥ 1− ξi, ξi ≥ 0, ∀i ∈ [m],

where φ is a vector-to-vector map, mapping each data ai to a point in a

(possibly) higher-dimensional space. If φ(a) = a, then (38) reduces to the

linear support vector machine. The model (38) can be interpreted as finding

a hyperplane {w : x>w− y = 0} to separate two sets of points {φ(ai) : βi =

1} and {φ(ai) : βi = −1}.The dual problem of (38) is

(39) minimizes

1

2s>Qs− e>s, subject to 0 ≤ si ≤ C, ∀i,

∑i

βisi = 0,

where Qij = βiβjk(ai, aj), k(·, ·) is a so-called kernel function, and e =

(1, ..., 1)>. If φ(a) = a, then k(ai, aj) = a>i aj .

Unbiased case. If y = 0 is enforced in (38), then the solution hyperplane

{w : x>w = 0} passes through the origin and is called unbiased. Conse-

quently, the dual problem (39) will no longer have the linear constraint∑i βisi = 0, leaving it with the coordinate-wise separable box constraints

0 ≤ si ≤ C. To solve (39), we can apply the FBS operator T defined by (14).

Page 32: Coordinate friendly structures, algorithms and applications

32

Let d(s) := 12s>Qs − e>s, A = prox[0,C], and C = ∇d. The coordinate up-

date based on FBS is

sk+1i = proj[0,C](s

ki − γi∇id(sk)),

where we can take γi = 1Qii

.

Biased (general) case. In this case, the mode (38) has y ∈ R, so thehyperplane {w : x>w − y = 0} may not pass the origin and is called biased.Then, the dual problem (39) retains the linear constraint

∑i βisi = 0. In this

case, we apply the primal-dual splitting scheme (28) or the three-operatorsplitting scheme (13).

The coordinate update based on the full primal-dual splitting scheme (28)is:

tk+1 = tk + γ

m∑i=1

βiski ,(40a)

sk+1i = proj[0,C]

(ski − η

(∇id(sk) + βi(2t

k+1 − tk))),(40b)

where t, s are the primal and dual variables, respectively. Note that we canlet w :=

∑mi=1 βisi and maintain it. With variable w and substituting (40a)

into (40b), we can equivalently write (40) into

(41)

if t is chosen (the index 0), then compute

tk+1 = tk + γwk,if si is chosen (an index i ∈ [m]), then compute

sk+1i = proj[0,C]

(ski − η

(q>i s

k − 1 + βi(2γwk + tk)

))wk+1 = wk + βi(s

k+1i − ski ).

We can also apply the three-operator splitting (13) as follows. Let D1 :=[0, C]m and D2 := {s :

∑mi=1 βisi = 0}. Let A = projD2

, B = projD1, and

C(x) = Qx− e, The full update corresponding to T = (I − ηk)I + ηkT3S is

sk+1 =projD2(uk),(42a)

uk+1 =uk + ηk

(projD1

(2sk+1 − uk − γ(Qsk+1 − e)

)− sk+1

),(42b)

where s is just an intermediate variable. Let β := β‖β‖2 and w := β>u. Then

projD2(u) = (I − ββ>)u. Hence, sk+1 = uk − wkβ. Plugging it into (42b)

Page 33: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 33

yields the following coordinate update scheme:if i ∈ [m] is chosen, then compute

sk+1i = uki − wkβi,uk+1i = uki + ηk

(proj[0,C]

(2sk+1i − uki − γ

(q>i u

k − wk(q>i β)− 1))− sk+1

i

)wk+1 = wk + βi(u

k+1i − uki ),

where wk is the maintained variable and sk is the intermediate variable.

5.1.3. Group Lasso. The group Lasso regression problem [84] is

(43) minimizex∈Rn

f(x) +

m∑i=1

λi‖xi‖2,

where f is a differentiable convex function, often bearing the form 12‖Ax−

b‖22, and xi ∈ Rni is a subvector of x ∈ Rn supported on Ii ⊂ [n], and

∪iIi = [n]. If Ii ∩ Ij = ∅, ∀i 6= j, it is called non-overlapping group Lasso,

and if there are two different groups Ii and Ij with a non-empty intersection,

it is called overlapping group Lasso. The model finds a coefficient vector x

that minimizes the fitting (or loss) function f(x) and that is group sparse:

all but a few xi’s are zero.

Let Ui be formed by the columns of the identity matrix I corresponding

to the indices in Ii, and let U = [U>1 ; . . . ;U>m] ∈ R(Σini)×n. Then, xi = U>i x.

Let hi(yi) = λi‖yi‖2, yi ∈ Rni for i ∈ [m], and h(y) =∑m

i=1 hi(yi) for

y = [y1; . . . ; ym] ∈ RΣini . In this way, (43) becomes

(44) minimizex

f(x) + h(Ux).

Non-overlapping case [84]. In this case, we have Ii ∩ Ij = ∅, ∀i 6= j,

and can apply the FBS scheme (14) to (44). Specifically, let T1 = ∂(h ◦ U)

and T2 = ∇f . The FBS full update is

xk+1 = JγT1 ◦ (I − γT2)(xk).

The corresponding coordinate update is the following

(45)

{if i ∈ [m] is chosen, then compute

xk+1i = arg minxi

12‖xi − x

ki + γi∇if(xk)‖22 + γihi(xi),

Page 34: Coordinate friendly structures, algorithms and applications

34

where ∇if(xk) is the partial derivative of f with respect to xi and the stepsize can be taken to be γi = 1

‖A:,i‖2 . When ∇f is either cheap or easy-to-

maintain, the coordinate update in (45) is inexpensive.

Overlapping case [38]. This case allows Ii ∩ Ij 6= ∅ for some i 6= j,causing the evaluation of JγT1 to be generally difficult. However, we canapply the primal-dual update (28) to this problem as

sk+1 = proxγh∗(sk + γUxk),(46a)

xk+1 =xk − η(∇f(xk) + U>(2sk+1 − sk)),(46b)

where s is the dual variable. Note that

h∗(s) =

{0, if ‖si‖2 ≤ λi, ∀i,+∞, otherwise,

is cheap. Hence, the corresponding coordinate update of (46) is(47)

if si is chosen for some i ∈ [m], then compute

sk+1i = projBλi (s

ki + γxki )

if xi is chosen for some i ∈ [m], then compute

xk+1i = xki − η

(UTi ∇f(xk) + UTi

∑j,UTi Uj 6=0 Uj(2projBλj (s

kj + γxkj )− skj )

),

where Bλ is the Euclidean ball of radius λ. When ∇f is easy-to-maintain,the coordinate update in (47) is inexpensive. To the best of our knowledge,the coordinate update method (47) is new.

5.2. Imaging

5.2.1. DRS for Image Processing in the Primal-dual Form [50].Many convex image processing problems have the general form

minimizex

f(x) + g(Ax),

where A is a matrix such as a dictionary, sampling operator, or finite differ-ence operator. We can reduce the problem to the system: 0 ∈ A(z) + B(z),where z = [x; s],

A(z) :=

[∂f(x)∂g∗(s)

], and B(z) :=

[0 A>

−A 0

] [xs

].

Page 35: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 35

(see Appendix C for the reduction.) The work [50] gives their resolvents

JγA =

[proxγf 0

0 proxγg∗

],

JγB = (I + γB)−1 =

[0 00 I

]+

[IγA

](I + γ2A>A)−1

[I−γA

]>,

where JγA is often cheap or separable and we can explicitly form JγB as

a matrix or implement it based on a fast transform. With the defined JγAand JγB, we can apply the RPRS method as zk+1 = TRPRSz

k. The result-

ing RPRS operator is CF when JγB is CF. Hence, we can derive a new

RPRS coordinate update algorithm. We leave the derivation to the read-

ers. Derivations of coordinate update algorithms for more specific image

processing problems are shown in the following subsections.

5.2.2. Total Variation Image Processing. We consider the following

Total Variation (TV) image processing model

(48) minimizex

λ‖x‖TV +1

2‖Ax− b‖2,

where x ∈ Rn is the vector representation of the unknown image, A is an

m × n matrix describing the transformation from the image to the mea-

surements b. Common A includes sampling matrices in MRI, CT, denois-

ing, deblurring, etc. Let (∇hi ,∇vi ) be the discrete gradient at pixel i and

∇x = (∇h1x,∇v1x, . . . ,∇hnx,∇vnx)>. Then the TV semi-norm ‖ · ‖TV in the

isotropic and anisotropic fashions are, respectively,

‖x‖TV =∑i

√(∇hi x)2 + (∇vi x)2,(49a)

‖x‖TV = ‖∇x‖1 =∑i

(|∇hi x|+ |∇vi x|

).(49b)

For simplicity, we use the anisotropic TV for analysis and in the numer-

ical experiment in § 6.2. It is slightly more complicated for the isotropic TV.

Introducing the following notation

B: =

(∇A

), h(p, q): = λ‖p‖1 +

1

2‖q − b‖2,

Page 36: Coordinate friendly structures, algorithms and applications

36

we can reformulate (48) as

minimizex

h(B x) = h(∇x,Ax),

which reduces to the form of (24) with f = g = 0. Based on its definition,

the convex conjugate of h(p, q) and its proximal operator are, respectively,

h∗(s, t) = ι‖·‖∞≤λ(s) +1

2‖t+ b‖2 − 1

2‖b‖2,(50)

proxγh∗(s, t) = proj‖·‖∞≤λ(s) +1

1 + γ(t− γb).(51)

Let s, t be the dual variables corresponding to ∇x and Ax respectively, then

using (51) and applying (29) give the following full update:

xk+1 = xk − η(∇>sk +A>tk),(52a)

sk+1 = proj‖·‖∞≤λ

(sk + γ∇(xk − 2η(∇>sk +A>tk))

),(52b)

tk+1 =1

1 + γ

(tk + γA(xk − 2η(∇>sk +A>tk))− γb

).(52c)

To perform the coordinate updates as described in §4, we can maintain∇>skand A>tk. Whenever a coordinate of (s, t) is updated, the corresponding

∇>sk (or A>tk) should also be updated. Specifically, we have the following

coordinate update algorithm

(53)

if xi is chosen for some i ∈ [n], then compute

xk+1i = xki − η(∇>sk +A>tk)i;

if si is chosen for some i ∈ [2n], then compute

sk+1i = proj‖·‖∞≤λ

(ski + γ∇i(xk − 2η(∇>sk +A>tk))

)and update ∇>sk to ∇>sk+1;

if ti is chosen for some i ∈ [m], then compute

tk+1i = 1

1+γ

(tki + γAi,:(x

k − 2η(∇>sk +A>tk))− γbi)

and update A>tk to A>tk+1.

5.2.3. 3D Mesh Denoising. Following an example in [60], we consider

a 3D mesh described by their nodes xi = (xXi , xYi , x

Zi ), i ∈ [n], and the

adjacency matrix A ∈ Rn×n, where Aij = 1 if nodes i and j are adjacent,

otherwise Aij = 0. We let Vi be the set of neighbours of node i. Noisy mesh

nodes zi, i ∈ [n], are observed. We try to recover the original mesh nodes by

Page 37: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 37

solving the following optimization problem [60]:

(54) minimizex

n∑i=1

fi(xi) +

n∑i=1

gi(xi) +∑i

∑j∈Vi

hi,j(xi − xj),

where fi’s are differentiable data fidelity terms, gi’s are the indicator func-

tions of box constraints, and∑

i

∑j∈Vi hi,j(xi−xj) is the total variation on

the mesh.

We introduce a dual variable s with coordinates si,j , for all ordered

pairs of adjacent nodes (i, j), and, based on the overlapping-block coordinate

updating scheme (33), perform coordinate update:

select i from [n], then compute

sk+1i,j = proxγh∗i,j (s

ki,j + γxki − γxkj ), ∀j ∈ Vi,

sk+1j,i = proxγh∗j,i(s

kj,i + γxkj − γxki ),∀j ∈ Vi,

and update

xik+1 = proxηgi(x

ki − η(∇fi(xki ) +

∑j∈Vi(2s

k+1i,j − 2sk+1

j,i − ski,j + skj,i))),

sk+1i,j = ski,j + 1

2(sk+1i,j − ski,j),∀j ∈ Vi,

sk+1j,i = skj,i + 1

2(sk+1j,i − skj,i),∀j ∈ Vi.

5.3. Finance

5.3.1. Portfolio Optimization. Assume that we have one unit of capital

and m assets to invest on. The ith asset has an expected return rate ξi ≥ 0.

Our goal is to find a portfolio with the minimal risk such that the expected

return is no less than c. This problem can be formulated as

minimizex

1

2x>Qx,

subject to x ≥ 0,

m∑i=1

xi ≤ 1,

m∑i=1

ξixi ≥ c,

where the objective function is a measure of risk, and the last constraint

imposes that the expected return is at least c. Let a1 = e/√m, b1 = 1/

√m,

a2 = ξ/‖ξ‖2, and b2 = c/‖ξ‖2, where e = (1, . . . , 1)>, ξ = (ξ1, . . . , ξm)>. The

above problem is rewritten as

(55) minimizex

1

2x>Qx, subject to x ≥ 0, a>1 x ≤ b1, a>2 x ≥ b2.

Page 38: Coordinate friendly structures, algorithms and applications

38

We apply the three-operator splitting scheme (13) to (55). Let f(x) =12x>Qx, D1 = {x : x ≥ 0}, D2 = {x : a>1 x ≤ b1, a

>2 x ≥ b2}, D21 =

{x : a>1 x = b1}, and D22 = {x : a>2 x = b2}. Based on (13), the full update is

yk+1 =projD2(xk),(56a)

xk+1 =xk + ηk(projD1

(2yk+1 − xk − γ∇f(yk+1))− yk+1),(56b)

where y is an intermediate variable. As the projection to D1 is simple, we

discuss how to evaluate the projection to D2. Assume that a1 and a2 are

neither perpendicular nor co-linear, i.e., a>1 a2 6= 0 and a1 6= λa2 for any

scalar λ. In addition, assume a>1 a2 > 0 for simplicity. Let a3 = a2 − 1a>1 a2

a1,

b3 = b2 − 1a>1 a2

b1, a4 = a1 − 1a>1 a2

a2, and b4 = b1 − 1a>1 a2

b2. Then we can

partition the whole space into four areas by the four hyperplanes a>i x = bi,

i = 1, . . . , 4. Let Pi = {x : a>i x ≤ bi, a>i+1x ≥ bi+1}, i = 1, 2, 3 and P4 = {x :

a>4 x ≤ b4, a>1 x ≥ b1}. Then

projD2(x) =

x, if x ∈ P1,projD22

(x), if x ∈ P2,projD21∩D22

(x), if x ∈ P3,projD21

(x), if x ∈ P4.

Let wi = a>i x − bi, i = 1, 2, and maintain w1, w2. Let a2 = a2−a1(a>1 a2)1−(a>1 a2)2

,

a1 = a1−a2(a>1 a2)1−(a>1 a2)2

. Then

projD21(x) = x− w1a1,

projD22(x) = x− w2a2,

projD21∩D22(x) = x− w1a1 − w2a2,

Page 39: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 39

Hence, the coordinate update of (56) is

xk ∈ P1 : xk+1i =(1− ηk)xki + ηk max(0, xki − γq>i xk),

(57a)

xk ∈ P2 : xk+1i =(1− ηk)xki + ηkw

k2(a2)i + ηk max (0,

xki − γq>i xk − wk2(2(a2)i − γq>i a2)),(57b)

xk ∈ P3 : xk+1i =(1− ηk)xki + ηk

(wk1(a1)i + wk2(a2)i

)+ ηk max (0,

xki − γq>i xk − wk1(2(a1)i − γq>i a1)− wk2(2(a2)i − γq>i a2)),(57c)

xk ∈ P4 : xk+1i =(1− ηk)xki + ηkw

k1(a1)i + ηk max (0,

xki − γq>i xk − wk1(2(a1)i − γq>i a1)),(57d)

where qi is the ith column of Q. At each iteration, we select i ∈ [m], andperform an update to xi according to (57) based on where xk is. We thenrenew wk+1

j = wkj +aij(xk+1i −xki ), j = 1, 2. Note that checking xk in some Pj

requires only O(1) operations by using w1 and w2, so the coordinate updatein (57) is inexpensive.

5.4. Distributed Computing

5.4.1. Network. Consider that m worker agents and one master agentform a star-shaped network, where the master agent at the center connectsto each of the worker agents. The m + 1 agents collaboratively solve theconsensus problem:

minimizex

m∑i=1

fi(x),

where x ∈ Rd is the common variable and each proximable function fi isheld privately by agent i. The problem can be reformulated as

minimizex1,...,xm,y∈Rd

F (x) :=

m∑i=1

fi(xi), subject to xi = y, ∀i ∈ [m],(58)

which has the KKT condition

(59) 0 ∈

∂F 0 00 0 00 0 0

︸ ︷︷ ︸operator A

xys

+

0 0 I0 0 −e>I −e 0

︸ ︷︷ ︸

operator C

xys

,

Page 40: Coordinate friendly structures, algorithms and applications

40

where s is the dual variable.

Applying the FBFS scheme (18) to (59) yields the following full update:

xk+1i = proxγfi(x

ki − γski ) + γ2xki − γ2yk − 2γski ,(60a)

yk+1 = (1 +mγ2)yk + 3γ

m∑j=1

skj − γ2m∑j=1

xkj ,(60b)

sk+1i = ski − 2γxki − γproxγfi(x

ki − γski ) + 3γyk + γ2

m∑j=1

skj ,(60c)

where (60a) and (60c) are applied to all i ∈ [m]. Hence, for each i, we

group xi and si together and assign them on agent i. We let the master

agent maintain∑

j sj and∑

j xj . Therefore, in the FBFS coordinate update,

updating any (xi, si) needs only y and∑

j sj from the master agent, and

updating y is done on the master agent. In synchronous parallel setting, at

each iteration, each worker agent i computes sk+1i , xk+1

i , then the master

agent collects the updates from all of the worker agents and then updates

y and∑

j sj . The above update can be relaxed to be asynchronous. In this

case, the master and worker agents work concurrently, the master agent

updates y and∑

j sj as soon as it receives the updated si and xi from any

of the worker agents. It also periodically broadcasts y back to the worker

agents.

5.5. Dimension Reduction

5.5.1. Nonnegative Matrix Factorization. Nonnegative Matrix Fac-

torization (NMF) is an important dimension reduction method for nonneg-

ative data. It was proposed by Paatero and his coworkers in [51]. Given a

nonnegative matrix A ∈ Rp×n+ , NMF aims at finding two nonnegative matri-

ces W ∈ Rp×r+ and H ∈ Rn×r+ such that WH> ≈ A, where r is user-specified

depending on the applications, and usually r � min(p, n). A widely used

model is

(61)minimize

W,HF (W,H) :=

1

2‖WH> −A‖2F ,

subject to W ∈ Rp×r+ , H ∈ Rn×r+ .

Page 41: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 41

Applying the projected gradient method (21) to (61), we have

W k+1 = max(0,W k − ηk∇WF (W k, Hk)

),(62a)

Hk+1 = max(0, Hk − ηk∇HF (W k, Hk)

).(62b)

In general, we do not know the Lipschitz constant of ∇F , so we have tochoose ηk by line search such that the Armijo condition is satisfied.

Partitioning the variables into 2r block coordinates: (w1, . . . , wr, h1, . . . , hr)where wi and hi are the ith columns of W and H, respectively, we can applythe coordinate update based on the projected-gradient method:

(63)

if wik is chosen for some ik ∈ [r], then compute

wk+1ik

= max(0, wkik − ηk∇wikF (W k, Hk)

);

if hik−r is chosen for some ik ∈ {r + 1, ..., 2r}, then compute

hk+1ik−r = max

(0, hkik−r − ηk∇hik−rF (W k, Hk)

).

It is easy to see that ∇wiF (W k, Hk) and ∇hiF (W k, Hk) are both Lipschitzcontinuous with constants ‖hki ‖22 and ‖wki ‖22 respectively. Hence, we can set

ηk =

{ 1‖hkik‖

22, if 1 ≤ ik ≤ r,

1‖wkik−r‖

22, if r + 1 ≤ ik ≤ 2r.

However, it is possible to have wki = 0 or hki = 0 for some i and k, andthus the setting in the above formula may have trouble of being dividedby zero. To overcome this problem, one can first modify the problem (61)by restricting W to have unit-norm columns and then apply the coordinateupdate method in (63). Note that the modification does not change theoptimal value since WH> = (WD)(HD−1)> for any r×r invertible diagonalmatrix D. We refer the readers to [82] for more details.

Note that ∇WF (W,H) = (WH>−A)H,∇HF (W,H) = (WH>−A)>Wand ∇wiF (W,H) = (WH> − A)hi,∇hiF (W,H) = (WH> − A)>wi, ∀i.Therefore, the coordinate updates given in (63) are computationally worthy(by maintaining the residual W k(Hk)> −A).

5.6. Stylized Optimization

5.6.1. Second-Order Cone Programming (SOCP). SOCP extendsLP by incorporating second-order cones. A second-order cone in Rn is

Q ={

(x1, x2, . . . , xn) ∈ Rn : ‖(x2, . . . , xn)‖2 ≤ x1

}.

Page 42: Coordinate friendly structures, algorithms and applications

42

Given a point v ∈ Rn, let ρv1 := ‖(v2, . . . , vn)‖2 and ρv2 := 12(v1 + ρv1). Then,

the projection of v to Q returns 0 if v1 < −ρv1, returns v if v1 ≥ ρv1, andreturns (ρv2,

ρv2ρv1· (v2, . . . , vn)) otherwise. Therefore, if we define the scalar

couple:

(ξv1 , ξv2) =

(0, 0), v1 < −ρv1,(1, 1), v1 ≥ ρv1,(ρv2,

ρv2ρv1

), otherwise,

then we have u = projQ(v) =(ξv1v1, ξ

v2 · (v2, . . . , vn)

). Based on this, we

have

Proposition 2. 1. Let v ∈ Rn and v+ := v + νei for any ν ∈ R. Then,given ρv1, ρ

v2, ξ

v1 , ξ

v2 defined above, it takes O(1) operations to obtain

ρv+

1 , ρv+

2 , ξv+

1 , ξv+

2 .2. Let v ∈ Rn and A = [a1 A2] ∈ Rm×n, where a1 ∈ Rm, A2 ∈ Rm×(n−1).

Given ρv1, ρv2, ξ

v1 , ξ

v2 , we have

A(2 · projQ(v)− v) = ((2ξv1 − 1)v1) · a1 + (2ξv2 − 1) ·A2(v2, . . . , vn)>.

By the proposition, if T1 is an affine operator, then in the compositionT1 ◦ projQ, the computation of projQ is cheap as long as we maintainρv1, ρ

v2, ξ

v1 , ξ

v2 .

Given x, c ∈ Rn, b ∈ Rm, and A ∈ Rm×n, the standard form of SOCP is

minimizex

c>x, subject to Ax = b,(64a)

x ∈ X = Q1 × · · · ×Qn,(64b)

where eachQi is a second-order cone, and n 6= n in general. The problem (64)is equivalent to

minimizex

(c>x+ ιA·=b(x)

)+ ιX(x),

to which we can apply the DRS iteration zk+1 = TDRS(zk) (see (16)), inwhich JγA = projX and TγB is a linear operator given by

JγB(x) = arg miny

c>y +1

2γ‖y − x‖2 subject to Ay = b.

Assume that the matrix A has full row-rank (otherwise, Ax = b has eitherredundant rows or no solution). Then, in (16), we have RγB(x) = Bx + d,where B := I − 2A>(AA>)−1A and d := 2A>(AA>)−1(b+ γAc)− 2γc.

Page 43: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 43

It is easy to apply coordinate updates to zk+1 = TDRS(zk) following

Proposition 2. Specifically, by maintaining the scalars ρv1, ρv2, ξ

v1 , ξ

v2 for each

v = xi ∈ Qi during coordinate updates, the computation of the projection

can be completely avoided. We pre-compute (AA>)−1 and cache the matrix

B and vector d. Then, TDRS is CF, and we have the following coordinate

update method

(65)

select i ∈ [n], then compute

yk+1i = Bix

k + dixk+1i = projQi(y

k+1i ) + 1

2(xki − yk+1i ),

where Bi ∈ Rni×n is the ith row block submatrix of B, and yk+1i is the

intermediate variable.

It is trivial to extend this method for SOCPs with a quadratic objective:

minimizex

c>x+1

2x>Cx, subject to Ax = b, x ∈ X = Q1 × · · · ×Qn,

because J2 is still linear. Clearly, this method applies to linear programs as

they are special SOCPs.

Note that many LPs and SOCPs have sparse matrices A, which deserve

further investigation. In particular, we may prefer not to form (AA>)−1 and

use the results in §4.2 instead.

6. Numerical Experiments

We illustrate the behavior of coordinate update algorithms for solving port-

folio optimization, image processing, and sparse logistic regression problems.

Our primary goal is to show the efficiency of coordinate update algorithms

compared to the corresponding full update algorithms. We will also illustrate

that asynchronous parallel coordinate update algorithms are more scalable

than their synchronous parallel counterparts.

Our first two experiments run on Mac OSX 10.9 with 2.4 GHz Intel Core

i5 and 8 Gigabytes of RAM. The experiments were coded in Matlab. The

sparse logistic regression experiment runs on 1 to 16 threads on a machine

with two 2.5 Ghz 10-core Intel Xeon E5-2670v2 (20 cores in total) and

64 Gigabytes of RAM. The experiment was coded in C++ with OpenMP

enabled. We use the Eigen library7 for sparse matrix operations.

7http://eigen.tuxfamily.org

Page 44: Coordinate friendly structures, algorithms and applications

44

6.1. Portfolio Optimization

In this subsection, we compare the performance of the 3S splitting scheme (56)with the corresponding coordinate update algorithm (57) for solving theportfolio optimization problem (55). In this problem, our goal is to dis-tribute our investment resources to all the assets so that the investmentrisk is minimized and the expected return is greater than c. This test usestwo datasets, which are summarized in Table 3. The NASDAQ dataset iscollected through Yahoo! Finance. We collected one year (from 10/31/2014to 10/31/2015) of historical closing prices for 2730 stocks.

Synthetic data NASDAQ data

Number of assets (N) 1000 2730Expected return rate 0.02 0.02

Asset return rate 3 * rand(N, 1) - 1 mean of 30 days return rateRisk covariance matrix + 0.01 · I positive definite matrix

Table 3: Two datasets for portfolio optimization

In our numerical experiments, for comparison purposes, we first obtain ahigh accurate solution by solving (55) with an interior point solver. For bothfull update and coordinate update, ηk is set to 0.8. However, we use differentγ. For 3S full update, we used the step size parameter γ1 = 2

‖Q‖2 , and for 3S

coordinate update, γ2 = 2max{Q11,...,QNN} . In general, coordinate update can

benefit from more relaxed parameters. The results are reported in Figure 3.We can observe that the coordinate update method converges much fasterthan the 3S method for the synthetic data. This is due to the fact that γ2

is much larger than γ1. However, for the NASDAQ dataset, γ1 ≈ γ2, so 3Scoordinate update is only moderately faster than 3S full update.

6.2. Computed Tomography Image Reconstruction

We compare the performance of algorithm (52) and its corresponding coor-dinate version on Computed Tomography (CT) image reconstruction. Wegenerate a thorax phantom of size 284 × 284 to simulate spectral CT mea-surements. We then apply the Siddon’s algorithm [66] to form the sinogramdata. There are 90 parallel beam projections and, for each projection, thereare 362 measurements. Then the sinogram data is corrupted with Gaussiannoise. We formulate the image reconstruction problem in the form of (48).The primal-dual full update corresponds to (52). For coordinate update, the

Page 45: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 45

0 20 40 60 80 100

10−16

10−14

10−12

10−10

10−8

10−6

10−4

10−2

3S full update vs 3S coordinate update

Epochs

fk−

f∗

3S full update

3S coordinate update

(a) Synthesis dataset

0 20 40 60 80 100

10−6

10−5

10−4

10−3

10−2

3S full update vs 3S coordinate update

Epochs

fk−

f∗

3S full update

3S coordinate update

(b) NASDAQ dataset

Figure 3: Compare the convergence of 3S full update with 3S coordinateupdate algorithms.

block size for x is set to 284, which corresponds to a column of the image.

The dual variables s, t are also partitioned into 284 blocks accordingly. A

block of x and the corresponding blocks of s and t are bundled together as

a single block. In each iteration, a bundled block is randomly chosen and

updated. The reconstruction results are shown in Figure 4. After 100 epochs,

the image recovered by the coordinate version is better than that by (52).

As shown in Figure 4d, the coordinate version converges faster than (52).

6.3. `1 Regularized Logistic Regression

In this subsection, we compare the performance of sync-parallel coordinate

update and async-parallel coordinate update for solving the sparse logistic

regression problem

(66) minimizex∈Rn

λ‖x‖1 +1

N

N∑j=1

log(1 + exp(−bj · a>j x)

),

where {(aj , bj)}Nj=1 is the set of sample-label pairs with bj ∈ {1,−1}, λ =

0.0001, and n and N represent the numbers of features and samples, respec-

Page 46: Coordinate friendly structures, algorithms and applications

46

(a) Phantom image (b) Recovered by PDS

(c) Recovered by PDS coord

0 20 40 60 80 100������

102

103

104

105

106

��

�����

PDS full updatePDS coordinate update

(d) Objective function value

Figure 4: CT image reconstruction.

tively. This test uses the datasets8: real-sim and news20, which are summa-rized in Table 4.

We let each coordinate hold roughly 50 features. Since the total numberof features is not divisible by 50, some coordinates have 51 features. We leteach thread draw a coordinate uniformly at random at each iteration. Westop all the tests after 10 epochs since they have nearly identical progressper epoch. The step size is set to ηk = 0.9, ∀k. Let A = [a1, . . . , aN ]> andb = [b1, ..., bN ]>. In global memory, we store A, b and x. We also store theproduct Ax in global memory so that the forward step can be efficientlycomputed. Whenever a coordinate of x gets updated, Ax is immediately

8http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

Page 47: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 47

Name # samples # featuresreal-sim 72, 309 20, 958news20 19,996 1,355,191

Table 4: Two datasets for sparse logistic regression.

updated at a low cost. Note that if Ax is not stored in global memory, everycoordinate update will have to compute Ax from scratch, which involves theentire x and will be very expensive.

Table 5 gives the running times of the sync-parallel and async-parallelimplementations on the two datasets. We can observe that async-parallelachieves almost-linear speedup, but sync-parallel scales very poorly as weexplain below.

In the sync-parallel implementation, all the running threads have towait for the last thread to finish an iteration, and therefore if a thread has alarge load, it slows down the iteration. Although every thread is (randomly)assigned to roughly the same number of features (either 50 or 51 componentsof x) at each iteration, their ai’s have very different numbers of nonzeros,and the thread with the largest number of nonzeros is the slowest. (Sparsematrix computation is used for both datasets, which are very large.) Asmore threads are used, despite that they altogether do more work at eachiteration, the per-iteration time may increase as the slowest thread tendsto be slower. On the other hand, async-parallel coordinate update does notsuffer from the load imbalance. Its performance grows nearly linear with thenumber of threads.

Finally, we have observed that the progress toward solving (66) is mainlya function of the number of epochs and does not change appreciably whenthe number of threads increases or between sync-parallel and async-parallel.Therefore, we always stop at 10 epochs.

7. Conclusions

We have presented a coordinate update method for fixed-point iterations,which updates one coordinate (or a few variables) at every iteration and canbe applied to solve linear systems, optimization problems, saddle point prob-lems, variational inequalities, and so on. We proposed a new concept calledCF operator. When an operator is CF, its coordinate update is computation-ally worthy and often preferable over the full update method, in particular ina parallel computing setting. We gave examples of CF operators and also dis-cussed how the properties can be preserved by composing two or more suchoperators such as in operator splitting and primal-dual splitting schemes. In

Page 48: Coordinate friendly structures, algorithms and applications

48

# threadsreal-sim news20

time (s) speedup time (s) speedupasync sync async sync async sync async sync

1 81.6 82.1 1.0 1.0 591.1 591.3 1.0 1.02 45.9 80.6 1.8 1.0 304.2 590.1 1.9 1.04 21.6 63.0 3.8 1.3 150.4 557.0 3.9 1.18 16.1 61.4 5.1 1.3 78.3 525.1 7.5 1.116 7.1 46.4 11.5 1.8 41.6 493.2 14.2 1.2

Table 5: Running times of async-parallel and sync-parallel FBS implemen-tations for `1 regularized logistic regression on two datasets. Sync-parallelhas very poor speedup due to the large distribution of coordinate sparsityand thus the large load imbalance across threads.

addition, we have developed CF algorithms for problems arising in several

different areas including machine learning, imaging, finance, and distributed

computing. Numerical experiments on portfolio optimization, CT imaging,

and logistic regression have been provided to demonstrate the superiority

of CF methods over their counterparts that update all coordinates at every

iteration.

References

[1] Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternat-

ing minimization and projection methods for nonconvex problems: An

approach based on the Kurdyka-Lojasiewicz inequality. Mathematics

of Operations Research 35(2), 438–457 (2010)

[2] Bahi, J., Miellou, J.C., Rhofir, K.: Asynchronous multisplitting meth-

ods for nonlinear fixed point problems. Numerical Algorithms 15(3-4),

315–345 (1997)

[3] Baudet, G.M.: Asynchronous iterative methods for multiprocessors. J.

ACM 25(2), 226–244 (1978).

[4] Bauschke, H.H., Borwein, J.M.: On the convergence of von Neumann’s

alternating projection algorithm for two sets. Set-Valued Analysis 1(2),

185–212 (1993)

[5] Bauschke, H.H., Combettes, P.L.: Convex analysis and monotone op-

erator theory in Hilbert spaces. Springer Science & Business Media

(2011)

Page 49: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 49

[6] Baz, D.E., Frommer, A., Spiteri, P.: Asynchronous iterations with flex-ible communication: contracting operators. Journal of Computationaland Applied Mathematics 176(1), 91 – 103 (2005)

[7] Baz, D.E., Gazen, D., Jarraya, M., Spiteri, P., Miellou, J.: Flexiblecommunication for parallel asynchronous methods with application toa nonlinear optimization problem. In: E. D’Hollander, F. Peters, G. Jou-bert, U. Trottenberg, R. Volpel (eds.) Parallel Computing Fundamen-tals, Applications and New Directions, Advances in Parallel Computing,vol. 12, pp. 429 – 436. North-Holland (1998)

[8] Beck, A., Tetruashvili, L.: On the convergence of block coordinate de-scent type methods. SIAM Journal on Optimization 23(4), 2037–2060(2013)

[9] Bengio, Y., Delalleau, O., Le Roux, N.: Label propagation andquadratic criterion. In: Semi-Supervised Learning, pp. 193–216. MITPress (2006)

[10] Bertsekas, D.P.: Distributed asynchronous computation of fixed points.Mathematical Programming 27(1), 107–120 (1983)

[11] Bertsekas, D.P.: Nonlinear programming. Athena Scientific (1999)

[12] Bertsekas, D.P., Tsitsiklis, J.N.: Parallel and distributed computation:numerical methods. Prentice hall Englewood Cliffs, NJ (1989)

[13] Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized min-imization for nonconvex and nonsmooth problems. Mathematical Pro-gramming 146(1-2), 459–494 (2014)

[14] Bradley, J.K., Kyrola, A., Bickson, D., Guestrin, C.: Parallel coordinatedescent for l1-regularized loss minimization. In: Proceedings of the 28thInternational Conference on Machine Learning (ICML-11), pp. 321–328(2011)

[15] Briceno-Arias, L.M.: Forward-Douglas–Rachford splitting and forward-partial inverse method for solving monotone inclusions. Optimization64(5), 1239–1261 (2015)

[16] Briceno-Arias, L.M., Combettes, P.L.: Monotone operator methods forNash equilibria in non-potential games. In: Computational and Ana-lytical Mathematics, pp. 143–159. Springer (2013)

[17] Chazan, D., Miranker, W.: Chaotic relaxation. Linear Algebra and itsApplications 2(2), 199–222 (1969)

Page 50: Coordinate friendly structures, algorithms and applications

50

[18] Combettes, P.L., Condat, L., Pesquet, J.C., Vu, B.C.: A forward-

backward view of some primal-dual optimization methods in image re-

covery. In: Proceedings of the 2014 IEEE International Conference on

Image Processing (ICIP), pp. 4141–4145 (2014)

[19] Combettes, P.L., Pesquet, J.C.: Stochastic quasi-Fejer block-coordinate

fixed point iterations with random sweeping. SIAM Journal on Opti-

mization 25(2), 1221–1248 (2015).

[20] Condat, L.: A primal–dual splitting method for convex optimization

involving Lipschitzian, proximable and linear composite terms. Journal

of Optimization Theory and Applications 158(2), 460–479 (2013)

[21] Dang, C.D., Lan, G.: Stochastic block mirror descent methods for non-

smooth and stochastic optimization. SIAM Journal on Optimization

25(2), 856–881 (2015).

[22] Davis, D.: An o(n log(n)) algorithm for projecting onto the ordered

weighted `1 norm ball. arXiv preprint arXiv:1505.00870 (2015)

[23] Davis, D.: Convergence rate analysis of primal-dual splitting schemes.

SIAM Journal on Optimization 25(3), 1912–1943 (2015)

[24] Davis, D., Yin, W.: A three-operator splitting scheme and its optimiza-

tion applications. arXiv preprint arXiv:1504.01032 (2015)

[25] Dhillon, I.S., Ravikumar, P.K., Tewari, A.: Nearest neighbor based

greedy coordinate descent. In: Advances in Neural Information Pro-

cessing Systems, pp. 2160–2168 (2011)

[26] Douglas, J., Rachford, H.H.: On the numerical solution of heat conduc-

tion problems in two and three space variables. Transactions of the

American Mathematical Society 82(2), 421–439 (1956)

[27] El Baz, D., Gazen, D., Jarraya, M., Spiteri, P., Miellou, J.C.: Flexible

communication for parallel asynchronous methods with application to

a nonlinear optimization problem. Advances in Parallel Computing 12,

429–436 (1998)

[28] Fercoq, O., Bianchi, P.: A coordinate descent primal-dual algorithm

with large step size and possibly non separable functions. arXiv preprint

arXiv:1508.04625 (2015)

[29] Frommer, A., Szyld, D.B.: On asynchronous iterations. Journal of Com-

putational and Applied Mathematics 123(1-2), 201–216 (2000)

Page 51: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 51

[30] Gabay, D., Mercier, B.: A dual algorithm for the solution of nonlinearvariational problems via finite element approximation. Computers &Mathematics with Applications 2(1), 17–40 (1976)

[31] Glowinski, R., Marroco, A.: Sur l’approximation, par elements finisd’ordre un, et la resolution, par penalisation-dualite d’une classe deproblemes de dirichlet non lineaires. Revue francaise d’automatique,informatique, recherche operationnelle. Analyse numerique 9(2), 41–76(1975)

[32] Grippo, L., Sciandrone, M.: On the convergence of the block nonlinearGauss-Seidel method under convex constraints. Operations ResearchLetters 26(3), 127–136 (2000)

[33] Han, S.: A successive projection method. Mathematical Programming40(1), 1–14 (1988)

[34] Hastie, T., Tibshirani, R., Friedman, J., Franklin, J.: The elements ofstatistical learning: data mining, inference and prediction. The Math-ematical Intelligencer 27(2), 83–85 (2005)

[35] Hildreth, C.: A quadratic programming procedure. Naval Research Lo-gistics Quarterly 4(1), 79–85 (1957)

[36] Hong, M., Wang, X., Razaviyayn, M., Luo, Z.Q.: Iteration complex-ity analysis of block coordinate descent methods. arXiv preprintarXiv:1310.6957v2 (2015)

[37] Hsieh, C.j., Yu, H.f., Dhillon, I.: PASSCoDe: Parallel asynchronousstochastic dual co-ordinate descent. In: Proceedings of the 32nd Inter-national Conference on Machine Learning (ICML-15), pp. 2370–2379(2015)

[38] Jacob, L., Obozinski, G., Vert, J.P.: Group lasso with overlap and graphlasso. In: Proceedings of the 26th International Conference on MachineLearning (ICML-09), pp. 433–440. ACM (2009)

[39] Krasnosel’skii, M.A.: Two remarks on the method of successive approx-imations. Uspekhi Matematicheskikh Nauk 10(1), 123–127 (1955)

[40] Lebedev, V., Tynjanskiı, N.: Duality theory of concave-convex games.In: Soviet Math. Dokl, vol. 8, pp. 752–756 (1967)

[41] Li, Y., Osher, S.: Coordinate descent optimization for `1 minimizationwith application to compressed sensing; a greedy algorithm. InverseProblems and Imaging 3(3), 487–503 (2009)

Page 52: Coordinate friendly structures, algorithms and applications

52

[42] Liu, J., Wright, S.J.: Asynchronous stochastic coordinate descent: Par-

allelism and convergence properties. SIAM Journal on Optimization

25(1), 351–376 (2015)

[43] Liu, J., Wright, S.J., Re, C., Bittorf, V., Sridhar, S.: An asynchronous

parallel stochastic coordinate descent algorithm. Journal of Machine

Learning Research 16, 285–322 (2015)

[44] Lu, Z., Xiao, L.: On the complexity analysis of randomized block-

coordinate descent methods. Mathematical Programming 152(1-2),

615–642 (2015).

[45] Luo, Z.Q., Tseng, P.: On the convergence of the coordinate descent

method for convex differentiable minimization. Journal of Optimization

Theory and Applications 72(1), 7–35 (1992)

[46] McLinden, L.: An extension of Fenchel’s duality theorem to saddle

functions and dual minimax problems. Pacific Journal of Mathematics

50(1), 135–158 (1974)

[47] Nedic, A., Bertsekas, D.P., Borkar, V.S.: Distributed asynchronous in-

cremental subgradient methods. Studies in Computational Mathemat-

ics 8, 381–407 (2001)

[48] Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale

optimization problems. SIAM Journal on Optimization 22(2), 341–362

(2012)

[49] Nutini, J., Schmidt, M., Laradji, I., Friedlander, M., Koepke, H.: Co-

ordinate descent converges faster with the Gauss-Southwell rule than

random selection. In: Proceedings of the 32nd International Conference

on Machine Learning (ICML-15), pp. 1632–1641 (2015)

[50] O’Connor, D., Vandenberghe, L.: Primal-dual decomposition by oper-

ator splitting and applications to image deblurring. SIAM Journal on

Imaging Sciences 7(3), 1724–1754 (2014)

[51] Paatero, P., Tapper, U.: Positive matrix factorization: A non-negative

factor model with optimal utilization of error estimates of data values.

Environmetrics 5(2), 111–126 (1994)

[52] Passty, G.B.: Ergodic convergence to a zero of the sum of monotone

operators in Hilbert space. Journal of Mathematical Analysis and Ap-

plications 72(2), 383–390 (1979)

Page 53: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 53

[53] Peaceman, D.W., Rachford Jr, H.H.: The numerical solution of

parabolic and elliptic differential equations. Journal of the Society for

Industrial and Applied Mathematics 3(1), 28–41 (1955)

[54] Peng, Z., Xu, Y., Yan, M., Yin, W.: ARock: an algorithmic frame-

work for asynchronous parallel coordinate updates. ArXiv e-prints

arXiv:1506.02396 (2015)

[55] Peng, Z., Yan, M., Yin, W.: Parallel and distributed sparse optimiza-

tion. In: Proceedings of the 2013 Asilomar Conference on Signals, Sys-

tems and Computers, pp. 659–646 (2013)

[56] Pesquet, J.C., Repetti, A.: A class of randomized primal-dual algo-

rithms for distributed optimization. Journal of Nonlinear and Convex

Analysis 16(12), 2453–2490 (2015)

[57] Polak, E., Sargent, R., Sebastian, D.: On the convergence of sequential

minimization algorithms. Journal of Optimization Theory and Appli-

cations 12(6), 567–575 (1973)

[58] Razaviyayn, M., Hong, M., Luo, Z.Q.: A unified convergence analysis

of block successive minimization methods for nonsmooth optimization.

SIAM Journal on Optimization 23(2), 1126–1153 (2013)

[59] Recht, B., Re, C., Wright, S., Niu, F.: Hogwild: A lock-free approach

to parallelizing stochastic gradient descent. In: Advances in Neural

Information Processing Systems, pp. 693–701 (2011)

[60] Repetti, A., Chouzenoux, E., Pesquet, J.C.: A random block-coordinate

primal-dual proximal algorithm with application to 3d mesh denoising.

In: Proceedings of the 2015 IEEE International Conference on Acous-

tics, Speech and Signal Processing (ICASSP), pp. 3561–3565 (2015)

[61] Richtarik, P., Takac, M.: Iteration complexity of randomized block-

coordinate descent methods for minimizing a composite function. Math-

ematical Programming 144(1-2), 1–38 (2014)

[62] Richtarik, P., Takac, M.: Parallel coordinate descent methods for

big data optimization. Mathematical Programming 156(1), 433–484

(2016).

[63] Rockafellar, R.T.: Convex analysis. Princeton University Press (1997)

[64] Rue, H., Held, L.: Gaussian Markov random fields: theory and applica-

tions. CRC Press (2005)

Page 54: Coordinate friendly structures, algorithms and applications

54

[65] Scholkopf, B., Smola, A.J.: Learning with kernels: support vector ma-chines, regularization, optimization, and beyond. MIT press (2001)

[66] Siddon, R.L.: Fast calculation of the exact radiological path for a three-dimensional CT array. Medical Physics 12(2), 252–255 (1985)

[67] Strikwerda, J.C.: A probabilistic analysis of asynchronous iteration.Linear Algebra and its Applications 349(1), 125–154 (2002)

[68] Tseng, P.: Applications of a splitting algorithm to decomposition inconvex programming and variational inequalities. SIAM Journal onControl and Optimization 29(1), 119–138 (1991)

[69] Tseng, P.: On the rate of convergence of a partially asynchronous gradi-ent projection algorithm. SIAM Journal on Optimization 1(4), 603–619(1991)

[70] Tseng, P.: Dual coordinate ascent methods for non-strictly convex min-imization. Mathematical Programming 59(1), 231–247 (1993)

[71] Tseng, P.: A modified forward-backward splitting method for maximalmonotone mappings. SIAM J. Control and Optimization 38(2), 431–446 (2000).

[72] Tseng, P.: Convergence of a block coordinate descent method for non-differentiable minimization. Journal of Optimization Theory and Ap-plications 109(3), 475–494 (2001)

[73] Tseng, P., Yun, S.: Block-coordinate gradient descent method for lin-early constrained nonsmooth separable optimization. Journal of Opti-mization Theory and Applications 140(3), 513–535 (2009)

[74] Tseng, P., Yun, S.: A coordinate gradient descent method for nons-mooth separable minimization. Mathematical Programming 117(1-2),387–423 (2009)

[75] Von Neumann, J.: On rings of operators. reduction theory. Annals ofMathematics 50(2), 401–485 (1949)

[76] Vu, B.C.: A splitting algorithm for dual monotone inclusions involvingcocoercive operators. Advances in Computational Mathematics 38(3),667–681 (2013)

[77] Warga, J.: Minimizing certain convex functions. Journal of the Societyfor Industrial and Applied Mathematics 11(3), 588–593 (1963)

[78] Wright, S.J.: Coordinate descent algorithms. Mathematical Program-ming 151(1), 3–34 (2015)

Page 55: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 55

[79] Wu, T.T., Lange, K.: Coordinate descent algorithms for lasso penalizedregression. The Annals of Applied Statistics 2(1), 224–244 (2008)

[80] Xu, Y.: Alternating proximal gradient method for sparse nonnegativeTucker decomposition. Mathematical Programming Computation 7(1),39–70 (2015).

[81] Xu, Y., Yin, W.: A block coordinate descent method for regularizedmulticonvex optimization with applications to nonnegative tensor fac-torization and completion. SIAM Journal on Imaging Sciences 6(3),1758–1789 (2013)

[82] Xu, Y., Yin, W.: A globally convergent algorithm for nonconvexoptimization based on block coordinate update. arXiv preprintarXiv:1410.1386 (2014)

[83] Xu, Y., Yin, W.: Block stochastic gradient iteration for convex andnonconvex optimization. SIAM Journal on Optimization 25(3), 1686–1716 (2015).

[84] Yuan, M., Lin, Y.: Model selection and estimation in regression withgrouped variables. Journal of the Royal Statistical Society: Series B(Statistical Methodology) 68(1), 49–67 (2006)

[85] Zadeh, N.: A note on the cyclic coordinate ascent method. ManagementScience 16(9), 642–644 (1970)

Appendix A. Some Key Concepts of Operators

In this section, we go over a few key concepts in monotone operator theoryand operator splitting theory.

Definition 8 (monotone operator). A set-valued operator T : H ⇒ H ismonotone if 〈x− y, u− v〉 ≥ 0, ∀x, y ∈ H, u ∈ T x, v ∈ T y. Furthermore, Tis maximally monotone if its graph Grph(T ) = {(x, u) ∈ H × H : u ∈ T x}is not strictly contained in the graph of any other monotone operator.

Example 18. An important maximally monotone operator is the subdiffer-ential ∂f of a closed proper convex function f .

Definition 9 (nonexpansive operator). An operator T : H → H is non-expansive if ‖T x − T y‖ ≤ ‖x − y‖, ∀x, y ∈ H. We say T is averaged,or α-averaged, if there is one nonexpansive operator R such that T =(1− α)I + αR for some 0 < α < 1. A 1

2 -averaged operator T is also calledfirmly-nonexpansive.

Page 56: Coordinate friendly structures, algorithms and applications

56

By definition, a nonexpansive operator is single-valued. Let T be aver-aged. If T has a fixed point, the iteration (2) converges to a fixed point;otherwise, the iteration diverges unboundedly. Now let T be nonexpansive.The damped update of T : xk+1 = xk−η(xk−T xk), is equivalent to applyingthe averaged operator (1− η)I + ηT .

Example 19. A common firmly-nonexpansive operator is the resolvent ofa maximally monotone map T , written as

(67) JA := (I +A)−1.

Given x ∈ H, JA(x) = {y : x ∈ y + Ay}. (By monotonicity of A, JA is asingleton, and by maximality of A, JA(x) is well defined for all x ∈ H. ) Areflective resolvent is

(68) RA := 2JA − I.

Definition 10 (proximal map). The proximal map for a function f is aspecial resolvent defined as:

(69) proxγf (y) = arg minx

{f(x) +

1

2γ‖x− y‖2

},

where γ > 0. The first-order variational condition of the minimization yieldsproxγf (y) = (I + γ∂f)−1; hence, proxγf is firmly-nonexpansive. Whenx ∈ Rm and proxγf can be computed in O(m) or O(m logm) operations,we call f proximable.

Examples of proximable functions include `1, `2, `∞-norms, several ma-trix norms, the owl-norm [22], (piece-wise) linear functions, certain quadraticfunctions, and many more.

Example 20. A special proximal map is the projection map. Let X be anonempty closed convex set, and ιS be its indicator function. MinimizingιS(x) enforces x ∈ S, so proxγιS reduces to the projection map projS forany γ > 0. Therefore, projS is also firmly nonexpansive.

Definition 11 (β-cocoercive operator). An operator T : H → H is β-cocoercive if 〈x− y, T x− T y〉 ≥ β‖T x− T y‖2, ∀x, y ∈ H.

Example 21. A special example of cocoercive operator is the gradient of asmooth function. Let f be a differentiable function. Then ∇f is β-Lipschitzcontinuous if and only if ∇f is 1

β -cocoercive [5, Corollary 18.16].

Page 57: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 57

Appendix B. Derivation of ADMM from the DRS Update

We derive the ADMM update in (23) from the DRS update

sk = JηB(tk),(70a)

tk+1 =

(1

2(2JηA − I) ◦ (2JηB − I) +

1

2I)

(tk),(70b)

where A = −∂f∗(−·) and B = ∂g∗.Note (70a) is equivalent to tk ∈ sk+η∂g∗(sk), i.e., there is a yk ∈ ∂g∗(sk)

such that tk = sk + ηyk, so

(71) tk − ηyk = sk ∈ ∂g(yk).

In addition, (70b) can be written as

tk+1 = JηA(2sk − tk) + tk − sk

= sk + (JηA − I)(2sk − tk)= sk + (I − (I + η∂f∗)−1)(tk − 2sk)

= sk + η(ηI + ∂f)−1(tk − 2sk)

= sk + η(ηI + ∂f)−1(ηyk − sk),(72)

where in the fourth equality, we have used the Moreau’s Identity [63]: (I +∂h)−1 + (I + ∂h∗)−1 = I for any closed convex function h. Let

(73) xk+1 = (ηI + ∂f)−1(ηyk − sk) = (I +1

η∂f)−1(yk − 1

ηsk).

Then (72) becomes

tk+1 = sk + ηxk+1,

and

(74) sk+1 (71)= tk+1 − ηyk+1 = sk + ηxk+1 − ηyk+1,

which together with sk+1 ∈ ∂g(yk+1) gives

(75) yk+1 = (ηI + ∂g)−1(sk + ηxk+1) = (I +1

η∂g)−1(xk+1 +

1

ηsk).

Hence, from (73), (74), and (75), the ADMM update in (23) is equivalentto the DRS update in (70) with η = 1

γ .

Page 58: Coordinate friendly structures, algorithms and applications

58

Appendix C. Representing the Condat-Vu Algorithm as aNonexpansive Operator

We show how to derive the Condat-Vu algorithm (28) by applying a forward-backward operator to the optimality condition (27):

(76) 0 ∈( [∇f 00 0

]︸ ︷︷ ︸

operator A

+

[∂g 00 ∂h∗

]+

[0 A>

−A 0

]︸ ︷︷ ︸

operator B

) [xs

]︸︷︷︸z

,

It can be written as 0 ∈ Az + Bz after we define z =

[xs

]. Let M be a

symmetric positive definite matrix, we have

0 ∈ Az + Bz⇔Mz −Az ∈Mz + Bz⇔z −M−1Az ∈ z +M−1Bz⇔z = (I +M−1B)−1 ◦ (I −M−1A)z.

Convergence and other results can be found in [23]. The last equivalentrelation is due to M−1B being a maximally monotone operator under thenorm induced by M . We let

M =

[1η I A>

A 1γ I

]� 0

and iterate

zk+1 = T zk = (I +M−1B)−1 ◦ (I −M−1A)zk.

We have Mzk+1 + Bzk+1 = Mzk −Azk:{1ηx

k +A>sk −∇f(xk) ∈ 1ηx

k+1 +A>sk+1 +A>sk+1 + ∂g(xk+1),1γ s

k +A xk ∈ 1γ s

k+1 +A xk+1 −A xk+1 + ∂h∗(sk+1),

which is equivalent to{sk+1 = proxγh∗(s

k + γAxk),

xk+1 = proxηg(xk − η(∇f(xk) +A>(2sk+1 − sk))).

Page 59: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 59

Now we derived the Condat-Vu algorithm. With proper choices of η and γ,the forward-backward operator T = (I + M−1B)−1 ◦ (I −M−1A) can beshown to be α-averaged if we use the inner product 〈z1, z2〉M = z>1 Mz2 and

norm ‖z‖M =√z>Mz on the space of z =

[xs

]. More details can be found

in [23].

If we change the matrix M to

[1η I −A>

−A 1γ I

], the other algorithm (29)

can be derived similarly.

Appendix D. Proof of Convergence for Async-parallelPrimal-dual Coordinate Update Algorithms

Algorithms 1 and 2 differ from that in [54] in the following aspects:

1. the operator TCV is nonexpansive under a norm induced by a sym-metric positive definite matrix M (see Appendix C), instead of thestandard Euclidean norm;

2. the coordinate updates are no longer orthogonal to each other underthe norm induced by M ;

3. the block coordinates may overlap each other.

Because of these differences, we make two major modifications to the proofin [54, Section 3]: (i) adjusting parameters in [54, Lemma 2] and modifyits proof to accommodate for the new norm; (2) modify the inner productand induced norm used in [54, Theorem 2] and adjust the constants in [54,Theorems 2 and 3].

We assume the same inconsistent case as in [54], i.e., the relationshipbetween zk and zk is

(77) zk = zk +∑d∈J(k)

(zd − zd+1),

where J(k) ⊆ {k − 1, ..., k − τ} and τ is the maximum number of otherupdates to z during the computation of the update. Let S = I − TCV.Then the coordinate update can be rewritten as zk+1 = zk − ηk

(m+p)qikSik zk,

where Sizk = (zk1 , . . . , zki−1, (S zk)i, zki+1, . . . , z

km+p) for Algorithm 1. For Al-

gorithm 2, the update is

zk+1 = zk − ηkmqik

Sik zk,(78)

Page 60: Coordinate friendly structures, algorithms and applications

60

where

Sizk =

0. . .

0IHi

0. . .

0ρi,1IG1

. . .

ρi,pIGp

S zk.

Let λmax and λmin be the maximal and minimal eigenvalues of the matrixM , respectively, and κ = λmax

λminbe the condition number. Then we have the

following lemma.

Lemma 1. For both Algorithms 1 and 2,∑i

Sizk = S zk,(79) ∑i

‖Sizk‖2M ≤ κ‖S zk‖2M ,(80)

where i runs from 1 to m+ p for Algorithm 1 and 1 to m for Algorithm 2.

Proof. The first part comes immediately from the definition of S for bothalgorithms. For the second part, we have∑

i

‖Sizk‖2M ≤∑i

λmax‖Sizk‖2 = λmax‖S zk‖2 ≤λmax

λmin‖S zk‖2M ,(81)

for Algorithm 1. For Algorithm 2, the equality is replaced by “≤”.

At last we define

zk+1 := zk − ηkS zk,(82)

qmin = mini qi > 0, and |J(k)| be the number of elements in J(k). It is shownin [23] that with proper choices of η and γ, TCV is nonexpansive under thenorm induced by M . Then Lemma 2 shows that S is 1/2-cocoercive underthe same norm.

Page 61: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 61

Lemma 2. An operator T : F→ F is nonexpansive under the induced norm

by M if and only if S = I − T is 1/2-cocoercive under the same norm, i.e.,

(83) 〈z − z,Sz − S z〉M ≥1

2‖Sz − S z‖2M , ∀ z, z ∈ F.

The proof is the same as that of [5, Proposition 4.33].

We state the complete theorem for Algorithm 2. The theorem for Algo-

rithm 1 is similar (we need to change m to m+ p when necessary).

Theorem 2. Let Z∗ be the set of optimal solutions of (24) and (zk)k≥0 ⊂ Fbe the sequence generated by Algorithm 2 (with proper choices of η and γ

such that TCV is nonexpansive under the norm induced by M), under the

following conditions:

(i) f, g, h∗ are closed proper convex functions. In addition, f is differen-

tiable and ∇f is Lipschitz continuous with β;

(ii) ηk ∈ [ηmin, ηmax] for certain 0 < ηmax <mqmin

2τ√κqmin+κ and any 0 < ηmin ≤

ηmax.

Then (zk)k≥0 converges to a Z∗-valued random variable with probability 1.

The proof directly follows [54, Section 3]. Here we only present the key

modifications. Interested readers are referred to [54] for the complete proce-

dure.

The next lemma shows that the conditional expectation of the distance

between zk+1 and any z∗ ∈ FixTCV = Z∗ for given Zk = {z0, z1, · · · , zk}has an upper bound that depends on Zk and z∗ only.

Lemma 3. Let (zk)k≥0 be the sequence generated by Algorithm 2. Then for

any z∗ ∈ FixTCV, we have

E(‖zk+1 − z∗‖2M

∣∣Zk) ≤‖zk − z∗‖2M +σ

m

∑d∈J(k)

‖zd − zd+1‖2M

+1

m

(|J(k)|σ

mqmin− 1

ηk

)‖zk − zk+1‖2M

(84)

where E(· | Zk) denotes conditional expectation on Zk and σ > 0 (to be

optimized later).

Page 62: Coordinate friendly structures, algorithms and applications

62

Proof. We have

(85)

E(‖zk+1 − z∗‖2M | Zk

)(78)= E

(‖zk − ηk

mqikSik zk − z∗‖2M | Zk

)=‖zk − z∗‖2M + E

(2ηkmqik

⟨Sik zk, z∗ − zk

⟩M

+ η2k

m2q2ik‖Sik zk‖2M

∣∣Zk)=‖zk − z∗‖2M + 2ηk

m

∑mi=1

⟨Sizk, z∗ − zk

⟩M

+ η2k

m2

∑mi=1

1qi‖Sizk‖2M

=‖zk − z∗‖2M + 2ηkm

⟨S zk, z∗ − zk

⟩M

+ η2k

m2

∑mi=1

1qi‖Sizk‖2M ,

where the third equality holds because the probability of choosing i is qi.

Note that

(86)

∑mi=1

1qi‖Sizk‖2M ≤

1

qmin

m∑i=1

‖Sizk‖2M(80)

≤ κ

qmin

m∑i=1

‖S zk‖2M

(82)=

κ

η2kqmin

‖zk − zk+1‖2M ,

and

(87)

〈S zk, z∗ − zk〉M=〈S zk, z∗ − zk +

∑d∈J(k)(z

d − zd+1)〉M(82)= 〈S zk, z∗ − zk〉M + 1

ηk

∑d∈J(k)〈zk − zk+1, zd − zd+1〉M

≤〈S zk − Sz∗, z∗ − zk〉M + 12ηk

∑d∈J(k)

(1σ‖z

k − zk+1‖2M + σ‖zd − zd+1‖2M)

(83)

≤ − 12‖S z

k‖2M + 12ηk

∑d∈J(k)(

1σ‖z

k − zk+1‖2M + σ‖zd − zd+1‖2M )

(82)= − 1

2η2k‖zk − zk+1‖2M + |J(k)|

2σηk‖zk − zk+1‖2M + σ

2ηk

∑d∈J(k) ‖zd − zd+1‖2M ,

where the first inequality follows from the Young’s inequality. Plugging (86)

and (87) into (85) gives the desired result.

Let Fτ+1 =∏τi=0 F be a product space and 〈· | ·〉 be the induced inner

product:

〈(z0, . . . , zτ ) | (z0, . . . , zτ )〉 =

τ∑i=0

〈zi, zi〉M , ∀(z0, . . . , zτ ), (z0, . . . , zτ ) ∈ Fτ+1.

Page 63: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 63

Define a (τ + 1)× (τ + 1) matrix U ′ by

U ′ :=

1 0 · · · 00 0 · · · 0...

.... . .

...0 0 · · · 0

+

√qmin

κ

τ −τ−τ 2τ − 1 1− τ

1− τ 2τ − 3 2− τ. . .

. . .. . .

−2 3 −1−1 1

,

and let U = U ′ ⊗ IF. Here ⊗ represents the Kronecker product. For a given

(y0, · · · , yτ ) ∈ Fτ+1, (z0, · · · , zτ ) = U(y0, · · · , yτ ) is given by:

z0 = y0 + τ√

qmin

κ (y0 − y1),

zi =√

qmin

κ ((i− τ − 1)yi−1 + (2τ − 2i+ 1)yi + (i− τ)yi+1), if 1 ≤ i ≤ τ − 1,

zτ =√

qmin

κ (yτ − yτ−1).

Then U is a self-adjoint and positive definite linear operator since U ′ is

symmetric and positive definite, and we define 〈· | ·〉U = 〈· |U ·〉 as the U -

weighted inner product and ‖ · ‖U the induced norm.

Let

zk = (zk, zk−1, . . . , zk−τ ) ∈ Fτ+1, k ≥ 0, z∗ = (z∗, . . . , z∗) ∈ Z∗ ⊆ Fτ+1,

where zk = z0 for k < 0. With

(88)

ξk(z∗) := ‖zk−z∗‖2U = ‖zk−z∗‖2M+

√qmin

κ

∑k−1i=k−τ (i−(k−τ)+1)‖zi−zi+1‖2M ,

we have the following fundamental inequality:

Theorem 3 (fundamental inequality). Let (zk)k≥0 be the sequence generated

by Algorithm 2. Then for any z∗ ∈ Z∗, it holds that

E(ξk+1(z∗)

∣∣Zk) ≤ ξk(z∗) +1

m

(2τ√κ

m√qmin

mqmin− 1

ηk

)‖zk+1 − zk‖2M .

Page 64: Coordinate friendly structures, algorithms and applications

64

Proof. Let σ = m√

qmin

κ . We have

E(ξk+1(z∗)|Zk)(88)= E(‖zk+1 − z∗‖2M |Zk) + σ

∑ki=k+1−τ

i−(k−τ)m E(‖zi − zi+1‖2M |Zk)

(78)= E(‖zk+1 − z∗‖2M |Zk) + στ

m E( η2k

m2q2ik‖Sik zk‖2M |Zk) + σ

∑k−1i=k+1−τ

i−(k−τ)m ‖zi − zi+1‖2M

≤E(‖zk+1 − z∗‖2M |Zk) + στκm3qmin

‖zk − zk+1‖2M + σ∑k−1

i=k+1−τi−(k−τ)

m ‖zi − zi+1‖2M(84)

≤ ‖zk − z∗‖2M + 1m

(|J(k)|σ + στκ

m2qmin+ κ

mqmin− 1

ηk

)‖zk − zk+1‖2M

+ σm

∑d∈J(k) ‖zd − zd+1‖2M + σ

∑k−1i=k+1−τ

i−(k−τ)m ‖zi − zi+1‖2M

≤‖zk − z∗‖2M + 1m

(τσ + στκ

m2qmin+ κ

mqmin− 1

ηk

)‖zk − zk+1‖2M

+ σm

∑k−1i=k−τ ‖zi − zi+1‖2M + σ

∑k−1i=k+1−τ

i−(k−τ)m ‖zi − zi+1‖2M

(88)= ξk(x

∗) + 1m

(2τ√κ

m√qmin

+ κmqmin

− 1ηk

)‖zk − zk+1‖2M .

The first inequality follows from the computation of the conditional expec-

tation on Zk and (86), the third inequality holds because J(k) ⊂ {k−1, k−2, · · · , k − τ}, and the last equality uses σ = m

√qmin

κ , which minimizesτσ + στκ

m2qminover σ > 0. Hence, the desired inequality holds.

Zhimin PengPO Box 951555UCLA Math DepartmentLos Angeles, CA 90095E-mail address: [email protected]

Tianyu WuPO Box 951555UCLA Math DepartmentLos Angeles, CA 90095E-mail address: [email protected]

Yangyang Xu207 Church St SEUniversity of Minnesota, Twin CitiesMinneapolis, MN 55455E-mail address: [email protected]

Page 65: Coordinate friendly structures, algorithms and applications

Coordinate friendly structures, algorithms, and applications 65

Ming YanDepartment of Computational Mathematics, Science and EngineeringDepartment of MathematicsMichigan State UniversityEast Lansing, MI 48824E-mail address: [email protected]

Wotao YinPO Box 951555UCLA Math DepartmentLos Angeles, CA 90095E-mail address: [email protected]

Received January 1, 2016