Communication-Efﬁcient Distributed Stochastic AUC ......Communication-Efﬁcient Distributed Stochastic AUC Maximization with Deep Neural Networks overhead due to a large number

Communication-Efficient Distributed Stochastic AUC Maximizationwith Deep Neural Networks

Zhishuai Guo 1 Mingrui Liu 1 Zhuoning Yuan 1 Li Shen 2 Wei Liu 2 Tianbao Yang 1

Abstract

In this paper, we study distributed algorithms forlarge-scale AUC maximization with a deep neu-ral network as a predictive model. Although dis-tributed learning techniques have been investi-gated extensively in deep learning, they are notdirectly applicable to stochastic AUC maximiza-tion with deep neural networks due to its strik-ing differences from standard loss minimizationproblems (e.g., cross-entropy).Towards address-ing this challenge, we propose and analyze acommunication-efficient distributed optimizationalgorithm based on a non-convex concave refor-mulation of the AUC maximization, in which thecommunication of both the primal variable andthe dual variable between each worker and theparameter server only occurs after multiple stepsof gradient-based updates in each worker. Com-pared with the naive parallel version of an existingalgorithm that computes stochastic gradients atindividual machines and averages them for updat-ing the model parameter, our algorithm requires amuch less number of communication rounds andstill achieves a linear speedup in theory. To thebest of our knowledge, this is the first work thatsolves the non-convex concave min-max problemfor AUC maximization with deep neural networksin a communication-efficient distributed mannerwhile still maintaining the linear speedup propertyin theory. Our experiments on several benchmarkdatasets show the effectiveness of our algorithmand also confirm our theory.

1Department of Computer Science, University of Iowa, IowaCity, IA 52242, USA 2Tencent AI Lab, Shenzhen, China. Corre-spondence to: Tianbao Yang .

Proceedings of the 37 th International Conference on MachineLearning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 bythe author(s).

1. IntroductionLarge-scale distributed deep learning (Dean et al., 2012; Liet al., 2014) has achieved tremendous successes in variousdomains, including computer vision (Goyal et al., 2017),natural language processing (Devlin et al., 2018; Yang et al.,2019), generative modeling (Brock et al., 2018), reinforce-ment learning (Silver et al., 2016; 2017), etc. From theperspective of learning theory and optimization, most ofthem are trying to minimize a surrogate loss of a specificerror measure using parallel minibatch stochastic gradientdescent (SGD). For example, in the image classification task,the surrogate loss is usually the cross entropy between theestimated probability distribution according to the outputof the neural network and the vector encoding the ground-truth label (Krizhevsky et al., 2012; Simonyan & Zisserman,2014; He et al., 2016), which is a surrogate loss of the mis-classification rate. Based on the surrogate loss, parallelminibatch SGD (Goyal et al., 2017) is employed to updatethe model parameters.

However, when the data for classification is imbalanced,AUC (short for Area Under the ROC Curve) is a moresuitable measure (Elkan, 2001). AUC is defined as the prob-ability that the positive sample has higher score than thenegative sample (Hanley & McNeil, 1982; 1983). Despitethe tremendous applications of distributed deep learning indifferent fields, the study about optimizing AUC with dis-tributed deep learning technologies is rare. The commonlyused parallel mini-batch SGD for minimizing a surrogateloss of AUC will suffer from high communication costs ina distributed setting due to the non-decomposability natureof AUC measure. The reason is that positive and negativedata pairs that define a surrogate loss for AUC may sit ondifferent machines. To the best of our knowledge, Liu et al.(2020b) is the only work trying to optimize a surrogate lossof AUC with a deep neural network that explicitly tacklesthe non-decomposability of AUC measure. Nevertheless,their algorithms are designed only for the single-machinesetting and hence are far from sufficient when encounteringa huge amount of data. Although a naive parallel version ofthe stochastic algorithms proposed in (Liu et al., 2020b) canbe used for distributed AUC maximization with a deep neu-ral network, it would still suffer from high communication

Communication-Efficient Distributed Stochastic AUC Maximization with Deep Neural Networks

overhead due to a large number of communication rounds.

In this paper, we bridge the gap between stochastic AUCmaximization and distributed deep learning by proposing acommunication-efficient distributed algorithm for stochasticAUC maximization with a deep neural network. The focusis to make the total number of communication roundsmuch less than the total number of iterative updates.We build our algorithm upon the nonconvex-concave min-max reformulation of the original problem. The key ingredi-ent is to design a communication-efficient distributed algo-rithm for solving the regularized min-max subproblems us-ing multiple machines. Specifically, we follow the proximalprimal-dual algorithmic framework proposed by (Rafiqueet al., 2018; Liu et al., 2020b), i.e., by solving a sequenceof quadratic regularized min-max saddle-point problemswith periodic updated regularizers successively. The keydifference is that the inner min-max problem solver is builton a distributed periodic model averaging technique, whichconsists of a fixed number of stochastic primal-dual updatesover individual machines and a small number of averagingof model parameters from multiple machines. This mech-anism can greatly reduce the communication cost, whichis similar to (Zhou & Cong, 2017; Stich, 2018; Yu et al.,2019b). However, their analysis cannot be applied to ourcase since their analysis only works for convex or non-convex minimization problems. In contrast, our algorithm isdesigned for a particular non-convex concave min-max prob-lem induced by the original AUC maximization problem.Our contributions are summarized as following:

• We propose a communication-efficient distributedstochastic algorithm named CoDA for solving anonconvex-concave min-max reformulation of AUCmaximization with deep neural networks by localprimal-dual updating and periodically global vari-able averaging. To our knowledge, this is the firstcommunication-efficient distributed stochastic algo-rithm for learning a deep neural network by AUC max-imization.

• We analyze the iteration complexity and communica-tion complexity of the proposed algorithm under thecommonly used Polyak- Łojasiewicz (PL) conditionas in (Liu et al., 2020b). Comparing with (Liu et al.,2020b), our theoretical result shows that the iterationcomplexity can be reduced by a factor of K (the num-ber of machines) in a certain region, while the commu-nication complexity (the rounds of communication) ismuch less than that of a naive distributed version ofthe stochastic algorithm proposed in (Liu et al., 2020b).The summary of iteration and communication com-plexities is given in Table 1.

• We verify our theoretical claims by conducting experi-ments on several large-scale benchmark datasets. The

experimental results show that our algorithm indeedexhibits good speedup performance in practice.

2. Related WorkStochastic AUC Maximization. It is challenging to di-rectly solve the stochastic AUC maximization in the onlinelearning setting since the objective function of AUC max-imization depends on a sum of pairwise losses betweensamples from positive and negative classes. (Zhao et al.,2011) addressed this problem by maintaining a buffer tostore representative data samples, employing the reservoirsampling technique to update the buffer, calculating gradi-ent information based on the data in the buffer, and thenperforming gradient-based update rule to update the clas-sifier. (Gao et al., 2013) did not maintain a buffer, theyinstead maintained first-order and second-order statistics ofthe received data to update the classifier by gradient-basedupdate. Both of them are infeasible in big data scenariossince (Zhao et al., 2011) suffers from a large amount oftraining data and (Gao et al., 2013) is not suitable for highdimensional data. Ying et al. (2016) addressed these issuesby introducing a min-max reformulation of the originalproblem and solving it by primal-dual stochastic gradientmethod (Nemirovski et al., 2009), in which no buffer isneeded and per-iteration complexity is the same magnitudeof the dimension of the feature vector. Natole et al. (2018)improved the convergence rate by adding a strongly convexregularizer upon the original formulation. Based on thesame saddle point formulation as in (Ying et al., 2016), Liuet al. (2018) got an improved convergence rate by devel-oping a multi-stage algorithm without adding the stronglyconvex regularizer. However, all of these studies focus onlearning a linear model. Recently, (Liu et al., 2020b) con-sidered stochastic AUC maximization for learning a deepnon-linear model, in which they designed a proximal primal-dual gradient-based algorithm under the PL condition andestablished non-asymptotic convergence results.

Communication Efficient Algorithms. There are mul-tiple approaches for reducing the communication cost indistributed optimization, including skipping communica-tion and compression techniques. Due to limit of space,we mainly review the literature on skipping communication.For compression techniques, we refer the readers to (Jiang &Agrawal, 2018; Stich et al., 2018; Basu et al., 2019; Wangniet al., 2018; Bernstein et al., 2018) and references therein.Skipping communication is realized by doing multiple localgradient-based updates in each worker before aggregatingthe local model parameters together. One special case isso-called one-shot averaging (Zinkevich et al., 2010; Mc-Donald et al., 2010; Zhang et al., 2013), where each machinesolves a local optimization problem and averages these so-lutions only at the last iterate. (Zhang et al., 2013; Shamir& Srebro, 2014; Godichon-Baggioni & Saadane, 2017; Jain


Table 1. Summary of Iteration and Communication Complexities, where K is number of machines and µ ≤ 1. NP-PPD-SG denotesthe naive parallel version of PPD-SG, which is also a special case of our algorithm, whose complexities can be derived following ouranalysis.

Alg. Setting Iteration Compl. Comm. Compl.

PPD-SG (Liu et al., 2020b) Single O(1/(µ2�)) -NP-PPD-SG Distributed O(1/(Kµ2�)) O(1/(Kµ2�))CoDA Distributed O(1/(Kµ2�)) O(1/(µ3/2�1/2))

et al., 2017; Koloskova et al., 2019; Koloskova* et al., 2020)considered one-shot averaging with one-pass of the dataand established statistical convergence, which is usually notable to guarantee the convergence of training error. Thescheme of local SGD update in each worker with skippingcommunication is analyzed for convex (Stich, 2018; Jaggiet al., 2014) and nonconvex problems (Zhou & Cong, 2017;Jiang & Agrawal, 2018; Wang & Joshi, 2018b; Lin et al.,2018b; Wang & Joshi, 2018a; Yu et al., 2019b;a; Basu et al.,2019; Haddadpour et al., 2019). There are also severalempirical studies (Povey et al., 2014; Su & Chen, 2015;McMahan et al., 2016; Chen & Huo, 2016; McMahan et al.,2016; Lin et al., 2018b; Kamp et al., 2018) showing that thisscheme exhibits good empirical performance in distributeddeep learning. However, all of these works only considerminimization problems and do not apply to the nonconvex-concave min-max formulation as considered in this paper.Nonconvex Min-max Optimization Stochastic noncon-vex min-max optimization has garnered increasing attentionrecently (Rafique et al., 2018; Lin et al., 2018a; Sanjabiet al., 2018; Lu et al., 2019; Lin et al., 2019; Jin et al.,2019; Liu et al., 2020a). (Rafique et al., 2018) consideredthe case where the objective function is weakly-convex andconcave and proposed an algorithm based on the spirit ofproximal point method (Rockafellar, 1976), in which a prox-imal subproblem with periodically updated reference pointsis approximately solved by an appropriate stochastic algo-rithm. They established the convergence to nearly stationarypoint for the equivalent minimization problem. Under thesame setting, (Lu et al., 2019) designed a block-based algo-rithm and showed that it can converge to a solution with asmall stationary gap, and (Lin et al., 2019) considered solv-ing the problem using vanilla stochastic gradient descentascent and established its convergence to a stationary pointunder the smoothness assumption. There are also severalpapers (Lin et al., 2018a; Sanjabi et al., 2018; Liu et al.,2020a) trying to solve non-convex non-concave min-maxproblems. (Lin et al., 2018a) proposed an inexact proximalpoint method for solving a class of weakly-convex weakly-concave problems, which was proven to converge to a nearlystationary point. (Sanjabi et al., 2018) exploited the PL con-dition for the inner maximization problem and designeda multi-step alternating optimization algorithm which wasable to converge to a stationary point. (Liu et al., 2020a) con-sidered solving a class of nonconvex-nonconcave min-max

problems by designing an adaptive gradient method andestablished an adaptive complexity for finding a stationarypoint. However, none of them is particularly designed fordistributed stochastic AUC maximization problem with adeep neural network.

3. Preliminaries and NotationsThe area under the ROC curve (AUC) on a population levelfor a scoring function h : X → R is defined as

AUC(h) = Pr(h(x) ≥ h(x′)|y = 1, y′ = −1), (1)

where z = (x, y) and z′ = (x′, y′) are drawn independentlyfrom P. By employing the squared loss as the surrogate forthe indicator function which is commonly used by previ-ous studies (Gao et al., 2013; Ying et al., 2016; Liu et al.,2018; 2020b), the deep AUC maximization problem can beformulated as

minw∈Rd

Ez,z′[(1− h(w;x) + h(w;x′))2|y = 1, y′ = −1

],

where h(w;x) denotes the prediction score for a data sam-ple x made by a deep neural network parameterized by w.It was shown in (Ying et al., 2016) that the above problemis equivalent to the following min-max problem:

minw∈Rd

(a,b)∈R2

maxα∈R

f(w, a, b, α) = Ez[F (w, a, b, α, z)], (2)

where

F (w, a, b, α; z) = (1− p)(h(w;x)− a)2I[y=1]+ p(h(w;x)− b)2I[y=−1] + 2(1 + α)(ph(w;x)I[y=−1]− (1− p)h(w,x)I[y=1])− p(1− p)α2,

where p = Pr(y = 1) denotes the prior probability thatan example belongs to the positive class and I denotes anindicator function. The above min-max reformulation al-lows us to decompose the expectation over all data into theexpectation over data on individual machines.

In this paper, we consider the following distributed AUCmaximization problem:

minw∈Rd

(a,b)∈R2

maxα∈R

f(w, a, b, α) =1

K

K∑k=1

fk(w, a, b, α), (3)

where K is the total number of machines, fk(w, a, b, α) =Ezk [Fk(w, a, b, α; zk)], zk = (xk, yk) ∼ Pk, Pk is the


data distribution on machine k, and Fk(w, a, b, α; zk) =F (w, a, b, α; zk). Our goal is to utilize K machines tojointly solve the optimization problem (3). We emphasizethat the k-th machine can only access data zk ∼ Pk ofits own. It is notable that our formulation includes boththe batch-learning setting and the online learning setting.For the batch-learning setting, Pk represents the empiricaldistribution of data on the k-th machine and p denotes theempirical positive ratio for all data. For the online learningsetting, Pk = P,∀k represents the same population distribu-tion of data and p denotes the positive ratio in the populationlevel.

Notations. We define the following notations:

v = (wT , a, b)T , φ(v) = maxα

f(v, α),

φs(v) = φ(v) +1

2γ‖v − vs−1‖2,

v∗φ = arg minvφ(v), v∗φs = arg minv

φs(v).

We make the following assumption throughout this paper.

Assumption 1(i) There exist v0,∆0 > 0 such that φ(v0)− φ(v∗φ) ≤ ∆0.

(ii) For any x, ‖∇h(w;x)‖ ≤ Gh.(iii) φ(v) satisfies the µ-PL condition, i.e., µ(φ(v) −φ(v∗)) ≤ 12‖∇φ(v)‖

2; φ(v) is L1-smooth, i.e., ‖φ(v1)−φ(v2)‖ ≤ L1‖v1 − v2‖.(iv) For any x, h(w;x) is Lh-smooth, and h(w;x) ∈ [0, 1].

Remark: Assumptions (i), (ii), (iii) and h(w;x) ∈ [0, 1] of(iv) are also assumed in (Liu et al., 2020b), which have beenjustified as well. L-smoothness of function h is a standardassumption in the optimization literature. Finally, it shouldbe noted that µ is usually much smaller than 1 (Yuan et al.,2019). This is important for us to understand our theoreticalresult later.

4. Main Result and Theoretical AnalysisIn this section, we first describe our algorithm, and thenpresent its convergence result followed by its analysis. Forsimplicity, we assume that the ratio p of data with positivelabel is known. For the batch learning setting, p is indeed theempirical ratio of positive examples. For the online learningsetting with an unknown distribution, we can follow theonline estimation technique in (Liu et al., 2020b) to do theparameter update.

Algorithm 1 describes the proposed algorithm CoDA foroptimizing AUC in a communication-efficient distributedmanner. CoDA shares the same algorithmic framework asproposed in (Liu et al., 2020b). In particular, we employ aproximal-point algorithmic scheme that successively solves

Algorithm 1 CoDA1: Initialization: (v0 = 0 ∈ Rα+2, α0 = 0, γ).2: for s = 1, ..., S do3: vs = DSG(vs−1, αs−1, ηs, Ts,ms, Is, γ),4: Each machine draws a minibatch {zk1 , ..., zkms} of

size ms and does:

5: hk− =ms∑i=1

h(vs;xki )Iyki =−1, N

k− =

ms∑i=1

Iyki =−1,

6: hk+ =ms∑i=1

h(vs;xki )Iyki =1, N

k+ =

ms∑i=1

Iyki =1,

7: αs =1K

K∑k=1

[hk−Nk−− h

k+

Nk+

], � communicate

8: end for9: Return vS .

the following convex-concave problems approximately:

minv

maxα

f(v, α) +1

2γ‖v − v0‖2, (4)

where γ is an appropriate regularization parameter to makesure that the regularized function is strongly-convex andstrongly-concave. The reference point v0 is periodicallyupdated after a number of iterations. At the s-th stage ouralgorithm invokes a communication-efficient algorithm forsolving the above strongly-convex and strongly-concavesubproblems. After obtaining a primal solution vs at thes-th stage, we sample some data from individual machinesto obtain an estimate of corresponding dual variable αs.

Our new contribution is the communication-efficient dis-tributed algorithm for solving the above strongly-convexand strongly-concave subproblems. The algorithm referredto as DSG is presented in Algorithm 2. Each machine makesa stochastic proximal-gradient update on the primal variableand a stochastic gradient update on the dual variable at eachiteration. After every I iterations, all the K machines com-municate to compute an average of local primal solutionsvkt and local dual solutions α

kt . It is not difficult to show

that when I = 1, our algorithm reduces to the naive parallelversion of the PPD-SG algorithm proposed in (Liu et al.,2020b), i.e., by averaging individual primal and dual gradi-ents and then updating the primal-dual variables accordingto the averaged gradient 1. Our novel analysis allows us touse I > 1 to skip communications, leading to a much lessnumber of communications. The intuition behind this isthat, as long as the step size ηs is sufficiently small we cancontrol the distance between individual solutions (vkt , α

kt )

to their global averages, which allows us to control the errorterm that is caused by the discrepancy between individual

1A tiny difference is that we use a proximal gradient update tohandle the regularizer 1

2γ‖v − v0‖2, while they directly use the

gradient update. Using the proximal gradient update allows us toremove the assumption that ‖vkt − v0‖ is upper bounded.


Algorithm 2 DSG(v0, α0, η, T, I, γ)Each machine does intialization: vk0 = v0, α

k0 = α0,

for t = 0, 1, ..., T − 1 doEach machine k updates its local solution in parallel:

vkt+1 = arg minv

[∇vFk(vkt , αkt ; zkt )Tv

+ 12η‖v − vkt ‖2 + 12γ ‖v − v0‖

2

],

αkt+1 = αkt + η∇αFk(vkt , αkt ; zkt ),

if t+ 1 mod I = 0 then

vkt+1 =1K

K∑k=1

vkt+1, � communicate

αkt+1 =1K

K∑k=1

αkt+1, � communicate

end ifend for

Return ṽ = 1KK∑k=1

1T

T∑t=1

vkt .

machines. We will provide more explanations as we presentthe analysis.

Below, we present the main theoretical result of CoDA.Note that in the following presentation, Lv, H,B, σv, σαare appropriate constants, whose values are given in theproofs of Lemma 1 and Lemma 2 in the supplement.

Theorem 1 Set γ = 12Lv , c =µ/Lv

5+µ/Lv,

ηs = η0K exp(−(s − 1)c) ≤ O(1), Ts =max(8,8G2h)Lvη0K

exp((s − 1)c), Is = max(1, 1/√Kηs)

and ms = max(

1+Cη2s+1Ts+1p

2(1−p)2 ,log(K)

log(1/p̃)

), where

C = 3p̃1

ln(1/p̃)

2 ln(1/p̃) and p̃ = max(p, 1 − p). To returnvS such that E[φ(vS) − φ(v∗φ)] ≤ �, it suffices

to choose S ≥ 5Lv+µµ max{

log(

2∆0�

), logS +

log

[2η0�

6HB2+12(σ2v+σ2α)

5

]}. As a result, the number of

iterations is at most T = Õ(

max(

∆0µ�η0K

, Lvµ2K�

))and the number of communications is at most

Õ

(K/µ +

∆1/20

µ(η0�)1/2,K/µ +

L1/2vµ3/2�1/2

), where Õ

suppresses logarithmic factors, and H,B, σv, σα areappropriate constants.

We have the following remarks about Theorem 1.

• First, we can see that the step size ηs is reduced ge-ometrically in a stagewise manner. This is due tothe PL condition. We note that a stagewise geometri-cally decreasing step size is usually used in practice indeep learning (Yuan et al., 2019). Second, by setting

η0 = O(1/K) we have Is = Θ( 1√K exp((s− 1)c/2).It means two things: (i) the larger the number of ma-chines the smaller value of Is, i.e., more frequentlythe machines need to communicate. This is reasonablesince more machines will create larger discrepancy be-tween data among different machines; (ii) the value Iscan be increased geometrically across stages. This isbecause that the step size ηs is reduced geometrically,which causes one step of primal and dual updates onindividual machines diverging less from their averagedsolutions. As a result, more communications can beskipped.

• Second, we can see that when K ≤ Θ(1/µ), we havethe total iteration complexity given by Õ( 1µ2K� ). Com-pared with the iteration complexity of the PPD-SG al-gorithm proposed in (Liu et al., 2020b) that is Õ( 1µ2� ),the proposed algorithm CoDA enjoys an iteration com-plexity that is reduced by a factor of K. This meansthat up to a certain large threshold Θ(1/µ) for the num-ber K of machines, CoDA enjoys a linear speedup.

• Finally, let us compare CoDA with the naive parallelversion of PPD-SG, which is CoDA by setting I = 1.In fact, our analysis of the iteration complexity for thiscase is still applicable, and it is not difficult to show thatthe iteration complexity of the naive parallel version ofPPD-SG is given by Õ( 1µ2K� ) when K ≤ 1/µ. As aresult, its communication complexity is also Õ( 1µ2K� ).In contrast, CoDA’s communication complexity isÕ( 1

µ3/2�1/2) when K ≤ 1µ ≤

1µ1/2�1/2

≤ 1� accordingto Theorem 1 2. Hence, our algorithm is more commu-nication efficient, i.e., Õ( 1

µ3/2�1/2) ≤ Õ( 1µ2K� ) when

K ≤ 1µ3. This means that up to a certain large thresh-

old Θ(1/µ) for the number K of machines, CoDA hasa smaller communication complexity than the naiveparallel version of PPD-SG.

4.1. Analysis

Below, we present a sketch of the proof of Theorem 1 byproviding some key lemmas. We first derive some usefulproperties regarding the random function Fk(v, α, z).

Lemma 1 Suppose that Assumption 1 holds and η ≤min( 12p(1−p) ,

12(1−p) ,

12p ). Then there exist some constants

L2, Bα, Bv, σv, σα such that

‖∇vFk(v1, α; z)−∇vFk(v2, α; z)‖ ≤ L2‖v1 − v2‖,‖∇vFk(v, α; z)‖2 ≤ B2v, |∇αFk(v, α; z)|2 ≤ B2α,E[‖∇vfk(v, α)−∇vFk(v, α; z)‖2] ≤ σ2v,E[|∇αfk(v, α)−∇αFk(v, α; z)|] ≤ σ2α.

2Assume � is set to be small than µ.3Indeed, K can be as large as 1

µ1/2�1/2for CoDA to be more

communication-efficient.


Remark: We include the proofs of these properties inthe Appendix. In the following, we will denote B2 =max(B2v, B

2α) and Lv = max(L1, L2).

Next, we introduce a key lemma, which is of vital impor-tance to establish the upper bound of the objective gap ofthe regularized subproblem.

Lemma 2 (One call of Algorithm 2) Let ψ(v) =maxα

f(v, α) + 12γ ‖v − v0‖2, ṽ be the output of

Algorithm 2 and v∗ψ = arg minψ(v), α∗(ṽ) =

arg maxα

f(ṽ, α) + 12γ ‖ṽ − v0‖2. By running Algorithm 2

with given input v0, α0 for T iterations, γ = 12Lv , and η ≤min{ 1Lv+3G2α/µα ,

1Lα+3G2v/Lv

, 32µα ,1

2p(1−p) ,1

2(1−p) ,12p},

we have

E[ψ(ṽ)−minvψ(v)] ≤

2‖v0 − v∗ψ‖2+E[(α0 − α∗(ṽ))2]ηT

+Hη2I2B2II>1+η(2σ2v+3σ

2α)

2K,

where µα = 2p(1 − p), Lα = 2p(1 − p), Gα =2 max{p, 1 − p}, Gv = 2 max{p, 1 − p}Gh, and H =(

6G2vµα

+ 6Lv +6G2αLv

+6L2αµα

).

Remark: The above result is similar to Lemma 2 in (Liuet al., 2020b). The key difference lies in the second and thirdterms in the upper bound. The second term arises because ofdiscrepancy of updates between individual machines. Thethird term is due to the variance reduction by using multiplemachines, which is the key to establish the linear speed-up.It is easy to see that by setting I = 1√

ηK, the second term

and the third term have the same order. With above lemma,the proof of Theorem 1 follows similar analysis to in (Liuet al., 2020b).

Sketch of the Proof of Lemma 2. Below, we present aroadmap for the proof of the key Lemma 2. The main ideais to first bound the objective gap of the subproblem inLemma 3. Then we further bound every term in the RHSin Lemma 3 appropriately, which are realized by Lemma 4,Lemma 5 and Lemma 6. All the detailed proofs of Lemmascan be found in Appendix.

Lemma 3 Define v̄t = 1K∑Kk=1 v

kt , ᾱt =

1K

∑Kk=1 α

kt .

Suppose Assumption 1 holds and by running Algorithm 2,we have

ψ(ṽ)−minvψ(v)

≤ 1T

T∑t=1

[〈∇vf(v̄t−1, ᾱt−1),v̄t−v∗ψ〉+2Lv〈v̄t−v0,v̄t−v∗ψ〉︸︷︷︸

A1

+ 〈∇αf(v̄t−1, ᾱt−1), α∗ − ᾱt〉︸︷︷︸A2

+Lv + 3G

2α/µα

2‖v̄t − v̄t−1‖2+

Lα + 3G2v/Lv

2(ᾱt − ᾱt−1)2︸︷︷︸

A3

+2Lv

3‖v̄t−1 − v∗ψ‖2 − Lv‖v̄t − v∗ψ‖2 −

µα3

(ᾱt−1 − α∗)2].

Next, we will bound A1, A2 in Lemma 4 and Lemma 5.A3 can be cancelled with similar terms in the followingtwo lemmas. The remaining terms will be left to form atelescoping sum with other similar terms in the followingtwo lemmas.

Lemma 4 Define v̂t = arg minv

(1K

K∑k=1

∇vf(vkt−1,αkt−1))T

v

+ 12η‖v − v̄t−1‖2 + 12γ ‖v − v0‖

2. We have

A1≤3G2α2Lv

1

K

K∑k=1

(ᾱt−1 − αkt−1)2+3Lv

2

1

K

K∑k=1

‖v̄t−1−vkt−1‖2

+ η

∥∥∥∥∥ 1KK∑k=1

[∇vfk(vkt−1, αkt−1)−∇vFk(vkt−1, αkt−1; zkt−1)]

∥∥∥∥∥2

+1

K

K∑k=1

〈∇vfk(vkt−1,αkt−1)−∇vFk(vkt−1,αkt−1; zkt−1),v̂t − v∗ψ〉

+1

2η(‖v̄t−1−v∗ψ‖2−‖v̄t−1−v̄t‖2−‖v̄t−v∗ψ‖2)+

Lv3‖v̄t−v∗ψ‖2.

Lemma 5 Define α̂t = ᾱt−1 + ηKK∑k=1

∇αfk(vkt−1, αkt−1),

and

α̃t= α̃t−1+η

K

K∑k=1

(∇αFk(vkt−1, αkt−1; zkt−1)−∇αfk(vkt−1,αkt−1)).

We have,

A2≤3G2v2µα

1

K

K∑k=1

‖v̄t−1 − vkt−1‖2+3L2α2µα

1

K

K∑k=1

(ᾱt−1 − αkt−1)2

+3η

2(

1

K

K∑k=1

[∇αfk(vkt−1, αkt−1)−∇αFk(vkt−1, αkt−1; zt−1)])2

+1

K

K∑k=1

〈∇αfk(vkt−1,αkt−1)−∇αFk(vkt−1,αkt−1;zkt−1),α̃t−1−α̂t〉

+1

2η((ᾱt−1−α∗(ṽ))2−(ᾱt−1−ᾱt)2−(ᾱt−α∗(ṽ))2)

+µα3

(ᾱt−α∗(ṽ))2 +1

2η((α∗(ṽ)− α̃t)2 − (α∗(ṽ)− α̃t+1)2).

The first two terms in the upper bounds of A1, A2 are thedifferences between individual solutions and their averages,the third term is the variance of stochastic gradient, and theexpectation of the fourth term will diminish. The lemma be-low will bound the difference between the averaged solutionand the individual solutions.

Lemma 6 If K machines communicate every I iterations,and update with step size η, then


1

K

K∑k=1

E[‖v̄t − vkt ‖2] ≤ 4η2I2B2vII>1

1

K

K∑k=1

E[‖ᾱt − αkt ‖2] ≤ 4η2I2B2αII>1.

Combining the results in Lemma 3, Lemma 4, Lemma 5and Lemma 6, we can prove the key Lemma 2.

5. ExperimentsIn this section, we conduct some experiments to verify ourtheory. In our experiments, one “machine” corresponds toone GPU. We use a cluster of 4 computing nodes with eachcomputer node having 4 GPUs, which gives a total of 16“machines”. We would like to emphasize that even 4 GPUssit on one computing node, they only access to differentparts of the data. For the experiment with K = 1 GPU, Werun one computing node by using one GPU. For experimentswith K = 4 GPUs, we run one computer node by usingall four GPUs, and for those experiments with K = 16GPUs, we use four computing nodes by using all GPUs.We notice that the communication costs among GPUs onone computing node might be less than that among GPUson different computing nodes. Hence, it should be kept inmind that when comparing with K = 4 GPUs on differentcomputer nodes, the margin of using K = 16 GPUs overusing K = 4 GPUs should be larger than what we will seein our experimental results. All algorithms are implementedby PyTorch (Paszke et al., 2019).

Data. We conduct experiments on 3 datasets: Cifar10, Ci-far100 and ImageNet. For Cifar10, we split the originaltraining data into two classes, i.e., positive class contains5 original classes and negative class are composed of theother 5 classes. Cifar100 dataset is split in a similar way,i.e., positive class contains 50 original classes and negativeclass are composed of the other 50 classes. Testing data forCifar10 and Cifar100 are the same as the original dataset.For ImageNet dataset, we sample 1% of the original train-ing data as testing data and use the remaining data as thetraining data. The training data is split in a similar way asCifar10 and Cifar100, i.e., positive class contains 500 origi-nal classes and negative class are composed of the other 500classes. For each dataset, we create two versions of trainingdata with different positive ratio. By keeping all examplesin the positive and negative class, we have p = 50% for allthree datasetes. In order to create imbalanced data, we dropsome proportion of the negative data for each dataset andkeep all the positive examples. In particular, by keeping allthe positive data and 40% of the negative data we constructthree datasets with positive ratio p = 71%. Training dataare shuffled and evenly divided to each GPU, i.e., each GPUhas access to 1/K of the training data, where K is the num-ber of GPUs. For all data, We use ResNet50 as our neuralnetwork (He et al., 2016) and initialize the model as the

pretrained model from PyTorch. Due to limite of space, weonly report the results on datasets with p = 71% positiveratio, and other results are included in the supplement.

Baslines and Parameter Setting. For baselines, we com-pare with the single-machine algorithm PPD-SG as pro-posed in (Liu et al., 2020b), which is represented by K = 1in our results, and the naive parallel version of PPD-SG,which is denoted by K = X, I = 1 in our results. Forall algorithms, we set Ts = T03k, ηs = η0/3k. T0and η0 are tuned for PPD-SG and set to the same forall other algorithms for fair comparison. T0 is tuned in[2000, 5000, 10000], and η0 is tune in [0.1, 0.01, 0.001]. Wefix the batch size for each GPU as 32. For simplicity, inour experiments we use a fixed value of I in order to see itstradeoff with the number of machines K.

Results. We plot the curve of testing AUC versus the num-ber of iterations and versus running time. We notice thatevaluating the training objective function value on all ex-amples is very expensive, hence we use the testing AUC asour evaluation metric. It might cause some gap between ourresults and the theory, however, the trend should be enoughfor our purpose to verify that our distributed algorithms canenjoy faster convergence in both the number of iterationsand running time. We have the following observations.

• Varying K. By varying K and fixing the value ofI , we aim to verify the parallel speedup. The resultsare shown in Figures 1(a), Figures 2(a) and Figures8(a). They show that when K becomes larger, thenour algorithm requires less number of iterations toconverge to the target AUC, which is consistent withparallel speedup result as indicated by Theorem 1. Inaddition, CoDA with K = 16 machines is also themost time-efficient algorithm among all settings.

• Varying I . By varying I and fixing the value of K,we aim to verify that skipping communications up toa certain number of iterations of CoDA does not hurtthe iteration complexity but can dramatically reducethe total communication costs. In particular, we fixK = 16 and vary I in the range {1, 8, 64, 512, 1024}.The results are shown in Figures 1(b), Figures 2(b) andFigures 8(b). They exhibit that even when I becomesmoderately large, our algorithm is still able to delivercomparable performance in terms of the number ofiterations compared with the case when I = 1. Thelargest value of I that does not cause dramatic perfor-mance drop compared with I = 1 is I = 1024, I = 64,I = 64 on ImageNet, CIFAR100 and CIFAR10, re-spectively. However, up to these thresholds the runningtime of CoDA can be dramatically reduced than thenaive parallel version with I = 1.

• Trade-off between I and K. Finally, we verify thetrade-off between I and K as indicated in Theorem 1.


(a) Fix I , vary K (b) Fix K, vary IFigure 1. ImageNet, positive ratio = 71%

(a) Fix I , vary K (b) Fix K, vary IFigure 2. Cifar100, positive ratio = 71%

(a) Fix I , vary K (b) Fix K, vary IFigure 3. Cifar10, positive ratio = 71%

Figure 4. Cifar100, positive ratio = 71%, K=4

Figure 5. Cifar10, positive ratio = 71%, K=4

To this end, we conduct experiments by fixing K = 4GPUs and varying the value I , and comparing thelimits of I for K = 4 and K = 16. The results ofusing K = 4 on CIFAR100 and CIFAR10 are reportedin Figure 4 and Figure 5. We can observe that whenK = 4 the upper limit of I that does not cause dramaticperformance drop compared with I = 1 is I = 512 forthe two datasets, which is larger than the upper limitof I = 64 for K = 16. This is consistent with ourTheorem 1.

6. ConclusionIn this paper, we have designed a communication-efficientdistributed stochastic deep AUC maximization algorithm, inwhich each machine is able to do multiple iterations of localupdates before communicating with the central node. Wehave proved the linear speedup property and showed thatthe communication complexity can be dramatically reducedfor multiple machines up to a large threshold number. Ourempirical studies verify the theory and also demonstratethe effectiveness of the proposed distributed algorithm onbenchmark datasets.


AcknowledgementsThis work is partially supported by National Science Foun-dation CAREER Award 1844403.

ReferencesBasu, D., Data, D., Karakus, C., and Diggavi, S. Qsparse-

local-sgd: Distributed sgd with quantization, sparsifi-cation and local computations. In Advances in NeuralInformation Processing Systems, pp. 14668–14679, 2019.

Bernstein, J., Wang, Y.-X., Azizzadenesheli, K., and Anand-kumar, A. signsgd: Compressed optimisation for non-convex problems. arXiv preprint arXiv:1802.04434,2018.

Brock, A., Donahue, J., and Simonyan, K. Large scale gantraining for high fidelity natural image synthesis. arXivpreprint arXiv:1809.11096, 2018.

Chen, K. and Huo, Q. Scalable training of deep learningmachines by incremental block training with intra-blockparallel optimization and blockwise model-update filter-ing. In 2016 ieee international conference on acoustics,speech and signal processing (icassp), pp. 5880–5884.IEEE, 2016.

Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M.,Mao, M., Ranzato, M., Senior, A., Tucker, P., Yang, K.,et al. Large scale distributed deep networks. In Advancesin neural information processing systems, pp. 1223–1231,2012.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:Pre-training of deep bidirectional transformers for lan-guage understanding. arXiv preprint arXiv:1810.04805,2018.

Elkan, C. The foundations of cost-sensitive learning. InInternational joint conference on artificial intelligence,volume 17, pp. 973–978. Lawrence Erlbaum AssociatesLtd, 2001.

Gao, W., Jin, R., Zhu, S., and Zhou, Z.-H. One-pass aucoptimization. In ICML (3), pp. 906–914, 2013.

Godichon-Baggioni, A. and Saadane, S. On the rates ofconvergence of parallelized averaged stochastic gradientalgorithms. arXiv preprint arXiv:1710.07926, 2017.

Goyal, P., Dollár, P., Girshick, R., Noordhuis, P.,Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., andHe, K. Accurate, large minibatch sgd: Training imagenetin 1 hour. arXiv preprint arXiv:1706.02677, 2017.

Haddadpour, F., Kamani, M. M., Mahdavi, M., andCadambe, V. Local sgd with periodic averaging: Tighter

analysis and adaptive synchronization. In Advancesin Neural Information Processing Systems, pp. 11080–11092, 2019.

Hanley, J. A. and McNeil, B. J. The meaning and use of thearea under a receiver operating characteristic (roc) curve.Radiology, 143(1):29–36, 1982.

Hanley, J. A. and McNeil, B. J. A method of comparingthe areas under receiver operating characteristic curvesderived from the same cases. Radiology, 148(3):839–843,1983.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-ing for image recognition. In Proceedings of the IEEEconference on computer vision and pattern recognition,pp. 770–778, 2016.

Jaggi, M., Smith, V., Takác, M., Terhorst, J., Krishnan,S., Hofmann, T., and Jordan, M. I. Communication-efficient distributed dual coordinate ascent. In Advancesin Neural Information Processing Systems 27: AnnualConference on Neural Information Processing Systems2014, December 8-13 2014, Montreal, Quebec, Canada,pp. 3068–3076, 2014.

Jain, P., Netrapalli, P., Kakade, S. M., Kidambi, R., andSidford, A. Parallelizing stochastic gradient descent forleast squares regression: mini-batching, averaging, andmodel misspecification. The Journal of Machine Learn-ing Research, 18(1):8258–8299, 2017.

Jiang, P. and Agrawal, G. A linear speedup analysis of dis-tributed deep learning with sparse and quantized commu-nication. In Advances in Neural Information ProcessingSystems, pp. 2525–2536, 2018.

Jin, C., Netrapalli, P., and Jordan, M. I. Minmax optimiza-tion: Stable limit points of gradient descent ascent arelocally optimal. arXiv preprint arXiv:1902.00618, 2019.

Kamp, M., Adilova, L., Sicking, J., Hüger, F., Schlicht, P.,Wirtz, T., and Wrobel, S. Efficient decentralized deeplearning by dynamic model averaging. In Joint Euro-pean Conference on Machine Learning and KnowledgeDiscovery in Databases, pp. 393–409. Springer, 2018.

Koloskova, A., Stich, S. U., and Jaggi, M. Decentralizedstochastic optimization and gossip algorithms with com-pressed communication. In Proceedings of the 36th Inter-national Conference on Machine Learning, ICML 2019,9-15 June 2019, Long Beach, California, USA, pp. 3478–3487, 2019.

Koloskova*, A., Lin*, T., Stich, S. U., and Jaggi, M. De-centralized deep learning with arbitrary communicationcompression. In International Conference on LearningRepresentations, 2020. URL https://openreview.net/forum?id=SkgGCkrKvH.


Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenetclassification with deep convolutional neural networks.In Advances in neural information processing systems,pp. 1097–1105, 2012.

Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed,A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.-Y.Scaling distributed machine learning with the parameterserver. In 11th {USENIX} Symposium on OperatingSystems Design and Implementation ({OSDI} 14), pp.583–598, 2014.

Lin, Q., Liu, M., Rafique, H., and Yang, T. Solvingweakly-convex-weakly-concave saddle-point problems asweakly-monotone variational inequality. arXiv preprintarXiv:1810.10207, 2018a.

Lin, T., Stich, S. U., Patel, K. K., and Jaggi, M. Don’tuse large mini-batches, use local sgd. arXiv preprintarXiv:1808.07217, 2018b.

Lin, T., Jin, C., and Jordan, M. I. On gradient descentascent for nonconvex-concave minimax problems. arXivpreprint arXiv:1906.00331, 2019.

Liu, M., Zhang, X., Chen, Z., Wang, X., and Yang, T. Faststochastic auc maximization with o (1/n)-convergencerate. In International Conference on Machine Learning,pp. 3195–3203, 2018.

Liu, M., Mroueh, Y., Ross, J., Zhang, W., Cui, X., Das, P.,and Yang, T. Towards better understanding of adaptivegradient algorithms in generative adversarial nets. InInternational Conference on Learning Representations,2020a. URL https://openreview.net/forum?id=SJxIm0VtwH.

Liu, M., Yuan, Z., Ying, Y., and Yang, T. Stochastic aucmaximization with deep neural networks. ICLR, 2020b.

Lu, S., Tsaknakis, I., Hong, M., and Chen, Y. Hybridblock successive approximation for one-sided non-convexmin-max problems: Algorithms and applications. arXivpreprint arXiv:1902.08294, 2019.

McDonald, R., Hall, K., and Mann, G. Distributed trainingstrategies for the structured perceptron. In Human lan-guage technologies: The 2010 annual conference of theNorth American chapter of the association for computa-tional linguistics, pp. 456–464. Association for Computa-tional Linguistics, 2010.

McMahan, H. B., Moore, E., Ramage, D., Hampson, S., et al.Communication-efficient learning of deep networks fromdecentralized data. arXiv preprint arXiv:1602.05629,2016.

Natole, M., Ying, Y., and Lyu, S. Stochastic proximal algo-rithms for auc maximization. In International Conferenceon Machine Learning, pp. 3707–3716, 2018.

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Ro-bust stochastic approximation approach to stochastic pro-gramming. SIAM Journal on optimization, 19(4):1574–1609, 2009.

Nesterov, Y. E. Introductory Lectures on Convex Optimiza-tion - A Basic Course, volume 87 of Applied Optimization.Springer, 2004.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,L., et al. Pytorch: An imperative style, high-performancedeep learning library. In Advances in Neural InformationProcessing Systems, pp. 8024–8035, 2019.

Povey, D., Zhang, X., and Khudanpur, S. Parallel trainingof dnns with natural gradient and parameter averaging.arXiv preprint arXiv:1410.7455, 2014.

Rafique, H., Liu, M., Lin, Q., and Yang, T. Non-convex min-max optimization: Provable algorithms and applicationsin machine learning. arXiv preprint arXiv:1810.02060,2018.

Rockafellar, R. T. Monotone operators and the proximalpoint algorithm. SIAM journal on control and optimiza-tion, 14(5):877–898, 1976.

Sanjabi, M., Razaviyayn, M., and Lee, J. D. Solvingnon-convex non-concave min-max games under polyak-lojasiewicz condition. arXiv preprint arXiv:1812.02878,2018.

Shamir, O. and Srebro, N. Distributed stochastic optimiza-tion and learning. In 2014 52nd Annual Allerton Confer-ence on Communication, Control, and Computing (Aller-ton), pp. 850–857. IEEE, 2014.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,Van Den Driessche, G., Schrittwieser, J., Antonoglou, I.,Panneershelvam, V., Lanctot, M., et al. Mastering thegame of go with deep neural networks and tree search.nature, 529(7587):484, 2016.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou,I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M.,Bolton, A., et al. Mastering the game of go withouthuman knowledge. Nature, 550(7676):354–359, 2017.

Simonyan, K. and Zisserman, A. Very deep convolu-tional networks for large-scale image recognition. arXivpreprint arXiv:1409.1556, 2014.

Stich, S. U. Local sgd converges fast and communicateslittle. arXiv preprint arXiv:1805.09767, 2018.


Stich, S. U., Cordonnier, J.-B., and Jaggi, M. Sparsifiedsgd with memory. In Advances in Neural InformationProcessing Systems, pp. 4447–4458, 2018.

Su, H. and Chen, H. Experiments on parallel training of deepneural network using model averaging. arXiv preprintarXiv:1507.01239, 2015.

Wang, J. and Joshi, G. Adaptive communication strategiesto achieve the best error-runtime trade-off in local-updatesgd. arXiv preprint arXiv:1810.08313, 2018a.

Wang, J. and Joshi, G. Cooperative sgd: Aunified framework for the design and analysis ofcommunication-efficient sgd algorithms. arXiv preprintarXiv:1808.07576, 2018b.

Wangni, J., Wang, J., Liu, J., and Zhang, T. Gradient spar-sification for communication-efficient distributed opti-mization. In Advances in Neural Information ProcessingSystems, pp. 1299–1309, 2018.

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov,R. R., and Le, Q. V. Xlnet: Generalized autoregressivepretraining for language understanding. In Advances inneural information processing systems, pp. 5754–5764,2019.

Ying, Y., Wen, L., and Lyu, S. Stochastic online auc maxi-mization. In Advances in Neural Information ProcessingSystems, pp. 451–459, 2016.

Yu, H., Jin, R., and Yang, S. On the linear speedupanalysis of communication efficient momentum sgd fordistributed non-convex optimization. arXiv preprintarXiv:1905.03817, 2019a.

Yu, H., Yang, S., and Zhu, S. Parallel restarted sgd withfaster convergence and less communication: Demystify-ing why model averaging works for deep learning. InProceedings of the AAAI Conference on Artificial Intelli-gence, volume 33, pp. 5693–5700, 2019b.

Yuan, Z., Yan, Y., Jin, R., and Yang, T. Stagewise train-ing accelerates convergence of testing error over SGD.In Advances in Neural Information Processing Systems32: Annual Conference on Neural Information Process-ing Systems 2019, NeurIPS 2019, 8-14 December 2019,Vancouver, BC, Canada, pp. 2604–2614, 2019.

Zhang, Y., Duchi, J. C., and Wainwright, M. J.Communication-efficient algorithms for statistical opti-mization. The Journal of Machine Learning Research, 14(1):3321–3363, 2013.

Zhao, P., Jin, R., Yang, T., and Hoi, S. C. Online aucmaximization. In Proceedings of the 28th internationalconference on machine learning (ICML-11), pp. 233–240,2011.

Zhou, F. and Cong, G. On the convergence propertiesof a k-step averaging stochastic gradient descent al-gorithm for nonconvex optimization. arXiv preprintarXiv:1708.01012, 2017.

Zinkevich, M., Weimer, M., Li, L., and Smola, A. J. Paral-lelized stochastic gradient descent. In Advances in neuralinformation processing systems, pp. 2595–2603, 2010.

Documents

Communication-Efﬁcient Distributed Stochastic AUC ......Communication-Efﬁcient Distributed Stochastic AUC Maximization with Deep Neural Networks overhead due to a large number