Client-Edge-Cloud Hierarchical Federated Learning · parameter server either at the cloud or at the edge. The cloud server can access more data but with excessive communication overhead

Client-Edge-Cloud Hierarchical Federated LearningLumin Liu?, Jun Zhang†, S.H. Song?, and Khaled B. Letaief?‡, Fellow, IEEE

? Dept. of ECE, The Hong Kong University of Science and Technology, Hong Kong†Dept. of EIE, The Hong Kong Polytechnic University, Hong Kong

‡Peng Cheng Laboratory, Shenzhen, ChinaEmail: [email protected], [email protected], [email protected], [email protected]

Abstract—Federated Learning is a collaborative machinelearning framework to train a deep learning model without ac-cessing clients’ private data. Previous works assume one centralparameter server either at the cloud or at the edge. The cloudserver can access more data but with excessive communicationoverhead and long latency, while the edge server enjoys moreefficient communications with the clients. To combine theiradvantages, we propose a client-edge-cloud hierarchical Feder-ated Learning system, supported with a HierFAVG algorithmthat allows multiple edge servers to perform partial modelaggregation. In this way, the model can be trained faster andbetter communication-computation trade-offs can be achieved.Convergence analysis is provided for HierFAVG and the effectsof key parameters are also investigated, which lead to qualitativedesign guidelines. Empirical experiments verify the analysis anddemonstrate the benefits of this hierarchical architecture indifferent data distribution scenarios. Particularly, it is shownthat by introducing the intermediate edge servers, the modeltraining time and the energy consumption of the end devices canbe simultaneously reduced compared to cloud-based FederatedLearning.

Index Terms—Mobile Edge Computing, Federated Learning,Edge Learning

I. INTRODUCTION

The recent development in deep learning has revolution-alized many application domains, such as image processing,natural language processing, and video analytics [1]. So fardeep learning models are mainly trained at some powerfulcomputing platforms, e.g., a cloud datacenter, with centralizedcollected massive datasets. Nonetheless, in many applications,data are generated and distributed at end devices, such assmartphones and sensors, and moving them to a central serverfor model training will violate the increasing privacy concern.Thus, privacy-preserving distribtued training has started toreceive much attention. In 2017, Google proposed FederatedLearning (FL) and a Federated Averaging (FAVG) algorithm[2] to train a deep learning model without centralizing the dataat the data center. With this algorithm, local devices downloada global model from the cloud server, perform several epochsof local training, and then upload the model weights to theserver for model aggregation. The process is repeated untilthe model reaches a desired accuracy, as illustrated in Fig. 1.

FL enables fully distributed training by decomposing theprocess into two steps, i.e., parallel model update based onlocal data at clients and global model aggregation at the server.Its feasibility has been verified in real-world implementation

Figure 1: Cloud-based, edge-based and client-edge-cloud hierarchicalFL. The process of the FAVG algorithm is also illustrated.

[3]. Thereafter, it has attracted great attentions from bothacademia and industry [4]. While most initial studies of FLassumed a cloud as the parameter server, with the recentemergence of edge computing platforms [5], researchers havestarted investigating edge-based FL systems [6]–[8]. For edge-based FL, the proximate edge server will act as the parameterserver, while the clients within its communication range ofthe server collaborate to train a deep learning model.

While both the cloud-based and edge-based FL systemsapply the same FAVG algorithm, there are some fundamentaldifferences between the two systems, as shown in Fig. 1. Incloud-based FL, the participated clients in total can reach mil-lions [9], providing massive datasets needed in deep learning.Meanwhile, the communication with the cloud server is slowand unpredictable, e.g., due to network congestion, whichmakes the training process inefficient [2], [9]. Analysis hasshown a trade-off between the communication efficiency andthe convergence rate for FAVG [10]. Specifically, less commu-nication is required at a price of more local computations. Onthe contrary, in edge-based FL, the parameter server is placedat the proximate edge, such as a base station. So the latencyof the computation is comparable to that of communicationto the edge parameter server. Thus, it is possible to pursuea better trade-off in computation and communication [7],[8]. Nevertheless, one disadvantage of edge-based FL is thelimited number of clients each server can access, leading toinevitable training performance loss.

arX

iv:1

905.

0664

1v2

[cs

.NI]

31

Oct

201

9

From the above comparison, we see a necessity in lever-aging a cloud server to access the massive training samples,while each edge server enjoys quick model updates with itslocal clients. This motivates us to propose a client-edge-cloudhierarchical FL system as shown on the right side of Fig. 1, toget the best of both systems. Compared with cloud-based FL,hierarchical FL will significantly reduce the costly communi-cation with the cloud, supplemented by efficient client-edgeupdates, thereby, resulting a significant reduction in both theruntime and number of local iterations. On the other hand, asmore data can be accessed by the cloud server, hierarchicalFL will outperform edge-based FL in model training. Thesetwo aspects are clearly observed from Fig. 2, which gives apreview of the results to be presented in this paper. While theadvantages can be intuitively explained, the design of a hier-archical FL system is nontrivial. First, by extending the FAVGalgorithm to the hierarchical setting, will the new algorithmstill converge? Given the two levels of model aggregation (oneat the edge, one at the cloud), how often should the models beaggregated at each level? Morever, by allowing frequent localupdates, can a better latency-energy tradeoff be achieved? Inthis paper, we address these key questions. First, a rigorousproof is provided to show the convergence of the trainingalgorithm. Through convergence analysis, some qualitavieguidelines on picking the aggregation frequencies at two levelsare also given. Experimental results on MNIST [11] andCIFAR-10 [12] datasets support our findings and demonstratethe advantage of achieving better communication-computationtradeoff compared to cloud-based systems.

II. FEDERATED LEARNING SYSTEMS

In this section, we first introduce the general learning prob-lem in FL. The cloud-based and edge-based FL systems differonly in the communication and the number of participatedclients, and they are identical to each other in terms of archi-tecture. Thus, we treat them as the same traditional two-layerFL system in this section and introduce the widely adoptedFAVG [2] algorithm. For the client-edge-cloud hierarchical FLsystem, we present the proposed three-layer FL system, andits optimization algorithm, namely, HierFAVG.

A. Learning Problem

We focus on supervised Federated Learning. Denote D ={xj , yj}|D|j=1 as the training dataset, and |D| as the totalnumber of training samples, where xj is the j-th input sample,yj is the corresponding label. w is a real vector that fullyparametrizes the ML model. f(xj , yj ,w), also denoted asfj(w) for convenience, is the loss function of the j-th datasample, which captures the prediction error of the model forthe j-th data sample. The training process is to minimize theempirical loss F (w) based on the training dataset [13]:

F (w) =1

|D|

|D|∑j=1

f(xj , yj ,w) =1

|D|

|D|∑j=1

fj(w). (1)

The loss funciton F (w) depends on the ML model and canbe convex, e.g. logistic regression, or non-convex, e.g. neural

Figure 2: Testing Accuracy w.r.t to the runtime on CIFAR-10.

networks. The complex learning problem is usually solvedby gradient descent. Denote k as the index for the updatestep, and η as the gradient descent step size, then the modelparameters are updated as:

w(k) = w(k − 1)− η∇F (w(k − 1)).

In FL, the dataset is distributed on N clients as {Di}Ni=1,with ∪Ni=1Di = D and these distributed datasets cannotbe directly accessed by the parameter server. Thus, F (w)in Eq. (1), also called the global loss, cannot be directlycomputed, but can only be computed in the form of a weightedaverage of the local loss functions Fi(w), on local datasetsDi. Specifically, F (w) and Fi(w) are given by:

F (w) =

∑Ni=1 |Di|Fi(w)

|D| , Fi(w) =

∑j∈Di

fj(w)

|Di|.

B. Traditional Two-Layer FL

In the traditional two-layer FL system, there are one centralparameter server and N clients. To reduce the communica-tion overhead, the FAVG algorithm [2] communicates andaggreagtes after every κ steps of gradient descent on eachclient. The process repeats until the model reaches a desiredaccuracy or the limited resources, e.g., the communication ortime budget, run out.

Denote wi(k) as the parameters of the local model on thei-th client, then wi(k) in FAVG evolves in the following way:

wi(k) =

wi(k − 1)− ηk∇Fi(wi(k − 1)) k | κ 6= 0

∑Ni=1 |Di|

[wi(k−1)−ηk∇Fi(wi(k−1))

]|D| k | κ = 0

C. Client-Edge-Cloud Hierarchical FL

In FAVG, the model aggregation step can be interpretedas a way to exchange information among the clients. Thus,aggregation at the cloud parameter server can incorporatemany clients, but the communicaiton cost is high. On theother hand, aggregation at the edge parameter server onlyincorporates a small number of clients with much cheapercommunicaiton cost. To combine their advantages, we con-sider a hierarchical FL system, which has one cloud server, Ledge servers indexed by `, with disjoint client sets {C`}L`=1,

Algorithm 1: Hierarchical Federated Averaging (HierFAVG)

1: procedure HIERARCHICALFEDERATEDAVERAGING2: Initialized all clients with parameter w0

3: for k = 1, 2, . . .K do4: for each client i = 1, 2, . . . , N in parallel do5: w`

i(k) ← wì(k − 1)− η∇Fi(w`

i(k − 1))6: end for7: if k | κ1 = 0 then8: for each edge ` = 1, . . . , L in parallel do9: w`(k)← EdgeAggregation({w`

i(k)}i∈C` )10: if k | κ1κ2 6= 0 then11: for each client i ∈ C` in parallel do12: w`

i(k) ← w`(k)13: end for14: end if15: end for16: end if17: if k | κ1κ2 = 0 then18: w(k)← CloudAggregation({w`(k)}L`=1)19: for each client i = 1 . . . N in parallel do20: w`

i(k) ← w(k)21: end for22: end if23: end for24: end procedure25: function EDGEAGGREGATION(`, {w`

i(k)}i∈C` ) //Aggregate locally

26: w`(k) ←∑

i∈C` |Dì |w

ì(k)

|D`|27: return w`(k)28: end function29: function CLOUDAGGREGATION({w`(k)}L`=1) //Aggregate globally

30: w(k) ←∑L

`=1 |D`|w`(k)

|D|31: return w(k)32: end function

and N clients indexed by i and `, with distributed datasets{Dì}Ni=1. Denote D` as the aggregated dataset under edge `.Each edge server aggregates models from its clients.

With this new architecture, we extend the FAVG to aHierFAVG algorithm. The key steps of the HierFAVG algo-rithm proceed as follows. After every κ1 local updates oneach client, each edge server aggregates its clients’ models.Then after every κ2 edge model aggregations, the cloudserver aggregates all the edge servers’ models, which meansthat the communication with the cloud happens every κ1κ2

local updates. The comparison between FAVG and HierFAVGis illustrated in Fig. 3. Denote w`

i(k) as the local modelparameters after the k-th local update, and K as the totalamount of local updates performed, which is assumed to bean integer multiple of κ1κ2. Then the details of the HierFAVGalgorithm are presented in Algorithm 1. And the evolution oflocal model parameters w`

i(k) is as follows:

wì(k) =

wì(k − 1)− ηt∇F ì (w`

i(k − 1)) k | κ1 6= 0

∑i∈C` |D

ì |[w`

i(k−1)−ηk∇F ì (w`

i(k−1))]

|D`|k | κ1 = 0

k | κ1κ2 6= 0∑Ni=1 |D

ì |[w`

i(k−1)−ηk∇F ì (w`

i(k−1))]

|D| k | κ1κ2 = 0

III. CONVERGENCE ANALYSIS OF HIERFAVG

In this section, we prove the convergence of HierFAVG forboth convex and non-convex loss functions. The analysis also

Figure 3: Comparison of FAVG and HierFAVG.

reveals some key properties of the algorithm, as well as theeffects of key parameters.

A. Definitions

Some essential definitions need to be explained before theanalysis. The overall K local training iterations are dividedinto B cloud intervals, each with a length of κ1κ2, or Bκ2

edge intervals, each with a length of κ1. The local (edge)aggregation happens at the end of each edge interval, and theglobal (cloud) aggregations happens at the end of each cloudinterval. We use [p] to represent the edge interval starting from(p − 1)κ1 to pκ1, and {q} to represent the cloud intervalfrom (q − 1)κ1κ2 to qκ1κ2, so we have {q} = ∪p[p], p =(q − 1)κ2 + 1, (q − 1)κ2 + 2, . . . , qκ2.• F `(w): The edge loss function at edge server `, is

expressed as:F `(w) =

1

|D`|∑i∈C`|Dì |Fi(w).

• w(k): The weighted average of wì(k), is expressed as:

w(k) =1

|D|

N∑i=1

|Dì |w̄ì(k).

• u{q}(k): The virtually centralized gradient descent se-quence, defined in cloud interval {q}, and synchronizedimmediately with w(k) after every cloud aggregation as:

u{q}((q − 1)κ1κ2) = w((q − 1)κ1κ2),

u{q}(k + 1) = u{q}(k)− ηk∇F (u{q}(k)).

The key idea of the proof is to show that the true weightsw(k) do not deviate much from the virtuallly centralizedsequence u{q}(k). Using the same method, the convergenceof two-layered FL was analyzed in [7].Lemma 1 (Convergence of FAVG [7]). For any i, assumingfi(w) is ρ-continuous, β-smooth, and convex. Als, let Finf =F (w∗). If the deviation of distributed weights has an upperbound denoted as M , then for FAVG with a fixed step size ηand an aggregation interval κ, after K = Bκ local updates,we have the following convergence upper bound:

F (w(K))− F (w∗) ≤ 1

B(ηϕ− ρMκε2 )

when the following conditions are satisfied: 1) η ≤ 1β

2) ηϕ − ρMkε2

> 0 3) F (vb(bk)) − F (w∗) ≥ ε for b =

1, . . . , Kκ

4) F (w(K)) − F (w∗) ≥ ε for some ε > 0, ω =minb

1‖F (vb((b−1)κ))−F (w∗)‖ , ϕ = ω(1− βη

2 ).

The unique non-Independent and Identicallly Distributed(non-IID) data distribution in FL is the key property that

distinguishes FL from distributed learning in datacenter. Sincethe data are generated seperately by each client, the local datadistribution may be unbalanced and the model performancewill be heavily influenced by the non-IID data distribution[14]. In this paper, we adopt the same measurement as in[7] to measure the two-level non-IIDness in our hierarchicalsystem, i.e., the client level and the edge level.Definition 1 (Gradient Divergence). For any weight parame-ter w, the gradient divergence between the local loss functionof the i-th client, and the edge loss function of the `-th edgeserver is defined as an upper bound of ‖∇F ì (w)−∇F `(w)‖,denoted as δì ; the gradient divergence between the edge lossfunction of the `th edge server and the global loss function isdefined as an upperbound of ‖∇F `(w)−∇F (w)‖, denotedas ∆`. Specifically,

‖∇F ì (w)−∇F `(w)‖ ≤ δì ,‖∇F `(w)−∇F (w)‖ ≤ ∆`.

Define δ =∑N

i=1 |Dì |δ

ì

|D| , ∆ =∑L

`=1 |D`|∆`

|D| =∑N

i=1 |Dì |∆

`

|D| ,and we call δ as the Client-Edge divergence, and ∆ as theEdge-Cloud divergence.

A larger gradient divergence means the dataset distributionis more non-IID. δ reflects the non-IIDness at the client level,while ∆ reflects the non-IIDness at the edge level.

B. Convergence

In this section, we prove that HierFAVG converges. Thebasic idea is to study is how the real weights w(k) deviatefrom the virtually centralized sequence u{q}(k) when theparameters in HierFAVG algorithm vary. In the following twolemmas, we prove an upper bound for the distributed weightsdeviation for both convex and non-convex loss functions.Lemma 2 (Convex). For any i, assuming fi(w) is β-smoothand convex, then for any cloud interval {q} with a fixed stepsize ηq and k ∈ {q}, we have

‖w(k)− u{q}(k)‖ ≤ Gc(k, ηq),

whereGc(k, ηq) = h(k − (q − 1)κ1κ2,∆, ηq)

+ h(k − ((q − 1)κ2 + p(k)− 1)κ1, δ, ηq

)+κ1

2

(p2(k) + p(k)− 2

)h(κ1, δ, ηq),

h(x, δ, η) =δ

β

((ηβ + 1)x − 1

)− ηβx,

p(x) = d xκ1− (q − 1)κ2e.

Remark 1. Note that when κ2 = 1, HierFAVG retrogradesto the FAVG algorihtm. In this case, [p] is the same as {q},p(k) = 1, κ1κ2 = κ1, and Gc(k) = h(k−(q−1)κ1,∆+δ, ηq).This is consistent with the result in [7]. When κ1 = κ2 =1, HierFAVG retrogrades to the traditional gradient descent.In this case, Gc(κ1κ2) = 0, implying the distibuted weightsiteration is the same as the centralized weights iteration.Remark 2. The following upperbound of the weights de-viation, Gc(k), increases as we increase either of the two

aggregation intervals, κ1 and κ2:

Gc(k, ηq) ≤ Gc(κ1κ2, ηq)

= h(κ1κ2,∆, ηq) +1

2(κ2

2 + κ2 − 1)(κ1 + 1)h(κ1, δ, ηq)

(2)It is obvious that when δ = ∆ = 0 (i.e., which means theclient data distribution is IID), we have Gc(k) = 0, wherethe distributed weights iteration is the same as the centralizedweights iteration.

When the client data are non-IID, there are two parts in theexpression of the weights deviation upper bound, G(κ1κ2, ηq).The first one is caused by Edge-Cloud divergence, and isexponential with both κ1 and κ2. The second one is caused bythe Client-Edge Divergence, which is only exponential withκ1, but quadratic with κ2. From lemma 1, we can see that asmaller model weight deviation leads to faster convergence.This gives us some qualitative guidelines in selecting theparamters in HierFAVG:

1) When the product of κ1 and κ2 is fixed, which means thenumber of local updates between two cloud aggreagtionsis fixed, a smaller κ1 with a larger κ2 will result ina smaller deviation Gc(κ1, κ2). This is consistent withour intuition, namely, frequent local model averaging canreduce the number of local iterations needed.

2) When the edge dataset is IID, meaning ∆ = 0, the firstpart in Eq. (2) becomes 0. The second part is dominatedby κ1, which suggests that when the distribution of edgedataset approaches IID, increasing κ2 will not push upthe deviation upper bound much. This suggests one wayto to further reduce the communication with the cloud isto make the edge dataset IID distributed.

The result for the non-convex loss function is stated in thefollowing lemma.Lemma 3 (Non-convex). For any i, assuming fi(w) is β-smooth, for any cloud interval {q} with step size ηq , we have

‖w(k)− u{q}(k)‖ ≤ Gnc(κ1κ2, ηq)

whereGnc(κ1κ2, ηq) =h(κ1κ2,∆, ηq)

+ κ1κ2(1 + ηqβ)κ1κ2 − 1

(1 + ηqβ)κ1 − 1h(κ1, δ, ηq)

+ h(κ1, δ, ηq),

h(x, δ, η) =δ

β

((ηβ + 1)x − 1

)− ηβx.

With the help of the weight deviation upperbound, we arenow ready to prove the convergence of HierFAVG for bothconvex and non-convex loss functions.Theorem 1 (Convex). For any i, assuming fi(w) is ρ-continuous, β-smooth and convex, and denoting Finf =F (w∗), then after K local updates, we have the followingconvergence upper bound of w(k) in HierFAVG with a fixedstep size:

F (w(K))− F (w∗) ≤ 1

T (ηϕ− ρGc(κ1κ2,η)

κ1κ2ε2)

Proof. By directly substituting M in Lemma 1 with G(κ1κ2)in Lemma 2, we prove Theorem 1.Remark 3. Notice in the condition ε2 > ρG(κ1κ2)

κ1κ2ηϕof Lemma

1, ε does not decrease when K increases. We cannot haveF (w(K)) − F (w∗) → 0 as K → ∞. This is because thevariance in the gradients introduced by non-IIDness cannnotbe eliminated by fixed-stepsize gradient descent.Remark 4. With diminishing step sizes {ηq} that satisfy∑∞q=1 ηq =∞,

∑∞q=1 η

2q <∞, the convergence upper bound

for HierFAVG after K = Bκ1κ2 local updates is:

F (w(K))− F (w∗) ≤ 1∑Bq=1(ηqϕq − ρGc(κ1κ2,ηq)

κ1κ2ε2q)

B→∞−−−−→ 0.

Now we consider non-convex loss functions, which appearin ML models such as neural networks.

Theorem 2 (Non-convex). For any i, assume that fi(w) isρ-continuous, and β-smooth. Also assume that HierFAVG isinitialized from w0, Finf = F (w∗), ηq in one cloud interval{q} is a constant, then after K = Bκ1κ2 local updates,the expected average-squared gradients of F (w) is upperbounded as:∑K

k=1 ηq‖∇F (w(k))‖2∑Kk=1 ηq

≤ 4[F (w0)− F (w∗)]∑Kk=1 ηq

+4ρ∑Bq=1Gnc(κ1, κ2, ηq)∑K

k=1 ηq

+2β2∑B

q=1 κ1κ2‖Gnc(κ1κ2, ηq)‖2∑Kk=1 ηq

.

(3)Remark 5. When the stepsize {ηq} is fixed, the weightedaverage norm of the gradients converges to some non-zero number. When the stepsize {ηq} satisfies

∑∞q=1 ηq =

∞,∑∞q=1 η

2q <∞, (3) converges to zero as K →∞.

IV. EXPERIMENTS

In this section, we present simulation results for HierFAVGto verify the obeservations from the convergence analysis andillustrate the advantages of the hierarchical FL system. Asshown in Fig. 2, the advantage over the edge-based FL systemin terms of the model accuracy is obvious. Hence, we shallfocus on the comparison with the cloud-based FL system.

A. Settings

We consider a hierarchical FL system with 50 clients, 5edge servers and a cloud server, assuming each edge serverauthorizes the same number of clients with the same amountof training data. For the ML tasks, image classification tasksare considered and standard datasets MNIST and CIFAR-10are used. For the 10-class hand-written digit classificationdataset MNIST, we use the Convolutional Neural Network(CNN) with 21840 trainable parameters as in [2]. For the localcomputation of the training with MNIST on each client, weemploy mini-batch Stochastic Gradient Descent (SGD) withbatch size 20, and an initial learning rate 0.01 which decaysexponetially at a rate of 0.995 with every epoch. For theCIFAR-10 dataset, we use a CNN with 3 convolutional blocks,

which has 5852170 parameters and achieves 90% testingaccuracy in centralized training. For the local computation ofthe training with CIFAR-10, mini-batch SGD is also employedwith a batch size of 20, an inital learing rate of 0.1 and anexponetial learning rate decay of 0.992 every epoch. In theexperiments, we also notice that using SGD with momentumcan speed up training and improve the final accuracy evidently.But the benefits of the hierarchical FL system always continueto exist with or without the momentum. To be consistent withthe analysis, we do not use momentum in the experiments.

Non-IID distribution in the client data is a key influentialfactor in FL. In our proposed hierarchical FL system, there aretwo levels of non-IIDness. In addition to the most commonlyused non-IID data partition [2], referred to as simple NIIDwhere each client owns samples of two classes and the clientsare randomly assigned to each edge server, we will alsoconsider the follwoing two non-IID cases for MNIST:

1) Edge-IID: Assign each client samples of one class, andassign each edge 10 clients with different classes. Thedatasets among edges are IID.

2) Edge-NIID: Assign each client samples of one class, andassign each edge 10 clients with a total of 5 classes oflabels. The datasets among edges are non-IID.

In the following, we provide the models for wirelesscommunications and local computations [8]. We ignore thepossible heterogeneous communication conditions and com-puting resources for different clients. For the communicationchannel between the client and edge server, clients uploadthe model through a wireless channel of 1 MHz bandwidthwith a channel gain g equals to 10−8. The transmitter powerp is fixed at 0.5W, and the noise power σ is 10−10W. Forthe local computation model, the number of CPU cycles toexcute one sample c is assumed to be 20 cycles/bit, CPUcycle frequency f is 1 GHz and the effective capacitance is2 × 10−28. For the communication latency to the cloud, weassume it is 10 times larger than that to the edge. Assumethe uploaded model size is M bits, and one local iterationinvoles D bits of data. In this case, the latency and energyconsumption for one model upload and one local iteration canbe caculated with the following equations (Specific paramtersare shown in table I):

T comp =cD

f, Ecomp =

α

2cDf2, (4)

T comm =M

B log2(1 + hpσ

), Ecomm = pT comm (5)

To investigate the local energy consumption and trainingtime in an FL system, we define the following two metrics:

1) Tα: The training time to reach a test accuracy level α;2) Eα: The local energy consumption to reach a test accu-

racy level α.B. Results

We first verify the two qualitative guidelines on the keyparameters in HierFAVG from the convergence analysis, i.e.,κ1, κ2. The experiments are done with the MNIST datasetunder two non-IID scenarios, edge-IID and edge-NIID.

Table I: The latency and energy consumption paramters for thecommunication and computation of MNIST and CIFAR-10.

Dataset T comp T comm Ecomp Ecomm

MNIST 0.024s 0.1233s 0.0024J 0.0616JCIFAR-10 4 33s 0.4J 16.5J

(a) Edge-IID. (b) Edge-NIID.

Figure 4: Test accuracy of MNIST dataset w.r.t training epoch.

The first conclusion to verify is that more frequent commu-nication with the edge (i.e., fewer local updates κ1) can speedup the training process when the communciation frequencywith the cloud is fixed (i.e., κ1κ2 is fixed.). In Fig. 4a andFig. 4b, we fix the communication frequency with the cloudserver at 60 local iterations, i.e., κ1κ2=60 and change thevalue of κ1. For both kinds of non-IID data distribution, aswe decrease κ1, the desired accuracy can be reached withfewer training epochs, which means fewer local computationsare needed on the devices.

The second conclusion to verify is that when the datasetsamong edges are IID and the communication freqency withthe edge server is fixed, decreasing the communication fre-quency with the cloud server will not slow down the trainingprocess. In Fig. 4a, the test accuracy curves with the sameκ1 = 60 and different κ2 almost coincide with each other.But for edge-NIID in Fig. 4b, when κ1 = 60, increasing κ2

will slow down the training process, which strongly supportsour analysis. This property indicates that we may be able tofurther reduce the high-cost communication with the cloudunder the edge-IID scenario, with litttle performance loss.

Next, we investigate two critical quantities in collabora-tive training systems, namely, the training time and energyconsumption of mobile devices. We compare cloud-basedFL (κ2=1) and hierarchical FL in Table II, assuming fixedκ1κ2. A close observation of the table shows that the trainingtime to reach a certain test accuracy decreases monotonicallyas we increase the communication frequency (i.e., κ2) withthe edge server for both the MNIST and CIFAR-10 datasets.This demonstrates the great advantage in training time ofthe hierarchical FL over cloud-based FL in training an FLmodel. For the local energy consumption, it decreases firstand then increases as κ2 increases. Because increasing client-edge communication frequency moderately can reduce theconsumed energy as fewer local computations are needed. Buttoo frequent edge-client communication also consumes extraenergy for data transmission. If the target is to minimize thedevice energy consumption, we should carefully balance thecomputation and communication energy by adjusting κ1, κ2.

Table II: Training time and local energy consumption.

Edge-IID Edge-NIIDE0.85(J) T0.85(s) E0.85(J) T0.85(s)

κ1 = 60, κ2 = 1 29.4 385.9 30.8 405.5κ1 = 30, κ2 = 2 21.9 251.1 28.6 312.4κ1 = 15, κ2 = 4 10.1 177.3 26.9 218.5κ1 = 6, κ2 = 10 19 97.7 28.9 148.4

(a) MNIST with edge-IID and edge-NIID distribution.

E0.70(J) T0.70(s)κ1 = 50, κ2 = 1 7117.5 109800κ1 = 25, κ2 = 2 6731 75760κ1 = 10, κ2 = 5 9635 65330κ1 = 5, κ2 = 10 13135 49350

(b) CIFAR-10 with simple NIID distribution.

V. CONCLUSIONS

In this paper, we proposed a client-edge-cloud hierarchicalFederated Learning architecture, supported by a collaborativetraining algorithm, HierFAVG. The convergence analysis ofHierFAVG was provided, leading to some qualitative designguidelines. In experiments, it was also shown that it cansimultaneously reduce the model training time and the energyconsumption of the end devices compared to traditional cloud-based FL. While our study revealed trade-offs in selecting thevalues of key parameters in the HierFAVG algorithm, futureinvestigation will be needed to fully characterize and optimizethese critical parameters.

REFERENCES

[1] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning, 2016.[2] H. B. McMahan, E. Moore, D. Ramage, and S. Hampson,

“Communication-efficient learning of deep networks from decentralizeddata,” Artificial Intelligence and Statistics, pp. 1273–1282, April. 2017.

[3] A. Hard, K. Rao, R. Mathews, F. Beaufays, S. Augenstein, H. Eichner,C. Kiddon, and D. Ramage, “Federated learning for mobile keyboardprediction,” arXiv preprint arXiv:1811.03604, 2018.

[4] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning:Concept and applications,” ACM Trans. Intell. Syst. and Technol. (TIST),vol. 10, no. 2, p. 12, 2019.

[5] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A surveyon mobile edge computing: The communication perspective,” IEEECommun. Surveys Tuts., vol. 19, no. 4, pp. 2322–2358.

[6] T. Nishio and R. Yonetani, “Client selection for federated learning withheterogeneous resources in mobile edge,” IEEE ICC, May. 2019.

[7] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, andK. Chan, “Adaptive federated learning in resource constrained edgecomputing systems,” IEEE J. Sel. Areas Commun., vol. 37, no. 6, pp.1205–1221, June 2019.

[8] N. H. Tran, W. Bao, A. Zomaya, N. Minh N.H., and C. S. Hong,“Federated learning over wireless networks: Optimization model designand analysis,” in IEEE INFOCOM 2019, April 2019, pp. 1387–1395.

[9] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman,V. Ivanov, C. Kiddon, J. Konecny, S. Mazzocchi, H. B. McMahan et al.,“Towards federated learning at scale: System design,” Proc. of the 2ndSysML Conference, Palo Alto, CA, USA, 2019.

[10] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergenceof fedavg on non-iid data,” arXiv preprint arXiv:1907.02189, 2019.

[11] A. Krizhevsky and G. Hinton, “Learning multiple layers of featuresfrom tiny images,” University of Toronto, Tech. Rep., 2009.

[12] A. Krizhevsky et al., “Learning multiple layers of features from tinyimages,” Citeseer, Tech. Rep., 2009.

[13] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods forlarge-scale machine learning,” SIAM Review, vol. 60, no. 2, pp. 223–311, 2018.

[14] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federatedlearning with non-iid data,” arXiv preprint arXiv:1806.00582, 2018.

Documents

Client-Edge-Cloud Hierarchical Federated Learning · parameter server either at the cloud or at the edge. The cloud server can access more data but with excessive communication overhead