E-mail: [email protected], fzhujiangcheng, zirui.zhou

An Optimal Resource Allocator of Elastic Training for DeepLearning Jobs on Cloud

Liang Hu1, Jiangcheng Zhu2, Zirui Zhou1, Ruiqing Cheng2, Xiaolong Bai2 and Yong Zhang1

1Huawei Technologies Canada 2Huawei Cloud

E-mail: [email protected], {zhujiangcheng, zirui.zhou, chengruiqing, baixiaolong1, yong.zhang3}@huawei.com

Abstract

Cloud training platforms, such as Amazon Web Servicesand Huawei Cloud provide users with computational re-sources to train their deep learning jobs. Elastic train-ing is a service embedded in cloud training platformsthat dynamically scales up or down the resources allo-cated to a job. The core technique of an elastic trainingsystem is to best allocate limited resources among het-erogeneous jobs in terms of shorter queueing delay andhigher training efficiency. This paper presents an op-timal resource allocator for elastic training system thatleverages a mixed-integer programming (MIP) model tomaximize the training progress of deep learning jobs.We take advantage of the real-world job data obtainedfrom ModelArts, the deep learning training platform ofHuawei Cloud and conduct simulation experiments tocompare the optimal resource allocator with a greedyone as benchmark. Numerical results show that the pro-posed allocator can reduce queuing time by up to 32%and accelerate training efficiency by up to 24% relativeto the greedy resource allocator, thereby greatly improv-ing user experience with Huawei ModelArts and poten-tially enabling the realization of higher profits for theproduct. Also, the optimal resource allocator is fast indecision-making, taking merely 0.4 seconds on average.

Keywords: elastic training, resource allocation, opti-mization, deep learning, cloud computing, ETA

1 Introduction

Cloud training platforms, such as Amazon Web Servicesand Microsoft Azure, provide abundant computationalresources for training deep learning (DL) models andcharge users by usage. Currently when users submit adeep learning job to the cloud, it is required to spec-ify the desired computational resources, e.g., number ofGPUs or nodes [1–3]. Cloud training platforms withoutelasticity will use the specified, also fixed, resources totrain the job. However, There are at least two problems

with using a fixed amount of resources for performanceof the training job.

First, a system using a fixed number of nodes or GPUsfor any given training job may use its computational re-sources inefficiently. If the system is performing a smallnumber of training jobs at a given time, the system willleave many of its computational resources idle. In otherwords, the training jobs could have each utilized morenodes or GPUs in order to complete sooner, instead ofwasting the computational capacity of the idle resources.For example, a system with 100 nodes performing onlya single training job, wherein the fixed number of nodesassigned to the training job is 8 nodes, is wasting 92 ofits nodes.

Second, the system’s computational resources are al-ways limited by the size of the system’s resource pool,e.g., the number of nodes available for allocation totraining jobs. It is common for a system to receivemultiple job profiles from users while a small numberof computationally-intensive training jobs are monop-olizing the system’s resource pool, requiring the sys-tem to maintain the later job profiles in a job queuefor a significant period of time while waiting for thecomputationally-intensive training jobs to complete.This introduces significant delays, even for small train-ing jobs that could be completed quickly if any nodeswere available. These delays are arguably inefficient interms of meeting the needs of the user base, and tendto generate dissatisfaction in users who experience suchdelays.

Accordingly, cloud systems have been developed thatperform elastic training of deep learning models (re-ferred to herein as “elastic training systems”) to addressthe limitations of systems using a fixed amount of re-sources for a given training job [4,5]. An elastic trainingsystem dynamically allocates computational resources(e.g., nodes) to training jobs based on the status of thesystem (e.g., how many nodes are in use, how manyjobs are in the job queue) and job attributes (e.g., howcomputationally intensive is a given training job) to ad-dress the two problems described above. If the system

1

arX

iv:2

109.

0338

9v1

[ee

ss.S

Y]

8 S

ep 2

021

has abundant computational resources available (e.g., alarge number of idle nodes), an elastic training systemmay scale up one or more ongoing training jobs, i.e.,allocate more nodes or other computational resourcesto the one or more ongoing training jobs. If an elastictraining system is busy (e.g., all nodes are being usedfor ongoing training jobs), the elastic training systemscales down one or more of the ongoing jobs, i.e., re-leases some nodes or other computational resources sothat new training jobs can use the released nodes or othercomputational resources instead of waiting in the jobqueue. Fig. 1 is an example wherein different jobs aretraining on different number of nodes. If systems haveno elastic training service, the resource pool may gener-ate fragments indicating some nodes are being wasted.With elastic training service, job 5 and 7 may scale up toutilize more nodes, and job 6 may scale down to releasepart of its resources to job 8 that just enters the pool.

Figure 1: Resource pool and job queue of an elastictraining system.

The core of an elastic training system is its resourceallocator. A resource allocator should optimally de-cide on the nodes or other computational resources as-signed to each training job so that the elastic trainingsystem can (1) improve efficient utilization of computa-tional resources, (2) speed up the overall training timerequired to complete a given set of training jobs, (3) re-duce queueing delay, and (4) improve the user experi-ence when submitting a job profile to the system. Byachieving one or more of these objectives, and providingthe benefits thereof to users, an elastic training systemmay also be able to realize higher profits in providing apaid deep learning product-as-a-service (PaaS) to users,through a combination of higher revenue from users dueto improved service, and/or lower overhead costs due tomore efficient use of resources.

Existing works include the mechanisms of elastictraining and the resource management in an elastic train-ing system. To achieve elasticity while training deeplearning jobs, the industry has proposed software frame-works such as Google Kubernetes Engine (GKE) [6],

Amazon Elastic Container Service (ECS) [7], and RedHat OpenShift Container Platform [8]. Shen et al. [9]presented an automatic elastic resource scaling system,called CloudScale, for multi-tenant cloud computing in-frastructures, though the infrastructures are not for thepurpose of deep learning training specifically. Otherthan elasticity, resource management also plays a crit-ical role in cloud system operations. One research[10] formulated the resource management problem forMapReduce jobs on the cloud as a constraint program-ming model. Its objective is to reduce energy consump-tion of MapReduce jobs. Liu and Xu [11] proposedan executor scheduler that is able to dynamically allo-cate and size resources to Spark jobs in order to min-imize resource fragmentation. Another work by Javadiet al. [12] also present a workload resource scheduler forSpark jobs that can significantly increase cloud resourceusage.

In recent years studies on resource allocation forelastic training specifically are drawing more attention.Chen et al. [13] found that the parameter server for deepneural networks training could become performancebottlenecks due to imbalanced workload distributionamong the parameter servers. They designed a dynamicworkload distribution scheme using an exploitation-exploration method in order to accelerate distributedmodel training. A similar study [14] also tried to solvethe imbalanced workload distribution problem using asemi-dynamic load balancing approach, which acceler-ates distributed training by up to 54%. Saxena et al. [15]proposed a GPU-level resource allocator. The core ideais to find the best combination of batch size and numberof GPUs to elastically train deep learning jobs. Their ap-proach took efforts on maximizing total throughput ofall jobs and searches for optimal combination throughdynamic programming. However, this approach onlyworks for GPU-level resource allocation, but not trans-ferable to node-level. GPU-level resource allocation is aform of process management that manages the executionof individual software processes by individual processorcores. Resource allocation at the node level, however,implements a form of cluster management or containermanagement, both of which refers to the managementof containers (i.e., a bundle of a software program andits data dependencies) and their execution by virtualizedclusters of processing resources. In addition, the allo-cate decisions in this paper do not make sure that thenumber of GPUs of a job is powers of two, which maylower training accuracy [16]. Parallel computing typi-cally requires that computational operations to be splitrecursively by powers of two to avoid accuracy prob-lems. Another practice in industry [16] tried to greedilyutilize all computational resources at all times. Section3 thoroughly reviews this greedy resource allocator and

2

we will use it as benchmark for comparison with ourwork.

This paper presents a novel resource allocator forelastic training systems that takes advantage of the ETA(estimated time of arrival), also known as estimated run-time, of deep learning jobs as input. We first formu-late the resource allocation problem as a mixed-integerprogramming (MIP) model that maximizes the trainingprogress of all jobs over a planning time horizon. Themodel also uses a innovative method to make sure thatthe allocated number of node to each job is 2m (m ∈N). Obtaining the optimal resource allocate decisions isvery fast. The decisions can significantly reduce queue-ing delay and improve training efficiency. In addition,the proposed allocator can better handle heterogeneous-ETA jobs and perform robustly on ETA disturbance andunexpected situations, such as jobs containing bugs andusers terminating jobs by themselves.

This paper is organized as follows. Section 2 in-troduces the architecture of elastic training system ofHuawei ModelArts. Section 3 reviews a previous prac-tice of Huawei ModelArts, i.e., the greedy resource allo-cator and its limitations. Section 4 explains the method-ology behind the optimal resource allocator and how welinearize and simplify the MIP model to make it solvableand efficient. Section 5 takes advantages of real-worldjob data from Huawei ModelArts and conducts simu-lation experiments to compare the above two resourceallocators. In the end Section 6 concludes the paper.

2 Architecture of Elastic TrainingSystem of Huawei ModelArts

Huawei ModelArts is a one-stop artificial intelligencedevelopment platform of Huawei Cloud. Thousands ofdeep learning jobs are training on this platform everyday. Elastic training is a service embedded in Mode-lArts that provides the capability to dynamically scaleup or scale down distributive training jobs. Elastic train-ing is critical to ModelArts since it holds the potential toimprove computational resource utilization and acceler-ate user’s training jobs, and as a result, realize higherprofits from users. Fig. 2 shows a screenshot of Mode-lArts elastic training service. When a user submits a jobprofile to the platform, they need to specify the desirednumber of nodes required for performing the trainingjob and the number of elastic nodes.

The architecture of elastic training system of HuaweiModelArts consists of deep learning jobs, job queue, re-source pool, resource allocator and other modules, asillustrated in Fig. 3. Resource allocator is the core mod-ule and it acts as the “brain” of the system to make thebest resource allocate decisions.

The job is defined as the workload to be run on cer-tain computational resource, i.e., GPUs, and 1 node ismade up of 8 GPUs. Jobs are trained distributively ondifferent number of nodes. The resource allocator canmake decision to scale-up or scale-down the computa-tional resource of a certain job, i.e., number of nodes.The jobs adapt to the changes of its resources and keepon training without interruption.

To initiate a training job, firstly users should publishan algorithm to the Algorithm Management Module, in-cluding the algorithm code and the algorithm mirror.The code and mirror will be pulled by the resource poolafter the job is allocated to certain nodes by the resourceallocator.

Once users have submitted a training job to the elastictraining system, the training system transfers the appli-cation programming interface (API) body of the elastictraining job into an API body that the cluster manage-ment module can address. The basic job profile infor-mation, including create time, starting time, flavor, etc.is stored in the database. The cluster management mod-ule monitors the status of each workload and each node,and makes real-time job allocation to the idle nodes.The computing node mounts the cloud storage, pulls thedocker image from the hub, download the code, and thenstart the script.

The resource allocator is the core module of the sys-tem. It has access to all active jobs (queueing or train-ing), job’s remaining ETA, and total computational re-sources. Based on elastic training frequency f , e.g., 5minutes, the resource allocator activates and makes de-cisions on scaling-up or scaling-down an active trainingjob and how the nodes in the resource pool should bere-assigned. At the end, the decisions are sent back tojob queue and resource pool for execution.

If some nodes are idle, the first job in the queue willstart training using as many nodes as possible under acap (e.g., 16 nodes). If idle nodes still exist, follow thesame procedure to initiate the next queueing job until noqueuing jobs or no idle nodes exist.

Figure 2: The elastic training service of Huawei Mode-lArts.

3

Figure 3: Architecture of elastic training system fordeep learning jobs on cloud.

3 Greedy Resource AllocatorOne previous work [16] of Huawei Cloud proposed agreedy resource allocator for elastic training. This is arule-based allocator that tries to greedily utilize as manynodes as possible. We use this allocator as benchmarkfor comparison with our work. Its rules are explained asfollows.

The greedy resource allocator allocates the resourcepool of the elastic training system based on four differ-ent scenarios, in which every training job is allocated anode count within a range, such as 1 to 16 nodes.

In the first scenario, the system has at least one idlenode and at least one training job in the job queue. Thegreedy allocator allocates as many nodes as possible tothe training job at the front of the job queue. If there arestill idle nodes and training jobs in the job queue, thisprocedure is repeated until all nodes are occupied or alltraining jobs have exited the job queue and are beingperformed.

In the second scenario, the system has at least oneidle node and no training jobs in the job queue. Thegreedy resource allocator finds the training job with theshortest training time, and then scales up this trainingjob by increasing its node count as large as possible. Ifthere are still idle nodes, this procedure is repeated untilall nodes have been occupied or all training jobs havescaled up.

In the third scenario, the system has no idle nodesand at least one training job in the job queue. Thus, thecomputation resources of the system have reached theirlimit. Some training jobs might be occupying all thenodes while many others have to wait in the job queue.

The greedy resource allocator finds the training job withthe longest training time, scales down the training jobthrough reducing its node count by half, and then allo-cates the released nodes to the training job at the frontof the job queue.

In the fourth scenario, the system has no idle nodesand no training jobs in the job queue. This is the sim-plest scenario. All nodes are occupied and no trainingjobs are waiting. In this case, the elastic training systemchanges nothing about the current node allocation.

Fig. 4 gives two examples that help better understandthe rules of greedy resource allocator. Assume the rangeof scaling is 1∼16 nodes. For the first example on thetop, the elastic training system has 2 idle nodes but noqueueing jobs. There are four jobs training currently andjob 4 has the shortest training time, so the 2 idle nodesare allocated to it. For the second example at the bottom,job 8 is waiting but the system has no idle nodes. Job5 has the longest training time, so the allocator scalesdown its node size from 4 to 2 and assigns the released2 nodes to job 8.

Figure 4: Two examples for greedy resource allocator.

The resource allocator is called “greedy” because italways tries to utilize the system’s computational re-sources to their fullest extent, i.e., leave no nodes idle.The rules governing its behavior tend to be simple andcomputationally fast. However, the simplicity of its be-havior results in several limitations.

First, while the greedy resource allocator keeps asmany nodes working as possible, the allocation of nodesto training jobs may not be efficient or fair. For example,a greedy resource allocator may inefficiently allocate 99

4

nodes to job 1 and 1 node to job 2, instead of allocat-ing 50 nodes to each job (job 1 and job 2). Althoughboth allocations utilize all 100 nodes, the second one isobviously more equal and may result in more overall ef-ficiency.

Second, training time may not be a good metric touse in deciding which job should scale up or down.The greedy resource allocator scales up the job withthe shortest training time, but if this job has a verysmall workload, one node may be sufficient; the addi-tional nodes might be more effectively deployed to alarger training job. Similarly, the greedy resource allo-cator scales down the job with the longest training time,but this may result in computationally intensive trainingjobs having their node count reduced repeatedly, therebyresulting in unnecessarily long training times.

Third, the greedy allocate decisions are short-sighted.The allocator only deals with what is currently happen-ing in the elastic training system, but has no considera-tion for the future. Because the system will face differ-ent computational demands in the future, it is necessaryto look ahead and plan computational resource alloca-tion accordingly.

4 Optimal Resource AllocatorThis paper proposes an optimal resource allocator thatcan overcome the above limitations of the greedy one.This section presents the methodology behind it. Pleaserefer to Table 1 for all notations. We adopt a rolling-horizon approach to plan resource allocation for the fu-ture. The optimization problem is first formulated as amixed-integer non-linear programming model. Then welinearize and simplify the model into a mixed-integerlinear one to make it solvable.

4.1 A Rolling-horizon Approach

The optimal resource allocator adopts a rolling-horizonapproach. This approach discretizes a planning timehorizon T into multiple time steps t ∈ T = {1, 2, 3, ...},and the length of each time step equals the elastic train-ing frequency f , e.g., 5 minutes. Therefore, this ap-proach can look ahead and plan decisions for the futureaccordingly. Consider a job’s remaining ETA as its com-putation demand. For each job i ∈ I at time step t ∈ T ,use nti nodes to serve sti out of job i’s demand di.

Fig. 5 is an example in which the optimal resourceallocator looks 4 hours ahead at 0:00. If elastic trainingfrequency is 30 minutes, there will be 8 time steps. Job1 is submitted at 0:00 and its ETA is 3 hours on a singlenode. We may use n11 = 1 node to serve s11 node·hrdemand out of the total 3 node·hr. We may use 2, 1, and

Table 1: NotationsSetsI set of deep learning jobs iT set of time steps t in a planning time

horizon, i.e., {1, 2, 3, · · · }K set of legal number of nodes k, i.e.,

{1, 2, 4, 8, 16, 32, · · · }Parametersdi remaining ETA or computation demand

of job i ∈ Ini,min minimum number of nodes for job i ∈ Ini,max maximum number of nodes for job i ∈

IN number of nodes in resource poolf elastic training frequency (unit: minute)p time step adjustment parameterM the big-M , a sufficiently large numberDecisionvariablesnti allocated number of nodes for job i ∈ I

at time step t ∈ T , integerni implemented allocate decision for job

i ∈ I , integersti served computation demand for job i ∈

I at time step t ∈ Tyti whether job i ∈ I has finished training

at time step t ∈ T , binaryδt,−i,k ,δt,+i,k

indicators for selecting k ∈ K nodes forjob i ∈ I at time step t ∈ T , binary

4 nodes for the next three time steps, respectively, so job1 might be done in only 2 hours.

Figure 5: The rolling-horizon approach.

4.2 Non-linear FormulationThe objective of optimal resource allocator is to maxi-mize the training progress of all the jobs I over the look-ahead time horizon T , as shown in (1). A job’s trainingprogress is defined as its total served demand during T ,i.e.,

∑t s

ti, over its demand di.

5

max∑i

∑t

stidi

(1)

subject to

sti ≤ di ∀i ∈ I, ∀t ∈ T (2)

st=1i ≤ p · nt=1

i · 0.8log2 nt=1i ∀i ∈ I (3)

sti ≤ st−1i +p·nti ·0.8log2 nt

i ∀i ∈ I, ∀t ∈ T, t ≥ 2 (4)

nti ≤ ni,max ∀i ∈ I, ∀t ∈ T (5)∑i

nti ≤ N ∀t ∈ T (6)

yti ≤stidi∀i ∈ I,∀t ∈ T (7)

yti ≥ 1−M(1− stidi) ∀i ∈ I, ∀t ∈ T (8)

nt=1i ≥ ni,min ∀i ∈ I (9)

nti ≥ ni,min −M · yt−1i ∀i ∈ I, ∀t ∈ T, t ≥ 2 (10)

nti ≤M(1− st−1i

di) ∀i ∈ I, ∀t ∈ T, t ≥ 2 (11)

nti ≥ 0, integer ∀i ∈ I, ∀t ∈ T (12)

sti ≥ 0 ∀i ∈ I, ∀t ∈ T (13)

yti ∈ {0, 1} ∀i ∈ I, ∀t ∈ T (14)

Constraint (2) makes sure that served demand sti attime step t cannot overtake the job’s demand. Con-straint (3) and (4) are the training process in which pis the time step adjustment parameter, for example, p =30/60 = 0.5, if each time step is 30 minutes. Given ntinodes to train job i at time step t, the served demandwill accumulate by p ·nti · 0.8log2 nt

i . This non-linear re-lationship between number of nodes and training speed,as shown in Fig. 6 is found from the historical data ofHuawei ModelArts [3]. The ideal training speed is lin-early related to the number of nodes, however, the actualtraining speed decreases by 20% on average when thenumber of nodes doubles (i.e., multiply linear trainingspeed by 0.8 every time the number of nodes doubles).While a job can be trained faster using more nodes, how-ever, the speed attenuation becomes larger as node countincreases. System-level training efficiency is likely todowngrade if only a few jobs are monopolizing all theresources.

Constraint (5) and (6) are straight-forward. Every jobshould have an upper bound of node size, denoted asni,max. The allocated nodes must not overtake total re-sources, i.e.,N nodes. However, the lower bound condi-tions are quite tricky. On the one hand, we must allocatea minimum number of nodes ni,min (usually 1 node)

Figure 6: Relationship between number of nodes andtraining speed.

when job i is training. On the other hand, if job i hasfinished training at time step t, then its node size shouldreturn back to zero starting from the next time step t+1.To satisfy the two requirements, we have to introduce anextra binary variable yti to indicate whether job i has fin-ished training at time step t and a sufficiently-large num-ber M . If job i has not finished yet, yti = 0; otherwise,yti = 1, as shown in Constraint (7) and (8). Constraint(9) and (10) ensure that minimum resource is allocated ifa job is still training. Constraint (11) let node size returnback to zero after a job has done. Constraints (12)∼(14)are the bounds of decision variables.

However, the above formulation is a mixed-integernon-linear programming model and hard to solve. Thetraining process constraints (3) and (4) are non-linearand may be not easy to satisfy. Constraints (7)-(11) con-tain too many integer decision variables and constraints,which may make the model very slow to find optimal so-lutions. Note that decision-making for our resource allo-cation problem should be real-time or near real-time. Inaddition, the allocate number of nodes for each job mustbe 2m (m ∈ N) out of training accuracy reasons, i.e.,nti ∈ K = {1, 2, 4, 8, 16, 32, ...}, but this formulationcannot meet this requirement.

4.3 Linearized and Simplified Formula-tion

We must linearize the non-linear constraints and sim-plify the above model formulation to make it solvableand fast. First, we drop the binary variables yti and re-place constraints (7)-(11) with a new constraint (15).This constraint allows nodes not return back to zero af-ter a job has finished training. The impact is minimal be-cause the optimal resource allocator adopts the rolling-horizon approach in which only the decision in the firsttime step, i.e., ni = nt=1

i , will be implemented. TakingFig. 7 as an example, after the job has finished training

6

at the second time step, we allow the minimum numberof nodes (1 node) in the optimal solution and allocate2 nodes (the decision of the first time step) to train thisjob. The problem size decreases significantly by simpli-fying model formulation. For a problem with 100 jobsand 5 time steps, we can save 500 integer variables and1800 constraints.

nti ≥ ni,min ∀i ∈ I, ∀t ∈ T (15)

Figure 7: An example for nodes not returning back tozero after a job has finished.

The powers-of-two requirement can be met by intro-ducing two binary indicators, δt,−i,k and δt,+i,k , and con-straints (16)-(18). Allocate nti = k ∈ K nodes to jobi at time step t if and only if δt,−i,k = δt,+i,k = 1. Con-straints (16) and (17) add lower and upper bounds to thedifference between nti and k. Constraint (18) ensuresthat exactly |K| + 1 inequalities in (16) and (17) musthold, thus the value of k will be a selection from the setK. For example, if the range of elastic training is from1 node to 16 nodes, i.e., k ∈ K = {1, 2, 4, 8, 16}, wemay allocate n11 = k = 4 nodes to job 1 at time step1 if and only if δ1,−1,4 = δ1,+1,4 = 1. In this case, exactly|K|+1 = 6 inequalities must hold, i.e., n11 ≥ 1, n11 ≥ 2,n11 ≤ 4, n11 ≥ 4, n11 ≤ 8, and n11 ≤ 16, while the re-maining 4 inequalities, i.e., n11 ≤ 1, n11 ≤ 2, n11 ≥ 8,and n11 ≥ 16, must not hold. By this method, the opti-mal value of each nti must be 2m (m ∈ N).

1− δt,−i,k

M−M · δt,−i,k ≤ n

ti − k ≤M · (1− δ

t,−i,k )

∀i ∈ I, ∀t ∈ T, ∀k ∈ K(16)

1− δt,+i,k

M−M · δt,+i,k ≤ k − n

ti ≤M · (1− δ

t,+i,k )

∀i ∈ I, ∀t ∈ T, ∀k ∈ K(17)∑

k

δt,−i,k +∑k

δt,+i,k = |K|+ 1 ∀i ∈ I, ∀t ∈ T (18)

Since binary variables δt,−i,k and δt,+i,k indicate whetherthe allocated number of nodes is a selection from the

setK, we can replace constraints (3)-(4) with two linearconstraints (19)-(20). If a k ∈ K is selected, the trainingspeed will exactly be k ·0.8log2 k, as shown by the actualtraining speed curve in Fig. 6.

st=1i ≤ p ·

∑k

k · 0.8log2 k · (δt=1,−i,k + δt=1,+

i,k − 1)

∀i ∈ I(19)

sti ≤ st−1i + p ·

∑k

k · 0.8log2 k · (δt,−i,k + δt,+i,k − 1)

∀i ∈ I,∀t ∈ T, t ≥ 2

(20)

Proposition. The binary variables δt,−i,k and δt,+i,k donot introduce estimation errors to actual training speeds.

Proof . If k∗ ∈ K is selected as the optimal solutionfor job i at time step t, then we have δt,−i,k∗ = δt,+i,k∗ = 1.By (18), it is easy to know that for k ∈ K and k 6= k∗,δt,−i,k +δt,+i,k = 1 must hold. Therefore, the training speedis∑

k k ·0.8log2 k · (δt,−i,k +δt,+i,k −1) = 1 ·0.8log2 1 · (1−1)+ · · ·+ k∗ · 0.8log2 k∗ · (2− 1)+ · · · = k∗ · 0.8log2 k∗

,which is exactly the actual speed as shown in Fig. 6.

4.4 Final FormulationIn summary, we re-formulate the optimal resource al-locator as a mixed-integer linear program (MILP) asfollows. We use the open-source optimization solverCBC [17] to search for optimal solutions.

maxn

∑i

∑t

stidi

subject tosti ≤ di ∀i ∈ I, ∀t ∈ T

nti ≤ ni,max ∀i ∈ I, ∀t ∈ T∑i

nti ≤ N ∀t ∈ T

nti ≥ ni,min ∀i ∈ I, ∀t ∈ T

1− δt,−i,k

M−M · δt,−i,k ≤ n

ti − k ≤M · (1− δ

t,−i,k )

∀i ∈ I, ∀t ∈ T, ∀k ∈ K

1− δt,+i,k

M−M · δt,+i,k ≤ k − n

ti ≤M · (1− δ

t,+i,k )

∀i ∈ I, ∀t ∈ T, ∀k ∈ K∑k

δt,−i,k +∑k

δt,+i,k = |K|+ 1 ∀i ∈ I, ∀t ∈ T

7

st=1i ≤ p ·

∑k

k · 0.8log2 k · (δt=1,−i,k + δt=1,+

i,k − 1)

∀i ∈ I

sti ≤ st−1i + p ·

∑k

k · 0.8log2 k · (δt,−i,k + δt,+i,k − 1)

∀i ∈ I, ∀t ∈ T, t ≥ 2

sti ≥ 0 ∀i ∈ I, ∀t ∈ T

nti ≥ 0, integer ∀i ∈ I, ∀t ∈ T

δt,−i,k , δt,+i,k ∈ {0, 1} ∀i ∈ I, ∀t ∈ T, ∀k ∈ K

5 SimulationTo compare the proposed optimal resource allocatorwith the greedy one that Huawei Cloud previously im-plemented, this section designs a system that simulatesthe training process of deep learning jobs on the cloud.Several experiments were conducted to examine eachresource allocator’s queueing delay, training efficiency,performance on heterogeneous-ETA jobs, robustness toETA disturbance, and performance under scaling delay.In addition, this section shows efficiency of the optimalresource allocator.

5.1 Simulation Setup

We set up a simulation framework to conduct experi-ments for the elastic training system. The system’s al-locate decisions are made either by the optimal resourceallocator or the greedy resource allocator. Simulationmoves forward second by second. For each second, thesystem could keep training some jobs, initiate queueingjobs, or finish training some others. Simulation updatesthe status of each job each second. Resource allocatormakes decisions every 5 minutes, for example, 9:00:00,9:05:00 and 9:10:00. The look-ahead time horizon is 25minutes and the length of each time step equals the elas-tic training frequency, so there are 5 time steps. Whensimulation arrives at scaling moments, resource alloca-tor activates and tells resource pool how to scale up ordown jobs. During regular times, simulation allocatesmaximum resources to queueing jobs, if any. For ex-ample, in Fig. 8 job 1 finishes training at 9:02:09 andreleases 8 nodes. Job 2 joins queue since 9:01:02, so 1second later the system allocates the 8 nodes to it.

Data used for simulation come from Huawei Mod-elArts for deep learning jobs. There were 1252 jobssubmitted from January 24, 2021 14:25:41 to January26, 2021 16:07:19, spanning around 2 days and 2 hours.On average, a job spent 2.8 minutes in the queue, 84.3minutes for training, and 87.1 minutes in total. Among

these jobs 447 of them spent training time of ≥5 min-utes. The average queueing time is 7.2 minutes, trainingtime is 232.6 minutes, and total time is 239.9 minutes.We select the 447 jobs as simulation baseline.

Figure 8: Example for resource allocation at regulartimes in simulation.

5.2 Queueing Delay and Training Effi-ciency

We assume that total computational resources are rang-ing from 70 to 190 nodes with an interval of 20 for thebaseline data and examine the average queueing timeper job. In Fig. 9, queueing delay shows a downwardtrend as more nodes are available in resource pool forboth greedy and optimal resource allocators. The opti-mal one always reduces queueing time given the sameresources. The largest decrease is 32% at 150 nodes.

Figure 9: Comparison of queueing time.

Total time is queueing time plus training time, and itindicates how long users can get training results afterthey have submitted a job. It is seen from Fig. 10 thattotal time is declining as more available nodes train jobsfor both resource allocators. The optimal one can al-ways accelerate training given the same resources. Theimprovement is shown in Fig. 11. When the greedy re-source allocator has trained 100 jobs, the optimal onecan train more and the additional trained jobs are illus-trated by the blue bars. The improvement is up to 17.4jobs given 110 nodes.

8

Note that efficiency improvement is bell-shaped.When computational resources are relatively limited,jobs tend to occupy as fewer nodes as possible, no matterwhat resource allocator is. An extreme example couldbe that the number of training jobs equals node size, sothe optimal decision would be to let each job occupyonly one node. On the contrary, when computational re-sources are abundant the best decision would be to allo-cate as many nodes as possible to all jobs. Thus, the gapbetween the two resource allocators becomes narrower.

Figure 10: Comparison of total time.

Figure 11: Additional jobs trained by optimal resourceallocator.

5.3 Impact of Heterogeneous-ETA JobsSmall-ETA jobs with training time of less than 5 min-utes account for 64% in our simulation dataset. Elasti-cally training small-ETA jobs or not remains a question.On the one hand, these jobs might take a little time tofinish training even if the allocated nodes are little. Onthe other hand, small-ETA jobs take a significant portionand may slow down the entire training efficiency if theirresources are insufficient.

For this simulation, we let elastic training systemscales all 1252 jobs whatever their ETA is. Therefore,

job ETA becomes more heterogeneous than the base-line scenario. Fig. 12 shows that up to additional 24.1jobs can be completed by optimal resource allocator,compared to the baseline 17.4 jobs in Fig. 11. There-fore, the optimal resource allocator can better handleheterogeneous-ETA jobs than the greedy one.

Figure 12: Additional trained jobs by optimal resourceallocator when ETA becomes more heterogeneous.

5.4 RobustnessIt is almost impossible to predict job ETA 100% accu-rately [18–20]. We add ±10% disturbance to ETA for thebaseline scenario in simulation. For example, if a job’sactual runtime is 10 node·hr, its ETA will be a randomsample from the range of 9∼11 node·hr. Fig. 13 showsthe additional trained jobs by optimal resource allocatorwith the disturbance. Compared to the no-disturbancebaseline in Fig. 11, the additional trained jobs are only0.6∼2.4 fewer.

Figure 13: Robustness under ±10% ETA disturbance.

It is also common that users submit jobs that containbugs. Bug jobs usually hang up within a few minutesonce starting training. Their ETA and actual trainingduration could be significantly different because bugs

9

are almost unpredictable. Users may also terminate jobsthat are training at any time out of many reasons, e.g.,losing patience. We conduct a simulation based on thebaseline data in which 75% jobs have ±10% ETA dis-turbance, 15% jobs contain bugs and hang up randomlywithin 5 minutes, and 10% jobs could be terminated byusers at any time. Fig. 14 shows that there are still upto 15.0 additional jobs completed by optimal resourceallocator under such harsh conditions.

Figure 14: Robustness under ETA disturbance, bug jobsand user-terminated jobs.

The reasons optimal resource allocator is robust arethat it adopts a rolling-horizon approach and makes al-locate decisions every a few minutes. Even if estimationerrors, bug jobs and user-terminated jobs enlarge the dif-ference between ETA and actual runtime, the negativeimpact on system training efficiency is small.

5.5 Impact of Scaling DelayThe above simulations assume no delay when a job initi-ates or scales. The time of scaling-down is usually min-imal while initiation or scaling-up takes 10∼20 secondson average. For this simulation, we take 15-second de-lay into consideration. When a job just starts trainingor resource allocator decides to scale up a job, it willstay on its current nodes for 15 seconds before executingthe new decision. Fig. 15 shows the additional trainedjobs by optimal resource allocator with 15-second delay.The differences compared to the baseline scenario aremerely -0.9∼1.8 additional jobs. The reason for the dif-ferences could be that during the 15-second delay somejobs are still executing the last allocate decisions thatare not optimal at this moment, so the additional trainedjobs become slightly fewer.

5.6 Speed of Optimal Resource AllocatorResource allocation for elastic training should be real-time or near real-time. We have recorded the time spent

Figure 15: Additional trained jobs by optimal resourceallocator with 15-sec scaling delay.

for every decision-making by optimal resource allocatorin simulation. The baseline scenario is provided with70 nodes. The simulation runs on a Linux server withIntel i7-8700K CPU and 64 GB RAM. Fig. 16 is thehistogram of solution time and shows the optimal re-source allocation is very fast in decision-making, spend-ing merely 0.4 seconds on average. The median time is0.24 seconds. For 95% cases, solution time is under 1.49seconds (95pth), and the maximum time is not over 2.48seconds.

Figure 16: Histogram of solution time by optimal re-source allocator.

6 ConclusionThis paper proposes an optimal resource allocator ofelastic training for deep learning jobs on cloud. Theallocator adopts a rolling-horizon approach and maxi-mizes training progress of all jobs over a planning timehorizon. The original model formulation contains non-linear constraints and many integer variables, which arehard to optimize. We simplify and linearize the modelformulation into a MILP that is smaller-scaled, more

10

solvable and faster. We also introduce an innovativemethod to make sure that allocated number of nodesmeet the powers-of-two requirements.

We design a simulation framework to conduct exper-iments to elastic training systems with the optimal re-source allocator and a greedy one as benchmark. For thebaseline scenario, the optimal resource allocator can re-duce queueing time by up to 32% and accelerate trainingefficiency by up to 17.4%, which greatly improves userexperience. Also, the optimal resource allocator can bet-ter handle heterogeneous-ETA jobs than the greedy one,training up to 24.1 additional jobs. Simulations that testrobustness show that the optimal allocator is very robustto ETA disturbance, bug jobs and user-terminated jobs.The impact of 15-second scaling delay is also examinedand the additional trained jobs are merely different fromthe no-delay baseline. Also, searching for optimal solu-tions is very fast, taking only 0.4 seconds on average.

References[1] Amazon Web Services. Getting Started with AWS.

Retrieved from https://aws.amazon.com/getting-started/.

[2] Microsoft Azure. Get to know Azure. Retrievedfrom https://azure.microsoft.com/en-ca/overview/.

[3] Huawei Cloud. ModelArts. Retrievedfrom https://www.huaweicloud.com/intl/en-us/product/modelarts.html.

[4] Alibaba Cloud. Overview of Auto Scaling. Re-trieved from https://bit.ly/3aBlIWi.

[5] Lin, H., Zhang, H., Ma, Y., He, T., Zhang, Z., Zha,S. and Li, M., 2019. Dynamic mini-batch SGD forelastic distributed training: learning in the limboof resources. arXiv preprint arXiv:1904.12043.

[6] Kubernetes. Production-Grade Container Orches-tration. Retrieved from https://kubernetes.io/.

[7] Amazon Web Services. Amazon Elas-tic Container Service. Retrieved fromhttps://aws.amazon.com/ecs/.

[8] Red Hat. Red Hat OpenShift Con-tainer Platform. Retrieved fromhttps://www.openshift.com/products/container-platform.

[9] Shen, Z., Subbiah, S., Gu, X. and Wilkes, J., 2011,October. Cloudscale: elastic resource scaling formulti-tenant cloud systems. In Proceedings of the2nd ACM Symposium on Cloud Computing (pp.1-14).

[10] Gregory, A. and Majumdar, S., 2016, March. Aconstraint programming based energy aware re-source management middleware for clouds pro-cessing mapreduce jobs with deadlines. In Com-panion Publication for ACM/SPEC on Interna-tional Conference on Performance Engineering(pp. 15-20).

[11] Liu, L. and Xu, H., 2018, October. Elasecutor:Elastic executor scheduling in data analytics sys-tems. In Proceedings of the ACM Symposium onCloud Computing (pp. 107-120).

[12] Javadi, S.A., Suresh, A., Wajahat, M. and Gandhi,A., 2019, November. Scavenger: A black-boxbatch workload resource manager for improvingutilization in cloud environments. In Proceedingsof the ACM Symposium on Cloud Computing (pp.272-285).

[13] Chen, Y., Peng, Y., Bao, Y., Wu, C., Zhu, Y. andGuo, C., 2020, October. Elastic parameter serverload distribution in deep learning clusters. In Pro-ceedings of the 11th ACM Symposium on CloudComputing (pp. 507-521).

[14] Chen, C., Weng, Q., Wang, W., Li, B. and Li, B.,2020, October. Semi-dynamic load balancing: ef-ficient distributed learning in non-dedicated envi-ronments. In Proceedings of the 11th ACM Sym-posium on Cloud Computing (pp. 431-446).

[15] Saxena, V., Jayaram, K.R., Basu, S., Sabharwal, Y.and Verma, A., 2020, November. Effective elasticscaling of deep learning workloads. In 2020 28thInternational Symposium on Modeling, Analysis,and Simulation of Computer and Telecommunica-tion Systems (MASCOTS) (pp. 1-8). IEEE.

[16] Zhu, J., Huang, Z., Wu, R., Bai, X., Yang, B., Li,Y., Zheng, H. and Dai, Z. 2020. A Design and Im-plementation Method for Elastic Distributed Train-ing Systems. Chinese patent, 87068967CN02.

[17] Forrest, J. and Lougee-Heimer, R. CBC UserGuide. Retrieved from https://www.coin-or.org/Cbc/cbcuserguide.html.

[18] Pham, T.P., Durillo, J.J. and Fahringer, T., 2017.Predicting workflow task execution time in thecloud using a two-stage machine learning ap-proach. IEEE Transactions on Cloud Computing,8(1), pp.256-268.

[19] Sidhanta, S., Golab, W. and Mukhopadhyay,S., 2016, May. Optex: A deadline-aware costoptimization model for spark. In 2016 16th

11

http://arxiv.org/abs/1904.12043

IEEE/ACM International Symposium on Cluster,Cloud and Grid Computing (CCGrid) (pp. 193-202). IEEE.

[20] Mustafa, S., Elghandour, I. and Ismail, M.A.,2018. A machine learning approach for predictingexecution time of spark jobs. Alexandria engineer-ing journal, 57(4), pp.3767-3778.

12

Documents

E-mail: [email protected], fzhujiangcheng, zirui.zhou