TAPS: Software Defined Task-Level Deadline-Aware .... TAPS Software Defined Task-level... · TAPS: Software Deﬁned Task-level Deadline-aware Preemptive Flow scheduling in Data Centers

TAPS: Software Defined Task-level Deadline-awarePreemptive Flow scheduling in Data Centers

Lili Liu∗†, Dan Li∗†, Jianping Wu∗†

∗Tsinghua National Laboratory for Information Science and Technology†Department of Computer Science and Technology, Tsinghua University

Abstract—Many data center applications have deadline re-quirements, which pose a requirement of deadline-awareness innetwork transport. Completing within deadlines is a necessaryrequirement for flows to be completed. Transport protocols incurrent data centers try to share the network resources fairly andare deadline-agnostic. Recently several works try to address theproblem by making as many flows meet deadlines as possible.However, for many data center applications, a task cannotbe completed until the last flow finishes, which indicates thebandwidths consumed by completed flows are wasted if some flowsin the task cannot meet deadlines.In this paper we design a task-level deadline-aware preemptive

flow scheduling(TAPS), which aims to make more tasks meetdeadlines. We leverage software defined networking (SDN) tech-nology and generalize SDN from flow-level awareness to task-levelawareness. The scheduling algorithm runs on the SDN controller,which decides whether a flow should be accepted or discard-ed, pre-allocates the transmission time slices and computes therouting paths for accepted flows. Extensive flow-level simulationsdemonstrate TAPS outperforms Varys, Baraat, PDQ (PreemptiveDistributed Quick flow scheduling), D3 (Deadline-Driven Deliverycontrol protocol) and Fair Sharing transport protocols in deadline-sensitive data center environment. A simple implementation onreal systems also proves that TAPS makes high effective utilizationof the network bandwidth in data centers.

I. INTRODUCTION

Data centers, capable of running various cloud applications,are widely deployed around the world. Many cloud applica-tions are interactive in nature and accordingly have soft-real-time requirements [1], [2]. For sake of both user experienceand provider revenue, sometimes seconds or microseconds oflatency matters significantly [3]. A cloud application (task) isusually distributedly executed by a number of servers, and dataflows among the servers are bandwidth-hungry to shuffle a largeamount of data. As a result, the data center network has toprovide very low latency to transmit the flows among serversand meet their deadlines. Unfortunately, traditional competitionbased transport protocols in data centers, such as TCP, RCP [4],ICTCP [5], DCTCP [2], adopt a philosophy of Fair Sharing tolet flows compete for the available bandwidth of links. Theyconsider neither meeting the deadline demands of the flows, nor

This work is supported by the National Key Basic Research Program ofChina (973 program) under Grant 2014CB347800, the National NaturalScience Foundation of China under Grant No.61170291, No.61432002, theNational High-tech R&D Program of China (863 program) under Grant2013AA013303, and Tsinghua University Initiative Scientific ResearchProgram.

minimizing the completion time of the flows. These deadline-agnostic transport protocols cannot make more flows completewithin deadlines, and cause a waste of link bandwidth. A recentstudy shows that the deadlines of 7.25% flows were missedin three production data centers, because of the ineffectiveutilization of link bandwidth [3].To overcome the problem above, recently deadline-aware

transport protocols for data center networks are proposed, suchas D3 [3], PDQ and D2TCP [6]. The basic idea is to introducea bandwidth competition or allocation algorithm which makesmore flows be completed within deadlines, thus link bandwidthis more effectively utilized. A common goal of these deadline-aware transport protocols is how to finish as many flowsbefore deadlines as possible. However, we argue that for cloudtasks such as financial service, online payment, or scientificcomputation, the computation results are useful if and onlyif all the servers finish their computations before deadlines.Consequently, for all the flows of a single task, what reallymatters is that the last flow completes before the deadline.Otherwise, the task will fail and the bandwidth consumed byall the completed flows is also wasted. Recently, some task-aware flow scheduling schemes are proposed, e.g. Baraat [7]and Varys [8]. However, Baraat is deadline-agnostic and aimsto reduce overall task completion time, which would result inthe low throughput in deadline-constraint cases. Varys is verysensitive to the task arrival order, which may make later-arrivedbut more urgent tasks miss deadlines.To consider these problems, in this paper we propose a

task-level deadline-aware preemptive flow scheduling algorithmfor data centers that tries to finish as many tasks, insteadof flows, as possible before deadlines. We call the protocolTAPS. The design of TAPS leverages the emerging softwaredefined networking (SDN) technique, and further generalizesSDN from flow-level awareness to task-level awareness. Thecore of TAPS is a task-aware flow scheduling algorithm runningon the SDN controller. When the scheduling requirements ofa flow arrive, the SDN controller decides whether it should beaccepted or discarded according to a reject rule. If the flowis accepted, the SDN controller pre-allocates its transmissiontime slices and computes the routing paths for it. Althoughthe allocation problem is proved to be NP-hard, we provide aheuristic solution which works well to arbitrary data centernetwork topologies. Compared to PDQ, Baraat and Varys,TAPS has a near-optimal routing scheme and a better defined

2015 44th International Conference on Parallel Processing

0190-3918/15 $31.00 © 2015 Crown Copyright

DOI 10.1109/ICPP.2015.75

660


0190-3918/15 $31.00 © 2015 Crown Copyright

DOI 10.1109/ICPP.2015.75

659


0190-3918/15 $31.00 © 2015 Crown Copyright

DOI 10.1109/ICPP.2015.75

659


0190-3918/15 $31.00 © 2015 Crown Copyright

DOI 10.1109/ICPP.2015.75

659


0190-3918/15 $31.00 © 2015 Crown Copyright

DOI 10.1109/ICPP.2015.75

659


0190-3918/15 $31.00 © 2015 Crown Copyright

DOI 10.1109/ICPP.2015.75

659

centralized routing algorithm, thus could make the most ofbandwidth and let more tasks be completed before deadlines.Apart from controller modification, an additional module isadded to the servers to maintain the states of local flows. Theswitches do not need any modification, which is consistent withthe technique trend of employing low-end commodity switchesin modern data centers.

We conduct extensive flow-level simulations and reportthe performance of TAPS, in both single-rooted and multi-rooted tree network topologies. To do comparison, we alsoimplement some existing flow-level and task-level protocols,and report their performance in the same topologies. Resultsdemonstrate that TAPS outperforms Baraat, Varys, PDQ, D3

and Fair Sharing protocols in deadline-sensitive data centernetwork environment, in terms of the number of tasks andtask size completed before deadlines. We also conduct testbedexperiments based on an implementation of TAPS, whoseresults also show that TAPS makes higher effective utilizationof network bandwidth and fulfills much more task than FairSharing transport protocols.

II. BACKGROUND AND RELATED WORK

In this section, we briefly review some important problemsin current data center applications, including the task-level andlatency-aware property, and multi-path routing they have toconsider, which inspire us the design of TAPS.

Task-level. Current data center applications and distributedcomputing systems like MapReduce and Dryad, employ apartition/aggregation pattern. They aim to achieving horizontalscalability by partitioning a task into many flows. The unitof these applications is task and each task contains a numberof flows. Statistics indicates that for web search works, eachtask contains at least 88 flows [2] while for MapReduce workseach task contains 30 to even more than 50000 flows [9],and for Cosmos works most tasks contain 30–70 flows [10].These statistics reveals that in data centers many applicationsgenerates multiple flows for a single task, and task is the unitof processing.

Nowadays mainly task-level related works are Baraat [7] andVarys [8].

Baraat: Baraat is a task-aware scheduling. The priority oftasks obeys SJF and the priority of all the flows in a task isthe same. The flow scheduling of Baraat is similar to PDQ [11]except the flow priority. The main goal of Baraat is to minimizethe average task completion time. However, losing sight ofdeadline information makes Baraat behave not very well indeadline-sensitive case.

Varys: Varys is also a task-aware and deadline-awarescheduling. The early arrived task would be scheduled first.The rate-allocation scheme is most like D3. Once a task isscheduled, it would not be rejected. If a more urgent taskarrived later than a less urgent task and the allocation of theless task left insufficient bandwidth to the more urgent task,then the more urgent task would be discarded by Varys. Thisdecides the limitation of arrival sequence for Varys.

Deadline-aware. Many data center applications are interac-tive and have a very strong requirement on latency. In this kindof applications the response to a request must be processed

very quickly, even 100ms latency overhead would cause a bigloss of the providers’ revenue [3]. A previous study showsthat applications need to complete the requests and respondto the users in a SLA of 200-300ms [3] . This tells us in manydata center applications flows usually have specific deadlinesto meet, thus they should be completed fast and early.Nowadays mainly task-level related works are D3 [3],

PDQ [11], DCTCP [2] and D2TCP [6].D3: D3 performs a centralized bandwidth allocation in order

to fulfill flows before deadlines. However, the FCFS schedulerused in D3 results in some performance issues, such as differentflow-scheduling results according to different order of theirarrival. Unfortunately, this allows large flows that arrived earlierto occupy the bottleneck bandwidth, but blocks small flowsarrived later. Furthermore, the task-agnostic feature of D3

makes the data center networks to lose more tasks withindeadlines.PDQ: PDQ is deadline-aware protocol which employs explic-

it rate control like TAPS. Unlike D3 [3], PDQ allocates band-width to the most critical flows and allows flow preempting,bringing the effects of reducing mean FCT by 30% comparedwith D3 and fulfilling more flows within deadlines. However,distributed scheduling without the global knowledge of allpassing tasks makes PDQ to be far from optimal scheduling.DeTail and D2TCP: DeTail and D2TCP are also deadline-

aware protocols. DeTail aims to cut the FCT tail in data centernetworks. D2TCP improves DCTCP [2] to a deadline-awareversion in order to accomplish more flows before deadline.However, the limitation of flow-level scheduling cannot mini-mize the deadline-missing tasks.Multi-path routing. Generally tree architecture is applied in

traditional data center topology. However, previous research-es [12] show that traditional tree topology architecture cannotadapt to the requirements of current data centers. In order toachieve high network capacity, numerous data center architec-tures have been proposed to address this problem. These rich-connected architectures, such as Fat-Tree [12], BCube [13],FiConn [14], employ multi-rooted tree topology to improvenetwork capacity. In current data center networks which usemulti-rooted tree topology, generalizing their rooting protocolto multi-path is a fundamental but very important problem.However, transport protocols in data centers employ TCP

and emulate fair sharing which tries to share the networkresources fairly. The mainly related works are TCP, RCP [4],DCTCP [2], and HULL [15] Previous studies [3] showed thatTCP and RCP with priority queueing lose quite a number offlows since missing deadlines and fall behind D3. DCTCP andHULL mainly target at reducing the queue length with novelrate control and congestion detection mechanisms. Previousresearch showed that though DCTCP can tackle the latencyproblem to a certain extent, but cannot achieve high deadline-sensitive flow completion ratio in data center networks.

III. MOTIVATION AND DESIGN GOALS

In this section, we start by presenting some motivationexamples to show the importance of task-level, preemptivescheduling and global scheduling in the design of current datacenter scheduling algorithm. Then based on these property wepresent the design goals of TAPS.

661660660660660660

A. Motivation ExampleConsider the scenario shown in Fig. 1, where two concurrent

tasks arrive simultaneously.

Task ID Flow ID Size Deadline��

� 2 4

�� 4 4

�� 1 4

�� 3 4

(a) flow size and deadline

0

Throughput

Time

1/4

1/2

3/4

1

4 10

��

��

��

��

97 0

Throughput

Time

1/2

3/4

1

4 10

��

��

�

��

6 7

(b) Fair Sharing (c) D3

��

0

Throughput

Time

1

4 10

��

��

�

97

��

0

Throughput

Time

1

4 10

��

�

��

97

(d) PDQ (e) Task-aware Scheduling

Fig. 1. Task-level scheduling vs flow-level scheduling. (a) shows the size anddeadline of the 4 flows(in 2 tasks). (b) - (e) show scheduling results of FairSharing, D3, PDQ and Task-aware Scheduling, respectively. X-axis is time andY-axis is the allocation of bottleneck link bandwidth.

Task-level Scheduling. Fig. 1 presents an example toshow that task-level scheduling has advantage over flow-levelscheduling.There are 2 tasks competing for one bottlenecklink, and each task consists of 2 flows. Fig. 1(a) shows thesize(expected transmission time) and deadline of each flow ineach task. These four concurrent flows arrive simultaneouslyin such order: Flow f11 , f

12 , f

21 , f

22 . Fig. 1(b)-(e) show the

scheduling results of Fair Sharing, D3, PDQ [11], and task-aware scheduling, respectively.With Fair Sharing scheduling scheme, flows share the bot-

tleneck link capacity equally. In the end 1 flows and 0 task canbe completed as Fig. 1(b) shows.With D3 scheduling scheme, each flow requests a rate of

r = sd , where s means flow size and d means deadline. In

the 1st to 4th time unit, f11 requests a rate of r = 24 and f12

requests a rate of r = 44 . But f

11 arrives earlier, so f11 gets a

rate of r = 12 , f

12 gets a rate of r = 1

2 and others get a rater = 0. In 5th to the 6th time unit, f12 gets a rate of r = 1,the others get the rest rate r = 0. Afterwards, in the 6th timeunit, f21 gets a rate of r = 1. And in the 7th to 10thtime unit,f22 gets a rate of r = 1 At last, only 1 flow and no tasks iscompleted, as Fig. 1(c) illustrates.With PDQ scheduling scheme, the descending priority order

of these flows is f21 , f11 , f

22 , f

12 . For brevity in this example,

Early Termination [11] is not employed. Each flow is trans-mitted at the rate of the link capacity. As the result, 2 floware completed before deadlines but neither tasks is completedwithin deadlines,as Fig. 1(d) shows.

Fig. 1(e) illustrates the result of a simple Task-awareScheduling scheme. The priority of the tasks is ordered byEDF [16], and the priority of flows inside each task obeysEDF as well. Flow with the highest priority is transmitted firstat the maximum rate as [11].In this way, 2 flows and 1 taskcan be completed.This example reveals that task-aware flow scheduling scheme

can complete one task, while both flow-granularity schedulingand deadline-agnostic task-granularity scheduling scheme failsto complete any task. Therefore, compared to task-agnostic ordeadline-agnostic scheduling methods, more tasks can be com-pleted using task-aware and deadline-aware flow scheduling.Taking tasks and their deadline into consideration does give abetter performance in task completion ratio.

Task ID Flow ID Size Deadline��

� 1 4

�� 1 4

�� 1 2

�� 1 2

(a) flow size and deadline

0

1

2 4

Throughput

Time

1/2

(b) Baraat

��

��

0

1

2 4

Throughput

Time

1/2

(c) Varys

��

��

0

1

2 4

��

Throughput

Time1 3

(d) TAPS

Fig. 2. Existing Task-level scheduling vs TAPS. (a) shows the size anddeadline of the 4 flows(in 2 tasks). (b) - (d) show scheduling results of Barrat,Varys and our proposed TAPS, respectively. X-axis is time and Y-axis is theallocation of bottleneck link bandwidth.

Preemptive Scheduling. Existing Task-aware Schedulingscheme such as Baraat [7] and Varys [8] still have room tobe improved. They perform badly in some examples due totheir lack of deadline or preemptive consideration. Specifically,Baraat is task-aware but not deadline-agnostic. Varys is task-aware and deadline-agnostic, but it obeys FIFO and does notsupport preemptive, which leads some last-arrived emergencytasks missing deadline. These disadvantages motivate us topropose a task-level, deadline-agnostic and preemptive schedul-ing(TAPS for short).Fig. 2 presents an example to show our preemptive moti-

vation. There are 2 tasks competing for one bottleneck link,and each task consists of 2 flows. These four concurrent flowsarrive simultaneously in such order: Flow f11 , f12 , f21 , f22 .Fig. 2(a) shows the size and deadline of the flows in each task.Fig. 2(b)-(d) show the scheduling results of Baraat, Varys, andour proposed TAPS, respectively.With Baraat [7] scheduling scheme, earlier-arrived task has

higher priority. Therefore task t1 starts first. Since the priorityof flows inside a task obeys SJF, flow f11 starts first whichresults in the failure of flow f21 . Though Baraat schedules flowsin task granularity, it fails to all the tasks as Fig. 2(b) shows.With Varys [8] scheduling scheme, the earliest-arrived task

should be scheduled first. Same as PDQ and D3, Varys is anoth-

662661661661661661

er rate control protocol. And in deadline-sensitive environment,the rate of a flow is assigned as r = s

d . Flows of t1 is scheduledfirstly. But after t1 is scheduled, there would not be enoughbandwidth for t2 to transmit. So t2 is rejected in the first place.Varys completed 1 task at last as Fig. 2(c) shows.

The reason why Varys rejects t2 is that it comes later thant1 and Varys do not support task preemption. To address suchproblem, the proposed scheduling scheme TAPS try to makea global task-level optimization after each task arrives, thussupports task preemption. Specifically, it has the same ratecontrol protocol as PDQ. And it give an accept/reject choice toa task after it arrives, based on an overall task-level schedulingoptimization. when a task is accepted, all the flows obey EDFand SJF to schedule, and flow with the highest priority wouldbe transmitted at the rate of the link capacity. TAPS completed2 tasks in examples showed in Fig. 2(d).

This example reveals that preemptive task-level schedulingcould accomplish more tasks in deadline sensitive situation.

Flow ID Size Deadline Source Destination

1f 1 1 1 2

2f 1 2 1 4

3f 1 2 3 2

4f 2 3 3 4

(a) Flows on transmission

Flow ID Optimal Transmission Time Interval

1f (0,1)

2f (1,2)

3f (1,2)

4f (0,1) & (2,3)

(b) Optimal Scheduling Results

(c) Topology, link capacity is1Gbps

Fig. 3. Global Scheduling Motivation Example. Global scheduling cancomplete more flows before deadline, compare to PDQ

Global Scheduling Fig. 3 presents an example to show theeffectiveness of global scheduling. There are 4 flows whichgoes through different pathes in the network. Fig. 3(a) showsthe size, deadline, source ID, destination ID of the 4 flows,respectively. Fig. 3(b) shows the optimal scheduling results andFig. 3(c) shows the topology. If we schedule the 4 flows withPDQ [11], the entire scheduling process is as follows: In the1st time unit, there is no switch that pauses f1, so f1 gets arate=1. f2 is paused at S1, because f1 is more critical than f2,and f3 is paused by S5. Assume that the flow list in S3 is full,so f4 is paused by S3. In the 2nd time unit, f1 is completed sof2 and f3 get rate=1. f4 is paused by S3. In the 3rd time unit,f2 and f3 are completed, but f4 cannot be completed beforedeadline, because it has 2 size units to transmit and only 1 timeunit left before deadline. In the end f4 misses the deadline.As we can see in the example, the bottleneck link S3-S5 and

S5-S4 are idle in the 1st time unit, but f4 cannot utilize them. Ifwe schedule globally, we can make full use of the 2 links andlet f4 transmit through them. The optimal scheduling resultsof each flow in this example are showed in Fig.3(b), whichsuccessfully make all these 4 flows completed before theirdeadlines. Finally global scheduling can complete all the flows,

while PDQ can only complete 3 flows. This example revealsthat in common situations, global scheduling has advantages tocomplete more flows than PDQ.

B. Design GoalsIn current data center network a task always consists of many

flows. As mentioned above, simply maximizing the numberof completed flows is not enough while making more taskscompleted before their deadlines is more meaningful. We givethe following design goals for TAPS here:Maximizing the number of completed tasks before dead-line: TAPS is designed for deadline-sensitive data center en-vironment, which expects all flows in a task to be completedbefore their deadlines. Note that a task is useful if and only if allflows in the task are completed before their deadlines. It is morecrucial for data centers to fulfill more tasks. This goal requirespreemptive and global scheduling of TAPS. Under this goal,unnecessary bandwidth waste should also be highly decreased.Online response to tasks in dynamic data center network:

The traffic in data center networks is dynamic and changesfrequently [17]. When a burst of tasks arrive, data centernetworks should selectively accept them according to networktolerance capacity. TAPS is designed to respond to tasks in datacenter networks online and dynamically.Applicability to general data center network topologies:

Nowadays most topologies in data center networks are multi-rooted tree. Whereas, current latency-aware and task-awaretransport protocols, such as D3, can only be applied to single-rooted tree topology. We extend single-path routing to multi-path routing in TAPS so that TAPS can be applied to generaldata center topologies.

IV. TAPS DESIGN

In this section, we will firstly overview the overall architec-ture of TAPS. Then we will discuss the core part, centralizedalgorithm in detail. After that we will introduce the design ofthe controller, server and switch, respectively.

A. Architecture OverviewThe basic idea of TAPS is maximizing the number of com-

pleted tasks before deadlines. The priority of flows is decidedby the deadline and flow size in scheduling disciplines usingEDF [16] and SJF [18]. Flows with higher priority should betransmitted first. In order to minimize the mean flow completiontime(FCT), there is at most one flow on transmission on eachlink at any time [11]. In other words, once a flow starts to send,it will occupy the link capacity exclusively.TAPS leverages Software Defined Networking(SDN) [19]

framework to enforce the flow scheduling mechanisms. Fig. 4depicts the procedure of TAPS and exchanged messages amongthe controller, servers, and switches. TAPS senders maintain aset of task-related variables, including flow deadline, expectedtransmission time and sending rate. Then the sender encap-sulates the task-related information into a scheduling headeradded to a probe packet, and sends the packet to the server.When a task arrives, it is directly sent to the controller forscheduling. When the SDN controller receives the probe packet,it firstly decides whether the task can be processed by thenetwork according to a reject rule. If the task is accepted by the

663662662662662662

Aggregate Switch

...

Server... ...

Core Switch...

1

2

SDNController

3

4A

51. A new task arrives

2. Servers send packets containing taskinformation to the controller

3.SDN controller computes to decidewhether to accept this task

4. If the task is accepted, the controller(4A) installs forwarding entries on the

corresponding switches(4B) sends packets including pre-

allocated time slices to senders

5. If else, the controller informs thesenders to discard this task

4B

� �0 1, , | { , , , }i

i i i i i i ii m j j j j jTask f f f Src Dst s d� �

i

i1mi||1

i ||1

Fig. 4. TAPS protocol architecture

controller, the controller calculates the time slices about whento transfer the flows in the task, and the routing path for theflows. Afterwards the controller sends the packets with timeslices included to the corresponding senders, and installs therouting table on the corresponding switches along the routingpath of a specific flow. Then the senders monitor the time anddecide when to send the flow at an assigned sending rate.Intermediate switches forward the packets according to thedefault route in time. We present TAPS details in the followingsections.

B. Centralized Algorithm

Ftrans the set of all flows on transmissionFtmp a temporary set of flowsfnew the newly coming flow

Srcij source server ID of flow j in task i

Dstij destination server ID of flow j in task i

sij size of flow j in task i

dij deadline of flow j in task i

Eij expected transmission time of flow j in task i

Aij allocated time slices for flow j in task i

Lij link set which flow j in task i goes through

Ox the occupied time set for link x

TABLE ILIST OF NOTATIONS USED IN THE PAPER

The centralized algorithm is the core part of TAPS archi-tecture, and it runs in the SDN controller. Task is the unitof the accepting or rejecting decision. We would not discardflows in tasks which are accepted and transmitting, meanwhilewe would not waste bandwidth on any flows in the taskwe have decided to discard. Specifically, newly arrived tasksare accepted/discarded according to a rejecting policy, whichintelligently decides whether a task need to be discarded tosave more bandwidth resource for other tasks. The goal of thecentralized algorithm is to check whether the newly arrived taskcould be handled by the network. So in general, the centralized

algorithm calculates the potential allocation of a newly arrivedtask and make decisions according to rejecting policy.

Problem formulation. We model the task scheduling prob-lem in the following form. The full set of tasks to be sched-uled is T = {t0, . . . , tn−1}, and task ti contains mi flows{f i0, . . . , f

imi−1

}. We suppose in the network, the bandwidth of

each link is uniformed, so each flow can always be transferredat its maximum rate. In this scenario, we do not need to careabout the actual size of a flow, but only the transferring timerequired. Flow f i

j has the tuple⟨dij , E

ij , A

ij , L

ij

⟩. dij , E

ij and A

ij

are the deadline, expected transferring time and the transmittedtime slices of flow f i

j , respectively. And Lij is the set of links

that will be transferred through by f ij . Note that flows in the

same task have the same deadline, i.e. dij = di for any j. Thenthe task scheduling problem is to find a set of tasks Ttrans thathas as many tasks as possible and the network can process. Foreach flow f i

j of each task ti in Ttrans, a scheduled transferringtime slice set will be given. The objective is to finish as manytasks as possible, while ensure that only flows within Ttrans

will be transferred and bandwidth resource will not be wastedon tasks that would only be partially finished.

NP-hardness Proof. We have proved that this problem is aNP-hard problem. To prove the NP-hardness of the problem, wereduce a well-known NP-hard problem, the Hamiltonian Circuitproblem, into a special case of this problem. Suppose we havea graph G = 〈V,E〉, in which V = {vi, i = 0, . . . , n− 1} isthe set of all vertices and E = {ej , j = 0, . . . ,m− 1} is theset of all edges. To find a hamiltonian circuit in G is actually tofind a set of n edges E′ ∈ E so that each vertex in V appearsas the endpoint of some edge in E′ twice. The hamiltoniancircuit problem above can be reduced into a task-based flowscheduling problem on a single link. Specifically, on a particularlink there are m tasks to be scheduled, and each task containsfour flows each of which have size 1

2 and start at time zero.Each task is corresponded to an edge in E from the hamiltoniancircuit problem. For an edge whose endpoints are vi1 and vi2 ,the four flows of the corresponding task have deadlines i1+1,2n−i1, i2+1 and 2n−i2. Therefore, we can find a schedulingin the original problem that n task can be completed if andonly if a circuit can be found in the corresponding Hamiltonianproblem.

Algorithm Detail. Next, we will take a deep insight intothe centralized algorithms in detail. Alg. 1 describes the wholeprocess of TAPS. When a new flow fnew arrives, the algorithmadds fnew to a temporary flow set Ftmp and wait for the flowsin the same task for a time interval T. After adding all thetransmitted flows Ftrans into Ftmp, we try to allocate all theflows into this network, calculate the time slices and route foreach flow in Ftmp (Alg. 2 ). Here we denote the task whichfnew belongs to as tid. Then the new task tid is decided to beaccepted or discarded according to the reject rule. The rejectrule means if one of the following situations happens, then allthe flows belong to tid would be discarded and tid would beadded to discarded task set Tdiscard. 1) If we accept fnew,flows of more than 1 task would miss deadline; 2) Some flowsinside tid have already missed deadline; 3) All the deadline-missing flows belong to 1 task, but the task is not tid and thecompletion ratio of the task is not less than tid; If the task is

664663663663663663

not tid but the completion ratio of the task is less than tid, thenwe discard the deadline-missing task and add it to Tdiscard.Alg. 2 is the whole process to allocate routing path for each

flow in F . Specifically, for each flow f ij , we aim to calculate its

transmission time slices Aij and links L

ij it goes through. The

calculation of f ij consists of three steps. Firstly, we calculate a

alternative path set P which contains all the paths f ij may go

through. Secondly, we try to allocate time slices for every pathp in P (Alg. 3 ). Finally, we try to find the optimal path in Pthrough which f i

j can be completed the earliest. Lij is the set

of the links of the optimal path, and Aij is the time slice set of

the optimal path.Alg. 3 allocates time slices for a specific flow f i

j when itgoes through path p. For each link lx, time period when it isoccupied is recorded. We firstly compute the union set Tocp

of time periods Ox of all the links in p. The complementaryset of Tocp is the time set when all the links in p are idle.We try to allocate transfer time slice timeSlice(p, fi) to thefirst Ei(expected transfer time) idle time slices. And flowcompletion time time(p, fi) can be obtained.

C. TAPS ControllerThe SDN controller performs the centralized algorithm when

it receives a new flow. Then it runs a centralized algorithm todetermine whether to accept or discard the new flow. The mainfunctions of the controller are as follows:Compute route for each flow in Ftrans. In Alg. 2 , the

controller calculates the optimal path Lij for f i

j . Then thecontroller informs the corresponding switches to install routeentries. Since a fundamental constraint of SDN is that theflow table size of an SDN switch is very limited (usually lessthan 2000 entries), only the first 1k entries are installed on aparticular switch, thinking of there at most one flow on a link.When the controller receives an ACK that the flow has beencompleted or missed deadline, it informs the correspondingswitches to withdraw the route entries.Pre-allocate time slices for each flow in Ftrans. In Alg. 3

, the controller calculates the time slices Aij for f

ij . After the

calculation, the controller sends packets including time slicesfor each flow to the senders.

Algorithm 1 Task-aware preemptive flow scheduling

1: if fnew arrives then2: tid ← fnew.id;3: if tid ∈ Tdiscard then4: reject fnew;5: end if6: Ftmp ← {fnew}7: Wait time T, and add all coming flows to Ftmp.8: Ftmp ← Ftmp ∪ Ftrans;9: sort Ftmp according to EDF and SJF;10: PathCalculation(Ftmp);11: accept or reject fnew according to the reject rule;12: end if

D. TAPS ServerIn TAPS, we add some spare modules to complete the

following functions:

Algorithm 2 PathCalculation(F )1: for each flow fi ∈ F do2: P ← φ3: add all the possible paths of fi to P ;4: for each path p ∈ P do5: TimeAllocation(p,fi);6: end for7: T = inf ;8: for each path p ∈ P do9: if T > time(p, fi) then10: T ← time(p, fi);11: Li ← p;12: Ai ← timeSlice(p, fi);13: end if14: end for15: Update Occupy set Ox for each link in Li based on

Ai.16: end for

Algorithm 3 TimeAllocation(p,fi)1: Tocp ← φ2: for each link lx ∈ p do3: push Ox(lx) into Tocp;4: end for5: timeSlice(p, fi)← first Ei time slices in the complemen-tary set of Tocp

6: compute time(p, fi);

Maintain the states of each flow. In TAPS, each sendermaintains several state variables for flow f i

j : deadline dij , theexpected transmission time Ei

j , the allocated time slicesAij .

Communicate with the controller. Once a new task arrives,the senders send a probe packet with the task information,including source ID Srcij , destination IDDstij , flow size sij anddeadline dij(i means task ID and j means flow ID) to controller.Then the senders wait for the results from the controller. Ifthe task is discarded, the senders will not transfer any flows inthe task. Otherwise the senders will maintain the pre-allocationinformation from the controller.

Monitor the time to send a flow. The senders monitorthe time and keep in touch with the controller to ensure timeconsistency. The sender would send the flow at an allocatedrate at the appropriate time. If a flow is completed, the sendersends a TERM packet to the controller and removes it from itsmaintained set.

E. TAPS Switch

The switches in TAPS do not need any modification oraddition of some modules to allocate rate for flows, comparedto switches in other latency-aware protocols which employexplicit rate control protocol [4], e.g. PDQ [11], D3 [3] andBaraat [7]. In TAPS switches only need to take charge of thedata forwarding. What they need to do is to forward packetsaccording to the default entries installed by the SDN controller.

665664664664664664

V. EVALUATION

In this section, we evaluate TAPS through simulation tests,and compare TAPS with state-of-the-art solutions. The simu-lation results indicate that TAPS outperforms Baraat, Varys,PDQ [11], D3 and Fair Sharing in terms of task completionratio, wasted bandwidth in both single-path and multi-pathsituations. Furthermore, TAPS also outperforms the other 5algorithms in terms of flow completion ratio when task is nottaken into consideration.Firstly, we give the simulation setup, then look into the

details of the simulation results.

A. Simulation SetupThe simulation runs on two different topology setups. The

single-rooted tree topology is identical to what is used in Baraatand similar to D3 [3] and PDQ [11] as shown in Fig. 5, whichis a three-level single-rooted tree topology. Each rack has 40machines with 1Gbps links. And inside a rack a Top-of-Rack(TOR) switch connects all these 40 machines with 1Gbps links.30 ToR switches are connected to an aggregation switch and30 aggregation switches are connected to a core switch. Thesingle-rooted tree topology has 36,000 physical servers wich30 pods and each pod is comprised with 30 racks. The multi-rooted tree topology is a 32-pod fat-tree [12] with 8192 serversand 1 Gbps links.

40 servers

RACK

30 ToR switches

30 aggregate switches

Core Switch

Fig. 5. A single-rooted tree topology

The simulation data is generated in the same way of ex-periments setup in D3 [3] and PDQ [11], but with additionaltask-level information. Each group of simulation data contains30 tasks. The arrival time of the tasks is generated followingthe poisson arrival model, with the arrival rate λ, i.e. λ tasksarrive per second by average. Each task has μ flows by average.All flows within the same task arrive at the same time. Whenthe task arrives in our simulation, their sending and receivingpoints are randomly decided. . The deadline of each task isgenerated by a exponential distribution(default mean deadline= 40ms). Here mean deadline is the mean flow deadline, i.e.averaged expected completion time minus start time of eachflow. And the sizes of each flow are generated by a normaldistribution(default mean flow size = 200KB). Note that allflows in the same task have the same deadline. In defaultsituation, the mean number of flows per task is 1200 for single-rooted simulations while 1024 for multi-rooted simulations.We evaluate the following five flow scheduling mechanisms

with flow-level simulator written in C++:TAPS: TAPS is implemented just as Sec. IV. Upon the

arrival of each task, the algorithm decides whether the task

should be accepted or declined. If a flow is accepted, timeslices are pre-allocated and the route is decided for each flowin the task.Fair Sharing: We develop an ordinary version of Fair

Sharing, which is totally agnostic about tasks or deadlines. Eachflow that competes for a bottleneck link gets a fair share of thelink capacity.D3: The implementation of D3 includes the improvement

introduced by [11].PDQ: We simulated PDQ with the basic Early Termi-

nation(ET) function. Suppressed Probing(SP) and Early S-tart(ES) [11] take buffer occupancy into account and is notappropriate in our flow-level model.Baraat: We simulated Baraat according to the algorithm in

[7].Varys: We mainly simulated Varys of Pseudocode 1 and 2 [8]

to adapt to the deadline-sensitive simulations.Since these algorithms are not naturally designed for multi-

rooted tree topologies, we use flow-level ECMP to extendthem to make routing decisions in multi-rooted scenarios. Forsolutions that may start flows even they are impossible tobe finished, namely, D3 and Fair Sharing, they will not sendmore packets from flows already missed their deadlines, so thatuseless transmission can be avoided.We generated multi groups of simulation data, on different

variable arguments to mirror the impact of different realityfactors: the mean deadline of flows for the task urgency, themean size of flows for the task duration, and the mean numberof flows per task and the number of tasks for the task diffusion.We evaluate on the following three metrics. The task completionratio is the percentage of tasks that can be successfully finishedbefore their deadlines. Only tasks in which all flows meet theirdeadlines are counted as completed. As a contrast we also keeprecords of the flow completion ratio and the application flowthroughput, which is the ratio of the total number and the totalsize of flows finished before their deadlines, regardless of theirtasks finished or not.

B. Impact of Task UrgencyIn the first group of simulation, we vary the mean flow

deadline from 20ms to 60ms.Fig. 6 shows experiment results of single-rooted tree. The

results indicate that with the growing mean deadline of flowsthe performance of each algorithm increases. It is intuitivethat the larger the mean deadline is, the easier for the sameamount of tasks to complete. TAPS outperforms the other 5algorithms in terms of Task Completion Ratio and ApplicationThroughput. Fair Sharing behaves the worst because of thedeadline-agnostic and task-agnostic properties. When deadlinesare very urgent, the performance of D3, PDQ, Varys and Baraatare very similar. But when deadlines is becoming larger, thedifference began to emerge. The performance of PDQ andVarys are very close. The reason that TAPS beats PDQ ismainly because of the rejecting policy which prevents thesubsequent flows from interrupting the prior flows. PDQ is task-agnostic, but could make flows completed more quickly andbefore deadlines. Although Varys is task-aware, it is restrictedto the task arrival rate. The reason why Baraat behaves likethis is that Baraat is deadline-agnostic. In deadline-sensitive

666665665665665665

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

20 25 30 35 40 45 50 55 60

App

licat

ion

Thro

uput

Deadline / ms

Fair Sharing D3 PDQ Barrat Varys TAPS

(a) Application Throughput

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

20 25 30 35 40 45 50 55 60

Task

Com

plet

ion

Rat

io

Deadline / ms


(b) Task Completion Ratio

Fig. 6. Application throughput and task completion ratio when varyingdeadline for single-rooted tree.

scenario, Baraat could not accomplish more tasks than theother solutions but Fair Sharing. We can also see that mostalgorithms could finish more task number ratio than task sizeratio from the difference of Fig. 6(a) and Fig. 6(b). For D3 andFair Sharing in the same situation application flow throughputis higher than flow completion ratio, which reveals that thethree algorithms prefer to accomplish flows with larger size.However, PDQ, Varys, Baraat and TAPS show the oppositefeature: they prefer to fulfill smaller flows, which benefits fromthe scheduling discipline SJF [18].

0

0.2

0.4

0.6

0.8

1

20 25 30 35 40 45 50 55 60

Task

Com

plet

ion

Rat

io

Deadline / ms


Fig. 7. Multi-rooted Simulation Results when varying deadline

Fig. 7 shows experiment results of multi-rooted tree. It issimilar to Fig. 6(b). The difference between them is that thegrowth trends of the curves are more obvious. The general trendis like Fig. 6(b).

Fig. 8 reveals the wasted bandwidth of these algorithms of

0

0.02

0.04

0.06

0.08

0.1

20 30 40 50 60

Was

ted

Ban

dwid

th R

atio

Deadline / ms


(a) Wasted Bandwidth Comparison

0

0.003

0.006

0.009

0.012

0.015

20 30 40 50 60

Was

ted

Ban

dwid

th R

atio

Deadline / ms


(b) Wasted Bandwidth Comparison without Fair Sharing

Fig. 8. Wasted bandwidth when varying deadline

single-rooted tree. Wasted bandwidth means that some packetswhich have been transmitted successfully but the flow theybelonged to missed deadline at last. Wasted bandwidth ratiodenotes the percentage of the size of the packets having beentransmitted but wasted inside total task size. Fig. 8(a) showsthat Fair Sharing wasted the most bandwidth. Fig 8(b) showsthe detailed comparison of the other algorithms. The reason ofthe wasted bandwidth of TAPS the reject policy. The flow whichmay miss its deadline would not be accepted and would not betransmitted at first. The wasted bandwidth ratio of Baraat is alsovery high, which reveals that the deadline-agnostic propertymakes Baraat waste plenty of bandwidth. D3 and PDQ sharethe similar results. They are both deadline-aware but PDQ savedmore bandwidth than D3. Varys saved the most bandwidthbenefitting from the similar reject policy as TAPS.

C. Impact of Task Duration

In the second group of simulation, we vary the mean flowsize from 60KB to 300KB in single-rooted tree topology.Fig. 9 indicates that the other algorithms can hardly complete

tasks when flow size is large, while TAPS achieves highercompletion ratio than them because of task awareness, rejectingpolicy and deadline awareness. Though PDQ can accomplishmore flows and more packets than D3 and Fair Sharing,but without global scheduling and near-optimal routing, theperformance is much lower than that of TAPS. The resultsof Fig. 9 is similar to Sec. V-B that TAPS outperforms theother algorithms in terms of task completion ratio. In contrast,the performance of D3 is very poor compared to the otheralgorithms when the flow size is very large. Though PDQ

667666666666666666

0

0.2

0.4

0.6

0.8

1

60 90 120 150 180 210 240 270 300

App

licat

ion

Thro

ughp

ut

Size/kB


(a) Application throughput

0

0.2

0.4

0.6

0.8

1

60 90 120 150 180 210 240 270 300

Task

Com

plet

ion

Rat

io

Size/kB


(b) task completion ratio

Fig. 9. Application throughput and task completion ratio when varying flowsize

can accomplish more flows and more packets than D3 andFair Sharing, but without global scheduling, the performanceis much lower than that of TAPS.

0

0.2

0.4

0.6

0.8

1

60 90 120 150 180 210 240 270 300

Task

Com

plet

ion

Rat

io

Size/kB


Fig. 10. Flow completion ratio when varying flow number

Fig. 10 is to prove that the near-optimal property of TAPS.The setup of this simulation is as Sec. V-A. While the taskhere only has one flow, which means task and flow are equalhere. Task completion ratio is equal to flow completion ratio.There are 36,000 tasks in this simulation. We could see that theTAPS still outperforms the other 5 algorithms in terms of flowcompletion ratio. But PDQ outperforms Varys more obviouslyin Fig. 10. The variation trends of the left algorithms are verysimilar to the above figures.

D. Impact of Task DiffusionIn the third group of simulation, we vary the mean number

of flows in one task from 400 to 2000, task number from 30to 270 in single-rooted tree topology.

0

0.2

0.4

0.6

0.8

1

400 600 800 1000 1200 1400 1600 1800 2000

Task

Com

plet

ion

Rat

io

Flow Number per Task


Fig. 11. Task completion ratio when varying task number

0

0.2

0.4

0.6

0.8

1

30 60 90 120 150 180 210 240 270

Task

Com

plet

ion

Rat

io

Task Number


Fig. 12. Task completion ratio when varying task number

Fig. 11 and Fig. 12 indicate that with the increase of tasknumber and flow number the performance of each algorithmdecreases. The advantage of TAPS is mainly because of theawareness of task and proper rejection of new coming tasks.Thus, under different level of task diffusion, the awareness oftask plays the most important role for TAPS’s performance.

VI. IMPLEMENTATION AND EXPERIMENTS

We deployed TAPS upon a software based controller sepa-rately across a small-scale testbed which is a partial Fat-treetopology as Fig. 13. The testbed includes 8 endhosts arrangedacross 4 racks and two pods. All the servers are Desktopswith 2.8GHz Intel Duo E7400 processor and 2GB of RAM,whose network cards are Inter 82574L Gigabit Ethernet card.Each rack has a top-of-rack (ToR) switch which is connected toan aggregate switch. The aggregate switches are connected bycore switches, composed of level-3 switches and configureddynamically by the controller which instructs servers whento send flows and which flow to send. All the switches arecomposed of H3C S5500-24P-SI series switches.To evaluate the benefits of exploiting deadline and task

information, we compare TAPS with Fair Sharing. For both twoscheduling approaches, a new flow is directed to the controllerby the sender which has a virtual switch inside, when it isgenerated by a virtual machine. Iperf [20] is used to generate100 flows in our implementation. The average flow size is

668667667667667667

100KB and average deadline is 40ms, similar to Sec. V-A. Thesource and destination IDs are generated randomly.We use effective application throughput as the metric, which

indicated the useful data packets transmitted per unit time. AsFig. 14 shows, TAPS can achieve a high effective applicationthroughput, which is almost close to 100%. However, FairSharing fails to achieve a stable effective application throughputwhich is much lower than that of TAPS. Since occupyingthe link bandwidth exclusively and no competition of linkbandwidth among flows, link bandwidth can be fully utilized bythe flows.In contrast, Fair Sharing can make the mean effectiveapplication throughput up to ∼60% due to flow competitionand deadline-agnostic feature. The tail of TAPS curve descendslittle by little. It is because when a sender finished all its flows,some bottleneck link would be idle and more 1 Gbps bandwidthwould not contribute to the throughput. However the scenarioof Fair Sharing shows that the throughput changes rapidly upondifferent deadline-missing flows.

Servers

Edge Switchs

Aggregate Switches

Core Switches

Fig. 13. A Partial Fat-tree

0 5 10 15 20 25 30 35 400

10

20

30

40

50

60

70

80

90

100

Time/ms

Effe

ctiv

e A

pplic

atio

n Th

roug

hput

/%

TAPSFair Sharing

Fig. 14. Implementation Results

The implementation results demonstrate that TAPS can makehigh effective utilization of network bandwidth and fulfill muchmore tasks than Fair Sharing transport protocol, which can savenetwork bandwidth resources.

VII. CONCLUSION

We proposed TAPS, a task-level deadline-aware schedulingalgorithm in data centers which aims to make more tasksinstead of flows to be completed before their deadlines. Weleverage SDN and further generalize SDN from flow-levelawareness to task-level awareness. TAPS provides a centralizedalgorithm which runs on the SDN controller to decide whether

a task should be accepted or discarded. If the task is accepted,the controller try to pre-allocate the transmission time slicesfor flows of the tasks, and compute the routing path for theaccepted flows. Extensive flow-level simulations demonstratethat TAPS outperforms PDQ [11], D3 [3], and Fair Sharingtransport protocols in deadline-sensitive data center networkenvironment, in both single-path and multi-path paradigms. Asimple implementation on real system also proves that TAPSgets high effective utilization of network bandwidth in datacenters.

REFERENCES

[1] D. Meisner, C. M. Sadler, L. A. Barroso, W.-D. Weber, and T. F. Wenisch,“Power management of online data-intensive services,” in ComputerArchitecture (ISCA), 2011 38th Annual International Symposium on,pp. 319–330, IEEE, 2011.

[2] M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel, B. Prab-hakar, S. Sengupta, and M. Sridharan, “Data center tcp (dctcp),” ACMSIGCOMM computer communication review, vol. 41, no. 4, pp. 63–74,2011.

[3] C. Wilson, H. Ballani, T. Karagiannis, and A. Rowtron, “Better neverthan late: Meeting deadlines in datacenter networks,” in Proceedings ofSIGCOMM ’11, pp. 50–61, ACM, 2011.

[4] N. Dukkipati and N. McKeown, “Why flow-completion time is the rightmetric for congestion control,” ACM SIGCOMM Computer Communica-tion Review, vol. 36, no. 1, pp. 59–62, 2006.

[5] H. Wu, Z. Feng, C. Guo, and Y. Zhang, “Ictcp: incast congestion controlfor tcp in data-center networks,” vol. 21, pp. 345–358, IEEE Press, 2013.

[6] B. Vamanan, J. Hasan, and T. Vijaykumar, “Deadline-aware datacentertcp (d2tcp),” ACM SIGCOMM Computer Communication Review, vol. 42,no. 4, pp. 115–126, 2012.

[7] F. R. Dogar, T. Karagiannis, H. Ballani, and A. Rowstron, “Decentralizedtask-aware scheduling for data center networks,” 2013.

[8] M. Chowdhury, Y. Zhong, and I. Stoica, “Efficient coflow schedulingwith varys,” in Proceedings of the 2014 ACM Conference on SIGCOMM,SIGCOMM ’14, pp. 443–454, 2014.

[9] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing onlarge clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.

[10] A. Shieh, S. Kandula, A. G. Greenberg, C. Kim, and B. Saha, “Sharingthe data center network.,” in NSDI, 2011.

[11] C.-Y. Hong, M. Caesar, and P. Godfrey, “Finishing flows quickly withpreemptive scheduling,” ACM SIGCOMM Computer CommunicationReview, vol. 42, no. 4, pp. 127–138, 2012.

[12] M. Al-Fares, A. Loukissas, and A. Vahdat, “A scalable, commodity datacenter network architecture,” in ACM SIGCOMM Computer Communi-cation Review, vol. 38, pp. 63–74, ACM, 2008.

[13] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, andS. Lu, “Bcube: a high performance, server-centric network architecturefor modular data centers,” ACM SIGCOMM Computer CommunicationReview, vol. 39, no. 4, pp. 63–74, 2009.

[14] D. Li, C. Guo, H. Wu, K. Tan, Y. Zhang, and S. Lu, “Ficonn: Usingbackup port for server interconnection in data centers,” in INFOCOM2009, IEEE, pp. 2276–2285, IEEE, 2009.

[15] M. Alizadeh, A. Kabbani, T. Edsall, B. Prabhakar, A. Vahdat, andM. Yasuda, “Less is more: trading a little bandwidth for ultra-low latencyin the data center,” in Proceedings of the 9th USENIX conference onNetworked Systems Design and Implementation, pp. 19–19, USENIXAssociation, 2012.

[16] V. Sivaraman, F. M. Chiussi, and M. Gerla, “End-to-end statistical delayservice under gps and edf scheduling: A comparison study,” in INFOCOM2001. Twentieth Annual Joint Conference of the IEEE Computer andCommunications Societies. Proceedings. IEEE, vol. 2, pp. 1113–1122,IEEE, 2001.

[17] S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken, “Thenature of data center traffic: measurements & analysis,” in Proceedings ofthe 9th ACM SIGCOMM conference on Internet measurement conference,pp. 202–208, ACM, 2009.

[18] N. Bansal and M. Harchol-Balter, “Analysis of srpt scheduling: Investi-gating unfairness,” vol. 29, ACM, 2001.

[19] “Software-defined networking (sdn).” https://www.opennetworking.org/.[20] “Iperf.” http://iperf.sourceforge.net/.

669668668668668668

Documents

TAPS: Software Defined Task-Level Deadline-Aware .... TAPS Software Defined Task-level... · TAPS: Software Deﬁned Task-level Deadline-aware Preemptive Flow scheduling in Data Centers