View
3
Download
0
Category
Preview:
Citation preview
Fair Scheduling in Cloud Datacenters with Multiple Resource Types
by
Wei Wang
A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy
Graduate Department of Electrical and Computer EngineeringUniversity of Toronto
c© Copyright 2015 by Wei Wang
Abstract
Fair Scheduling in Cloud Datacenters with Multiple Resource Types
Wei Wang
Doctor of Philosophy
Graduate Department of Electrical and Computer Engineering
University of Toronto
2015
This dissertation focuses on algorithm design and prototype implementation of fair sharing policies
in cloud datacenters with multiple resource types. Specifically, it seeks to address two fundamental
resource management problems.
First, how should the the computing resources of a large-scale cluster – such as CPU cores, memory,
and storage – be fairly shared among running applications? This problem has become even more com-
plicated in cloud datacenters, mainly due to their unprecedented heterogeneity and complexity. Cloud
computing clusters are likely to be constructed from a large number of commodity servers spanning mul-
tiple generations, with different computing capabilities, bandwidth, and storage capacities. On the other
hand, depending on the underlying applications, computing tasks may require vastly different amounts
of resources: some are CPU-intensive; some are memory- or bandwidth-bound. Despite these complex-
ities, existing resource sharing policies either result in significant resource fragmentation, or require a
homogeneous cluster where servers are of the same specification.
This dissertation proposes Dominant Resource Fairness in Heterogeneous systems (DRFH), a general
multi-resource sharing policy that preserves many axiomatically, and highly desirable “fair” properties
that used to be provided by weighted fair sharing in the single-resource setting. DRFH eliminates
the application’s incentive of cheating the cluster scheduler for more allocation share by misreporting
its resource requirement, a strategic behaviour commonly observed in production clusters. Prototype
implementation and trace-driven simulations show that DRFH can be easily enforced in cluster systems,
with higher resource utilization and shorter job completion time than the existing resource sharing
policies.
Second, how should the active flows fairly share the network resources of middleboxes, software
routers, and other appliances that are widely deployed in datacenters? In middleboxes or software
routers, flows usually undergo deep packet inspection, which requires the support of multiple types of
resources, and may bottleneck on either CPU, memory bandwidth, or link bandwidth. While there is
rich literature of fair queueing for a single type of resource (i.e., link bandwidth), it remains unclear how
ii
to schedule multiple resources in middleboxes to achieve fair sharing among flows. A similar problem
also arises in virtual machine (VM) scheduling inside a hypervisor, where different VMs may consume
different amounts of resources, and it is desirable to fairly multiplex their access to physical resources.
To answer these challenges, this dissertation proposes Multi-resource Round Robin (MR3) that serves
flows in rounds and achieves near-perfect fairness in O(1) time. MR3 serves as a foundation for a more
general fair scheduler, called Group Multi-Resource Round Robin (GMR3). GMR3 also runs in O(1)
time, yet provides weight-proportional packet latency when flows are assigned uneven weights.
This dissertation also identifies a new challenge that is unique to multi-resource scheduling: the
general tradeoff between fairness and efficiency. Such a tradeoff has never been a problem for traditional
fair sharing of link bandwidth. As long as the queueing algorithm is work conserving, the bandwidth is
always fully utilized, and fairness is the only concern. However, in the presence of multiple resource types,
fairness and efficiency are conflicting objectives that cannot be achieved at the same time. Motivated
by this problem, a new queueing algorithm is proposed and prototyped. It allows network operator to
flexibly specify her tradeoff preference and implements the specified tradeoff by determining the right
packet scheduling order.
iii
To my family
iv
Acknowledgments
First and foremost, I am deeply indebted to my thesis advisors, Professor Baochun Li and Professor
Ben Liang, who have inspired my mind and guided me relentlessly throughout my PhD. They are both
brilliant researchers who always shared with me sharp visions and provided me invaluable feedback to
virtually every piece of work I have done. Without their guide and training, I could not imagine that I
would have gone so further. I felt extremely privileged to work with both of them and benefit from both
their perspectives.
There are also a number of people to whom I am especially thankful. The first one is easy: Dr. Chen
Feng. A good old friend and a close collaborator, Chen was always available and ready to help in various
ways, both technical and non-technical. Chapter 4 was joint work with him and my advisors, where he
has helped simply the analyses and has provided the key ideas for the design of an O(log n) tracking
algorithm (Sec. 4.5). Also, I would like to express my sincere gratitude to Professor Jorg Liebeherr and
Professor Ding Yuan, who generously served on my thesis proposal committee and have offered fantastic
advices to the preliminary work of this dissertation. I am grateful to Jun Li as well, for his great help
in the experiments of Chapter 2. Finally, I am greatly thankful to Professor Frank Kschischang, for his
comments on writing.
Beyond direct contributors on my thesis project, many other people have generously offered help
in my graduate work. In particular, the close collaboration with Dr. Di Niu led to join work in cloud
economics [1]. Dr. Zimu Liu has provided numerous great suggestions on prototype implementations.
Dr. Henry Xu and Zhifei Zhu were always available for inspiring and fruitful discussions.
I was also very fortunate to work with talented members in two fantastic research groups at the
University of Toronto: iQua and WCL. The list is in the order of seniority: Dr. Di Niu, Dr. Chen Feng,
Dr. Mahdi Hajiaghayi, Dr. Seyed Hossein Seyedmehdi, Dr. Henry Xu, Dr. Zimu Liu, Dr. Yuan Feng,
Dr. Boyang Wang, Dr. Zhi Wang, Yuefei Zhu, Yiwei Pu, Qiang Xiao, Juyang Xia, Honghao Ju, Wei Bao,
Sun Sun, Jun Li, Jaya Prakash Champati, Ali Ramezani-Kebrya, Meng-Hsi Chen, Yong Wang, Li Chen,
Liyao Xiang, Shuhao Liu, Yujie Xu, Sowndarya Sundar, Yinan Liu, and Mohammadmoein Soltanizadeh.
v
I also would like to thank friends in Bahen for their accompany and encouragement: Huiyuan Xiong,
Siyu Liu, Yuhan Zhou, Alice Gao, Dr. Yicheng Lin, Dr. Qi Zhang, Chu Pang, Weiwei Li, and Binbin
Dai.
Finally, my deepest gratitude goes to my family. My work would have never been possible without
the unwavering support and unflagging love from my parents and my wife. This dissertation is dedicated
to them.
vi
Contents
Acknowledgment v
1 Introduction 1
1.1 Resource Sharing Policies in Cloud Datacenters . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Multi-Resource Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Summary of Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Multi-Resource Fair Sharing for Cloud Systems with Heterogeneous Servers . . . . 5
1.3.2 Multi-Resource Fair Queueing for Network Flows . . . . . . . . . . . . . . . . . . . 5
1.3.3 Fairness-Efficiency Tradeoff for Multi-Resource Scheduling . . . . . . . . . . . . . . 6
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Multi-Resource Fair Scheduler for Datacenter Jobs 8
2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 DRF Scheduler and Its Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 System Model and Allocation Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Basic Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Resource Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.3 Allocation Mechanism and Desirable Properties . . . . . . . . . . . . . . . . . . . . 13
2.3.4 Naive DRF Extension and Its Inefficiency . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 DRFH Allocation and Its Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1 DRFH Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.2 Envy-Freeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.3 Pareto Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.4 Group Strategyproofness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.5 Sharing Incentive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
vii
2.4.6 Other Important Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5.1 Weighted Users with a Finite Number of Tasks . . . . . . . . . . . . . . . . . . . . 28
2.5.2 Scheduling Tasks as Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6.1 Prototype Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6.2 Trace-Driven Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3 Multi-Resource Fair Queueing for Network Flows 40
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Preliminaries and Design Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.1 Packet Processing Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.2 Dominant Resource Fairness (DRF) over Time . . . . . . . . . . . . . . . . . . . . 44
3.3.3 Scheduling Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.4 Scheduling Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Multi-Resource Round Robin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.1 Challenges of Round-Robin Extension . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.2 Basic Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4.3 Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.4 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.6 MR3 for the Weighted Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.5 Group Multi-Resource Round Robin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.5.1 Basic Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.5.2 Flow Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.5.3 Inter-Group Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.5.4 Intra-Group Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.5.5 Handling New Packet Arrivals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.5.6 Implementation and Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.5.7 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
viii
3.5.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.6 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.8 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.8.1 Fairness Analysis for MR3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.8.2 Analysis of Startup Latency of MR3 . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.8.3 Analysis of Scheduling Delay of MR3 . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.8.4 Fairness Analysis of Weighted MR3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.8.5 Delay Analysis of Weighted MR3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4 Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 93
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.2 Fairness and Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.2.1 Dominant Resource Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.2.2 The Efficiency Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.2.3 Tradeoff between Fairness and Efficiency . . . . . . . . . . . . . . . . . . . . . . . . 97
4.2.4 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.3 Fairness, Efficiency, and Their Tradeoff in the Fluid Model . . . . . . . . . . . . . . . . . 99
4.3.1 Fluid Relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.3.2 Fluid Schedule with Perfect Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.3.3 Fluid Schedule with Optimal Efficiency . . . . . . . . . . . . . . . . . . . . . . . . 102
4.3.4 Tradeoff between Fairness and Efficiency . . . . . . . . . . . . . . . . . . . . . . . . 104
4.4 Packet-by-Packet Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.4.1 Start-Time Tracking vs. Finish-Time Tracking . . . . . . . . . . . . . . . . . . . . 107
4.4.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.5 An O(log n) Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.5.1 Packet Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.5.2 Direct Implementation of Fluid Scheduling . . . . . . . . . . . . . . . . . . . . . . 109
4.5.3 Virtual Time Implementation of Fluid Scheduling . . . . . . . . . . . . . . . . . . 110
4.5.4 Start-Time Tracking and Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.6.1 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.6.2 Trace-Driven Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
ix
4.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.8 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.9 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.9.1 Proof of Lemma 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.9.2 Proof of Lemma 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.9.3 Proof of Theorem 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.9.4 Proof of Theorem 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.9.5 Preliminaries for the Proofs of Theorems 14 and 15 . . . . . . . . . . . . . . . . . . 126
4.9.6 Proof of Theorem 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
4.9.7 Proof of Theorem 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5 Concluding Remarks 133
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Bibliography 136
x
List of Figures
2.1 An example of a system consisting of two heterogeneous servers, in which user 1 can
schedule at most two tasks each demanding 1 CPU and 4 GB memory. The resources
required to execute the two tasks are also highlighted in the figure. . . . . . . . . . . . . 10
2.2 An example of a system containing two heterogeneous servers shared by two users. Each
computing task of user 1 requires 0.2 CPU time and 1 GB memory, while the computing
task of user 2 requires 1 CPU time and 0.2 GB memory. . . . . . . . . . . . . . . . . . . 12
2.3 DRF allocation for the example shown in Fig. 2.2, where user 1 is allocated 5 tasks in
server 1 and 1 in server 2, while user 2 is allocated 1 task in server 1 and 5 in server 2. . 15
2.4 An alternative allocation with higher system utilization for the example of Fig. 2.2. Server
1 and 2 are exclusively assigned to user 1 and 2, respectively. Both users schedule 10 tasks. 19
2.5 CPU, memory, and global dominant share of four applications running in a 50-node Ama-
zon EC2 cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.6 Time series of CPU and memory utilization. . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.7 DRFH improvements on job completion times over Slots scheduler. . . . . . . . . . . . . 35
2.8 Task completion ratio of users using Best-Fit DRFH and Slots schedulers, respectively.
Each bubble’s size is logarithmic to the number of tasks the user submitted. . . . . . . . . 36
2.9 Comparison of task completion ratios under DRFH and that obtained in dedicated clouds
(DCs). Each circle’s radius is logarithmic to the number of tasks submitted. . . . . . . . . 36
3.1 Illustration of a scheduling discipline that achieves DRF. . . . . . . . . . . . . . . . . . . 44
3.2 Illustration of a direct DRR extension. Each packet of flow 1 has processing times 〈7, 6.9〉,
while each packet of flow 2 has processing times 〈1, 7〉. . . . . . . . . . . . . . . . . . . . 47
3.3 Naive fix of the DRR extension shown in Fig. 3.2a by withholding the scheduling oppor-
tunity of every packet until its previous packet is completely processed on all resources. . 48
3.4 Illustration of a schedule by MR3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
xi
3.5 Illustration of the round-robin service and the sequence number. . . . . . . . . . . . . . . 51
3.6 Dominant services and packet throughput received by different flows under FCFS and
MR3. Flows 1, 11 and 21 are ill-behaving. . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.7 Latency comparison between DRFQ and MR3. . . . . . . . . . . . . . . . . . . . . . . . . 60
3.8 MR3 can quickly adapt to traffic dynamics and achieve DRF across all 3 flows. . . . . . . 62
3.9 Fairness and delay sensitivity of MR3 in response to mixed packet sizes and arrival dis-
tributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.10 MR3 schedule fails to offer weight-proportional delay when flows are assigned uneven
weights. P ik denotes the kth packet of flow i. . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.11 An improved schedule over MR3 in the example of Fig. 3.10, where the scheduling delay
is significantly reduced. P ik denotes the kth packet of flow i. . . . . . . . . . . . . . . . . 66
3.12 An illustration of the scheduling rounds of flow groups, where Rkl denotes the scheduling
round l of flow group Gk. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.13 An illustration of the inter-group scheduler selecting flow groups in the example of Fig. 3.10.
Within a group, the flow is determined in a round-robin manner by the intra-group sched-
uler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.14 The schedule determined by GMR3 in the example of Fig. 3.10, where f li denotes the
packet processing for flow i ∈ Gk in scheduling round l of its flow group Gk. The slot axis
is only for the accounting mechanism, while the time axis shows real time elapse. . . . . 69
3.15 Illustration of the maximum number of scheduling rounds of flow i on the last resource in
(t1, t2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.16 The illustration of a scenario where the scheduling delay Di(p) reaches the maximum.
Here, f li denotes the processing of flow i in scheduling round l of its flow group. . . . . . 78
3.17 Simulation results of the fairness and delay performance of GMR3, as compared to DRFQ
and MR3. Figure (a) dedicates to the fairness evaluation, while (b), (c), and (d) compare
the scheduling delay of the three schedulers. . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.18 Illustration of the startup latency. In the figure, flow i in round k is denoted as ik. Flow
n+ 1 becomes active when flow n has been served on resource 1 in round k − 1, and will
be served right after flow n in round k. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.19 Illustration of the packet latency, where flow i in round k is denoted as ik. The figure
shows the scenario under which the latency reaches its maximal value: a packet p is pushed
to the top of the input queue in one round but is scheduled in the next round because of
the account deficit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
xii
4.1 An example showing the tradeoff between fairness and efficiency for multi-resource packet
scheduling. Packets that finishes CPU processing are placed into a buffer in front of the
output link. Flow 1 sends packets p1, p2, ..., each having a processing time vector 〈2, 3〉;
Flow 2 sends packets q1, q2, ..., each having a processing time vector 〈9, 1〉. Schedule (a)
achieves DRF but is inefficient; Schedule (b) is efficient but unfair. . . . . . . . . . . . . 94
4.2 The DRGPS fluid that implements the perfect fairness in the example of Fig. 4.1. Flow 1
sends packets p1, p2, ..., and receives 〈3/5 CPU, 1/15 bandwidth〉; Flow 2 sends packets
q1, q2, ..., and receives 〈3/5 CPU, 1/15 bandwidth〉. Only 2/3 of the link bandwidth is
utilized. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.3 Overall resource utilization observed in Click. No packet drops. . . . . . . . . . . . . . . 115
4.4 Dominant share each flow receives per second in Click. No packet drops. The strict fair
share is 2%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.5 Average resource utilization and dominant share each flow receives in Click at different
fairness levels. The queue capacity is 200 packets. The measurement of resource utilization
and dominant share is conducted every second over the entire schedule. . . . . . . . . . . 116
4.6 Mean per-packet latency of elephant (sending 20,000 pkts/s) and mice flows (sending 2
pkts/s) in Click. The error bar shows the standard deviation. . . . . . . . . . . . . . . . 117
4.7 Resource utilization achieved at different fairness levels in the simulation, averaged over
10 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.8 The improvement of per-packet latency and packet drop rate due to the fairness tradeoff. 118
4.9 Per-packet latency against flow sizes in the first 20s of simulation, feeding CPU-bound
traffic, with and without the fairness tradeoff. . . . . . . . . . . . . . . . . . . . . . . . . 119
xiii
List of Tables
1.1 Configurations of servers in one of Google’s datacenters [2, 3]. CPU and memory units
are normalized to the maximum server (highlighted below). . . . . . . . . . . . . . . . . . 2
2.1 Resource configurations of the 50 nodes in the launched Amazon EC2 cluster. . . . . . . . 31
2.2 Details of tasks submitted by four artificial applications. . . . . . . . . . . . . . . . . . . . 31
2.3 Resource utilization of the Slots scheduler with different slot sizes. . . . . . . . . . . . . . 33
3.1 Performance comparison between MR3 and DRFQ, where L is the maximum packet pro-
cessing time, m is the number of resources, and n is the number of active flows. . . . . . . 58
3.2 Linear model for CPU processing time in 3 middlebox modules. Model parameters are
based on the measurement results reported in [4]. . . . . . . . . . . . . . . . . . . . . . . . 59
3.3 Summary of performance of GMR3 and existing schemes, where n is the number of flows,
and m is the number of resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.1 Main notations used in the fluid model. The superscript t is dropped when time can be
clearly inferred from the context. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.2 Schedule makespan observed in Click at different fairness levels. The queue capacity is
infinite. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
xiv
Chapter 1
Introduction
Cloud computing realizes the long-held ambition of “big data” processing. By pooling a large number
of commodity servers into a cluster – known as datacenter – cloud computing allows “big data” appli-
cations, such as web search, machine learning, and social networks, to scale up to hundreds of servers,
accommodating their ever-growing demand for computing cycles. Building a computer system at such
a large scale poses some major technical challenges on resource management. Even a medium-sized
production datacenter consists of more than 10K servers, and can host over 2,000 applications from hun-
dreds of users [2, 3]. A fundamental system design problem facing a datacenter operator is: how should
the cluster resources, such as CPU, memory, and link bandwidth, be shared among applications fairly
and efficiently? Motivated by this question, my doctoral research has been on modeling, algorithms,
analyses, and prototype implementations that incorporate the following two themes:
(1) designing fundamental resource sharing policies for large-scale computer clusters and data analytic
systems, and
(2) analyzing fundamental scheduling problems for devices and systems, such as middleboxes, software
routers, and hypervisors, that are widely deployed in cloud datacenters.
1.1 Resource Sharing Policies in Cloud Datacenters
Unlike traditional application-specific clusters and grids, cloud computing systems distinguish themselves
with unprecedented server and workload heterogeneity. Modern datacenters are likely to be constructed
from a variety of server classes, with different configurations in terms of processing capabilities, memory
sizes, and storage spaces [5]. Asynchronous hardware upgrades, such as adding new servers and phasing
1
Chapter 1. Introduction 2
Table 1.1: Configurations of servers in one of Google’s datacenters [2, 3]. CPU and memory units arenormalized to the maximum server (highlighted below).
Number of servers CPUs Memory6732 0.50 0.503863 0.50 0.251001 0.50 0.75795 1.00 1.00126 0.25 0.2552 0.50 0.125 0.50 0.035 0.50 0.973 1.00 0.501 0.50 0.06
out existing ones, further aggravate such diversity, leading to a wide range of server specifications in
a cloud computing system [2, 6–9]. Table 1.1 illustrates the heterogeneity of servers in one of Google’s
clusters [2,3]. Similar server heterogeneity has also been observed in public clouds, such as Amazon EC2
and Rackspace [6, 7].
In addition to server heterogeneity, cloud computing systems also represent much higher diversity in
resource demand profiles. Depending on the underlying applications, the workload spanning multiple
cloud users may require vastly different amounts of resources (e.g., CPU, memory, and storage). For
example, numerical computing tasks are usually CPU intensive, while database operations typically
require high-memory support. The heterogeneity of both servers and workload demands poses significant
technical challenges on the resource allocation mechanism, giving rise to many delicate issues – notably
fairness and efficiency – that must be carefully addressed.
Despite the unprecedented heterogeneity in cloud computing systems, state-of-the-art computing
frameworks employ rather simple abstraction that falls short. For example, Hadoop [10] and Dryad [11],
the two most widely deployed cloud computing frameworks, partition a server’s resources into bundles –
known as slots – that contain fixed amounts of different resources. The system then allocates resources
to users at the granularity of these slots. Such a single resource abstraction ignores the heterogeneity of
both server specifications and demand profiles, inevitably leading to a fairly inefficient allocation [12].
Towards addressing the inefficiency of the current allocation module, many recent works focus on
multi-resource allocation mechanisms. Notably, Ghodsi et al. [12] suggest a compelling alternative known
as the Dominant Resource Fairness (DRF) allocation, in which each user’s dominant share – the maxi-
mum ratio of any resource that the user has been allocated – is equalized. The DRF allocation possesses
a set of highly desirable fairness properties, and has quickly received significant attention in the lit-
erature [13–16]. While DRF and its subsequent works address the demand heterogeneity of multiple
Chapter 1. Introduction 3
resources, they all limit the discussion to a simplified model where resources are pooled in one place and
the entire resource pool is abstracted as one big server.1 Such an all-in-one resource model not only
contrasts the prevalent datacenter infrastructure – where resources are distributed to a large number of
servers – but also ignores the server heterogeneity: the allocations depend only on the total amount of
resources pooled in the system, irrespective of the underlying resource distribution of servers. In fact,
when servers are heterogeneous, even the definition of dominant resource is not so clear. Depending on
the underlying server configurations, a computing task may bottleneck on different resources in different
servers. We shall see in Sec. 2.3.4 that naive extensions, such as applying the DRF allocation to each
server separately, may lead to a highly inefficient allocation.
1.2 Multi-Resource Scheduling
The need for multi-resource management goes beyond cloud clusters and extends to appliances and
systems that are widely deployed in datacenters, such as middleboxes, software routers, and hypervisors.
Recent studies report that the number of middleboxes deployed in enterprise and datacenter networks
is on par with the traditional switches and routers [17, 18]. These middleboxes do more than just
packet forwarding. In addition, they perform filtering (e.g., firewalls), optimization (e.g., HTTP caching
and WAN optimization), and transformation (e.g., dynamic request routing) based on the underlying
traffic contents [18–20], which requires the support of multiple hardware resources such as CPU, memory
bandwidth, and link bandwidth [4,17]. A scheduling algorithm specifically designed for multiple resource
types is therefore needed for sharing these resources fairly and efficiently. Similar problems also arise
in virtual machine (VM) scheduling inside a hypervisor, where different VMs may consume different
amounts of resources, and it is desirable to fairly multiplex their access to physical resources.
While single-resource fair queueing for bandwidth sharing has been extensively studied for switches
and routers [21–26], multi-resource fair queueing imposes new scheduling challenges as flows are com-
peting for multiple resources and may have vastly different resource requirements. For example, flows
that require forwarding a large amount of small packets congest the memory bandwidth of a software
router [27], while those that require IP security encryption (IPsec) needs more CPU processing time [28].
Despite their heterogeneous resource requirements, flows are expected to receive predictable service isola-
tion to meet their Quality of Service (QoS) requirements. This requires a multi-resource packet scheduler
with the following desirable properties.
Fairness. The packet scheduler should provide some measure of service isolation across flows, so
1While [12] briefly touches on the case where resources are distributed to small servers (known as the discrete scenario),its coverage is rather informal.
Chapter 1. Introduction 4
that the damaging behaviour of rogue traffic will not affect the QoS of other regular flows. In particular,
each flow should receive service (i.e., throughput) at least at the level as when every resource is allocated
in proportion to the flow’s weight, irrespective of the behaviours of other traffic.
Bounded scheduling delay. Interactive Internet applications such as video streaming and online
games have stringent end-to-end delay requirements. It is hence important for a packet scheduler to offer
bounded scheduling delay. Such a delay bound should be a small constant, independent of the number
of flows.
Low complexity. As the volume of traffic through middleboxes surges [29, 30], it is important to
make scheduling decisions at high speed. Ideally, a packet scheduler should have time complexity that
is a small constant, independent of the number of flows. In addition, the scheduling algorithm should
be amenable to practical implementation.
While all three properties have been extensively studied for bandwidth sharing in traditional switches
and routers [21–23, 25, 31], multi-resource fair queueing remains a largely uncharted territory. The
recent work of Ghodsi et al. [4] suggests a promising alternative, known as Dominant Resource Fair
Queueing (DRFQ), that implements Dominant Resource Fairness (DRF) [12, 16] in the time domain.
While DRFQ provides nearly perfect service isolation, it is expensive to implement. Specifically, DRFQ
needs to sort packet timestamps [4] and requires O(log n) time complexity per packet, where n is the
number of backlogged flows. With a large number of flows, it is hard to implement DRFQ at high
speeds. This problem is further aggravated by the recent middlebox innovations, where software-defined
middleboxes deployed as VMs and processes are now replacing traditional network appliances with
dedicated hardware [20,32]. As more software-defined middleboxes are consolidated onto commodity and
cloud servers [17,18], a device will see an increasing amount of flows competing for multiple resources.
1.3 Summary of Contribution
This dissertation focuses on algorithm design and prototype implementation of the two multi-resource
sharing problems mentioned in the previous two sections. In particular, this dissertation makes the
following contributions.
Chapter 1. Introduction 5
1.3.1 Multi-Resource Fair Sharing for Cloud Systems with Heterogeneous
Servers
To address the problems of existing resource schedulers in computer clusters, the first contribution of
this dissertation is a rigorous study on a solution with provable operational benefits that bridge the gap
between the existing multi-resource allocation models and the state-of-the-art datacenter infrastructure.
We propose DRFH [33, 34], a DRF generalization in Heterogeneous environments where resources are
pooled by a large number of heterogeneous servers, representing different points in the configuration
space of resources such as processing, memory, and storage. DRFH generalizes the intuition of DRF by
seeking an allocation that equalizes every user’s global dominant share, which is the maximum ratio of
any resources the user has been allocated in the entire resource pool. We systematically analyze DRFH
and show that it retains most of the desirable properties that the all-in-one DRF model provides for a
single server [12]. Specifically, DRFH is Pareto optimal, where no user is able to increase its allocation
without decreasing other users’ allocations. Meanwhile, DRFH is envy-free in that no user prefers the
allocation of another one. More importantly, DRFH is group strategyproof in that whenever a coalition
of users collude with each other to misreport their resource demands, there is a member of the coalition
that cannot strictly gain. As a result, the coalition is better off not formed. In addition, DRFH offers
some level of service isolation by ensuring the sharing incentive property in a weak sense – it allows users
to execute more tasks than those under some “equal partition” where the entire resource pool is evenly
allocated among all users. DRFH also satisfies a set of other important properties, namely single-server
DRF, single-resource fairness, bottleneck fairness, and population monotonicity (details in Sec. 2.3.3).
As a direct application, we design a heuristic scheduling algorithm that implements DRFH in real-
world systems. We evaluate DRFH via both prototype implementation and trace-driven simulations.
Both implementation and simulation results show that compared with the traditional slot schedulers
adopted in prevalent cloud computing frameworks, the DRFH algorithm suitably matches demand het-
erogeneity to server heterogeneity, significantly improving the system’s resource utilization, yet with a
substantial reduction of job completion times.
1.3.2 Multi-Resource Fair Queueing for Network Flows
To address the complexity problem of the existing multi-resource fair queueing algorithms, this disser-
tation proposes two multi-resource fair schedulers that are complementary to each other.
Multi-Resource Round Robin. The second contribution of this dissertation is the design of a low
complexity multi-resource fair scheduler, called Multi-Resource Round Robin (MR3), that serves flows in
Chapter 1. Introduction 6
rounds and achieves near-perfect fairness in O(1) time [35]. MR3 can also provide bounded scheduling
delay for unweighted flows. The design of MR3 overcomes a series of newly emerged technical challenges
due to the presence of multiple resource types, and is very easy to implement in middleboxes or software
routers. We shall describe its detailed design in Sec. 3.4.
Group Multi-Resource Round Robin. Despite many desirable properties, MR3 fails to provide
weight-proportional delays when flows are assigned uneven weights. The third contribution of this
dissertation is an improved scheduling algorithm presented in Sec. 3.5, referred to as Group Multi-
Resource Round Robin (GMR3), that achieves all three desirable properties [36]. GMR3 groups flows
with similar weights into a small number of groups, each associating with a timestamp. The scheduling
decisions are made in a two-level hierarchy. At the higher level, GMR3 makes inter-group scheduling
decisions by choosing the group with the earliest timestamp, while at the lower level, the intra-group
scheduler serves flows within a group in a round-robin fashion. GMR3 is highly efficient, as it requires
only O(1) time per packet in almost all practical scenarios. In addition, we show that GMR3 achieves
near-perfect fairness across flows, with its scheduling delay bounded by a small constant. These desirable
properties are proven analytically and validated by experiments as well. To our knowledge, GMR3 is the
first fair queueing algorithm that offers near-perfect fairness with O(1) time complexity and a constant
scheduling delay bound.
1.3.3 Fairness-Efficiency Tradeoff for Multi-Resource Scheduling
The last contribution of this dissertation is to identify a new challenge that is unique to multi-resource
scheduling – the general tradeoff between fairness and efficiency. Such a tradeoff has never been a problem
for traditional fair sharing of link bandwidth. As long as the queueing algorithm is work conserving,
the bandwidth is always fully utilized given a non-empty system. That leaves fairness the only concern.
However, this is no longer the case for multi-resource scheduling: different queueing algorithms, even
work conserving, may result in vastly different resource utilization. In general, fairness and efficiency
are conflicting objectives that cannot be achieved at the same time. For applications with loose fairness
requirements, trading off some degree of fairness for higher efficiency and higher throughput is well
justified.
Driven by this practical need, this dissertation proposes a new queueing algorithm that allows network
operator to flexibly specify its tradeoff preference and implements the specified tradeoff by determining
the right packet scheduling order [37]. Both trace-driven simulation and prototype implementation show
that in many cases, trading off only a small degree of fairness is sufficient to improve the overall traffic
Chapter 1. Introduction 7
throughput to the point where the system capacity is almost saturated. We shall present the detailed
discussions in Chapter 4.
1.4 Thesis Organization
This dissertation is organized as follows. Chapter 2 presents our study of multi-resource fair sharing for
datacenter jobs. Chapter 3 presents the study of multi-resource fair scheduling for packet processing
in middleboxes and software routers that are widely deployed in cloud datacenter. Chapter 4 discusses
the tradeoff between fairness and efficiency for multi-resource scheduling. Chapter 5 provides possible
future work and concluding remarks.
Chapter 2
Multi-Resource Fair Scheduler for
Datacenter Jobs
2.1 Motivation
As we have explained in Sec. 1.1, one fundamental problem facing computer clusters is to fairly and
efficiently share resources among different applications. This problem has become even more complicated
in cloud computing systems, mainly due to the unprecedented heterogeneity in terms of both server
specifications and computing requirements. Cloud computing clusters are likely to be constructed from
a large number of commodity servers spanning multiple generations, with different configurations in
terms of processing capabilities, memory sizes, outgoing bandwidth, and storage spaces. On the other
hand, depending on the underlying applications, computing tasks may require vastly different amounts
of resources. For example, scientific computations are typically CPU-intensive; database applications
usually require high memory support; graph-parallel computations, such as PageRank, are bandwidth-
bound.
Despite these complexities, state-of-the-art cluster computing frameworks employ rather simple re-
source sharing policies that fall short. For example, many widely deployed data analytic frameworks,
such as Hadoop [10], Dryad [11], and Spark [38], schedule tasks at the granularity of slots, defined as
a bundle of resources that contain fixed amounts of memory and CPU cores. This inevitably results in
resource fragmentation, the magnitude of which increases with the number of tasks scheduled. Twitter’s
Mesos [39] adopts a recently proposed DRF scheduler [12] to overcome this drawback. However, DRF
is unable to handle server heterogeneity and may lead to low resource utilization [40].
8
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 9
In this chapter, we design a multi-resource allocation mechanism, called DRFH, that generalizes the
notion of Dominant Resource Fairness (DRF) from a single server to multiple heterogeneous servers.
DRFH provides a number of highly desirable properties. With DRFH, no user prefers the allocation
of another user; no one can improve its allocation without decreasing that of the others; and more
importantly, no coalition behavior of misreporting resource demands can benefit all its members. DRFH
also ensures some level of service isolation among the users. As a direct application, we design a simple
heuristic that implements DRFH in real-world systems. Both prototype implementation and large-scale
trace-driven simulations show that DRFH significantly outperforms the traditional slot-based scheduler,
leading to much higher resource utilization with substantially shorter job completion times.
The remainder of this chapter is organized as follows. We briefly revisit the DRF allocation and
point out its limitations in heterogeneous environments in Sec. 2.2. We then formulate the allocation
problem with heterogeneous servers in Sec. 2.3, where a set of desirable allocation properties are also
defined. In Sec. 2.4, we propose DRFH and analyze its properties. Sec. 2.5 dedicates to some practical
issues on implementing DRFH. We evaluate the performance of DRFH via trace-driven simulations in
Sec. 2.6. We survey the related work in Sec. 2.7 and summarize the chapter in Sec. 2.8.
2.2 DRF Scheduler and Its Limitations
In this section, we briefly review the DRF allocation [12] and show that it may lead to an infeasible
allocation when a cloud system is composed of multiple heterogeneous servers.
In DRF, the dominant resource is defined for each user as the resource that requires the largest fraction
of the total availability. Consider an example given in [12], suppose that a computing system has 9 CPUs
and 18 GB memory, and is shared by two users. User 1 wishes to launch a set of (divisible) tasks each
requiring 〈1 CPU, 4 GB〉, and user 2 has a set of (divisible) tasks each requiring 〈3 CPU, 1 GB〉. In this
example, the dominant resource of user 1 is the memory as each of its task demands 1/9 of the total
CPU and 2/9 of the total memory. On the other hand, the dominant resource of user 2 is CPU, as each
of its task requires 1/3 of the total CPU and 1/18 of the total memory.
We now define the dominant share for each user as the fraction of the dominant resource the user
has been allocated. DRF is then defined as the max-min fairness1 in terms of the dominant share. In
other words, the mechanism of DRF seeks a maximum allocation that equalizes each user’s dominant
share. Back to the aforementioned example, the DRF mechanism allocates 〈3 CPU, 12 GB〉 to user 1
and 〈6 CPU, 2 GB〉 to user 2, where user 1 launches three tasks and user 2 two. It is easy to verify that
1An allocation is said to be max-min fair if any attempt to increase the allocation of a user would decrease the allocationof another user with a smaller or equal share of resources.
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 10
Memory
CPUs
Server 1 Server 2
(1 CPU, 14 GB) (8 CPUs, 4 GB)
Figure 2.1: An example of a system consisting of two heterogeneous servers, in which user 1 can scheduleat most two tasks each demanding 1 CPU and 4 GB memory. The resources required to execute thetwo tasks are also highlighted in the figure.
both users receive the same dominant share (i.e., 2/3) and no one can launch more tasks by allocating
additional resources, although there is 2 GB memory left unused.
The DRF allocation above is based on a simplified all-in-one resource model, where the entire system
is modeled as one big server. The allocation hence depends only on the total amount of resources pooled
in the system. In the example above, no matter how many servers the system has, and what each server
specification is, as long as the system has 9 CPUs and 18 GB memory in total, the DRF allocation will
always schedule three tasks for user 1 and two for user 2. However, this allocation may not be possible
to implement, especially when the system consists of heterogeneous servers. For example, suppose that
the resource pool is provided by two servers. Server 1 has 1 CPU and 14 GB memory, and server 2 has
8 CPUs and 4 GB memory. As shown in Fig. 2.1, even allocating both servers exclusively to user 1,
at most two tasks can be scheduled, one in each server. Moreover, even for some server specifications
where the DRF allocation is feasible, the mechanism only gives the total amount of resources each user
should receive. It remains unclear how many resources a user should be allocated in each server. These
problems significantly limit the application of the DRF mechanism. In general, the allocation is valid
only when the system contains a single server or multiple homogeneous servers, which is rarely a case
under the prevalent datacenter infrastructure.
Despite the limitation of the all-in-one resource model, DRF is shown to possess a set of highly
desirable allocation properties for cloud computing systems [12, 16]. A natural question is: how should
the DRF intuition be generalized to a heterogeneous environment to achieve similar properties? Note
that this is not an easy question to answer. In fact, with heterogeneous servers, even the definition of
dominant resource is not so clear. Depending on the server specifications, a resource most demanded in
one server (in terms of the fraction of the server’s availability) might be the least-demanded in another.
For instance, in the example of Fig. 2.1, user 1 demands CPU the most in server 1. But in server 2,
it demands memory the most. Should the dominant resource be defined separately for each server, or
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 11
should it be defined for the entire resource pool? How should the allocation be conducted? And what
properties do the resulting allocation preserve? We shall answer these questions in the following sections.
2.3 System Model and Allocation Properties
In this section, we model multi-resource allocation in a cloud computing system with heterogeneous
servers. We formalize a number of desirable properties that are deemed the most important for allocation
mechanisms in cloud computing environments.
2.3.1 Basic Settings
Let S = {1, . . . , k} be the set of heterogeneous servers a cloud computing system has in its resource
pool. Let R = {1, . . . ,m} be the set of m hardware resources provided by each server, e.g., CPU,
memory, storage, etc. Let cl = (cl1, . . . , clm)T be the resource capacity vector of server l ∈ S, where each
component clr denotes the total amount of resource r available in this server. Without loss of generality,
we normalize the total availability of every resource to 1, i.e.,
∑l∈S
clr = 1, ∀r ∈ R .
Let U = {1, . . . , n} be the set of cloud users sharing the entire system. For every user i, let Di =
(Di1, . . . , Dim)T be its resource demand vector, where Dir is the amount of resource r required by each
instance of the task of user i. For simplicity, we assume positive demands, i.e., Dir > 0 for all user i
and resource r. We say resource r∗i the global dominant resource of user i if
r∗i ∈ arg maxr∈R
Dir .
In other words, resource r∗i is the most heavily demanded resource required by each instance of the task
of user i, over the entire resource pool. For each user i and resource r, we define
dir = Dir/Dir∗i
as the normalized demand and denote by di = (di1, . . . , dim)T the normalized demand vector of user i.
As a concrete example, consider Fig. 2.2 where the system consists of two heterogeneous servers.
Server 1 is high-memory with 2 CPUs and 12 GB memory, while server 2 is high-CPU with 12 CPUs
and 2 GB memory. Since the system has 14 CPUs and 14 GB memory in total, the normalized capacity
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 12
Memory
CPUs
Server 1 Server 2
(2 CPUs, 12 GB) (12 CPUs, 2 GB)
Figure 2.2: An example of a system containing two heterogeneous servers shared by two users. Eachcomputing task of user 1 requires 0.2 CPU time and 1 GB memory, while the computing task of user 2requires 1 CPU time and 0.2 GB memory.
vectors of server 1 and 2 are c1 = (CPU share,memory share)T = (1/7, 6/7)T and c2 = (6/7, 1/7)T ,
respectively. Now suppose that there are two users. User 1 has memory-intensive tasks each requiring
0.2 CPU time and 1 GB memory, while user 2 has CPU-heavy tasks each requiring 1 CPU time and
0.2 GB memory. The demand vector of user 1 is D1 = (1/70, 1/14)T and the normalized vector is
d1 = (1/5, 1)T , where memory is the global dominant resource. Similarly, user 2 has D2 = (1/14, 1/70)T
and d2 = (1, 1/5)T , and CPU is its global dominant resource.
For now, we assume users have an infinite number of tasks to be scheduled, and all tasks are divisible
[12,14–16,41]. We shall discuss how these assumptions can be relaxed in Sec. 2.5.
2.3.2 Resource Allocation
For every user i and server l, let Ail = (Ail1, . . . , Ailm)T be the resource allocation vector, where Ailr is
the amount of resource r allocated to user i in server l. Let Ai = (Ai1, . . . ,Aik) be the allocation matrix
of user i, and A = (A1, . . . ,An) the overall allocation for all users. We say an allocation A feasible if
no server is required to use more than any of its total resources, i.e.,
∑i∈U
Ailr ≤ clr, ∀l ∈ S, r ∈ R .
For each user i, given allocation Ail in server l, let Nil(Ail) be the maximum number of tasks (possibly
fractional) it can schedule. We have
Nil(Ail)Dir ≤ Ailr, ∀r ∈ R .
As a result,
Nil(Ail) = minr∈R{Ailr/Dir} .
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 13
The total number of tasks user i can schedule under allocation Ai is hence
Ni(Ai) =∑l∈S
Nil(Ail) . (2.1)
Intuitively, a user prefers an allocation that allows it to schedule more tasks.
A well-justified allocation should never give a user more resources than it can actually use in a server.
Following the terminology used in the economics literature [42], we call such an allocation non-wasteful:
Definition 1. For user i and server l, an allocation Ail is non-wasteful if reducing any resource decreases
the number of tasks scheduled, i.e., for all A′il ≺ Ail2, we have
Nil(A′il) < Nil(Ail) .
Further, user i’s allocation Ai = (Ail) is non-wasteful if Ail is non-wasteful for all server l, and
allocation A = (Ai) is non-wasteful if Ai is non-wasteful for all user i.
Note that one can always convert an allocation to non-wasteful by revoking those resources that are
allocated but have never been actually used, without changing the number of tasks scheduled for any
user. Unless otherwise specified, we limit the discussion to non-wasteful allocations.
2.3.3 Allocation Mechanism and Desirable Properties
A resource allocation mechanism takes user demands as input and outputs the allocation result. In
general, an allocation mechanism should provide the following essential properties that are widely recog-
nized as the most important fairness and efficiency measures in both cloud computing systems [12,13,43]
and the economics literature [42,44].
Envy-freeness. An allocation mechanism is envy-free if no user prefers the other’s allocation to its
own, i.e.,
Ni(Ai) ≥ Ni(Aj), ∀i, j ∈ U .
This property essentially embodies the notion of fairness.
Pareto optimality. An allocation mechanism is Pareto optimal if it returns an allocation A such
that for all feasible allocations A′, if Ni(A′i) > Ni(Ai) for some user i, then there exists a user j 6= i such
that Nj(A′j) < Nj(Aj). In other words, allocation A cannot be further improved such that all users
are at least as well off and at least one user is strictly better off. This property ensures the allocation
2For any two vectors x and y, we say x ≺ y if xi ≤ yi, ∀i and for some j we have strict inequality: xj < yj .
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 14
efficiency and is critical to achieve high resource utilization.
Group strategyproofness. An allocation mechanism is group strategyproof if whenever a coalition
of users misreport their resource demands (assuming a user’s demand is its private information), there
is a member of the coalition who would schedule less tasks and hence has no incentive to join the
coalition. Specifically, let M ⊂ U be the coalition of manipulators in which user i ∈ M misreports its
demand as D′i 6= Di. Let A′ be the allocation returned. Also, let A be the allocation returned when all
users truthfully report their demands. The allocation mechanism is group strategyproof if there exists
a manipulator i ∈M who cannot schedule more tasks than being truthful, i.e.,
Ni(A′i) ≤ Ni(Ai) .
In other words, user i is better off quitting the coalition. Group strategyproofness is of a special
importance for a cloud computing system, as it is common to observe in a real-world system that users
try to manipulate the scheduler for more allocations by lying about their resource demands [12,43].
Sharing incentive is another critical property that has been frequently mentioned in the litera-
ture [12–14, 16]. It ensures that every user’s allocation is not worse off than that obtained by evenly
dividing the entire resource pool. While this property is well defined for a single server, it is not for
a system containing multiple heterogeneous servers, as there is an infinite number of ways to evenly
divide the resource pool among users, and it is unclear which one should be selected as a benchmark to
compare with. We shall give a specific discussion to Sec. 2.4.5, where we justify between two reasonable
alternatives.
In addition to the four essential allocation properties above, we also consider four other important
properties as follows:
Single-server DRF. If the system contains only one server, then the resulting allocation should be
reduced to the DRF allocation.
Single-resource fairness. If there is a single resource in the system, then the resulting allocation
should be reduced to a max-min fair allocation.
Bottleneck fairness. If all users bottleneck on the same resource (i.e., having the same global
dominant resource), then the resulting allocation should be reduced to a max-min fair allocation for
that resource.
Population monotonicity. If a user leaves the system and relinquishes all its allocations, then the
remaining users will not see any reduction in the number of tasks scheduled.
To summarize, our objective is to design an allocation mechanism that guarantees all the properties
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 15
50%
CPU Memory
100%
0%
50%
CPU Memory
100%
0%
User1 User2
Server 1 Server 2
42%
8%
Figure 2.3: DRF allocation for the example shown in Fig. 2.2, where user 1 is allocated 5 tasks in server1 and 1 in server 2, while user 2 is allocated 1 task in server 1 and 5 in server 2.
defined above.
2.3.4 Naive DRF Extension and Its Inefficiency
It has been shown in [12, 16] that the DRF allocation satisfies all the desirable properties mentioned
above when the entire resource pool is modeled as one server. When resources are distributed to multiple
heterogeneous servers, a naive generalization is to separately apply the DRF allocation per server. For
instance, consider the example of Fig. 2.2. We first apply DRF in server 1. Because CPU is the
dominant resource of both users, it is equally divided for both of them, each receiving 1. As a result,
user 1 schedules 5 tasks onto server 1, while user 2 schedules one. Similarly, in server 2, memory is the
dominant resource of both users and is evenly allocated, leading to one task scheduled for user 1 and
five for user 2. The resulting allocations in the two servers are illustrated in Fig. 2.3, where both users
schedule 6 tasks.
Unfortunately, this allocation violates the Pareto optimality and is highly inefficient. If we instead
allocate server 1 exclusively to user 1, and server 2 exclusively to user 2, then both users schedule 10
tasks, almost twice the number of tasks scheduled under the DRF allocation. In fact, a similar example
can be constructed to show that the per-server DRF may lead to arbitrarily low resource utilization.
The failure of the naive DRF extension to the heterogeneous environment necessitates an alternative
allocation mechanism, which is the main theme of the next section.
2.4 DRFH Allocation and Its Properties
In this section, we present DRFH, a generalization of DRF in a heterogeneous cloud computing system
where resources are distributed in a number of heterogeneous servers. We analyze DRFH and show that
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 16
it provides all the desirable properties defined in Sec. 2.3.
2.4.1 DRFH Allocation
Instead of allocating separately in each server, DRFH jointly considers resource allocation across all
heterogeneous servers. The key intuition is to achieve the max-min fair allocation for the global dominant
resources. Specifically, given allocation Ail, let
Gil(Ail) = Nil(Ail)Dir∗i= min
r∈R{Ailr/dir} . (2.2)
be the amount of global dominant resource user i is allocated in server l. Since the total availability
of resources is normalized to 1, we also refer to Gil(Ail) the global dominant share user i receives in
server l. Simply adding up Gil(Ail) over all servers gives the global dominant share user i receives under
allocation Ai, i.e.,
Gi(Ai) =∑l∈S
Gil(Ail) =∑l∈S
minr∈R{Ailr/dir} . (2.3)
DRFH simply applies the max-min fair allocation across user’s global dominant share: it seeks an
allocation that maximizes the minimum global dominant share among all users, subject to the resource
constraints per server, i.e.,
maxA
mini∈U
Gi(Ai)
s.t.∑i∈U
Ailr ≤ clr,∀l ∈ S, r ∈ R .(2.4)
Recall that without loss of generality, we assume non-wasteful allocation A (see Sec. 2.3.2). We have
the following structural result.
Lemma 1. For user i and server l, an allocation Ail is non-wasteful if and only if there exists some gil
such that
Ail = gildi .
In particular, gil is the global dominant share user i receives in server l under allocation Ail, i.e.,
gil = Gil(Ail) .
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 17
Proof: (⇐) We start with the necessity proof. Since Ail = gildi, for all resource r ∈ R, we have
Ailr/Dir = gildir/Dir = gilDir∗i.
As a result,
Nil(Ail) = minr∈R{Ailr/Dir} = gilDir∗i
.
Now for any A′il ≺ Ail, suppose that A′ilr0 < Ailr0 for some resource r0. We have
Nil(A′il) = min
r∈R{A′ilr/Dir}
≤ A′ilr0/Dir0
< Ailr0/Dir0 = Nil(Ail) .
Hence by definition, allocation Ail is non-wasteful.
(⇒) We next present the sufficiency proof. Since Ail is non-wasteful, for any two resources r1, r2 ∈ R,
we must have
Ailr1/Dir1 = Ailr2/Dir2 .
Otherwise, without loss of generality, suppose that Ailr1/Dir1 > Ailr2/Dir2 . There must exist some
ε > 0, such that
(Ailr1 − ε)/Dir1 > Ailr2/Dir2 .
Now construct an allocation A′il, such that
A′ilr =
A′ilr1 − ε, r = r1;
Ailr, o.w.(2.5)
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 18
Clearly, A′il ≺ Ail. However, it is easy to see that
Nil(A′il) = min
r∈R{A′ilr/Dir}
= minr 6=r1{A′ilr/Dir}
= minr 6=r1{Ailr/Dir}
= minr∈R{Ailr/Dir}
= Nil(Ail) ,
which contradicts the fact that Ail is non-wasteful. As a result, there exits some nil, such that for all
resource r ∈ R, we have
Ailr = nilDir = nilDir∗idir .
Now letting gil = nilDir∗i, we see Ail = gildi. ut
Intuitively, Lemma 1 indicates that under a non-wasteful allocation, resources are allocated in pro-
portion to the user’s demand. Lemma 1 immediately suggests the following relationship for every user i
and its non-wasteful allocation Ai:
Gi(Ai) =∑l∈S
Gil(Ail) =∑l∈S
gil .
Problem (2.4) can hence be equivalently written as
max{gil}
mini∈U
∑l∈S
gil
s.t.∑i∈U
gildir ≤ clr, ∀l ∈ S, r ∈ R ,
(2.6)
where the constraints are derived from Lemma 1. Now let g = mini∑l∈S gil. Via straightforward
algebraic operations, we see that (2.6) is equivalent to the following problem:
max{gil},g
g
s.t.∑i∈U
gildir ≤ clr, ∀l ∈ S, r ∈ R ,
∑l∈U
gil = g, ∀i ∈ U .
(2.7)
Note that the second constraint embodies the fairness in terms of equalized global dominant share g. By
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 19
CPU Memory
5/7
0CPU Memory
0
User1 User2
Server 1 Server 2
1/7 1/7
6/7 6/7
Resource
Share
Figure 2.4: An alternative allocation with higher system utilization for the example of Fig. 2.2. Server1 and 2 are exclusively assigned to user 1 and 2, respectively. Both users schedule 10 tasks.
solving (2.7), DRFH allocates each user the maximum global dominant share g, under the constraints of
both server capacity and fairness. The allocation received by each user i in server l is simply Ail = gildi.
For example, Fig. 2.4 illustrates the resulting DRFH allocation in the example of Fig. 2.2. By solving
(2.7), DRFH allocates server 1 exclusively to user 1 and server 2 exclusively to user 2, allowing each user
to schedule 10 tasks with the maximum global dominant share g = 5/7.
We next analyze the properties of DRFH allocation obtained by solving (2.7). Our analyses of DRFH
start with the four essential resource allocation properties, namely, envy-freeness, Pareto optimality,
group strategyproofness, and sharing incentive.
2.4.2 Envy-Freeness
We first show by the following proposition that under the DRFH allocation, no user prefers other’s
allocation to its own.
Proposition 1 (Envy-freeness). The DRFH allocation obtained by solving (2.7) is envy-free.
Proof: Let {gil}, g be the solution to problem (2.7). For all user i, its DRFH allocation in server
l is Ail = gildi. To show Ni(Aj) ≤ Ni(Ai) for any two users i and j, it is equivalent to prove
Ni(Aj) ≤ Ni(Ai). We have
Gi(Aj) =∑lGil(Ajl)
=∑l minr{gjldjr/dir}
≤∑l gjl
= Gi(Ai) ,
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 20
where the inequality holds because
minr{djr/dir} ≤ djr∗i /dir∗i = djr∗i ≤ 1 ,
where r∗i is user i’s global dominant resource. ut
2.4.3 Pareto Optimality
We next show that DRFH leads to an efficient allocation under which no user can improve its allocation
without decreasing that of the others.
Proposition 2 (Pareto optimality). The DRFH allocation obtained by solving (2.7) is Pareto optimal.
Proof: Let {gil}, g be the solution to problem (2.7). For all user i, its DRFH allocation in server
l is Ail = gildi. Since (2.6) and (2.7) are equivalent, {gil} is also the solution to (2.6), and g is the
maximum value of the objective of (2.6).
Assume, by way of contradiction, that allocation A is not Pareto optimal, i.e., there exists some
allocation A′, such that Ni(A′i) ≥ Ni(Ai) for all user i, and for some user j we have strict inequality:
Nj(A′j) > Nj(Aj). Equivalently, this implies Gi(A
′i) ≥ Gi(Ai) for all user i, and Gj(A
′j) > Gj(Aj) for
user j. Without loss of generality, let A′ be non-wasteful. By Lemma 1, for all user i and server l, there
exists some g′il such that A′il = g′ildi. We show that based on {g′il}, one can construct some {gil} such
that {gil} is a feasible solution to (2.6), yet leads to a higher objective than g, contradicting the fact
that {gil}, g optimally solve (2.6).
To see this, consider user j. We have
Gj(Aj) =∑l gjl = g < Gj(A
′j) =
∑l g′jl.
For user j, there exists a server l0 and some ε > 0, such that after reducing g′jl0 to g′jl0 − ε, the resulting
global dominant share remains higher than g, i.e.,∑l g′jl−ε ≥ g. This leads to at least εdj idle resources
in server l0. We construct {gil} by redistributing these idle resources to all users to increase their global
dominant share, therefore strictly improving the objective of (2.6).
Denote by {g′′il} the dominant share after reducing g′jl0 to g′jl0 − ε, i.e.,
g′′il =
g′jl0 − ε, i = j, l = l0;
g′il, o.w.
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 21
The corresponding non-wasteful allocation is A′′il = g′′ildi for all user i and server l. Note that allocation
A′′ is preferred to the original allocation A by all users, i.e., for all user i, we have
Gi(A′′i ) =
∑l
g′′il =
∑l g′jl − ε ≥ g = Gj(Aj), i = j;∑
l g′il = Gi(A
′i) ≥ Gi(Ai), o.w.
We now construct {gil} by redistributing the εdj idle resources in server l0 to all users, each increasing
its global dominant share g′′il0 by δ = minr{εdjr/∑i dir}, i.e.,
gil =
g′′il0 + δ, l = l0;
g′′il, o.w.
It is easy to check that {gil} remains a feasible allocation. To see this, it suffices to check server l0. For
all its resource r, we have
∑i gil0dir =
∑i(g′′il0
+ δ)dir
=∑i g′il0dir − εdjr + δ
∑i dir
≤ cl0r − (εdjr − δ∑i dir)
≤ cl0r .
where the first inequality holds because A′ is a feasible allocation.
On the other hand, for all user i ∈ U , we have
∑l gil =
∑l g′′il + δ = Gi(A
′′i ) + δ ≥ Gi(Ai) + δ > g .
This contradicts the premise that g is optimal for (2.6). ut
2.4.4 Group Strategyproofness
For now, all our discussions are based on a critical assumption that all users truthfully report their
resource demands. However, in a real-world system, it is common to observe users to attempt to
manipulate the scheduler by misreporting their resource demands, so as to receive more allocation [12,43].
More often than not, these strategic behaviours would significantly hurt those honest users and reduce the
number of their tasks scheduled, inevitably leading to a fairly inefficient allocation outcome. Fortunately,
we show by the following proposition that DRFH is immune to these strategic behaviours, as reporting
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 22
the true demand is always the dominant strategy for all users, even if they form a coalition to misreport
together with others.
Proposition 3 (Group strategyproofness). The DRFH allocation obtained by solving (2.7) is group
strategyproof in that the coalition behavior of misreporting demands cannot strictly benefit every member.
Proof: Let M ⊂ U be the set of strategic users forming a coalition to misreport the normalized
demand vector d′M = (d′i)i∈M , where d′i 6= di. for all i ∈ M . Let d′ be the collection of normalized
demand vectors submitted by all users, where d′i = di, for all i ∈ U\M . Let A′ be the resulting allocation
obtained by solving (2.7). In particular, A′il = g′ild′i for each user i and server l, and g′ =
∑l g′il, where
{g′il}, g′ solve (2.7). On the other hand, let A be the allocation returned when all users truthfully report
their demands, and {gil}, g the solution to (2.7) with the truthful d. Similarly, for each user i and server
l, we have Ail = gildi, and g =∑l gil. We check the following two cases and show that there exists a
user i ∈M , such that Gi(A′i) ≤ Gi(Ai), which is equivalent to Ni(A
′i) ≤ Ni(Ai).
Case 1: g′ ≤ g. In this case, let ρi = minr{d′ir/dir} be defined for all user i ∈M . Clearly,
ρi = minr{d′ir/dir} ≤ d′ir∗i /dir∗i = d′ir∗i ≤ 1 ,
where r∗i is the dominant resource of user i. We then have
Gi(A′i) =
∑lGil(A
′il)
=∑lGil(g
′ild′i)
=∑l minr{g′ild′ir/dir}
= ρig′
≤ g
= Gi(Ai).
Case 2: g′ > g. We first consider users that are not manipulators. Since they truthfully report their
demands, we have
Gj(A′j) = g′ > g = Gj(Aj),∀j ∈ U \M. (2.8)
Now for those manipulators, there is a user i ∈M such that Gi(A′i) < Gi(Ai). Otherwise, allocation A′
is strictly preferred to allocation A by all users. This contradicts the facts that A is a Pareto optimal
allocation and A′ is a feasible allocation. ut
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 23
2.4.5 Sharing Incentive
In addition to the aforementioned three properties, sharing incentive is another critical allocation prop-
erty that has been frequently mentioned in the literature, e.g., [12–14,16,43]. The property ensures that
every user can execute at least the number of tasks it schedules when the entire resource pool is evenly
partitioned. The property provides service isolations among the users.
While the sharing incentive property is well defined in the all-in-one resource model, it is not for
the system with multiple heterogeneous servers. In the former case, since the entire resource pool is
abstracted as a single server, evenly dividing every resource of this big server would lead to a unique
allocation. However, when the system consists of multiple heterogeneous servers, there are many different
ways to evenly divide these servers, and it is unclear which one should be used as a benchmark for
comparison. For instance, in the example of Fig. 2.2, two users share a system with 14 CPUs and 14 GB
memory in total. The following two allocations both allocate each user 7 CPUs and 7 GB memory: (a)
User 1 is allocated 1/2 resources of server 1 and 1/2 resources of server 2, while user 2 is allocated the
rest; (b) user 1 is allocated (1.5 CPUs, 5.5 GB) in server 1 and (5.5 CPUs, 1.5 GB) in server 2, while
user 2 is allocated the rest. It is easy to verify that the two allocations lead to different number of tasks
scheduled for the same user, and can be used as two different allocation benchmarks. In fact, one can
construct many other allocations that evenly divide all resources among the users.
Despite the general ambiguity explained above, in the next two subsections, we consider two defini-
tions of the sharing incentive property, strong and weak, depending on the choice of the benchmark for
equal partitioning of resources.
Strong Sharing Incentive
Among various allocations that evenly divide all servers, perhaps the most straightforward approach is
to evenly partition each server’s availability cl among all n users. The strong sharing incentive property
is defined by using this per-server partitioning as a benchmark.
Definition 2 (Strong sharing incentive). Allocation A satisfies the strong sharing incentive property if
each user schedules fewer tasks by evenly partitioning each server, i.e.,
Ni(Ai) =∑l∈S
Ni(Ail) ≥∑l∈S
Ni(cl/n), ∀i ∈ U .
Before we proceed, it is worth mentioning that the per-server partitioning above cannot be directly
implemented in practice. With a large number of users, in each server, everyone will be allocated a very
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 24
small fraction of the server’s availability. In practice, such a small slice of resources usually cannot be
used to run any computing task. However, per-server partitioning may be interpreted as follows. Since
a cloud system is constructed by pooling hundreds of thousands of servers [2, 5], the number of users is
typically far smaller than the number of servers [12,43], i.e., k � n. An equal partition could randomly
allocate to each user k/n servers, which is equivalent to randomly allocating each server to each user
with probability 1/n. It is easy to see that the mean number of tasks scheduled for each user under this
random allocation is∑lNi(ci/n), the same as that obtained under the per-server partitioning.
Unfortunately, the following proposition shows that DRFH may violate the sharing incentive property
in the strong sense. The proof gives a counterexample.
Proposition 4. DRFH does not satisfy the property of strong sharing incentive.
Proof: Consider a system consisting of two servers. Server 1 has 1 CPU and 2 GB memory; server
2 has 4 CPUs and 3 GB memory. There are two users. Each instance of the task of user 1 demands
1 CPU and 1 GB memory; each of user 2’s tasks demands 3 CPUs and 2 GB memory. In this case,
we have c1 = (1/5, 2/5)T , c2 = (4/5, 3/5)T , D1 = (1/5, 1/5)T , D2 = (3/5, 2/5)T , d1 = (1, 1)T , and
d2 = (1, 2/3)T . It is easy to verify that under DRFH, the global dominant share both users receive
is 12/25. On the other hand, under the per-server partitioning, the global dominant share that user 2
receives is 1/2, higher than that received under DRFH. ut
While DRFH may violate the strong sharing incentive property, we shall show via trace-driven
simulations in Sec. 2.6 that this only happens in rare cases.
Weak Sharing Incentive
The strong sharing incentive property is defined by choosing the per-server partitioning as a benchmark,
which is only one of many different ways to evenly divide the total availability. In general, any equal
partition that allocates an equal share of every resource can be used as a benchmark. This allows us to
relax the sharing incentive definition. We first define an equal partition as follows.
Definition 3 (Equal partition). Allocation A is an equal partition if it divides every resource evenly
among all users, i.e., ∑l∈S
Ailr = 1/n, ∀r ∈ R, i ∈ U .
It is easy to verify that the aforementioned per-server partition is an equal partition. We are now
ready to define the weak sharing incentive property as follows.
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 25
Definition 4 (Weak sharing incentive). Allocation A satisfies the weak sharing incentive property if
there exists an equal partition A′ under which each user schedules fewer tasks than those under A, i.e.,
Ni(Ai) ≥ Ni(A′i), ∀i ∈ U .
In other words, the property of weak sharing incentive only requires the allocation to be better off
than one equal partition, without specifying its specific form. It is hence a more relaxed requirement
than the strong sharing incentive property.
The following proposition shows that DRFH satisfies the sharing incentive property in the weak
sense. The proof is constructive.
Proposition 5 (Weak sharing incentive). DRFH satisfies the property of weak sharing incentive.
Proof: Let g be the global dominant share each user receives under a DRFH allocation A, and gil
the global dominant share user i receives in server l. We construct an equal partition A′ under which
users schedule fewer tasks than those under A.
Case 1: g ≥ 1/n. In this case, let A′ be any equal partition. We show that each user schedules
fewer tasks under A′ than those under A. To see this, consider the DRFH allocation A. Since it is
non-wasteful, the number of tasks user i schedules is
Ni(Ai) = g/Dir∗i≥ 1/nDir∗i
.
On the other hand, the number of tasks user i schedules under A′ would be at most
Ni(A′i) =
∑l∈S minr{A′ilr/Dir}
≤∑l∈S A
′ilr/Dir∗i
= 1/nDir∗i
≤ Ni(Ai) .
Case 2: g < 1/n. In this case, no resource has been fully allocated under A, i.e.,
∑i∈U
∑l∈S
Ailr =∑i∈U
∑l∈S
gildir ≤∑i∈U
∑l∈S
gil =∑i∈U
g < 1
for all resource r ∈ R. Let
Llr = clr −∑i∈U
Ailr
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 26
be the amount of resource r left unallocated in server l, Further, let
Lr =∑l∈S
Llr = 1−∑i∈U
∑l∈S
Ailr
be the total amount of resource r left unallocated.
We are now ready to construct an equal partition A′ based on A. Since A′ should allocate each
user 1/n of the total availability of every resource r, the additional amount of resource r user i needs to
obtain is
uir = 1/n−∑l∈S
Ailr .
It is easy to see that uir > 0,∀i ∈ U, r ∈ R. The demanded fraction of unallocated resource r for user i
is
fir = uir/Lr .
As a result, we can construct A′ by reallocating those leftover resources in each server to users, in
proportion to their demands, i.e.,
A′ilr = Ailr + Llrfir, ∀i ∈ U, l ∈ S, r ∈ R .
It is easy to verify that A′ is an equal partition, i.e.,
∑l∈S
A′ilr =∑l∈S
Llrfir +∑l∈S
Ailr
= (uir/Lr)∑l∈S
Llr +∑l∈S
Ailr
= uir +∑l∈S
Ailr
= 1/n , ∀i ∈ U, r ∈ R .
We now compare the number of tasks scheduled for each user under both allocations A and A′.
Because A′ allocates more resources to each user than A does, we have Ni(A′i) ≥ Ni(Ai) for all i.
On the other hand, by the Pareto optimality of allocation A, no user can schedule more tasks without
decreasing the number of tasks scheduled for others. Therefore, we must have Ni(A′i) = Ni(Ai) for all
i. ut
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 27
Discussion
Strong sharing incentive provides more predictable service isolation than weak sharing incentive does.
It assures a user a priori that it can schedule at least the number of tasks when every server is evenly
allocated. This gives users a concrete idea on the worst Quality of Service (QoS) they may receive,
allowing them to accurately predict their computing performance. While weak sharing incentive also
provides some degree of service isolation, a user cannot infer the guaranteed number of tasks it can
schedule a priori from this weaker property, and therefore cannot predict the computing performance.
We note that the root cause of such degradation of service isolation is due to the heterogeneity
among servers. When all servers are of the same specification of hardware resources, DRFH reduces to
DRF, and strong sharing incentive is guaranteed. This is also the case for schedulers adopting the single-
resource abstraction. For example, in Hadoop, each server is divided into several slots (e.g., reducers and
mappers). Hadoop Fair Scheduler [45] allocates these slots evenly to all users. We see that predictable
service isolation is achieved: each user receives at least ks/n slots, where ks is the number of slots, and
n is the number of users.
In general, one can view weak sharing incentive as the price paid by DRFH to achieve high resource
utilization. In fact, naively applying DRF allocation separately to each server retains strong sharing
incentive: in each server, the DRF allocation ensures that a user can schedule at least the number of
tasks when resources are evenly allocated [12, 16]. However, as we have seen in Sec. 2.3.4, such a naive
DRF extension may lead to extremely low resource utilization that is unacceptable. Similar problem
also exists for traditional schedulers adopting single-resource abstractions. By artificially dividing servers
into slots, these schedulers cannot match computing demands to available resources at a fine granularity,
resulting in poor resource utilization in practice [12]. For these reasons, we believe that slightly trading
off the degree of service isolation for much higher resource utilization is well justified. We shall use
trace-driven simulation to show in Sec. 2.6.2 that DRFH only violates strong sharing incentive in rare
cases in the Google cluster.
2.4.6 Other Important Properties
In addition to the four essential properties shown in the previous subsection, DRFH also provides a num-
ber of other important properties. First, since DRFH generalizes DRF to heterogeneous environments,
it naturally reduces to the DRF allocation when there is only one server contained in the system, where
the global dominant resource defined in DRFH is exactly the same as the dominant resource defined in
DRF.
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 28
Proposition 6 (Single-server DRF). The DRFH leads to the same allocation as DRF when all resources
are concentrated in one server.
Next, by definition, we see that both single-resource fairness and bottleneck fairness trivially hold
for the DRFH allocation. We hence omit the proofs of the following two propositions.
Proposition 7 (Single-resource fairness). The DRFH allocation satisfies single-resource fairness.
Proposition 8 (Bottleneck fairness). The DRFH allocation satisfies bottleneck fairness.
Finally, we see that when a user leaves the system and relinquishes all its allocations, the remaining
users will not see any reduction of the number of tasks scheduled. Formally,
Proposition 9 (Population monotonicity). The DRFH allocation satisfies population monotonicity.
Proof: Let A be the resulting DRFH allocation, then for all user i and server l, Ail = gildi and
Gi(Ai) = g, where {gil} and g solve (2.7). Suppose that user j leaves the system, changing the resulting
DRFH allocation to A′. By DRFH, for all user i 6= j and server l, we have A′il = g′ildi and Gi(A′i) = g′,
where {g′il}i 6=j and g′ solve the following optimization problem:
maxg′il,i6=j
g′
s.t.∑i 6=j
g′ildir ≤ clr,∀l ∈ S, r ∈ R ,
∑l∈U
g′il = g′,∀i 6= j .
(2.9)
To show Ni(A′i) ≥ Ni(Ai) for all user i 6= j, it is equivalent to prove Gi(A
′i) ≥ Gi(Ai). It is easy
to verify that g, {gil}i 6=j satisfy all the constraints of (2.9) and are hence feasible to (2.9). As a result,
g′ ≥ g, which is exactly Gi(A′i) ≥ Gi(Ai). ut
2.5 Practical Considerations
So far, all our discussions are based on several assumptions that may not be the case in a real-world
system. In this section, we relax these assumptions and discuss how DRFH can be implemented in
practice.
2.5.1 Weighted Users with a Finite Number of Tasks
In the previous sections, users are assumed to be assigned equal weights and have infinite computing
demands. Both assumptions can be easily removed with some minor modifications of DRFH.
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 29
When users are assigned uneven weights, let wi be the weight associated with user i. DRFH seeks
an allocation that achieves the weighted max-min fairness across users. Specifically, we maximize the
minimum normalized global dominant share (w.r.t the weight) of all users under the same resource
constraints as in (2.4), i.e.,
maxA
mini∈U
Gi(Ai)/wi
s.t.∑i∈U
Ailr ≤ clr,∀l ∈ S, r ∈ R .
When users have a finite number of tasks, the DRFH allocation is computed iteratively. In each
round, DRFH increases the global dominant share allocated to all active users, until one of them has
all its tasks scheduled, after which the user becomes inactive and will no longer be considered in the
following allocation rounds. DRFH then starts a new iteration and repeats the allocation process above,
until no user is active or no more resources could be allocated to users. Because each iteration saturates
at least one user’s resource demand, the allocation will be accomplished in at most n rounds, where n
is the number of users.3 Our analysis presented in Sec. 2.4 also extends to weighted users with a finite
number of tasks.
2.5.2 Scheduling Tasks as Entities
Until now, we have assumed that all tasks are divisible. In a real-world system, however, fractional
tasks may not be accepted. To schedule tasks as entities, one can apply progressive filling as a simple
implementation of DRFH4. That is, whenever there is a scheduling opportunity, the scheduler always
accommodates the user with the lowest global dominant share. To do this, it picks the first server whose
remaining resources are sufficient to accommodate the request of the user’s task. While this First-Fit
algorithm offers a fairly good approximation to DRFH, we propose another simple heuristic that can
lead to a better allocation with higher resource utilization.
Similar to First-Fit, the heuristic also chooses user i with the lowest global dominant share to serve.
However, instead of randomly picking a server, the heuristic chooses the “best” one whose remaining
resoruces most suitably matches the demand of user i’s tasks, and is hence referred to as the Best-Fit
DRFH. Specifically, for user i with resource demand vector Di = (Di1, . . . , Dim)T and a server l with
available resource vector cl = (cl1, . . . , clm)T , where clr is the share of resource r remaining available in
server l, we define the following heuristic function to quantitatively measure the fitness of the task for
3For medium- and large-sized cloud clusters, n is in the order of thousands [2, 3].4Progressive filling has also been used to implement the DRF allocation [12]. However, when the system consists of
multiple heterogeneous servers, progressive filling will lead to a DRFH allocation.
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 30
server l:
H(i, l) = ‖Di/Di1 − cl/cl1‖1 , (2.10)
where ‖·‖1 is the L1-norm. Intuitively, the smaller H(i, l), the more similar the resource demand vector
Di appears to the server’s available resource vector cl, and the better fit user i’s task is for server l.
Best-Fit DRFH schedules user i’s tasks to server l with the least H(i, l).
As an illustrative example, suppose that only two types of resources are concerned, CPU and memory.
A CPU-heavy task of user i with resource demand vector Di = (1/10, 1/30)T is to be scheduled, meaning
that the task requires 1/10 of the total CPU availability and 1/30 of the total memory availability of the
system. Only two servers have sufficient remaining resources to accommodate this task. Server 1 has the
available resource vector c1 = (1/5, 1/15)T ; Server 2 has the available resource vector c2 = (1/8, 1/4)T .
Intuitively, because the task is CPU-bound, it is more fit for Server 1, which is CPU-abundant. This is
indeed the case as H(i, 1) = 0 < H(i, 2) = 5/3, and Best-Fit DRFH places the task onto Server 1.
Both First-Fit and Best-Fit DRFH can be easily implemented by searching all k servers in O(k)
time, which is fast enough for small- and medium-sized clusters. For a large cluster containing tens of
thousands of servers, this computation can be fast approximated by adapting the power of two choices
load balancing technique [46]. Instead of scanning through all servers, the scheduler randomly probes
two servers and places the task on the server that fits the task better.
It is worth mentioning that the definition of the heuristic function (2.10) is not unique. In fact, one
can use more complex heuristic function other than (2.10) to measure the fitness of a task for a server,
e.g., cosine similarity [47]. However, as we shall show in the next section, Best-Fit DRFH with (2.10)
as its heuristic function already improves the utilization to a level where the system capacity is almost
saturated. Therefore, the benefit of using more complex fitness measure is very limited, at least for the
Google cluster traces [3].
2.6 Evaluation
In this section, we evaluate the performance of DRFH via both prototype implementation and extensive
simulations driven by Google cluster-usage traces [3].
2.6.1 Prototype Implementation
We implemented First-Fit DRFH as a new pluggable allocation module in Apache Mesos 0.18.1 [39,48].
Mesos is an open-source cluster resource management system that has been widely adopted by industry
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 31
Table 2.1: Resource configurations of the 50 nodes in the launched Amazon EC2 cluster.
# Cores Memory (MB) # Nodes2 200 71 800 101 400 91 200 92 800 82 400 7
Table 2.2: Details of tasks submitted by four artificial applications.
App. ID # Tasks # Cores per Task Memory per Task (MB) Task Runtime (s)1 6000 0.2 20 252 5000 0.1 40 603 7500 0.1 40 254 8000 0.2 20 25
companies such as Twitter and Google. Our implementation maintains a priority queue for all active
applications: the one with the least global dominant share is deemed the most “unfair” and will receive
the highest priority for task scheduling. Whenever resources become available in some server, our DRFH
allocator offers them to the application with the highest scheduling opportunity.
We set up a cluster in Amazon EC2 with a total of 50 nodes (t2.medium). In order to emulate
the machine heterogeneity in a large cluster, we configured Mesos so that each node contributes only a
partial amount of resources (i.e., CPU and memory) to the cluster. Table 2.1 summarizes the resource
configuration of each node.
To evaluate the allocation fairness of our DRFH allocator in the presence of users dynamically joining
and departing, we wrote four artificial Mesos applications and launched them sequentially over time.
Table 2.2 summarizes the number of tasks submitted by all four applications, the resource demands of
each task, and task runtime.
Throughout the experiment, we measure the CPU, memory, and global dominant share of the four
applications every 10 s. Fig. 2.5 shows the measured allocation shares over time with DRFH. We see
that initially, only App1 is active, and is allocated the entire CPU cores of the cluster. This allocation
remains for 300 s, after which App2 joins the system. Both applications now compete for the cluster
resources, leading to a DRFH allocation with both applications receiving the same global dominant
share of 50%. At around 560 s, App1 finishes running and returns all the allocated resources. App2 then
occupies the entire cluster, and is allocated 70% of the total CPU cores and 80% of cluster memory. We
note that not all memory is allocated to App2 because a certain amount of memory has been reserved
for Mesos Slave in each node [48]. At around 800 s, both App3 and App4 become active and start to
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 32
0 300 600 900 1200 1500 18000
0.2
0.4
0.6
0.8
1
CP
U S
ha
re
0 300 600 900 1200 1500 18000
0.2
0.4
0.6
0.8
1
Time (s)
Me
m.
Sh
are
0 300 600 900 1200 1500 18000
0.2
0.4
0.6
0.8
1
Do
min
an
t S
ha
re
App1
App2
App3
App4
Figure 2.5: CPU, memory, and global dominant share of four applications running in a 50-node AmazonEC2 cluster.
launch computing tasks. The DRFH allocator ensures the same global dominant share of 33% for all
three applications until App2 finishes at 1350 s. After that the entire cluster is shared by App3 and
App4. Since App3 (resp., App4) has the same resource requirements as those of App1 (resp., App2),
the allocation shares received by App3 (resp., App4) are similar to those received by App1 (resp., App2)
from 300 s to 560 s. Finally, we see that App4 increases its CPU share to 100% after App3 departs
the system. To summarize, whenever there is more than one application sharing the cluster, our Mesos
implementation can quickly ensure all applications to receive the same global dominant share, achieving
the precise DRFH allocation at all times.
Our prototype implementation in a small-sized Amazon EC2 cluster suggests that naive First-Fit
DRFH is sufficient to achieve near-perfect fairness. While Best-Fit DRFH cannot do better than First-
Fit in terms of fairness, we shall see in the next subsection that the former outperforms the latter in
terms of other metrics – notably resource utilization – when the cluster is of a large scale.
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 33
Table 2.3: Resource utilization of the Slots scheduler with different slot sizes.
Number of Slots CPU Utilization Memory Utilization10 per maximum server 35.1% 23.4%12 per maximum server 42.2% 27.4%14 per maximum server 43.9% 28.0%16 per maximum server 45.4% 24.2%20 per maximum server 40.6% 20.0%
2.6.2 Trace-Driven Simulation
We now turn to a system with a much larger scale and compare the two DRFH implementations via ex-
tensive simulations driven by Google cluster-usage traces [3]. The traces contain resource demand/usage
information of over 900 users (i.e., Google services and engineers) on a cluster of 12K servers. The server
configurations are summarized in Table 1.1, where the CPUs and memory of each server are normalized
so that the maximum server is 1. Each user submits computing jobs, divided into a number of tasks, each
requiring a set of resources (i.e., CPU and memory). From the traces, we extract the computing demand
information – the required amount of resources and task running time – and use it as the demand input
of the allocation algorithms for evaluation.
Resource Utilization
Our first evaluation focuses on the resource utilization. We take the 24-hour computing demand data
from the Google traces and simulate it on a smaller cloud computing system of 2,000 servers so that
fairness becomes relevant. The server configurations are randomly drawn from the distribution of Google
cluster servers in Table 1.1. We compare Best-Fit DRFH with two other benchmarks, the traditional
Slots schedulers that schedules tasks onto slots of servers (e.g., Hadoop Fair Scheduler [45]), and the
First-Fit DRFH that chooses the first server that fits the task. For the former, we try different slot
sizes and chooses the one with the highest CPU and memory utilization. Table 2.3 summarizes our
observations, where dividing the maximum server (1 CPU and 1 memory in Table 1.1) into 14 slots
leads to the highest overall utilization.
Fig. 2.6 depicts the time series of CPU and memory utilization of the three algorithms. We see that
the two DRFH implementations significantly outperform the traditional Slots scheduler with much higher
resource utilization, mainly because the latter ignores the heterogeneity of both servers and workload.
This observation is consistent with findings in the homogeneous environment where all servers are of
the same hardware configurations [12]. As for the DRFH implementations, we see that Best-Fit DRFH
leads to uniformly higher resource utilization than the First-Fit alternative at all times.
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 34
0 200 400 600 800 1000 1200 14000
0.2
0.4
0.6
0.8
1
Time (min)C
PU
Utiliz
atio
n
Best−Fit DRFH First−Fit DRFH Slots
0 200 400 600 800 1000 1200 14000
0.2
0.4
0.6
0.8
1
Time (min)
Me
mo
ry U
tiliz
atio
n
Best−Fit DRFH First−Fit DRFH Slots
Figure 2.6: Time series of CPU and memory utilization.
The high resource utilization of Best-Fit DRFH naturally translates to shorter job completion times
shown in Fig. 2.7a, where the CDFs of job completion times for both Best-Fit DRFH and Slots scheduler
are depicted. Fig. 2.7b offers a more detailed breakdown, where jobs are classified into 5 categories
based on the number of its computing tasks, and for each category, the mean completion time reduction
is computed. While DRFH shows no improvement over Slots scheduler for small jobs, a significant
completion time reduction has been observed for those containing more tasks. Generally, the larger
the job is, the more improvement one may expect. Similar observations have also been found in the
homogeneous environments [12].
Fig. 2.7 does not account for partially completed jobs and focuses only on those having all tasks
finished in both Best-Fit and Slots. As a complementary study, Fig. 2.8 computes the task completion
ratio – the number of tasks completed over the number of tasks submitted – for every user using Best-Fit
DRFH and Slots schedulers, respectively. The radius of the circle is scaled logarithmically to the number
of tasks the user submitted. We see that Best-Fit DRFH leads to higher task completion ratio for almost
all users. Around 20% users have all their tasks completed under Best-Fit DRFH but do not under Slots.
Sharing Incentive
Our final evaluation is on the sharing incentive property of DRFH. While we have shown in Sec. 2.4.5
that DRFH may not satisfy the property in the strong sense, it remains unclear how often this property
would be violated in practice. Is it frequently happened or just a rare case? We answer this question in
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 35
101
102
103
104
105
0
0.2
0.4
0.6
0.8
1
Job Completion Time (s)
Best−Fit DRFHSlots
(a) CDF of job completion times.
0
20
40
60
80
Job Size (tasks)
Com
ple
tion T
ime R
eduction
1−5051−100
101−500
501−1000>1000
−1% 2%
25%
43%
62%
(b) Job completion time reduction.
Figure 2.7: DRFH improvements on job completion times over Slots scheduler.
this subsection.
We first compute the task completion ratio under the benchmark per-server partition. To do this,
we randomly choose dk/ne servers, where k is the number of servers and n is the number of users in the
traces. We then allocate these dk/ne servers to a user and schedule its tasks onto them. These dk/ne
servers form a dedicated cloud exclusive for this user. We compute the task completion ratio obtained
in this dedicated cloud and compare it with the one obtained under the DRFH allocation. Fig. 2.9
illustrates the comparison results for all users. We see that most users would prefer DRFH allocation as
compared with running tasks in a dedicated cloud. In particular, only 2% users see fewer tasks finished
under the DRFH allocation. Even for these users, the task completion ratio decreases only slightly, as
shown in Fig. 2.9. As a result, we see that DRFH only violates the property of strong sharing incentive
in rare cases in the Google traces.
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 36
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Task completion ratio w/ Slots
Task c
om
ple
tion r
atio w
/ D
RF
H
← y = x
Figure 2.8: Task completion ratio of users using Best-Fit DRFH and Slots schedulers, respectively. Eachbubble’s size is logarithmic to the number of tasks the user submitted.
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Task completion ratio in a dedicated cloud
Task c
om
ple
tion r
atio w
/ D
RF
H
← y = x
Figure 2.9: Comparison of task completion ratios under DRFH and that obtained in dedicated clouds(DCs). Each circle’s radius is logarithmic to the number of tasks submitted.
2.7 Related Work
Despite the extensive computing system literature on fair resource allocation, many existing works
limit their discussions to the allocation of a single resource type, e.g., CPU time [49, 50] and link
bandwidth [51–55]. Various fairness notions have also been proposed throughout the years, ranging
from application-specific allocations [56,57] to general fairness measures [51,58,59].
As for multi-resource allocation, state-of-the-art cloud computing systems employ naive single re-
source abstractions. For example, the two fair sharing schedulers currently supported in Hadoop [45,60]
partition a node into slots with fixed fractions of resources, and allocate resources jointly at the slot
granularity. Quincy [61], a fair scheduler developed for Dryad [11], models the fair scheduling problem
as a min-cost flow problem to schedule jobs into slots. The recent work [43] takes the job placement
constraints into consideration, yet it still uses a slot-based single resource abstraction.
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 37
Ghodsi et al. [12] are the first in the literature to present a systematic investigation on the multi-
resource allocation problem in cloud computing systems. They proposed DRF to equalize the dominant
share of all users, and show that a number of desirable fairness properties are guaranteed in the resulting
allocation. DRF has quickly attracted a substantial amount of attention and has been generalized to
many dimensions. Notably, Joe-Wong et al. [13] generalized the DRF measure and incorporated it into
a unifying framework that captures the trade-offs between allocation fairness and efficiency. Dolev et
al. [14] suggested another notion of fairness for multi-resource allocation, known as Bottleneck-Based
Fairness (BBF), under which two fairness properties that DRF possesses are also guaranteed. Gutman
and Nisan [15] considered another settings of DRF with a more general domain of user utilities, and
showed their connections to the BBF mechanism. Parkes et al. [16], on the other hand, extended DRF
in several ways, including the presence of zero demands for certain resources, weighted user endowments,
and in particular the case of indivisible tasks. They also studied the loss of social welfare under the
DRF rules. Kash et al. [41] extended the DRF model to allow users to join the system over time but
will never leave. Bhattacharya et al. [62] generalized DRF to a hierarchical scheduler that offers service
isolations in a computing system with a hierarchical structure. All these works assume, explicitly or
implicitly, that resources are either concentrated into one super computer, or are distributed to a set of
homogeneous servers with exactly the same resource configuration.
However, server heterogeneity has been widely observed in today’s cloud computing systems. Specif-
ically, Ahmad et al. [8] noted that datacenter clusters usually consist of both high-performance servers
and low-power nodes with different hardware architectures. Reiss et al. [2,3] illustrated a wide range of
server specification in Google clusters. As for public clouds, Farley et al. [6] and Ou et al. [7] observed
significant hardware diversity among Amazon EC2 servers that may lead to substantially different per-
formance across supposedly equivalent VM instances. Ou et al. [7] also pointed out that such server
heterogeneity is not limited to EC2 only, but generally exists in long-lasting public clouds such as
Rackspace.
To our knowledge, the very recent paper [63] is the only work that studied allocation properties
in the presence of server heterogeneity, where a randomized allocation algorithm, called Constrained-
DRF (CDRF), is proposed to schedule discrete jobs. While CDRF possess all the desirable properties
discussed in this chapter, it is too complex for a job scheduler, and an efficient algorithm remains an
open problem [63]. More recently, Grandl et al. [64] proposed an efficient heuristic algorithm for multi-
resource scheduling in heterogeneous computer clusters. Their work mainly focuses on designing a good
heuristic algorithm, not studying the allocation properties, and is therefore orthogonal to our work.
Other related works include fair-division problems in the economics literature, in particular the
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 38
egalitarian division under Leontief preferences [42] and the cake-cutting problem [44]. These works also
assume the all-in-one resource model, and hence cannot be directly applied to cloud computing systems
with heterogeneous servers.
Finally, we note that many bandwidth allocation problems in the networking literature, although
focusing on a single resource type, can be naturally converted to a multi-resource sharing problem.
For example, in the traditional flow control problem (described in Sec. 6.5.2 of [65]), each flow has its
route determined already and is associated with a fix path in the network. The flow control problem
requires us to determine the traffic rate for each flow such that the entire network bandwidth is fairly
shared. To see how this problem connects to a multi-resource sharing problem, we view each link as a
type of resource, with the availability as the bandwidth capacity. Moreover, we view each flow as an
aggregation of many sub-flows each traversing the same path of the flow at rate of 1 bit per second.
We hence construct an equivalent multi-resource sharing problem where the equivalent of a user is a
flow and the equivalent of a task is a sub-flow. DRF can therefore be applied as a fair sharing solution:
The dominant resource of a flow is the link with the smallest bandwidth capacity among all links the
flow traverses. It turns out that the resulting DRF allocation reduces to the well-known max-min fair
flow control mechanism [65] (Sec. 6.5.2). It is worth emphasizing that while the flow control problem
can be interpreted as a multi-resource fair sharing problem, the converse is generally not true. This is
because in a network, a flow always consumes the same amount of bandwidth in all links it traverses.
In contrast, in a multi-resource sharing problem, a task may require different amounts of resources. It
is such a key difference that makes the max-min fair flow control not applicable when extending to a
general multi-resource setting.
2.8 Summary
In this chapter, we have studied a multi-resource allocation problem in a heterogeneous cloud computing
system where the resource pool is composed of a large number of servers with different configurations
in terms of resources such as processing, memory, and storage. The proposed multi-resource allocation
mechanism, known as DRFH, equalizes the global dominant share allocated to each user, and hence
generalizes the DRF allocation from a single server to multiple heterogeneous servers. We have analyzed
DRFH and showed that it retains almost all desirable properties that DRF provides in the single-server
scenario. Notably, DRFH is envy-free, Pareto optimal, and group strategyproof. It also offers the sharing
incentive in a weak sense. We have implemented DRFH as a new pluggable resource allocator in Apache
Mesos. Experimental results show that our implementation leads to a precise DRFH allocation in a
Chapter 2. Multi-Resource Fair Scheduler for Datacenter Jobs 39
real cluster. Large-scale simulations driven by Google cluster traces further show that, compared to the
traditional single-resource abstraction such as a slot scheduler, DRFH achieves significant improvements
in resource utilization, leading to much shorter job completion times.
Chapter 3
Multi-Resource Fair Queueing for
Network Flows
3.1 Motivation
In Chapter 2, we have studied how multiple types of computing resources – such as CPU cores, memory,
IP addresses, and storage spaces – should be shared for datacenter jobs. In addition to computing
resources, network is another important concern. Modern datacenters typically deploy a large number
of network appliances or “middleboxes”. According to recent studies [17,18], the number of middleboxes
deployed in enterprise and datacenter networks is already on par with the traditional switches and
routers. These middleboxes do more than just packet forwarding. They perform a variety of critical
network functions that require deep packet inspection based on the payload of packets, such as IP
security encryption, WAN optimization, and intrusion detection. Performing these complex network
functions requires the support of multiple types of resources, and may bottleneck on either CPU or link
bandwidth [4,28]. For example, flows that require basic forwarding may congest the link bandwidth [4],
while those that require IP security encryption need more CPU processing time [28]. In all cases, link
bandwidth is no longer the only network resource shared by flows. A scheduling algorithm specifically
designed for multiple resource types is therefore needed for sharing these resources fairly and efficiently.
Starting from this chapter, we shall focus on the design and implementation of multi-resource queueing
algorithms for middleboxes. The queueing algorithm determines the order in which packets in various
independent flows are processed, and serves as a fundamental mechanism for allocating resources in
a middlebox. Ideally, a queueing algorithm – also known as a packet scheduler – should provide the
40
Chapter 3. Multi-Resource Fair Queueing for Network Flows 41
following desirable properties.
Fairness. The middlebox scheduler should provide some measure of service isolation to allow com-
peting flows to have a fair share of middlebox resources. In particular, each flow should receive service
(i.e., throughput) at least at the level where every resource is equally allocated (assuming flows are
equally weighted). Moreover, this service isolation should not be compromised by strategic behaviours
of other flows.
Low complexity. With the ever growing line rate and the increasing volume of traffic passing
through middleboxes [29, 30], it is critical to schedule packets at high speeds. This requires low time
complexity when making scheduling decisions. In particular, it is desirable that this complexity is a small
constant, independent of the number of traffic flows. Equally importantly, the scheduling algorithm
should also be amenable to practical implementations.
Bounded scheduling delay. Interactive network and data analytic applications have stringent
end-to-end delay requirements. It is hence important for a packet scheduler to offer bounded scheduling
delay. Such a delay bound should be a small constant, independent of the number of flows.
Despite recent advances in multi-resource fair queueing (e.g., [4]), how a multi-resource packet sched-
uler is to be designed to satisfy all three desirable properties remains an open challenge. Existing designs
are expensive to implement at high speed. In particular, DRFQ [4], the first multi-resource fair queueing
algorithm that implements Dominant Resource Fairness (DRF) [12], associates packets with timestamps,
and schedules the one with the earliest timestamp. It suffers from a sorting bottleneck with scheduling
complexity logarithmic in the number of flows. Similar problem also exists in DRWF2Q [66].
In this chapter, we design two new schedulers that take O(1) time to schedule a packet while achieving
near-perfect fairness with the scheduling delay bounded by a small constant. The first scheduler, referred
to as Multi-Resource Round Robin (MR3), is designed for flows with equal weights. While round-robin
schedulers have found successful applications to fairly share a single type of resource, e.g., outgoing
bandwidth in switches and routers [25, 67, 68], we observe that directly applying them to schedule
multiple types of resources may lead to arbitrary unfairness. Nonetheless, we show, analytically, that
near-perfect fairness could be achieved by simply deferring the scheduling opportunity of a packet until
the progress gap between two resources falls below a small threshold. We then implement this principle
in a multi-resource packet scheduling design similar to Elastic Round Robin [68], which we show to be
the most suitable variant for the middlebox environment in the design space of round robin algorithms.
Both theoretical analyses and extensive simulation demonstrate that compared with DRFQ, the price
we pay is a slight increase in packet latency. MR3 is the first multi-resource fair queueing algorithm that
offers near-perfect fairness in O(1) time.
Chapter 3. Multi-Resource Fair Queueing for Network Flows 42
Despite the low complexity and the friendliness of implementation, MR3 may incur large schedul-
ing delay for flows with uneven weights, and is hence unsuitable for applications with stringent delay
requirements. To address this problem, we design an improved round-robin scheduler, called Group
Multi-Resource Round Robin (GMR3), that achieves all three desirable scheduling properties. GMR3
groups flows with similar weights into a small number of groups, each associating with a timestamp. The
scheduling decisions are made in a two-level hierarchy. At the higher level, GMR3 makes inter-group
scheduling decisions by choosing the group with the earliest timestamp, while at the lower level, the
intra-group scheduler serves flows within a group in a round-robin fashion. GMR3 is highly efficient, as
it requires only O(1) time per packet in almost all practical scenarios. In addition, we show that GMR3
achieves near-perfect fairness across flows, with its scheduling delay bounded by a small constant. These
desirable properties are proven analytically and validated by simulation. To our knowledge, GMR3
represents the first fair queueing algorithm that offers near-perfect fairness with O(1) time complexity
and a constant scheduling delay bound. GMR3 is amenable to practical implementations, and may find
a variety of applications in other multi-resource scheduling contexts such as VM scheduling inside a
hypervisor.
The remainder of this chapter is organized as follows. After a brief survey of the existing literature
in Sec. 3.2, we discuss the challenges of extending round-robin algorithms to the multi-resource scenario
in Sec. 3.4. Our design of MR3 is given in Sec. 3.4.3 and is evaluated via both theoretical analyses and
simulation experiments in Sec. 3.4.4 and Sec. 3.4.5, respectively. Sec. 3.7 concludes the chapter.
3.2 Related Work
Unlike switches and routers where the output bandwidth is the only shared resource, middleboxes handle
a variety of hardware resources and require a more complex packet scheduler. Many recent measure-
ments, such as [4, 27, 28], report that packet processing in a middlebox may bottleneck on any of CPU,
memory bandwidth, and link bandwidth, depending on the network functions applied to the traffic flow.
Such a multi-resource setting significantly complicates the scheduling algorithm. As pointed out in [4],
simply applying traditional fair queueing schemes [21–26] per resource (i.e., per-resource fairness) or on
the bottleneck resource of the middlebox (i.e., bottleneck fairness [69]) fails to offer service isolation.
Furthermore, by strategically claiming some resources that are not needed, a flow may increase its service
share at the price of other flows.
Ghodsi et al. [4] has proposed a multi-resource fair queueing algorithm, termed DRFQ, that imple-
ments DRF in the time domain, i.e., it schedules packets in a way such that each flow receives roughly the
Chapter 3. Multi-Resource Fair Queueing for Network Flows 43
same processing time on its most congested resource. It is shown to achieve service isolation and guaran-
tee that no rational user would misreport the amount of resources it requires. Following this intuition, we
have previously extended the idealized GPS model [21,22] to Dominant Resource GPS (DRGPS), which
implements strict DRF at all times [66]. By emulating DRGPS, well-known fair queueing algorithms,
such as WFQ [21] and WF2Q [26], can have direct extensions in the multi-resource setting. However,
while all these algorithms achieve nearly perfect service isolation, they are timestamp-based schedulers
and are expensive to implement. They all require selecting a packet with the earliest timestamp among
n active flows, taking O(log n) time per packet processing. With a large n, these algorithms are hard to
implement at high speeds.
Reducing the scheduling complexity is a major concern in many existing studies on high-speed net-
working. When there is only a single resource to schedule, round-robin schedulers [25, 67,68] have been
proposed to multiplex the output bandwidth of switches and routers, in which flows are served in cir-
cular order. These algorithms eliminate the sorting bottleneck associated with the timestamp-based
schedulers, and achieve O(1) time complexity per packet. Due to their extreme simplicity, round-robin
schemes have been widely implemented in high-speed routers such as Cisco GSR [70].
Despite the successful application of round-robin algorithms in traditional L2/L3 devices, we observe
in this study that they may lead to arbitrary unfairness when naively applied to multiple resources.
Therefore, it remains unclear whether their attractiveness, i.e., the implementation simplicity and low
time complexity, extends to the multi-resource environment, and if it does, how a round-robin scheduler
should be designed and implemented in middleboxes. We study answers to these questions in the
following sections.
3.3 Preliminaries and Design Objectives
In this section, we introduce some basic concepts and clarify the detailed design objectives of a multi-
resource scheduler.
3.3.1 Packet Processing Time
Depending on the network functions applied to a flow, processing a packet of the flow may consume
different amounts of middlebox resources. Following [4], we define the packet processing time as a metric
for the resource requirements of a packet. Specifically, for packet p, its packet processing time on resource
r, denoted τr(p), is defined as the time required to process the packet on resource r, normalized to the
middlebox’s processing capacity of resource r. For example, a packet may require 10 µs to process using
Chapter 3. Multi-Resource Fair Queueing for Network Flows 44
P1
Q1P1CPU
Link ...P2
P3
Q1
Q2
Q2
...
P3
Time20 6 124 8 10 14
P2 Q3
Q3
P4
P4
Q4
(a) The scheduling discipline.
P1
Q1CPU
Link ...P2
Q2 ...
P3
Time20 6 124 8 10 14
Q3
P4
Q4
(b) The processing time received on the dominant resources.
Figure 3.1: Illustration of a scheduling discipline that achieves DRF.
one CPU core. A middlebox with 2 CPU cores can process 2 such packets in parallel. As a result, the
packet processing time of this packet on the CPU resource of this middlebox is 5 µs.
3.3.2 Dominant Resource Fairness (DRF) over Time
Fairness is the primary design objective for a packet scheduler. We have seen in the previous chapter
that the recently proposed Dominant Resource Fairness [12, 16] serves as a promising notion of fairness
for sharing multiple types of resources. DRF generalizes max-min fairness to the dominant resource in
the space domain, and can also be extended to resource multiplexing over time. In our context, the
dominant resource of a packet is the one that requires the most packet processing time. Specifically,
let m be the number of resources concerned. For a packet p, its dominant resource, denoted r∗(p), is
defined as
r∗(p) = arg max1≤r≤m
{τr(p)}. (3.1)
A schedule is said to implement DRF over time if flows receive the same processing time on the dominant
resource of their packets in all backlogged periods. For example, consider two flows in Fig. 3.1a. Flow
1 sends packets P1, P2, . . ., while flow 2 sends packets Q1, Q2, . . .. Each packet of flow 1 requires 1
time unit for CPU processing and 3 time units for link transmission, and thus has the processing times
〈1, 3〉. Each packet of flow 2, on the other hand, requires the processing times 〈3, 1〉. In this case, the
dominant resource of packets of flow 1 is link bandwidth, while the dominant resource of packets of flow
2 is CPU. We see that the scheduling scheme shown in Fig. 3.1a achieves DRF, under which both flows
receive the same processing time on dominant resources (see Fig. 3.1b).
It has been shown in [4,66] that a scheduler that achieves strict DRF at all times offers the following
Chapter 3. Multi-Resource Fair Queueing for Network Flows 45
highly desirable properties:
1. Predictable service isolation. For each flow i, the received service (i.e., throughput) is at least at
the level where every resource is equally allocated.
2. Truthfulness. No flow can receive better service (finish faster) by misreporting the amount of
resources it requires.
3. Work conservation. No resource is wasted in idle should it be used to increase the throughput of
a backlogged flow.
Therefore, we adopt DRF as the notion of fairness for multi-resource scheduling.
To measure how well a packet scheduler approximates DRF, the following Relative Fairness Bound
(RFB) is used as a fairness metric [4, 66]:
Definition 5. For any packet arrivals, let Ti(t1, t2) be the packet processing time flow i receives on the
dominant resources of its packets in the time interval (t1, t2). Ti(t1, t2) is referred to as the dominant
service flow i receives in (t1, t2). Let wi be the weight associated with flow i, and B(t1, t2) the set of
flows that are backlogged in (t1, t2). The Relative Fairness Bound (RFB) is defined as
RFB = supt1,t2;i,j∈B(t1,t2)
∣∣∣∣Ti(t1, t2)
wi− Tj(t1, t2)
wj
∣∣∣∣ . (3.2)
We require a scheduling scheme to have a small RFB, such that the difference between the normalized
dominant service received by any two flows i and j, over any backlogged time period (t1, t2), is bounded
by a small constant.
3.3.3 Scheduling Delay
In addition to fairness, scheduling delay is another important concern for a packet scheduler. The
scheduling delay is defined as the time that elapses between the instant a packet reaches the head of its
queue, and the instant the packet is completely processed on all resources. The delay is introduced by
the scheduling algorithm and is also referred to as the single packet delay in the fair queueing literature
[68, 71–73]. Intuitively, flows with larger weights are expected to experience smaller delay. In the ideal
case, the scheduling delay SDi(p) experienced by any packet p of flow i should be within a small constant
amount that is inversely proportional to the deserved scheduling weight of the flow, i.e.,
SDi(p) ≤ C/wi, (3.3)
Chapter 3. Multi-Resource Fair Queueing for Network Flows 46
where C is a constant, independent of the number of flows.
3.3.4 Scheduling Complexity
To handle a large volume of traffic at high speeds, the scheduler must operate with low scheduling
complexity, defined as the time required to make a packet scheduling decision. Ideally, this complexity
should be a small constant, independent of the number of flows.
In summary, a good packet scheduler should offer near-perfect fairness and a constant delay bound
that is inversely proportional to the flow’s weight, while operating in O(1) time complexity as well.
3.4 Multi-Resource Round Robin
In this section, we present a new scheduler, called Multi-Resource Round Robin (MR3), that achieves all
the aforementioned scheduling properties when flows are assigned the same weights,i.e., wi = 1 for all
flow i. We shall extend this design to a more general scheduler for flows with uneven weights in Sec. 3.5.
We start by revisiting round-robin algorithms in the traditional fair queueing literature and discussing
the challenges of extending them to the multi-resource setting.
3.4.1 Challenges of Round-Robin Extension
As mentioned in Sec. 4.7, among various scheduling schemes, round-robin algorithms are particularly
attractive for practical implementation due to its extreme simplicity and constant time complexity.
To extend it to the multi-resource setting with DRF, a natural approach is to directly apply it on the
dominant resources of a flow’s packets, such that in each round, flows receive roughly the same dominant
services. Such a general extension can be applied to many well-known round-robin algorithms. However,
such a naive extension may lead to arbitrary unfairness.
Take the well-known Deficit Round Robin (DRR) [25] as an example. When there is a single resource,
DRR assigns some predefined quantum size to each flow. Each flow maintains a deficit counter, whose
value is the currently unused transmission quota. In each round, DRR polls every backlogged flow and
transmits its packets up to an amount of data equal to the sum of its quantum and deficit counter.
The unused transmission quota will be carried over to the next round as the value of the flow’s deficit
counter. Similar to the single-resource case, one can apply DRR [25] on the dominant resources of a
flow’s packets as follows.
Initially, the algorithm assigns a predefined quantum size to each flow, which is also the amount
of dominant service the flow is allowed to receive in one round. Each flow maintains a deficit counter
Chapter 3. Multi-Resource Fair Queueing for Network Flows 47
P1CPU
Link ...
...
Time
P1
P2
Q1
P3
P2 Q2 P3
P4 P5
Q3
P6
Q1 Q2 Q3 Q4 Q5
0
(a) Direct application of DRR to schedule multiple resources.
0 20 40 60 80 1000
20
40
60
80
100
Time
Do
min
an
t S
erv
ice
Re
ce
ive
d
Flow 1 (CPU processing time)Flow 2 (Transmission time)
(b) The dominant services received by two flows.
Figure 3.2: Illustration of a direct DRR extension. Each packet of flow 1 has processing times 〈7, 6.9〉,while each packet of flow 2 has processing times 〈1, 7〉.
that measures the currently unused portion of the allocated dominant service. Packets are scheduled
in rounds, and in each round, each backlogged flow schedules as many packets as it has, as long as the
dominant service consumed does not exceed the sum of its quantum and deficit counter. The unused
portion of this amount is carried over to the next round as the new value of the deficit counter.
As an example, consider two flows where flow 1 sends P1, P2, . . . , while flow 2 sends Q1, Q2, . . . .
Each packet of flow 1 has processing times 〈7, 6.9〉, i.e., it requires 7 time units for CPU processing and
6.9 time units for link transmission. Each packet of flow 2 requires processing times 〈1, 7〉. Fig. 3.2a
illustrates the resulting schedule of the above naive DRR extension, where the quantum size assigned
to both flows is 7. In round 1, both flows receive a quantum of 7, and can process 1 packet each, which
consumes all the quantum awarded on the dominant resources in this round. Such a process repeats in
the subsequent rounds. As a result, packets of the two flows are scheduled alternately. At the end of
each round, the received quantum is always used up, and the deficit counter remains zero.
Similar to single-resource DRR, the extension above schedules packets in O(1) time1. However, such
an extension fails to provide fair services in terms of DRF. Instead, it may lead to arbitrary unfairness
with an unbounded RFB. Fig. 3.2b depicts the dominant services received by the two flows. We see
that flow 1 receives nearly two times the dominant service flow 2 receives. With more packets being
scheduled, the service gap increases, eventually leading to an unbounded RFB.
It is to be emphasized that the problem of arbitrary unfairness is not limited to DRR extension only,
but generally extends to all round-robin variants. For example, one can extend Surplus Round Robin
(SRR) [67] and Elastic Round Robin (ERR) [68] to the multi-resource setting in a similar way (more
details can be found in Sec. 3.4.3). It is easy to verify that running the example above will give exactly
1The O(1) time complexity is conditioned on the quantum size being at least the maximum packet processing time.
Chapter 3. Multi-Resource Fair Queueing for Network Flows 48
P1CPU
Link ...
Time
P1
P2
Q1
P3
P2 Q2
Q1 Q2
0
Figure 3.3: Naive fix of the DRR extension shown in Fig. 3.2a by withholding the scheduling opportunityof every packet until its previous packet is completely processed on all resources.
the same schedule shown in Fig. 3.2a with an unbounded RFB2. In fact, due to the heterogeneous packet
processing times on different resources, the work progress on one resource may be far ahead of that on
the other. For example, in Fig. 3.2a, when CPU starts to process packet P6, the transmission of packet
P3 remains unfinished. It is such a progress mismatch that leads to a significant gap between the two
flows’ dominant services.
In summary, directly applying round-robin algorithms on the dominant resources would fail to provide
fair services. A new design is therefore required. We preview its basic idea in the next subsection.
3.4.2 Basic Intuition
The key reason that direct round-robin extensions fail is because they cannot track the flows’ dominant
services in real-time. Take the DRR extension as an example. In Fig. 3.2a, after packet Q1 is completely
processed on CPU, flow 2’s deficit counter is updated to 0, meaning that flow 2 has already used up the
quantum allocated for dominant services (i.e., link transmission) in round 1. This allows packet P2 to
be processed but erroneously, as the actual consumption of this quantum incurs only when packet Q1 is
transmitted on the link, after the transmission of packet P1.
To circumvent this problem, a naive fix is to withhold the scheduling opportunity of every packet
until its previous packet is completely processed on all resources, which allows the scheduler to track the
dominant services accurately. Fig. 3.3 depicts the resulting schedule when applying this fix to the DRR
extension shown in Fig. 3.2a. We see that the difference between the dominant services received by two
flows is bounded by a small constant. However, such a fairness improvement is achieved at the expense
of significantly lower resource utilization. Even though multiple packets can be processed in parallel on
different resources, the scheduler serves only one packet at a time, leading to poor resource utilization
and high packet latency. As a result, this simple fix cannot meet the demand of high-speed networks.
To strike a balance between fairness and latency, packets should not be deferred as long as the
difference of two flows’ dominant services is small. This can be achieved by bounding the progress
2In either SRR or ERR extension, by scheduling 1 packet, each flow uses up all the quantum awarded in each round.As a result, packets of the two flows are scheduled alternately, the same as that in Fig. 3.2a.
Chapter 3. Multi-Resource Fair Queueing for Network Flows 49
P1CPU
Link ...
Time
P1
P2
Q1
P3
P2 Q2 P3
P4
Q3
Q1 Q2 Q3 Q4
0
...
(a) Schedule by MR3.
0 20 40 60 80 1000
10
20
30
40
50
60
Time
Do
min
an
t S
erv
ice
Re
ce
ive
d
Flow 1 (CPU processing time)Flow 2 (Transmission time)
(b) The dominant services received by two flows.
Figure 3.4: Illustration of a schedule by MR3.
gap on different resources by a small amount. In particular, we may serve flows in rounds as follows.
Whenever a packet p of a flow i is ready to be processed on the first resource (usually CPU) in round k,
the scheduler checks the work progress on the last resource (usually the link bandwidth). If flow i has
already received services on the last resource in the previous round k − 1, or it has newly arrived, then
packet p is scheduled immediately. Otherwise, packet p is withheld until flow i starts to receive service
on the last resource in round k − 1. As an example, Fig. 3.4a depicts the resulting schedule with the
same input traffic as that in the example of Fig. 3.2a. In round 1, both packets P1 and Q1 are scheduled
without delay because both flows are new arrivals. In round 2, packet P2 (resp., Q2) is also scheduled
without delay, because when it is ready to be processed, flow 1 (resp., flow 2) has already started its
service on the link bandwidth in round 1. In round 3, while packet P3 is ready to be processed right
after packet Q2 is completely processed on CPU, it has to wait until the transmission of P2 starts. A
similar process repeats for all subsequent packets.
We will show later in Sec. 3.4.4 that such a simple idea leads to nearly perfect fairness across flows,
without incurring high packet latency. In fact, the schedule in Fig. 3.4a incurs the same packet latency
as that in Fig. 3.2a, but is much fairer. As we see from Fig. 3.4b, the difference between dominant
services received by the two flows is bounded by a small constant.
3.4.3 Algorithm Design
While the general idea introduced in the previous section is simple, implementing it as a concrete round-
robin algorithm is nontrivial. We next explore the design space of round robin algorithms and identify
Elastic Round Robin [68] as the most suitable variant for multi-resource fair queueing in middleboxes.
The resulting algorithm is referred to as Multi-Resource Round Robin (MR3).
Chapter 3. Multi-Resource Fair Queueing for Network Flows 50
Design Space of Round-Robin Algorithms
Many round-robin variants have been proposed in the traditional fair queueing literature. While all
these variants achieve similar performance and are all feasible for the single-resource scenario, not all of
them are suitable to implement the aforementioned idea in a middlebox. We investigate three typical
variants, i.e., Deficit Round Robin [25], Surplus Round robin [67], and Elastic Round Robin [68], and
discuss their implementation issues in middleboxes as follows.
Deficit Round Robin (DRR): We have introduced the basic idea of DRR in Sec. 4.2.4. As an
analogy, one can view the behavior of each flow as maintaining a banking account. In each round, a
predefined quantum is deposited into a flow’s account, tracked by the deficit counter. The balance of
the account (i.e., the value of the deficit counter) represents the dominant service the flow is allowed to
receive in the current round. Scheduling a packet is analogous to withdrawing the corresponding packet
processing time on the dominant resource from the account. As long as there is sufficient balance to
withdraw from the account, packet processing is allowed.
However, DRR is not amenable to implementation in middleboxes for the following two reasons.
First, to ensure that a flow has sufficient account balance to schedule a packet, the processing time
required on the dominant resource has to be known before packet processing. However, it is hard to
know what middlebox resources are needed and how much processing time is required until the packet
is processed. Also, the O(1) time complexity of DRR is conditioned on the quantum size that is at least
the same as the maximum packet processing time, which may not be easy to obtain in a real system.
Without satisfying this condition, the time complexity could be as high as O(N) [68].
Surplus Round Robin (SRR): SRR [67] allows a flow to consume more processing time on the
dominant resources of its packets in one round than it has in its account. As a compensation, the
excessive consumption, tracked by a surplus counter, will be deducted from the quantum awarded in
the future rounds. In SRR, as long as the account balance (i.e., surplus counter) is positive, the flow
is allowed to schedule packets, and the corresponding packet processing time is withdrawn from the
account after the packet is processed on its dominant resource. In this case, the packet processing time
is only needed after the packet has been processed.
While SRR does not require knowing packet processing time beforehand, its O(1) time remains
conditioned on the predefined quantum size that is at least the same as the maximum packet processing
time. Otherwise, the time complexity could be as high as O(N) [68]. SRR is hence not amenable to
implementation in middleboxes for the same reason mentioned above.
Elastic Round Robin (ERR): Similar to SRR, ERR [68] does not require knowing the processing
Chapter 3. Multi-Resource Fair Queueing for Network Flows 51
...
SeqNum1
f1
Round 1 Round 2 Round 3 Round 4
f2 f3 f1 f4 f2 f3 f4 f2 f3 f4
2 3 4 5 6 7 8 9 10 110
Figure 3.5: Illustration of the round-robin service and the sequence number.
time before the packet is processed. It allows flows to overdraw its permitted processing time in one
round on the dominant resource, with the excessive consumption deducted from the quantum received
in the next round. The difference is that instead of depositing a predefined quantum with a fixed size, in
ERR, the quantum size in one round is dynamically set as the maximum excessive consumption incurred
in the previous round. This ensures that each flow will always have a positive balance in its account at
the beginning of each round, and can schedule at least one packet. In this case, ERR achieves O(1) time
complexity without knowing the maximum packet processing time a priori, and is the most suitable for
implementation in middleboxes at high speeds.
MR3 Design
While ERR serves as a promising round-robin variant for extension to the multi-resource case, there
remain several challenges to implement the idea of scheduling deferral presented in Sec. 3.4.2. How
can the scheduler quickly track the work progress gap of two resources and decide when to withhold a
packet? To ensure efficiency, such a progress comparison must be completed within O(1) time. Note
that simply comparing the numbers of packets that have been processed on two resources does not give
any clue about the progress gap: due to traffic dynamics, each round may consist of different amounts
of packets.
To circumvent this problem, we associate each flow i with a sequence number SeqNumi, which is
increased from 0 and is the scheduling order of the flow. We use a global variable NextSeqNum to record
the next sequence number that will be assigned to a flow. The value of NextSeqNum is initialized to
0 and increased by 1 every time a flow is processed. Each flow i also records its sequence number in
the previous round, tracking by PreviousRoundSeqNumi. For example, consider Fig. 3.5. Initially, flows
1, 2 and 3 are backlogged and are served in sequence in round 1, with sequence numbers 1, 2 and 3,
respectively. While flow 2 is being served in round 1, flow 4 becomes active. Flow 4 is therefore scheduled
right after flow 1 in round 2, with a sequence number 5. After round 2, flow 1 has no packet to serve
and becomes inactive. As a result, only flows 2, 3 and 4 are serviced in round 3, where their sequence
numbers in the previous round are 6, 7 and 5, respectively.
Chapter 3. Multi-Resource Fair Queueing for Network Flows 52
We use sequence numbers to track the work progress on a resource. Whenever a packet p is scheduled
to be processed, it is stamped with its flow’s sequence number. By checking the sequence number stamped
to the packet that is being processed on a resource, the scheduler knows exactly the work progress on
that resource.
Besides sequence numbers, the following important variables/functions are also used in the algorithm.
Active list: The algorithm maintains an ActiveFlowList to track backlogged flows. Flows are served
in a round-robin fashion. The algorithm always serves the flow at the head of the list, and after the
service, this flow, if remaining active, will be moved to the tail of the list for service in the next round.
A newly arriving flow is always appended to the tail of the list, and will be served in the next round. We
also use RoundRobinCounter to track the number of flows that have not yet been served in the current
round. Initially, ActiveFlowList is empty and RoundRobinCounter is 0.
Excess counter: Each flow i maintains an excess counter ECi, recording the excessive dominant
service flow i incurred in one round. The algorithm also uses two variables, MaxEC and PreviousRound-
MaxEC, to track the maximum excessive consumption incurred in the current and the previous round,
respectively. Initially, all these variables are set to 0.
DominantProcessingTime(p): this function profiles a packet p and returns the (estimated) processing
time on its dominant resource. We will discuss how a packet is profiled in Sec. 3.4.5.
Our algorithm, referred to as MR3, consists of two functional modules, PacketArrival (Algorithm 1),
which handles packet arrival events, and Scheduler (Algorithm 2), which decides which packet should be
processed next.
PacketArrival: This module is invoked upon the arrival of a packet. It enqueues the packet to
the input queue of the flow to which the packet belongs. If this flow is previously inactive, it is then
appended to the tail of the active list and will be served in the next round. The sequence number of the
flow is also updated, as shown in Algorithm 1 (line 3 to line 5).
Algorithm 1 MR3 PacketArrival
1: Let i be the flow to which the packet belongs2: if ActiveFlowList.Contains(i) == FALSE then3: PreviousRoundSeqNumi = SeqNumi
4: NextSeqNum = NextSeqNum + 15: SeqNumi = NextSeqNum6: ActiveFlowList.AppendToTail(i)7: end if8: Enqueue the packet to queue i
Scheduler: This module decides which packet should be processed next. The scheduler first checks
the value of RoundRobinCounter to see how many flows have not yet been served in the current round.
Chapter 3. Multi-Resource Fair Queueing for Network Flows 53
If the value is 0, then a new round starts. The scheduler sets RoundRobinCounter to the length of the
active list (line 3), and updates PreviousRoundMaxEC as the maximum excessive consumption incurred
in the round that has just passed (line 4), while MaxEC is reset to 0 for the new round (line 5).
Algorithm 2 MR3 Scheduler
1: while TRUE do2: if RoundRobinCounter == 0 then3: RoundRobinCounter = ActiveFlowList.Length()4: PreviousRoundMaxEC = MaxEC5: MaxEC = 06: end if7: Flow i = ActiveFlowList.RemoveFromHead()8: Bi = PreviousRoundMaxEC − ECi
9: while Bi ≥ 0 and QueueIsNotEmpty(i) do10: Let q be the packet being processed on the last resource11: WaitUntil(q.SeqNum ≥ PreviousRoundSeqNumi)12: Packet p = Dequeue(i)13: p.SeqNum = SeqNumi
14: ProcessPacket(p)15: Bi = Bi − DominantProcessingTime(p)16: end while17: if QueueIsNotEmpty(i) then18: ActiveFlowList.AppendToTail(i)19: NextSeqNum = NextSeqNum + 120: PreviousRoundSeqNumi = SeqNumi
21: SeqNumi = NextSeqNum22: ECi = −Bi
23: else24: ECi = 025: end if26: MaxEC = Max(MaxEC, ECi)27: RoundRobinCounter = RoundRobinCounter − 128: end while
The scheduler then serves the flow at the head of the active list. Let flow i be such a flow. Flow i
receives a quantum equal to the maximum excessive consumption incurred in the previous round, and
has its account balance Bi equal to the difference between the quantum and the excess counter, i.e.,
Bi = PreviousRoundMaxEC− ECi. Since PreviousRoundMaxEC ≥ ECi, we have Bi ≥ 0.
Flow i is allowed to schedule packets (if any) as long as its balance is positive (i.e., Bi ≥ 0). To
ensure a small work progress gap between two resources, the scheduler keeps checking the sequence
number stamped to the packet that is being processed on the last resource3 and compares it with flow
i’s sequence number in the previous round. The scheduler waits until the former exceeds the latter, at
which time the progress gap between any two resources is within 1 round. The scheduler then dequeues
a packet from the input queue of flow i, stamps the sequence number of flow i, and performs deep
packet processing on CPU, which is also the first middlebox resource required by the packet. After CPU
3If no packet is being processed, we take the sequence number of the packet that has recently been processed.
Chapter 3. Multi-Resource Fair Queueing for Network Flows 54
processing, the scheduler knows exactly how the packet should be processed next and what resources
are required. The packet is then pushed to the buffer of the next resource for processing. Meanwhile,
the packet processing time on each resource can also be accurately estimated after CPU processing, e.g.,
using some simple packet profiling technique introduced in [4]. The scheduler then deducts the dominant
processing time of the packet from flow i’s balance. The service for flow i continues until flow i has no
packet to process or its balance becomes negative.
If flow i is no longer active after service in the current round, its excess counter will be reset to 0.
Otherwise, flow i is appended to the tail of the active list for service in the next round. In this case, a
new sequence number is associated with flow i. The excess counter ECi is also updated as the account
deficit of flow i. Finally, before serving the next flow, the scheduler updates MaxEC and decreases
RoundRobinCounter by 1, indicating that one flow has already finished service in the current round.
3.4.4 Performance Analysis
In this section, we analyze the performance of MR3 by deriving its time complexity, fairness, and delay
bound. We also compare MR3 with the existing multi-resource fair queueing design, e.g., DRFQ [4].
Complexity
MR3 is highly efficient as compared with DRFQ. One can verify that under MR3, at least one packet is
scheduled for each flow in one round. Formally, we have
Theorem 1. MR3 makes the scheduling decisions in O(1) time per packet.
Proof: We prove the theorem by showing that both enqueuing a packet (Module 1) and scheduling
a packet (Module 2) finish within O(1) time.
In Module 1, determining the flow at which a new packet arrives is an O(1) operation. By maintaining
the per-flow state, the scheduler knows if the flow is contained in the active list (line 2) within O(1)
time. Also, updating the sequence number (line 3 to 5), appending the flow to the active list (line 6),
and enqueueing a packet (line 8) are all of O(1) time complexity.
We now analyze the time complexity of scheduling a packet. In Module 2, since the quantum deposited
into each flow’s account is the maximum excessive consumption incurred in the previous round, a flow
will always have a positive balance at the beginning of each round, i.e., Bi ≥ 0 in line 15. As a
result, at least one packet is scheduled for each flow in one round. The time complexity of scheduling
a packet is therefore no more than the time complexity of all the operations performed during each
service opportunity. These operations include determining the next flow to be served, removing the flow
Chapter 3. Multi-Resource Fair Queueing for Network Flows 55
from the head of the active list and possibly adding it back at the tail, all of which are O(1) operations
if the active list is implemented as a linked list. Additional operations include updating the sequence
number, MaxEC, PreviousRoundMaxEC, RoundRobinCounter, and dequeuing a packet. All of them are
also executed within O(1) time. ut
Fairness
In addition to the low scheduling complexity, MR3 provides near-perfect fairness across flows. To see
this, let ECki be the excess counter of flow i after round k, and MaxECk the maximum ECki over all flow
i’s. Let Dki be the dominant service flow i receives in round k. Also, let Li be the maximum packet
processing time of flow i across all resources. Finally, let L be the maximum packet processing time
across all flows, i.e., L = maxi{Li}. We show that the following lemmas and corollaries hold throughout
the execution of MR3. The proofs are given in Sec. 3.8.1.
Lemma 2. ECki ≤ Li for all flow i and round k.
Corollary 1. MaxECk ≤ L for all round k.
Lemma 3. For all flow i and round k, we have
Dki = MaxECk−1 − ECk−1
i + ECki , (3.4)
where EC0i = 0 and MaxEC0 = 0.
Corollary 2. Dki ≤ 2L for all flow i and round k.
We are now ready to bound the difference of dominant services received by two flows under MR3.
Recall that Ti(t1, t2) denotes the dominant service flow i receives in time interval (t1, t2). We have the
following theorem.
Theorem 2. Under MR3, the following relationship holds for any two flows i, j that are backlogged in
any time interval (t1, t2):
|Ti(t1, t2)− Tj(t1, t2)| ≤ 6L.
Proof: Suppose at time t1 (resp., t2), the work progress of MR3 on resource 1 is in round R1 (resp.,
R2). For any flow i, we upper bound Ti(t1, t2) as follows. At time t1, the work progress on the last
resource must be at least in round R1 − 1, as the progress gap between the first and the last resource is
bounded by 1 round under MR3. Also note that at time t2, the progress on the last resource is at most
Chapter 3. Multi-Resource Fair Queueing for Network Flows 56
in round R2. As a result, flow i receives dominant services at most in rounds R1 − 1, R1, . . . , R2, i.e.,
Ti(t1, t2) ≤R2∑
k=R1−1
Dki
=
R2∑k=R1−1
MaxECk−1 − ECR1−2i + ECR2
i
≤R2∑
k=R1−1
MaxECk−1 + L , (3.5)
where the equality is derived from Lemma 3, and the last inequality is derived from Lemma 2.
We now derive the lower bound of Ti(t1, t2). Note that at time t1, the algorithm has not yet started
round R1 + 1, while at time t2, the algorithm must have finished all processing of round R2 − 2 on the
last resource. As a result, all packets that belong to rounds R1 + 1, . . . , R2 − 2 have been processed on
all resources within (t1, t2). We then have
Ti(t1, t2) ≥R2−2∑k=R1+1
Dki
=
R2−2∑k=R1+1
MaxECk−1 − ECR1i + ECR2−2
i
≥R2−2∑k=R1+1
MaxECk−1 − L
≥R2∑
k=R1−1
MaxECk−1 − 5L . (3.6)
where the last inequality is derived from Corollary 1.
For notation simplicity, let X =∑R2
k=R1−1 MaxECk−1. By (3.5) and (3.6), we have
X − 5L ≤ Ti(t1, t2) ≤ X + L . (3.7)
Note that this inequality also holds for flow j, i.e.,
X − 5L ≤ Tj(t1, t2) ≤ X + L . (3.8)
Taking the difference between (3.7) and (3.8) leads to the statement. ut
Corollary 3. The RFB of MR3 is 6L.
Based on Theorem 2 and Corollary 3, we see that MR3 bounds the difference between (normalized)
Chapter 3. Multi-Resource Fair Queueing for Network Flows 57
dominant services received by two backlogged flows in any time interval by a small constant. Note that
the interval (t1, t2) may be arbitrarily large. MR3 therefore achieves nearly perfect DRF across all active
flows, irrespective of their traffic patterns.
We note that this is a stronger fairness guarantee than the one provided by existing multi-resource
fair queueing schemes, e.g., DRFQ, which require that all packets of a flow have the same dominant
resource throughout each backlogged period (referred to as the resource monotonicity assumption in [4]).
Delay Analysis
In addition to the complexity and fairness, latency is also an important concern for a packet scheduling
algorithm. Recall that in Sec. 3.3.3, the scheduling delay is defined as the latency from the time when a
packet reaches the head of the input queue to the time when the scheduler finishes processing the packet
on all resources. Before we proceed to the general analysis for all packets, let us focus on the scheduling
delay of the very first packet of a flow, referred to as the startup latency. In particular, let m be the
number of resources concerned, and n the number of backlogged flows. We have the following theorem.
The proof is given in Sec. 3.8.2.
Theorem 3. When flows are assigned the same weights, the startup latency SL of any newly backlogged
flow is bounded by
SL ≤ 2(m+ n− 1)L . (3.9)
Based on the analysis of the startup latency, we next derive the upper bound of the scheduling delay.
The proof is given in Sec. 3.8.3.
Theorem 4. When flows are assigned the same weights, for any packet p, the scheduling delay D(p) is
bounded by
D(p) ≤ (4m+ 4n− 2)L . (3.10)
Table 3.1 summarizes the derived performance of MR3, in comparison with DRFQ [4]. We see that
MR3 significantly reduces the scheduling complexity per packet, while providing near-perfect fairness in
a more general case with arbitrary traffic patterns. The price we pay, however, is longer startup latency
for newly active flows. Since the number of middlebox resources is typically much smaller than the
number of active flows, i.e., m� n, the startup latency bound of MR3 is approximately two times that
of DRFQ, i.e.,
2(m+ n− 1)L ≈ 2nL.
Chapter 3. Multi-Resource Fair Queueing for Network Flows 58
Table 3.1: Performance comparison between MR3 and DRFQ, where L is the maximum packet processingtime, m is the number of resources, and n is the number of active flows.
Performance MR3 DRFQComplexity O(1) O(log n)
Fairness*(RFB) 6L 2LStartup Latency 2(m+ n− 1)L nLScheduling Delay (4m+ 4n− 2)L Unknown* The fairness analysis of DRFQ requires the re-
source monotonicity assumption [4]. Under thesame assumption, the RFB of MR3 is 4L [74].
Also note that, since the scheduling delay is usually hard to analyze, no analytical delay bound is given
in [4]. We shall experimentally compare the latency performance of MR3 and DRFQ later in Sec. 3.4.5.
3.4.5 Simulation Results
As a complementary study of the above theoretical analysis, we evaluate the performance of MR3 via
extensive simulations. First, we would like to confirm experimentally that MR3 offers predictable service
isolation and is superior to the naive first-come-first-served (FCFS) scheduler, as the theory indicates.
Second, we want to confirm that MR3 can quickly adapt to traffic dynamics and achieve nearly perfect
DRF across flows. Third, We compare the latency performance of MR3 with DRFQ [4] to see if the
extremely low time complexity of MR3 is achieved at the expense of significant packet delay. Fourth, we
investigate how sensitive the performance of MR3 is when packet size distributions and arrival patterns
change. Finally, we evaluate the scenario where flows are assigned uneven weights, under which we
compare the delay performance of weighted MR3 with DRFQ.
General Setup
All simulation results are based on our event-driven packet simulator written with 3,000 lines of C++
codes. We assume resources are consumed serially, with CPU processing first, followed by link trans-
mission. We implement 3 schedulers, FCFS, DRFQ and MR3. The last two inspect the flows’ input
queues and decide which packet should be processed next. By default, packets follow Poisson arrivals,
unless otherwise stated. The simulator models resource consumption of packet processing in 3 typical
middlebox modules, each corresponds to one of the flow types: basic forwarding, per-flow statistical
monitoring, and IPSec encryption. The first two modules are bandwidth-bound, with statistical mon-
itoring consuming slightly more CPU resources than basic forwarding, while IPSec is CPU intensive.
For direct comparison, we set the packet processing times required for each middlebox module the same
as those in [4], which are based on real measurements. In particular, the CPU processing time of each
Chapter 3. Multi-Resource Fair Queueing for Network Flows 59
Table 3.2: Linear model for CPU processing time in 3 middlebox modules. Model parameters are basedon the measurement results reported in [4].
Module CPU processing time (µs)Basic Forwarding 0.00286× PacketSizeInBytes + 6.2
Statistical Monitoring 0.0008× PacketSizeInBytes + 12.1IPSec Encryption 0.015× PacketSizeInBytes + 84.5
0 5 10 15 20 25 300
5
10
15
Flow ID
Dom
inant S
erv
ice (
s)
FCFS
MR3
(a) Dominant service received.
0 5 10 15 20 25 300
1
2
3
4
5
6
7
Flow IDT
hro
ughput (1
03 p
kts
/s)
FCFS
MR3
(b) Packet throughput of flows.
Figure 3.6: Dominant services and packet throughput received by different flows under FCFS and MR3.Flows 1, 11 and 21 are ill-behaving.
module is observed to follow a simple linear model based on packet size. Table 3.2 summarizes the
detailed parameters based on the measurement results reported in [4]. The link transmission time is
proportional to the packet size, and the output bandwidth of the middlebox is set to 200 Mbps, the
same as the experiment environment in [4]. Each of the simulation experiment spans 30 s.
Service Isolation
We start off by confirming that MR3 offers nearly perfect service isolation, which naive FCFS fails to
provide. We initiate 30 flows that send 1400-byte UDP packets. Flows 1 to 10 undergo basic forwarding;
11 to 20 undergo statistical monitoring; 21 to 30 undergo IPSec encryption. We generate 3 rogue flows,
i.e., 1, 11 and 21, each sending 10,000 pkts/s. All other flows behaves normally, each sending 1,000
pkts/s. Fig. 3.6a shows the dominant services received by different flows under FCFS and MR3. We see
that under FCFS, rogue flows grab an arbitrary share of middlebox resources, while under MR3, flows
receive fair services on their dominant resources. This result is further confirmed in Fig. 3.6b: Under
FCFS, the presence of rogue flows squeezes normal traffics to almost zero. In contrast, MR3 ensures that
all flows receive deserved, though uneven, throughput based on their dominant resource requirements,
irrespective of the presence and (mis)behaviour of other traffic.
Chapter 3. Multi-Resource Fair Queueing for Network Flows 60
0 50 100 1500
2
4
6
8
Flow ID
Sta
rtup L
ate
ncy (
ms)
DRFQ
MR3
(a) Startup latency of flows.
0 10 20 30 40 500
0.2
0.4
0.6
0.8
1
Single Packet Delay (ms)
CD
F
DRFQ
MR3
(b) CDF of the scheduling delay.
Figure 3.7: Latency comparison between DRFQ and MR3.
Latency
We next evaluate the latency penalty MR3 pays for its extremely low time complexity, in comparison
with DRFQ [4]. We implement DRFQ and measure the startup latency as well as the single packet delay
of both algorithms. In particular, 150 flows with UDP packets are initiated sequentially, where flow 1
becomes active at time 0, followed by flow 2 at time 0.2 second, and flow 3 at time 0.3 second, and so on.
A flow is randomly assigned to one of the three types. To congest the middlebox resources, the packet
arrival rate of each flow is set to 500 pkts/s, and the packet size is uniformly drawn from 200 bytes
to 1400 bytes. Fig. 3.7a depicts the per-flow startup latency using both DRFQ and MR3. Clearly, the
dense and sequential flow starting times in this example represent a worst-case scenario for a round-robin
scheduler. We see that under MR3, flows joining the system later see larger startup latency, while under
DRFQ, the startup latency is relatively consistent. This is because under MR3, a newly active flow will
have to wait for a whole round before being served. The more active flows, the more time is required to
finish serving one round. As a result, the startup latency is linearly dependent on the number of active
flows. While this is also true for DRFQ in the worst-case analysis (see Table 3.1), our simulation results
show that on average, the startup latency of DRFQ is smaller than MR3. However, we see next that
this advantage of DRFQ comes at the expense of highly uneven single packet delays.
Compared with the startup latency, single packet delay is a much more important delay metric.
As we see from Fig. 3.7b, MR3 exhibits more consistent packet delay performance, with all packets
delayed less than 15 ms. In contrast, the latency distribution of DRFQ is observed to have a long tail:
90% packets are delayed less than 5 ms while the rest 10% are delayed from 5 ms to 50 ms. Further
investigation reveals that these 10% packets are uniformly distributed among all flows. All results above
indicate that the low time complexity and near-perfect fairness of MR3 is achieved at the expense of
Chapter 3. Multi-Resource Fair Queueing for Network Flows 61
only slight increase in packet latency.
Dynamic Allocation
We further investigate if the DRF allocation achieved by MR3 can quickly adapt to traffic dynamics. To
congest middlebox resources, we initiate 3 UDP flows each sending 20,000 1400-byte packets per second.
Flow 1 undergoes basic forwarding and is active in time interval (0, 15). Flow 2 undergoes statistical
monitoring and is active in two intervals (3, 10) and (20, 30). Flow 3 undergoes IPSec encryption and
is active in (5, 25). The input queue of each flow can cache up to 1,000 packets. Fig. 3.8 shows the
resource share allocated to each flow over time. Since flow 1 is bandwidth-bound and is the only active
flow in (0, 3), it receives 20% CPU share and all bandwidth. In (3, 5), both flows 1 and 2 are active.
They equally share the bandwidth on which both flows bottleneck. Later, when flow 3 becomes active
at time 5, all three flows are backlogged in (5, 10). Because flow 3 is CPU-bound, it grabs only 10%
bandwidth share from 2 and 3, respectively, yet is allocated 40% CPU share. Similar DRF allocation is
also observed in subsequent time intervals. We see that MR3 quickly adapts to traffic dynamics, leading
to nearly perfect DRF across flows.
Sensitivity
We also evaluate the performance sensitivity of MR3 under a mixture of different packet size distributions
and arrival patterns. The simulator generates 24 UDP flows with arrival rate 10,000 pkts/s each. Flows 1
to 8 undergo basic forwarding; 9 to 16 undergo statistical monitoring; 17 to 24 undergo IPSec encryption.
The 8 flows passing through the same middlebox module is further divided into 4 groups. Flows in group
1 send large packets with 1400 bytes; Flows in group 2 send small packets with 200 bytes; Flows in group
3 send bimodal packets that alternate between small and large; Flows in group 4 send packet with random
size uniformly drawn from 200 bytes to 1400 bytes. Each group contains exactly 2 flows, with exponential
and constant packet interarrival times, respectively. The input queue of each flow can cache up to 1,000
packets. Fig. 3.9a shows the dominant services received by all 24 flows, where no paritcular pattern
is observed in response to distribution changes of packet sizes and arrivals. Figs. 3.9b, 3.9c and 3.9d
show the average single packet delay observed in three middlebox modules, respectively. We find that
while the latency performance is highly consistent under different arrival patterns, it is affected by the
distribution of packet size. In general, flows with small packets are slightly preferred and will see smaller
latency than those with large packets. Similar preference for small-packet flows has also been observed
in our experiments with DRFQ.
Chapter 3. Multi-Resource Fair Queueing for Network Flows 62
0 5 10 15 20 25 300
0.2
0.4
0.6
0.8
1
Time (s)
CP
U S
ha
re
Flow 1Flow 2Flow 3
0 5 10 15 20 25 300
0.2
0.4
0.6
0.8
1
Time (s)
Ba
nd
wid
th S
ha
re
Flow 1 Flow 2 Flow 3
0 5 10 15 20 25 300
0.2
0.4
0.6
0.8
1
Time (s)
Do
min
an
t S
ha
re
Flow 1 Flow 2 Flow 3
Figure 3.8: MR3 can quickly adapt to traffic dynamics and achieve DRF across all 3 flows.
3.4.6 MR3 for the Weighted Flows
So far, we have assumed that all flows are of the same weights and are expected to receive the same level
of service. For the purpose of service differentiation, flows may be assigned different weights, based on
their respective Quality of Service (QoS) requirements. A scheduling algorithm should therefore provide
weighted fairness across flows. Specifically, the packet processing time a flow receives on the dominant
resources of its packets should be in proportion to its weight, over any backlogged periods.
A minor modification to the original MR3 is sufficient to achieve weighted fairness. In line 15 of
Algorithm 2, after packet processing, the scheduler deducts the dominant processing time of the packet,
normalized to its flow’s weight, from that flow’s balance, i.e.,
Bi = Bi −DominantProcessingTime(p)/wi , (3.11)
where wi is the weight associated with flow i. In other words, balance Bi now tracks the normalized
Chapter 3. Multi-Resource Fair Queueing for Network Flows 63
0 5 10 15 20 251704
1705
1706
Flow ID
Do
min
an
t S
erv
ice
(m
s)
(a) Dominant service received.
Large Small Bimodal Random0
1
2
3
4
Packet Size
Avera
ge L
ate
ncy (
ms)
ConstantExponential
(b) Basic forwarding.
Large Small Bimodal Random0
1
2
3
4
Packet Size
Avera
ge L
ate
ncy (
ms)
ConstantExponential
(c) Statistical monitoring.
Large Small Bimodal Random0
1
2
3
4
Packet Size
Avera
ge L
ate
ncy (
ms)
ConstantExponential
(d) IPSec encryption.
Figure 3.9: Fairness and delay sensitivity of MR3 in response to mixed packet sizes and arrival distribu-tions.
dominant processing time available to flow i. All the other parts of the algorithm remain the same as
its unweighted version. The resulting scheduler is referred to as weighted MR3.
All the analyses presented in Sec. 3.4.4 can also be extended to weighted MR3. Without loss of
generality, we assume the flow weights (wi’s) are normalized such that
n∑i=1
wi = 1.
It is easy to verify that having flows assigned uneven weights will have no impact on the scheduling
complexity of the algorithm. Formally, we have
Theorem 5. Weighted MR3 makes scheduling decisions in O(1) time per packet.
Moreover, the following theorem ensures that weighted MR3 provides near-perfect weighted fairness
across flows, as the difference between the normalized dominant services received by two flows is bounded
by a small constant. The proof is similar to the unweighted scenario and is briefly outlined in Sec. 3.8.4.
Theorem 6. Under weighted MR3, the following relationship holds for any two flows i, j that are
backlogged in any time interval (t1, t2):
∣∣∣∣Ti(t1, t2)
wi− Tj(t1, t2)
wj
∣∣∣∣ ≤ 6 max1≤i≤n
Liwi
.
Finally, to analyze the delay performance of weighted MR3, let
W = max1≤i,j≤n
wi/wj . (3.12)
The following two theorems bound the startup latency and scheduling delay for weighted MR3, respec-
tively. Their proofs resemble the unweighted case and are briefly described in Sec. 3.8.5.
Chapter 3. Multi-Resource Fair Queueing for Network Flows 64
CPU
Link ...
...
Time20 6 124 8 10 14
P 11 P 1
2 P 13 P 1
4 P 15
P 21 P 3
1 P 41 P
51 P 6
1
P 16P 3
1 P 41 P 5
1 P 61
P 16P 1
1 P 12 P 1
3 P 14 P 1
5
P 21
16
P 17
Round 1 Round 2
Figure 3.10: MR3 schedule fails to offer weight-proportional delay when flows are assigned unevenweights. P ik denotes the kth packet of flow i.
Theorem 7. Under weighted MR3, the startup latency is bounded by
SL ≤ (1 +W )(m+ n− 1)L .
Theorem 8. Under weighted MR3, for any flow i, the scheduling delay for its packet p is bounded by
Di(p) ≤ 2(m+W )2L/wi .
By Theorem 8, we see that the delay bound of weighted MR3 critically depends on the weight
distributions. When flows are assigned significantly uneven weights (i.e., W � 1), the delay bound may
become very large. While our analysis is based on the worst case, the following example shows that MR3
fails to offer a weight-proportional delay bound.
Suppose there are 6 flows competing for both middlebox CPU and link bandwidth. Each packet of
flow 1 requires 1 time unit for CPU processing and 2 for link transmission. Each packet of other flows
requires 2 time units for CPU processing and 1 for link transmission. Flow 1 weighs 1/2, while flow 2
to 6 each weighs 1/10. The amount of credits flow 1 receives in one round is hence 5 times those given
to the other flows. Fig. 3.10 illustrates an MR3 schedule, where P ik denotes the kth packet of flow i. We
see that in every round, flow 1 schedules five packets while each of the other flow schedules only one.
Such a schedule offers weight-proportional services (the dominant services flow 1 receives are 5 times
that of the other flow) but not weight-proportional delay. The maximum packet scheduling delay flow
1 experiences is 13 time units (e.g., packet P 16 is ready for service at time 5 but finishes service at time
18), more than half of that experienced by other flows (e.g., packet P 22 has been delayed by 20 time
units).
In addition to Theorem 8 and the example above, our simulation results presented in Sec. 3.5.8
further confirm that MR3 may incur large delays under uneven flow weights. MR3 is hence unsuitable
to schedule weighted flows.
Chapter 3. Multi-Resource Fair Queueing for Network Flows 65
To summarize, we have seen two alternative approaches used to design a multi-resource scheduler,
the timestamp-based DRFQ scheduler [4] and the round-robin variant of MR3. However, none of these
schedulers can achieve all the desirable scheduling properties introduced in Sec. 3.3. We note that sim-
ilar problems have also been a major challenge in the long evolution of single-resource fair queueing
algorithms for bandwidth sharing, where timestamp-based schemes and round robin are the two de-
sign approaches. The former provides tight delay bounds, but requires high complexity to sort packet
timestamps (e.g., [21–23, 26, 31]). The latter approach, on the other hand, has O(1) complexity, yet
incurs high scheduling delays (e.g., [25, 67, 68]). To achieve the best of both worlds, one approach is to
combine the fairness and delay properties of timestamp-based algorithms with the low time complexity
of round-robin schemes [71–73, 75, 76]. This is typically done by grouping flows into a small number
of classes. The scheduler then uses the timestamp-based algorithm to determine which class to serve.
Within a class, the scheduling resembles a round-robin scheme. While this strategy turns out to be an
effective approach for single-resource bandwidth sharing, generalizing it to schedule multiple resource
types imposes non-trivial technical challenges. As we shall show in the later sections, despite that flows
may have different dominant resources with different processing progresses, the scheduler has to maintain
a weight-proportional dominant service level across all flows. We shall describe our solution to answer
this challenge in the next section.
3.5 Group Multi-Resource Round Robin
In this section, we present an improved round-robin scheduler, referred to as Group Multi-Resource
Round Robin (GMR3), that addresses the delay problem of MR3. GMR3 not only provides a weight-
proportional delay bound, but also achieves near-perfect fairness in O(1) time.
3.5.1 Basic Intuition
While round robin may incur high delay in a general scenario, Theorem 8 indicates that it provides
a good delay bound when flows are of similar weights. In other words, if we group flows with similar
weights to a flow group, then within a group, round robin serves as an excellent scheduler. The challenge
is to schedule inter-group flows with different weights.
We have observed that in MR3, flows are always served in a “burst” mode under uneven flow weights
(see Sec. 3.4.3). For example, in Fig. 3.10, flow 1 schedules 5 packets in a row in round 1, and has to
wait for an entire round to schedule its next packet P 16 in round 2, resulting in a long scheduling delay of
that packet. Instead of serving flows in a “burst” mode, a better strategy is to spread their scheduling
Chapter 3. Multi-Resource Fair Queueing for Network Flows 66
CPU
Link ...
...
Time20 6 124 8 10 14
P 11 P 1
2 P 13 P 1
4 P 15
P 21 P 3
1 P 41 P 5
1 P 61
P 16P 3
1 P 41 P 5
1 P 61
P 16P 1
1 P 12 P 1
3 P 14 P 1
5
P 21
16
P 22
D1(P12 ) = 5
Figure 3.11: An improved schedule over MR3 in the example of Fig. 3.10, where the scheduling delay issignificantly reduced. P ik denotes the kth packet of flow i.
opportunities over time, in proportion to their respective weights. Fig. 3.11 illustrates an improved
schedule over MR3 in Fig. 3.10, where the scheduling opportunities of flow 1 are interleaved between
those of other flows. Compared to the MR3 schedule in Fig. 3.10, the maximum scheduling delay of flow
1 is reduced from 13 to 5 (shown in Fig. 3.11), and the delay of other flows is also reduced from 20 to 16.
Note that such a delay improvement is achieved without compromising fairness: the dominant services
each flow receives under this schedule remain proportional to its assigned weight.
Our design follows exactly this intuition. The algorithm aggregates flows with similar weights to a
flow group, and makes scheduling decisions in a two-level hierarchy. At the higher level, the algorithm
makes inter-group scheduling decisions to determine a flow group, with the objective of distributing the
scheduling opportunities over time, in proportion to the approximate weights of flows. Within a group,
the intra-group scheduler serves flows in a round-robin fashion. We shall show in Sec. 3.5.7 that this
simple combination leads to remarkable performance guarantees. For now, we focus on the detailed
design in the following subsections.
3.5.2 Flow Grouping
Suppose there are n backlogged flows sharing m middlebox resources. Without loss of generality, let the
flow weight wi be normalized such thatn∑i=1
wi = 1.
The scheduler collects flows with similar weights into a flow group. Specifically, flow group Gk is defined
as
Gk = {i : 2−k ≤ wi < 2−k+1}, k = 1, 2, . . . (3.13)
Thus, the weights of any two flows belonging to the same flow group are within a factor of 2 of each
other.
Similar grouping strategy has also been adopted in many single-resource fair queueing designs [71–73].
Chapter 3. Multi-Resource Fair Queueing for Network Flows 67
Slot20 6 124 8 10 14 16
R31 R3
2
R11 R1
2 R13 R1
4 R15 R1
6 R17 R1
8
R21 R2
2 R23 R2
4
R41
Figure 3.12: An illustration of the scheduling rounds of flow groups, where Rkl denotes the schedulinground l of flow group Gk.
The significance of this grouping strategy is that it leads to a small number of flow groups ng, bounded
by ng ≤ log2W . For a practical flow weight distribution, the number of flow groups ng ≤ 40 and can
hence be safely assumed as a small constant. This significantly reduces the complexity of the inter-group
scheduling.
3.5.3 Inter-Group Scheduling
The inter-group scheduler determines a flow group to potentially schedule a flow. Each group is associated
with a timestamp, and the one with the earliest timestamp is selected. With appropriate timestamps, the
scheduling opportunities of a flow group would be weight-proportionally distributed over time. Given a
small number of flow groups ng, the complexity of sorting the group timestamps is also a small constant
O(log ng). Among various timestamp-based algorithms, we find that [71] is particularly attractive for
multi-resource extension, due to its simple timestamp computation. In contrast, extending other inter-
group scheduling algorithms (e.g., [72, 73]) to multiple resources would have required referring to the
idealized fluid DRGPS model [66], incurring relatively high computational complexity.
The scheduler maintains an accounting mechanism consisting of a sequence of virtual slots, indexed
by 0, 1, 2, . . . . Each slot is exclusively assigned to one flow, and is the scheduling opportunity of this
flow. Each flow group Gk is associated with a set of scheduling rounds each spanning 2k contiguous
slots. The first scheduling round of flow group Gk, denoted Rk1 , starts at slot 0 and ends at slot 2k − 1,
while the second scheduling round, denoted Rk2 , starts at slot 2k and ends at slot 2k+1 − 1, and so on.
Fig. 3.12 gives an example. Note that the scheduling rounds of different flow groups overlap by design.
The scheduler assigns each backlogged flow i ∈ Gk exactly one slot per scheduling round of flow group
Gk. This allows flow i to receive one scheduling opportunity every 2k slots, roughly matching the flow’s
weight (i.e., 2−k ≤ wi < 2−k+1). The scheduling opportunities of flows are hence weight-proportionally
distributed over time.
Following the terminology used in [71], a flow group is called active if it contains at least one back-
Chapter 3. Multi-Resource Fair Queueing for Network Flows 68
Slot20 6 124 8 10 14
R11 R1
2 R13 R1
4 R15 R1
6 R17 R1
8
R41
31 7 135 11 159
f1 f2 f3 f4 f5 f6f1 f1 f1 f1 f1f1 f1
G1G4G1G4 G4 G4 G4G1 G1 G1 G1 G1 G1 ...
...
Group
Flow
Figure 3.13: An illustration of the inter-group scheduler selecting flow groups in the example of Fig. 3.10.Within a group, the flow is determined in a round-robin manner by the intra-group scheduler.
logged flow. A backlogged flow i ∈ Gk is called pending if it has not yet been assigned a slot in the
current scheduling round of Gk. A flow group is called pending if it contains at least one pending flow.
For every virtual slot t, the inter-group scheduler chooses among all pending flow groups the one with
the earliest timestamp, defined as the ending slot of the current scheduling round of that flow group. Ties
are broken arbitrarily. From the selected flow group, the intra-group scheduler then chooses a pending
flow and assigns it the current slot t (with details to be described in Sec. 3.5.4). A flow temporarily
ceases to be pending once it has been assigned a slot in the current scheduling round of its flow group,
and will become pending again at the beginning of the next scheduling round, if it remains backlogged.
If no group is pending in slot t, the slot is skipped. Algorithm 3 summarizes this inter-group scheduling
process.
Algorithm 3 InterGroupScheduling
1: t = 02: P = {flow groups that are pending in slot 0}3: while TRUE do4: Choose Gk ∈ P , where Gk has the earliest timestamp5: IntraGroupScheduling(Gk)6: P = P −Gk if Gk is no longer pending7: if P = ∅ then8: Do nothing until there is a backlogged flow9: Advance t to the next slot with pending flows
10: else11: t = t + 112: end if13: P = P ∪ {flow groups that become pending in slot t}14: end while
Fig. 3.13 illustrates an example of the inter-group scheduler assigning slots to groups and flows in
the example of Fig. 3.4a. Flow 1 belongs to G1 as its weight is 1/2, while flows 2 to 6 are grouped to
G4 as each of them weighs 1/10. At slot 0, both G1 and G4 are pending, with the end of their current
scheduling round (i.e., R11 and R4
1 in Fig. 3.13) at slot 1 and slot 15, respectively. The inter-group
scheduler hence picks G1, from which the intra-group scheduler selects flow 1 as it is the only backlogged
flow in G1. Flow 1 then schedules its packets for processing and ceases to be pending in the current
Chapter 3. Multi-Resource Fair Queueing for Network Flows 69
Slot10 3 62 4 5 7 8
Time20 6 124 8 10 14 16
CPU
Link ...
...
9 10 12 14 16 17
18 2220 24 26
f11
f11
f21
f21
f31 f4
1 f51 f6
1 f71
f81f3
1 f41 f5
1 f61 f7
1
f81 f9
1
f91
f12 f1
3
f14
f15
f16
f22
f12 f1
3
f14
f15
f16
Figure 3.14: The schedule determined by GMR3 in the example of Fig. 3.10, where f li denotes thepacket processing for flow i ∈ Gk in scheduling round l of its flow group Gk. The slot axis is only forthe accounting mechanism, while the time axis shows real time elapse.
scheduling round R11. As a result, at slot 1, only flow group G4 is pending and is hence selected, from
which the intra-group scheduler selects flow 2 and assigns it the slot. Flow 1 becomes pending again in
slot 2 as a new scheduling round of its flow group G1 starts (i.e., R12 in Fig. 3.13), and is selected because
its current scheduling round R12 ends earlier than R4
1, the current scheduling round of G4. Flow groups
G1 and G4 are hence selected alternately in the following slots until slot 9, after which all flows of G4
are assigned slots in the current scheduling round R41 and cease to be pending in the rest slots of R4
1.
For this reason, slots 11, 13, and 15 are not assigned to any flows as no one is pending then. All these
slots are simply skipped by the scheduler. It is worth mentioning that the schedule shown in Fig. 3.13
only gives the scheduling order of flows. The corresponding packet-level schedule is further determined
by the intra-group scheduler and is shown in Fig. 3.14, where f li denotes the actual packet processing
for flow i in scheduling round l of its flow group.
Inter-group scheduling distributes flows’ scheduling slots over time, in proportion to their approximate
weights. While this achieves coarse-grained fairness for single-resource scheduling [71–73], it is not the
case in the multi-resource setting, as the flows may not receive dominant services in their assigned
slots. For example, in Fig. 3.14, flow 1 is assigned slot 0 in time interval [0, 1), but receives dominant
services (i.e., link transmission) later in [1, 3). Flow 2, on the other hand, always receives dominant
services (i.e., CPU processing) in its assigned slots. Due to this service asynchronicity, distributing
flows’ scheduling slots weight-proportionally over time does not ensure that the opportunities of receiving
dominant services are also distributed weight-proportionally. Without appropriate control, the potential
service asynchronicity may lead to poor fairness with extremely unbalanced resource utilization. This
is also the key challenge of multi-resource scheduling as compared with its single-resource counterpart
(e.g., [71–73, 76]). We show in the next subsection that this challenge can be effectively addressed by a
carefully designed intra-group scheduler.
Chapter 3. Multi-Resource Fair Queueing for Network Flows 70
3.5.4 Intra-Group Scheduling
Once the flow group is determined, the intra-group scheduler chooses a pending flow from that group
in a round-robin manner. Compared to round robin for bandwidth sharing (e.g.,, [25,67,68,71–73,76]),
the intra-group scheduler operates with two important differences. First, for the purpose of DRF,
the scheduler maintains a credit system to keep track of the dominant services a flow receives, not the
amount of bits a flow transmits. Second, the scheduler employs a progress control mechanism to reinforce
a relatively consistent work progress across resources, so as to eliminate the adverse effects caused by
the aforementioned service asynchronicity.
Credit System: Every time a flow i is assigned a slot, it receives a credit ci (whose size is given in
(3.14) below), which is the time given to the flow for packet processing on its dominant resource in the
current scheduling round. As long as there are available credits, flow i is allowed to schedule a packet
for processing, and the corresponding packet processing time on the dominant resource is deducted from
its total credit. A flow i can overdraw the processing time by scheduling at most one more packet than
those allowed by the available credits. The excessive consumption of dominant services is tracked by the
excess counter ei, and will be deducted from the credit given in the next scheduling round as a penalty
of overconsumption.
While MR3 adopts a similar credit system in its design (see Sec. 3.4.3), the intra-group scheduler of
GMR3 operates with an important difference. Every time a flow i is assigned a slot, instead of receiving
an elastic amount of credits in different rounds, it is given a fixed-size credit that is proportional to its
weight wi. Specifically, for flow i ∈ Gk, the given credit ci is
ci = 2kLwi , (3.14)
where L is the maximum packet processing time. The motivation for defining credit in this manner is
two-fold.
To begin with, even if two flows i, j belong to the same group Gk, flow i’s weight wi may be up
to twice as large as wj . Despite their weight difference, both flows are assigned exactly one slot per
scheduling round of Gk. Therefore, to ensure weight-proportional dominant services, the given credits
as shown in (3.14) are proportional to their respective weights.
Moreover, for each flow i ∈ Gk, since 2−k ≤ wi < 2−k+1, the scaling factor 2kL in (3.14) ensures that
L ≤ ci < 2L . (3.15)
Chapter 3. Multi-Resource Fair Queueing for Network Flows 71
Because the given credits are larger than the maximum packet processing time, they can always com-
pensate for the overconsumption of dominant service flow i incurs in the previous scheduling round. As
a result, flow i will always have available credits when assigned a slot, and can schedule at least one
packet. In addition, by (3.15), the given credits are roughly the same across all flow groups. This is
significant as flow i ∈ Gk is already assigned slots in proportion to its approximate weight 2−k, so that
in each slot, the scheduler should allocate all flows approximately the same dominant services.
Progress Control Mechanism: In addition to the credit system, the scheduler also employs a
progress control mechanism to reinforce a relatively consistent processing rate across resources. Specifi-
cally, whenever a flow i ∈ Gk is assigned a slot t in the scheduling round l of Gk, the scheduler checks the
work progress on the last resource (usually the link bandwidth). If flow i has already received services
on the last resource in the previous scheduling round l − 1, or flow i is a new arrival, then its packet is
scheduled immediately. Otherwise, the scheduler defers packet scheduling until flow i starts to receive
service on the last resource in the previous scheduling round l − 1 of Gk. For example, as shown in
Fig. 3.14, in slot 12, the packet processing for flow 1 (i.e., f71 ) is withheld in round 7 of G1 until the
packet processed in round 6 (i.e., f61 ) starts transmission. Similar deferral has also been shown in slots
14 and 16.
Intuitively, this progress control mechanism ensures that the work progress on one resource is never
ahead of that on the other by more than 1 round, hence achieving an approximately consistent processing
rate across resources, in spite of the potential service asynchronicity. This progress control mechanism
is essential to the fairness and delay guarantee of GMR3, as shown in our analysis in Sec. 3.5.7.
To summarize, Algorithm 4 gives detailed design of the intra-group scheduling. Every flow group Gk
maitains an ActiveFlowList[k] for its backlogged flows. It also uses RoundRobinCounter[k] and Round[k]
to keep track of the current scheduling round. Every time flow group Gk is selected, the intra-group
scheduler chooses flow i ∈ Gk at the head of ActiveFlowList[k]. Flow i is given a credit to compensate
for its overdraft in the previous round, and schedule packets until no credit remains or no packet is
backlogged (line 6 to 15). After that, the flow ceases to be pending and is appended to the tail of the
active list if it remains backlogged. Flow group Gk ceases to be pending when all its backlogged flows
are serviced in the current scheduling round. If no flow is backlogged, flow group Gk becomes inactive.
3.5.5 Handling New Packet Arrivals
In addition to the inter- and intra-group scheduling, GMR3 scheduler also needs to handle new packet
arrivals. Algorithm 5 gives the detailed procedure. In addition to enqueueing the newly arrived packet p
Chapter 3. Multi-Resource Fair Queueing for Network Flows 72
Algorithm 4 IntraGroupScheduling(Gk)
1: if RoundRobinCounter[k] == 0 then2: RoundRobinCounter[k] = ActiveFlowList[k].Length()3: Round[k] += 1 . The current scheduling round of Gk
4: end if5: Flow i = ActiveFlowList[k].RemoveFromHead()6: bi = 2kLwi − ei . bi tracks the available credit of flow i7: while IsBacklogged(i) and bi ≥ 0 do8: while FlowProgressOnLastResource[i] < Round[k] − 1 do9: Withhold the scehduling opportunity of flow i
10: end while11: Packet p = Queue[i].Dequeue()12: p.SchedulingRound = Round[k]13: ProcessPacket(p) . Schedule for CPU processing14: bi = bi −DominantProcessingTime(p)15: end while16: if IsBacklogged(i) then17: ei = −bi . ei tracks the overdraft of credits of flow i18: ActiveFlowList[k].AppendToTail(i)19: else20: ei = 021: end if22: RoundRobinCounter[k] -= 123: if RoundRobinCounter[k] == 0 then24: Flow group Gk ceases to be pending25: end if26: if ActiveFlowList[k] = ∅ then27: Deactivate(Gk) . Flow group Gk ceases to be active28: end if
to the queue of flow i ∈ Gk to which the packet belongs, the scheduler also appends flow i to the active
list of its flow group Gk if flow i is previously inactive. Flow group Gk is also set to active accordingly.
Algorithm 5 PacketArrival(p)
1: Let i be the flow to which the newly arrived packet p belongs2: Queue[i].Enqueue(p)3: Let Gk be the flow group to which flow i belongs4: if ActiveFlowList[k].Contains(i) == FALSE then5: ActiveFlowList[k].AppendToTail(i)6: if IsActive(Gk) == FALSE then7: Activate(Gk) . Flow group Gk becomes active8: end if9: end if
3.5.6 Implementation and Complexity
So far, we have described the design of GMR3. We next show that appropriate implementations allow
GMR3 to make packet scheduling decisions in O(1) time for almost all practical scenarios.
Flow Grouping: To identify the flow group Gk of flow i, it suffices to locate the most significant
bit of wi that is set to 1, as 2−k ≤ wi < 2−k+1. Direct computation requires only O(logW ), which is a
Chapter 3. Multi-Resource Fair Queueing for Network Flows 73
small constant for almost all practical weight distributions (i.e., logW ≤ 40). In fact, with the support
of a standard priority encoder, this operation can be accomplished in a few bitwise operations [71, 73],
and is strictly O(1).
Inter-Group Scheduling: There are three important operations in Algorithm 3, i.e., choosing a
flow group (line 4), advancing to the earliest slot with pending groups (line 9), and updating the pending
set P (line 13). Given a small number of flow groups ng, all these operations can be accomplished in
O(1) time using the simple methods described in [71], which we briefly mention in the following.
The scheduler uses two bitmaps a = ang. . . a2a1 and g = gng
. . . g2g1 to track the active and pending
flow groups. Bit ak is set to 1 if flow group Gk is active, and 0 otherwise. Similarly, bit gk is 1 if group
Gk is pending, and 0 otherwise.
• Choosing a flow group: It is easy to check that, in all slot t, the scheduling round of flow group Gk
ends earlier than those of all flow groups Gk′ , where k′ > k (see Fig. 3.12). Flow group Gk hence
has a higher priority to be chosen than Gk′ . As a result, the chosen group Gk can be identified by
locating the rightmost bit gk of bitmap g that is set to 1. Such an operation can be done in O(1)
time by a standard priority encoder [71].
• Advancing to the earliest slot with pending groups: Because the start of the scheduling round for
group Gk is also the start of a scheduling round of all groups Gk′ , where k′ > k (see Fig. 3.12), the
scheduler should advance to the start of the next scheduling round of the lowest-numbered flow
group that is active. This can be identified by locating the rightmost bit ak that is set to 1, and
the new slot is the smallest multiple of 2k greater than the current slot t. With the surport of
priority encoder, all these operations are done in O(1) time.
• Updating the pending set: At slot t, an active flow group Gk becomes pending if 2k divides t. To
identify all these groups, it is sufficient to locate the least significant bit of t that is set to 1. Let it
be the kth least significant bit of t. Then all active flow groups Gk′ where k′ ≤ k become pending
at t, and can be found via some simple bit operations in O(1) [71].
Intra-Group Scheduling: In Algorithm 4, an essential operation is to track the work progress on
the last resource of the selected flow i (line 8 to 10) to determine if the scheduling opportunity of flow i
should be withheld. For the purpose of efficient implementation, a packet p of flow i, upon scheduling,
is associated with a tag recording the current scheduling round of flow group Gk to which flow i belongs
(line 12 of Algorithm 4). Whenever packet p starts service on the last resource m, the progress of flow
i on that resource is updated to the scheduling round tagged to packet p, which will be used later to
Chapter 3. Multi-Resource Fair Queueing for Network Flows 74
determine the timing of withholding packet processing of flow i (line 8). All these operations can be
done in O(1) time.
Another operations that may introduce additional complexity is to obtain the packet processing time
on the dominant resource (line 14). Note that such information is required only after the packet has
been processed by CPU. At that time the scheduler knows exactly how the packet should be processed
next and what resources are required. The packet processing time on each of the following resource can
hence be accurately inferred via some simple packet profiling techniques in O(1) time. For example, a
simple linear model based on the packet size is proved to be sufficiently accurate for estimation [4].
To conclude, with appropriate implementations mentioned above, both inter-group and intra-group
scheduling decisions can be made in O(1) time per packet, making GMR3 a highly efficient multi-resource
scheduler for middleboxes.
3.5.7 Performance Analysis
In this section, we analyze the properties of GMR3 and show that it achieves near-perfect fairness with
scheduling delays bounded by a small constant.
Fairness
For the purpose of fairness analysis, we derive the RFB of GMR3 defined in Sec. 3.3. We start by
bounding the dominant services a flow receives in any backlogged period (t1, t2) as follows.
Lemma 4. Let Ti(t1, t2) be the dominant service a backlogged flow i receives in a time interval (t1, t2).
We have
xLwi − 9L ≤ Ti(t1, t2) ≤ xLwi + 9L , (3.16)
where x is the number of slots, complete and partial, that have been assigned to flows in (t1, t2).
Proof: Let xi be the number of slots assigned to flow i ∈ Gk in (t1, t2). We show that flow i receives
services on its dominant resource at least in xi − 2 scheduling rounds, and at most in xi + 2 scheduling
rounds. To see this, let l1 (resp. l2) be the scheduling round of flow group Gk to which the first (resp. last)
slot assigned to flow i belongs, in (t1, t2). Clearly, we have xi = l2 − l1 + 1. By Algorithm 4, for any
flow, the progress gap between any two resources is upper bounded by one scheduling round. Therefore,
at time t1, the processing for flow i on the last resource m should progress to at least scheduling round
l1 − 2, as illustrated in Fig. 3.15. Also, by time t2, the work progress of flow i on the last resource m
is at most in scheduling round l2 (because the work progress of resource m is always behind that of
Chapter 3. Multi-Resource Fair Queueing for Network Flows 75
s-1 s f+1
r∗i
r∗j j
j
Round s
i
...f
j i j i
j
f+2
i i
t1 t2 Time
ij
...
...
...
Figure 3.15: Illustration of the maximum number of scheduling rounds of flow i on the last resource in(t1, t2).
resource 1). As a result, in (t1, t2), flow i receives dominant services at most in l2− (l1− 2) + 1 = xi + 2
scheduling rounds.
With a similar argument, we see that the work progress of flow i on the last resource m is at most
in scheduling round l1 at time t1, and is at least in scheduling round l2 − 2 at time t2. Flow i hence
receives its dominant services in at least xi − 2 scheduling rounds in (t1, t2).
Therefore, the dominant services flow i receives are at least (xi−2)ci−L and are at most (xi+2)ci+L,
where ci = 2kLwi is the credit given to flow i in each scheduling round, i.e.,
(xi − 2)ci − L ≤ Ti(t1, t2) ≤ (xi + 2)ci + L . (3.17)
Also note that for resource 1, the number of scheduling rounds of flow group Gk contained in (t1, t2)
is at least xi − 2, and is at most xi + 2. Because each scheduling round of Gk spans exactly 2k slots, we
have 2k(xi − 2) ≤ x ≤ 2k(xi + 2), which is equivalent to
2−kx− 2 ≤ xi ≤ 2−kx+ 2 . (3.18)
Substituting (3.18) to (3.17), we derive
Ti(t1, t2) ≤ (xi + 2)ci + L
≤ (2−kx+ 4)ci + L
= 2−kx2kLwi + 4ci + L
≤ xLwi + 9L ,
(3.19)
Similarly, we have
Ti(t1, t2) ≥ (xi − 2)ci − L ≥ xLwi − 9L . (3.20)
Chapter 3. Multi-Resource Fair Queueing for Network Flows 76
Combining (3.19) and (3.20) leads to the statement. ut
We are now ready to derive the RFB of GMR3 as follows.
Theorem 9. For any time interval (t1, t2) and any two flows i, j that are backlogged, we have
∣∣∣∣Ti(t1, t2)
wi− Tj(t1, t2)
wj
∣∣∣∣ ≤ 9L
(1
wi+
1
wj
).
Proof: For any flow i, applying Lemma 4 and dividing both sides of (3.16) by wi, we have
xL− 9L/wi ≤ Ti(t1, t2)/wi ≤ xL+ 9L/wi . (3.21)
Similar inequalities also hold for flow j, i.e.,
xL− 9L/wj ≤ Tj(t1, t2)/wj ≥ xL+ 9L/wj . (3.22)
Combining (3.21) and (3.22) leads to the statement. ut
Theorem 9 indicates that GMR3 bounds the difference between the normalized dominant services
received by two flows in any backlogged period by a small constant. GMR3 hence provides near-perfect
fairness across flows, irrespective of their traffic patterns. Furthermore, we note that the fairness guaran-
tees provided by existing multi-resource fair queueing schemes, e.g., [4], all assume flows do not change
their dominant resources throughout the backlogged periods (a.k.a., the resource monotonicity assump-
tion [4]), while Theorem 9 does not require such an assumption.
Scheduling Delay
In addition to the fairness guarantees, we show that GMR3 ensures that the scheduling delay is bounded
by a small constant that is inversely proportional to the flow’s weight. To see this, the following two
lemmas are needed in the analysis.
Lemma 5. Let dli be the dominant services flow i ∈ Gk receives in scheduling round l of Gk. We have
0 ≤ dli ≤ 3L . (3.23)
Proof: Let eli be the excessive consumption of dominant services flow i ∈ Gk receives in scheduling
round l of Gk. By Algorithm 4, flow i may overdraw the processing time by scheduling at most one
Chapter 3. Multi-Resource Fair Queueing for Network Flows 77
more packet than those allowed by the available credits. Therefore,
0 ≤ eli ≤ L . (3.24)
Also note that at the beginning of scheduling round l, flow i has available credits ci− el−1i . After round
l, all these credits are used, with an excessive consumption eli. The dominant services flow i receives in
round l are hence
dli = ci − el−1i + eli ≤ ci + eli ≤ 3L ,
where the last inequality is derived from (3.15) and (3.24). ut
Lemma 6. For flow i ∈ Gk and scheduling round l of Gk, let t0 be the time when flow i is completely
processed on resource 1 in round l of Gk, and t1 the time when flow i is completely processed on the last
resource m in the same round l. We have
t1 − t0 < 12mL/wi .
Proof: By Algorithm 4, the progress gap between any two resources is bounded by 1 round. There-
fore, at time t0, flow i must have already received its packet processing on the last resource m in round
l− 1 of Gk. Because there are at most 2k+1 slots assigned to flows in round l− 1 and l of Gk, and each
slot is assigned to at most one flow, the number of flows served on the last resource in (t0, t1), denoted
nf , is upper bounded by 2k+1, i.e.,
nf ≤ 2k+1 .
Let these flows be j1, . . . , jnf, operating in scheduling rounds l1, . . . , lnf
of their respective flow groups
on the last resource. In particular, jnf= i and lnf
= l. By Algorithm 4, flow j1 starts service on
resource 1 in round l1 of its flow group no later than the time when its previously served flow has been
processed on the last resource m, which is no later than t0. Therefore, flow j1 is completely processed
on the last resource in round l1 of its flow group no later than
t0 +mdl1j1 ≤ t0 + 3Lm , (3.25)
where the inequality is derived from Lemma 5. Similarly, flow j2 starts its service on resource 1 in round
l2 no later than the time its previous flow j1 is completely processed on the last resource m in round
l1, which is no later than t0 + 3Lm by (3.25). As a result, flow j2 is completely processed on the last
Chapter 3. Multi-Resource Fair Queueing for Network Flows 78
f l2j2
Time
...
...
1
m
...
f li
f li
t1t0
f l1j1
f l1j1
t2 t3 tnf
f l+1i
f l2j2 f l+1
i
ResourceScheduling delay Di(p)
... tnf−1
Figure 3.16: The illustration of a scenario where the scheduling delay Di(p) reaches the maximum. Here,f li denotes the processing of flow i in scheduling round l of its flow group.
resource in round l2 of its flow group no later than
t0 + 3Lm+mdl2j2 ≤ t0 + 6Lm .
By induction, flow ju is completely processed on the last resource in round lu of its flow group no later
than
t0 + 3uLm , u = 1, 2, . . . , nf .
Now letting u = nf and noting that nf ≤ 2k+1 and jnf= i, we have
t1 − t0 ≤ 3nfLm ≤ 3Lm2k+1 ≤ 12mL/wi ,
where the last inequality holds because 2−k ≤ wi < 2−k+1, which implies 2k+1 ≤ 4/wi. ut
We now bound the scheduling delay of GMR3 as follows.
Theorem 10. For all flow i, the scheduling delay of its packet p is bounded by
SDi(p) < 24mL/wi ,
where m is the number of resources.
Proof: For any flow i ∈ Gk, the scheduling delay of its packet p reaches its maximum when p reaches
the head of the queue in scheduling round l of Gk, but is not processed until the next round l+ 1 of Gk.
Since there are at most 2k+1 slots in between and each slot is assigned to one flow, the number of flows
that have been assigned slots during this time, denoted nf , is upper bounded by 2k+1, i.e.,
nf ≤ 2k+1 .
Let these flows be j1, . . . , jnf, with their assigned slots in their current scheduling rounds l1, . . . , lnf
of
Chapter 3. Multi-Resource Fair Queueing for Network Flows 79
their respective flow groups. In particular, jnf= i and lnf
= l+1. By Algorithm 4, flow j1 starts service
on resource 1 no later than the time its previous flow i is completely processed on the last resource m in
round l of Gk. Similarly, flow j2 starts its service on resource 1 no later than the time when its previous
flow j1 is completely processed on the last resource m in round l1 of its flow group, and so on. Fig. 3.16
illustrates this scenario, where tu is the latest time instant that flow ju receives service on resource 1 in
round lu of its flow group, u = 1, 2, . . . . We then have
tu+1 − tu ≤ mdluju ≤ 3Lm, u = 1, 2, . . . , (3.26)
where the second inequality is derived from Lemma 5. In other words, the time span of processing flow
ju on all resources in round lu reaches its maximum when the processing time is maximized on every
resource.
Now let t0 be the time when packet p reaches the head of the queue in scheduling round l of its flow
group, which is also the time when flow i is completely processed on resource 1 in the same round l (see
Fig. 3.16). By Lemma 6, we have
t1 − t0 ≤ 12mL/wi . (3.27)
With (3.26) and (3.27), we bound the scheduling delay SDi(p) as follows:
SDi(p) ≤nf∑u=1
(tu − tu−1)
< 12mL/wi + 3Lmnf
≤ 12mL/wi + 3Lm2k+1
≤ 24mL/wi ,
where the last inequality holds because 2−k ≤ wi < 2−k+1, which implies 2k+1 ≤ 4/wi. ut
Theorem 10 gives a strictly weight-proportional scheduling delay bound that is independent of the
number of flows. This implies that a flow is guaranteed to be scheduled within a small constant amount
of time that is inversely proportional to the processing rate (weight) the flow deserves, irrespective of
the behaviours of other flows. To our knowledge, this is the first multi-resource packet scheduler that
offers this property.
To conclude, Table 3.3 compares the performance of GMR3 with the two existing multi-resource fair
queueing schemes, i.e., DRFQ [4] and MR3. We see that GMR3 is the only scheduler that provides
Chapter 3. Multi-Resource Fair Queueing for Network Flows 80
Table 3.3: Summary of performance of GMR3 and existing schemes, where n is the number of flows,and m is the number of resources.
Scheme Complexity Fairness4 (RFB) Scheduling DelayDRFQ [4] O(log n) L(1/wi + 1/wj) Unknown
MR3 O(1) 2L(1/wi + 1/wj) 2(m+W )2L/wiGMR3 O(1) 9L(1/wi + 1/wj) 24mL/wi
provably good performance guarantees on fairness and delay in O(1) time.
3.5.8 Evaluation
For complementary study to our theoretical analysis, we experimentally evaluate the fairness and delay
performance of GMR3 via simulations.
General Setup
All simulation results are based on our event-driven packet simulator written with 3,000 lines of C++
code. Packets follow Poisson arrivals and are processed serially on resources, with CPU processing first,
followed by link transmission. In addition to GMR3, we also implement DRFQ [4] and MR3 for the
purpose of comparison. The simulator simulates packet processing in 3 typical middlebox modules, i.e.,
basic forwarding (Basic), statistical monitoring (Stat. Mon.), and IP security encryption (IPsec). The
first two modules are bandwidth intensive, with monitoring consuming slightly more CPU resources,
while IPsec is CPU intensive. According to the measurement results reported in [4], the CPU processing
time required by each middlebox module follows a simple linear model based on packet size x, and is
αkx+βk, where αk and βk are parameters of module k and are summarized in Table 3.2 (see Sec. 3.4.5).
The link transmission time is proportional to the packet size, and the output bandwidth of the middlebox
is 200 Mbps, the same as [4].
Fairness
We confirm experimentally that GMR3 provides near-perfect service isolation across flows, irrespective of
their traffic behaviours. The simulator generates 30 traffic flows that send 1300-byte UDP packets for 30
seconds. Flows 1 to 10 pass through the Basic module; flows 11 to 20 undergo statistical monitoring; while
flows 21 to 30 require IPsec encryption. Among all these flows, flow 1, 11, and 21 are rogue traffic, each
sending 30,000 pkts/s. All other flows behave normally, each sending 3,000 pkts/s. Flows are assigned
random weights uniformly drawn from 1 to 1000. Fig. 3.17a depicts the dominant services, in seconds,
4The fairness analysis of DRFQ requires that flows do not change their dominant resources throughout the backloggedperiods [4].
Chapter 3. Multi-Resource Fair Queueing for Network Flows 81
0 5 10 15 20 25 3038
38.5
39
39.5
40
Flow ID
Norm
aliz
ed D
om
. S
erv
. (s
)
BasicStat. Mon.IPsec
(a) Normalized dominant ser-vice.
0 20 40 60 800
0.2
0.4
0.6
0.8
1
Scheduling Delay (ms)
CD
F
DRFQ
MR3
GMR3
(b) CDF of the scheduling de-lay.
0 200 400 600 800 10000
30
60
90
120
150
Flow Weight
Me
an
Sch
ed
ulin
g D
ela
y (
ms)
DRFQ
MR3
GMR3
(c) Mean scheduling delay.
0 200 400 600 800 10000
100
200
300
400
Flow Weight
Ma
x S
ch
ed
ulin
g D
ela
y (
ms)
DRFQ
MR3
GMR3
(d) Maximum scheduling de-lay.
Figure 3.17: Simulation results of the fairness and delay performance of GMR3, as compared to DRFQand MR3. Figure (a) dedicates to the fairness evaluation, while (b), (c), and (d) compare the schedulingdelay of the three schedulers.
received by different flows under GMR3, normalized to their respective weights. We see that despite
the presence of ill-behaving traffic, GMR3 allows flows through different modules to receive weight-
proportional dominant services, enforcing service isolation. Similar results have also been observed
using DRFQ and MR3, and are not shown in the figure.
Scheduling Delay
We next confirm experimentally that GMR3 significantly improves the packet scheduling delay, as com-
pared to existing multi-resource scheduling alternatives. The simulator generates 150 UDP flows with
flow weights uniformly drawn from 1 to 1000. A flow randomly chooses one of the three middlebox
modules to pass through. To congest the middlebox resources, the flow rate is set to 500 pkts/s, with
packet sizes uniformly drawn from 200 B to 1400 B, which are the typical settings for Ethernet. For
each processed packet, we record its scheduling delay, using DRFQ, MR3, and GMR3, respectively. The
simulation spans 30 seconds.
Fig. 3.17b shows the CDF of the scheduling delay a packet experiences, from which we see the
significance of GMR3 on delay improvement: Using GMR3, over 95% packets are scheduled within
20 ms, which is roughly the minimum time a packet has to wait under DRFQ and MR3! A detailed
statistics breakdown is given in Fig. 3.17c and 3.17d. Fig. 3.17c shows the mean scheduling delay a flow
experiences with respective to its weight. We see that GMR3 consistently leads to a smaller mean delay
than the other two schedulers for almost all flows, especially for those with large weights. This delay
improvement is not limited to the average case. Fig. 3.17d gives the maximum delay a flow experiences
with respect to its weight. We see that both GMR3 and DRFQ offer a weight-proportional delay bound.
While DRFQ achieves a smaller delay bound for flows with smaller weights, GMR3 is generally better
for more important flows with medium to large weights. MR3, on the other hand, fails to provide service
differentiations among flows. Intuitively, since flows are served in rounds, in the worst case, a packet has
Chapter 3. Multi-Resource Fair Queueing for Network Flows 82
to wait for the entire scheduling round until it is processed, incurring a worst-case delay that is as long
as the span of an entire round. GMR3 avoids this problem by distributing the scheduling opportunities
over time, in proportion to the flows’ weights.
3.6 Discussion and Future Work
While both analysis and simulation show that GMR3 offers provably good performance on fairness,
delay, and complexity, we point out that it does not outperform existing schemes in every aspect, and
is by no means a “final solution” to multi-resource fair queueing. In particular, as shown in Table 3.3,
GMR3 offers a relatively looser fairness guarantee with a larger RFB than those of DRFQ and MR3.
This means that GMR3 is not as good as the other two schemes in terms of the short-term fairness, and
is not preferred by short-lived flows (a.k.a. mice flows). However, for long-lived flows (a.k.a. elephant
flows), all three schemes are similar as their RFBs are all bounded by a small constant. As a result,
GMR3 is not a perfect fit when the strict short-term fairness is of particular importance to the system.
Also, even excluding the slight disadvantage of short-term fairness, GMR3 is not an all-around
improvement over MR3 for the following two reasons. To begin with, while both MR3 and GMR3 operate
at O(1) complexity, the former has a simpler mechanism and is easier to be implemented than the latter.
Unlike GMR3, flows are simply served in rounds under MR3. No flow grouping or inter-group scheduling
strategies is needed. Moreover, MR3 operates without any a priori knowledge of packet processing,
which is not the case for GMR3. Specifically, in each round, MR3 awards each flow an elastic amount
of credits based on its excessive credit consumption in the previous round, such that the flow is always
given sufficient credits to process at least one packet (see Sec. 3.4.3). Such an elastic credit-awarding
mechanism enables MR3 to operate without a priori knowledge of the maximum packet processing time
L. In contrast, this information is required by GMR3 to compute the amount of credits awarded to a
flow in each round (see (3.14) and line 6 of Algorithm 4). We also note that the elastic credit-awarding
mechanism adopted by MR3 cannot be applied to GMR3: it only respects the intra-group fairness for
flows in one flow group, yet violates the inter-group fairness among flow groups.
In summary, GMR3 can be viewed as a balanced design in terms of fairness, delay, and complexity. It
remains open to investigate if and how these performance metrics can be further improved. Furthermore,
what is the ultimate performance tradeoff that one can expect? Given that similar questions have been
answered for single-resource fair queueing [77,78], investigations into these open questions in the multi-
resource setting may present a promising future direction.
Chapter 3. Multi-Resource Fair Queueing for Network Flows 83
3.7 Summary
The potential congestion of packet processing with respect to multiple resource types in a middlebox
complicates the design of packet scheduling algorithms. Previously proposed multi-resource fair queueing
schemes require O(log n) complexity per packet and are expensive to implement at high speeds. In this
chapter, we present two new schedulers both operating in O(1) time.
Our first scheduler, MR3, was designed for unweighted flows. MR3 serves flows in rounds. At the
same time, it keeps track of the work progress on each resource and withholds the scheduling opportunity
of a packet until the progress gap between any two resources falls below one round. Theoretical analyses
have indicated that MR3 implements near-perfect DRF across flows. MR3 is very easy to implement,
and is the first multi-resource fair scheduler that offers near-perfect fairness with O(1) time complexity.
Our second scheduler, GMR3, was designed for weighted flows. GMR3 collects flows with similar
weights into the same flow group, and makes scheduling decisions in a two-level hierarchy. The inter-
group scheduler determines a flow group, from which the intra-group scheduler picks a flow in a round-
robin manner. Through this design, GMR3 eliminates the sorting bottlenecks suffered by existing multi-
resource scheduling alternatives such as DRFQ, and is able to handle a large volume of traffic at high
speeds. More importantly, we have shown, both analytically and experimentally, that GMR3 ensures
a constant scheduling delay bound that is inversely proportional to the flow weight, hence offering
predictable delay guarantees for individual flows. To our knowledge, GMR3 is the first multi-resource
fair queueing algorithm that offers near-perfect fairness with a constant scheduling delay bound in O(1)
complexity.
We believe that both MR3 and GMR3 may find general applications in other multi-resource scheduling
contexts where jobs must be scheduled as entities, such as multi-tenant scheduling in deep software stacks
and VM scheduling inside a hypervisor.
3.8 Proofs
3.8.1 Fairness Analysis for MR3
In this subsection, we give detailed proofs of lemmas and corollaries that are used in the fairness analysis
presented in Sec. 3.4.4. We start by proving Lemma 2.
Proof of Lemma 2: If flow i has no packets to serve (i.e., the input queue is empty) after round
k, then ECki = 0 (line 24 in Module 2), and the statement holds. Otherwise, let packet p be the last
Chapter 3. Multi-Resource Fair Queueing for Network Flows 84
packet of flow i served in round k. Let B′i be its account balance before packet p is processed. We have
ECki = DominantProcessingTime(p)−B′i ≤ Li,
where the inequality holds because B′i ≥ 0 and DominantProcessingTime(p) ≤ Li. ut
We next show Lemma 3.
Proof of Lemma 3: At the beginning of round k, flow i has an account balance Bi = MaxECk−1−
ECk−1i . After round k, all this amount of processing time has been consumed on the dominant resources
of the processed packets, with an excessive consumption ECki . The dominant service flow i received in
round k is therefore Di = Bi + ECki , which is the RHS of (3.4). ut
Finally, we prove Corollary 2.
Proof of Corollary 2: By Lemma 3 and noticing that ECk−1i ≥ 0, we have:
Dki ≤ MaxECk−1 + ECki ≤ 2L,
where the last inequality holds because of Lemma 2 and Corollary 1. ut
3.8.2 Analysis of Startup Latency of MR3
In this subsection, we bound the startup latency of MR3 by proving Theorem 3.
Proof of Theorem 3: Without loss of generality, suppose n flows are backlogged in round k − 1,
with flow 1 being served first, followed by flow 2, and so on. After flow n has been served on resource 1,
flow n + 1 becomes active and is appended to the tail of the active list. Flow n + 1 is therefore served
right after flow n in round k. In particular, suppose flow n+ 1 becomes active at time 0. Let Ski,r be the
time when flow i starts to receive service in round k on resource r, and let F ki,r be the time when the
scheduler finishes serving flow i in round k on resource r. Specifically,
F k−1n,1 = 0 .
As shown in Fig. 3.18, the startup latency of flow n+ 1 is
SL = F kn,1 . (3.28)
The following two relationships are useful in the analysis. For all flows i = 1, . . . , n and resources
Chapter 3. Multi-Resource Fair Queueing for Network Flows 85
nk−2
Round k-2
rm
...
nk−1
nk−1 nk
F k−1n,1 = 0 Time
1k−1
...
...
r1
Start-up latency of flow n+1
1k
nk−1...1k−1r2
(n+ 1)k
F kn,1
Round k-1
... ...
1k
Round k-1 Round k
Sk1,1
Figure 3.18: Illustration of the startup latency. In the figure, flow i in round k is denoted as ik. Flown+ 1 becomes active when flow n has been served on resource 1 in round k− 1, and will be served rightafter flow n in round k.
r = 1, . . . ,m, we have
F ki,r ≤ Ski,r +Dki ≤ Ski,r + 2L , (3.29)
where the last inequality holds because of Corollary 2. Further, for all flows i = 1, . . . , n, we have
Ski,1 =
max{Sk−11,m , F
k−1n,1 }, i = 1,
max{Sk−1i,m , F ki−1,1}, i = 2, . . . , n .
(3.30)
That is, flow i is scheduled in round k right after its previous-round service has started on the last
resource m (in this case, the progress gap on two resources is no more than 1 round) and its previous
flow has been served on resource 1.
The following lemma is also required in the analysis.
Lemma 7. For all flows i = 1, . . . , n and all resources r = 1, . . . ,m, we have
Sk−1i,r ≤ 2(i+ r − 2)L ,
F k−1i,r ≤ 2(i+ r − 1)L .
(3.31)
Proof of Lemma 7: We observe the following relationship for all resources r = 2, . . . ,m:
Sk−1i,r ≤
max{F k−2n,r , F
k−11,r−1}, i = 1,
max{F k−1i−1,r, F
k−1i,r−1}, i = 2, . . . , n.
(3.32)
That is, flow i starts to receive service on resource r no later than the time when it has been served on
Chapter 3. Multi-Resource Fair Queueing for Network Flows 86
resource r− 1 and the time when its previous flow has been served on the same resource (see Fig. 3.18).
To see the statement, we apply induction to r and i. First, when r = 1, the statement trivially holds
because
Sk−1i,1 ≤ F k−1
i,1 ≤ F k−1n,1 = 0, i = 1, . . . , n. (3.33)
When r = 2, i = 1, we have Sk−11,2 ≤ max{F k−2
n,2 , F k−11,1 } ≤ 2L ,
F k−11,2 ≤ S
k−11,2 + 2L ≤ 4L .
(3.34)
This is because
F k−2n,2 ≤ F k−2
n,m
≤ Sk−2n,m + 2L (By (3.29))
≤ Sk−11,1 + 2L (By MR3 algorithm)
≤ 2L (By (3.33)) ,
(3.35)
and
Sk−11,2 ≤ max{F k−2
n,2 , F k−11,1 } ≤ max{2L, 0} = 2L . (3.36)
Now assume for some r, i and r − 1, i+ 1, the statement holds. Note that for r, i+ 1, we have
Sk−1i+1,r ≤ max{F k−1
i,r , F k−1i+1,r−1} (By (3.32))
≤ 2(i+ r − 1)L, (By induction)(3.37)
and
F k−1i+1,r ≤ S
k−1i+1,r + 2L ≤ 2(i+ r)L . (3.38)
Therefore, the statement holds for r, i = 1, . . . , n. We then consider the case of r + 1, 1. We have
Sk−11,r+1 ≤ max{F k−2
n,r+1, Fk−11,r } (By (3.32))
≤ max{F k−2n,m , 2rL}
≤ max{2L, 2rL} (By (3.35))
= 2rL ,
(3.39)
and
F k−11,r+1 ≤ S
k−11,r+1 + 2L ≤ 2(r + 1)L . (3.40)
Therefore, by induction, the statement holds. ut
Lemma 7 leads to the following lemma.
Chapter 3. Multi-Resource Fair Queueing for Network Flows 87
Lemma 8. The following relationship holds for all flows i = 1, . . . , n:
Ski,1 ≤ 2(m+ i− 2)L ,
F ki,1 ≤ 2(m+ i− 1)L .(3.41)
Proof of Lemma 8: For i = 1, we have
Sk1,1 = max{Sk−11,m , F
k−1n,1 } (By (3.30))
≤ max{2(m− 1)L, 0} (By Lemma 7)
= 2(m− 1)L ,
(3.42)
and
F k1,1 ≤ Sk1,1 + 2L ≤ 2mL . (3.43)
Assume the statement holds for some i. Then for flow i+ 1, we have
Ski+1,1 = max{Sk−1i+1,m, F
ki,1} (By (3.30))
≤ 2(i+m− 1)L (By Lemma 7 and (3.41))
and
F ki+1,1 ≤ Ski+1,1 + 2L = 2(i+m)L . (3.44)
Hence by induction, the statement holds. ut
We are now ready to prove Theorem 3. By (3.28), it is equivalent to show
SL = F kn,1 ≤ 2(m+ n− 1)L ,
which is a direct result of Lemma 8. ut
3.8.3 Analysis of Scheduling Delay of MR3
We next analyze the single packet delay of MR3 by proving Theorem 4.
Proof of Theorem 4: For any packet p, let a(p) be the time when packet p reaches the head of
the input queue and is ready for service. Let d(p) be the time when packet p has been processed on all
resources and leaves the system. The scheduling delay of packet p is defined as
SD(p) = d(p)− a(p) . (3.45)
Chapter 3. Multi-Resource Fair Queueing for Network Flows 88
nkprm
...
Time
...
...
r1
Single Packet Delay PD(p)
... ...Round k-1 Round k
nk−1 1k nkp
d(p)
1k+1
Round k+1
F kn,ma(p) = F k−1
n,1 = 0
ik−1
Skn,1
Figure 3.19: Illustration of the packet latency, where flow i in round k is denoted as ik. The figure showsthe scenario under which the latency reaches its maximal value: a packet p is pushed to the top of theinput queue in one round but is scheduled in the next round because of the account deficit.
Without loss of generality, assume packet p belongs to flow n, and is pushed to the top of the input
queue at time 0 in round k − 1. The delay D(p) reaches its maximal value when packet p is scheduled
in the next round k, as shown in Fig. 3.19.
We use the same notations as those in the proof of Theorem 3. Let Ski,r be the time when flow i
starts to receive service in round k on resource r, and let F ki,r be the time when flow i has been served
in round k on resource r. As shown in Fig. 3.19, the scheduling delay is
D(p) = d(p) ≤ F kn,m . (3.46)
We claim the following relationships for all flows i = 1, . . . , n and resources r = 1, . . . ,m, with which
the statement holds. Ski,r ≤ 2(m+ n+ i+ r − 2)L ,
F ki,r ≤ 2(m+ n+ i+ r − 1)L .(3.47)
To see this, we extend (3.32) to round k by replacing k − 1 in (3.32) with k:
Ski,r ≤
max{F k−1n,r , F
k1,r−1}, i = 1,
max{F ki−1,r, Fki,r−1}, i = 2, . . . , n.
(3.48)
We now show (3.47) by induction. First, by Lemma 8, (3.47) holds when r = 1. Also, for r = 2, i = 1,
we have
Sk1,2 ≤ max{F k−1n,2 , F k1,1} (By (3.48))
≤ max{2(n+ 1)L, 2mL} (By (3.31), (3.41))
≤ 2(m+ n+ 1)L ,
Chapter 3. Multi-Resource Fair Queueing for Network Flows 89
and
F k1,2 ≤ Sk1,2 + 2L = 2(m+ n+ 2)L .
Now assume for some r, i and r − 1, i+ 1, (3.47) holds. Note that for r, i+ 1, we have
Ski+1,r ≤ max{F ki,r, F ki+1,r−1} (By (3.48))
≤ 2(m+ n+ i+ r − 1)L, (By induction)
and
F ki+1,r ≤ Ski+1,r + 2L = 2(m+ n+ i+ r)L .
Therefore, by induction, (3.47) holds for r, i = 1, . . . , n. We then consider the case of r + 1, 1. We have
Sk1,r+1 ≤ max{F k−1n,r+1, F
k1,r} (By (3.48))
≤ max{2(n+ r)L, 2(m+ n+ r)L} (By Lemma 7)
= 2(m+ n+ r)L,
and
F k1,r+1 ≤ Sk1,r+1 + 2L = 2(m+ n+ r + 1)L .
Hence by induction, (3.47) holds. ut
3.8.4 Fairness Analysis of Weighted MR3
To prove Theorem 6, we require the following relationships, which are natural extensions to those given
in Sec. 4.2.1.
Lemma 9. Under weighted MR3, for all flow i and round k, we have
ECki ≤ Li/wi .
Corollary 4. Under weighted MR3, for all round k, we have
MaxECk ≤ max1≤i≤n
Li/wi .
Lemma 10. Under weighted MR3, for all flow i and round k, we have
Dki /wi = MaxECk−1 − ECk−1
i + ECki ,
Chapter 3. Multi-Resource Fair Queueing for Network Flows 90
where EC0i = 0 and MaxEC0 = 0.
Corollary 5. Under weighted MR3, for all flow i and round k, the following relationship holds:
Dki ≤ (1 +W )L .
Proof: By Lemma 10, we derive as follows
Dki ≤ wiMaxECk−1 + wiECki
≤ wi max1≤j≤n
Lj/wj + Li
≤ (1 +W )L ,
where the second inequality is derived from Lemma 9 and Corollary 4. ut
With all these relationships, theorem 6 can be proved in a similar way as Theorem 2.
Proof sketch of Theorem 6: Following the notation and the analysis given in the proof of Theo-
rem 15, we bound the normalized dominant service flow i receives in (t1, t2) as follows:
R2−2∑k=R1+1
Di
wi≤ Ti(t1, t2)
wi≤
R2∑k=R1−1
Di
wi.
Now applying Lemma 9 and 10, we have
X − 5Li/wi ≤ Ti(t1, t2)/wi ≤ X + Li/wi , (3.49)
where X =∑R2
k=R1−1 MaxECk−1. Similar inequality also holds for flow j, i.e.,
X − 5Lj/wj ≤ Tj(t1, t2)/wj ≤ X + Lj/wj , (3.50)
Taking the difference between (3.49) and (3.50) leads to the statement. ut
3.8.5 Delay Analysis of Weighted MR3
We now briefly outline how the analysis of startup latency presented in Sec. 3.8.3 extends to weighted
MR3.
Proof sketch of Theorem 7: Following the notations used in the proof of Theorem 3, it is equivalent
to show
SL = F kn,1 ≤ (1 +W )(m+ n− 1)L . (3.51)
Chapter 3. Multi-Resource Fair Queueing for Network Flows 91
By Corollary 5, we rewrite (3.29) as
F ki,r ≤ Ski,r +Dki ≤ Ski,r + (1 +W )L . (3.52)
Combining (3.30), (3.52) and following the analysis of Lemma 7, one can easily show that for all flows
i = 1, . . . , n and resources r = 1, . . . ,m, there must be
Sk−1i,r ≤ (1 +W )(i+ r − 2)L ,
F k−1i,r ≤ (1 +W )(i+ r − 1)L .
This leads to an extension of Lemma 8, where the following relationship holds for all flows i = 1, . . . , n:
Ski,1 ≤ (1 +W )(m+ i− 2)L ,
F ki,1 ≤ (1 +W )(m+ i− 1)L .
Taking i = n, we see (3.51) holds. ut
Similarly, the analysis for the scheduling delay also extends to weighted MR3.
Proof sketch of Theorem 8: We can easily extend (3.47) to the following inequalities, following
exactly the same induction steps given in the proof of Theorem 4:
Ski,r ≤ (1 +W )(m+ n+ i+ r − 2)L ,
F ki,r ≤ (1 +W )(m+ n+ i+ r − 1)L .(3.53)
As a result, we have
SDi(p) = d(p) ≤ F kn,m ≤ (1 +W )(2m+ 2n− 1)L < 2(1 +W )(m+ n)L. (3.54)
Also notice that flow weights are normalized such that
n∑j=1
wj = 1.
Dividing both sides by wi and noting that wj/wi ≥ 1/W , we have
n ≤W/wi.
Chapter 3. Multi-Resource Fair Queueing for Network Flows 92
Substituting this inequality to (3.54), we have
SDi(p) < 2(1 +W )(m+ n)L
≤ 2(1 +W )(m+W/wi)L
≤ 2(m+W )2L/wi,
where the last inequality holds because m ≥ 1 and wi ≤ 1. ut
Chapter 4
Fairness-Efficiency Tradeoff for
Multi-Resource Scheduling
4.1 Motivation
In the previous chapter, we have mainly focused on fairness in the design of a queueing algorithm. Here,
fairness means predictable service isolation among flows, and is embodied by Dominant Resource Fairness
(DRF) in the presence of multiple types of resources. However, fairness is just one side of the story.
In addition to fairness, resources should also be shared efficiently. In the context of packet scheduling,
efficiency measures the resource utilization achieved by a queueing algorithm. High resource utilization
naturally translates into high traffic throughput. This is of particular importance to enterprise networks,
given the surging volume of traffic passing through middleboxes [18,30].
Both fairness and efficiency can be achieved at the same time in traditional single-resource fair queue-
ing, where bandwidth is the only concern. As long as the schedule is work conserving [24], bandwidth
utilization is 100% given a non-empty system. That leaves fairness as an independent objective to
optimize.
However, in the presence of multiple resource types, fairness is often a conflicting objective against
efficiency. To see this, consider two schedules shown in Fig. 4.1 with two flows whose packets need CPU
processing before transmission. Packets that finishes CPU processing are placed into a buffer in front of
the output link. Each packet in Flow 1 has a processing time vector 〈2, 3〉, meaning that it requires 2
time units for CPU processing and 3 time units for transmission; each packet in Flow 2 has a processing
time vector 〈9, 1〉. The dominant resource of Flow 1 is link bandwidth, as it takes more time to transmit
93
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 94
p6
q1
q1
...
...
p1
p1
p2 p3 p4 p5 p6
p2 p3 p4 p5
q2CPU
Link
Time0 6 8 12 14 182 4 10 16 20 22 24 26
(a) A packet schedule that is fair but inefficient.
CPU
Link p9
p10
q1
...
...
p1
p1
p2 p3 p4 p5 p6 p7 p8
p2 p3 p4 p5 p6 p8p7
p9
Time0 6 8 12 14 182 4 10 16 20 22 24 26
q1
(b) A packet schedule that is efficient but violates DRF.
Figure 4.1: An example showing the tradeoff between fairness and efficiency for multi-resource packetscheduling. Packets that finishes CPU processing are placed into a buffer in front of the output link.Flow 1 sends packets p1, p2, ..., each having a processing time vector 〈2, 3〉; Flow 2 sends packets q1, q2,..., each having a processing time vector 〈9, 1〉. Schedule (a) achieves DRF but is inefficient; Schedule(b) is efficient but unfair.
a packet than processing it using CPU; similarly, the dominant resource of Flow 2 is CPU. To achieve
DRF, the transmission time Flow 1 receives should be approximately equal to the CPU processing time
Flow 2 receives. In this sense, Flow 1 should schedule three packets whenever Flow 2 schedules one,
so that each flow receives 9 time units to process its dominant resource, as shown in Fig. 4.1a. This
schedule, though fair, leads to poor bandwidth utilization—the link is idle for 1/3 of the time. On the
other hand, Fig. 4.1b shows a schedule that achieves 100% CPU and bandwidth utilization by serving
eight packets of Flow 1 and one packet of Flow 2 alternately. The schedule, though efficient, violates
DRF. While Flow 1 receives 24/25 of the link bandwidth, Flow 2 receives only 9/25 of the CPU time.
The fairness-efficiency tradeoff shown in the example above generally exists for multi-resource packet
scheduling. However, existing multi-resource queueing algorithms focus solely on fairness, e.g., DRFQ [4],
MR3 (Sec. 3.4), and GMR3 (Sec. 3.5). On the other hand, for applications having a loose fairness
requirement, trading off a modest degree of fairness for higher efficiency and higher throughput is well
justified. In general, depending on the underlying applications, a network operator may weigh fairness
and efficiency differently. Ideally, a multi-resource queueing algorithm should allow network operators to
flexibly specify their tradeoff preference and implement the specified tradeoff by determining the “right”
packet scheduling order.
However, designing such a queueing algorithm is non-trivial. It remains to be seen how efficiency can
be quantitatively defined. Further, it remains open how the tradeoff requirement should be appropriately
specified. But most importantly, given a specific tradeoff requirement, how can the scheduling decision
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 95
be correctly made to implement it?
This chapter represents the first attempt to address these challenges. We clarify the efficiency mea-
sure as the schedule makespan, which is the completion time of the last flow. We show that achieving
a flexible tradeoff between fairness and efficiency is generally NP-hard. We hence limit our discussion
to a typical scenario where CPU and link bandwidth are the two types of resources required for packet
processing, which is usually the case in middleboxes. We show that the fairness-efficiency tradeoff can
be strictly enforced by a GPS-like (Generalized Processor Sharing [21, 22]) fluid model, where packets
are served in arbitrarily small increments on both resources. To implement the idealized fluid in the
real world, we design a packet-by-packet tracking algorithm, using an approach similar to the virtual
time implementation of Weighted Fair Queueing (WFQ) [21, 22, 79]. We have prototyped our tradeoff
algorithm in the Click modular router [80]. Both our prototype implementation and trace-driven simu-
lation show that a 15% ∼ 20% fairness tradeoff is sufficient to achieve the optimal efficiency, leading to
a nearly 20% improvement in bandwidth throughput with a significantly higher resource utilization.
4.2 Fairness and Efficiency
Before discussing the tradeoff between fairness and efficiency, we shall first clarify how the notion of
fairness is to be defined, and how efficiency is to be measured quantitatively. We model packet processing
as going through a resource pipeline, where the first resource is consumed to process the packet first,
followed by the second, and so on. A packet is not available for the downstream resource until the
processing on the upstream resource finishes. For example, a packet cannot be transmitted (which
consumes link bandwidth) before it has been processed by CPU.
4.2.1 Dominant Resource Fairness
As we have seen in Chapter 3, fairness is one of the primary design objectives for a queueing algorithm.
A fair schedule offers service isolation among flows by allowing each flow to receive the throughput at
least at the level when every resource is evenly allocated. The notion of Dominant Resource Fairness
(DRF) embodies this isolation property by achieving the max-min fairness on the dominant resources of
packets in their respective flows. In Chapter 3, we have defined the dominant resource of a packet as the
one that requires the maximum packet processing time. In particular, let τr(p) be the time required to
process packet p on resource r. The dominant resource of packet p is rp = arg maxr τr(p). Given a packet
schedule, let Di(t1, t2) be the time flow i receives to process the dominant resources of its packets in a
backlogged period (t1, t2). The function Di(t1, t2) is referred to as the dominant service flow i receives
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 96
in (t1, t2). A schedule is said to strictly implement DRF if for all flows i and j, and for any period (t1, t2)
they backlog, we have
Di(t1, t2) = Dj(t1, t2). (4.1)
In other words, a strict DRF schedule allows each flow to receive the same dominant service in any
backlogged period.
However, because packets are scheduled as separate entities and are transmitted in sequence, strictly
implementing DRF at all times may not be possible in practice. For this reason, a practical fair schedule
only requires flows to receive approximately the same dominant services over time [4], as shown in the
previous example of Fig. 4.1a.
4.2.2 The Efficiency Measure
In addition to fairness, efficiency is another important concern for a multi-resource scheduling algorithm,
but has received no significant attention before. Even the definition of efficiency needs clarification.
Perhaps the most widely adopted efficiency measure is system throughput, whose conventional def-
inition is the rate of completions [81], computed as the processed workload divided by the elapsed
time (e.g., bits per second). While this performance metric is well defined for single-resource systems,
extending its definition to multiple types of resources leads to a throughput vector, where each compo-
nent is the throughput of one type of resource (e.g., 10 CPU instruction completions per second and
5 bits transmitted through the output link per second), and different throughput vectors may not be
comparable.
Another possible efficiency measure is resource utilization given non- empty system, or simply resource
utilization in the remainder of this chapter.1 However, in a middlebox, different resources may see
different levels of utilization. The question is: how should the “system utilization” be properly defined?
One possible definition is to add up the utilization rates of all resources. This definition implicitly
assumes exchangeable resources, say, 1% CPU usage is equivalent to 1% bandwidth consumption, which
may not be well justified in many circumstance, especially when one type of resource is scarce in the
system and is valued more than the other.
In this chapter, we measure efficiency with the schedule makespan. Given input flows with a finite
number of packets, the makespan of a schedule is defined as the time elapsed from the arrival of the first
packet to the time when all packets finish processing on all resources. One can also view makespan as
1This definition is different from that of queueing theory, where the utilization is defined as the fraction of time a deviceis busy [81]. Under this definition, high utilization usually means a high congestion level with a large queue backlog andlong delays [82], and is usually not desired.
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 97
the completion time of the last flow. Intuitively, given a finite traffic input, the shorter the makespan is,
the faster the input traffic is processed, and the more efficient the schedule is.2
4.2.3 Tradeoff between Fairness and Efficiency
With the precise measure of efficiency, we are curious to know how much efficiency is sacrificed for fair
queueing. To answer this question, we first generalize the definition of work conserving schedules from
traditional single-resource fair queueing to multiple resources. In particular, we say a schedule is work
conserving if at least one resource is fully utilized for packet processing when there is a backlogged flow.
In other words, a work conserving schedule does not allow resources to be wasted in idle if they can be
used to process a backlogged packet. Existing multi-resource fair queueing algorithms (e.g., DRFQ [4],
MR3, and GMR3) use the goal of achieving work conservation as an indication of efficiency. However,
in the theorem below, we observe that such an approach is ineffective.
Theorem 11. Let m be the number of resource types concerned. Given any traffic input I, let Tσ(I)
be the makespan of a work conserving schedule σ, and T ∗(I) the minimum makespan of an optimal
schedule. We have
Tσ(I) ≤ mT ∗(I). (4.2)
Proof. Given a traffic input I, let the work conserving schedule σ consist of nb busy period. A busy
period is a time interval during which at least one type of resource is used for packet processing. When
the system is empty and a new packet arrives, a new busy period starts. The busy period ends when
the system becomes empty again. We consider the following two cases.
Case 1: nb = 1. Let traffic input I consist of N packets, ordered based on their arrival times, where
packet 1 arrives first. For packet i, let τ(i)r be its packet processing time on resource r. It is easy to
check that the following inequality holds for the optimal schedule with the minimum makespan:
T ∗(I) ≥ maxr
N∑i=1
τ (i)r . (4.3)
On the other hand, for work conserving schedule σ, its makespan reaches the maximum when packet
2In general, makespan is not the only efficiency measure that one can define. For example, we can also measure efficiencywith the average flow completion time. We choose makespan as the efficiency measure in this chapter because it leadsto tractable analysis. More importantly, makespan closely relates to “system utilization” and is conceptually easy tounderstand. The discussion of other possible efficiency measures is out of the scope of this chapter.
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 98
processing does not overlap in time, across all resources, i.e.,
Tσ(I) ≤N∑i=1
m∑r=1
τ (i)r (4.4)
This leads to the following inequalities:
Tσ(I) ≤N∑i=1
m∑r=1
τ (i)r ≤
N∑i=1
mmaxrτ (i)r ≤ mT ∗(I). (4.5)
Case 2: nb > 1. Given traffic input I, let I(t+) be the packets that arrive on or after time t. For
schedule σ, let t0 be the time when its second last busy period (nb − 1) ends, and t1 the time when the
last busy period (nb) starts. Because schedule σ is work conserving, no packet arrives between t0 and
t1. We have
Tσ(I) = t1 + Tσ(I(t+1 )), (4.6)
and
T ∗(I) = t1 + T ∗(I(t+1 )). (4.7)
Note that given traffic input I(t+1 ), schedule σ consists of only one busy period. By the discussion of
Case 1, we have
Tσ(I) = t1 + Tσ(I(t+1 ))
≤ t1 +mT ∗(I(t+1 ))
≤ mT ∗(I),
(4.8)
where the last inequality is derived from (4.7). ut
We make the following three observations from Theorem 11. First, the tradeoff between fairness
and efficiency is a unique challenge facing multi-resource scheduling. When the system consists of only
one type of resource (i.e., m = 1), work conservation is sufficient to achieve the minimum makespan,
leaving fairness as the only concern. For this reason, efficiency has never been a problem for traditional
single-resource fair queueing. Second, while work conservation also provides some efficiency guarantee
for multi-resource scheduling, the more types of resources, the weaker the guarantee. Third, even
with a small number of resource types, the efficiency loss could be quite significant. Since bandwidth
throughput is inversely proportional to the schedule makespan, Theorem 11 implies that solely relying
on work conservation may incur up to 50% loss of bandwidth throughput when there are two types of
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 99
resources. While this is based on the worst case, as we shall see later in Sec. 4.6, our experiments confirm
that a throughput loss of as much as 20% is introduced by the existing fair queueing algorithms. Trading
off some degree of fairness for higher efficiency is therefore well justified, especially for applications with
loose fairness requirements.
4.2.4 Challenges
Unfortunately, striking a desired balance between fairness and efficiency in a multi-resource system
is technically non-trivial. Even minimizing the makespan without regard to fairness – a special case
of fairness-efficiency tradeoff – is NP-hard. In particular, we note that minimizing the makespan of
a packet schedule can be modeled as a multi-stage flow shop problem [83–85] studied in operations
research, where the equivalent of a packet is a job, and the equivalent of a type of resource is a machine.
However, flow shop scheduling is a notoriously hard problem, even in its offline setting where the entire
input is known beforehand. Specifically, when all jobs (packets) are available at the very beginning,
finding the minimum makespan is strongly NP-hard when the number of machines (resources) is greater
than two [86].
Given the hardness results above, in this chapter, we limit our discussion to two types of resources,
CPU and link bandwidth, as these are the two most concerned middlebox resources [4,17]. We note that
even with two types of resources, minimizing the schedule makespan remains a hard problem. Because
packets arrive dynamically over time, the problem resembles a 2-machine online flow shop scheduling
problem where jobs (packets) do not reveal their information until they arrive. For this problem, only a
limited amount of negative results is known [83–85,87,88]. Specifically, no online algorithm can ensure a
makespan within a factor of 1.349 of the optimum in all cases [89]. We also notice that no existing work
gives a concrete solution, even a heuristic algorithm, that jointly considers both makespan and fairness.
4.3 Fairness, Efficiency, and Their Tradeoff in the Fluid Model
The difficulty of makespan minimization is mainly introduced by the combinatorial nature of multi-
resource scheduling. One approach to circumvent this problem is to consider a fluid relaxation, where
packets are served in arbitrarily small increments on all resources. For each packet, this is equivalent
to processing it simultaneously on all resources with the same progress, and head-of-line packets of
backlogged flows can also be served in parallel, at (potentially) different processing rates. Such a parallel
processing fluid model eliminates the need for discussing the scheduling orders of flows. Instead, it
allows us to focus on the resource shares allocated to flows, hence relaxing a combinatorial optimization
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 100
problem to a simpler dynamic resource allocation problem. While in general, optimally solving such
a dynamic problem requires knowing future packet arrivals, we show in this section that, under some
practical assumptions, a greedy algorithm gives an optimal online schedule with the minimum makespan.
We can then strike a balance between efficiency and fairness by imposing some fairness constraints to
the fluid schedule. We shall discuss later in Sec. 4.4 and Sec. 4.5 how this fluid schedule is implemented
in practice with a packet-by-packet tracking algorithm at acceptable complexity.
4.3.1 Fluid Relaxation
In the fluid model, a flow is relaxed to a fluid where each of its packets is served simultaneously on all
resources with the same progress. Packets of different flows are also served in parallel. The schedule
needs to decide, at each time, the resource share allocated to each backlogged flow. In particular, let Bt
be the set of flows that are backlogged at time t. Let ati,r be the fraction (share) of resource r allocated
to flow i at time t. The fluid schedule determines, at each time t, the resource allocation ati,r for each
backlogged flow i and each resource r.
Two constraints must be satisfied when making resource allocation decisions. First, we must ensure
that no resource is allocated more than its total availability:
∑i∈Bt
ati,r ≤ 1, r = 1, 2. (4.9)
The second constraint ensures that a packet is processed at a consistent rate across resources. In
particular, for a backlogged flow i and its head-of-line packet at time t, let τ ti,r be its packet processing
time on resource r, and
ri = arg maxr
τ ti,r (4.10)
be its dominant resource. The processing rate that this packet receives on resource r is computed as
the ratio between the resource share allocated and the processing time required: ati,r/τti,r. To ensure a
consistent processing rate, we have
ati,r/τti,r = ati,r′/τ
ti,r′ , for all r and r′.
Substituting ri into r′ above, we see a linear relation between the allocation share of resource r and that
of the dominant resource:
ati,r =τ ti,rτ ti,ri
ati,ri = τ ti,rdti, (4.11)
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 101
Table 4.1: Main notations used in the fluid model. The superscript t is dropped when time can be clearlyinferred from the context.
Notation Explanation
n maximum number of flows that are con-currently backlogged
α fairness knob specified by the networkoperator
B (or Bt) set of flows that are currently back-logged (at time t)
di (or dti) dominant share allocated to flow i (attime t)
d (or dt) fair dominant share (at time t), givenby (4.16)
τi,r (or τ ti,r) packet processing time on resource r re-quired by the head-of-line packet of flowi (at time t)
τi,r (or τ ti,r) normalized τi,r (or τ ti,r), defined by(4.12)
where
τ ti,r = τ ti,r/τti,ri (4.12)
is the normalized packet processing time on resource r, and
dti = ati,ri (4.13)
is the dominant share allocated to flow i at time t. Plugging (4.11) into (4.9), we combine the two
constraints into one feasibility constraint of a fluid schedule:
∑i∈Bt
τ ti,rdti ≤ 1, r = 1, 2. (4.14)
Before we discuss the tradeoff between fairness and efficiency, we first consider two special cases,
where either fairness or efficiency is the only objective to optimize in the fluid model. For ease of
presentation, we drop the superscript t when time can be clearly inferred from the context. Table 4.1
summarizes the main notations used in the fluid model.
4.3.2 Fluid Schedule with Perfect Fairness
We first consider the fairness objective. To achieve perfect DRF, the fluid schedule enforces strict max-
min fairness on flows’ dominant shares, under the feasibility constraint. Specifically, the fluid schedule
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 102
q1 q2p1
CPU
Link
p2 p3 p4 p5 p6
p1 p2 p3 p4 p5 p6
Time0 6 8 12 14 182 4 10 16 20 22 24 26
q1 q2
Figure 4.2: The DRGPS fluid that implements the perfect fairness in the example of Fig. 4.1. Flow 1sends packets p1, p2, ..., and receives 〈3/5 CPU, 1/15 bandwidth〉; Flow 2 sends packets q1, q2, ..., andreceives 〈3/5 CPU, 1/15 bandwidth〉. Only 2/3 of the link bandwidth is utilized.
solves the following DRF allocation problem [12, 16] at each time t:
maxdi
mini∈B
di
s.t.∑i∈B
τi,rdi ≤ 1, r = 1, 2.(4.15)
Let n be the number of backlogged flows. The optimal solution, denoted by d = (d1, . . . , dn), allocates
each backlogged flow the same dominant share, i.e.,
di = d = 1/max {∑i τi,1,
∑i τi,2} . (4.16)
In any backlogged periods, because flows are allocated the same dominant shares, they receive the same
dominant services, achieving strict DRF at all times. The resulting fluid schedule is also known as
DRGPS [66], a multi-resource generalization to the well-known GPS [21,22].
Any discrete fair schedule is essentially a packet-by-packet approximation to DRGPS. For instance,
applying DRGPS to the example of Fig. 4.1 leads to a fluid schedule shown in Fig. 4.2, where the
normalized packet processing times of Flow 1 and Flow 2 are 〈τ1,1, τ1,2〉 = 〈2/3, 1〉 and 〈τ2,1, τ2,2〉 =
〈1, 1/9〉, respectively. By (4.16), both flows are allocated the same dominant share d = 3/5. Specifically,
Flow 1 receives 〈2/5 CPU, 3/5 bandwidth〉; Flow 2 receives 〈3/5 CPU, 1/15 bandwidth〉. In total, only
2/3 of the bandwidth is utilized, the same as the discrete fair schedule shown in Fig. 4.1a.
4.3.3 Fluid Schedule with Optimal Efficiency
We next discuss the efficiency objective. While there are some schedules proposed in the operations
research literature that can achieve the minimum makspan for a flow shop problem, none of them
applies in the context of packet scheduling: they either assume no packet arrivals (e.g., [90]) or require
full knowledge of future information (e.g., [91]). We propose a simple greedy fluid schedule as follows.
For a given time instant, we define the system’s instantaneous dominant throughput as the sum of
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 103
the dominant share allocated, i.e.,∑i∈B di. Intuitively, by maximizing
∑i∈B di at all times, one would
expect a high average dominant throughput∑i∈BDi/T , where T is the schedule makespan and Di is
the total dominant services (processing time) required by flow i. Given dominant workload∑i∈BDi,
maximizing the average dominant throughput is equivalent to minimizing the schedule makespan T .
Following this intuition, we propose a greedy fluid schedule that solves the following resource allocation
problem to maximize the instantaneous dominant throughput at every time:
maxdi≥0
∑i∈B
di
s.t.∑i∈B
τi,rdi ≤ 1, r = 1, 2.
(4.17)
In case that the optimal solution, denoted d∗ = (d∗1, . . . , d∗n), is not unique, the schedule chooses the one
with the maximum overall utilization:
maxd∗i
∑r
∑i∈B
τi,rd∗i . (4.18)
In the example of Fig. 4.1, solving (4.17) allocates Flow 1 the dominant share d∗1 = 9/25 and Flow 2 the
dominant share d∗2 = 24/25. It is easy to check that both CPU and link bandwidth are fully utilized.
Compared to those schedules proposed in the operations research literature, the greedy schedule
defined by (4.17) is particularly attractive for packet scheduling due to the following three properties.
First, it is an online algorithm without any a priori knowledge of future packet arrivals. Further, among
all packets that are backlogged, only the information regarding head-of-line packets is required. This
suggests that the schedule only needs to maintain a very simple per-flow state. Most importantly, the
greedy schedule is more than a simple heuristic. Below we show that under some practical assumptions,
greedily maximizing the dominant throughput gives the minimum makespan. Our analysis requires the
following lemma, where we show that the schedule will not waste any resource in idle, unless all flows
bottleneck on the same resource, in which case the other resource cannot be fully utilized anyway. The
proof is given in Sec. 4.9.1.
Lemma 11. The fluid schedule defined by (4.17) fully utilizes both resources if there are two head-of-line
packets with different dominant resource, i.e., there exist two flows j and l, such that τj,1 = 1 > τj,2 and
τl,1 < τl,2 = 1.
With Lemma 11, we analyze the makespan of the fluid schedule defined by (4.17). Following [4], we
say a flow is dominant-resource monotonic if it does not change its dominant resource during backlogged
periods. To make the analysis tractable, we assume that flows are dominant-resource monotonic. This
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 104
is often true in practice as packets in the same flow usually undergo the same processing, and hence
have the same dominant resource. The following lemma, whose proof can be found in Sec. 4.9.2, states
the optimality of the fluid schedule in a static scenario without dynamic packet arrivals.
Lemma 12. For dominant-resource monotonic flows, the fluid schedule defined by (4.17) gives the
minimum makespan if all packets are available at the beginning.
We now extend the results of Lemma 12 to an online case where packets dynamically arrive over
time. The following theorem gives the optimality condition of the fluid schedule. The proof can be found
in Sec. 4.9.3.
Theorem 12. For dominant-resource monotonic flows, the fluid schedule defined by (4.17) gives the
minimum makespan among all schedules, if after the system has two flows with different dominant
resources, whenever a new flow arrives, there exist two backlogged flows with different dominant resources.
The optimality conditions required by Theorem 12 can be easily met in practice. Because the number
of backlogged flows is usually large, it is almost true that we can always find two flows with different
dominant resources. In fact, even in a very unfortunate case where all flows bottleneck on the same
resource, the greedy fluid schedule does not deviate far away from the optimum: no matter what fluid
schedule is used, the bottleneck resource is always fully utilized when the system is non-empty and hence
has the same backlog, which is a dominant factor in determining the schedule makespan.
The significance of Theorem 12 is that it connects makespan, a measure defined in the time domain,
to the instantaneous dominant throughput, a measure defined in the space domain. More importantly,
it shows that minimizing the former is, in a practical sense, equivalent to maximizing the latter at all
times, without the need to know future packet arrivals. We shall use this intuition to strike a balance
between fairness and efficiency in the next subsection.
4.3.4 Tradeoff between Fairness and Efficiency
When both fairness and efficiency are considered, we express the tradeoff between the two conflicting
objectives as a constrained optimization problem – minimizing makespan under some specified fairness
requirements.3 Recall that when perfect fairness is enforced, all flows receive the same dominant share
d computed by (4.16), i.e., di = d for all i. When fairness is not a strict requirement, we introduce
a fairness knob α ∈ [0, 1] to specify the fairness degradation. In particular, an allocation d is called
α-portion fair if di ≥ αd for all backlogged flow i. In other words, each flow receives at least an α-portion
3In general, the formulation of the tradeoff is not unique. However, we found that our formulation is particularlyattractive: it leads to both the tractable analysis and practical implementation, as we shall show later in this chapter.
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 105
of its fair dominant share d. A fluid schedule is called α-portion fair if it achieves the α-portion fair
allocation at all times.
By choosing different values for α, a network operator can precisely control the fairness degradation.
As two extreme cases, setting α = 0 means that fairness is not considered at all; setting α = 1 means
that perfect fairness must be enforced at all times.
Given the specified fairness knob α, the fluid schedule tries to minimize makespan under the corre-
sponding α-portion fairness constraints. Since minimizing makespan is, in a practical sense, equivalent
to maximizing the system’s dominant throughput, we obtain a simple tradeoff heuristic that maximizes
the dominant throughput, subject to the required α-portion fairness at every time t:
maxdi
∑i∈B
di
s.t.∑i∈B
τi,rdi ≤ 1, r = 1, 2,
di ≥ αd, ∀i ∈ B,
(4.19)
where the fair share d is given by (4.16). We see that the fluid schedule captures both DRGPS and the
greedy schedule defined by (4.17) as special cases with α = 1 and 0, respectively.
Special Solution Structure. The tradeoff problem (4.19) has a closed-form solution, based on
which the tradeoff schedule can be easily computed. We first allocate each flow its guaranteed portion
of dominant share αd. We then denote
di = di − αd (4.20)
as the bonus dominant share allocated to flow i. Substituting (4.20) into (4.19), we equivalently rewrite
(4.19) as a problem of determining the bonus dominant share received by each flow:
maxdi≥0
∑i∈B
di + |B|αd
s.t.∑i∈B
τi,rdi ≤ µr r = 1, 2,
(4.21)
where
µr = 1− αd∑i∈B
τi,r, r = 1, 2, (4.22)
and is the remaining share of resource r after each flow receives its guaranteed dominant share αd.
Without loss of generality, we sort all the backlogged flows based on the processing demands on the two
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 106
types of resources required by their head-of-line packets as follows:
τ1,1/τ1,2 ≥ · · · ≥ τn,1/τn,2. (4.23)
The following theorem shows that at most two flows are awarded the bonus share at a time. Its proof
is given in Sec. 4.9.4.
Theorem 13. There exists an optimal solution d∗ to (4.21) where d∗i = 0 for all 2 ≤ i ≤ n − 1. In
particular, d∗ is given in the following three cases:
Case 1: µ1/µ2 < τn,1/τn,2. In this case, resource 1 is fully utilized, with d∗n = µ1/τn,1 and d∗i = 0 for
all i < n.
Case 2: µ1/µ2 > τ1,1/τ1,2. In this case, resource 2 is fully utilized, with d∗1 = µ2/τ1,2 and d∗i = 0 for
all i > 1.
Case 3: τn,1/τn,2 ≤ µ1/µ2 ≤ τ1,1/τ1,2. In this case, both resources are fully utilized, and we have
d∗i =
(µ1τn,2 − µ2τn,1)/(τ1,1τn,2 − τ1,2τn,1), i = 1;
(µ2τ1,1 − µ1τ1,2)/(τ1,1τn,2 − τ1,2τn,1), i = n;
0, o.w.
Once the optimal bonus dominant share has been determined as shown above, the optimal solution
d∗ to (4.19), which is the dominant share allocated to each flow, can be easily computed as the sum of
the bonus share and the guaranteed share:
d∗i = d∗i + αd, for all i. (4.24)
We give an intuitive explanation of Theorem 13 as follows. The first two cases of Theorem 13
correspond to the scenario where after each flow receives its guaranteed share, the remaining amounts of
the two types of resources are unbalanced and cannot be fully utilized simultaneously. In this case, the
schedule awards the bonus share to the flow (either Flow 1 or Flow n) whose processing demands can
better utilize the remaining resources. The third case covers the scenario where the remaining amounts
of the two types of resources are balanced, and can be fully utilized when the system is non-empty. In
this case, they are allocated to two flows (Flow 1 and Flow n) with complementary resource demands
as their bonus shares.
Theorem 13 reveals an important structure, that at most two flows are allocated more dominant
shares than others. We refer to these flows as the favored flows and all the others as the regular flows.
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 107
We shall show in Sec. 4.5 that this structure leads to an efficient O(log n) implementation of the fluid
schedule.
4.4 Packet-by-Packet Tracking
So far, all our discussions are based on an idealized fluid model. In practice, however, packets are
processed as separate entities. In this section, we present a discrete tracking algorithm that implements
the fluid schedule as a packet-by-packet schedule in practice. We show that the discrete schedule is
asymptotically close to the fluid schedule, in terms of both fairness and efficiency. We start with a
comparison between two typical tracking approaches.
4.4.1 Start-Time Tracking vs. Finish-Time Tracking
Two common tracking algorithms may be used to implement a fluid schedule in practice, start-time
tracking and finish-time tracking. The former tracks the order of packet start times—among all packets
that have already started in the fluid schedule, the one that starts the earliest is scheduled first. Finish-
time tracking, on the other hand, assigns the highest scheduling priority to the packet that completes
service the earliest in the fluid schedule. In traditional single-resource fair queueing, FQS [79] uses the
former approach to track GPS, while WFQ [21,22,26] adopts the latter approach.
While both algorithms closely track the fluid schedule of fair queueing, only start-time tracking is
well defined for the tradeoff schedule given by (4.19). This is due to the fact that, in the tradeoff
schedule, future traffic arrivals may lead to a different allocation of packet processing rates and may
subsequently change the packet finish times of current packets. As a result, determining the order of
finish times requires future traffic arrival information and hence is unrealistic.4 Start-time tracking
avoids this problem as packets are scheduled only after they start in the fluid schedule.
For this reason, we use start-time tracking to implement the fluid schedule. We say a discrete
schedule and a fluid schedule correspond to each other if the former tracks the latter by the packet start
time. Specifically, we maintain the fluid schedule in the background. Whenever there is a scheduling
opportunity, among all head-of-line packets that have already started in the fluid schedule, the one that
starts the earliest is chosen. Below we show that this discrete schedule is asymptotically close to its
corresponding fluid schedule.
4This is not a problem of single-resource fair queueing as different flows are allocated the same processing rate, so thatfuture traffic arrivals will not affect the order of finish times of current packets.
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 108
4.4.2 Performance Analysis
To analyze the performance of start-time tracking, we introduce the following notations. Let τmax be
the maximum packet processing time required by any packet on any resource. Let n be the maximum
number of flows that are concurrently backlogged. Let TF be the makespan of the fluid schedule, and
TD the makespan of its corresponding discrete schedule. All proofs are given in Secs. 4.9.5, 4.9.6, and
4.9.7.
The following theorem bounds the difference between the make-span of the fluid schedule and its
corresponding discrete schedule.
Theorem 14. For the fluid schedule with α > 0 and its corresponding discrete schedule, we have
TD ≤ TF + nτmax. (4.25)
The error bound nτmax can be intuitively explained as the total packet processing time required by
all n concurrent flows, each sending only one packet. In practice, the number of packets a flow sends
is usually significantly larger than one. As a result, the traffic makespan is significantly larger than the
error bound, i.e., TF � nτmax. Theorem 14 essentially indicates that in terms of makespan, the two
schedules are asymptotically close to each other.
We next analyze the fairness performance of the discrete schedule by comparing the dominant services
a flow receives under both schedules. In particular, let DFi (0, t) be the dominant services flow i receives
in (0, t) under the fluid schedule, and DDi (0, t) the dominant services flow i receives in (0, t) under the
corresponding discrete schedule. The following theorem shows that flows receive approximately the same
dominant services under both schedules.
Theorem 15. For the fluid schedule with α > 0 and its corresponding discrete schedule, the following
inequality holds for any flow i and any time t:
DFi (0, t)− 2(n− 1)τmax ≤ DD
i (0, t) ≤ DFi (0, t) + τmax. (4.26)
In other words, the difference between the dominant services a flow receives under the two corre-
sponding schedules is bounded by a constant amount, irrespective of the time t. Over the long run, the
discrete schedule achieves the same α-portion fairness as its corresponding fluid schedule. To summarize,
start-time tracking retains both the efficiency and fairness properties of its corresponding fluid schedule
in the asymptotic regime.
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 109
4.5 An O(log n) Implementation
To implement the aforementioned start-time tracking algorithm, two modules are required: packet pro-
filing and fluid scheduling. The former estimates the packet processing time on both CPU and link
bandwidth; the latter maintains the fluid schedule as a reference system based on the packet profiling
results. We show in this section that packet profiling can be quickly accomplished in O(1) time using
a simple approach proposed in [4]. The main challenge comes from the complexity of maintaining the
fluid schedule, where direct implementation requires O(n) time. Here, n is the number of backlogged
flows. We give an O(log n) implementation based on an approach similar to virtual time. We shall show
in Sec. 4.6 that the implementation can be easily prototyped in the Click modular router [80].
4.5.1 Packet Profiling
As pointed out by Ghodsi et al. [4], any multi-resource fair queueing algorithm, including our fluid
schedule, requires knowledge of the packet processing time on each resource. Fortunately, as shown
in [4], CPU processing time can be accurately estimated as a linear function of packet size. Specifically,
for a packet of size l, the CPU processing time is estimated as al + b, where a and b are the coefficients
depending on the type of packet processing (e.g., IPsec). We have validated this linear model through
an upfront experiment using Click [80]. For each type of packet processing, we measure the exact CPU
processing time required by packets of different sizes. This allows us to determine the coefficients a and
b. We fit such a linear model to the scheduler and use it to estimate the CPU processing time required
by a packet. As for the packet transmission time, the estimation is simply the packet size divided by
the outgoing bandwidth, which is known a priori.
4.5.2 Direct Implementation of Fluid Scheduling
Based on the packet profiling results, the fluid schedule is constructed and is maintained by the fluid
scheduler. In particular, we need to determine the next packet that starts in the fluid schedule. This
requires tracking the work progress of all n flows. Below we give a direct implementation that will be
used later in our virtual time implementation.
For each flow i, we record d∗i , which is the dominant share the flow receives in the fluid schedule at
the current time and is computed by (4.24). We also record Ri, the remaining dominant processing time
required by the head-of-line packet of the flow at the current time.
For flow i, its head-of-line packet will finish in Ri/d∗i time if no event occurs then. An event is either a
packet departure or a packet being the new head-of-line in the fluid schedule. Either of them may change
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 110
the head-of-line packet of a flow, leading to different coefficients of the tradeoff problem (4.19). With d∗i
and Ri, we can accurately track the work progress of flow i in an event-driven basis. Specifically, upon
the occurrence of an event, let ∆t be the time elapsed since the last update. If ∆t < Ri/d∗i , meaning
that the event occurs before the head-of-line packet finishes, we update Ri ← Ri− d∗i∆t. If ∆t = Ri/d∗i ,
meaning that the event occurs at the time when the head-of-line packet finishes, we check if flow i has
a next packet p to process. If it does, then packet p becomes the new head-of-line and should start in
the fluid schedule. We update Ri as the dominant processing time required by p. Otherwise, we reset
Ri ← 0, and flow i leaves the fluid system. We also recompute d∗i after Ri is updated. (Note that it is
impossible to have ∆t > Ri/d∗i .)
However, purely relying on the approach above to track the work progress of all n flows is highly
inefficient. Whenever an event occurs, each flow must be updated individually, which requires at least
O(n) time per event and is too expensive. We next introduce a more efficient implementation that
requires the above procedure for at most two flows.
4.5.3 Virtual Time Implementation of Fluid Scheduling
To avoid the high complexity required by the direct implementation above, we have noted, by The-
orem 13, that at most two flows are favored and are allocated more dominant shares than others.
Therefore, it suffices to maintain at most three dominant shares at a time—two for the favored flows
and one for the other regular flows. For regular flows, we track their work progress using an approach
similar to the virtual time implementation of GPS [21, 22]. Our intuition is that, by Theorem 13, all
the regular flows are allocated the same dominant share, and their scheduling resembles fair queueing.
For favored flows, since there are at most two of them, we track their work progress directly, using the
direct implementation above. Our approach is detailed below.
Identifying Favored and Regular Flows
We first discuss how favored and regular flows can be quickly identified upon the occurrence of an
event. By Theorem 13, it suffices to sort flows in order (4.23) and examine the three cases. Flows
that receive the bonus share (i.e., d∗i > 0) are favored. Note that the entire computation requires only
information regarding the head-of-line packet of the first and the last flow in order (4.23) (τ1,r and
τn,r, the normalized dominant processing time). We store all the head-of-line packets in a double-ended
priority queue maintained by a min-max heap for fast retrieval, where the packet order is defined by
(4.23). This allows us to apply Theorem 13 and identify the favored and regular flows in O(log n) time.
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 111
Tracking Favored Flows
For favored flows, because there are at most two of them, we track their work progress using the direct
implementation mentioned in Sec. 4.5.2, where we record d∗i and Ri for each favored flow i. It is easy to
see that the update complexity is dominated by the computation of d∗i . As mentioned in the previous
discussion, this can be done in O(log n) time by Theorem 13. Also, since there are at most two favored
flows, the overall tracking complexity remains O(log n) per event.
Tracking Regular Flows
For regular flows, since they receive the same dominant share, their scheduling resembles fair queueing.
We hence track their work progress using virtual time [21, 22, 26]. Specifically, we define virtual time
V (t) as a function of real time t evolving as follows:
V (0) = 0,
V ′(t) = αdt, t > 0.
(4.27)
Here, dt is the fair dominant share computed by (4.16) at time t, and is fixed between two consecutive
events; αdt is the dominant share each regular flow receives.5 Thus, V can be interpreted as increasing
at the marginal rate at which regular flows receive dominant services. Each regular flow i also maintains
virtual finish time Fi, indicating the virtual time at which its head-of-line packet finishes in the fluid
schedule. The virtual finish time Fi is updated as follows when flow i has a new head-of-line packet p
at time t:
Fi = V (t) + τ∗(p), (4.28)
where τ∗(p) is the dominant packet processing time required by p. Among all the regular flows, the
one with the smallest Fi has its head-of-line packet finishing first in the fluid schedule. Unless some
event occurs in between, at time t, the next packet departure for the regular flows would be in tN =
(mini Fi − V (t))/αd time.
Using virtual time defined by (4.27), we can accurately track the work progress of regular flows in
an event-driven basis. Specifically, upon the occurrence of an event at time t, let t0 be the time of the
last update, and ∆t = t− t0 the time elapsed since the last update. If ∆t < tN , meaning that the event
occurs before the next packet departure of regular flows, we simply update the virtual time following
5We restore the superscript t here to emphasize that the fair dominant share computed by (4.16) may change over time.
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 112
(4.27):
V (t) = V (t0) + αd∆t. (4.29)
If ∆t = tN , then the event occurs at the time when a packet of a regular flow, say flow i, finishes in the
fluid schedule. In addition to updating the virtual time, we check to see if flow i has a next packet p to
process. If it does, meaning that the packet p should start in the fluid schedule, we update its virtual
finish time Fi following (4.28). Otherwise, flow i departs the system. We also recompute d by (4.16).
The tracking complexity is dominated by the computation of the minimum virtual finish time, i.e.,
mini Fi. By storing Fi’s in a priority queue maintained by a heap, we see that the tracking complexity
is O(log n) per event.
Handling Identity Switching
We note that the identity of a flow is not fixed: upon the occurrence of an event, a favored flow may
switch to a regular flow, and vice versa. We show that such identity switching can also be easily handled
in O(log n) time.
We first consider a favored flow i switching to a regular one at time t, which requires the computation
of the virtual finish time Fi. Recall that we have recorded Ri, the remaining dominant processing time
required by the head-of-line packet, for flow i as it is previously favored. By definition, the virtual finish
time Fi can be simply computed as
Fi = V (t) +Ri. (4.30)
Adding Fi to the heap takes at most O(log n) time.
We next consider a regular flow i switching to a favored one at time t, which requires the computation
of Ri. Recall that we have recorded the virtual finish time Fi for flow i. By definition, the remaining
dominant processing time required by its head-of-line packet is simply
Ri = Fi − V (t), (4.31)
which is a dual of (4.30).
We also need to remove the virtual finish time, Fi, from the heap. To do so, we maintain an index
for each regular flow, recording the location of its virtual finish time stored in the heap. Following this
index, we can easily locate the position of Fi and delete it from the heap, followed by some standard
“trickle-down” operations to preserve the heap property in O(log n) time.
To summarize, our approach maintains the fluid schedule by identifying favored and regular flows,
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 113
tracking their work progress, and handling the potential identity switching. We show that any of these
operations can be accomplished in O(log n) time. As a result, maintaining the fluid schedule takes
O(log n) time per event.
4.5.4 Start-Time Tracking and Complexity
With the fluid schedule maintained as a reference system, the implementation of start-time tracking is
straightforward. Whenever a packet starts in the fluid schedule, it is added to a FIFO queue. Upon a
scheduling opportunity, the scheduler polls the queue and retrieves a packet to schedule. This ensures
that packets are scheduled in order of their start times in the fluid schedule. To minimize the update
frequency, the scheduler lazily updates the fluid schedule only when the FIFO queue is empty.
We now analyze the scheduling complexity of the aforementioned implementation. The scheduling
decisions are made by updating the fluid schedule in an event-driven basis. For each event, the update
takes O(log n) time, where n is the number of backlogged flows. Note that there are only two types
of events in the fluid schedule, new head-of-line and packet departure. Because a packet served in the
fluid schedule triggers exactly these two events over the entire scheduling period, scheduling N packets
triggers 2N updates in the fluid schedule, with the overall complexity O(2N log n). On average, the
scheduling decision is made in O(2 log n) time per packet, the same order as that of DRFQ [4].
4.6 Evaluation
We evaluate the tradeoff algorithm via both our prototype implementation and trace-driven simulation.
We use a prototype implementation to investigate the detailed functioning of the algorithm, in a micro-
scopic view. We then take a macroscopic view to evaluate the algorithm using trace-driven simulation,
where flows dynamically join and depart the system.
4.6.1 Experimental Results
We have prototyped our tradeoff algorithm as a new scheduler in the Click modular router [80], based
on the O(log n) implementation given in the previous section. The scheduler classifies packets to flows
(based on the IP prefix and port number) and identifies the types of packet processing based on the port
number specified by a flow class table. The scheduler also exposes an interface that allows the operator
to dynamically configure the fairness knob α. Our implementation consists of roughly 1,000 lines of
C++ code.
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 114
Table 4.2: Schedule makespan observed in Click at different fairness levels. The queue capacity is infinite.
Fairness knob α Makespan (s) Normalized Makespan1.00 55.68 100.00%0.95 52.50 94.28%0.90 48.97 87.95%0.85 47.17 84.72%0.70 47.13 84.64%0.60 47.07 84.54%0.50 47.07 84.54%
We run our Click implementation in user mode on a Dell PowerEdge server with an Intel Xeon
3.0 GHz processor and 1 Gbps Ethernet interface. To make fairness relevant, we throttle the outgoing
bandwidth to 200 Mbps while keeping the inbound bandwidth as is. We also throttle the Click module
to use only 20% CPU so that CPU could also be a bottleneck.6 We configure three packet processing
modules in Click to emulate a multi-functioning middlebox: packet checking, statistical monitoring, and
IPsec. The former two modules are bandwidth-bound, though statistical monitoring requires more CPU
processing time than packet checking does. The IPsec module encrypts packets using AES (128-bit key
length) and is CPU-bound. We configure another server as a traffic source, initiating 60 UDP flows each
sending 2000 800-byte packets per second to the Click router. The first 20 flows pass through the packet
checking module; the next 20 flows pass through the statistical monitoring module; and the last 20 flows
pass through the IPsec module.
Fairness-Efficiency Tradeoff
We first evaluate the achieved tradeoff between schedule fairness and makespan. To fairly compare the
makespan at different fairness levels, it is critical to ensure the same traffic input when running the
algorithm with different values of fairness knob α. Therefore, we initially consider an idealized scenario
where each flow queue has infinite capacity and never drops packets. Table 4.2 lists the observed
makespans with various fairness requirements, in an experiment where each flow keeps sending packets
for 10 seconds. We see that, as expected, trading off some level of fairness leads to a shorter makespan
and higher efficiency. Furthermore, the marginal improvement of efficiency is decreasing. This suggests
that one does not need to compromise too much fairness in order to achieve high efficiency. In our
experiment, trading off 15% of fairness shortens the makespan by 15.3% from the strictly fair schedule
(α = 1), which is equivalent to a 18.1% bandwidth throughput enhancement and is near-optimal as seen
in Table 4.2. Fig. 4.3 gives a detailed look into the achieved resource utilization over time, at four fairness
6We configured Click to run in the user mode, which does not support high-speed network. We therefore throttle theCPU cycles so as to match the processing speed with the transmission speed.
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 115
0 10 20 30 40 50 600
20
40
60
80
100
CP
U U
tilz
atio
n (
%)
0 10 20 30 40 50 600
20
40
60
80
100
Time (s)
B/W
Utilz
atio
n (
%)
α = 0.85
α = 0.90
α = 0.95
α = 1.00
Figure 4.3: Overall resource utilization observed in Click. No packet drops.
50 60 70 80 90 1000
1
2
3
4
5
6
7
α (%)
Do
min
an
t S
ha
re (
%)
Figure 4.4: Dominant share each flow receives per second in Click. No packet drops. The strict fairshare is 2%.
levels. We see that strictly fair queueing (α = 1) wastes 30% of CPU cycles, leaving the bandwidth as
the bottleneck at the beginning. This situation remains until bandwidth-bound flows finish, at which
time the bottleneck shifts to CPU. By relaxing fairness, CPU-bound flows receive more services, leading
to a steady increase of CPU utilization up to 100%. Meanwhile, bandwidth-bound flows experience
slightly longer completion times due to the fairness tradeoff.
We now verify the fairness guarantee. We run the scheduler at various fairness levels. At each level,
for each flow, we measure its received dominant share every second for the first 20 seconds, during which
all flows are backlogged. Fig. 4.4 shows the results, where each cross (“x”) corresponds to the dominant
share of a flow measured in one second. As expected, under strict fairness (α = 1), all flows receive the
same dominant share (around 2%). As α decreases, the fairness requirement relaxes. Some flows are
hence favored and are allocated more dominant share, while others receive less. However, the minimum
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 116
80 85 90 95 10060
70
80
90
100
α (%)
Avera
ge U
tiliz
ation (
%)
CPUB/W
(a) Resource utilization.
0 20 40 601
2
3
4
5
Flow ID
Mean D
om
. S
hare
(%
)
α = 0.85
α = 0.90
α = 0.95
α = 1.00
(b) Dominant share.
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Per−Packet Latency (s)
CD
F
α = 0.85
α = 0.90
α = 0.95
α = 1.00
(c) Per-packet latency
Figure 4.5: Average resource utilization and dominant share each flow receives in Click at differentfairness levels. The queue capacity is 200 packets. The measurement of resource utilization and dominantshare is conducted every second over the entire schedule.
dominant share a flow receives is lower bounded by the α-portion of the fair share, shown as the solid
line in Fig. 4.4. This shows that the algorithm is correctly operating at the desired fairness level.
We next extend the experiment to a more practical setup, where each flow queue has a limited
capacity and drops packets when it is full. We set the queue size to 200 packets for each flow and repeat
the previous experiments. In this case, comparing makespan is inappropriate as the scheduler may drop
different packets when running at different fairness level. We instead measure the resource utilization
achieved every second over the entire scheduling period. Fig. 4.5a illustrates the average utilization of
both CPU and bandwidth, where the error bar shows one standard deviation. Similar to the previous
experiments, a fairness degradation of 15% is sufficient to achieve the optimal efficiency, enhancing the
CPU utilization from 71% to 100%. Further trading off fairness is not well justified. As shown in
Fig. 4.5b, the increased CPU throughput is mainly used to process those CPU-bound flows (Flows 41 to
60), doubling their dominant shares. Meanwhile, the dominant share received by all the other flows is
at least 85% of the fair share, as promised by the algorithm. We also depict the per-packet latency CDF
in Fig. 4.5c. We see that trading off fairness for efficiency significantly improves the tail latency, usually
caused by flows that finish the last. On the other hand, flows whose shares have been traded off see
slightly longer delays of their packets. Fortunately, these latency penalties are strictly bounded—thanks
to the fairness guarantee—and are compensated by the significant latency improvement of favored flows.
We have also measured the scheduling overhead in the experiments. In particular, we configure the
tradeoff scheduler for strict fair queueing by setting α = 1. We then compare the incurred CPU overhead
with that of MR3, a low complexity fair scheduler introduced in Sec. 3.4.3. Our measurement shows
that the tradeoff scheduler introduces 1% CPU overhead compared with MR3.
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 117
1 2 310
20
30
40
50
Flow ID
Me
an
La
ten
cy (
ms)
α = 0.8
α = 0.9
α = 1.0
(a) Elephant flows.
4 5 60.1
0.2
0.3
0.4
Flow ID
Me
an
La
ten
cy (
ms)
α = 0.8 α = 0.9 α = 1
(b) Mice flows.
Figure 4.6: Mean per-packet latency of elephant (sending 20,000 pkts/s) and mice flows (sending 2pkts/s) in Click. The error bar shows the standard deviation.
Service Isolation
We next examine the impact of fairness tradeoff on service isolation. We initiate 6 UDP flows sending
800-byte packets. Flows 1 to 3 are elephant flows, each sending 20,000 packets per second, and undergo
the checking, monitoring, and IPsec modules, respectively. Flows 4 to 6 are mice flows, each sending 2
packets per second, and undergo the checking, monitoring, and IPsec modules, respectively. The queue
capacity is set to 200 packets. Fig. 4.6 shows the per-packet latency of each flow at different fairness
levels. We see that the tradeoff mainly affects those high-rate flows. For mice flows, even if they may
receive less resource share when α < 1, the guaranteed share is sufficient to accommodate their low-rate
traffic. As a result, their packets are scheduled almost immediately upon arrival, with two orders of
magnitude lower latency than the elephant flows.
We also compare our tradeoff scheduler against other fair queueing algorithms. In particular, we have
implemented MR3 and GMR3 as two other round-robin O(1) schedulers in Click, and conducted the
same experiments mentioned above. We find that they achieve almost the same makespan and resource
utilization as that of the tradeoff scheduler running with the strict fairness requirement (α = 1). This
should come with no surprises, as all the existing multi-resource fair queueing algorithms are essentially
different approximations to the fluid schedule with perfect DRF. These results are omitted to avoid
redundancy.
4.6.2 Trace-Driven Simulation
Next, we use trace-driven simulation to further evaluate the proposed algorithm from a macroscopic
perspective. We have written a packet-level simulator consisting of 3,000 lines of C++ code and fed it
with real-world traces [92] captured in a university switch. The traces are dominated by UDP packets.
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 118
50 60 70 80 90 10075
80
85
90
95
100
α (%)
CP
U U
tiliz
atio
n (
%)
CPU−BoundBalancedB/W−Bound
(a) CPU utilization.
50 60 70 80 90 10020
40
60
80
100
α (%)
B/W
Utiliz
atio
n (
%)
CPU−BoundBalancedB/W−Bound
(b) Bandwidth utilization.
Figure 4.7: Resource utilization achieved at different fairness levels in the simulation, averaged over 10runs.
0 100 200 300 400 5000.2
0.4
0.6
0.8
1
Per−Packet Latency (ms)
CD
F
α = 0.8
α = 1.0
MR3
(a) CPU-Bound.
0 50 100 150 2000.5
0.6
0.7
0.8
0.9
1
Per−Packet Latency (ms)
CD
F
α = 0.8
α = 1.0
MR3
(b) Balanced.
0 30 60 90 120 1500.5
0.6
0.7
0.8
0.9
1
Per−Packet Latency (ms)
CD
F
α = 0.8
α = 1.0
MR3
(c) Bandwidth-Bound.
CPU−bound Bal. B/W−bound10
20
30
40
50
60
70
Packet D
rop R
ate
(%
)
MR3
α = 1
α = 0.9
α = 0.8
(d) Packet drop rate.
Figure 4.8: The improvement of per-packet latency and packet drop rate due to the fairness tradeoff.
Based on the IP prefix and port number, we classify packets in the traces into nearly 3,000 flows and
synthesize the input traffic by randomly assigning each flow to one of three middlebox modules: basic
forwarding, statistical monitoring, and IPsec. The CPU processing time of each module follows a linear
model based on the measurement results of [4]. The flow queue size is set to 200 packets, and the
outgoing bandwidth is set to 200 Mbps. We linearly scale up the traffic by 5× to simulate a heavy
load. Depending on the total resource consumption, the synthesized traffic is classified into the following
three patterns: CPU-bound traffic where the CPU processing time exceeds 1.2× the transmission time,
bandwidth-bound traffic where the transmission time exceeds 1.2× the CPU time, and balanced traffic
otherwise.
Fig. 4.7 shows the mean utilization achieved at various fairness levels, where each data point is
averaged over 10 runs under the corresponding traffic pattern. The error bar shows one standard
deviation. We observe similar trends in all three patterns, that trading off fairness leads to higher
utilization on both resources. Similar to our Click implementation results, we see that the marginal
improvement in utilization is decreasing, with the optimal utilization achieved by trading off more than
15% of fairness. Among all three patterns, traffic with balanced resource consumption has the least
incentive to trade off fairness, as flows have complementary resource demands and can dove-tail one
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 119
102
103
104
105
106
107
108
109
0
50
100
150
200
250
300
Flow Size (B)
Me
an
La
ten
cy (
ms)
(a) α = 1.
102
103
104
105
106
107
108
109
0
50
100
150
200
250
300
Flow Size (B)
Me
an
La
ten
cy (
ms)
(b) α = 0.8.
Figure 4.9: Per-packet latency against flow sizes in the first 20s of simulation, feeding CPU-bound traffic,with and without the fairness tradeoff.
another [4]. In this case, fair queueing is sufficient to realize high efficiency. We have also simulated the
other two multi-resource fair queueing algorithms, DRFQ [4] and MR3, and observed almost the same
performance as that of the strictly fair queueing (α = 1). We omit these results to avoid redundancy.
We examine in Fig. 4.8 the tradeoff impact on other measures relevant to efficiency. Specifically,
we depict the per-packet latency CDF of three scheduling algorithms – tradeoff with α = 0.8, complete
fairness (α = 1.0), and MR3 – in Figs. 4.8a, 4.8b, and 4.8c, for each of the three traffic patterns. In
general, the enhanced resource utilization due to the fairness tradeoff translates into shorter latencies
in all three traffic patterns. The improvements are mainly attributed to the shortened packet latency
of favored flows. Furthermore, the packet drop rates are compared in Fig. 4.8d. We observe an average
of 15% to 20% decrease in the packet drop rate under all three traffic patterns, suggesting that higher
bandwidth throughput is achieved.
Finally, we investigate the tradeoff impact on service isolation in a dynamic environment. Ideally, we
would like to see that compromising a small percentage of fairness will not affect the per-packet latency
of mice flows as their guaranteed resource share, even when traded off, is sufficient to support their low
packet rate. Fig. 4.9 confirms this isolation property with CPU-bound traffic, where we depict the mean
packet latency for each flow in the first 20 seconds of the simulation, running with complete fairness and
80% of fairness, respectively. We see that compared to strictly fair queueing, trading off fairness has
no impact on the latency of small flows, but it affects those medium and large ones: Some see shorter
latency while others experience longer delay, depending on if they are favored flows or not. We have
similar observations in the other two traffic patterns.
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 120
4.7 Related Work
Ghodsi et al. [4] identified the need of multi-resource fair queueing for deep packet inspection in middle-
boxes. They compared a set of queueing alternatives and proposed DRFQ, the first multi-resource fair
queueing algorithm, that implements DRF in the time domain. We have proposed two follow-up queue-
ing algorithms in Chapter 3 with lower scheduling complexity and bounded scheduling delay. All these
works focus solely on fairness, and use the goal of achieving work conservation as the only indication of
efficiency, similar to traditional single-resource fair queueing [21,22].
However, as we have shown in this chapter, unlike single-resource scheduling, there is a general
tradeoff between fairness and efficiency when packet processing requires multiple types of resources. We
have briefly mentioned this problem in our previous position paper [93], where we shared our visions on
several possible directions that may lead to a concrete solution. We have materialized some of our visions
in this chapter by formally characterizing the tradeoff problem and proposing an implementable queueing
algorithm to achieve a flexible balance between fairness and efficiency in the two-resource setting.
While the tradeoff problem has received little attention in the fair queueing literature, striking a
balance between allocation fairness and efficiency has been a focus of many recent works in both net-
working and operations research. Specifically, Danna et al. [94] have presented an efficient bandwidth
allocation algorithm to achieve a flexible tradeoff between fairness and throughput for traffic engineer-
ing. Joe-Wong et al. [13] have proposed a unifying framework with fairness and efficiency requirements
specified by two parameters for a given multi-resource allocation problem. Discussions on the tradeoff
between fairness and performance have also been given in the context of P2P networks [95, 96]. In the
literature of operations research, Bertsimas et al. [97] have derived a tight bound to characterize the
efficiency loss under proportional fairness and max-min fairness, respectively. They have later developed
a more general framework to characterize the fairness-efficiency tradeoff in a family of “α-fair” welfare
functions [98]. All these works focus on one-shot resource allocation in the space domain. In contrast,
our focus in this chapter is a packet scheduling problem where resources are shared in the time domain.
As explained in Sec. 4.2.4, our tradeoff problem captures the flow shop problem [83–86, 88, 89, 91,
99, 100] as a special case when efficiency is the only concern. Our approach borrows the idea of WFQ
[21, 22, 26], in relaxing a discrete packet flow to an idealized fluid and tracking the fluid schedule based
on virtual time. However, for single-resource fair queueing, the main challenge is to design a packet-
by-packet tracking algorithm, because the fluid schedule, GPS [21, 22, 26], is fairly straightforward and
easy to compute. Our problem is more complex, requiring more careful modeling of the fluid schedule,
packet-by-packet tracking with multiple resources, and tradeoff analysis.
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 121
4.8 Summary and Future Work
Middleboxes perform complex network functions whose packet processing requires the support of multiple
types of hardware resources. A multi-resource packet scheduling algorithm is therefore needed. Unlike
traditional single-resource fair queueing, where bandwidth is the only concern, there exists a general
tradeoff between fairness and efficiency in the presence of multiple resources. Ideally, we would like
to achieve flexible tradeoff to meet QoS requirements while maintaining the system at a high resource
utilization level. We show the difficulty of the general problem and limit our discussion to a common
scenario where CPU and link bandwidth are the two resources required for packet processing. We
propose an efficient scheduling algorithm by tracking an idealized fluid schedule. We show through both
our Click implementation and trace-driven simulation that our algorithm achieves a flexible tradeoff
between fairness and efficiency in various scenarios.
Despite the initial progress made by this dissertation, many challenges remain open. First, while
optimized, the current implementation requires O(log n) time per packet scheduling, which may be a
concern given a large number of flows. A simpler scheduler with lower complexity may be desired. Given
the extensive techniques developed for low-complexity scheduler in the fair queueing literature, it would
be interesting to see if and how these techniques extend to the multi-resource setting. Also, as shown by
Theorem 11, in general, the more types of resources a system has, the more salient the fairness-efficiency
tradeoff would be. For a system with more than two types of resources, we believe the intuition and
technique developed in this chapter may still be applied. We could define a similar fluid schedule that
maximizes the dominant throughput under the specified fairness constraint, and use start-time tracking
to implement it in practice.
4.9 Proofs
4.9.1 Proof of Lemma 11
Suppose there are n backlogged flows at time t, where flow i has head-of-line packet pi. Let d∗ =
(d∗1, . . . , d∗n) be the optimal solution to (4.17). It is equivalent to show that d∗ leads to tight constraints
of (4.17) if the stated condition is met. Let
A = {pi|τi,1 = 1 > τi,2}
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 122
be the set of head-of-line packets whose dominant resource is resource 1, and let
B = {pi|τi,1 < τi,2 = 1}
be the set of head-of-line packets whose dominant resources is resource 2. We know that A,B 6= ∅
according to the stated condition.
We first claim that for the optimal solution d∗, there exist pj ∈ A and pl ∈ B, such that d∗j > 0 and
d∗l > 0. To prove this claim, let us assume the opposite and see what happens. Without loss of generality,
suppose for all d∗i > 0, we have either pi ∈ A or pi /∈ A∪B (i.e., τi,1 = τi,2 = 1). This implies∑i d∗i = 1.
We show a contradiction by constructing a feasible allocation that leads to a dominant throughput higher
than 1. Consider two packets pj ∈ A and pl ∈ B. We construct the following allocation:
di =
min{1− τl,1
2 , 12τj,2}, i = j,
1/2, i = l,
0, otherwise.
To see that the allocation is feasible, we substitute di to the constraints of (4.17) and have
∑i
τi,1di = dj + τl,1dl = min
{1− τl,1
2,
1
2τj,2
}+τl,12≤ 1
for resource 1 and ∑i
τi,2di = τj,2dj + dl ≤ 1
for resource 2. To see that the allocation leads to a dominant throughput higher than 1, we first have
dj = min
{1− τl,1
2,
1
2τj,2
}>
1
2
by noting that τj,2, τl,1 < 1 (because pj ∈ A and pl ∈ B). We then have
∑i
di = dj + dl > 1 =∑i
d∗i ,
contradicting the fact that d∗ optimally solves (4.17).
We are now ready to prove the statement of the lemma. We assume the opposite that at least one
resource is not fully utilized under allocation d∗, hoping to show a contradiction. Since (4.17) is a linear
program, one constraint must be tight for the optimal solution. Without loss of generality, we assume
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 123
that only resource 1 is fully utilized under allocation d∗. That is, for resource 1 we have
∑i
τi,1d∗i = 1 ,
but for resource 2 we have ∑i
τi,2d∗i = 1−∆ < 1
for some ∆ ∈ (0, 1). We construct an allocation d with the same dominant throughput as that of d∗,
but does not fully utilize any of the two resources. This suggests that a higher dominant throughput
can be achieved, leading to a contradiction.
By our previous claim, there exist pj ∈ A and pl ∈ B such that d∗j > 0 and d∗l > 0. We construct the
following allocation:
di =
d∗j − δ, i = j,
d∗l + δ, i = l,
d∗i , otherwise,
where δ = min{d∗j ,∆/2(1 − τj,2δ)}. Clearly, the constructed allocation d leads to the same dominant
throughput as that of d∗, i.e.,∑i di =
∑i d∗i , but does not fully utilize any of the two resources. In
particular, we have
∑i
τi,1di =∑i
τi,1d∗i − (τj,1 − τl,1)δ
= 1− (1− τl,1)δ < 1
for resource 1 and
∑i
τi,2di =∑i
τi,2d∗i + (τl,2 − τj,2)δ
= 1−∆ + (1− τj,2)δ
≤ 1−∆/2
< 1
for resource 2. This suggests that strictly higher dominant throughput can be achieved (by increasing
some di by some sufficiently small ε > 0). We hence see a contradiction. ut
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 124
4.9.2 Proof of Lemma 12
It suffices to consider the following two cases.
Case 1: All flows have the same dominant resource. Without loss of generality, assume that resource
1 is the dominant resource of all flows. In this case, the constraint of resource 2 in (4.17) is ineffective,
and the constraint of resource 1 is tight under the optimal solution of (4.17). Resource 1 is hence fully
utilized until the end of the schedule. The makespan of the schedule is equal to the total amount of time
required to process all packets on resource 1, and is the minimum.
Case 2: There are two flows with different dominant resources. By Lemma 11, the fluid schedule
schedule fully utilizes both resources until some flows complete services and all the remaining backlogged
flows have the same dominant resource, say, resource 1. Since then, the fluid schedule fully utilizes
resource 1 until the end of the schedule. Therefore, resource 1 is fully utilized in the entire schedule.
The makespan is equal to the total amount of time required to process all packets on resource 1, and is
the minimum. ut
4.9.3 Proof of Theorem 12
Let t0 be the first time when the system has two flows having different dominant resources. A flow is
called an early flow if it arrives before t0. All early flows have the same dominant resource. Without loss
of generality, let resource 1 be their dominant resource. Flows that arrive at or after t0 are called late
flows. Let Wr be the total amount of time required to process all packets on resource r. For an arbitrary
schedule σ, let Pσr (t) be the total amount of time that resource r is busy in (0, t). The makespan of the
schedule σ is lower bounded by the following equation:
Tσ ≥ t0 + maxr=1,2
{Wr − Pσr (t0)} , (4.32)
where Wr − Pσr (t0) is the remaining amount of packet processing time required on resource r.
Now for the fluid schedule schedule ρ, since all the early flows have resource 1 as their dominant
resource, any vector d∗ with∑i d∗i = 1 optimally solves (4.17). By the fluid schedule model, this implies
that resource 1 is fully utilized while the utilization rate of resource 2 is also maximized, at all times
before t0. Therefore, we have P ρ1 (t0) = t0 ,
P ρ2 (t0) = maxσ′ Pσ′
2 (t0) .(4.33)
Let t1 be the last flow arrival time after t0. (If no flow arrives after t0, let t1 = t0.) One can always
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 125
find two backlogged flows with different dominant resources in (t0, t1). By Lemma 11, both resources
are fully utilized in (t0, t1), i.e.,
P ρr (t1)− P ρr (t0) = t1 − t0 . (4.34)
After t1, since there is no flow arrivals, no newly arrived packet would become the head-of-line packet.
Therefore, for the fluid schedule schedule, packets are scheduled as if they were available at t1. By
Lemma 12, the schedule fully utilizes one resource until the end of the makespan, i.e.,
T ρ − t1 = maxr=1,2
{Wr − P ρr (t1)} . (4.35)
Plugging (4.33) and (4.34) into (4.35), we have
T ρ = t0 + maxr=1,2
{Wr − P ρr (t0)}
= t0 + max{W1 − t0,W2 −maxσ′
Pσ′
2 (t0)} .(4.36)
Now for any schedule σ, we have
Pσ1 (t0) ≤ t0 = P ρ1 (t0) ,
Pσ2 (t0) ≤ maxσ′ Pσ′
2 (t0) = P ρ2 (t0).(4.37)
Substituting (4.37) to (4.36), we have
T ρ ≤ t0 + maxr=1,2
{Wr − Pσr (t0)} ≤ Tσ, (4.38)
where the last inequality is derived from (4.32). Since (4.38) holds for any schedule σ, we see that the
fluid schedule schedule ρ is optimal. ut
4.9.4 Proof of Theorem 13
We first show that given any optimal solution d∗ with d∗j > 0 for some 1 < j < n, we can convert it
to another optimal solution d with dj = 0 and di = d∗i for all i 6= 1, j, n. It is easy to check that there
exists some k such that
1 = τ1,1 = · · · = τk,1 > τk+1,1 ≥ · · · ≥ τn,1 ,
τ1,2 ≤ · · · ≤ τk,2 ≤ τk+1,2 = · · · = τn,2 = 1 .
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 126
In particular, if j ≤ k, we let
di =
d∗1 + d∗j , i = 1,
0, i = j,
d∗i , o.w.
We show that d is a feasible solution to (4.21). Consider the constraint of resource 1 in (4.21). Because
τ1,1 = τj,1 = 1 for j ≤ k, we have
∑i
τi,1di =∑i6=j
τi,1d∗i + τ1,1d
∗j =
∑i
τi,1d∗i ≤ µ1 .
For the constraint of resource 2 in (4.21), because τ1,2 ≤ τj,2 when j ≤ k, we have
∑i
τi,2di =∑i6=j
τi,2d∗i + τ1,2d
∗j ≤
∑i
τi,2d∗i ≤ µ2 .
Also note that∑i di =
∑i d∗i , we see that d is an optimal solution to (4.21).
If j > k, we let
di =
0, i = j,
d∗n + d∗j , i = n,
d∗i , o.w.
With similar arguments, we see that d optimally solves (4.21).
Repeatedly applying the approach above to all the non-zero components, except the first and the
last, of an optimal solution, we see that the statement holds. Because only d1, dn > 0, problem (4.21)
reduces to a simple linear program with two variables and two constraints. Exhaustive discussions lead
to the results in all the three cases. ut
4.9.5 Preliminaries for the Proofs of Theorems 14 and 15
Because the discrete schedule tracks the fluid schedule based on the packet start times, the packet
scheduling orders of both schedules are the same. Let the orders be p1, . . . , pN , where pi is the ith packet
scheduled. For the discrete schedule, let sDi,r be the start time of packet pi on resource r, and fD
i,r the
finish time of pi on resource r. For the fluid schedule, since each packet starts (completes) processing the
same time on both resources, let sFi and fF
i be the start and finish time of packet pi, respectively. Let
τr(pi) be the packet processing time required by pi on resource r, and τmax = maxi,r τr(pi) the maximum
packet processing time required by any packet on any resource. The following lemma generally holds for
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 127
any discrete schedule.
Lemma 13. For any packet pi, there exists some pivotal packet pk, 1 ≤ k ≤ i, such that the processing
of pk on resource 2 follows immediately after the processing on resource 1 completes, and the processing
of pk, . . . , pi is continuous on resource 2, i.e.,
fDi,2 = fD
k,1 +
i∑j=k
τ2(pj) . (4.39)
Proof: For the pivotal packet pk, the processing on resource 2 starts immediately after its processing
on resource 1 completes, i.e.,
fDk,1 = sD
k,2 . (4.40)
To find the pivotal packet, we search from packet pi and check if it satisfies (4.40). If it does, then
pi is the pivotal packet and the search stops. Otherwise, the processing of packet pi on resource 2 is
delayed for a certain amount of time after the processing completes on resource 1. This implies that the
processing of pi on resource 2 starts immediately after its previous packet pi−1 completes processing on
resource 2:
fDi,1 < sD
i,2 = fDi−1,2 . (4.41)
We continue the search to pi−1 and check if it satisfies (4.40). If it does, then pi−1 is the pivotal packet
because the processing of pi−1 and pi is continuous on resource 2, as suggested by (4.41), and the search
stops. Otherwise, we must have
fDi−1,1 < sD
i−1,2 = fDi−2,2 , (4.42)
for the similar reason as that of (4.41). We continue the search to pi−2. Note that the search is
guaranteed to stop at packet p1, because it is the first packet scheduled and there is no delay between
the processing on the two resources. In this case, p1 is the pivotal packet. ut
Lemma 14. For any fluid schedule with α > 0 and its corresponding discrete schedule, we have
sDi,1 ≤ sF
i + (n− 1)τmax, for all i, (4.43)
where n is the maximum number of concurrent flows in the fluid schedule.
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 128
Proof: For any packet pi, let pj be the earliest-scheduled packet where pj , pj+1, . . . , pi are continu-
ously processed on resource 1 in the discrete schedule, i.e.,
j = arg min1≤l≤i
{fDl,1 = sD
l−1,1} .
The reason that pj is not processed right after its previous packet completes processing on resource 1 is
because it has not yet started processing in the corresponding fluid schedule. Packet pj should hence be
scheduled immediately in the discrete schedule upon its start in the fluid counterpart, i.e.,
sDj,1 = sF
j . (4.44)
Now for the fluid schedule, consider the time interval [sFj , s
Fi ), during which packets pj , . . . , pi−1 start
processing. When packet pi starts processing at time sFi , there are at most (n − 1) other packets that
have not yet completed processing, all of which must start earlier than pi. Therefore, the length of busy
period of fluid schedule in [sFj , s
Fi ) is at least
i−1∑l=j
τ1(pl)− (n− 1)τmax ≤ sFi − sF
j = sFi − sD
j,1, (4.45)
where the equality is derived from (4.44). We rewrite (4.45) as
sFi ≥ sD
j,1 +
i−1∑l=j
τ1(pl)− (n− 1)τmax
= sDi,1 − (n− 1)τmax,
(4.46)
where the equality holds because packets pj , . . . , pi are processed continuously on resource 1 in the
discrete schedule. ut
With Lemma 13 and Lemma 14, we give proofs of Theorem 14 and Theorem 15 in the following two
subsections.
4.9.6 Proof of Theorem 14
By Lemma 13, for the discrete schedule, there exists a pivotal packet pk, such that
fDN,2 = fD
k,1 +
N∑i=k
τ2(pi) = sDk,1 + τ1(pk) +
N∑i=k
τ2(pi). (4.47)
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 129
Since packets pk, . . . , pN are all processed after sFk in the corresponding fluid schedule, we have
N∑i=k
τ2(pi) ≤ TF − sFk . (4.48)
On the other hand, Lemma 14 suggests that
sDk,1 ≤ sF
k + (n− 1)τmax. (4.49)
Plugging (4.48) and (4.49) into (4.47), we have
fDN,2 ≤ TF + (n− 1)τmax + τ1(pk) ≤ TF + nτmax.
By noting that fDN,2 = TD, we see that the statement holds. ut
4.9.7 Proof of Theorem 15
Without loss of generality, we limit the discussion to a busy period of the fluid schedule. We first prove
the lower bound of DDi (0, t). This requires the following lemma.
Lemma 15. For any fluid schedule with α > 0 and its corresponding discrete schedule, we have
fDi,2 ≤ fF
i + (2n− 1)τmax, for all i.
Proof: For any packet pi, by Lemma 13, there exists a pivotal packet pk where
fDi,2 = fD
k,1 +
i∑j=k
τ2(pj) = τ1(pk) + sDk,1 +
i∑j=k
τ2(pj) . (4.50)
For the fluid schedule, consider the time interval [sFk , f
Fi ), during which packets pk, . . . , pi start processing.
Right before packet pi completes processing at time fFi , there are at most n− 1 other packets processed
in parallel, all of which start earlier than pi. Therefore, the workload processed by the fluid schedule
during [sFk , f
Fi ) is at least
i∑l=k
τl,2 − (n− 1)τmax ≤ fFi − sF
k .
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 130
We then have
fFi ≥ sF
k +∑il=k τl,2 − (n− 1)τmax
≥ sDk,1 +∑il=k τl,2 − 2(n− 1)τmax (By Lemma 14)
= fDi,2 − τ1(pk)− 2(n− 1)τmax (By (4.50))
≥ fDi,2 − (2nc − 1)τmax .
ut
We are now ready to establish the lower bound of DDi (0, t).
Lemma 16. For any Fluid with α > 0 and its corresponding Discrete, at any time t, we have
DDi (0, t) ≥ DF
i (0, t)− 2(n− 1)τmax, for all i.
Proof: For any flow i, because 0 ≤ ddtD
Fi (0, t) ≤ 1 and d
dtDDi (0, t) ∈ {0, 1}, equation DF
i (0, t) −
DDi (0, t) reaches its maximum at some time t when a packet p of flow i starts processing on its dominant
resource in the discrete schedule. Packet p completes its dominant processing at time t + τ∗, where τ∗
is the packet processing time required by p on its dominant resource. Let fD be the time when packet
p completes processing (on resource 2) in the discrete schedule. We have
fD ≥ t+ τ∗ . (4.51)
Let fF be the time when packet p completes processing in the fluid schedule. We have
DFi (0, fF) = DD
i (0, t+ τ∗) = DDi (0, t) + τ∗ . (4.52)
By Lemma 15 and (4.51), we have
fF ≥ fD − (2n− 1)τmax ≥ t+ τ∗ − (2n− 1)τmax .
This immediately suggests that
DFi (0, t+ τ∗ − (2n− 1)τmax) ≤ DF
i (0, fF)
= DDi (0, t) + τ∗ .
(4.53)
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 131
On the other hand, we have
DFi (0, t+ τ∗ − (2nc − 1)τmax) ≥ DF
i (0, t) + τ∗ − (2nc − 1)τmax . (4.54)
Combining (4.53) and (4.54), we see that the statement holds. ut
We next prove the upper bound of DDi (0, t) using the following lemma, hence completing the proof
of Theorem 15.
Lemma 17. For any fluid schedule with α > 0 and its corresponding discrete schedule, at any time t,
we have
DDi (0, t) ≤ DF
i (0, t) + τmax, for all i.
Proof: Let p(i,k) be the kth packet of flow i. Let sD(i,k) be the time when p(i,k) starts processing (on
resource 1) in the discrete schedule, and fD(i,k) the time when p(i,k) completes processing (on resource
2) in the discrete schedule. Let sF(i,k) and fF
(i,k) be similarly defined for the fluid schedule. We note the
following two facts. First, a packet does not start processing in the discrete schedule until it starts in
the fluid schedule, i.e.,
sD(i,k) ≥ s
F(i,k), for all p(i,k) . (4.55)
Second, because packets of a flow are processed in sequence in both fluid schedule and discrete schedule,
the following must hold for all p(i,k):
DDi (0, sD
(i,k)) = DFi (0, sF
(i,k)) . (4.56)
Without loss of generality, we assume sF(i,k) ≤ t ≤ s
F(i,k+1) for some k. It suffices to consider the following
two cases.
Case 1: sD(i,k) ≥ t. In this case, we have
DDi (0, t) ≤ DD
i (0, sD(i,k)) = DF
i (0, sF(i,k)) ≤ D
Fi (0, t) .
Case 2: sD(i,k) < t. Let τ∗(i,k) be the packet processing time required by p(i,k) on its dominant
resource. Because the dominant service flow i receives in [sD(i,k), t) in the discrete schedule is limited by
Chapter 4. Fairness-Efficiency Tradeoff for Multi-Resource Scheduling 132
the dominant processing time of p(i,k) and the processing rate, we have
DDi (sD
(i,k), t) ≤ min{τ∗(i,k), t− sD(i,k)}
≤ min{τ∗(i,k), t− sF(i,k)} ,
(4.57)
where the second inequality holds because of (4.55). Now consider the dominant service flow i receives
in the fluid schedule in [sF(i,k), t). Let β be the minimum dominant processing rate flow i receives in
[sF(i,k), t). We have 0 < β ≤ 1 and
DFi (sF
(i,k), t) ≥ min{τ∗(i,k), (t− sF(i,k))β} . (4.58)
By (4.56), (4.57), and (4.58), we have
DDi (0, t)−DF
i (0, t)
=DDi (0, sD
(i,k)) +DDi (sD
(i,k), t)−DFi (0, sF
(i,k))−DFi (sF
(i,k), t)
=DDi (sD
(i,k), t)−DFi (sF
(i,k), t)
≤min{τ∗(i,k), t− sF(i,k)} −min{τ∗(i,k), (t− s
F(i,k))β}
=
(t− sF(i,k))(1− β), t− sF
(i,k) < τ∗(i,k),
τ∗(i,k) − (t− sF(i,k))β, (t− sF
(i,k))β < τ∗(i,k) < t− sF(i,k),
0, τ∗(i,k) ≤ (t− sF(i,k))β .
It is easy to check that in either case above, we have
DDi (0, t)−DF
i (0, t) ≤ τ∗(i,k) ≤ τmax .
ut
Chapter 5
Concluding Remarks
5.1 Conclusions
This dissertation studies several fundamental fair sharing problems with multiple resource types in cloud
datacenters, and makes contributions in both algorithmic design and prototype implementation.
We started by studying the multi-resource allocation problem in cloud computing systems where the
resource pool is constructed from a large number of heterogeneous servers, representing different points in
the configuration space of resources such as processing, memory, and storage. We have designed a multi-
resource sharing policy, called DRFH, that generalizes the notion of Dominant Resource Fairness (DRF)
from a single server to multiple heterogeneous servers. DRFH provides a number of highly desirable
“fair” properties. In particular, with DRFH, no user prefers the allocation of another user; no one can
improve its allocation without decreasing that of the others; and more importantly, no coalition behavior
of misreporting resource demands can benefit all its members. DRFH also ensures some level of service
isolation among the users. We have prototyped DRFH as a pluggable resource allocator in Apache Mesos,
an open-source cluster management system widely adopted by industry. Experimental studies show that
our implementation can lead to accurate DRFH allocation at all times. Large-scale simulations driven
by Google cluster traces show that DRFH significantly outperforms the traditional slot-based scheduler,
leading to much higher resource utilization with substantially shorter job completion times.
Multi-resource fair sharing is not only a fundamental system design problem for large computer
clusters, but also a new challenge for middleboxes, software routers, and other appliances that are
widely deployed in datacenters. These appliances perform a wide range of important network functions,
including WAN optimizations, intrusion detection systems, network and application level firewalls, etc.
133
Chapter 5. Concluding Remarks 134
Depending on the underlying applications, packet processing for different traffic flows may consume vastly
different amounts of hardware resources (e.g., CPU and link bandwidth). Multi-resource fair queueing
allows each traffic flow to receive a fair share of multiple middlebox resources. Previous schemes for
multi-resource fair queueing, however, are expensive to implement at high speeds. Specifically, the time
complexity to schedule a packet is O(log n), where n is the number of backlogged flows. We have designed
Multi-Resource Round Robin (MR3), a new fair queueing scheme for unweighted flows that schedules
packets in a manner similar to Elastic Round Robin. MR3 requires only O(1) work to schedule a packet
and is simple enough to implement in practice. It serves as a foundation for a more generalized fair
scheduler, called Group Multi-Resource Round Robin (GMR3), that also runs in O(1) time, yet provides
weight-proportional delay bound for flows with uneven weights. We have shown, both analytically and
experimentally, both MR3 and GMR3 can achieve nearly perfect Dominant Resource Fairness.
We have also identified a new challenge that is unique to multi-resource scheduling. Unlike tradi-
tional fair queueing where bandwidth is the only concern, we have shown in Chapter 4 that fairness and
efficiency are conflicting objectives that cannot be achieved simultaneously in the presence of multiple
resource types. Ideally, a scheduling algorithm should allow network operators to flexibly specify their
fairness and efficiency requirements, so as to meet the Quality of Service demands while keeping the sys-
tem at a high utilization level. Yet, existing multi-resource scheduling algorithms focus on fairness only,
and may lead to poor resource utilization. Driven by this problem, we have proposed a new scheduling
algorithm to achieve a flexible tradeoff between fairness and efficiency for packet processing, consuming
both CPU and link bandwidth. Experimental results based on both real-world implementation and
trace-driven simulation suggested that trading off a modest level of fairness can potentially improve the
efficiency to the point where the system capacity is almost saturated.
5.2 Future Directions
Our work on resource scheduling in cloud systems is by no means complete. Large-scale datacenter
and analytic systems are undergoing fundamental shifts, from application-specific to general-purpose,
from batch processing to interactive analysis, from a small number of long-lasting tasks with low fan-
out to a large number of short-lived tasks that are highly parallelized. All these shifts pose significant
scheduling challenges on the system availability, scalability, and responsiveness. These challenges provide
rich research problems requiring both analytical studies and new implementations.
Availability. Production computing tasks typically have placement requirements, some of which
are hard constraints that must be enforced (e.g., some tasks can only run on nodes with public IPs),
Chapter 5. Concluding Remarks 135
while others are preferences that could be violated at the expense of performance degradation (e.g.,
data locality). In addition to placement requirements, tasks may have dynamic resource demands. For
example, a Spark task is usually CPU-intensive in the mapping stage, but will soon become bandwidth-
bound in the reducing stage. All these complexities make resource availability a severe problem, giving
rise to many delicate issues that must be carefully addressed. In particular, how should tasks be scheduled
under complex placement requirements while ensuring basic fairness and high efficiency? How should the
system quickly respond to these resource dynamics, and make task scheduling adjustment accordingly?
These questions are arising in Hadoop, Mesos, and Spark deployments, and require both new system
abstractions and analytical work.
Scalability. Driven by demand for lower-latency interactive data analytics, large-scale clusters are
shifting towards shorter tasks with higher degrees of parallelism. The foreseeable trend of adopting sub-
second tasks in new-generation analytic frameworks (e.g., Dremel, Spark, Impala) would further increase
the number of tasks by two orders of magnitude. Scheduling such a large number of tasks in a short
period of time would easily congest the centralized scheduler. Moreover, having much higher degrees of
parallelism will further aggravate the already severe Incast problem in clusters. Some promising ideas
are to parallelize the computation of task scheduling via distributed schedulers and to reduce traffic load
in the reduce stage via in-network aggregation. However, these ideas are far from maturity and require
further analytical investigations, engineering innovations, and implementation efforts.
Responsiveness. The demand for lower-latency interactive data analytics requires large-scale clus-
ters to provide fluid responsiveness, which naturally translates into a short tail of latency distribution.
While many techniques have been proposed to reduce the latency tail, most of them are based on con-
ventional parallel computing frameworks such as MapReduce. It remains open to see if and how these
techniques extend to new-generation data analytic frameworks that support in-memory computing and
data streaming (e.g., Spark and Shark). It is also unclear how these techniques can be implemented
in distributed scheduling given the scalability problem suffered by centralized scheduling. Answering
these questions requires new insights derived from extensive measurement studies in the production
data analytic clusters, based on which new techniques can be proposed and implemented.
Bibliography
[1] W. Wang, D. Niu, B. Li, and B. Liang, “Dynamic cloud resource reservation via cloud brokerage,”
in Proc. IEEE ICDCS, 2013.
[2] C. Reiss, A. Tumanov, G. Ganger, R. Katz, and M. Kozuch, “Heterogeneity and dynamicity of
clouds at scale: Google trace analysis,” in Proc. ACM SoCC, 2012.
[3] C. Reiss, J. Wilkes, and J. L. Hellerstein, “Google Cluster-Usage Traces,” http://code.google.com/
p/googleclusterdata/.
[4] A. Ghodsi, V. Sekar, M. Zaharia, and I. Stoica, “Multi-resource fair queueing for packet process-
ing,” in Proc. ACM SIGCOMM, 2012.
[5] M. Armbrust, A. Fox, R. Griffith, A. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson,
A. Rabkin, I. Stoica, and M. Zaharia, “A view of cloud computing,” Commun. ACM, vol. 53,
no. 4, pp. 50–58, 2010.
[6] B. Farley, A. Juels, V. Varadarajan, T. Ristenpart, K. D. Bowers, and M. M. Swift, “More for
your money: Exploiting performance heterogeneity in public clouds,” in Proc. ACM SoCC, 2012.
[7] Z. Ou, H. Zhuang, A. Lukyanenko, J. Nurminen, P. Hui, V. Mazalov, and A. Yla-Jaaski, “Is the
same instance type created equal? exploiting heterogeneity of public clouds,” IEEE Trans. Cloud
Computing, vol. 1, no. 2, pp. 201–214, 2013.
[8] F. Ahmad, S. T. Chakradhar, A. Raghunathan, and T. Vijaykumar, “Tarazu: Optimizing mapre-
duce on heterogeneous clusters,” in Proc. ACM ASPLOS, 2012.
[9] R. Nathuji, C. Isci, and E. Gorbatov, “Exploiting platform heterogeneity for power efficient data
centers,” in Proc. USENIX ICAC, 2007.
[10] “Apache Hadoop,” http://hadoop.apache.org.
136
Bibliography 137
[11] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: distributed data-parallel programs
from sequential building blocks,” in Proc. EuroSys, 2007.
[12] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica, “Dominant resource
fairness: Fair allocation of multiple resource types,” in Proc. USENIX NSDI, 2011.
[13] C. Joe-Wong, S. Sen, T. Lan, and M. Chiang, “Multi-resource allocation: Fairness-efficiency trade-
offs in a unifying framework,” in Proc. IEEE INFOCOM, 2012.
[14] D. Dolev, D. Feitelson, J. Halpern, R. Kupferman, and N. Linial, “No justified complaints: On
fair sharing of multiple resources,” in Proc. ACM ITCS, 2012.
[15] A. Gutman and N. Nisan, “Fair allocation without trade,” in Proc. AAMAS, 2012.
[16] D. Parkes, A. Procaccia, and N. Shah, “Beyond dominant resource fairness: Extensions, limita-
tions, and indivisibilities,” in Proc. ACM EC, 2012.
[17] V. Sekar, N. Egi, S. Ratnasamy, M. Reiter, and G. Shi, “Design and implementation of a consoli-
dated middlebox architecture,” in Proc. USENIX NSDI, 2012.
[18] J. Sherry, S. Hasan, C. Scott, A. Krishnamurthy, S. Ratnasamy, and V. Sekar, “Making middle-
boxes someone else’s problem: Network processing as a cloud service,” in Proc. ACM SIGCOMM,
2012.
[19] A. Greenhalgh, F. Huici, M. Hoerdt, P. Papadimitriou, M. Handley, and L. Mathy, “Flow pro-
cessing and the rise of commodity network hardware,” ACM SIGCOMM Comput. Commun. Rev.,
vol. 39, no. 2, pp. 20–26, 2009.
[20] J. Anderson, R. Braud, R. Kapoor, G. Porter, and A. Vahdat, “xOMB: Extensible open middle-
boxes with commodity servers,” in Proc. ACM/IEEE ANCS, 2012.
[21] A. Demers, S. Keshav, and S. Shenker, “Analysis and simulation of a fair queueing algorithm,” in
Proc. ACM SIGCOMM, 1989.
[22] A. Parekh and R. Gallager, “A generalized processor sharing approach to flow control in integrated
services networks: The single-node case,” IEEE/ACM Trans. Netw., vol. 1, no. 3, pp. 344–357,
1993.
[23] S. Golestani, “A self-clocked fair queueing scheme for broadband applications,” in Proc. IEEE
INFOCOM, 1994.
Bibliography 138
[24] H. Zhang, “Service disciplines for guaranteed performance service in packet-switching networks,”
Proc. IEEE, vol. 83, no. 10, pp. 1374–1396, 1995.
[25] M. Shreedhar and G. Varghese, “Efficient fair queuing using deficit round-robin,” IEEE/ACM
Trans. Netw., vol. 4, no. 3, pp. 375–385, 1996.
[26] J. Bennett and H. Zhang, “WF2Q: Worst-case fair weighted fair queueing,” in Proc. IEEE INFO-
COM, 1996.
[27] N. Egi, A. Greenhalgh, M. Handley, M. Hoerdt, F. Huici, and L. Mathy, “Towards high performance
virtual routers on commodity hardware,” in Proc. ACM CoNEXT, 2008.
[28] H. Dreger, A. Feldmann, V. Paxson, and R. Sommer, “Predicting the resource consumption of
network intrusion detection systems,” in Recent Advances in Intrusion Detection (RAID), vol.
5230. Springer, 2008, pp. 135–154.
[29] M. Honda, Y. Nishida, C. Raiciu, A. Greenhalgh, M. Handley, and H. Tokuda, “Is it still possible
to extend TCP?” in Proc. ACM IMC, 2011.
[30] Z. Wang, Z. Qian, Q. Xu, Z. Mao, and M. Zhang, “An untold story of middleboxes in cellular
networks,” in Proc. SIGCOMM, 2011.
[31] P. Goyal, H. Vin, and H. Cheng, “Start-time fair queueing: A scheduling algorithm for integrated
services packet switching networks,” IEEE/ACM Trans. Netw., vol. 5, no. 5, pp. 690–704, 1997.
[32] A. Gember, P. Prabhu, Z. Ghadiyali, and A. Akella, “Toward software-defined middlebox network-
ing,” in Proc. ACM Hotnets, 2012.
[33] W. Wang, B. Li, and B. Liang, “Dominant resource fairness in cloud computing systems with
heterogeneous servers,” in Proc. IEEE INFOCOM, 2014.
[34] W. Wang, B. Liang, and B. Li, “Multi-resource fair allocation in heterogeneous cloud computing
systems,” IEEE Trans. Parallel Distrib. Syst., (to appear).
[35] W. Wang, B. Li, and B. Liang, “Multi-resource round robin: A low complexity packet scheduler
with dominant resource fairness,” in Proc. IEEE ICNP, 2013.
[36] W. Wang, B. Liang, and B. Li, “Low complexity multi-resource fair queueing with bounded delay,”
in Proc. IEEE INFOCOM, 2014.
Bibliography 139
[37] W. Wang, C. Feng, B. Li, and B. Liang, “On the fairness-efficiency tradeoff for packet processing
with multiple resources,” in Proc. ACM CoNEXT, 2014.
[38] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker,
and I. Stoica, “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster
computing,” in Proceedings of the 9th USENIX Conference on Networked Systems Design and
Implementation, ser. NSDI’12. Berkeley, CA, USA: USENIX Association, 2012, pp. 2–2. [Online].
Available: http://dl.acm.org/citation.cfm?id=2228298.2228301
[39] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and
I. Stoica, “Mesos: A platform for fine-grained resource sharing in the data center,” in Proc.
USENIX NSDI, 2011.
[40] W. Wang, B. Li, and B. Liang, “Dominant resource fairness in cloud computing systems with
heterogeneous servers,” in Proc. IEEE INFOCOM, 2014.
[41] I. Kash, A. Procaccia, and N. Shah, “No agent left behind: Dynamic fair division of multiple
resources,” in Proc. AAMAS, 2013.
[42] J. Li and J. Xue, “Egalitarian division under Leontief preferences,” Econ. Theory, vol. 54, no. 3,
pp. 597–622, 2013.
[43] A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica, “Choosy: Max-min fair sharing for datacenter
jobs with constraints,” in Proc. ACM EuroSys, 2013.
[44] A. D. Procaccia, “Cake cutting: Not just child’s play,” Commun. ACM, 2013.
[45] “Hadoop Fair Scheduler,” http://hadoop.apache.org/docs/r0.20.2/fair scheduler.html.
[46] M. Mitzenmacher, “The power of two choices in randomized load balancing,” IEEE Trans. Parallel
Distrib. Syst., vol. 12, no. 10, pp. 1094–1104, 2001.
[47] A. Singhal, “Modern information retrieval: A brief overview,” IEEE Data Eng. Bull., vol. 24,
no. 4, pp. 35–43, 2001.
[48] “Apache Mesos,” http://mesos.apache.org.
[49] S. Baruah, J. Gehrke, and C. Plaxton, “Fast scheduling of periodic tasks on multiple resources,”
in Proc. IEEE IPPS, 1995.
Bibliography 140
[50] S. Baruah, N. Cohen, C. Plaxton, and D. Varvel, “Proportionate progress: A notion of fairness in
resource allocation,” Algorithmica, vol. 15, no. 6, pp. 600–625, 1996.
[51] F. Kelly, A. Maulloo, and D. Tan, “Rate control for communication networks: Shadow prices,
proportional fairness and stability,” J. Oper. Res. Soc., vol. 49, no. 3, pp. 237–252, 1998.
[52] J. Mo and J. Walrand, “Fair end-to-end window-based congestion control,” IEEE/ACM Trans.
Networking, vol. 8, no. 5, pp. 556–567, 2000.
[53] J. Kleinberg, Y. Rabani, and E. Tardos, “Fairness in routing and load balancing,” in Proc. IEEE
FOCS, 1999.
[54] J. Blanquer and B. Ozden, “Fair queuing for aggregated multiple links,” in Proc. ACM SIGCOMM,
2001.
[55] Y. Liu and E. Knightly, “Opportunistic fair scheduling over multiple wireless channels,” in Proc.
IEEE INFOCOM, 2003.
[56] C. Koksal, H. Kassab, and H. Balakrishnan, “An analysis of short-term fairness in wireless media
access protocols,” in Proc. ACM SIGMETRICS (poster session), 2000.
[57] M. Bredel and M. Fidler, “Understanding fairness and its impact on quality of service in IEEE
802.11,” in Proc. IEEE INFOCOM, 2009.
[58] R. Jain, D. Chiu, and W. Hawe, A quantitative measure of fairness and discrimination for re-
source allocation in shared computer system. Eastern Research Laboratory, Digital Equipment
Corporation, 1984.
[59] T. Lan, D. Kao, M. Chiang, and A. Sabharwal, “An axiomatic theory of fairness in network
resource allocation,” in Proc. IEEE INFOCOM, 2010.
[60] “Hadoop Capacity Scheduler,” http://hadoop.apache.org/docs/r0.20.2/capacity scheduler.html.
[61] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg, “Quincy: Fair
scheduling for distributed computing clusters,” in Proc. ACM SOSP, 2009.
[62] A. A. Bhattacharya, D. Culler, E. Friedman, A. Ghodsi, S. Shenker, and I. Stoica, “Hierarchical
scheduling for diverse datacenter workloads,” in Proc. ACM SoCC, 2013.
[63] E. Friedman, A. Ghodsi, and C.-A. Psomas, “Strategyproof allocation of discrete jobs on multiple
machines,” in Proc. ACM EC, 2014.
Bibliography 141
[64] R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella, “Multi-resource packing for
cluster schedulers,” in Proc. ACM SIGCOMM, 2014.
[65] D. Bertsekas and R. Gallager, Data Networks. Prentice-Hall, 2004.
[66] W. Wang, B. Liang, and B. Li, “Multi-resource generalized processor sharing for packet process-
ing,” in Proc. ACM/IEEE IWQoS, 2013.
[67] S. Floyd and V. Jacobson, “Link-sharing and resource management models for packet networks,”
IEEE/ACM Trans. Netw., vol. 3, no. 4, pp. 365–386, 1995.
[68] S. Kanhere, H. Sethu, and A. Parekh, “Fair and efficient packet scheduling using elastic round
robin,” IEEE Trans. Parallel Distrib. Syst., vol. 13, no. 3, pp. 324–336, 2002.
[69] N. Egi, A. Greenhalgh, M. Handley, G. Iannaccone, M. Manesh, L. Mathy, and S. Ratnasamy,
“Improved forwarding architecture and resource management for multi-core software routers,”
Proc. IFIP NPC, 2009.
[70] “Cisco GSR,” http://www.cisco.com/.
[71] S. Ramabhadran and J. Pasquale, “Stratified round robin: A low complexity packet scheduler with
bandwidth fairness and bounded delay,” in Proc. ACM SIGCOMM, 2003.
[72] B. Caprita, J. Nieh, and W. C. Chan, “Group round robin: Improving the fairness and complexity
of packet scheduling,” in Proc. ACM ANCS, 2005.
[73] X. Yuan and Z. Duan, “Fair round-robin: A low-complexity packet scheduler with proportional
and worst-case fairness,” IEEE Trans. Comput., 2009.
[74] W. Wang, D. Niu, B. Li, and B. Liang, “Dynamic cloud resource reservation via cloud brokerage,”
in Proc. IEEE ICDCS, 2013.
[75] S. Y. Cheung and C. S. Pencea, “BSFQ: Bin sort fair queueing,” in Proc. IEEE INFOCOM, 2002.
[76] C. Guo, “Srr: An o (1) time complexity packet scheduler for flows in multi-service packet networks,”
in Proc. ACM SIGCOMM, 2001.
[77] D. Stiliadis and A. Varma, “Rate-proportional servers: A design methodology for fair queueing
algorithms,” IEEE/ACM Trans. Netw., vol. 6, no. 2, pp. 164–174, 1998.
[78] ——, “Latency-rate servers: A general model for analysis of traffic scheduling algorithms,”
IEEE/ACM Trans. Netw., vol. 6, no. 5, pp. 611–624, 1998.
Bibliography 142
[79] A. Greenberg and N. Madras, “How fair is fair queuing,” J. ACM, vol. 39, no. 3, pp. 568–598,
1992.
[80] E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. F. Kaashoek, “The Click modular router,”
ACM Trans. Comput. Sys., vol. 18, no. 3, pp. 263–297, 2000.
[81] M. Harchol-Balter, Performance Modeling and Design of Computer Systems: Queueing Theory in
Action. Cambridge University Press, 2013.
[82] W. Whitt, “Understanding the efficiency of multi-server service systems,” Management Sci.,
vol. 38, no. 5, pp. 708–723, 1992.
[83] B. Chen, C. N. Potts, and G. J. Woeginger, “A review of machine scheduling: Complexity, al-
gorithms and approximability,” in Handbook of Combinatorial Optimization, D.-Z. Du and P. M.
Pardalos, Eds. Springer, 1999, pp. 1493–1641.
[84] J. Y. Leung, Handbook of Scheduling: Algorithms, Models, and Performance Analysis. CRC
Press, 2004.
[85] M. L. Pinedo, Scheduling: Theory, Algorithms, and Systems. Springer, 2012.
[86] M. R. Garey, D. S. Johnson, and R. Sethi, “The complexity of flowshop and jobshop scheduling,”
Math. Oper. Res., vol. 1, no. 2, pp. 117–129, 1976.
[87] J. Sgall, “On-line scheduling,” in Online Algorithms, A. Fiat and G. J. Woeginger, Eds. Springer,
1998.
[88] A. P. A. Vestjens, “On-line machine scheduling,” Ph.D. dissertation, Technische Universiteit Eind-
hoven, 1997.
[89] S. S. Seiden, “A guessing game and randomized online algorithms,” in Proc. ACM STOC, 2000.
[90] J. B. Sidney, “The two-machine maximum flow time problem with series parallel precedence rela-
tions,” Oper. Res., vol. 27, no. 4, pp. 782–791, 1979.
[91] D. Bertsimas and J. Sethuraman, “From fluid relaxations to practical algorithms for job shop
scheduling: the makespan objective,” Math. Program., vol. 92, no. 1, pp. 61–102, 2002.
[92] T. Benson, “Data set for IMC 2010 data center measurement,” http://pages.cs.wisc.edu/∼tbenson/
IMC DATA/univ2 trace.tgz, 2010.
Bibliography 143
[93] W. Wang, B. Liang, and B. Li, “On fairness-efficiency tradeoffs for multi-resource packet process-
ing,” in Proc. IEEE ICDCS Workshop on Data Center Performance (DCPerf), 2013.
[94] E. Danna, S. Mandal, and A. Singh, “A practical algorithm for balancing the max-min fairness
and throughput objectives in traffic engineering,” in Proc. IEEE INFOCOM, 2012.
[95] B. Fan, D.-M. Chiu, and J. C. Lui, “The delicate tradeoffs in bittorrent-like file sharing protocol
design,” in Proc. IEEE ICNP, 2006.
[96] B. Zhang, S. C. Borst, and M. I. Reiman, “Optimal server scheduling in hybrid P2P networks,”
Perform. Eval., vol. 67, no. 11, 2010.
[97] D. Bertsimas, V. F. Farias, and N. Trichakis, “The price of fairness,” Oper. Res., vol. 59, no. 1,
pp. 17–31, 2011.
[98] ——, “On the efficiency-fairness trade-off,” Management Sci., vol. 58, no. 12, pp. 2234–2250, 2012.
[99] J. G. Dai and G. Weiss, “A fluid heuristic for minimizing makespan in job shops,” Oper. Res.,
vol. 50, no. 4, pp. 692–707, 2002.
[100] T. Gonzalez and S. Sahni, “Flowshop and jobshop schedules: Complexity and approximation,”
Oper. Res., vol. 26, no. 1, pp. 26–52, 1978.
Recommended