Is Co-scheduling Too Expensive for SMP VMs?
Orathai Sukwong and Hyong S. Kim{osukwong,kim}@ece.cmu.edu
Carnegie Mellon University, PA, USA
EuroSys 2011 - April 12, 2011
2
Background Problem Statement Previous Work Proposed Solution Evaluation Conclusion
• Virtualization
– Allow multiple servers to share the same physical machine
– Achieve higher utilization of physical machines
– Ease infrastructure management
Cloud & Virtualization
Computing Cloud
3
Background Problem Statement Previous Work Proposed Solution Evaluation Conclusion
Virtualization
• A symmetric multiprocessing (SMP) VM/guest
– A VM with > 1 virtual CPU (vCPU)
– Each vCPU behaves identically
• vCPU siblings = vCPUs belonging to the same VM
Hardware
Hypervisor
VM VM VM...
Operating System (OS)
Applic
ation A
Applic
ation N
…...
Physical Machine
Examples: Xen,
VMware ESX, KVM
4
Background Problem Statement Previous Work Proposed Solution Evaluation Conclusion
VM Scheduling
CPU 0
Physical Machine
CPU 1 CPU 2 CPU 3
VM 2VM 1
Runqueue0 (RQ0)
vCPU0 vCPU1 vCPU2 vCPU3 vCPU0 vCPU1
RQ0 RQ1 RQ2 RQ3 RQ0 RQ1
Runqueue1 (RQ1) Runqueue2 (RQ2) Runqueue3 (RQ3)
A B C D X Y Z
A B C D X Y
Z
vCPU0 vCPU0 vCPU1vCPU1
vCPU2 vCPU3
Z
vCPU2 vCPU3vCPU0
Each vCPU can execute on any CPU.
Each thread can run on any vCPU
5
Problem StatementBackground Previous Work Proposed Solution Evaluation Conclusion
Synchronization in SMP VMs
VM
vCPU0 vCPU1
A B
Independent
A B
…..
…..
Dependent
A B
…..
…..
Resource
A B
… .........message
…..…
……
…………
Wait
• Assuming that vCPU0 runs A and vCPU1 runs B
• If A and B are dependent, vCPU0 and vCPU1 are also
dependent
6
vCPU1
Problem StatementBackground Previous Work Proposed Solution Evaluation Conclusion
Synchronization Latency Problem
• Recall: each thread can run on any CPU
Synchronization Latency (vCPU1)
vCPU0
vCPU1CPU1
CPU0 XX
XXXXWait
T0 T1
Time Progress
vCPU1 waits < (T1 - T0) for the lock
Schedule vCPU0 & vCPU1 simultaneously
Synchronization Latency (vCPU1)
vCPU0
CPU1
CPU0
XX XXXX
T0 T1
Time Progress
WaitWait vCPU1
XXXXXX
vCPU1 waits (T1 - T0) for the lock
Schedule vCPU1 before vCPU0
• Synchronization latency can increase significantly,
depending on scheduling order
Assume vCPU0 successfully acquires the lock before time = T0Stacking vCPUs
7
Problem StatementBackground Previous Work Proposed Solution Evaluation Conclusion
How Often Does Scheduler Stack vCPUs?
# VMs≥ 2 vCPU siblings stacking
on the same CPU
1 5.564%
2 43.127%
3 45.932%
• Run 4-vCPU VMs on a 4-CPU physical host
• Run the CPU-bound workload inside the VMs
– 100% utilization on each vCPU
8
VM 2
vCPU0 vCPU1 vCPU2 vCPU3
Problem StatementBackground Previous Work Proposed Solution Evaluation Conclusion
How Often Does Scheduler Stack vCPUs?
# VMs≥ 2 vCPU siblings stacking
on the same CPU
1 5.564%
2 43.127%
3 45.932%
• Run 4-vCPU VMs on a 4-CPU physical host
• Run the CPU-bound workload inside the VMs
– 100% utilization on each vCPU
VM 1
vCPU0 vCPU1 vCPU2 vCPU3
CPU0 CPU1 CPU2 CPU3
vCPU0 vCPU0 vCPU1 vCPU3
vCPU125%
vCPU1
vCPU250%
vCPU2
vCPU375%
9
Problem StatementBackground Previous Work Proposed Solution Evaluation Conclusion
What if running non-concurrent application?
• A VM runs both applications and OS
Hardware
Hypervisor
VMVM ...
Applications
OS
OS/Kernel:
• Concurrently service
applications
• Require synchronization
• E.g. Spinlock
• Even though running synchronization-free applications
inside the VM, the VM may still encounter the
synchronization latency problem.
10
Background Problem Statement Proposed Solution Evaluation Conclusion
Co-scheduling
• Schedule vCPU siblings simultaneously
• Drawback
– CPU fragmentation
• Lower utilization and delay vCPU execution
Previous Work
CPU0
CPU1
XXX vCPU0
vCPU1
XX
T0 T1 T2 T3
vCPU0
vCPU1
T4
Time Progress
Idle Idle
VM
vCPU0 vCPU1
11
Background Problem Statement Proposed Solution Evaluation Conclusion
Co-scheduling
• Schedule vCPU siblings simultaneously
• Drawback
– CPU fragmentation
• Significantly lower utilization and delay vCPU execution
Previous Work
CPU0
CPU1
XXX vCPU0
vCPU1
XX
T0 T1 T2 T3 T4
Time Progress
vCPU2
vCPU3
CPU2
CPU3
Idle
VM
vCPU0 vCPU1 vCPU2 vCPU3
vCPU0
vCPU1
T5
vCPU2
vCPU3XXXXX
12
Background Problem Statement Proposed Solution Evaluation Conclusion
Related Work
• Co-scheduling
– VMware => Relax co-scheduling to mitigate CPU fragmentation
• Strict co-scheduling (ESX 2.x)
• Relaxed co-scheduling (ESX 3.x)
• Further relaxed co-scheduling (ESX 4.x)
– Xen => Selectively apply co-scheduling to the concurrent VMs
• Weng2009, Bai2010
• Affinity-based scheduling
– Statically bind a vCPU to a set of CPUs
– Carefully bind vCPUs to avoid overloading particular CPUs
Previous Work
13
Background Problem Statement Previous Work Evaluation Conclusion
Our Balance Scheduling
• Simple idea: Balance vCPU siblings across CPUs
– Never put any two vCPU siblings into the same RQ
– No need to force vCPU siblings to be scheduled simultaneously
• Cause no CPU fragmentation and improve the
performance of SMP VMs as well as co-scheduling does
• Easy to implement
– Modify each vCPU’s cpus_allowed field before selecting a RQ
Proposed Solution
CPU0 CPU1 CPU2 CPU3
vCPU0
vCPU1
vCPU1
cpus_allowed = {CPU1, CPU2, CPU3}
14
0 5 10 15 20 25 300
10
20
30
40
50
60
70
80
Runqueue Size
La
ten
cy Im
pro
ve
me
nt b
y B
ala
nce
(%
)
2 CPUs
4 CPUs
16 CPUs
64 CPUs
Background Problem Statement Previous Work Evaluation Conclusion
Synchronization Latency Improvement By Balance Scheduling
Proposed Solution
• The improvement decreases as the runqueue size grows
• The empirical results show that the runqueue size is ≤ 6
on average
15
Background Problem Statement Previous Work Proposed Solution Conclusion
Evaluate Scheduling Algorithms
Evaluation
• Completely Fair Scheduler (CFS) – Default scheduler in KVM
– Treat each vCPU the same
• Affinity-based algorithm (Aff)– Statically bind each vCPU to a CPU before running an experiment
• # vCPUs per physical CPU is relatively the same
• Do not assign any two vCPU siblings to the same physical CPU
• Co-scheduling algorithm (Co)– Implement on top of CFS
– No longer have CPU fragmentation problem but may incur additional
context switching
• Our balance scheduling algorithm (Bal)
16
CFS Aff Bal Co CFS Aff Bal Co CFS Aff Bal Co0
5
10
15
20
25
30
Ru
nq
ue
ue
Siz
e
Maximum
Average
Utilization 72% 72% 73% 73% 84% 89% 89% 90% 87% 93% 93% 94%
Background Problem Statement Previous Work Proposed Solution Conclusion
Runqueue Size
Evaluation
• Runqueue size is about 4-6 on average
• Run 14 VMs on a 4-CPU physical machine
• Expect 56 vCPU threads + I/O QEMU threads
17
Default(CFS) CPU Affinity Balance Coschedule0
200
400
600
800
1000
1200
1400
1600
1800
Re
sp
on
se
Tim
e (
ms)
Average
90 Percentile
190118.7
793
177121.9
178130.6
1721
0 50 100 150 200 250 300 350
0.992
0.993
0.994
0.995
0.996
0.997
0.998
0.999
1
Microseconds
CD
F o
f S
pin
lock L
ate
ncy
Default/CFS (2.1654,2029.481,19.2872)
CPU Affinity (1.4797,1319.962,6.2344)
Balance (1.5205,186.506,2.8366)
Coschedule (1.5245,422.401,8.3666)
(Mean, Max, Standard Deviation)
Background Problem Statement Previous Work Proposed Solution Conclusion
TPC-W
Evaluation
84%
Spinlock Latency ↑ → TCP Retransmission ↑ → Response Time ↑
• Run 3 four-vCPU VMs for a proxy server, an application
server, and a database server on a 4-CPU host.
18
Background Problem Statement Previous Work Proposed Solution Conclusion
Different Applications
Evaluation
Pi HackBench X264 Compile DVDStore BZip2 Untar TTCP0
10
20
30
40
50
60
70
80
90
Speed U
p (%
) C
om
pare
d T
o C
FS
CPU Affinity
Balance
Coschedule
Single-threaded ApplicationsMulti-threaded Applications
• Improvement depends on the synchronization degree in VMs
Run 2 VMs in the host
- One 4-vCPU VM
- Run an application
- One 2-vCPU VM
- Run the CPU-bound
workload
• Balance scheduling can improve application performance as
much as co-scheduling
19
-50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10
read I/O(requests)
read merges(requests)
read sectors(sectors)
read ticks(ms)
write I/O(requests)
write merges(requests)
write sectors(sectors)
write ticks(ms)
io_ticks(ms)
time_in_queue(ms)
I/O Speed Up (%) Compared to CFS
CPU Affinity
Balance
Coschedule
-20 30 80 130 180 230 280 330 380 430
SeqOutput PerChr
SeqOutput Block
SeqOutput Rewrite
SeqInput PerChr
SeqInput Block
Random Seeks
SeqCreate Create
SeqCreate Read
SeqCreate Delete
RandomCreate Create
RandomCreate Read
RandomCreate Delete
Speed Up (%) Relative to CFS
CPU Affinity
Spread
Coschedule
Background Problem Statement Previous Work Proposed Solution Conclusion
Bonnie++
Evaluation
• Run Bonnie++ in a 4-
vCPU VM with a 2-vCPU
VM running the CPU-
bound workload
• Bonnie++ => single-
threaded, I/O-intensive
• Running a single-threaded
application can also
benefit from co-scheduling,
balance scheduling and
affinity-based scheduling
20
CFS Aff Bal Co CFS Aff Bal Co CFS Aff Bal Co0
10
20
30
Aggre
gate
d F
ram
es/s
0.4
0.6
0.8
1
Pin
g J
itte
r (m
s)
1st X.264 VM 2nd X.264 VM 3rd X.264 VM Standard Deviation of Ping
3 X264 VMs2 X264 VMs1 X264 VM
Background Problem Statement Previous Work Proposed Solution Conclusion
X264 & Ping
Evaluation
• 3 four-vCPU VMs run X.264 and 1 one-vCPU VM runs Ping
• In affinity-based scheduling, the Ping vCPU may get stuck in
the busiest CPU. With balance scheduling, the Ping vCPU
can choose the idlest CPU.
21
Background Problem Statement Previous Work Proposed Solution Conclusion
Different Hypervisors
Evaluation
Xen ESX Bal K33 K35 Xen ESX Bal K33 K350
10
20
30
40
50
60
70
80
90
100
Avera
ge C
om
ple
tion T
ime (S
econds)
Multiple Independent Processes
HackBench
Xen = Xen 4.0.1
ESX = VMware ESXi 4.1
Bal = Balance + KVM (Kernel 2.6.33)
K33 = CFS + KVM (Kernel 2.6.33)
K35 = CFS + KVM (Kernel 2.6.35)
SynchronizationIntensive
SynchronizationFree
Our balance scheduling works well with both synchronization-
intensive and synchronization-free applications
22
Background Problem Statement Previous Work Proposed Solution Conclusion
Discussion
Evaluation
• What are most applications?
– Synchronization-intensive? Synchronization-free?
• Many legacy applications are still single-threaded.
• A good number of applications are concurrent
programs to take advantage of multi-core architecture
• A rule of thumb of concurrent programming is using
minimal synchronization to promote parallelism
• We believe that future parallel programs should be
leaning toward the minimal usage of synchronization
=> Our balance scheduling should be a way to go!
23
Background Problem Statement Previous Work Proposed Solution Evaluation
Conclusion
Conclusion
• Synchronization latency problem can significantly
degrade application performance
– Even if running synchronization-free applications due to the synchronization in the guest OS
• Co-scheduling can be too expensive for SMP VMs with
minimal synchronization
– CPU fragmentation
• Reduce the host-CPU utilization and delay vCPU execution
• Our balance scheduling can
– Perform similarly to co-scheduling given concurrent SMP VMs
– Work well with minimal-synchronization VMs