Download pdf - Is Co-scheduling Too Expensive for SMP VMs? · Is Co-scheduling Too Expensive for SMP VMs? Orathai Sukwong and Hyong S. Kim {osukwong,kim}@ece.cmu.edu Carnegie Mellon University,

Is Co-scheduling Too Expensive for SMP VMs?

Orathai Sukwong and Hyong S. Kim{osukwong,kim}@ece.cmu.edu

Carnegie Mellon University, PA, USA

EuroSys 2011 - April 12, 2011

2

Background Problem Statement Previous Work Proposed Solution Evaluation Conclusion

• Virtualization

– Allow multiple servers to share the same physical machine

– Achieve higher utilization of physical machines

– Ease infrastructure management

Cloud & Virtualization

Computing Cloud

3


Virtualization

• A symmetric multiprocessing (SMP) VM/guest

– A VM with > 1 virtual CPU (vCPU)

– Each vCPU behaves identically

• vCPU siblings = vCPUs belonging to the same VM

Hardware

Hypervisor

VM VM VM...

Operating System (OS)

Applic

ation A

Applic

ation N

…...

Physical Machine

Examples: Xen,

VMware ESX, KVM

4


VM Scheduling

CPU 0

Physical Machine

CPU 1 CPU 2 CPU 3

VM 2VM 1

Runqueue0 (RQ0)

vCPU0 vCPU1 vCPU2 vCPU3 vCPU0 vCPU1

RQ0 RQ1 RQ2 RQ3 RQ0 RQ1

Runqueue1 (RQ1) Runqueue2 (RQ2) Runqueue3 (RQ3)

A B C D X Y Z

A B C D X Y

Z

vCPU0 vCPU0 vCPU1vCPU1

vCPU2 vCPU3

Z

vCPU2 vCPU3vCPU0

Each vCPU can execute on any CPU.

Each thread can run on any vCPU

5

Problem StatementBackground Previous Work Proposed Solution Evaluation Conclusion

Synchronization in SMP VMs

VM

vCPU0 vCPU1

A B

Independent

A B

…..

…..

Dependent

A B

…..

…..

Resource

A B

… .........message

…..…

……

…………

Wait

• Assuming that vCPU0 runs A and vCPU1 runs B

• If A and B are dependent, vCPU0 and vCPU1 are also

dependent

6

vCPU1


Synchronization Latency Problem

• Recall: each thread can run on any CPU

Synchronization Latency (vCPU1)

vCPU0

vCPU1CPU1

CPU0 XX

XXXXWait

T0 T1

Time Progress

vCPU1 waits < (T1 - T0) for the lock

Schedule vCPU0 & vCPU1 simultaneously

Synchronization Latency (vCPU1)

vCPU0

CPU1

CPU0

XX XXXX

T0 T1

Time Progress

WaitWait vCPU1

XXXXXX

vCPU1 waits (T1 - T0) for the lock

Schedule vCPU1 before vCPU0

• Synchronization latency can increase significantly,

depending on scheduling order

Assume vCPU0 successfully acquires the lock before time = T0Stacking vCPUs

7


How Often Does Scheduler Stack vCPUs?

# VMs≥ 2 vCPU siblings stacking

on the same CPU

1 5.564%

2 43.127%

3 45.932%

• Run 4-vCPU VMs on a 4-CPU physical host

• Run the CPU-bound workload inside the VMs

– 100% utilization on each vCPU

8

VM 2

vCPU0 vCPU1 vCPU2 vCPU3


How Often Does Scheduler Stack vCPUs?

# VMs≥ 2 vCPU siblings stacking

on the same CPU

1 5.564%

2 43.127%

3 45.932%

• Run 4-vCPU VMs on a 4-CPU physical host

• Run the CPU-bound workload inside the VMs

– 100% utilization on each vCPU

VM 1


CPU0 CPU1 CPU2 CPU3


vCPU125%

vCPU1

vCPU250%

vCPU2

vCPU375%

9


What if running non-concurrent application?

• A VM runs both applications and OS

Hardware

Hypervisor

VMVM ...

Applications

OS

OS/Kernel:

• Concurrently service

applications

• Require synchronization

• E.g. Spinlock

• Even though running synchronization-free applications

inside the VM, the VM may still encounter the

synchronization latency problem.

10

Background Problem Statement Proposed Solution Evaluation Conclusion

Co-scheduling

• Schedule vCPU siblings simultaneously

• Drawback

– CPU fragmentation

• Lower utilization and delay vCPU execution

Previous Work

CPU0

CPU1

XXX vCPU0

vCPU1

XX

T0 T1 T2 T3

vCPU0

vCPU1

T4

Time Progress

Idle Idle

VM

vCPU0 vCPU1

11


Co-scheduling

• Schedule vCPU siblings simultaneously

• Drawback


• Significantly lower utilization and delay vCPU execution

Previous Work

CPU0

CPU1

XXX vCPU0

vCPU1

XX

T0 T1 T2 T3 T4

Time Progress

vCPU2

vCPU3

CPU2

CPU3

Idle

VM


vCPU0

vCPU1

T5

vCPU2

vCPU3XXXXX

12


Related Work

• Co-scheduling

– VMware => Relax co-scheduling to mitigate CPU fragmentation

• Strict co-scheduling (ESX 2.x)

• Relaxed co-scheduling (ESX 3.x)

• Further relaxed co-scheduling (ESX 4.x)

– Xen => Selectively apply co-scheduling to the concurrent VMs

• Weng2009, Bai2010

• Affinity-based scheduling

– Statically bind a vCPU to a set of CPUs

– Carefully bind vCPUs to avoid overloading particular CPUs

Previous Work

13

Background Problem Statement Previous Work Evaluation Conclusion

Our Balance Scheduling

• Simple idea: Balance vCPU siblings across CPUs

– Never put any two vCPU siblings into the same RQ

– No need to force vCPU siblings to be scheduled simultaneously

• Cause no CPU fragmentation and improve the

performance of SMP VMs as well as co-scheduling does

• Easy to implement

– Modify each vCPU’s cpus_allowed field before selecting a RQ

Proposed Solution

CPU0 CPU1 CPU2 CPU3

vCPU0

vCPU1

vCPU1

cpus_allowed = {CPU1, CPU2, CPU3}

14

0 5 10 15 20 25 300

10

20

30

40

50

60

70

80

Runqueue Size

La

ten

cy Im

pro

ve

me

nt b

y B

ala

nce

(%

)

2 CPUs

4 CPUs

16 CPUs

64 CPUs

Background Problem Statement Previous Work Evaluation Conclusion

Synchronization Latency Improvement By Balance Scheduling

Proposed Solution

• The improvement decreases as the runqueue size grows

• The empirical results show that the runqueue size is ≤ 6

on average

15

Background Problem Statement Previous Work Proposed Solution Conclusion

Evaluate Scheduling Algorithms

Evaluation

• Completely Fair Scheduler (CFS) – Default scheduler in KVM

– Treat each vCPU the same

• Affinity-based algorithm (Aff)– Statically bind each vCPU to a CPU before running an experiment

• # vCPUs per physical CPU is relatively the same

• Do not assign any two vCPU siblings to the same physical CPU

• Co-scheduling algorithm (Co)– Implement on top of CFS

– No longer have CPU fragmentation problem but may incur additional

context switching

• Our balance scheduling algorithm (Bal)

16

CFS Aff Bal Co CFS Aff Bal Co CFS Aff Bal Co0

5

10

15

20

25

30

Ru

nq

ue

ue

Siz

e

Maximum

Average

Utilization 72% 72% 73% 73% 84% 89% 89% 90% 87% 93% 93% 94%


Runqueue Size

Evaluation

• Runqueue size is about 4-6 on average

• Run 14 VMs on a 4-CPU physical machine

• Expect 56 vCPU threads + I/O QEMU threads

17

Default(CFS) CPU Affinity Balance Coschedule0

200

400

600

800

1000

1200

1400

1600

1800

Re

sp

on

se

Tim

e (

ms)

Average

90 Percentile

190118.7

793

177121.9

178130.6

1721

0 50 100 150 200 250 300 350

0.992

0.993

0.994

0.995

0.996

0.997

0.998

0.999

1

Microseconds

CD

F o

f S

pin

lock L

ate

ncy

Default/CFS (2.1654,2029.481,19.2872)

CPU Affinity (1.4797,1319.962,6.2344)

Balance (1.5205,186.506,2.8366)

Coschedule (1.5245,422.401,8.3666)

(Mean, Max, Standard Deviation)


TPC-W

Evaluation

84%

Spinlock Latency ↑ → TCP Retransmission ↑ → Response Time ↑

• Run 3 four-vCPU VMs for a proxy server, an application

server, and a database server on a 4-CPU host.

18


Different Applications

Evaluation

Pi HackBench X264 Compile DVDStore BZip2 Untar TTCP0

10

20

30

40

50

60

70

80

90

Speed U

p (%

) C

om

pare

d T

o C

FS

CPU Affinity

Balance

Coschedule

Single-threaded ApplicationsMulti-threaded Applications

• Improvement depends on the synchronization degree in VMs

Run 2 VMs in the host

- One 4-vCPU VM

- Run an application

- One 2-vCPU VM

- Run the CPU-bound

workload

• Balance scheduling can improve application performance as

much as co-scheduling

19

-50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10

read I/O(requests)

read merges(requests)

read sectors(sectors)

read ticks(ms)

write I/O(requests)

write merges(requests)

write sectors(sectors)

write ticks(ms)

io_ticks(ms)

time_in_queue(ms)

I/O Speed Up (%) Compared to CFS

CPU Affinity

Balance

Coschedule

-20 30 80 130 180 230 280 330 380 430

SeqOutput PerChr

SeqOutput Block

SeqOutput Rewrite

SeqInput PerChr

SeqInput Block

Random Seeks

SeqCreate Create

SeqCreate Read

SeqCreate Delete

RandomCreate Create

RandomCreate Read

RandomCreate Delete

Speed Up (%) Relative to CFS

CPU Affinity

Spread

Coschedule


Bonnie++

Evaluation

• Run Bonnie++ in a 4-

vCPU VM with a 2-vCPU

VM running the CPU-

bound workload

• Bonnie++ => single-

threaded, I/O-intensive

• Running a single-threaded

application can also

benefit from co-scheduling,

balance scheduling and

affinity-based scheduling

20

CFS Aff Bal Co CFS Aff Bal Co CFS Aff Bal Co0

10

20

30

Aggre

gate

d F

ram

es/s

0.4

0.6

0.8

1

Pin

g J

itte

r (m

s)

1st X.264 VM 2nd X.264 VM 3rd X.264 VM Standard Deviation of Ping

3 X264 VMs2 X264 VMs1 X264 VM


X264 & Ping

Evaluation

• 3 four-vCPU VMs run X.264 and 1 one-vCPU VM runs Ping

• In affinity-based scheduling, the Ping vCPU may get stuck in

the busiest CPU. With balance scheduling, the Ping vCPU

can choose the idlest CPU.

21


Different Hypervisors

Evaluation

Xen ESX Bal K33 K35 Xen ESX Bal K33 K350

10

20

30

40

50

60

70

80

90

100

Avera

ge C

om

ple

tion T

ime (S

econds)

Multiple Independent Processes

HackBench

Xen = Xen 4.0.1

ESX = VMware ESXi 4.1

Bal = Balance + KVM (Kernel 2.6.33)

K33 = CFS + KVM (Kernel 2.6.33)

K35 = CFS + KVM (Kernel 2.6.35)

SynchronizationIntensive

SynchronizationFree

Our balance scheduling works well with both synchronization-

intensive and synchronization-free applications

22


Discussion

Evaluation

• What are most applications?

– Synchronization-intensive? Synchronization-free?

• Many legacy applications are still single-threaded.

• A good number of applications are concurrent

programs to take advantage of multi-core architecture

• A rule of thumb of concurrent programming is using

minimal synchronization to promote parallelism

• We believe that future parallel programs should be

leaning toward the minimal usage of synchronization

=> Our balance scheduling should be a way to go!

23

Background Problem Statement Previous Work Proposed Solution Evaluation

Conclusion

Conclusion

• Synchronization latency problem can significantly

degrade application performance

– Even if running synchronization-free applications due to the synchronization in the guest OS

• Co-scheduling can be too expensive for SMP VMs with

minimal synchronization


• Reduce the host-CPU utilization and delay vCPU execution

• Our balance scheduling can

– Perform similarly to co-scheduling given concurrent SMP VMs

– Work well with minimal-synchronization VMs