vSnoop: Improving TCP Throughput in Virtualized Environments via Acknowledgement Offload

vSnoop: Improving TCP Throughput in Virtualized Environments

via Acknowledgement Offload

Ardalan Kangarlou, Sahan Gamage, Ramana Kompella, Dongyan Xu

Department of Computer SciencePurdue University

Cloud Computing and HPC

Background and Motivation Virtualization: A key enabler of cloud

computing Amazon EC2, Eucalyptus

Increasingly adopted in other real systems: High performance computing

NERSC’s Magellan system Grid/cyberinfrastructure computing

In-VIGO, Nimbus, Virtuoso

Multiple VMs hosted by one physical host Multiple VMs sharing the same core

Flexibility, scalability, and economy

VM Consolidation: A Common Practice

Hardware

Virtualization Layer

VM 1 VM 3 VM 4VM 2Key Observation:VM consolidation negatively

impacts network performance!

Sender

Hardware

Virtualization Layer

Investigating the Problem

Server

VM 1 VM 2 VM 3Client

40

60

80

100

120

140

160

180

5432

RTT

(ms)

Number of VMs

US East – West

US East – Europe

US West – Australia

RTT increases in proportion to VM scheduling slice

(30ms)

Q1: How does CPU Sharing affect RTT ?

RTT Increase

Q2: What is the Cause of RTT Increase ?

Sender

Hardware

Driver Domain(dom0)

VM 1

Device Driver

VM 3

bufbuf

30ms

30ms

VM scheduling latency dominates

virtualization overhead!

CD

F

VM 2

buf

+ dom0 processing x wait time in buffer

Connection to the VM is much slower than dom0!

Q3: What is the Impact on TCP Throughput ?

+ dom0 x VM

Our Solution: vSnoop Alleviates the negative effect of VM

scheduling on TCP throughput Implemented within the driver domain to

accelerate TCP connections

Does not require any modifications to the VM

Does not violate end-to-end TCP semantics Applicable across a wide range of VMMs

Xen, VMware, KVM, etc.

Sender VM1 BufferDriver Domain

time

SYN

SYN,ACK

SYN

SYN,ACK

VM1 buffer

TCP Connection to a VMScheduled VM

VM1

VM2

VM3

VM1

VM2

VM3

SYN,ACKSYN

VM Scheduling Latency

RTT

RTT

VM Scheduling Latency

Sender establishes a TCP connection to

VM1

Sender VM Shared BufferDriver Domain

time

SYN

SYN,ACK

SYN

SYN,ACK

VM1 buffer

Key Idea: Acknowledgement OffloadScheduled VM

VM1

VM2

VM3

VM1

VM2

VM3

SYN,ACK

w/ vSnoop

Faster progress during TCP slowstart

vSnoop’s Impact on TCP Flows TCP Slow Start

Early acknowledgements help progress connections faster

Most significant benefit for short transfers that are more prevalent in data centers [Kandula IMC’09], [Benson WREN’09]

TCP congestion avoidance and fast retransmit Large flows in the steady state can also benefit from

vSnoop Benefit not as much as for Slow Start

Challenge 1: Out-of-order/special packets (SYN, FIN packets)

Solution: Let the VM handle these packets

Challenge 2: Packet loss after vSnoop Solution: Let vSnoop acknowledge only if room in

buffer

Challenge 3: ACKs generated by the VM Solution: Suppress/rewrite ACKs already generated by

vSnoop

Challenge 4: Throttle Receive window to keep vSnoop online

Solution: Adjusted according to the buffer size

Challenges

State Machine Maintained Per-Flow

Start

Unexpected Sequence

Active(online)

No buffer(offline)

Out-of-order packet

In-order pkt Buffer space

available

Out-of-order packet

In-order pktNo buffer

In-order pkt Buffer space available

No buffer

Packet recvEarly acknowledgements

for in-order packets

Don’t acknowledge

Pass out-of-order pkts to VM

vSnoop Implementation in Xen

Driver Domain (dom0)

Bridge

Netfront

Netback

vSnoop

VM1

Netfront

Netback

VM3Netfront

Netback

VM2

buf bufbuf

Tuning Netfront

Evaluation Overheads of vSnoop

TCP throughput speedup

Application speedup Multi-tier web service (RUBiS) MPI benchmarks (Intel, High-Performance

Linpack)

Evaluation – Setup VM hosts

3.06GHz Intel Xeon CPUs, 4GB RAM Only one core/CPU enabled Xen 3.3 with Linux 2.6.18 for the driver domain (dom0)

and the guest VMs Client machine

2.4GHz Intel Core 2 Quad CPU, 2GB RAM Linux 2.6.19

Gigabit Ethernet switch

vSnoop Routines

Single Stream Multiple Streams

Cycles CPU % Cycles CPU %

vSnoop_ingress() 509 3.03 516 3.05vSnoop_lookup_hash(

)74 0.44 91 0.51

vSnoop_build_ack() 52 0.32 52 0.32vSnoop_egress() 104 0.61 104 0.61

Per-packet CPU overhead for vSnoop routines in dom0

vSnoop Overhead Profiling per-packet vSnoop overhead using

Xenoprof [Menon VEE’05]

Minimal aggregateCPU overhead

Median

0.192MB/s

0.778MB/s

6.003MB/s

TCP Throughput Improvement 3 VMs consolidated, 1000 transfers of a

100KB file Vanilla Xen, Xen+tuning,

Xen+tuning+vSnoop30x Improvement

+ Vanilla Xen x Xen+tuning * Xen+tuning+vSnoop

TCP Throughput: 1 VM/Core

0.00

0.20

0.40

0.60

0.80

1.00

100M

B

10M

B

1MB

500K

B

250K

B

100K

B

50KBNo

rmal

ized

Thro

ughp

ut

Transfer Size

Xen+tuning+vSnoopXen+tuningXen

TCP Throughput: 2 VMs/Core

0.00

0.20

0.40

0.60

0.80

1.00

100M

B

10M

B

1MB

500K

B

250K

B

100K

B

50KBNo

rmal

ized

Thro

ughp

ut

Transfer Size



0.00

0.20

0.40

0.60

0.80

1.00

100M

B

10M

B

1MB

500K

B

250K

B

100K

B

50KB

Norm

alize

d Th

roug

hput

Transfer Size



0.00

0.20

0.40

0.60

0.80

1.00

100M

B

10M

B

1MB

500K

B

250K

B

100K

B

50KB

Norm

alize

d Th

roug

hput

Transfer Size


vSnoop’s benefit rises with higher VM consolidation

TCP Throughput: Other Setup Parameters CPU load for VMs Number of TCP connections to VM Driver domain on separate core Sender being a VM

vSnoop consistently achieves significant TCP

throughput improvement

vSnoopdom0

dom1 dom2

Server1

vSnoopdom0

dom1 dom2

Server2ClientClient Threads

Application-Level Performance: RUBiS

RUBiS Clients Apache MySQL

RUBiS Operation Countw/o vSnoop

Countw/ vSnoop

%Gain

Browse 421 505 19.9%BrowseCategories 288 357 23.9%

SearchItemsInCategory 3498 4747 35.7%BrowseRegions 128 141 10.1%

ViewItem 2892 3776 30.5%ViewUserInfo 732 846 15.6%

ViewBidHistory 339 398 17.4%Others 3939 4815 22.2%Total 12237 15585 27.4%

Average Throughput 29 req/s 37 req/s 27.5%

RUBiS Results

Intel MPI Benchmark: Network intensive High-performance Linpack: CPU intensive

vSnoopdom0

dom1 dom2

Server1dom0

dom1 dom2

Server2dom0

dom1 dom2

Server3dom0

dom2

Server4

dom1

MPI nodes

Application-level Performance – MPI Benchmarks

vSnoop vSnoop vSnoop

Intel MPI Benchmark Results: Broadcast

0.00

0.20

0.40

0.60

0.80

1.00

8MB

4MB

2MB

1MB

512K

B

256K

B

128K

B

64KB

Norm

alize

d Ex

ecut

ion

Tim

e

Message Size


40% Improvement

Intel MPI Benchmark Results: All-to-All

0.00

0.20

0.40

0.60

0.80

1.00

8MB

4MB

2MB

1MB

512K

B

256K

B

128K

B

64KB

Norm

alize

d Ex

ecut

ion

Tim

e

Message Size


40%

HPL Benchmark Results

0.000 0.200 0.400 0.600 0.800 1.000 1.200 1.400 1.600 1.800

(8K,

16)

(8K,

8)

(8K,

4)

(8K,

2)

(6K,

16)

(6K,

8)

(6K,

4)

(6K,

2)

(4K,

16)

(4K,

8)

(4K,

4)

(4K,

2)

Gflop

s

Problem Size and Block Size (N,NB)

Xen+tuning+vSnoopXen

Related Work Optimizing virtualized I/O path

Menon et al. [USENIX ATC’06,’08; ASPLOS’09]

Improving intra-host VM communications XenSocket [Middleware’07], XenLoop

[HPDC’08], Fido [USENIX ATC’09], XWAY [VEE’08], IVC [SC’07]

I/O-aware VM scheduling Govindan et al. [VEE’07], DVT [SoCC’10]

Conclusions Problem: VM consolidation degrades TCP

throughput Solution: vSnoop

Leverages acknowledgment offloading Does not violate end-to-end TCP semantics Is transparent to applications and OS in VMs Is generically applicable to many VMMs

Results: 30x improvement in median TCP throughput About 30% improvement in RUBiS benchmark 40-50% reduction in execution time for Intel

MPI benchmark

Thank you.

For more information: http://

friends.cs.purdue.edu/dokuwiki/doku.php?id=vsnoopOr Google “vSnoop Purdue”

http://friends.cs.purdue.edu/dokuwiki/doku.php?id=vsnoop

http://friends.cs.purdue.edu/dokuwiki/doku.php?id=vsnoop

TCP Benchmarks cont. Testing different scenarios:

a) 10 concurrent connections b) Sender also subject to VM

scheduling c) Driver domain on a separate core

a)

b)

c)

TCP Benchmarks cont. Varying CPU load for 3 consolidated VMs:

40% CPU load:

80% CPU load:

60% CPU load:

Documents

vSnoop: Improving TCP Throughput in Virtualized Environments via Acknowledgement Offload