Upload
locke
View
55
Download
0
Tags:
Embed Size (px)
DESCRIPTION
vSnoop: Improving TCP Throughput in Virtualized Environments via Acknowledgement Offload . Ardalan Kangarlou , Sahan Gamage , Ramana Kompella , Dongyan Xu Department of Computer Science Purdue University. Cloud Computing and HPC. Background and Motivation. - PowerPoint PPT Presentation
Citation preview
vSnoop: Improving TCP Throughput in Virtualized Environments
via Acknowledgement Offload
Ardalan Kangarlou, Sahan Gamage, Ramana Kompella, Dongyan Xu
Department of Computer SciencePurdue University
Cloud Computing and HPC
Background and Motivation Virtualization: A key enabler of cloud
computing Amazon EC2, Eucalyptus
Increasingly adopted in other real systems: High performance computing
NERSC’s Magellan system Grid/cyberinfrastructure computing
In-VIGO, Nimbus, Virtuoso
Multiple VMs hosted by one physical host Multiple VMs sharing the same core
Flexibility, scalability, and economy
VM Consolidation: A Common Practice
Hardware
Virtualization Layer
VM 1 VM 3 VM 4VM 2Key Observation:VM consolidation negatively
impacts network performance!
Sender
Hardware
Virtualization Layer
Investigating the Problem
Server
VM 1 VM 2 VM 3Client
40
60
80
100
120
140
160
180
5432
RTT
(ms)
Number of VMs
US East – West
US East – Europe
US West – Australia
RTT increases in proportion to VM scheduling slice
(30ms)
Q1: How does CPU Sharing affect RTT ?
RTT Increase
Q2: What is the Cause of RTT Increase ?
Sender
Hardware
Driver Domain(dom0)
VM 1
Device Driver
VM 3
bufbuf
30ms
30ms
VM scheduling latency dominates
virtualization overhead!
CD
F
VM 2
buf
+ dom0 processing x wait time in buffer
Connection to the VM is much slower than dom0!
Q3: What is the Impact on TCP Throughput ?
+ dom0 x VM
Our Solution: vSnoop Alleviates the negative effect of VM
scheduling on TCP throughput Implemented within the driver domain to
accelerate TCP connections
Does not require any modifications to the VM
Does not violate end-to-end TCP semantics Applicable across a wide range of VMMs
Xen, VMware, KVM, etc.
Sender VM1 BufferDriver Domain
time
SYN
SYN,ACK
SYN
SYN,ACK
VM1 buffer
TCP Connection to a VMScheduled VM
VM1
VM2
VM3
VM1
VM2
VM3
SYN,ACKSYN
VM Scheduling Latency
RTT
RTT
VM Scheduling Latency
Sender establishes a TCP connection to
VM1
Sender VM Shared BufferDriver Domain
time
SYN
SYN,ACK
SYN
SYN,ACK
VM1 buffer
Key Idea: Acknowledgement OffloadScheduled VM
VM1
VM2
VM3
VM1
VM2
VM3
SYN,ACK
w/ vSnoop
Faster progress during TCP slowstart
vSnoop’s Impact on TCP Flows TCP Slow Start
Early acknowledgements help progress connections faster
Most significant benefit for short transfers that are more prevalent in data centers [Kandula IMC’09], [Benson WREN’09]
TCP congestion avoidance and fast retransmit Large flows in the steady state can also benefit from
vSnoop Benefit not as much as for Slow Start
Challenge 1: Out-of-order/special packets (SYN, FIN packets)
Solution: Let the VM handle these packets
Challenge 2: Packet loss after vSnoop Solution: Let vSnoop acknowledge only if room in
buffer
Challenge 3: ACKs generated by the VM Solution: Suppress/rewrite ACKs already generated by
vSnoop
Challenge 4: Throttle Receive window to keep vSnoop online
Solution: Adjusted according to the buffer size
Challenges
State Machine Maintained Per-Flow
Start
Unexpected Sequence
Active(online)
No buffer(offline)
Out-of-order packet
In-order pkt Buffer space
available
Out-of-order packet
In-order pktNo buffer
In-order pkt Buffer space available
No buffer
Packet recvEarly acknowledgements
for in-order packets
Don’t acknowledge
Pass out-of-order pkts to VM
vSnoop Implementation in Xen
Driver Domain (dom0)
Bridge
Netfront
Netback
vSnoop
VM1
Netfront
Netback
VM3Netfront
Netback
VM2
buf bufbuf
Tuning Netfront
Evaluation Overheads of vSnoop
TCP throughput speedup
Application speedup Multi-tier web service (RUBiS) MPI benchmarks (Intel, High-Performance
Linpack)
Evaluation – Setup VM hosts
3.06GHz Intel Xeon CPUs, 4GB RAM Only one core/CPU enabled Xen 3.3 with Linux 2.6.18 for the driver domain (dom0)
and the guest VMs Client machine
2.4GHz Intel Core 2 Quad CPU, 2GB RAM Linux 2.6.19
Gigabit Ethernet switch
vSnoop Routines
Single Stream Multiple Streams
Cycles CPU % Cycles CPU %
vSnoop_ingress() 509 3.03 516 3.05vSnoop_lookup_hash(
)74 0.44 91 0.51
vSnoop_build_ack() 52 0.32 52 0.32vSnoop_egress() 104 0.61 104 0.61
Per-packet CPU overhead for vSnoop routines in dom0
vSnoop Overhead Profiling per-packet vSnoop overhead using
Xenoprof [Menon VEE’05]
Minimal aggregateCPU overhead
Median
0.192MB/s
0.778MB/s
6.003MB/s
TCP Throughput Improvement 3 VMs consolidated, 1000 transfers of a
100KB file Vanilla Xen, Xen+tuning,
Xen+tuning+vSnoop30x Improvement
+ Vanilla Xen x Xen+tuning * Xen+tuning+vSnoop
TCP Throughput: 1 VM/Core
0.00
0.20
0.40
0.60
0.80
1.00
100M
B
10M
B
1MB
500K
B
250K
B
100K
B
50KBNo
rmal
ized
Thro
ughp
ut
Transfer Size
Xen+tuning+vSnoopXen+tuningXen
TCP Throughput: 2 VMs/Core
0.00
0.20
0.40
0.60
0.80
1.00
100M
B
10M
B
1MB
500K
B
250K
B
100K
B
50KBNo
rmal
ized
Thro
ughp
ut
Transfer Size
Xen+tuning+vSnoopXen+tuningXen
TCP Throughput: 3 VMs/Core
0.00
0.20
0.40
0.60
0.80
1.00
100M
B
10M
B
1MB
500K
B
250K
B
100K
B
50KB
Norm
alize
d Th
roug
hput
Transfer Size
Xen+tuning+vSnoopXen+tuningXen
TCP Throughput: 5 VMs/Core
0.00
0.20
0.40
0.60
0.80
1.00
100M
B
10M
B
1MB
500K
B
250K
B
100K
B
50KB
Norm
alize
d Th
roug
hput
Transfer Size
Xen+tuning+vSnoopXen+tuningXen
vSnoop’s benefit rises with higher VM consolidation
TCP Throughput: Other Setup Parameters CPU load for VMs Number of TCP connections to VM Driver domain on separate core Sender being a VM
vSnoop consistently achieves significant TCP
throughput improvement
vSnoopdom0
dom1 dom2
Server1
vSnoopdom0
dom1 dom2
Server2ClientClient Threads
Application-Level Performance: RUBiS
RUBiS Clients Apache MySQL
RUBiS Operation Countw/o vSnoop
Countw/ vSnoop
%Gain
Browse 421 505 19.9%BrowseCategories 288 357 23.9%
SearchItemsInCategory 3498 4747 35.7%BrowseRegions 128 141 10.1%
ViewItem 2892 3776 30.5%ViewUserInfo 732 846 15.6%
ViewBidHistory 339 398 17.4%Others 3939 4815 22.2%Total 12237 15585 27.4%
Average Throughput 29 req/s 37 req/s 27.5%
RUBiS Results
Intel MPI Benchmark: Network intensive High-performance Linpack: CPU intensive
vSnoopdom0
dom1 dom2
Server1dom0
dom1 dom2
Server2dom0
dom1 dom2
Server3dom0
dom2
Server4
dom1
MPI nodes
Application-level Performance – MPI Benchmarks
vSnoop vSnoop vSnoop
Intel MPI Benchmark Results: Broadcast
0.00
0.20
0.40
0.60
0.80
1.00
8MB
4MB
2MB
1MB
512K
B
256K
B
128K
B
64KB
Norm
alize
d Ex
ecut
ion
Tim
e
Message Size
Xen+tuning+vSnoopXen+tuningXen
40% Improvement
Intel MPI Benchmark Results: All-to-All
0.00
0.20
0.40
0.60
0.80
1.00
8MB
4MB
2MB
1MB
512K
B
256K
B
128K
B
64KB
Norm
alize
d Ex
ecut
ion
Tim
e
Message Size
Xen+tuning+vSnoopXen+tuningXen
40%
HPL Benchmark Results
0.000 0.200 0.400 0.600 0.800 1.000 1.200 1.400 1.600 1.800
(8K,
16)
(8K,
8)
(8K,
4)
(8K,
2)
(6K,
16)
(6K,
8)
(6K,
4)
(6K,
2)
(4K,
16)
(4K,
8)
(4K,
4)
(4K,
2)
Gflop
s
Problem Size and Block Size (N,NB)
Xen+tuning+vSnoopXen
Related Work Optimizing virtualized I/O path
Menon et al. [USENIX ATC’06,’08; ASPLOS’09]
Improving intra-host VM communications XenSocket [Middleware’07], XenLoop
[HPDC’08], Fido [USENIX ATC’09], XWAY [VEE’08], IVC [SC’07]
I/O-aware VM scheduling Govindan et al. [VEE’07], DVT [SoCC’10]
Conclusions Problem: VM consolidation degrades TCP
throughput Solution: vSnoop
Leverages acknowledgment offloading Does not violate end-to-end TCP semantics Is transparent to applications and OS in VMs Is generically applicable to many VMMs
Results: 30x improvement in median TCP throughput About 30% improvement in RUBiS benchmark 40-50% reduction in execution time for Intel
MPI benchmark
Thank you.
For more information: http://
friends.cs.purdue.edu/dokuwiki/doku.php?id=vsnoopOr Google “vSnoop Purdue”
TCP Benchmarks cont. Testing different scenarios:
a) 10 concurrent connections b) Sender also subject to VM
scheduling c) Driver domain on a separate core
a)
b)
c)
TCP Benchmarks cont. Varying CPU load for 3 consolidated VMs:
40% CPU load:
80% CPU load:
60% CPU load: