Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
© 2014 IBM Corporation
On the cost of tunnel endpoint processing in overlay virtual networks
J. Weerasinghe & F. AbelIBM Research – Zurich Laboratory
J. Weerasinghe; NVSDN2014, London; 8th December 2014
© 2014 IBM Corporation
Motivation
Overlay Virtual Networks–background
– tunnel endpoint
Cost of Tunnel Endpoint Processing
Proposal & Implementation of Acceleration
Measurements
Conclusions
Outline
2
© 2014 IBM Corporation
Rack Server Blade Server
Rack- & Blade-Servers
Discrete NICs (PCIe- or CPU-attached)
Deployment of OVNs in Integrated NICs (iNICs)–area and power restricted
Motivation
3
Integrated NIC (inside the CPU)
Hyper-scale Servers
Micro Server
iNIC
© 2014 IBM Corporation
Overlay Virtual Networks (1/2)
HEADER PAYLOAD
HEADER PAYLOADHEADER
HEADER PAYLOADHEADER
HEADER PAYLOADHEADER
…
4
…Millions of Virtual
L2 Networks/
Overlay Networks
Single
Physical Network/
Underlay Network
(Both L2 & L3)
Single Physical Network
Millions of Virtual Networks
Packet Encapsulation
© 2014 IBM Corporation
Many Flavors –different encapsulation protocols
Overlay Virtual Networks (2/2)
STT
IP/TCP-encap
NVGRE
IP/GRE-encap
VXLAN
GENEVE
DOVE
IP/UDP-encap
5
© 2014 IBM Corporation
Place where packets are encapsulated and de-capsulated
Usually an IP interface (L4 depends on encap protocol)
Tunnel Endpoint (TEP)
HEADER PAYLOAD
HEADER PAYLOADHEADER
HEADER PAYLOADVirtual
Network
HEADER PAYLOADHEADER
TEP TEP
Physical
Network
6
Can be implemented in:
(a) HW Switch(b) NIC(c) SW
© 2014 IBM Corporation
Performance is good
But not scalable• isolation of traffic between App TEP
• MAC table explosion
a) TEP in HW Switch: Pros & Cons
Physical
Network
Virtual
Network
HEADER PAYLOADHEADER
HEADER PAYLOAD
Server(SW)
App
NIC(HW)
Server(SW)
App
NIC(HW)
HEADER PAYLOADHEADER
HEADER PAYLOAD
TEP TEP
7
© 2014 IBM Corporation
Performance is good
But not scalable• isolation of traffic between App TEP
• MAC table explosion
b) TEP in NIC: Pros & Cons
Virtual
Network
HEADER PAYLOADHEADER
HEADER PAYLOAD
Server(SW)
App
NIC(HW)
Server(SW)
App
NIC(HW)
HEADER PAYLOADHEADER
HEADER PAYLOAD
TEP TEP
Physical
Network
HEADER PAYLOADHEADER HEADER PAYLOADHEADER
8
© 2014 IBM Corporation
Scales well
But longer code-path in SW degrades performance
c) TEP in SW: Pros & Cons
Physical
Network
Virtual
Network
HEADER PAYLOADHEADER
HEADER PAYLOAD
Server(SW)
App
NIC(HW)
Server(SW)
App
NIC(HW)
HEADER PAYLOADHEADER
HEADER PAYLOAD
TEP TEP
9
© 2014 IBM Corporation
OVN –VXLAN
Application Protocol–TCP/IP
• a widely used application protocol
Rx Path–critical part of packet processing
• due to lack of prior knowledge
SW TEP–analyze Linux implementation
–assess the Cost
Identify Functions to be Accelerated
Accelerate the Identified Functions
Approach On Improving SW TEP Performance
MAC IP UDP VXLAN MAC IP TCP PAYLOAD
HEADER PAYLOADHEADER
HEADER PAYLOAD
VXLAN-encapsulated TCP/IP packet
10
© 2014 IBM Corporation
VXLAN-encapsulated TCP/IP Packet
each packet has to
travel twice in the
stack
• Path in the Linux
Network Stack
VXLANMAC IP UDP MAC IP TCP PAYLOAD
Outer Stack Inner Stack
11
© 2014 IBM Corporation
Experimental Setup–Linux on bare metal
–sender and receiver• netperf-based
• connected back-to-back
Measurements–clock cycles (for TEP processing)
–BW
– latency
Assessing the Cost of TEP Processing (1/4)
Experimental Setup
Specification of the Experimental Setup
netperf
Sender(Tx)
NIC
TEP
VXLAN
eth0
netperf
Receiver(Rx)
NIC
TEP
VXLAN
eth0
12
© 2014 IBM Corporation
Clock Cycle Measurement– fine grained instrumentation of the code using the time stamp counter
Procedure
Assessing the Cost of TEP Processing (2/4)
M1 = Measure()M2 = Measure()
M3 = Measure()
M4 = Measure()
Measurement overhead = M2–M1
Clock cycles spent
on executing the code
w/ measurement
overhead = M4-M3
Clock cycles spent
on executing the code = (M4-M3)-(M2-M1)
Linux Kernel Network Source Code
Code Segment of
Interest
13
© 2014 IBM Corporation
Assessing the Cost of TEP Processing (4/4)
CPU Clock Cycles for VXLAN Packet Processing
Number of Clock Cycles Spent on Outer- and Inner-stack Processing
14
VXLANMAC IP UDP MAC IP TCP PAYLOAD
Outer Stack Inner Stack
StackLayer/
Function
Sub
Function
Clock
CyclesTotal
% of MTU-
Size Packet
Outer
Net Core 120
1224 21%L3 (IP) 668
L4 (UDP) 180
VXLAN 256
Inner
Net Core 92
1604 27%L3 (IP)
Checksum 24
Other 228
L4 (TCP)Checksum 608
Other 652
Format of VXLAN-encapsulated TCP/IP packet
21% overhead
10% overhead
© 2014 IBM Corporation
Bandwidth–Netperf TCP_STREAM
–BW performance = data rate (Gbps)/ CPU utilization
Latency–Netperf TCP_RR
–RTT/2
Assessing the Cost of TEP Processing (3/4)
15
31.8% BWdrop
32.5%latency
increment
© 2014 IBM Corporation
SW TEP Hybrid (SW & HW) TEP–part of Rx path SW TEP processing moved to NIC HW & Driver
–Tx path TEP processing not changed
Proposed Acceleration (1/3)
App
Server
NIC
Switch
App
VXLAN
TEP
VXLAN
eth0
TEP
Driver
TEP
16
App
Server
NIC
Switch
App
VXLAN
TEP
VXLAN
eth0
Driver
© 2014 IBM Corporation
Stack Acceleration–outer stack acceleration (OSA)
• (a) packet de-capsulation in NIC
– inner stack acceleration (ISA)• (b) checksum in NIC
• (c) direct access to L4
Proposed Acceleration (2/3)
VXLANMAC IP UDP MAC IP TCP PAYLOAD
Outer Stack Inner Stack
VXLAN
App
17
AppServer
NIC
Switch
VXLAN
eth0
TEP
Driver
TEP
HEADER PAYLOADHEADER
PAYLOADTCP
(c) driver places the packet
directly in L4 (TCP) layer
(a) NIC decapsulates,
(b) verifies inner packet checksum
(c) extract inner stack information
HEADER PAYLOAD
inner packet Rx descriptor
VNI STACK INFOCS
VNI:VXLAN Network ID, CS: Checksum
PAYLOADTCP
HEADER PAYLOADHEADER
HEADER PAYLOAD
(a) NIC decapsulates,
(b) verifies inner packet checksum
(c) extract inner stack information
inner packet Rx descriptor
VNI STACK INFOCS
(c) driver places the packet
directly in L4 (TCP) layer
© 2014 IBM Corporation
Proposed Acceleration (3/3)
(a) each packet
travels only
once in the
stack
(b) L2 and L3 are
bypassed
• Accelerated Path in the
Linux Network Stack
MAC IP UDP VXLAN MAC IP TCP PAYLOAD
Outer Stack Inner Stack
18
© 2014 IBM Corporation
@Tx driver–meta-data is generated (in real implementation this happens in Rx NIC)
–and prepended to the packet
@Rx driver–meta-data is removed
–and packet is decapsulated (in real implementation this happens in Rx NIC )
Implementation
19
PAYLOADHEADER
HEADER PAYLOADHEADERVNI STACK INFOCS
HEADER PAYLOADHEADERVNI STACK INFOCS
PAYLOADTCP
HEADER PAYLOADHEADER
HEADER PAYLOADHEADERVNI STACK INFOCS
HEADER PAYLOADHEADERVNI STACK INFOCS
Netperf
Sender (Tx)
NIC
VXLAN
Driver
eth0
Receiver (Rx)
NIC
VXLAN
Driver
Netperf
eth0
TEP
TEP
VNI:VXLAN Network ID, CS: Checksum
© 2014 IBM Corporation
Bandwidth
Latency
Results
20
OSA: Outer Stack Acceleration
ISA: Inner Stack Acceleration
97.1% of the BWperformance
achieved
94.4% of the latency
performanceachieved
© 2014 IBM Corporation
SW Tunnel Endpoint–supports all OVN requirements
–but performance is degraded
–cost• VXLAN adds 21% of CPU cycles to the processing of a MTU-size packet
• BW performance is dropped by 31.8%
• latency is increased by 32.5%
Accelerated Tunnel Endpoint– light-weight stack acceleration
–achieved performance
• 97.1% of BW
• 94.4% of latency
Future Work–add OVN support to
integrated NICs
Conclusion
21
0
10
20
30
40
50
60
70
80
90
100
BW Latency
Performance of Accelerated TEP(higher the better)
Non-VXLAN Non-Accelerated VXLAN
Pe
rce
nta
ge
© 2014 IBM Corporation
THANKS
22