Upload
stephen-hemminger
View
1.124
Download
9
Embed Size (px)
Citation preview
Agenda
● DPDK– Background– Examples
● Performance– Lessons learned
● Open Issues
60 160 260 360 460 560 660 760 860 960 1060 1160 1260 1360 14600
200400600800
100012001400160018002000
Packet size (bytes)
Tim
e (n
s)
Server Packet
Network Infrastructure
Packet time vs size
Demo Applications
● Examples– L2fwd → Bridge/Switch– L3fwd → Router– L3fwd-acl → Firewall– Load_balancer– qos_sched → Quality Of Service
L3fwd
RouteLookup
Transmit
TransmitTransmit
TransmitReceive
ReceiveReceive
ReceiveReceive
Forwarding thread
Read Burst
Process Burst
Statistics: Received Packets Transmit Packets
Iterations Packets processed
Time Budget● Packet
– 67.2ns = 201 cycles @ 3Ghz● Cache
– L3 = 8 ns– L2 = 4.3
● Atomic operations– Lock = 8.25 ns– Lock/Unlock = 16.1
Network stack challenges at increasing speeds – LCA 2015 Jesper Dangaard Brouer
Architecture choices
● Legacy– Existing proprietary code
● BSD clone● DIY
– Build forwarding engine from scratch
Test fixture
Freebsd(netmap)
Router
Linuxdesktop
10G Fibre
192.18.1.0/27
10G Cat6
192.18.0.0/27
Management network
Dataplane CPU activity
Core Interface RX Rate TX Rate Idle
--------------------------------------------------------
1 p1p1 14.9M 0
2 p1p1 0 250
3 p33p1 0 250
4 p33p1 1 250
5 p1p1 0 250
6 p33p1 11.9M 1
Internal Instrumentation
Linux perf
● Perf tool part of kernel source
● Can look at kernel and application– Drill down if has symbols
Perf – active threadSamples: 16K of event 'cycles', Event count (approx.): 11763536471
14.93% dataplane [.] ip_input
10.04% dataplane [.] ixgbe_xmit_pkts
7.69% dataplane [.] ixgbe_recv_pkts
7.05% dataplane [.] T.240
6.82% dataplane [.] fw_action_in
6.61% dataplane [.] fifo_enqueue
6.44% dataplane [.] flow_action_fw
6.35% dataplane [.] fw_action_out
3.92% dataplane [.] ip_hash
3.69% dataplane [.] cds_lfht_lookup
2.45% dataplane [.] send_packet
2.45% dataplane [.] bit_reverse_ulong
Original model
Rx0.0
Rx0.1
Core 1
Core 2
Tx0.0
Rx1.0
Rx1.1
Core 3
Core 4
Tx0.1
Tx0.2
Tx0.3
Tx1.0
Tx1.1
Tx1.2
Tx1.3
Split thread model
Rx0.0
Rx0.1
Core 1
Core 5
Tx1
Core 2
Core 6
Tx1
Rx1.0
Rx1.1
Core 3
Core 4
Speed killer's● I/O● VM exit's● System call's● PCI access● HPET● TSC● Floating Point● Cache miss● CPU pipeline stall
TSC counterwhile(1)
cur_tsc = rte_rdtsc();diff_tsc = cur_tsc – prev_tsc;
if (unlikely(diff_tsc > drain_tsc)) {for (portid = 0; portid < RTE_MAX_ETHPORTS;
portid++) {
send_burst(qconf,qconf>tx_mbufs[portid].len,portid);
CPU stall
Heisenburg: observing performance slows it down
fw_action_in
│ struct ip_fw_args fw_args = {
.m = m,
│ .client = client,
│ .oif = NULL };
1.54 │1d: movzbl %sil,%esi
0.34 │ mov %rsp,%rdi
0.04 │ mov $0x13,%ecx
0.16 │ xor %eax,%eax
57.66 │ rep stos %rax,%es:(%rdi)
4.68 │ mov %esi,0x90(%rsp)
20.45 │ mov %r9,(%rsp)
Why is QoS slow?static inline void
rte_sched_port_time_resync(struct rte_sched_port *port)
{
uint64_t cycles = rte_get_tsc_cycles();
uint64_t cycles_diff = cycles port>time_cpu_cycles;
double bytes_diff = ((double) cycles_diff) / port>cycles_per_byte;
/* Advance port time */
port>time_cpu_cycles = cycles;
port>time_cpu_bytes += (uint64_t) bytes_diff;
The value of idle
● All CPU's are not equal– Hyperthreading– Cache sharing– Thermal
If one core is idle, others can go faster
Idle sleep
● 100% Poll → 100% CPU– CPU power limits– No Turbo boost– PCI bus overhead
● Small sleep's– 0 - 250us– Based on activity
Memory Layout● Cache killers
– Linked lists – Poor memory layout– Global statistics– Atomic
● Doesn't help– Prefetching– Inlining
Mutual Exclusion
● Locking– Reader/Writer lock is expensive– Read lock more overhead than spin lock
● Userspace RCU– Don't modify, create and destroy– Impacts thread model
Longest Prefix Match
Nexthop
1.1.1.1
/24table
1.1.1.X
If = dp0p9p1gw = 2.2.33.5
1.1.3.6
LPM issues
● Prefix → 8 bit next hop● Missing barriers● Rule update● Fixed size /8 table
PCI passthrough● I/O TLB size
– Hypervisor uses IOMMU to map guest– IOMMU has small TLB cache– Guest I/O exceeds TLB
● Solution– 1G hugepage on host KVM– Put Guest in huge pages– Only on KVM – requires manual configuration
DPDK Issues● Static configuration
– Features– CPU architecture– Table sizes
● Machine specific initialization– # of Cores, Memory Channels
● Poor device model– Works for Intel E1000 like devices
Conclusion
● DPDK can be used to build fast router– 12M pps per core
● Lots of ways to go slow– Fewer ways to go fast
Q & A