Dpdk performance

DPDK performance

How to not just do a demo with the DPDK

Stephen [email protected]

@networkplumber

mailto:[email protected]

Agenda

● DPDK– Background– Examples

● Performance– Lessons learned

● Open Issues

60 160 260 360 460 560 660 760 860 960 1060 1160 1260 1360 14600

200400600800

100012001400160018002000

Packet size (bytes)

Tim

e (n

s)

Server Packet

Network Infrastructure

Packet time vs size

Demo Applications

● Examples– L2fwd → Bridge/Switch– L3fwd → Router– L3fwd-acl → Firewall– Load_balancer– qos_sched → Quality Of Service

L3fwd

RouteLookup

Transmit

TransmitTransmit

TransmitReceive

ReceiveReceive

ReceiveReceive

Forwarding thread

Read Burst

Process Burst

Statistics: Received Packets Transmit Packets

Iterations Packets processed

Time Budget● Packet

– 67.2ns = 201 cycles @ 3Ghz● Cache

– L3 = 8 ns– L2 = 4.3

● Atomic operations– Lock = 8.25 ns– Lock/Unlock = 16.1

Network stack challenges at increasing speeds – LCA 2015 Jesper Dangaard Brouer

Architecture choices

● Legacy– Existing proprietary code

● BSD clone● DIY

– Build forwarding engine from scratch

Test fixture

Freebsd(netmap)

Router

Linuxdesktop

10G Fibre

192.18.1.0/27

10G Cat6

192.18.0.0/27

Management network

Dataplane CPU activity

Core Interface RX Rate TX Rate Idle

--------------------------------------------------------

1 p1p1 14.9M 0

2 p1p1 0 250

3 p33p1 0 250

4 p33p1 1 250

5 p1p1 0 250

6 p33p1 11.9M 1

Internal Instrumentation

Linux perf

● Perf tool part of kernel source

● Can look at kernel and application– Drill down if has symbols

Perf – active threadSamples: 16K of event 'cycles', Event count (approx.): 11763536471

14.93% dataplane [.] ip_input

10.04% dataplane [.] ixgbe_xmit_pkts

7.69% dataplane [.] ixgbe_recv_pkts

7.05% dataplane [.] T.240

6.82% dataplane [.] fw_action_in

6.61% dataplane [.] fifo_enqueue

6.44% dataplane [.] flow_action_fw

6.35% dataplane [.] fw_action_out

3.92% dataplane [.] ip_hash

3.69% dataplane [.] cds_lfht_lookup

2.45% dataplane [.] send_packet

2.45% dataplane [.] bit_reverse_ulong

Original model

Rx0.0

Rx0.1

Core 1

Core 2

Tx0.0

Rx1.0

Rx1.1

Core 3

Core 4

Tx0.1

Tx0.2

Tx0.3

Tx1.0

Tx1.1

Tx1.2

Tx1.3

Split thread model

Rx0.0

Rx0.1

Core 1

Core 5

Tx1

Core 2

Core 6

Tx1

Rx1.0

Rx1.1

Core 3

Core 4

Speed killer's● I/O● VM exit's● System call's● PCI access● HPET● TSC● Floating Point● Cache miss● CPU pipeline stall

TSC counterwhile(1)

cur_tsc = rte_rdtsc();diff_tsc = cur_tsc – prev_tsc;

if (unlikely(diff_tsc > drain_tsc)) {for (portid = 0; portid < RTE_MAX_ETHPORTS;

portid++) {

send_burst(qconf,qconf>tx_mbufs[portid].len,portid);

CPU stall

Heisenburg: observing performance slows it down

fw_action_in

│ struct ip_fw_args fw_args = {

.m = m,

│ .client = client,

│ .oif = NULL };

1.54 │1d: movzbl %sil,%esi

0.34 │ mov %rsp,%rdi

0.04 │ mov $0x13,%ecx

0.16 │ xor %eax,%eax

57.66 │ rep stos %rax,%es:(%rdi)

4.68 │ mov %esi,0x90(%rsp)

20.45 │ mov %r9,(%rsp)

Why is QoS slow?static inline void

rte_sched_port_time_resync(struct rte_sched_port *port)

{

uint64_t cycles = rte_get_tsc_cycles();

uint64_t cycles_diff = cycles port>time_cpu_cycles;

double bytes_diff = ((double) cycles_diff) / port>cycles_per_byte;

/* Advance port time */

port>time_cpu_cycles = cycles;

port>time_cpu_bytes += (uint64_t) bytes_diff;

The value of idle

● All CPU's are not equal– Hyperthreading– Cache sharing– Thermal

If one core is idle, others can go faster

Idle sleep

● 100% Poll → 100% CPU– CPU power limits– No Turbo boost– PCI bus overhead

● Small sleep's– 0 - 250us– Based on activity

Memory Layout● Cache killers

– Linked lists – Poor memory layout– Global statistics– Atomic

● Doesn't help– Prefetching– Inlining

Mutual Exclusion

● Locking– Reader/Writer lock is expensive– Read lock more overhead than spin lock

● Userspace RCU– Don't modify, create and destroy– Impacts thread model

Longest Prefix Match

Nexthop

1.1.1.1

/24table

1.1.1.X

If = dp0p9p1gw = 2.2.33.5

1.1.3.6

LPM issues

● Prefix → 8 bit next hop● Missing barriers● Rule update● Fixed size /8 table

PCI passthrough● I/O TLB size

– Hypervisor uses IOMMU to map guest– IOMMU has small TLB cache– Guest I/O exceeds TLB

● Solution– 1G hugepage on host KVM– Put Guest in huge pages– Only on KVM – requires manual configuration

DPDK Issues● Static configuration

– Features– CPU architecture– Table sizes

● Machine specific initialization– # of Cores, Memory Channels

● Poor device model– Works for Intel E1000 like devices

Conclusion

● DPDK can be used to build fast router– 12M pps per core

● Lots of ways to go slow– Fewer ways to go fast

Q & A

Thank you

Stephen [email protected]

@networkplumber

mailto:[email protected]

Technology

Dpdk performance