42
Understanding DPDK Description of techniques used to achieve high throughput on a commodity hardware

Understanding DPDK

Embed Size (px)

Citation preview

Page 1: Understanding DPDK

Understanding

DPDKDescription of techniques used to achieve

high throughput on a commodity hardware

Page 2: Understanding DPDK

How fast SW has to work?

14.88 millions of 64 byte packets per second on 10G interface

1.8 GHz -> 1 cycle = 0,55 ns

1 packet -> 67.2 ns = 120 clock cycles

IFGPream

ble

DST

MAC

SRC

MAC

SRC

MACType Payload CRC

84 Bytes

412 8 60

Page 3: Understanding DPDK

Comparative speed values

CPU to memory speed = 6-8 GBytes/s

PCI-Express x16 speed = 5 GBytes/s

Access to RAM = 200 ns

Access to L3 cache = 4 ns

Context switch ~= 1000 ns (3.2 GHz)

Page 4: Understanding DPDK

Packet processing in Linux

User space

Kernel space

NIC

App

Driver

RX/TX queues

SocketRing

buffers

Page 5: Understanding DPDK

Linux kernel overhead

System calls

Context switching on blocking I/O

Data copying from kernel to user space

Interrupt handling in kernel

Page 6: Understanding DPDK

Expense of sendto

Function Activity Time (ns)

sendto system call 96

sosend_dgram lock sock_buff, alloc mbuf, copy in 137

udp_output UDP header setup 57

ip_output route lookup, ip header setup 198

ether_otput MAC lookup, MAC header setup 162

ixgbe_xmit device programming 220

Total 950

Page 7: Understanding DPDK

Packet processing with DPDK

User space

Kernel space

NIC

App DPDKRing

buffers

UIO driver

RX/TX

queues

Page 8: Understanding DPDK

Kernel space

Updating a register in Linux

User space

HW

ioctl()

Register

syscall

VFS

copy_from_user()

iowrite()

Page 9: Understanding DPDK

Updating a register with DPDK

User space

HW

assign

Register

Page 10: Understanding DPDK

What is used inside DPDK?

Processor affinity (separate cores)

Huge pages (no swap, TLB)

UIO (no copying from kernel)

Polling (no interrupts overhead)

Lockless synchronization (avoid waiting)

Batch packets handling

SSE, NUMA awareness

Page 11: Understanding DPDK

Linux default scheduling

Core 0

Core 1

Core 2

Core 3

t1 t4t3t2

Page 12: Understanding DPDK

How to isolate a core for a process

To diagnose use top“top” , press “f” , press “j”

Before boot use isolcpus“isolcpus=2,4,6”

After boot - use cpuset“cset shield -c 1-3”, “cset shield -k on”

Page 13: Understanding DPDK

Core 2Core 1

Run-to-completion model

RX/TX

thread

RX/TX

thread

Port 1 Port 2

Page 14: Understanding DPDK

Core 2Core 1

Pipeline model

RX

thread

TX

thread

Port 1 Port 2

Ring

Page 15: Understanding DPDK

Page tables tree

Linux paging model

cr3

Page

Page

Global

Directory

Page

Table

Page

Middle

Directory

Page 16: Understanding DPDK

TLB

TLBPage

Table

RAM

OffsetVirtual page

Physical Page Offset

Page 17: Understanding DPDK

TLB characteristics

$ cpuid | grep -i tlb

size: 12–4,096 entries

hit time: 0.5–1 clock cycle

miss penalty: 10–100 clock cycles

miss rate: 0.01–1%

It is very expensive resource!

Page 18: Understanding DPDK

Solution - Hugepages

Benefit: optimized TLB usage, no swap

Hugepage size = 2M

Usage:

mount hugetlbfs /mnt/huge

mmap

Library - libhugetlbfs

Page 19: Understanding DPDK

Lockless ring design

Writer can preempt writer and reader

Reader can not preempt writer

Reader and writer can work simultaneously on

different cores

Barrier

CAS operation

Bulk queue/dequeue

Page 20: Understanding DPDK

Lockless ring (Single Producer)

1

cons_head

cons_tail

prod_head

prod_tail

prod_next 2

cons_head

cons_tail

prod_head

prod_next

prod_tail

3

cons_head

cons_tail

prod_head

prod_tail

Page 21: Understanding DPDK

Lockless ring (Single Consumer)

1

cons_head

cons_tail

prod_head

prod_tail

cons_next 2

cons_tail prod_head

prod_tail

cons_next

cons_head

3

cons_head

cons_tail

prod_head

prod_tail

Page 22: Understanding DPDK

Lockless ring (Multiple Producers)

1

cons_head

cons_tail

prod_head

prod_tail

prod_next1

prod_next2 3

cons_head

cons_tail

prod_head

2

cons_head

cons_tail

prod_head

prod_next2

prod_tail

prod_next1

4

cons_head

cons_tail

5

cons_head

cons_tail

prod_head

prod_tail

prod_tail

prod_headprod_tail

prod_next1

prod_next2

prod_next1

prod_next2

Page 23: Understanding DPDK

Kernel space network driver

App

IP stack

Driver

NIC

Data

Desc

Config

Data

User space

Kernel space

Interrupts

Page 24: Understanding DPDK

UIO

“The most important devices can’t be handled

in user space, including, but not

limited to, network interfaces and block

devices.” - LDD3

Page 25: Understanding DPDK

UIO

User space

Kernel space

Interfacesysfs /dev/uioX

App

US driver epoll()

mmap()

UIO framework

driver

Page 26: Understanding DPDK

NIC User space

Access to device from user space

BAR0 (Mem)

BAR1

BAR2 (IO)

BAR5

BAR4

BAR3

Vendor Id

Device Id

Command

Revision Id

Status

...

Configuration

registers

I/O and memory

regions

/sys/class/uio/uioX/maps/mapX

/sys/class/uio/uioX/portio/portX

/dev/uioX -> mmap (offset)

/sys/bus/pci/devices

Page 27: Understanding DPDK

Host memory NIC memory

DMA RX

Update RDT

DMA descriptor(s)

RX queue RX FIFO

DMA packet

Descriptor ringMemory

DMA descriptors

Page 28: Understanding DPDK

Host memory NIC memory

DMA TX

Update TDT

DMA descriptor(s)

TX queue TX FIFO

DMA packet

Descriptor ringMemory

DMA descriptors

Page 29: Understanding DPDK

Receive from SW side

DD DD DDDD

RDT

DDmbuf1

addrDD

mbuf2

addr

RDT

RDH = 1

RDT = 5

RDBA = 0

RDLEN = 6

mbuf1RDH

RDH

mbuf2

Page 30: Understanding DPDK

Transmit from SW side

DD DD DDDD

TDT

DDmbuf1

addrDD

mbuf2

addr

TDT

TDH = 1

TDT = 5

TDBA = 0

TDLEN = 6

mbuf1TDH

TDH

mbuf2

Page 31: Understanding DPDK

NUMA

CPU 0

Cores

Me

mo

ry

co

ntro

ller

I/O controller

Me

mo

ry

PCI-E PCI-E

CPU 1

Cores

Me

mo

ry

co

ntr

olle

r

I/O controller

Me

mo

ry

PCI-E PCI-E

QPI

Socket 0 Socket 1

Page 32: Understanding DPDK

RSS (Receive Side Scaling)

Hash

functionQueue 0 CPU N

...

Queue N

Incoming traffic Indirection

table

Page 33: Understanding DPDK

Flow director

Queue 0 CPU N

...

Queue N

Incoming trafficFilter table

Hash

function

Outgoing traffic

Drop Route

Page 34: Understanding DPDK

Virtualization - SR-IOV

NIC

VMM

VM1

VF driver

VM2

VF driver

PF driver

VF

Virtual bridge

VF PF

Page 35: Understanding DPDK

NIC

Slow path using bifurcated driver

Kernel DPDK

VF

Virtual bridge

PF Filter table

Page 36: Understanding DPDK

Slow path using TAP

User space

Kernel space

NIC

App DPDKRing

buffers

TAP device

RX/TX

queues

TCP/IP

stack

Page 37: Understanding DPDK

Slow path using KNI

User space

Kernel space

NIC

App DPDKRing

buffers

KNI device

RX/TX

queues

TCP/IP

stack

Page 38: Understanding DPDK

x86 HW

Application 1 - Traffic generator

User space

Streams generator

DUT

Traffic analyzer

Page 39: Understanding DPDK

x86 HW

Application 2 - Router

Kernel

User space

Routing table

Routing table cacheDUT1 DUT2

Page 40: Understanding DPDK

x86 HW

Application 3 - Middlebox

User space

DPIDUT1 DUT2

Page 42: Understanding DPDK

My blog

Learning Network Programming