Flow Classification Optimizations in DPDK · 2020-03-21 · Flow Classification Optimizations in DPDK Sameh Gobriel & Charlie Tai - Intel DPDK US Summit - San Jose - 2016

Flow Classification Optimizations in DPDK

Sameh Gobriel & Charlie Tai - Intel

DPDK US Summit - San Jose - 2016

Agenda

Flow Classification in DPDK

Cuckoo Hashing for Optimized Flow Table Design

Using Intel Transactional Synchronization Extensions (TSX) for scaling Insert performance

Using Intel AVX instructions for scaling lookup performance

Research Proof of Concept: 2 level lookup for OVS Megaflow Cache

Flow Classification on Network Appliances vs General Purpose Server H/W

3

Flow In Out

Flow In Out

Flow In Out

Flow In Out

Flow Classification Implemented on General Purpose

Processors

Flow Target

Flow Target

Flow Target

Flow Target

Flow Action

Flow Action

Flow Action

Flow Action

Mem

ory

C

$ LLC

$ Lx

C

$ Lx

Mem

ory

C

$ LLC

$ Lx

C

$ Lx

NIC NIC

• General purpose processors with Cache/memory hierarchy can support much larger flow tables.

• Multicores architecture provide a scalable competitive flow classification performance.

TEM/OEM Proprietary OS

ASIC, DSP, FPGA, ASSP

• Network appliances use purpose-built H/W & ASICs (e.g., TCAM) for flow classification

• Cost & power consumption are limiting factors to support large number of flows

Monolithic Purpose-built Boxes

NFV

Hypervisor (e.g. ESXi, KVM,.. etc.)

Networking VMs on Standard Servers

Flow Target

Flow Target

Flow Target

Flow Target

Flow Action

Flow Action

Flow Action

Flow Action

Flow In Out

Flow In Out

Flow In Out

Flow In Out

Metrics for Good Flow Table Design

1. Higher Lookup Rate = Better throughput

& latency

2. Higher Insert Rate = Better Flow update & Table Initialization

3. Efficient Table Utilization = More Flows

Packet Header Payload

Flow Key

Fields of the packet are used to form a flow Key

H(..)Hash function is used to create a flow table index

Key 1 Action 1 Key 2 Action 2

Key x Action x Key y Action y Key z Action z

Key N Action N

Flow Table

Hash value used to index Flow table

Key x Key z

Match

Key y

Flow Key

Retrieved keys are matched with input key

Action

DPDK Framework

User Space

KNI IGB_UIO VFIO

EAL

MBUF

MEMPOOL

RING

TIMER

KernelUIO_PCI_GENERIC

FM10K

IXGBE

VMXNET3IGB

E1000

I40E

XENVIRT PCAP

MLX4

MLX5

ETHDEV

RING

NULL

AF_PKT

BONDING

VIRTIOENIC

CXGBE

BNX2X

PMDs: Native & Virtual

SZEDATA2

NFP

MPIPE

HASH

LPM

ACL

JOBSTAT

DISTRIB

IP FRAGKNI

REORDER POWER

VHOST

IVSHMEM

SCHED

METER

PIPELINE

PORT TABLE

Network Functions (Cloud, Enterprise, Comms)

CRYPTO

QAT

AESNI MB

Future

TBD

AcceleratorsCore

Classify Extensions QoS Pkt Framework

ENA

AESNI GCM

SNOW 3G

NULL

VHOST

RTE-Hash Exact Match Library

Cuckoo Hashing

IP A IP H

IP P

IP J IP B

IP Q

IP D IP W

Traditional Exact Match

Library

IP Y

IP Z

IP X

1

2

3

H1 H2

Traditional Exact Match Table library:

• relies on a “sparse” hash table implementation

• Simple exact match implementation

• Significant performance degradation with increased table sizes.

Cuckoo Hashing – Better Scalability:

• Denser tables fit in cache.

• Can scale to millions of entries.

Available since DPDK v2.2

7

Cuckoo HashingHigh Level Overview

H1(x) H2(x)

X

1

H1(x) H2(x)

X

2

H1(x) H2(x)

X

Y

3

H1(x) H2(x)

X

Y

4

H1(x) H2(x)

X

Y

Z

5

H1(x) H2(x)

X

Y

Z

6

Cuckoo Hashing Performance Benefits

Cuckoo Hashing allows for more flows to be inserted in the flow table

RTE-hash can be used to support flow table with millions of keys (e.g. 64M – 5 tuple keys) that fits in the CPU cache.

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Traditional Exact Match Cuckoo Hashing

Tab

le L

oad

Table Load at First Key Insertion Failure

Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHzHyper-Threading: disabled

Code Snippet for RTE-hash API

struct rte_hash *rte_hash_create (const struct rte_hash_parameters *params)

int rte_hash_add_key_data (const struct rte_hash *h, const void *key, void *data)

int rte_hash_lookup_data (const struct rte_hash *h, const void *key, void **data)

int rte_hash_lookup_bulk_data (const struct rte_hash *h, const void **keys, uint32_t num_keys, uint64_t *hit_mask, void *data[])

Reference: http://dpdk.org/doc/api/rte__hash_8h.html

10

a * * *

e * * *

s * * *

x * * *k * * *

f * * *d * * *

t * * *

* ∅

One Insert may move a lot of items

especially at high table occupancy

cuckoo path: a➝e➝s➝x➝k➝f➝d➝t➝∅ (9 writes)

Insert y

←collision

Collision happens when multiple writers

have intersecting Cuckoo Paths

Long Cuckoo Paths & Multiple Concurrent Writers

11

Flow-Table Insert Performance Optimizations

Traditional Locks

• Limited Concurrency

• Threads are serialized in critical section

TSX Hardware Concurrency

Detect

Roll Back

• Hardware monitors cache lines.

• When data conflict is detected, execution is rolled back

Insert Performance Optimizations

Make Use of IA

Hardware FeaturesMinimize Critical

Section

Make Use of IA

Hardware Features

Build

12

Depth First Search

Breadth First Search

Split Path Search from Keys Movement

a * * *

e * * *

s * * *

x * * *

k * * *

f * * *

d * * *

t * * *

* Ø

a * * *

e * * * c * * *

s * * * * Ø * * * *

TSX Unlock

TSX Lock

Move Keys

Cuckoo Path Search

TSX Unlock

TSX LockMove Keys

Cuckoo Path Search

Insert Performance Optimizations

Make Use of IA

Hardware FeaturesMinimize Critical

SectionMinimize Critical

Section

1 2

Flow-Table Insert Performance Optimizations

Summary of Insert Optimizations Results

Insert Performance Linearly Scalable with Number of Cores

-

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Inse

rt R

ate

(Mill

ion

ins/

sec

)

Number of Cores

Insert Optimized

Original11 X


Code Snippet for RTE-hash with TSX (DPDK V16.07)

#define RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD 0x02

/* Default behavior of insertion, single writer/multi writer */

struct rte_hash_parameters {

...

uint8_t extra_flag;

};

rte_hash_parameters.extra_flag |=

(RTE_HASH_EXTRA_FLAGS_TRANS_MEM_SUPPORT

| RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD);

To enjoy TSX enabled multiwriter.

Reference: http://dpdk.org/doc/api/rte__hash_8h.html

15

Flow-Table Lookup Performance Optimizations

Use AVX Instructions Minimize Overhead

1. Prefetching of Keys in

Cache.

2. Inline Functions

3. Lookup Pipelining

Lookup Performance Optimizations

Make Use of IA

Hardware FeaturesMinimize Implementation

Overhead

1 2

Pkt

H(x)

lookup

Pkt 1 Pkt 2 Pkt 3 Pkt 4

HAVX(x)

LookupAVX

Summary of Lookup Optimizations Results

~35% Improved Lookup Throughput

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

18.00

1M 2M 4M 8M 16M 32M

Throughput vs. # of Flows

rte_hash

Cuckoo

Default

Lookup Optimized

Number of Flows

Thro

ug

hp

ut

(Mill

ion

s Lo

ok

up

s/C

ore

/Se

c)

35%

Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz

Hyper-Threading: disabled

Code Snippet for RTE-hash with AVX (Targeting DPDK V16.11)

DPDK Framework

User Space

KNI IGB_UIO VFIO

EAL

MBUF

MEMPOOL

RING

TIMER

KernelUIO_PCI_GENERIC

FM10K

IXGBE

VMXNET3IGB

E1000

I40E

XENVIRT PCAP

MLX4

MLX5

ETHDEV

RING

NULL

AF_PKT

BONDING

VIRTIOENIC

CXGBE

BNX2X

PMDs: Native & Virtual

SZEDATA2

NFP

MPIPE

HASH

LPM

ACL

JOBSTAT

DISTRIB

IP FRAGKNI

REORDER POWER

VHOST

IVSHMEM

SCHED

METER

PIPELINE

PORT TABLE

Network Functions (Cloud, Enterprise, Comms)

CRYPTO

QAT

AESNI MB

Future

TBD

AcceleratorsCore

Classify Extensions QoS Pkt Framework

ENA

AESNI GCM

SNOW 3G

NULL

VHOST

Supporting Wild Card Flow

Classification and Variable Key Size

19

POC: Open vSwitch Flow Lookup

Mask N

1xxx xxxx0xxx xxxx

Flow Mask Mask L

01xx xxxx 10xx xxxx

1110 0000

110x xxxx 101x xxxx 111x xxxx 011x xxxx

1111 0000

1010 xxxx 0011 xxxx 1011 xxxx

Rules Match

Packet Header

1. Set of disjoint sub-table

2. Rule is only inserted into one sub-table (lookup terminates after first match)

3. Lookup is done by sequentially search each sub-table until a match is found

Instead of L sequential lookups What if we know which sub-table to hit

20

OVS with Two Layer Lookup

Mask N

1xxx xxxx0xxx xxxx

Flow Mask Mask L

01xx xxxx 10xx xxxx

1110 0000


1111 0000


Rules Match

Packet Header

1st Level of Indirection

21

Bloom Filter as 1st Level of Indirection

Mask N

1xxx xxxx0xxx xxxx

Mask L

01xx xxxx 10xx xxxx

1110 0000


1111 0000


Match

Packet Header

1st Level of Indirection

Mask 1 BF Mask 2 BF Mask L BF Mask N BF

L Lookups L Bloom Filters + 1 lookup

22

2 Level Lookup Preliminary Performance Results

0

1000

2000

3000

4000

5000

6000

7000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Cyc

les

Number of Subtables Traversed

OVS - Hit OVS - Miss bloom - Hit bloom - Miss


Legal Disclaimers

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.

© 2016 Intel Corporation. Intel, the Intel logo, Intel. Experience What’s Inside, and the Intel. Experience What’s Inside logo are trademarks of Intel. Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

Questions?Sameh Gobriel

[email protected]

Documents

Flow Classification Optimizations in DPDK · 2020-03-21 · Flow Classification Optimizations in DPDK Sameh Gobriel & Charlie Tai - Intel DPDK US Summit - San Jose - 2016