Upload
others
View
33
Download
0
Embed Size (px)
Citation preview
Flow Classification Optimizations in DPDK
Sameh Gobriel & Charlie Tai - Intel
DPDK US Summit - San Jose - 2016
Agenda
Flow Classification in DPDK
Cuckoo Hashing for Optimized Flow Table Design
Using Intel Transactional Synchronization Extensions (TSX) for scaling Insert performance
Using Intel AVX instructions for scaling lookup performance
Research Proof of Concept: 2 level lookup for OVS Megaflow Cache
Flow Classification on Network Appliances vs General Purpose Server H/W
3
Flow In Out
Flow In Out
Flow In Out
Flow In Out
Flow Classification Implemented on General Purpose
Processors
Flow Target
Flow Target
Flow Target
Flow Target
Flow Action
Flow Action
Flow Action
Flow Action
Mem
ory
C
$ LLC
$ Lx
C
$ Lx
Mem
ory
C
$ LLC
$ Lx
C
$ Lx
NIC NIC
• General purpose processors with Cache/memory hierarchy can support much larger flow tables.
• Multicores architecture provide a scalable competitive flow classification performance.
TEM/OEM Proprietary OS
ASIC, DSP, FPGA, ASSP
• Network appliances use purpose-built H/W & ASICs (e.g., TCAM) for flow classification
• Cost & power consumption are limiting factors to support large number of flows
Monolithic Purpose-built Boxes
NFV
Hypervisor (e.g. ESXi, KVM,.. etc.)
Networking VMs on Standard Servers
Flow Target
Flow Target
Flow Target
Flow Target
Flow Action
Flow Action
Flow Action
Flow Action
Flow In Out
Flow In Out
Flow In Out
Flow In Out
Metrics for Good Flow Table Design
1. Higher Lookup Rate = Better throughput
& latency
2. Higher Insert Rate = Better Flow update & Table Initialization
3. Efficient Table Utilization = More Flows
Packet Header Payload
Flow Key
Fields of the packet are used to form a flow Key
H(..)Hash function is used to create a flow table index
Key 1 Action 1 Key 2 Action 2
Key x Action x Key y Action y Key z Action z
Key N Action N
Flow Table
Hash value used to index Flow table
Key x Key z
Match
Key y
Flow Key
Retrieved keys are matched with input key
Action
DPDK Framework
User Space
KNI IGB_UIO VFIO
EAL
MBUF
MEMPOOL
RING
TIMER
KernelUIO_PCI_GENERIC
FM10K
IXGBE
VMXNET3IGB
E1000
I40E
XENVIRT PCAP
MLX4
MLX5
ETHDEV
RING
NULL
AF_PKT
BONDING
VIRTIOENIC
CXGBE
BNX2X
PMDs: Native & Virtual
SZEDATA2
NFP
MPIPE
HASH
LPM
ACL
JOBSTAT
DISTRIB
IP FRAGKNI
REORDER POWER
VHOST
IVSHMEM
SCHED
METER
PIPELINE
PORT TABLE
Network Functions (Cloud, Enterprise, Comms)
CRYPTO
QAT
AESNI MB
Future
TBD
AcceleratorsCore
Classify Extensions QoS Pkt Framework
ENA
AESNI GCM
SNOW 3G
NULL
VHOST
RTE-Hash Exact Match Library
Cuckoo Hashing
IP A IP H
IP P
IP J IP B
IP Q
IP D IP W
Traditional Exact Match
Library
IP Y
IP Z
IP X
1
2
3
H1 H2
Traditional Exact Match Table library:
• relies on a “sparse” hash table implementation
• Simple exact match implementation
• Significant performance degradation with increased table sizes.
Cuckoo Hashing – Better Scalability:
• Denser tables fit in cache.
• Can scale to millions of entries.
Available since DPDK v2.2
7
Cuckoo HashingHigh Level Overview
H1(x) H2(x)
X
1
H1(x) H2(x)
X
2
H1(x) H2(x)
X
Y
3
H1(x) H2(x)
X
Y
4
H1(x) H2(x)
X
Y
Z
5
H1(x) H2(x)
X
Y
Z
6
Cuckoo Hashing Performance Benefits
Cuckoo Hashing allows for more flows to be inserted in the flow table
RTE-hash can be used to support flow table with millions of keys (e.g. 64M – 5 tuple keys) that fits in the CPU cache.
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Traditional Exact Match Cuckoo Hashing
Tab
le L
oad
Table Load at First Key Insertion Failure
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHzHyper-Threading: disabled
Code Snippet for RTE-hash API
struct rte_hash *rte_hash_create (const struct rte_hash_parameters *params)
int rte_hash_add_key_data (const struct rte_hash *h, const void *key, void *data)
int rte_hash_lookup_data (const struct rte_hash *h, const void *key, void **data)
int rte_hash_lookup_bulk_data (const struct rte_hash *h, const void **keys, uint32_t num_keys, uint64_t *hit_mask, void *data[])
Reference: http://dpdk.org/doc/api/rte__hash_8h.html
10
a * * *
e * * *
s * * *
x * * *k * * *
f * * *d * * *
t * * *
* ∅
One Insert may move a lot of items
especially at high table occupancy
cuckoo path: a➝e➝s➝x➝k➝f➝d➝t➝∅ (9 writes)
Insert y
←collision
Collision happens when multiple writers
have intersecting Cuckoo Paths
Long Cuckoo Paths & Multiple Concurrent Writers
11
Flow-Table Insert Performance Optimizations
Traditional Locks
• Limited Concurrency
• Threads are serialized in critical section
TSX Hardware Concurrency
Detect
Roll Back
• Hardware monitors cache lines.
• When data conflict is detected, execution is rolled back
Insert Performance Optimizations
Make Use of IA
Hardware FeaturesMinimize Critical
Section
Make Use of IA
Hardware Features
Build
12
Depth First Search
Breadth First Search
Split Path Search from Keys Movement
a * * *
e * * *
s * * *
x * * *
k * * *
f * * *
d * * *
t * * *
* Ø
a * * *
e * * * c * * *
s * * * * Ø * * * *
TSX Unlock
TSX Lock
Move Keys
Cuckoo Path Search
TSX Unlock
TSX LockMove Keys
Cuckoo Path Search
Insert Performance Optimizations
Make Use of IA
Hardware FeaturesMinimize Critical
SectionMinimize Critical
Section
1 2
Flow-Table Insert Performance Optimizations
Summary of Insert Optimizations Results
Insert Performance Linearly Scalable with Number of Cores
-
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Inse
rt R
ate
(Mill
ion
ins/
sec
)
Number of Cores
Insert Optimized
Original11 X
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHzHyper-Threading: disabled
Code Snippet for RTE-hash with TSX (DPDK V16.07)
#define RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD 0x02
/* Default behavior of insertion, single writer/multi writer */
struct rte_hash_parameters {
...
uint8_t extra_flag;
};
rte_hash_parameters.extra_flag |=
(RTE_HASH_EXTRA_FLAGS_TRANS_MEM_SUPPORT
| RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD);
To enjoy TSX enabled multiwriter.
Reference: http://dpdk.org/doc/api/rte__hash_8h.html
15
Flow-Table Lookup Performance Optimizations
Use AVX Instructions Minimize Overhead
1. Prefetching of Keys in
Cache.
2. Inline Functions
3. Lookup Pipelining
Lookup Performance Optimizations
Make Use of IA
Hardware FeaturesMinimize Implementation
Overhead
1 2
Pkt
H(x)
lookup
Pkt 1 Pkt 2 Pkt 3 Pkt 4
HAVX(x)
LookupAVX
Summary of Lookup Optimizations Results
~35% Improved Lookup Throughput
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
1M 2M 4M 8M 16M 32M
Throughput vs. # of Flows
rte_hash
Cuckoo
Default
Lookup Optimized
Number of Flows
Thro
ug
hp
ut
(Mill
ion
s Lo
ok
up
s/C
ore
/Se
c)
35%
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Hyper-Threading: disabled
Code Snippet for RTE-hash with AVX (Targeting DPDK V16.11)
DPDK Framework
User Space
KNI IGB_UIO VFIO
EAL
MBUF
MEMPOOL
RING
TIMER
KernelUIO_PCI_GENERIC
FM10K
IXGBE
VMXNET3IGB
E1000
I40E
XENVIRT PCAP
MLX4
MLX5
ETHDEV
RING
NULL
AF_PKT
BONDING
VIRTIOENIC
CXGBE
BNX2X
PMDs: Native & Virtual
SZEDATA2
NFP
MPIPE
HASH
LPM
ACL
JOBSTAT
DISTRIB
IP FRAGKNI
REORDER POWER
VHOST
IVSHMEM
SCHED
METER
PIPELINE
PORT TABLE
Network Functions (Cloud, Enterprise, Comms)
CRYPTO
QAT
AESNI MB
Future
TBD
AcceleratorsCore
Classify Extensions QoS Pkt Framework
ENA
AESNI GCM
SNOW 3G
NULL
VHOST
Supporting Wild Card Flow
Classification and Variable Key Size
19
POC: Open vSwitch Flow Lookup
Mask N
1xxx xxxx0xxx xxxx
Flow Mask Mask L
01xx xxxx 10xx xxxx
1110 0000
110x xxxx 101x xxxx 111x xxxx 011x xxxx
1111 0000
1010 xxxx 0011 xxxx 1011 xxxx
Rules Match
Packet Header
1. Set of disjoint sub-table
2. Rule is only inserted into one sub-table (lookup terminates after first match)
3. Lookup is done by sequentially search each sub-table until a match is found
Instead of L sequential lookups What if we know which sub-table to hit
20
OVS with Two Layer Lookup
Mask N
1xxx xxxx0xxx xxxx
Flow Mask Mask L
01xx xxxx 10xx xxxx
1110 0000
110x xxxx 101x xxxx 111x xxxx 011x xxxx
1111 0000
1010 xxxx 0011 xxxx 1011 xxxx
Rules Match
Packet Header
1st Level of Indirection
21
Bloom Filter as 1st Level of Indirection
Mask N
1xxx xxxx0xxx xxxx
Mask L
01xx xxxx 10xx xxxx
1110 0000
110x xxxx 101x xxxx 111x xxxx 011x xxxx
1111 0000
1010 xxxx 0011 xxxx 1011 xxxx
Match
Packet Header
1st Level of Indirection
Mask 1 BF Mask 2 BF Mask L BF Mask N BF
L Lookups L Bloom Filters + 1 lookup
22
2 Level Lookup Preliminary Performance Results
0
1000
2000
3000
4000
5000
6000
7000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Cyc
les
Number of Subtables Traversed
OVS - Hit OVS - Miss bloom - Hit bloom - Miss
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHzHyper-Threading: disabled
Legal Disclaimers
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.
© 2016 Intel Corporation. Intel, the Intel logo, Intel. Experience What’s Inside, and the Intel. Experience What’s Inside logo are trademarks of Intel. Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
Questions?Sameh Gobriel