XPDS16: Windows PV Network Performance - Paul Durrant, Citrix Systems Inc

Windows PV Network Performance

Paul DurrantSenior Principal Software Engineer,Citrix Systems

Windows PV Driver Community Lead

Xen Project Developer Summit 2016

Agenda

• Background• The netif protocol• Windows RSS

• Protocol Extensions

• Performance Measurements

• Q & A

Background

The netif protocol• Canonical header: xen/include/public/io/netif.h

• Usual split driver model:

• But…

Backend Frontend

Requests

Responses

The netif protocol• Duplicated for RX and TX:

Backend Frontend

Requests

Responses

Backend Frontend

Requests

Responses

RX TX

• RX requests still come from frontend so ring needs to be ‘pre-filled’

The netif protocol• TX packet fragments (requests):

Frag 1

Extra 1

Extra n

Frag 2

Frag n

… …• Data specified by grant_ref, offset and size

• size of ‘Frag 1’ is total size of packet, not just the fragment• id field echoed in corresponding response

• ‘Extra’ fragments have no room for id. How are responses matched?

• They’re not but…• ‘Extra’ response has magic RSP_NULL status

The netif protocol• RX packet fragments (responses):

Frag 1

Extra 1

Extra n

Frag 2

Frag n

… …• Data specified by offset

• No size field. A positive value of status is a fragment size• grant_ref is in the request, so id needed to find the right

data, but…

• ‘Extra’ fragments have no room for id. How are responses matched?

• Responses must be in same ring slot as corresponding request, so id isn’t actually needed!

The netif protocol• Performance Issues:

• Single event channel for RX and TX completion• Fixed by feature-split-event-channels

• Single ring (therefore single vCPU) for RX and TX processing• Fixed by multi-queue…

• Single page ring• Still an open question…

Windows RSS• Relies on NIC functionality (which most implement):

PACKET

HASHKEY

TABLE

MSI-X

CPU0

CPU1

CPUn

…Toeplitz

Set by Windows network stack

Windows RSS

“So how do we do this with PV drivers?”

Windows RSS• This bit needs to be in the frontend:

queue-0/event_channel_rx

queue-1/event-channel-rx

queue-n/event-channel-rx

…

CPU0

CPU1

CPUn

…HVMOP_set_evtchn_upcall_vectorEVTCHNOP_bind_vcpu

ALREADY DONE

Windows RSS• This bit needs to be in the backend:

PACKET

HASHKEY

TABLE

QUEUE

Toeplitz

queue-0

queue-1

queue-n

…Set by Windows network stack

HOW?

Protocol Extensions

Protocol Extensions• Need some way to…

• Specify hash algorithm

• Specify hash key and flags

• Specify indirection table

…in the backend

Protocol Extensions• Introduce netif control ring:

Backend Frontend

Requests

Responses

CTRL

Requests:

XEN_NETIF_CTRL_TYPE_GET_HASH_FLAGSXEN_NETIF_CTRL_TYPE_SET_HASH_FLAGSXEN_NETIF_CTRL_TYPE_SET_HASH_KEYXEN_NETIF_CTRL_TYPE_GET_HASH_MAPPING_SIZEXEN_NETIF_CTRL_TYPE_SET_HASH_MAPPING_SIZEXEN_NETIF_CTRL_TYPE_SET_HASH_MAPPINGXEN_NETIF_CTRL_TYPE_SET_HASH_ALGORITHM

Responses:

XEN_NETIF_CTRL_STATUS_SUCCESSXEN_NETIF_CTRL_STATUS_NOT_SUPPORTEDXEN_NETIF_CTRL_STATUS_INVALID_PARAMETERXEN_NETIF_CTRL_STATUS_BUFFER_OVERFLOW

Protocol Extensions

unsigned int size = vif->hash.mapping_size;

xenvif_set_skb_hash(vif, skb);

return vif->hash.mapping[skb_get_hash_raw(skb) % size];

• xen-netback implementation:

• New ndo_select_queue op (overrides default):

Toeplitz implementation (actually in netif.h)

Protocol Extensions• xen-netback implementation:

• New debugfs node:

root@brixham:~# ls /sys/kernel/debug/xen-netback/vif1.1ctrl io_ring_q0 io_ring_q1 io_ring_q2 io_ring_q3

root@brixham:~# cat /sys/kernel/debug/xen-netback/vif1.1/ctrlHash Algorithm: TOEPLITZ

Hash Flags:- IPv4- IPv4 + TCP- IPv6- IPv6 + TCP

…

Protocol Extensions

“What about the hash values?”

Protocol Extensions• New ‘Extra’ frag type:

XEN_NETIF_EXTRA_TYPE_HASH

struct { uint8_t type; uint8_t algorithm; uint8_t value[4];} hash;

• Windows passes RX flow hash on TX side, so correct queue can be chosen.

Performance Measurements


• Gigabyte Brix i7-4770R

• 32GB RAM

• 200GB SATA SSD

• Hardware:


• 2 x Windows 10 32-bit domU• 4 vCPUs• 4G RAM• 8.2.0 (master) PV Drivers

• Xen 4.7.0• Upstream QEMU

• Linux 4.7.0• debugfs patch

• IXIA Chariot• TCP Throughput

• Software:

Performance Measurements• Single Pair:

Performance Measurements• Two Pairs:

Performance Measurements• Four Pairs (one per CPU):


“Does RSS make a difference over basic multi-queue?”

Performance Measurements• Four Pairs (multi-queue, no RSS):

Unbalanced throughput because competing for CPU


“What if all flows compete for the same CPU?”

Performance Measurements• Four Pairs (RSS forced to single queue):

Worst case is bad… Down ~6Gbps on best case.

Performance Measurements• Conclusions

• Multi-queue works best when queues are targeted at different CPUs

• RSS allows the guest to control TCP flow to queue mapping and hence get the best from multi-queue

Q & A

Technology

XPDS16: Windows PV Network Performance - Paul Durrant, Citrix Systems Inc