Upload
the-linux-foundation
View
201
Download
1
Embed Size (px)
Citation preview
Windows PV Network Performance
Paul DurrantSenior Principal Software Engineer,Citrix Systems
Windows PV Driver Community Lead
Xen Project Developer Summit 2016
Agenda
• Background• The netif protocol• Windows RSS
• Protocol Extensions
• Performance Measurements
• Q & A
Background
The netif protocol• Canonical header: xen/include/public/io/netif.h
• Usual split driver model:
• But…
Backend Frontend
Requests
Responses
The netif protocol• Duplicated for RX and TX:
Backend Frontend
Requests
Responses
Backend Frontend
Requests
Responses
RX TX
• RX requests still come from frontend so ring needs to be ‘pre-filled’
The netif protocol• TX packet fragments (requests):
Frag 1
Extra 1
Extra n
Frag 2
Frag n
… …• Data specified by grant_ref, offset and size
• size of ‘Frag 1’ is total size of packet, not just the fragment• id field echoed in corresponding response
• ‘Extra’ fragments have no room for id. How are responses matched?
• They’re not but…• ‘Extra’ response has magic RSP_NULL status
The netif protocol• RX packet fragments (responses):
Frag 1
Extra 1
Extra n
Frag 2
Frag n
… …• Data specified by offset
• No size field. A positive value of status is a fragment size• grant_ref is in the request, so id needed to find the right
data, but…
• ‘Extra’ fragments have no room for id. How are responses matched?
• Responses must be in same ring slot as corresponding request, so id isn’t actually needed!
The netif protocol• Performance Issues:
• Single event channel for RX and TX completion• Fixed by feature-split-event-channels
• Single ring (therefore single vCPU) for RX and TX processing• Fixed by multi-queue…
• Single page ring• Still an open question…
Windows RSS• Relies on NIC functionality (which most implement):
PACKET
HASHKEY
TABLE
MSI-X
CPU0
CPU1
CPUn
…Toeplitz
Set by Windows network stack
Windows RSS
“So how do we do this with PV drivers?”
Windows RSS• This bit needs to be in the frontend:
queue-0/event_channel_rx
queue-1/event-channel-rx
queue-n/event-channel-rx
…
CPU0
CPU1
CPUn
…HVMOP_set_evtchn_upcall_vectorEVTCHNOP_bind_vcpu
ALREADY DONE
Windows RSS• This bit needs to be in the backend:
PACKET
HASHKEY
TABLE
QUEUE
Toeplitz
queue-0
queue-1
queue-n
…Set by Windows network stack
HOW?
Protocol Extensions
Protocol Extensions• Need some way to…
• Specify hash algorithm
• Specify hash key and flags
• Specify indirection table
…in the backend
Protocol Extensions• Introduce netif control ring:
Backend Frontend
Requests
Responses
CTRL
Requests:
XEN_NETIF_CTRL_TYPE_GET_HASH_FLAGSXEN_NETIF_CTRL_TYPE_SET_HASH_FLAGSXEN_NETIF_CTRL_TYPE_SET_HASH_KEYXEN_NETIF_CTRL_TYPE_GET_HASH_MAPPING_SIZEXEN_NETIF_CTRL_TYPE_SET_HASH_MAPPING_SIZEXEN_NETIF_CTRL_TYPE_SET_HASH_MAPPINGXEN_NETIF_CTRL_TYPE_SET_HASH_ALGORITHM
Responses:
XEN_NETIF_CTRL_STATUS_SUCCESSXEN_NETIF_CTRL_STATUS_NOT_SUPPORTEDXEN_NETIF_CTRL_STATUS_INVALID_PARAMETERXEN_NETIF_CTRL_STATUS_BUFFER_OVERFLOW
Protocol Extensions
unsigned int size = vif->hash.mapping_size;
xenvif_set_skb_hash(vif, skb);
return vif->hash.mapping[skb_get_hash_raw(skb) % size];
• xen-netback implementation:
• New ndo_select_queue op (overrides default):
Toeplitz implementation (actually in netif.h)
Protocol Extensions• xen-netback implementation:
• New debugfs node:
root@brixham:~# ls /sys/kernel/debug/xen-netback/vif1.1ctrl io_ring_q0 io_ring_q1 io_ring_q2 io_ring_q3
root@brixham:~# cat /sys/kernel/debug/xen-netback/vif1.1/ctrlHash Algorithm: TOEPLITZ
Hash Flags:- IPv4- IPv4 + TCP- IPv6- IPv6 + TCP
…
Protocol Extensions
“What about the hash values?”
Protocol Extensions• New ‘Extra’ frag type:
XEN_NETIF_EXTRA_TYPE_HASH
struct { uint8_t type; uint8_t algorithm; uint8_t value[4];} hash;
• Windows passes RX flow hash on TX side, so correct queue can be chosen.
Performance Measurements
Performance Measurements
• Gigabyte Brix i7-4770R
• 32GB RAM
• 200GB SATA SSD
• Hardware:
Performance Measurements
• 2 x Windows 10 32-bit domU• 4 vCPUs• 4G RAM• 8.2.0 (master) PV Drivers
• Xen 4.7.0• Upstream QEMU
• Linux 4.7.0• debugfs patch
• IXIA Chariot• TCP Throughput
• Software:
Performance Measurements• Single Pair:
Performance Measurements• Two Pairs:
Performance Measurements• Four Pairs (one per CPU):
Performance Measurements
“Does RSS make a difference over basic multi-queue?”
Performance Measurements• Four Pairs (multi-queue, no RSS):
Unbalanced throughput because competing for CPU
Performance Measurements
“What if all flows compete for the same CPU?”
Performance Measurements• Four Pairs (RSS forced to single queue):
Worst case is bad… Down ~6Gbps on best case.
Performance Measurements• Conclusions
• Multi-queue works best when queues are targeted at different CPUs
• RSS allows the guest to control TCP flow to queue mapping and hence get the best from multi-queue
Q & A