73
Copyright © 2016 Solarflare Communications, Inc. All rights reserved. Much Faster Networking David Riddoch [email protected]

Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Copyright © 2016 Solarflare Communications, Inc. All rights reserved.

Much Faster Networking

David [email protected]

Page 2: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

What is kernel bypass?

Page 3: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

The standard receive path

Page 4: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

The standard receive path

Page 5: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

The standard receive path

Page 6: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

The standard receive path

Page 7: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Kernel-bypass receive

Page 8: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Kernel-bypass transmit

Page 9: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Kernel-bypass transmit

Page 10: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Kernel-bypass transmit

Page 11: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Kernel-bypass transmit – even faster

Page 12: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

What does all of thiscleverness achieve?

Page 13: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Better performance!

• Fewer CPU instructions for network operations

• Better cache locality

• Faster response (lower latency)

• Higher throughput (higher bandwidth/message rate)

• Reduced contention between threads

• Better core scaling

• Reduced latency jitter

Page 14: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for
Page 15: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Solarflare’s OpenOnload

• Sockets acceleration using kernel bypass

• Standard Ethernet, IP, TCP and UDP

• Standard BSD sockets API

• Binary compatible with existing applications

Page 16: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

OpenOnload intercepts network calls

Page 17: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Single thread throughput and latency

Page 18: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

UDP receive throughput (small messages)

Page 19: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Kernel stack OpenOnload

Page 20: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

A much more challenging application

Page 21: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

HAProxy performance and scaling

0

10

20

30

40

50

1 2 3 4 5 6 7 8 9

Co

nn

ect

ion

s/se

c (0

00

s)

CPU Cores

100K Byte

Solarflare Net Driver

Intel Net Driver

Solarflare OpenOnload

0

100

200

300

400

500

600

1 2 3 4 5 6 8 10 12

Co

nn

ect

ion

s/se

c (0

00

s)

CPU Cores

1K Byte

Solarflare Net Driver

Intel Net Driver

Solarflare OpenOnload

Solarflare kernel

Other kernel

OpenOnload

Solarflare kernel

Other kernel

OpenOnload

1 KiB message size

100 KiB message size

Co

nn

ecti

on

s/s

(10

00

s)C

on

nec

tio

ns/

s (1

00

0s)

Page 22: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Why doesn’t performance scale when using the kernel stack?

Page 23: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Better question:

How come it scales aswell as it does?

Page 24: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

CPU cores

Page 25: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Received packet is delivered intomemory (or L3 cache)

Page 26: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Interrupt triggers packethandling on this CPU core

Page 27: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Applicationcalls recv()

Page 28: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Socket state pulled from onecache to another; Inefficient

Single core bottleneckfor interrupt handling

Page 29: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Multiple receive channels(up to one per core)

channel_id = hash(4tuple) % n_cores;

Page 30: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Multiple flows

Page 31: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Hopefully!

Page 32: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Usually

Page 33: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

channel_id, n = lookup(4tuple);

if( n > 1 )

channel_id += hash(4tuple) % n_cores;

Page 34: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

SO_REUSEPORT to the rescue

• Multiple listening sockets on the same TCP port• One listening socket per worker thread

• Each gets a subset of incoming connection requests

• New connections go to the worker running on the core that the flow hashes to

• Connection establishment scales with the number of workers

• Received packets are delivered to the ‘right’ core

Page 35: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Problem solved?

Page 36: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for
Page 37: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for
Page 38: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for
Page 39: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

0

100

200

300

400

500

600

1 2 3 4 5 6 8 10 12

Co

nn

ect

ion

s/se

c (0

00

s)

CPU Cores

1K Byte

Solarflare Net Driver

Intel Net Driver

Solarflare OpenOnload

Solarflare kernel

Other kernel

OpenOnload

1 KiB message size

Co

nn

ecti

on

s/s

(10

00

s)

Number of CPU cores

Page 40: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

So much for sockets

Page 41: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Let’s get closer to the metal…

Page 42: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Layer-2 APIs

1-6 Mpps/coreMany protocolsEnd-host applications

1-60 Mpps/coreAll protocolsAny network function

Page 43: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

60 million pkt/s?

Page 44: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

17 ns

Page 45: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Some tips for achievingreally fast networking…

Page 46: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Tip 1. The faster your application is already, the more speedup you’ll get from kernel bypass

(This is just Ahmdal’s law)

Page 47: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Tip 2. NUMA locality applies doubly to I/O devices

Page 48: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for
Page 49: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for
Page 50: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for
Page 51: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for
Page 52: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for
Page 53: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

• DMA transfers will use the L3 cache if:

• The targeted cache line is resident

• Or if not then up to 10% of L3 is available for write-allocate

• Therefore

• If you want consistent high performance, DMA buffers must be resident in L3 cache

• To achieve that

• Small set of DMA buffers recycled quickly

• (Even if that means doing an extra copy)

Page 54: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Tip 3.Queue management is critical

Page 55: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Queues exist mostly tohandle mismatch betweenarrival rate and service rate

Page 56: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

• Buffers in switches and routers

• Descriptor rings in network adapters

• Socket send and receive buffers

• Shared memory queues, locks etc.

• Run queue in the kernel task scheduler

Page 57: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

What happens when queues start to fill?

Page 58: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Service rate < arrival rate

Queue fill level increases(latency++)

Working-set sizeincreases

(efficiency--)

Service rate drops

Page 59: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Service rate < arrival rate

Queue fills

DROP!

Page 60: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Drops are bad, m’kay

Make SO_RCVBUF bigger!

Page 61: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Dilemma!

Small buffers:

Necessary for stable performance when overloaded

Large buffers:

Necessary for absorbing bursts without loss

Page 62: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

For stable performance when overloaded

Limit working set size

• Limit the sizes of pools, queues, socket buffers etc.

Shed excess load early (Tip 3.1)

• Before you’ve wasted time on requests you’re going to have to drop anyway

Page 63: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Interrupt moves packets fromdescriptor ring to sockets

App thread consumesfrom socket buffer

Page 64: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Drop newest data(only at very high rates)

Drop newest data

Do work for every packet,whether dropped or not

Page 65: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Drop newest data

Page 66: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Tip 3.2 Drop old data for better response time

Page 67: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Last example: Cache locality

Page 68: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

recv()

OpenOnload stack; includes sockets, DMA buffers,TX and RX rings, control plane etc.

Page 69: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

recv()

Page 70: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Problem: Send a small (eg. 200 bytes) reply from a different thread

Page 71: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

send()

send()

Or…

Page 72: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

send()

• Passing a small message to another thread:

• A few cache misses

• send() on a socket last accessed on another core:

• Dozens of cache misses

Page 73: Much Faster Networking - QCon London · descriptor ring to sockets App thread consumes from socket buffer. Drop newest data (only at very high rates) Drop newest data Do work for

Thank you!