12
White Paper Broadcom NCC-WP101 September 20, 2017 Overview By their nature, data communication networks create the potential for directing more traffic to a link than they can transport. The instantaneous overfeeding of a link is called contention. Contention is resolved by momentarily storing (queuing) some contending traffic in network elements until it can eventually be sent. For example, if two 25GbE links simultaneously deliver a packet destined to the same 25GbE, one packet must be queued until the other is sent. Sustained overfeeding, possibly until the network element storage is exhausted, is called congestion. Congestion can only be resolved by detecting it and then reducing the data delivery rate. Congestion control schemes combine a variety of mechanisms to detect congestion and reduce the data arrival rate in response. The TCP/IP transmission control protocol (TCP) contains the best known, most enduring, and widely deployed congestion control scheme. TCP's congestion management is a remarkable accomplishment in delivering dynamic and sustained application-layer throughput on heterogeneous networks through use of the following techniques: Links from kilobits per second to terabits per second Round-trip times from microseconds to tens of seconds Participating and unaware network elements Moreover, the TCP's congestion management handles these hugely varied conditions simultaneously. A TCP endpoint can communicate at gigabits per second over a specialized data center network and at kilobits per second simultaneously over multiple satellite hops, maintaining expected performance levels for all conversations. Key figures of merit for congestion management: Utilization – Application data delivered from source to destination, also known as goodput. Waste – Transmission resources consumed in the operation of the congestion management scheme, such as congestion control packets, embedded congestion control information, and discarded data. Buffer consumption – The amount of memory consumed by transit data in network elements. Buffer consumption impacts application latency linearly as a result of queuing backlog and nonlinearly when total buffer consumption exceeds available storage. These impacts cause packet drops and associated slow-paced recovery or engagement of brute-force flow control. Latency – The time for application data to be delivered from source to destination. Fairness – The balance of service delivered across equally competing peers. TCP inseparably combines its congestion management scheme and a byte-stream transport into a monolithic protocol. Development of increasingly fast and increasingly dense networks over the past 20 years has created two tidal forces on TCP: A desire for even better performance in specific network scenarios; for example, dense, regularly structured networks with high-link bandwidths, short round-trip times, and minimally buffered participating network elements (as in modern data centers). A desire for transport models that lower application-to- application latency and increase endpoint data-handling efficiency. Introduction to NetXtreme ® Congestion Control (NCC) by Moshe Voloshin, Technical Director, Broadcom Limited

Introduction to NetXtreme® Congestion Control (NCC)

  • Upload
    others

  • View
    34

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to NetXtreme® Congestion Control (NCC)

White Paper

Broadcom NCC-WP101September 20, 2017

Overview

By their nature, data communication networks create the potential for directing more traffic to a link than they can transport. The instantaneous overfeeding of a link is called contention. Contention is resolved by momentarily storing (queuing) some contending traffic in network elements until it can eventually be sent. For example, if two 25GbE links simultaneously deliver a packet destined to the same 25GbE, one packet must be queued until the other is sent. Sustained overfeeding, possibly until the network element storage is exhausted, is called congestion.

Congestion can only be resolved by detecting it and then reducing the data delivery rate. Congestion control schemes combine a variety of mechanisms to detect congestion and reduce the data arrival rate in response.

The TCP/IP transmission control protocol (TCP) contains the best known, most enduring, and widely deployed congestion control scheme. TCP's congestion management is a remarkable accomplishment in delivering dynamic and sustained application-layer throughput on heterogeneous networks through use of the following techniques:

Links from kilobits per second to terabits per second

Round-trip times from microseconds to tens of seconds

Participating and unaware network elements

Moreover, the TCP's congestion management handles these hugely varied conditions simultaneously. A TCP endpoint can communicate at gigabits per second over a specialized data center network and at kilobits per second simultaneously over multiple satellite hops, maintaining expected performance levels for all conversations.

Key figures of merit for congestion management:

Utilization – Application data delivered from source to destination, also known as goodput.

Waste – Transmission resources consumed in the operation of the congestion management scheme, such as congestion control packets, embedded congestion control information, and discarded data.

Buffer consumption – The amount of memory consumed by transit data in network elements. Buffer consumption impacts application latency linearly as a result of queuing backlog and nonlinearly when total buffer consumption exceeds available storage. These impacts cause packet drops and associated slow-paced recovery or engagement of brute-force flow control.

Latency – The time for application data to be delivered from source to destination.

Fairness – The balance of service delivered across equally competing peers.

TCP inseparably combines its congestion management scheme and a byte-stream transport into a monolithic protocol.

Development of increasingly fast and increasingly dense networks over the past 20 years has created two tidal forces on TCP:

A desire for even better performance in specific network scenarios; for example, dense, regularly structured networks with high-link bandwidths, short round-trip times, and minimally buffered participating network elements (as in modern data centers).

A desire for transport models that lower application-to-application latency and increase endpoint data-handling efficiency.

Introduction to NetXtreme® Congestion Control (NCC)

by Moshe Voloshin, Technical Director, Broadcom Limited

Page 2: Introduction to NetXtreme® Congestion Control (NCC)

Broadcom NCC-WP1012

NetXtreme Congestion Control White Paper Introduction to NetXtreme Congestion Control (NCC)

The performance itch is being scratched with new congestion management schemes while Remote Direct Memory Access (RDMA) protocols offer low-latency kernel-bypassed messaging with highly efficient zero-copy transport.

Broadcom's NetXtreme NIC architecture powers four generations of data center connectivity products with speeds from 10 Gb/s to 100 Gb/s with 200 Gb/s on the horizon. NetXtreme-based products (including Whitney+) implement the industry-leading RDMA over Converged Ethernet (RoCE) protocol standard1 and the unique, high-performance NetXtreme Congestion Control (NCC) scheme.

NCC draws its inspiration from the Data Center TCP (DCTCP) congestion management protocol2, translating DCTCP mechanics from a stream-oriented context to the message-based RoCE.

This document introduces NCC, including a measured performance of the Whitney+ implementation.

RDMA Over Converged Ethernet

RoCE is the most widely deployed RDMA protocol, with over 1 million RoCE endpoints in service. RoCE supports three primary types of data exchange:

Messages – RDMA send

Remote memory access – RDMA write and RDMA read

Atomic remote memory access – RDMA atomics

RoCE is defined with the specific goal of enabling highly efficient hardware implementations in RoCE NICs (RNICs). A debate has continued for decades over whether network transport protocols should be implemented in hardware. Whatever the case, RNICs have proved an enduring technology, offering the following advantages:

Zero-copy transfer among arbitrary, application-supplied buffers. The stream nature of TCP requires applications to either accept implementation-provided buffers, or, more commonly, endure a data copy between transport and application that incurs significant additional CPU consumption and latency.

Kernel-bypassed operation. Kernel bypass allows userland applications to directly drive the network interface, with absolute minimum application-to-application latency.

Extension of a random-access memory (RAM) model to the network. While a RAM model can be built on a CPU-based stream transport, hardware RNICs deliver a cost model low enough for local access that existing RAM-based algorithms can be scaled-out efficiently.

RoCE's design center is on point-to-point connections, wherein arbitrary-sized sends and RDMAs are transported by RNIC hardware, including segmentation, reassembly, reliability, sequencing, and error recovery.

RoCEv1 is a simple adaptation of the Infiniband (IB) system-area network RDMA protocol to Ethernet3. RoCEv2 mapped RoCE onto UDP/IP3, as shown in Figure 1.

1. InfiniBand Trade Association. InfiniBand Architecture Specification Release 1.2.1 Annex A16: RoCE. 13 April 2010.2. M. Alizadeh, A. Greenberg, D. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, and M. Sridharan. Data Center TCP

(DCTCP). In SIGCOMM, 2010.3. InfiniBand Trade Association. InfiniBand™ Architecture Specification Release 1.2.1 Annex A17: RoCEv2. 2 September

2014.

Page 3: Introduction to NetXtreme® Congestion Control (NCC)

Broadcom NCC-WP1013

NetXtreme Congestion Control White Paper Introduction to NetXtreme Congestion Control (NCC)

Figure 1: RoCEv1 and RoCEv2 Packet Formats

RoCEv2 provides three advantages:

Operation on routed networks ubiquitous in large data centers

IP QoS – The DiffServ code point (DSCP)

IP congestion – The explicit congestion notification (ECN) signal

The intermediate UDP layer is only present to provide a connection identifier. The layer is a UDP/IP socket address quintuple (da, sa, protocol, dport, sport) for existing hardware and software network management mechanisms, including ECMP-based load balancing and network instrumentation.

RoCEv2 also defines a Congestion Notification Packet (CNP), shown in Table 1. RNICs send CNPs in response to ECN Congestion Experienced (CE) markings to indicate that the transmission rate should be reduced. CNPs are associated with RoCE connections, providing fine-grained, per-connection congestion notification information. RoCEv2 only specifies the mechanism for marking packets when congestion is experienced and the format of the CNP response. It leaves the particular congestion control algorithm unspecified, including the following information:

When packets are marked ECN CE

When CNPs are generated in response to ECN CE

How sending is adjusted in response to CNPs

Table 1: RoCEv2 CNP Format

MAC Header

IPv4/IPv6 Header

UDP Header

BTH

DestQP set to QPN for which the RoCEv2 CNP is generated.

Opcode set to b’10000001.

PSN set to 0.

SE set to 0.

M set to 0.

P_KEY set to the same value as in the BTH of the ECN packet marked.

(16 bytes) – Reserved. MUST be set to 0 by sender. Ignored by receiver.

ICRC

FCS

EtherType indicatesthat packet is IP

(that is, next header is IP)

EtherType indicatesthat packet is RoCE

(that is, next header is IB GRH)

EtherType

Proto #

Port #

EtherTypeRoC

E Eth L2Header IB GRH

IP Header UDPHeader

IB Payload ICRC FCSIB BTH+(L4 Hdr)

IB Payload ICRC FCSIB BTH+(L4 Hdr)

Eth L2Header

RoC

Ev2

ip.protocol_numberindicates that packet is UDP

UDP dport number indicatesthat next header is IB.BTH

Page 4: Introduction to NetXtreme® Congestion Control (NCC)

Broadcom NCC-WP1014

NetXtreme Congestion Control White Paper Introduction to NetXtreme Congestion Control (NCC)

Ethernet flow control and priority flow control (PFC) are the only fully standardized mechanisms to prevent RoCE senders from overfeeding a network. Unfortunately, as explained below, the performance of Ethernet flow control on a loaded network is quite poor in every metric except waste.

Ethernet Priority Flow Control

Priority flow control (PFC) is a mechanism for restricting flow on a particular traffic class (up to eight, associated with the eight Ethernet VLAN tag priorities) on a link4. Careful configuration of flow control thresholds can ensure no packets cross a link that would be dropped. In other words, PFC can be used to implement lossless traffic classes. This, in turn, prevents endpoints from overfeeding the network.

The technical challenge in using PFC is that, when active, it restrains all traffic for the class whether or not it is destined for an overfed path. Furthermore, a single overfed path can lead to traffic being restrained on all network paths. This is called congestion spreading.

Figure 2 provides a simple example of congestion spreading:

1. NICs B, C, D, and F overfeed NIC G.

2. The queue to NIC G fills.

3. Switch 2 stops flow on all ports to prevent admission of packets potentially destined for NIC G.

4. The queue from Switch 1 to Switch 2 fills.

5. Switch 1 stops flow on all ports to prevent admission of packets potentially destined to Switch 2.

6. The path from NIC A to NIC E becomes blocked even though it is not overfed.

Figure 2: PFC Congestion Spreading

PFC degrades network performance in multiple ways:

The latency through filled queues is very high

Flow control turnaround latency compounds the already high queuing delay

Network utilization plummets and fairness evaporates as uninvolved flows become victims of involved flows

Modern switches implement complex buffer and flow-control strategies to mitigate the impact of filled queues and congestion spreading, but all schemes eventually succumb at sustained high network load.

4. IEEE. 802.11Qbb. Priority based flow control, 2011.

A

B

60%

33%

33%

33%

25%

xOff

Switch 1 Switch 2

C

D

E

FNICs

NICs

G

xOff

%

Page 5: Introduction to NetXtreme® Congestion Control (NCC)

Broadcom NCC-WP1015

NetXtreme Congestion Control White Paper Introduction to NetXtreme Congestion Control (NCC)

PFC-controlled networks only thrive at low sustained loads. In effect, PFC only provides a safety net to prevent packet loss that would precipitate high-cost recovery events. PFC cannot effectively contain network latency nor can it deliver high utilization.

Data Center TCP

Modern data center switches, such as those built on Broadcom's Trident and Tomahawk® family of chips, offer tremendous bandwidth density but with relatively modest buffering. Data Center TCP (DCTCP) is a stream transport protocol with a congestion management scheme that provides utilization and efficiency similar to TCP but with very little buffer consumption. This means DCTCP exhibits much lower steady-state latency and eliminates latency spikes from flow control or packet drops when switch buffers are exhausted.

DCTCP preserves the TCP method of controlling transmission rates by adjusting the transmit window size as a proxy for the actual transmission rate. The transmit window size is the amount of unacknowledged data allowed in the network. When the transmit window is at its minimum, one segment (~1500 bytes) of data is sent per network round-trip-time (RTT). Note that this RTT includes the endpoint processing as well as queuing and transmission delays. When the window grows to bandwidth × RTT of the path, the full bandwidth of the path is used.

The transmit window mechanism was chosen in the initial invention of TCP to relieve endpoints of the substantial CPU burden of maintaining a transmit timer for every connection. TCP is self-timed—the network path itself is the timer.

The DCTCP congestion control scheme is quite simple:

Network elements mark packets added to queues longer than a fixed, configured threshold. This differs from traditional TCP active queue management (AQM):

– Instantaneous queue length is used, where traditional AQM uses average queue length.

– Every packet of an overfilled queue is marked, where traditional AQM marks packets probabilistically as a function of average queue length.

DCTCP receivers send an ECN-Echo indication for every ECN CE packet (subject to delayed ACKs). Traditional AQM bursts ECN-Echo indicators in response to a single ECN CE until the sender acknowledges the ECN-Echo. ECN CE marking is expected to be rare event in traditional AQM, where it is common in DCTCP.

DCTCP senders increase transmit window size in the same way as TCP (slow-start and additive increase)

DCTCP senders use a moving average of the ratio of packets sent to ECN-Echo indications received to approximate the probability congestion experienced, CP, and reduce transmit window size by (1 – CP / 2).

DCTCPs mechanisms ensure the window is not reduced when no congestion is experienced, and the window is cut in half when all packets experience congestion, as in TCP. DCTCP reduces the current window gingerly and continuously between these two extremes, quite unlike TCP.

DCTCP's immediate, universal congestion notification ensures all congesting senders are notified and react extremely quickly. This is the core property that makes DCTCP so effective in minimizing buffer consumption.

Page 6: Introduction to NetXtreme® Congestion Control (NCC)

Broadcom NCC-WP1016

NetXtreme Congestion Control White Paper Introduction to NetXtreme Congestion Control (NCC)

NetXtreme Congestion Control

The NetXtreme Congestion Control (NCC) scheme translates the strong performance of DCTCP into the context of the packet-based RoCE protocol. DCTCP's minimal buffer consumption minimizes network queuing delays and thus end-to-end latency, conferring great benefit to many RoCE applications.

The major differences between DCTCP and RoCE are:

DCTCP is stream-based while RoCE is message-based

DCTCP implementations are software-based while RoCE is hardware

RoCE's hardware implementation precipitates reexamination of several CPU-saving DCTCP (and TCP) design decisions:

DCTCP's self-clocking transmit window rate control scheme

DCTCP's simple slow-start and additive increase rate increase schemes

DCTCP's window scheme is as nearly as problematic as it is clever. When TCP was invented, network paths were long and RTT variation from queuing delay was a small portion of total RTT. In other words, RTT was reasonably well-damped. However, in >10 Gb/s data center networks with a small number of hops, the queuing delay ranges from near 0 to orders of magnitude more than the intrinsic path delay. This can lead to less-than-optimal performance, low utilization, and large latency spikes in dynamic circumstances.

The transmit window scheme also permits delivery of large flights of data (for example, entire windows) into the network in bursts. Unfortunately, this behavior is not uncommon and creates wildly varying queue depths with the potential for flow control of packet drop events. This does not occur when network load is steady or mildly varying, but that is never the case. Applications alternate between computation and communication, creating greatly fluctuating load.

The RoCE transport is implemented in hardware, meaning the pure transmit rate control can be used instead of windows. This opens the door to eliminating suboptimal behavior of the DCTCP transmit window mechanism. Thus, NCC uses hardware-based pure transmit rate control.

NCC, like most congestion control schemes, involves two separate rate control mechanisms: one to increase the rate when no congestion is experienced and another to reduce the rate in response to congestion.

NCC rate management is expressed in terms of two variables:

RC – Current transmit rate

RT – Target transmit rate

These variables are updated periodically based on previous values, congestion indications, and internal algorithmic state. The NCC update period is configurable, but the current default is 80 μs and behavior is relatively insensitive to variation of this period.

Page 7: Introduction to NetXtreme® Congestion Control (NCC)

Broadcom NCC-WP1017

NetXtreme Congestion Control White Paper Introduction to NetXtreme Congestion Control (NCC)

NCC Rate Increase

DCTCP and TCP's slow start in addition to the additive-increase/rate-increase mechanism is computationally efficient, but a bit too gradual to maximize utilization during the startup and recovery phases. On the other hand, this slightly lower utilization is a small price to pay to avoid the entire Internet bouncing like the Tacoma Narrows Bridge.

NCC builds on pure rate control rather than a transmit window, meaning it can also adopt the modern, proven data-center-oriented rate increase behavior of the rate-based IEEE 802.1Qau Quantized Congestion Notification (QCN) scheme5.

Figure 3 illustrates NCC's rate increase mechanism. The NCC rate increase phase can be understood as starting after receiving a CNP and continuing until the next CNP. The NCC rate increase has two phases:

Fast Recovery – Rapidly increasing RC towards the RT captured before the last CNP was received by periodically halving the distance to RT:

RT = RT

RC = (RT + RC) / 2

NCC performs a configurable number of Fast Recovery steps, three by default, before switching to Active Increase.

Active Increase – Periodically increasing RT by a fixed minimum increment RAI and increasing RC by half the distance to RT until congestion is reported or the full rate of the link is achieved:

RT = RT + RAI

RC = (RT + RC) / 2

Figure 3: NCC Rate Increase Diagram

These rules ensure NCC quickly approaches a new set point after congestion is reported, and constantly probes for additional bandwidth after recovery.

5. IEEE. 802.11Qau. Congestion notification, 2010.

Rat

e

RT

Target Rate

RC

Current Rate

CNP Received(rate reduced)

Fast Recovery

RAIRAI

RAIRAI

½

¼

¼

Active Increase

Time

8

Page 8: Introduction to NetXtreme® Congestion Control (NCC)

Broadcom NCC-WP1018

NetXtreme Congestion Control White Paper Introduction to NetXtreme Congestion Control (NCC)

NCC Rate Decrease

NCC directly transliterates DCTCP's window decrease into rate decrease. When CNPs are received, the rate is reduced periodically by an amount proportional to an approximation of the probability that congestion will be experienced (CP):

F = No. of CNPs / No. of TX messages

CP = (1 – g) × CP + F × g

RT = RC

RC = RC × (1 – CP / 2)

The Estimation Gain (g) is configurable, but defaults to 1/32.

The intuition behind these equations:

CP is a moving average of the probability a packet experiences congestion.

The target rate is set to the current rate when congestion is experienced.

If CP is minimal (one packet experiences congestion), the rate is minimally reduced.

If CP is maximal (all packets experience congestion), the rate is halved.

NCC Performance

The behavior of NCC has been extensively simulated and the simulations validated by extensive measurement of the NCC implementations. Figure 4 shows NCC in a 50-to-1 (five QPs from each of ten machines) incast on 40 Gb/s links. Note the following:

The left scale is queue depth in bytes

The right scale is bandwidth in Gb/s

The green and black spiked traces at the bottom are average and instantaneous queue depths, respectively

The remaining traces near the bottom are individual incasting flow rates

Figure 4: NCC 40G 50-to-1 Simulation

500,000

400,000

300,000

200,000

100,000

0

25

30

35

40lf0P0BWlf0P0Qsizelf0P0QAvgsize h4BW

h2BW

h6BWh9BW

h8BWh1BW

h3BWh10BW

h5BWh0BW

h7BW

20

15

10

5

00 20,000 40,000 60,000 80,000 100,000 120,000 140,000 160,000 180,000

Total Utilization (near 40 Gb/s)

DCTCP 40 Gb/s 10 to 15 QPs per source over two switch stages 10 sources on one TOR updateTrMode 1 G 5 RTT 50 smin_max ecnTh 20kb_20kb max_p 100.00% PFC OFF Wq 0.2500

Individual Utilizations (equal and fair)Queue Depths (average 25 KB, peak 80 KB)

Sw

itch

Q S

ize

(byt

es)

Time ( s)

BW

(Gb/s)

Page 9: Introduction to NetXtreme® Congestion Control (NCC)

Broadcom NCC-WP1019

NetXtreme Congestion Control White Paper Introduction to NetXtreme Congestion Control (NCC)

As shown in Figure 4, utilization, fairness, buffer consumption, and associated latency are excellent. The peak queuing backlog of 80 KB introduces only 16 μs of latency.

Figure 5 shows measured behavior for a comparable 25G 50-to-1 incast on a Whitney+ 2x 25G NIC that corroborates the NCC simulations:

Utilization = 22.12 Gb/s

Waste = 0.3 Gb/s (CNP bandwidth consumption)

Peak queue depth = 41.55 KB

Peak queuing latency = 14 μs

Figure 5: NCC 25G 50-to-1 Measurements

150

10to1_5qp_TH25g.txt

100

50

0 0

5

10

15

20

25Sender1Sender2Sender3Sender4Sender5Sender6Sender7Sender8Sender9Sender10EgBWToRcvrCnpsFrmRcvrcnpsToSn1cnpsToSn2s1.QP0BWs1.QP1BWs1.QP2BWs1.QP3BWs2.QP0BWs2.QP1BWs2.QP2BWs2.QP3BWs3.QP0BWs3.QP1BWs3.QP2BWs3.QP3BWs4.QP0BWs4.QP1BW

s6.QP2BWs6.QP1BWs6.QP0BWs5.QP3BWs5.QP2BWs5.QP1BWs5.QP0BWs4.QP3BWs4.QP2BW

500,000 1,000,000 1,500,000 2,000,000

Sw

itch

Q S

ize

(KB

)

BW

(Gb/s)

Time ( s)

Page 10: Introduction to NetXtreme® Congestion Control (NCC)

Broadcom NCC-WP10110

NetXtreme Congestion Control White Paper Introduction to NetXtreme Congestion Control (NCC)

Table 2 lists the measurements of NCC on Whitney+ 25G ports under incasts of 1 to 10 machines and 2 to 50 flows. Performance is high across the board.

The measurements in Table 2 show that NCC on Whitney+ achieves its design objectives of high performance with especially low buffer consumption and latency. Peak queue depths less than 50 KB in this case offer two compelling benefits in addition to minimal latency:

Packets are never dropped nor flow control engaged even with minimally buffered switches. RoCE performs well on a lossy network under these conditions.

The vast majority of switch buffer remains available for TCP flows that require it, even in the presence of a high RoCE load. If 48 ports of a Tomahawk switch with 16 MB of buffer are overfed with RoCE, NCC ensures RoCE buffer consumption of ~2.3 MB, leaving more than 13 MB available for TCP.

Table 2: Whitney+ NCC 25G 2-50-to-1 Measurements

FlowsNo. QPs per

SenderNodes

Min. BWNodes

Max. BWQPs

Min. BWQPs

Max. BWReceiver

BW

CNPs BW Receiver

SideTransient Q Peak

SS Q Peak

2 to 1

2 1 10.61 11.92 10.6117 11.9168 22.53 0.02 27.23 27.23

4 2 11.35 11.53 5.4652 5.8858 22.88 0.04 28.62 27.2

10 5 11.56 11.65 2.2675 2.4005 23.21 0.07 30.57 29.87

20 10 11.38 11.46 1.1041 1.198 22.84 0.12 35.84 35.84

50 25 10.82 11.05 0.4211 0.4561 21.86 0.29 47.69 47.69

3 to 1

3 1 7.42 7.76 7.4171 7.7602 22.91 0.05 26.93 26.32

6 2 7.53 7.94 3.649 4.0864 23.01 0.07 42.4 28.53

15 5 7.46 7.72 1.3977 1.6443 22.9 0.14 33.91 33.57

48 16 7.21 7.63 0.4315 0.4911 22.05 0.38 47.84 47.84

4 to 1

4 1 5.64 6.05 5.6403 6.0546 23.3 0.06 31.8 27.58

8 2 5.66 5.92 2.8001 2.9685 23.22 0.08 33.71 29.56

20 5 5.64 5.97 1.0931 1.2694 23.02 0.2 33.11 33.11

48 12 5.31 5.76 0.427 0.5008 22.13 0.45 45.71 45.71

5 to 1

5 1 4.43 4.95 4.4252 4.9484 23.11 0.07 28.07 28

10 2 4.58 4.8 2.2142 2.4452 23.37 0.12 32.34 27.91

25 5 4.59 4.7 0.9064 0.9596 23.21 0.22 33.55 31.44

50 10 4.21 4.58 0.3971 0.4754 22.1 0.5 44.4 44.4

10 to 1

10 1 2.28 2.39 2.281 2.391 23.35 0.07 30.65 25.82

20 2 2.21 2.39 1.0909 1.2152 22.98 0.12 32.65 32.65

50 5 2.06 2.3 0.3967 0.4685 22.12 0.3 41.55 41.55

Page 11: Introduction to NetXtreme® Congestion Control (NCC)

Broadcom NCC-WP10111

NetXtreme Congestion Control White Paper Introduction to NetXtreme Congestion Control (NCC)

Comparison with Data Center Quantized Congestion Control

Data Center Quantized Congestion Control (DCQCN)6,7 is a well-known alternative to NCC. Table 3 compares a simulation of DCQCN with 40 Gb/s ports (as described in the DCQCN publication references below) to a simulation of NCC.

DCQCN exhibits a 4.5x larger buffer consumption and queuing delay compared to NCC. DCQCN wastes less bandwidth and delivers slightly higher utilization at low load compared to NCC, but neither difference is material.

The main difference between DQCQN and NCC is that DCQCN uses probabilistic marking, similar to traditional AQM. Thus, DCQCN takes considerably longer to notify all congesting flows. DCQCN takes longer to reduce the rate to a conforming level, meaning additional queuing backlog and latency develops (as shown in Table 3).

The DCQCN design anticipates the use of firmware for rate computations. One implication is that the number of ECN markings and CNP packets must be contained at a level manageable by firmware. By contrast, NCC implements rate computations in hardware, enabling the DCTCP-style scheme of deterministically marking all packets experiencing congestion and adjusting the rate based on a potentially continuous flow of CNP packets.

DCQCN's restricted congestion management communication does result in significantly lower waste than NCC, but the absolute waste of both NCC and DCQCN is well below a material level.

Computing flow rates in firmware also affects the total number of flows that can use congestion management. NCC is capable of managing every RDMA queue in the RNIC (1M in Whitney+), where DCQCN implementations are reported to simultaneously manage fewer than 1000 queues.

Summary

NCC is a high-performance congestion management solution for RoCE that delivers fairness, high utilization, dramatically lower buffer consumption and network latency, and greater connection scaling than any other known solution. NCC ensures Broadcom's Whitney+ and NetXtreme family of products deliver the best possible RoCE performance, and especially the lowest possible end-to-end latency, in the industry.

6. Y. Zhu, H. Eran, D. Firestone, C. Guo, M. Lipshteyn, Y. Liron, J. Padhye, S. Raindel, M. H. Yahia, and M. Zhang. Congestion Control for Large-Scale RDMA Deployments. In SIGCOMM, 2015.

7. C. Guo, H. Wu, Z. Deng, G. Soni, J. Ye, J. Padhye, M. Lipshteyn. RDMA over Commodity Ethernet at Scale. In SIGCOMM, 2016.

Table 3: 40G NCC and DCQCN Comparison

No. of Flows

Broadcom NCC DCQCN

Utilization (%)

Peak Q Depth (KB)

Peak Q Delay (μs)

CNP Utilization

Utilization (%)

Peak Q Depth (KB)

Peak Q Delay (μs)

CNP Utilization

2 97.3 33 7 0.49 99.3 95 19 0.01

6 96.9 42 9 0.8 99.5 190 40 0.02

10 94.1 51 10 0.96 97.6 232 48 0.04

20 93.6 59 12 1.23 90.7 280 57 0.1

50 92.4 85 17 1.88 86.9 370 76 0.33

Page 12: Introduction to NetXtreme® Congestion Control (NCC)

Broadcom, the pulse logo, Connecting everything, Avago Technologies, Avago, NetXtreme, Tomahawk, and the A logo are among the trademarks of Broadcom and/or its affiliates in the United States, certain other countries and/or the EU.

Copyright © 2017 by Broadcom. All Rights Reserved.

The term “Broadcom” refers to Broadcom Limited and/or its subsidiaries. For more information, please visit www.broadcom.com.

Broadcom reserves the right to make changes without further notice to any products or data herein to improve reliability, function, or design. Information furnished by Broadcom is believed to be accurate and reliable. However, Broadcom does not assume any liability arising out of the application or use of this information, nor the application or use of any product or circuit described herein, neither does it convey any license under its patent rights nor the rights of others.