59
Design of TCP for Data Centers

TCP for Data Centers

Embed Size (px)

Citation preview

Page 1: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 159

Design of TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 259

Cloud ComputingThe cloud computing architecture is comprised of two significant

parts

The front end is the side at which the user of the computer orthe client himself is able to access This involves the clientrsquos

network or his computer and the program(s) that the client uses

to access the database or the servers that contain all the data

The back end is the cloud itself which is the collection of allrelated information saved on the servers that the client wishes to

have access to

These two ends of the cloud computing architecture are connectedthrough a network usually the Internet which provides remote

access to all the users of the cloud

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 359

Benefits of Cloud Computing

For large-scale businesses the cloud computing technology

eliminates the need to buy an additional number of hardware and

storage devices since all data needed would be easily accessible

from the cloud through the individual computers of the

employees

Installing software in every computer would no longer be

necessary because the cloud computing platform would be ableto do the job

Cloud hosting services provide managed hosting for all server

configurations on a dedicated 247 availability Cloud software

ranges from sales applications to custom applications dependingon the usersrsquo choice

The above benefits can provide a great deal of profit to many

businesses and can also improve customer satisfaction

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 459

Cloud Data Centers Cloud computing services provide the users of the Cloud

better management of their information This might save a

company on expenses since the company will not need to hirea large IT team for its own technical support

There is a lot of cloud computing software available today that

offers provision of cloud computing applications Running

applications and storing data on the cloud has proven to beeconomical and efficient for many businesses

Cloud data centers host diverse applications mixing

workloads that require small predictable latency with othersrequiring large sustained throughput

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 559

Data Center Bubble

Numerous companies are already providing cloud services

including Amazon Google Yahoo Microsoft HP IBM Cisco

etc

Data Centers range in size from ldquoedgerdquo facilities to megascale datacenters (100K to 1M servers)

Data centers are located in many countries eg USA India

Singapore Germany etcThere is a push for Green Data Centers that use windsolar energy

efficient floor layout recycling of waste material environment

friendly materialpaint green rated power equipment

Fastest growing sectors in data centers Telecom Foreign hosting

companies Global information management Business

connectivity

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 659

Data Center Closures amp ConsolidationUS Government Data Centers

In November 2012 the US Government has closed an additional

64 data centers bringing the total number of closed facilities to

381 The closures are part of the Federal Data CenterConsolidation Initiative for streamlining government IT

operations

The ultimate goal is to close 40 percent of the US Federal

Governmentrsquos data centers (ie close 1200 of the nearly 2900identified data centers) by 2015

Commercial Data CentersData center consolidation is a trend in industry For example HP

has been replacing its 85 data centers around the world with only 6newly-built larger facilities in Austin Atlanta and Houston

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 759

Hewlett-Packard Development Company

Example HP Cloud Services

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 859

Example Amazon Data CentersAmazon data centers serve four regions in

the US and three regions in Europe and

Asia Another data center in the US was

opened July 2011 in the state of Oregon to

serve the Pacific Northwest region

In December 2011 Amazon announced it is

opening a data center in Sao Paulo Brazil

its first in South America

In November 2012 Amazon announced it

is adding a ninth region by opening a data

center in Sydney Australia

The data centers support all Amazon Web

Services (AWS) including Amazon Elastic

Compute Cloud (EC2) and Amazon

Simple Storage Service (S3)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 959

Example Amazon Data Centers

The Amazon Elastic Compute Cloud (Amazon EC2) web service provides users

with the ability to execute their applications in Amazons computing environment

To use Amazon EC2 Create an Amazon Machine Image (AMI) containing all the software including

the operating system

Upload this AMI to the Amazon S3 (Amazon Simple Storage Service)

Register to get an AMI ID Use this AMI ID and the Amazon EC2 web service APIs to run monitor and

terminate as many instances of this AMI as required

EC2 Pricing Policy pay as you go no minimal fee The prices are based on theRegion in which the application instance is running

httpawsamazoncomec2pricing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1059

Data Center Services

Exampe Colocation Services of Cogent

httpwwwcogentcocomen

Cogent is a multinational Tier 1 Internet Service Provider

Companies can colocate their business critical equipment in one of

43 Cogents secure state-of-the-art data centers that connect directly

to a Tier-1 IP network The data centers have extensive powerbackup systems complete fire detection and suppression plans to

ensure the safety and security of equipment

Cogent Data Center Features

httpwwwcogentcocomenproducts-and-servicescolocation-

services

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1159

Colocation Data Centers and Cloud Servers

httpwwwdatacentermapcomdatacentershtml

httpwwwdatacentermapcomcloudhtml

Example AtlanticNet

httpwwwatlanticnetorlando-colocation-floridahtml

Orlando Data Center

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1259

Data Center TCP (DCTCP)

M Alizadehzy A Greenbergy D Maltzy J Padhyey P

Pately B Prabhakarz S Senguptay M Sridharan

983117983145983139983154983151983155983151983142983156 983122983141983155983141983137983154983139983144 amp 983123983156983137983150983142983151983154983140 983125983150983145983158983141983154983155983145983156983161

ACM SIGCOMM September 2010

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1359

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1459

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1559

Rack Servers with Commodity Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1659

Performance impairments of Shallow-buffered

Switches1 TCP Incast Collapse

Many applications generate barrier-synchronized requests in which the

client cannot make forward progress until the responses from every

server for the current request have been received An Example of these

applications is a web search query (eg a Google search) sent to a large

number of nodes with results returned to the parent node to be sorted

Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these

requests create many flows that converge on the same interface of a

switch over a short period of time The response packets create a long

queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and

throughput collapse

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1759

1 TCP Incast Collapse (continued)

Barrier-synchronized requests exhibit the PartitionAggregate workflow

pattern which is the foundation of many large scale web applications

Requests from higher layers of the application are broken into pieces and

farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content

composition and advertisement selection are based around the

PartitionAggregate design pattern

In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require

iteratively invoking the pattern with an aggregator making serial requests

to the workers below it to prepare a response (1 to 4 iterations are typical

though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up

to the root must be completed within the deadline

In other publications this pattern is referred to as the ScatterGather pattern

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1859

983137983143983143983154983141983143983137983156983151983154

983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154

The partitionaggregate design pattern

Request Latency deadline 250 ms

deadline 50 ms

deadline 10 ms

The total permissible latency for a request is limited and the ldquobackendrdquo part of the

application is typically allocated between 230-300 ms This limit is called the all-up SLA

Example in web search a query might be sent to many aggregators and workers each

responsible for a different part of the index Based on the replies an aggregator might

refine the query and send it out again to improve the relevance of the result Lagging

instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries

A high-level aggregator

(HLA) partitions queries to

a large number of mid-level

aggregators (MLAs) that in

turn partition each query

over the other servers in the

same rack as the MLA

Servers act as both MLAs

and workers so each server

will be acting as an

aggregator for some queries

and as a worker for other

queries

HLA

MLAMLA

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1959

aggregator

worker 1worker 2worker 3worker 4

query

response

Ack

A TCP Incast Event

Response from worker 3 is lost due to incast and is

retransmitted after a timeout

timeout

983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154

983090

983159983151983154983147983141983154

983089

983159983151983154983147983141983154

983091

983159983151983154983147983141983154

983091

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2059

IncastScenario

Packets from many

flows arriving to

the same port at

the same time

Incast Collapse Summary

In other publications the incast scnario

is referred to as the fan-in burst at the

parent node This incast is a key reason

for increased network delay and occurswhen all the children (eg workers at

the leaf level) of a parent node face the

same deadline and are likely to respond

nearly at the same time causing a fan-

in burst at the parent node

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2159

Performance impairments of Shallow-buffered

Switches2 Queue Buildup

When long and short flows traverse the same queue there is a queue

buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every

worker in the cluster handles both query traffic and background

traffic (large flows needed to update the data structures on the

workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays

because of long-lived greedy TCP flows Further answering a

request can require multiple iterations which magnifies the impact of

this delay

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2259

Performance impairments of Shallow-buffered

Switches3 Buffer Pressure

Given the mix of long and short flows in a data center it is very

common for short flows on one port to be impacted by activity on

other ports The loss rate of short flows in this traffic pattern depends

on the number of long flows traversing other ports

The long greedy TCP flows build up queues on their interfaces

Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space

available to absorb bursts of traffic from the PartitionAggregate

traffic This impairment is called buffer pressure The result is packet

loss and timeouts as in incast but without requiring synchronizedflows

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2359

Buffer

Pressure

Short flows on oneport and long flows

on another port

Incast

Scenario

Multiple shortflows on the same

port

Queue

Buildup

Short and longflows on the same

port

Flow Interactions in Shallow-buffered Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2459

C o n g e s t i o n

w i n d o w

10

5

15

20

0

Round-trip times

Slow

start

Congestionavoidance

Time-out

Legacy TCP Congestion Control

983155983155983135983156983144983154983141983155983144 983101983089983094

983139983159983150983140 983101983090983088

983155983155983135983156983144983154983141983155983144 983101983089983088

Segment loss

Segment loss

FastRetransmit

Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2559

The Need for a Data Center TCP

The data center environment is significantly

different from wide area networks

o round trip times (RTTs) can be less than 250 ms in absence ofqueuing

o Applications need extremely high bandwidths and very low

latencies

o little statistical multiplexing a single flow can dominate a

particular path

o The network is largely homogeneous and under a single

administrative controlo Traffic flowing in switches is mostly internal Connectivity to the

external Internet is typically managed through load balancers and

application proxies that effectively separate internal traffic from

external

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2659

The Need for a Data Center TCP (continued)

Data center applications generate a diverse mix of short and long

flows The measurements by the authors reveal that 9991 of

traffic in the data center is TCP traffic The traffic consists of query

traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to

100MB) These applications require three things from the data

center network

o low latency for short flows

o high burst tolerance

o high utilization for long flows

Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above

requirements

See paper for details of workload

characterization in cloud data centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2759

The Data Center TCP (DCTCP) Algorithm

The goal of DCTCP is to achieve high burst tolerance low latency

and high throughput with commodity shallow buffered switches

DCTCP uses the concept of ECN (Explicit Congestion Notification)

DCTCP achieves these goals primarily by reacting to congestion in

proportion to the extent of congestion

DCTCP uses a simple marking scheme at switches that sets the

Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold

The DCTCP source reacts by reducing the window by a factor that

depends on the fraction of marked packets the larger the fraction the

bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN

notification

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2859

DCTCP- Simple Marking at the Switch

DCTCP employs a simple active queue management scheme There

is only a single parameter the marking threshold K as opposed to

two parameters THmin and THmax in RED routers

An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival

Marking is based on the instantaneous value of the queue not the

average value as in RED routers

The DCTCP scheme ensures that sources are quickly notified of the

queue overshoot

The RED marking scheme implemented by most modern switches

can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of

average queue length

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2959

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 2: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 259

Cloud ComputingThe cloud computing architecture is comprised of two significant

parts

The front end is the side at which the user of the computer orthe client himself is able to access This involves the clientrsquos

network or his computer and the program(s) that the client uses

to access the database or the servers that contain all the data

The back end is the cloud itself which is the collection of allrelated information saved on the servers that the client wishes to

have access to

These two ends of the cloud computing architecture are connectedthrough a network usually the Internet which provides remote

access to all the users of the cloud

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 359

Benefits of Cloud Computing

For large-scale businesses the cloud computing technology

eliminates the need to buy an additional number of hardware and

storage devices since all data needed would be easily accessible

from the cloud through the individual computers of the

employees

Installing software in every computer would no longer be

necessary because the cloud computing platform would be ableto do the job

Cloud hosting services provide managed hosting for all server

configurations on a dedicated 247 availability Cloud software

ranges from sales applications to custom applications dependingon the usersrsquo choice

The above benefits can provide a great deal of profit to many

businesses and can also improve customer satisfaction

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 459

Cloud Data Centers Cloud computing services provide the users of the Cloud

better management of their information This might save a

company on expenses since the company will not need to hirea large IT team for its own technical support

There is a lot of cloud computing software available today that

offers provision of cloud computing applications Running

applications and storing data on the cloud has proven to beeconomical and efficient for many businesses

Cloud data centers host diverse applications mixing

workloads that require small predictable latency with othersrequiring large sustained throughput

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 559

Data Center Bubble

Numerous companies are already providing cloud services

including Amazon Google Yahoo Microsoft HP IBM Cisco

etc

Data Centers range in size from ldquoedgerdquo facilities to megascale datacenters (100K to 1M servers)

Data centers are located in many countries eg USA India

Singapore Germany etcThere is a push for Green Data Centers that use windsolar energy

efficient floor layout recycling of waste material environment

friendly materialpaint green rated power equipment

Fastest growing sectors in data centers Telecom Foreign hosting

companies Global information management Business

connectivity

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 659

Data Center Closures amp ConsolidationUS Government Data Centers

In November 2012 the US Government has closed an additional

64 data centers bringing the total number of closed facilities to

381 The closures are part of the Federal Data CenterConsolidation Initiative for streamlining government IT

operations

The ultimate goal is to close 40 percent of the US Federal

Governmentrsquos data centers (ie close 1200 of the nearly 2900identified data centers) by 2015

Commercial Data CentersData center consolidation is a trend in industry For example HP

has been replacing its 85 data centers around the world with only 6newly-built larger facilities in Austin Atlanta and Houston

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 759

Hewlett-Packard Development Company

Example HP Cloud Services

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 859

Example Amazon Data CentersAmazon data centers serve four regions in

the US and three regions in Europe and

Asia Another data center in the US was

opened July 2011 in the state of Oregon to

serve the Pacific Northwest region

In December 2011 Amazon announced it is

opening a data center in Sao Paulo Brazil

its first in South America

In November 2012 Amazon announced it

is adding a ninth region by opening a data

center in Sydney Australia

The data centers support all Amazon Web

Services (AWS) including Amazon Elastic

Compute Cloud (EC2) and Amazon

Simple Storage Service (S3)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 959

Example Amazon Data Centers

The Amazon Elastic Compute Cloud (Amazon EC2) web service provides users

with the ability to execute their applications in Amazons computing environment

To use Amazon EC2 Create an Amazon Machine Image (AMI) containing all the software including

the operating system

Upload this AMI to the Amazon S3 (Amazon Simple Storage Service)

Register to get an AMI ID Use this AMI ID and the Amazon EC2 web service APIs to run monitor and

terminate as many instances of this AMI as required

EC2 Pricing Policy pay as you go no minimal fee The prices are based on theRegion in which the application instance is running

httpawsamazoncomec2pricing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1059

Data Center Services

Exampe Colocation Services of Cogent

httpwwwcogentcocomen

Cogent is a multinational Tier 1 Internet Service Provider

Companies can colocate their business critical equipment in one of

43 Cogents secure state-of-the-art data centers that connect directly

to a Tier-1 IP network The data centers have extensive powerbackup systems complete fire detection and suppression plans to

ensure the safety and security of equipment

Cogent Data Center Features

httpwwwcogentcocomenproducts-and-servicescolocation-

services

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1159

Colocation Data Centers and Cloud Servers

httpwwwdatacentermapcomdatacentershtml

httpwwwdatacentermapcomcloudhtml

Example AtlanticNet

httpwwwatlanticnetorlando-colocation-floridahtml

Orlando Data Center

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1259

Data Center TCP (DCTCP)

M Alizadehzy A Greenbergy D Maltzy J Padhyey P

Pately B Prabhakarz S Senguptay M Sridharan

983117983145983139983154983151983155983151983142983156 983122983141983155983141983137983154983139983144 amp 983123983156983137983150983142983151983154983140 983125983150983145983158983141983154983155983145983156983161

ACM SIGCOMM September 2010

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1359

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1459

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1559

Rack Servers with Commodity Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1659

Performance impairments of Shallow-buffered

Switches1 TCP Incast Collapse

Many applications generate barrier-synchronized requests in which the

client cannot make forward progress until the responses from every

server for the current request have been received An Example of these

applications is a web search query (eg a Google search) sent to a large

number of nodes with results returned to the parent node to be sorted

Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these

requests create many flows that converge on the same interface of a

switch over a short period of time The response packets create a long

queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and

throughput collapse

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1759

1 TCP Incast Collapse (continued)

Barrier-synchronized requests exhibit the PartitionAggregate workflow

pattern which is the foundation of many large scale web applications

Requests from higher layers of the application are broken into pieces and

farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content

composition and advertisement selection are based around the

PartitionAggregate design pattern

In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require

iteratively invoking the pattern with an aggregator making serial requests

to the workers below it to prepare a response (1 to 4 iterations are typical

though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up

to the root must be completed within the deadline

In other publications this pattern is referred to as the ScatterGather pattern

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1859

983137983143983143983154983141983143983137983156983151983154

983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154

The partitionaggregate design pattern

Request Latency deadline 250 ms

deadline 50 ms

deadline 10 ms

The total permissible latency for a request is limited and the ldquobackendrdquo part of the

application is typically allocated between 230-300 ms This limit is called the all-up SLA

Example in web search a query might be sent to many aggregators and workers each

responsible for a different part of the index Based on the replies an aggregator might

refine the query and send it out again to improve the relevance of the result Lagging

instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries

A high-level aggregator

(HLA) partitions queries to

a large number of mid-level

aggregators (MLAs) that in

turn partition each query

over the other servers in the

same rack as the MLA

Servers act as both MLAs

and workers so each server

will be acting as an

aggregator for some queries

and as a worker for other

queries

HLA

MLAMLA

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1959

aggregator

worker 1worker 2worker 3worker 4

query

response

Ack

A TCP Incast Event

Response from worker 3 is lost due to incast and is

retransmitted after a timeout

timeout

983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154

983090

983159983151983154983147983141983154

983089

983159983151983154983147983141983154

983091

983159983151983154983147983141983154

983091

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2059

IncastScenario

Packets from many

flows arriving to

the same port at

the same time

Incast Collapse Summary

In other publications the incast scnario

is referred to as the fan-in burst at the

parent node This incast is a key reason

for increased network delay and occurswhen all the children (eg workers at

the leaf level) of a parent node face the

same deadline and are likely to respond

nearly at the same time causing a fan-

in burst at the parent node

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2159

Performance impairments of Shallow-buffered

Switches2 Queue Buildup

When long and short flows traverse the same queue there is a queue

buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every

worker in the cluster handles both query traffic and background

traffic (large flows needed to update the data structures on the

workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays

because of long-lived greedy TCP flows Further answering a

request can require multiple iterations which magnifies the impact of

this delay

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2259

Performance impairments of Shallow-buffered

Switches3 Buffer Pressure

Given the mix of long and short flows in a data center it is very

common for short flows on one port to be impacted by activity on

other ports The loss rate of short flows in this traffic pattern depends

on the number of long flows traversing other ports

The long greedy TCP flows build up queues on their interfaces

Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space

available to absorb bursts of traffic from the PartitionAggregate

traffic This impairment is called buffer pressure The result is packet

loss and timeouts as in incast but without requiring synchronizedflows

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2359

Buffer

Pressure

Short flows on oneport and long flows

on another port

Incast

Scenario

Multiple shortflows on the same

port

Queue

Buildup

Short and longflows on the same

port

Flow Interactions in Shallow-buffered Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2459

C o n g e s t i o n

w i n d o w

10

5

15

20

0

Round-trip times

Slow

start

Congestionavoidance

Time-out

Legacy TCP Congestion Control

983155983155983135983156983144983154983141983155983144 983101983089983094

983139983159983150983140 983101983090983088

983155983155983135983156983144983154983141983155983144 983101983089983088

Segment loss

Segment loss

FastRetransmit

Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2559

The Need for a Data Center TCP

The data center environment is significantly

different from wide area networks

o round trip times (RTTs) can be less than 250 ms in absence ofqueuing

o Applications need extremely high bandwidths and very low

latencies

o little statistical multiplexing a single flow can dominate a

particular path

o The network is largely homogeneous and under a single

administrative controlo Traffic flowing in switches is mostly internal Connectivity to the

external Internet is typically managed through load balancers and

application proxies that effectively separate internal traffic from

external

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2659

The Need for a Data Center TCP (continued)

Data center applications generate a diverse mix of short and long

flows The measurements by the authors reveal that 9991 of

traffic in the data center is TCP traffic The traffic consists of query

traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to

100MB) These applications require three things from the data

center network

o low latency for short flows

o high burst tolerance

o high utilization for long flows

Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above

requirements

See paper for details of workload

characterization in cloud data centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2759

The Data Center TCP (DCTCP) Algorithm

The goal of DCTCP is to achieve high burst tolerance low latency

and high throughput with commodity shallow buffered switches

DCTCP uses the concept of ECN (Explicit Congestion Notification)

DCTCP achieves these goals primarily by reacting to congestion in

proportion to the extent of congestion

DCTCP uses a simple marking scheme at switches that sets the

Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold

The DCTCP source reacts by reducing the window by a factor that

depends on the fraction of marked packets the larger the fraction the

bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN

notification

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2859

DCTCP- Simple Marking at the Switch

DCTCP employs a simple active queue management scheme There

is only a single parameter the marking threshold K as opposed to

two parameters THmin and THmax in RED routers

An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival

Marking is based on the instantaneous value of the queue not the

average value as in RED routers

The DCTCP scheme ensures that sources are quickly notified of the

queue overshoot

The RED marking scheme implemented by most modern switches

can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of

average queue length

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2959

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 3: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 359

Benefits of Cloud Computing

For large-scale businesses the cloud computing technology

eliminates the need to buy an additional number of hardware and

storage devices since all data needed would be easily accessible

from the cloud through the individual computers of the

employees

Installing software in every computer would no longer be

necessary because the cloud computing platform would be ableto do the job

Cloud hosting services provide managed hosting for all server

configurations on a dedicated 247 availability Cloud software

ranges from sales applications to custom applications dependingon the usersrsquo choice

The above benefits can provide a great deal of profit to many

businesses and can also improve customer satisfaction

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 459

Cloud Data Centers Cloud computing services provide the users of the Cloud

better management of their information This might save a

company on expenses since the company will not need to hirea large IT team for its own technical support

There is a lot of cloud computing software available today that

offers provision of cloud computing applications Running

applications and storing data on the cloud has proven to beeconomical and efficient for many businesses

Cloud data centers host diverse applications mixing

workloads that require small predictable latency with othersrequiring large sustained throughput

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 559

Data Center Bubble

Numerous companies are already providing cloud services

including Amazon Google Yahoo Microsoft HP IBM Cisco

etc

Data Centers range in size from ldquoedgerdquo facilities to megascale datacenters (100K to 1M servers)

Data centers are located in many countries eg USA India

Singapore Germany etcThere is a push for Green Data Centers that use windsolar energy

efficient floor layout recycling of waste material environment

friendly materialpaint green rated power equipment

Fastest growing sectors in data centers Telecom Foreign hosting

companies Global information management Business

connectivity

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 659

Data Center Closures amp ConsolidationUS Government Data Centers

In November 2012 the US Government has closed an additional

64 data centers bringing the total number of closed facilities to

381 The closures are part of the Federal Data CenterConsolidation Initiative for streamlining government IT

operations

The ultimate goal is to close 40 percent of the US Federal

Governmentrsquos data centers (ie close 1200 of the nearly 2900identified data centers) by 2015

Commercial Data CentersData center consolidation is a trend in industry For example HP

has been replacing its 85 data centers around the world with only 6newly-built larger facilities in Austin Atlanta and Houston

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 759

Hewlett-Packard Development Company

Example HP Cloud Services

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 859

Example Amazon Data CentersAmazon data centers serve four regions in

the US and three regions in Europe and

Asia Another data center in the US was

opened July 2011 in the state of Oregon to

serve the Pacific Northwest region

In December 2011 Amazon announced it is

opening a data center in Sao Paulo Brazil

its first in South America

In November 2012 Amazon announced it

is adding a ninth region by opening a data

center in Sydney Australia

The data centers support all Amazon Web

Services (AWS) including Amazon Elastic

Compute Cloud (EC2) and Amazon

Simple Storage Service (S3)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 959

Example Amazon Data Centers

The Amazon Elastic Compute Cloud (Amazon EC2) web service provides users

with the ability to execute their applications in Amazons computing environment

To use Amazon EC2 Create an Amazon Machine Image (AMI) containing all the software including

the operating system

Upload this AMI to the Amazon S3 (Amazon Simple Storage Service)

Register to get an AMI ID Use this AMI ID and the Amazon EC2 web service APIs to run monitor and

terminate as many instances of this AMI as required

EC2 Pricing Policy pay as you go no minimal fee The prices are based on theRegion in which the application instance is running

httpawsamazoncomec2pricing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1059

Data Center Services

Exampe Colocation Services of Cogent

httpwwwcogentcocomen

Cogent is a multinational Tier 1 Internet Service Provider

Companies can colocate their business critical equipment in one of

43 Cogents secure state-of-the-art data centers that connect directly

to a Tier-1 IP network The data centers have extensive powerbackup systems complete fire detection and suppression plans to

ensure the safety and security of equipment

Cogent Data Center Features

httpwwwcogentcocomenproducts-and-servicescolocation-

services

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1159

Colocation Data Centers and Cloud Servers

httpwwwdatacentermapcomdatacentershtml

httpwwwdatacentermapcomcloudhtml

Example AtlanticNet

httpwwwatlanticnetorlando-colocation-floridahtml

Orlando Data Center

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1259

Data Center TCP (DCTCP)

M Alizadehzy A Greenbergy D Maltzy J Padhyey P

Pately B Prabhakarz S Senguptay M Sridharan

983117983145983139983154983151983155983151983142983156 983122983141983155983141983137983154983139983144 amp 983123983156983137983150983142983151983154983140 983125983150983145983158983141983154983155983145983156983161

ACM SIGCOMM September 2010

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1359

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1459

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1559

Rack Servers with Commodity Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1659

Performance impairments of Shallow-buffered

Switches1 TCP Incast Collapse

Many applications generate barrier-synchronized requests in which the

client cannot make forward progress until the responses from every

server for the current request have been received An Example of these

applications is a web search query (eg a Google search) sent to a large

number of nodes with results returned to the parent node to be sorted

Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these

requests create many flows that converge on the same interface of a

switch over a short period of time The response packets create a long

queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and

throughput collapse

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1759

1 TCP Incast Collapse (continued)

Barrier-synchronized requests exhibit the PartitionAggregate workflow

pattern which is the foundation of many large scale web applications

Requests from higher layers of the application are broken into pieces and

farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content

composition and advertisement selection are based around the

PartitionAggregate design pattern

In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require

iteratively invoking the pattern with an aggregator making serial requests

to the workers below it to prepare a response (1 to 4 iterations are typical

though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up

to the root must be completed within the deadline

In other publications this pattern is referred to as the ScatterGather pattern

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1859

983137983143983143983154983141983143983137983156983151983154

983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154

The partitionaggregate design pattern

Request Latency deadline 250 ms

deadline 50 ms

deadline 10 ms

The total permissible latency for a request is limited and the ldquobackendrdquo part of the

application is typically allocated between 230-300 ms This limit is called the all-up SLA

Example in web search a query might be sent to many aggregators and workers each

responsible for a different part of the index Based on the replies an aggregator might

refine the query and send it out again to improve the relevance of the result Lagging

instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries

A high-level aggregator

(HLA) partitions queries to

a large number of mid-level

aggregators (MLAs) that in

turn partition each query

over the other servers in the

same rack as the MLA

Servers act as both MLAs

and workers so each server

will be acting as an

aggregator for some queries

and as a worker for other

queries

HLA

MLAMLA

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1959

aggregator

worker 1worker 2worker 3worker 4

query

response

Ack

A TCP Incast Event

Response from worker 3 is lost due to incast and is

retransmitted after a timeout

timeout

983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154

983090

983159983151983154983147983141983154

983089

983159983151983154983147983141983154

983091

983159983151983154983147983141983154

983091

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2059

IncastScenario

Packets from many

flows arriving to

the same port at

the same time

Incast Collapse Summary

In other publications the incast scnario

is referred to as the fan-in burst at the

parent node This incast is a key reason

for increased network delay and occurswhen all the children (eg workers at

the leaf level) of a parent node face the

same deadline and are likely to respond

nearly at the same time causing a fan-

in burst at the parent node

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2159

Performance impairments of Shallow-buffered

Switches2 Queue Buildup

When long and short flows traverse the same queue there is a queue

buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every

worker in the cluster handles both query traffic and background

traffic (large flows needed to update the data structures on the

workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays

because of long-lived greedy TCP flows Further answering a

request can require multiple iterations which magnifies the impact of

this delay

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2259

Performance impairments of Shallow-buffered

Switches3 Buffer Pressure

Given the mix of long and short flows in a data center it is very

common for short flows on one port to be impacted by activity on

other ports The loss rate of short flows in this traffic pattern depends

on the number of long flows traversing other ports

The long greedy TCP flows build up queues on their interfaces

Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space

available to absorb bursts of traffic from the PartitionAggregate

traffic This impairment is called buffer pressure The result is packet

loss and timeouts as in incast but without requiring synchronizedflows

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2359

Buffer

Pressure

Short flows on oneport and long flows

on another port

Incast

Scenario

Multiple shortflows on the same

port

Queue

Buildup

Short and longflows on the same

port

Flow Interactions in Shallow-buffered Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2459

C o n g e s t i o n

w i n d o w

10

5

15

20

0

Round-trip times

Slow

start

Congestionavoidance

Time-out

Legacy TCP Congestion Control

983155983155983135983156983144983154983141983155983144 983101983089983094

983139983159983150983140 983101983090983088

983155983155983135983156983144983154983141983155983144 983101983089983088

Segment loss

Segment loss

FastRetransmit

Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2559

The Need for a Data Center TCP

The data center environment is significantly

different from wide area networks

o round trip times (RTTs) can be less than 250 ms in absence ofqueuing

o Applications need extremely high bandwidths and very low

latencies

o little statistical multiplexing a single flow can dominate a

particular path

o The network is largely homogeneous and under a single

administrative controlo Traffic flowing in switches is mostly internal Connectivity to the

external Internet is typically managed through load balancers and

application proxies that effectively separate internal traffic from

external

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2659

The Need for a Data Center TCP (continued)

Data center applications generate a diverse mix of short and long

flows The measurements by the authors reveal that 9991 of

traffic in the data center is TCP traffic The traffic consists of query

traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to

100MB) These applications require three things from the data

center network

o low latency for short flows

o high burst tolerance

o high utilization for long flows

Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above

requirements

See paper for details of workload

characterization in cloud data centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2759

The Data Center TCP (DCTCP) Algorithm

The goal of DCTCP is to achieve high burst tolerance low latency

and high throughput with commodity shallow buffered switches

DCTCP uses the concept of ECN (Explicit Congestion Notification)

DCTCP achieves these goals primarily by reacting to congestion in

proportion to the extent of congestion

DCTCP uses a simple marking scheme at switches that sets the

Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold

The DCTCP source reacts by reducing the window by a factor that

depends on the fraction of marked packets the larger the fraction the

bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN

notification

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2859

DCTCP- Simple Marking at the Switch

DCTCP employs a simple active queue management scheme There

is only a single parameter the marking threshold K as opposed to

two parameters THmin and THmax in RED routers

An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival

Marking is based on the instantaneous value of the queue not the

average value as in RED routers

The DCTCP scheme ensures that sources are quickly notified of the

queue overshoot

The RED marking scheme implemented by most modern switches

can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of

average queue length

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2959

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 4: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 459

Cloud Data Centers Cloud computing services provide the users of the Cloud

better management of their information This might save a

company on expenses since the company will not need to hirea large IT team for its own technical support

There is a lot of cloud computing software available today that

offers provision of cloud computing applications Running

applications and storing data on the cloud has proven to beeconomical and efficient for many businesses

Cloud data centers host diverse applications mixing

workloads that require small predictable latency with othersrequiring large sustained throughput

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 559

Data Center Bubble

Numerous companies are already providing cloud services

including Amazon Google Yahoo Microsoft HP IBM Cisco

etc

Data Centers range in size from ldquoedgerdquo facilities to megascale datacenters (100K to 1M servers)

Data centers are located in many countries eg USA India

Singapore Germany etcThere is a push for Green Data Centers that use windsolar energy

efficient floor layout recycling of waste material environment

friendly materialpaint green rated power equipment

Fastest growing sectors in data centers Telecom Foreign hosting

companies Global information management Business

connectivity

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 659

Data Center Closures amp ConsolidationUS Government Data Centers

In November 2012 the US Government has closed an additional

64 data centers bringing the total number of closed facilities to

381 The closures are part of the Federal Data CenterConsolidation Initiative for streamlining government IT

operations

The ultimate goal is to close 40 percent of the US Federal

Governmentrsquos data centers (ie close 1200 of the nearly 2900identified data centers) by 2015

Commercial Data CentersData center consolidation is a trend in industry For example HP

has been replacing its 85 data centers around the world with only 6newly-built larger facilities in Austin Atlanta and Houston

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 759

Hewlett-Packard Development Company

Example HP Cloud Services

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 859

Example Amazon Data CentersAmazon data centers serve four regions in

the US and three regions in Europe and

Asia Another data center in the US was

opened July 2011 in the state of Oregon to

serve the Pacific Northwest region

In December 2011 Amazon announced it is

opening a data center in Sao Paulo Brazil

its first in South America

In November 2012 Amazon announced it

is adding a ninth region by opening a data

center in Sydney Australia

The data centers support all Amazon Web

Services (AWS) including Amazon Elastic

Compute Cloud (EC2) and Amazon

Simple Storage Service (S3)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 959

Example Amazon Data Centers

The Amazon Elastic Compute Cloud (Amazon EC2) web service provides users

with the ability to execute their applications in Amazons computing environment

To use Amazon EC2 Create an Amazon Machine Image (AMI) containing all the software including

the operating system

Upload this AMI to the Amazon S3 (Amazon Simple Storage Service)

Register to get an AMI ID Use this AMI ID and the Amazon EC2 web service APIs to run monitor and

terminate as many instances of this AMI as required

EC2 Pricing Policy pay as you go no minimal fee The prices are based on theRegion in which the application instance is running

httpawsamazoncomec2pricing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1059

Data Center Services

Exampe Colocation Services of Cogent

httpwwwcogentcocomen

Cogent is a multinational Tier 1 Internet Service Provider

Companies can colocate their business critical equipment in one of

43 Cogents secure state-of-the-art data centers that connect directly

to a Tier-1 IP network The data centers have extensive powerbackup systems complete fire detection and suppression plans to

ensure the safety and security of equipment

Cogent Data Center Features

httpwwwcogentcocomenproducts-and-servicescolocation-

services

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1159

Colocation Data Centers and Cloud Servers

httpwwwdatacentermapcomdatacentershtml

httpwwwdatacentermapcomcloudhtml

Example AtlanticNet

httpwwwatlanticnetorlando-colocation-floridahtml

Orlando Data Center

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1259

Data Center TCP (DCTCP)

M Alizadehzy A Greenbergy D Maltzy J Padhyey P

Pately B Prabhakarz S Senguptay M Sridharan

983117983145983139983154983151983155983151983142983156 983122983141983155983141983137983154983139983144 amp 983123983156983137983150983142983151983154983140 983125983150983145983158983141983154983155983145983156983161

ACM SIGCOMM September 2010

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1359

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1459

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1559

Rack Servers with Commodity Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1659

Performance impairments of Shallow-buffered

Switches1 TCP Incast Collapse

Many applications generate barrier-synchronized requests in which the

client cannot make forward progress until the responses from every

server for the current request have been received An Example of these

applications is a web search query (eg a Google search) sent to a large

number of nodes with results returned to the parent node to be sorted

Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these

requests create many flows that converge on the same interface of a

switch over a short period of time The response packets create a long

queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and

throughput collapse

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1759

1 TCP Incast Collapse (continued)

Barrier-synchronized requests exhibit the PartitionAggregate workflow

pattern which is the foundation of many large scale web applications

Requests from higher layers of the application are broken into pieces and

farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content

composition and advertisement selection are based around the

PartitionAggregate design pattern

In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require

iteratively invoking the pattern with an aggregator making serial requests

to the workers below it to prepare a response (1 to 4 iterations are typical

though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up

to the root must be completed within the deadline

In other publications this pattern is referred to as the ScatterGather pattern

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1859

983137983143983143983154983141983143983137983156983151983154

983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154

The partitionaggregate design pattern

Request Latency deadline 250 ms

deadline 50 ms

deadline 10 ms

The total permissible latency for a request is limited and the ldquobackendrdquo part of the

application is typically allocated between 230-300 ms This limit is called the all-up SLA

Example in web search a query might be sent to many aggregators and workers each

responsible for a different part of the index Based on the replies an aggregator might

refine the query and send it out again to improve the relevance of the result Lagging

instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries

A high-level aggregator

(HLA) partitions queries to

a large number of mid-level

aggregators (MLAs) that in

turn partition each query

over the other servers in the

same rack as the MLA

Servers act as both MLAs

and workers so each server

will be acting as an

aggregator for some queries

and as a worker for other

queries

HLA

MLAMLA

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1959

aggregator

worker 1worker 2worker 3worker 4

query

response

Ack

A TCP Incast Event

Response from worker 3 is lost due to incast and is

retransmitted after a timeout

timeout

983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154

983090

983159983151983154983147983141983154

983089

983159983151983154983147983141983154

983091

983159983151983154983147983141983154

983091

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2059

IncastScenario

Packets from many

flows arriving to

the same port at

the same time

Incast Collapse Summary

In other publications the incast scnario

is referred to as the fan-in burst at the

parent node This incast is a key reason

for increased network delay and occurswhen all the children (eg workers at

the leaf level) of a parent node face the

same deadline and are likely to respond

nearly at the same time causing a fan-

in burst at the parent node

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2159

Performance impairments of Shallow-buffered

Switches2 Queue Buildup

When long and short flows traverse the same queue there is a queue

buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every

worker in the cluster handles both query traffic and background

traffic (large flows needed to update the data structures on the

workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays

because of long-lived greedy TCP flows Further answering a

request can require multiple iterations which magnifies the impact of

this delay

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2259

Performance impairments of Shallow-buffered

Switches3 Buffer Pressure

Given the mix of long and short flows in a data center it is very

common for short flows on one port to be impacted by activity on

other ports The loss rate of short flows in this traffic pattern depends

on the number of long flows traversing other ports

The long greedy TCP flows build up queues on their interfaces

Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space

available to absorb bursts of traffic from the PartitionAggregate

traffic This impairment is called buffer pressure The result is packet

loss and timeouts as in incast but without requiring synchronizedflows

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2359

Buffer

Pressure

Short flows on oneport and long flows

on another port

Incast

Scenario

Multiple shortflows on the same

port

Queue

Buildup

Short and longflows on the same

port

Flow Interactions in Shallow-buffered Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2459

C o n g e s t i o n

w i n d o w

10

5

15

20

0

Round-trip times

Slow

start

Congestionavoidance

Time-out

Legacy TCP Congestion Control

983155983155983135983156983144983154983141983155983144 983101983089983094

983139983159983150983140 983101983090983088

983155983155983135983156983144983154983141983155983144 983101983089983088

Segment loss

Segment loss

FastRetransmit

Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2559

The Need for a Data Center TCP

The data center environment is significantly

different from wide area networks

o round trip times (RTTs) can be less than 250 ms in absence ofqueuing

o Applications need extremely high bandwidths and very low

latencies

o little statistical multiplexing a single flow can dominate a

particular path

o The network is largely homogeneous and under a single

administrative controlo Traffic flowing in switches is mostly internal Connectivity to the

external Internet is typically managed through load balancers and

application proxies that effectively separate internal traffic from

external

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2659

The Need for a Data Center TCP (continued)

Data center applications generate a diverse mix of short and long

flows The measurements by the authors reveal that 9991 of

traffic in the data center is TCP traffic The traffic consists of query

traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to

100MB) These applications require three things from the data

center network

o low latency for short flows

o high burst tolerance

o high utilization for long flows

Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above

requirements

See paper for details of workload

characterization in cloud data centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2759

The Data Center TCP (DCTCP) Algorithm

The goal of DCTCP is to achieve high burst tolerance low latency

and high throughput with commodity shallow buffered switches

DCTCP uses the concept of ECN (Explicit Congestion Notification)

DCTCP achieves these goals primarily by reacting to congestion in

proportion to the extent of congestion

DCTCP uses a simple marking scheme at switches that sets the

Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold

The DCTCP source reacts by reducing the window by a factor that

depends on the fraction of marked packets the larger the fraction the

bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN

notification

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2859

DCTCP- Simple Marking at the Switch

DCTCP employs a simple active queue management scheme There

is only a single parameter the marking threshold K as opposed to

two parameters THmin and THmax in RED routers

An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival

Marking is based on the instantaneous value of the queue not the

average value as in RED routers

The DCTCP scheme ensures that sources are quickly notified of the

queue overshoot

The RED marking scheme implemented by most modern switches

can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of

average queue length

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2959

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 5: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 559

Data Center Bubble

Numerous companies are already providing cloud services

including Amazon Google Yahoo Microsoft HP IBM Cisco

etc

Data Centers range in size from ldquoedgerdquo facilities to megascale datacenters (100K to 1M servers)

Data centers are located in many countries eg USA India

Singapore Germany etcThere is a push for Green Data Centers that use windsolar energy

efficient floor layout recycling of waste material environment

friendly materialpaint green rated power equipment

Fastest growing sectors in data centers Telecom Foreign hosting

companies Global information management Business

connectivity

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 659

Data Center Closures amp ConsolidationUS Government Data Centers

In November 2012 the US Government has closed an additional

64 data centers bringing the total number of closed facilities to

381 The closures are part of the Federal Data CenterConsolidation Initiative for streamlining government IT

operations

The ultimate goal is to close 40 percent of the US Federal

Governmentrsquos data centers (ie close 1200 of the nearly 2900identified data centers) by 2015

Commercial Data CentersData center consolidation is a trend in industry For example HP

has been replacing its 85 data centers around the world with only 6newly-built larger facilities in Austin Atlanta and Houston

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 759

Hewlett-Packard Development Company

Example HP Cloud Services

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 859

Example Amazon Data CentersAmazon data centers serve four regions in

the US and three regions in Europe and

Asia Another data center in the US was

opened July 2011 in the state of Oregon to

serve the Pacific Northwest region

In December 2011 Amazon announced it is

opening a data center in Sao Paulo Brazil

its first in South America

In November 2012 Amazon announced it

is adding a ninth region by opening a data

center in Sydney Australia

The data centers support all Amazon Web

Services (AWS) including Amazon Elastic

Compute Cloud (EC2) and Amazon

Simple Storage Service (S3)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 959

Example Amazon Data Centers

The Amazon Elastic Compute Cloud (Amazon EC2) web service provides users

with the ability to execute their applications in Amazons computing environment

To use Amazon EC2 Create an Amazon Machine Image (AMI) containing all the software including

the operating system

Upload this AMI to the Amazon S3 (Amazon Simple Storage Service)

Register to get an AMI ID Use this AMI ID and the Amazon EC2 web service APIs to run monitor and

terminate as many instances of this AMI as required

EC2 Pricing Policy pay as you go no minimal fee The prices are based on theRegion in which the application instance is running

httpawsamazoncomec2pricing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1059

Data Center Services

Exampe Colocation Services of Cogent

httpwwwcogentcocomen

Cogent is a multinational Tier 1 Internet Service Provider

Companies can colocate their business critical equipment in one of

43 Cogents secure state-of-the-art data centers that connect directly

to a Tier-1 IP network The data centers have extensive powerbackup systems complete fire detection and suppression plans to

ensure the safety and security of equipment

Cogent Data Center Features

httpwwwcogentcocomenproducts-and-servicescolocation-

services

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1159

Colocation Data Centers and Cloud Servers

httpwwwdatacentermapcomdatacentershtml

httpwwwdatacentermapcomcloudhtml

Example AtlanticNet

httpwwwatlanticnetorlando-colocation-floridahtml

Orlando Data Center

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1259

Data Center TCP (DCTCP)

M Alizadehzy A Greenbergy D Maltzy J Padhyey P

Pately B Prabhakarz S Senguptay M Sridharan

983117983145983139983154983151983155983151983142983156 983122983141983155983141983137983154983139983144 amp 983123983156983137983150983142983151983154983140 983125983150983145983158983141983154983155983145983156983161

ACM SIGCOMM September 2010

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1359

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1459

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1559

Rack Servers with Commodity Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1659

Performance impairments of Shallow-buffered

Switches1 TCP Incast Collapse

Many applications generate barrier-synchronized requests in which the

client cannot make forward progress until the responses from every

server for the current request have been received An Example of these

applications is a web search query (eg a Google search) sent to a large

number of nodes with results returned to the parent node to be sorted

Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these

requests create many flows that converge on the same interface of a

switch over a short period of time The response packets create a long

queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and

throughput collapse

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1759

1 TCP Incast Collapse (continued)

Barrier-synchronized requests exhibit the PartitionAggregate workflow

pattern which is the foundation of many large scale web applications

Requests from higher layers of the application are broken into pieces and

farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content

composition and advertisement selection are based around the

PartitionAggregate design pattern

In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require

iteratively invoking the pattern with an aggregator making serial requests

to the workers below it to prepare a response (1 to 4 iterations are typical

though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up

to the root must be completed within the deadline

In other publications this pattern is referred to as the ScatterGather pattern

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1859

983137983143983143983154983141983143983137983156983151983154

983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154

The partitionaggregate design pattern

Request Latency deadline 250 ms

deadline 50 ms

deadline 10 ms

The total permissible latency for a request is limited and the ldquobackendrdquo part of the

application is typically allocated between 230-300 ms This limit is called the all-up SLA

Example in web search a query might be sent to many aggregators and workers each

responsible for a different part of the index Based on the replies an aggregator might

refine the query and send it out again to improve the relevance of the result Lagging

instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries

A high-level aggregator

(HLA) partitions queries to

a large number of mid-level

aggregators (MLAs) that in

turn partition each query

over the other servers in the

same rack as the MLA

Servers act as both MLAs

and workers so each server

will be acting as an

aggregator for some queries

and as a worker for other

queries

HLA

MLAMLA

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1959

aggregator

worker 1worker 2worker 3worker 4

query

response

Ack

A TCP Incast Event

Response from worker 3 is lost due to incast and is

retransmitted after a timeout

timeout

983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154

983090

983159983151983154983147983141983154

983089

983159983151983154983147983141983154

983091

983159983151983154983147983141983154

983091

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2059

IncastScenario

Packets from many

flows arriving to

the same port at

the same time

Incast Collapse Summary

In other publications the incast scnario

is referred to as the fan-in burst at the

parent node This incast is a key reason

for increased network delay and occurswhen all the children (eg workers at

the leaf level) of a parent node face the

same deadline and are likely to respond

nearly at the same time causing a fan-

in burst at the parent node

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2159

Performance impairments of Shallow-buffered

Switches2 Queue Buildup

When long and short flows traverse the same queue there is a queue

buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every

worker in the cluster handles both query traffic and background

traffic (large flows needed to update the data structures on the

workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays

because of long-lived greedy TCP flows Further answering a

request can require multiple iterations which magnifies the impact of

this delay

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2259

Performance impairments of Shallow-buffered

Switches3 Buffer Pressure

Given the mix of long and short flows in a data center it is very

common for short flows on one port to be impacted by activity on

other ports The loss rate of short flows in this traffic pattern depends

on the number of long flows traversing other ports

The long greedy TCP flows build up queues on their interfaces

Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space

available to absorb bursts of traffic from the PartitionAggregate

traffic This impairment is called buffer pressure The result is packet

loss and timeouts as in incast but without requiring synchronizedflows

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2359

Buffer

Pressure

Short flows on oneport and long flows

on another port

Incast

Scenario

Multiple shortflows on the same

port

Queue

Buildup

Short and longflows on the same

port

Flow Interactions in Shallow-buffered Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2459

C o n g e s t i o n

w i n d o w

10

5

15

20

0

Round-trip times

Slow

start

Congestionavoidance

Time-out

Legacy TCP Congestion Control

983155983155983135983156983144983154983141983155983144 983101983089983094

983139983159983150983140 983101983090983088

983155983155983135983156983144983154983141983155983144 983101983089983088

Segment loss

Segment loss

FastRetransmit

Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2559

The Need for a Data Center TCP

The data center environment is significantly

different from wide area networks

o round trip times (RTTs) can be less than 250 ms in absence ofqueuing

o Applications need extremely high bandwidths and very low

latencies

o little statistical multiplexing a single flow can dominate a

particular path

o The network is largely homogeneous and under a single

administrative controlo Traffic flowing in switches is mostly internal Connectivity to the

external Internet is typically managed through load balancers and

application proxies that effectively separate internal traffic from

external

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2659

The Need for a Data Center TCP (continued)

Data center applications generate a diverse mix of short and long

flows The measurements by the authors reveal that 9991 of

traffic in the data center is TCP traffic The traffic consists of query

traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to

100MB) These applications require three things from the data

center network

o low latency for short flows

o high burst tolerance

o high utilization for long flows

Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above

requirements

See paper for details of workload

characterization in cloud data centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2759

The Data Center TCP (DCTCP) Algorithm

The goal of DCTCP is to achieve high burst tolerance low latency

and high throughput with commodity shallow buffered switches

DCTCP uses the concept of ECN (Explicit Congestion Notification)

DCTCP achieves these goals primarily by reacting to congestion in

proportion to the extent of congestion

DCTCP uses a simple marking scheme at switches that sets the

Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold

The DCTCP source reacts by reducing the window by a factor that

depends on the fraction of marked packets the larger the fraction the

bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN

notification

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2859

DCTCP- Simple Marking at the Switch

DCTCP employs a simple active queue management scheme There

is only a single parameter the marking threshold K as opposed to

two parameters THmin and THmax in RED routers

An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival

Marking is based on the instantaneous value of the queue not the

average value as in RED routers

The DCTCP scheme ensures that sources are quickly notified of the

queue overshoot

The RED marking scheme implemented by most modern switches

can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of

average queue length

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2959

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 6: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 659

Data Center Closures amp ConsolidationUS Government Data Centers

In November 2012 the US Government has closed an additional

64 data centers bringing the total number of closed facilities to

381 The closures are part of the Federal Data CenterConsolidation Initiative for streamlining government IT

operations

The ultimate goal is to close 40 percent of the US Federal

Governmentrsquos data centers (ie close 1200 of the nearly 2900identified data centers) by 2015

Commercial Data CentersData center consolidation is a trend in industry For example HP

has been replacing its 85 data centers around the world with only 6newly-built larger facilities in Austin Atlanta and Houston

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 759

Hewlett-Packard Development Company

Example HP Cloud Services

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 859

Example Amazon Data CentersAmazon data centers serve four regions in

the US and three regions in Europe and

Asia Another data center in the US was

opened July 2011 in the state of Oregon to

serve the Pacific Northwest region

In December 2011 Amazon announced it is

opening a data center in Sao Paulo Brazil

its first in South America

In November 2012 Amazon announced it

is adding a ninth region by opening a data

center in Sydney Australia

The data centers support all Amazon Web

Services (AWS) including Amazon Elastic

Compute Cloud (EC2) and Amazon

Simple Storage Service (S3)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 959

Example Amazon Data Centers

The Amazon Elastic Compute Cloud (Amazon EC2) web service provides users

with the ability to execute their applications in Amazons computing environment

To use Amazon EC2 Create an Amazon Machine Image (AMI) containing all the software including

the operating system

Upload this AMI to the Amazon S3 (Amazon Simple Storage Service)

Register to get an AMI ID Use this AMI ID and the Amazon EC2 web service APIs to run monitor and

terminate as many instances of this AMI as required

EC2 Pricing Policy pay as you go no minimal fee The prices are based on theRegion in which the application instance is running

httpawsamazoncomec2pricing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1059

Data Center Services

Exampe Colocation Services of Cogent

httpwwwcogentcocomen

Cogent is a multinational Tier 1 Internet Service Provider

Companies can colocate their business critical equipment in one of

43 Cogents secure state-of-the-art data centers that connect directly

to a Tier-1 IP network The data centers have extensive powerbackup systems complete fire detection and suppression plans to

ensure the safety and security of equipment

Cogent Data Center Features

httpwwwcogentcocomenproducts-and-servicescolocation-

services

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1159

Colocation Data Centers and Cloud Servers

httpwwwdatacentermapcomdatacentershtml

httpwwwdatacentermapcomcloudhtml

Example AtlanticNet

httpwwwatlanticnetorlando-colocation-floridahtml

Orlando Data Center

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1259

Data Center TCP (DCTCP)

M Alizadehzy A Greenbergy D Maltzy J Padhyey P

Pately B Prabhakarz S Senguptay M Sridharan

983117983145983139983154983151983155983151983142983156 983122983141983155983141983137983154983139983144 amp 983123983156983137983150983142983151983154983140 983125983150983145983158983141983154983155983145983156983161

ACM SIGCOMM September 2010

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1359

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1459

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1559

Rack Servers with Commodity Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1659

Performance impairments of Shallow-buffered

Switches1 TCP Incast Collapse

Many applications generate barrier-synchronized requests in which the

client cannot make forward progress until the responses from every

server for the current request have been received An Example of these

applications is a web search query (eg a Google search) sent to a large

number of nodes with results returned to the parent node to be sorted

Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these

requests create many flows that converge on the same interface of a

switch over a short period of time The response packets create a long

queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and

throughput collapse

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1759

1 TCP Incast Collapse (continued)

Barrier-synchronized requests exhibit the PartitionAggregate workflow

pattern which is the foundation of many large scale web applications

Requests from higher layers of the application are broken into pieces and

farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content

composition and advertisement selection are based around the

PartitionAggregate design pattern

In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require

iteratively invoking the pattern with an aggregator making serial requests

to the workers below it to prepare a response (1 to 4 iterations are typical

though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up

to the root must be completed within the deadline

In other publications this pattern is referred to as the ScatterGather pattern

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1859

983137983143983143983154983141983143983137983156983151983154

983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154

The partitionaggregate design pattern

Request Latency deadline 250 ms

deadline 50 ms

deadline 10 ms

The total permissible latency for a request is limited and the ldquobackendrdquo part of the

application is typically allocated between 230-300 ms This limit is called the all-up SLA

Example in web search a query might be sent to many aggregators and workers each

responsible for a different part of the index Based on the replies an aggregator might

refine the query and send it out again to improve the relevance of the result Lagging

instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries

A high-level aggregator

(HLA) partitions queries to

a large number of mid-level

aggregators (MLAs) that in

turn partition each query

over the other servers in the

same rack as the MLA

Servers act as both MLAs

and workers so each server

will be acting as an

aggregator for some queries

and as a worker for other

queries

HLA

MLAMLA

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1959

aggregator

worker 1worker 2worker 3worker 4

query

response

Ack

A TCP Incast Event

Response from worker 3 is lost due to incast and is

retransmitted after a timeout

timeout

983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154

983090

983159983151983154983147983141983154

983089

983159983151983154983147983141983154

983091

983159983151983154983147983141983154

983091

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2059

IncastScenario

Packets from many

flows arriving to

the same port at

the same time

Incast Collapse Summary

In other publications the incast scnario

is referred to as the fan-in burst at the

parent node This incast is a key reason

for increased network delay and occurswhen all the children (eg workers at

the leaf level) of a parent node face the

same deadline and are likely to respond

nearly at the same time causing a fan-

in burst at the parent node

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2159

Performance impairments of Shallow-buffered

Switches2 Queue Buildup

When long and short flows traverse the same queue there is a queue

buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every

worker in the cluster handles both query traffic and background

traffic (large flows needed to update the data structures on the

workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays

because of long-lived greedy TCP flows Further answering a

request can require multiple iterations which magnifies the impact of

this delay

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2259

Performance impairments of Shallow-buffered

Switches3 Buffer Pressure

Given the mix of long and short flows in a data center it is very

common for short flows on one port to be impacted by activity on

other ports The loss rate of short flows in this traffic pattern depends

on the number of long flows traversing other ports

The long greedy TCP flows build up queues on their interfaces

Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space

available to absorb bursts of traffic from the PartitionAggregate

traffic This impairment is called buffer pressure The result is packet

loss and timeouts as in incast but without requiring synchronizedflows

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2359

Buffer

Pressure

Short flows on oneport and long flows

on another port

Incast

Scenario

Multiple shortflows on the same

port

Queue

Buildup

Short and longflows on the same

port

Flow Interactions in Shallow-buffered Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2459

C o n g e s t i o n

w i n d o w

10

5

15

20

0

Round-trip times

Slow

start

Congestionavoidance

Time-out

Legacy TCP Congestion Control

983155983155983135983156983144983154983141983155983144 983101983089983094

983139983159983150983140 983101983090983088

983155983155983135983156983144983154983141983155983144 983101983089983088

Segment loss

Segment loss

FastRetransmit

Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2559

The Need for a Data Center TCP

The data center environment is significantly

different from wide area networks

o round trip times (RTTs) can be less than 250 ms in absence ofqueuing

o Applications need extremely high bandwidths and very low

latencies

o little statistical multiplexing a single flow can dominate a

particular path

o The network is largely homogeneous and under a single

administrative controlo Traffic flowing in switches is mostly internal Connectivity to the

external Internet is typically managed through load balancers and

application proxies that effectively separate internal traffic from

external

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2659

The Need for a Data Center TCP (continued)

Data center applications generate a diverse mix of short and long

flows The measurements by the authors reveal that 9991 of

traffic in the data center is TCP traffic The traffic consists of query

traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to

100MB) These applications require three things from the data

center network

o low latency for short flows

o high burst tolerance

o high utilization for long flows

Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above

requirements

See paper for details of workload

characterization in cloud data centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2759

The Data Center TCP (DCTCP) Algorithm

The goal of DCTCP is to achieve high burst tolerance low latency

and high throughput with commodity shallow buffered switches

DCTCP uses the concept of ECN (Explicit Congestion Notification)

DCTCP achieves these goals primarily by reacting to congestion in

proportion to the extent of congestion

DCTCP uses a simple marking scheme at switches that sets the

Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold

The DCTCP source reacts by reducing the window by a factor that

depends on the fraction of marked packets the larger the fraction the

bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN

notification

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2859

DCTCP- Simple Marking at the Switch

DCTCP employs a simple active queue management scheme There

is only a single parameter the marking threshold K as opposed to

two parameters THmin and THmax in RED routers

An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival

Marking is based on the instantaneous value of the queue not the

average value as in RED routers

The DCTCP scheme ensures that sources are quickly notified of the

queue overshoot

The RED marking scheme implemented by most modern switches

can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of

average queue length

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2959

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 7: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 759

Hewlett-Packard Development Company

Example HP Cloud Services

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 859

Example Amazon Data CentersAmazon data centers serve four regions in

the US and three regions in Europe and

Asia Another data center in the US was

opened July 2011 in the state of Oregon to

serve the Pacific Northwest region

In December 2011 Amazon announced it is

opening a data center in Sao Paulo Brazil

its first in South America

In November 2012 Amazon announced it

is adding a ninth region by opening a data

center in Sydney Australia

The data centers support all Amazon Web

Services (AWS) including Amazon Elastic

Compute Cloud (EC2) and Amazon

Simple Storage Service (S3)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 959

Example Amazon Data Centers

The Amazon Elastic Compute Cloud (Amazon EC2) web service provides users

with the ability to execute their applications in Amazons computing environment

To use Amazon EC2 Create an Amazon Machine Image (AMI) containing all the software including

the operating system

Upload this AMI to the Amazon S3 (Amazon Simple Storage Service)

Register to get an AMI ID Use this AMI ID and the Amazon EC2 web service APIs to run monitor and

terminate as many instances of this AMI as required

EC2 Pricing Policy pay as you go no minimal fee The prices are based on theRegion in which the application instance is running

httpawsamazoncomec2pricing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1059

Data Center Services

Exampe Colocation Services of Cogent

httpwwwcogentcocomen

Cogent is a multinational Tier 1 Internet Service Provider

Companies can colocate their business critical equipment in one of

43 Cogents secure state-of-the-art data centers that connect directly

to a Tier-1 IP network The data centers have extensive powerbackup systems complete fire detection and suppression plans to

ensure the safety and security of equipment

Cogent Data Center Features

httpwwwcogentcocomenproducts-and-servicescolocation-

services

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1159

Colocation Data Centers and Cloud Servers

httpwwwdatacentermapcomdatacentershtml

httpwwwdatacentermapcomcloudhtml

Example AtlanticNet

httpwwwatlanticnetorlando-colocation-floridahtml

Orlando Data Center

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1259

Data Center TCP (DCTCP)

M Alizadehzy A Greenbergy D Maltzy J Padhyey P

Pately B Prabhakarz S Senguptay M Sridharan

983117983145983139983154983151983155983151983142983156 983122983141983155983141983137983154983139983144 amp 983123983156983137983150983142983151983154983140 983125983150983145983158983141983154983155983145983156983161

ACM SIGCOMM September 2010

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1359

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1459

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1559

Rack Servers with Commodity Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1659

Performance impairments of Shallow-buffered

Switches1 TCP Incast Collapse

Many applications generate barrier-synchronized requests in which the

client cannot make forward progress until the responses from every

server for the current request have been received An Example of these

applications is a web search query (eg a Google search) sent to a large

number of nodes with results returned to the parent node to be sorted

Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these

requests create many flows that converge on the same interface of a

switch over a short period of time The response packets create a long

queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and

throughput collapse

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1759

1 TCP Incast Collapse (continued)

Barrier-synchronized requests exhibit the PartitionAggregate workflow

pattern which is the foundation of many large scale web applications

Requests from higher layers of the application are broken into pieces and

farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content

composition and advertisement selection are based around the

PartitionAggregate design pattern

In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require

iteratively invoking the pattern with an aggregator making serial requests

to the workers below it to prepare a response (1 to 4 iterations are typical

though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up

to the root must be completed within the deadline

In other publications this pattern is referred to as the ScatterGather pattern

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1859

983137983143983143983154983141983143983137983156983151983154

983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154

The partitionaggregate design pattern

Request Latency deadline 250 ms

deadline 50 ms

deadline 10 ms

The total permissible latency for a request is limited and the ldquobackendrdquo part of the

application is typically allocated between 230-300 ms This limit is called the all-up SLA

Example in web search a query might be sent to many aggregators and workers each

responsible for a different part of the index Based on the replies an aggregator might

refine the query and send it out again to improve the relevance of the result Lagging

instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries

A high-level aggregator

(HLA) partitions queries to

a large number of mid-level

aggregators (MLAs) that in

turn partition each query

over the other servers in the

same rack as the MLA

Servers act as both MLAs

and workers so each server

will be acting as an

aggregator for some queries

and as a worker for other

queries

HLA

MLAMLA

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1959

aggregator

worker 1worker 2worker 3worker 4

query

response

Ack

A TCP Incast Event

Response from worker 3 is lost due to incast and is

retransmitted after a timeout

timeout

983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154

983090

983159983151983154983147983141983154

983089

983159983151983154983147983141983154

983091

983159983151983154983147983141983154

983091

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2059

IncastScenario

Packets from many

flows arriving to

the same port at

the same time

Incast Collapse Summary

In other publications the incast scnario

is referred to as the fan-in burst at the

parent node This incast is a key reason

for increased network delay and occurswhen all the children (eg workers at

the leaf level) of a parent node face the

same deadline and are likely to respond

nearly at the same time causing a fan-

in burst at the parent node

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2159

Performance impairments of Shallow-buffered

Switches2 Queue Buildup

When long and short flows traverse the same queue there is a queue

buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every

worker in the cluster handles both query traffic and background

traffic (large flows needed to update the data structures on the

workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays

because of long-lived greedy TCP flows Further answering a

request can require multiple iterations which magnifies the impact of

this delay

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2259

Performance impairments of Shallow-buffered

Switches3 Buffer Pressure

Given the mix of long and short flows in a data center it is very

common for short flows on one port to be impacted by activity on

other ports The loss rate of short flows in this traffic pattern depends

on the number of long flows traversing other ports

The long greedy TCP flows build up queues on their interfaces

Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space

available to absorb bursts of traffic from the PartitionAggregate

traffic This impairment is called buffer pressure The result is packet

loss and timeouts as in incast but without requiring synchronizedflows

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2359

Buffer

Pressure

Short flows on oneport and long flows

on another port

Incast

Scenario

Multiple shortflows on the same

port

Queue

Buildup

Short and longflows on the same

port

Flow Interactions in Shallow-buffered Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2459

C o n g e s t i o n

w i n d o w

10

5

15

20

0

Round-trip times

Slow

start

Congestionavoidance

Time-out

Legacy TCP Congestion Control

983155983155983135983156983144983154983141983155983144 983101983089983094

983139983159983150983140 983101983090983088

983155983155983135983156983144983154983141983155983144 983101983089983088

Segment loss

Segment loss

FastRetransmit

Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2559

The Need for a Data Center TCP

The data center environment is significantly

different from wide area networks

o round trip times (RTTs) can be less than 250 ms in absence ofqueuing

o Applications need extremely high bandwidths and very low

latencies

o little statistical multiplexing a single flow can dominate a

particular path

o The network is largely homogeneous and under a single

administrative controlo Traffic flowing in switches is mostly internal Connectivity to the

external Internet is typically managed through load balancers and

application proxies that effectively separate internal traffic from

external

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2659

The Need for a Data Center TCP (continued)

Data center applications generate a diverse mix of short and long

flows The measurements by the authors reveal that 9991 of

traffic in the data center is TCP traffic The traffic consists of query

traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to

100MB) These applications require three things from the data

center network

o low latency for short flows

o high burst tolerance

o high utilization for long flows

Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above

requirements

See paper for details of workload

characterization in cloud data centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2759

The Data Center TCP (DCTCP) Algorithm

The goal of DCTCP is to achieve high burst tolerance low latency

and high throughput with commodity shallow buffered switches

DCTCP uses the concept of ECN (Explicit Congestion Notification)

DCTCP achieves these goals primarily by reacting to congestion in

proportion to the extent of congestion

DCTCP uses a simple marking scheme at switches that sets the

Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold

The DCTCP source reacts by reducing the window by a factor that

depends on the fraction of marked packets the larger the fraction the

bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN

notification

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2859

DCTCP- Simple Marking at the Switch

DCTCP employs a simple active queue management scheme There

is only a single parameter the marking threshold K as opposed to

two parameters THmin and THmax in RED routers

An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival

Marking is based on the instantaneous value of the queue not the

average value as in RED routers

The DCTCP scheme ensures that sources are quickly notified of the

queue overshoot

The RED marking scheme implemented by most modern switches

can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of

average queue length

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2959

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 8: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 859

Example Amazon Data CentersAmazon data centers serve four regions in

the US and three regions in Europe and

Asia Another data center in the US was

opened July 2011 in the state of Oregon to

serve the Pacific Northwest region

In December 2011 Amazon announced it is

opening a data center in Sao Paulo Brazil

its first in South America

In November 2012 Amazon announced it

is adding a ninth region by opening a data

center in Sydney Australia

The data centers support all Amazon Web

Services (AWS) including Amazon Elastic

Compute Cloud (EC2) and Amazon

Simple Storage Service (S3)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 959

Example Amazon Data Centers

The Amazon Elastic Compute Cloud (Amazon EC2) web service provides users

with the ability to execute their applications in Amazons computing environment

To use Amazon EC2 Create an Amazon Machine Image (AMI) containing all the software including

the operating system

Upload this AMI to the Amazon S3 (Amazon Simple Storage Service)

Register to get an AMI ID Use this AMI ID and the Amazon EC2 web service APIs to run monitor and

terminate as many instances of this AMI as required

EC2 Pricing Policy pay as you go no minimal fee The prices are based on theRegion in which the application instance is running

httpawsamazoncomec2pricing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1059

Data Center Services

Exampe Colocation Services of Cogent

httpwwwcogentcocomen

Cogent is a multinational Tier 1 Internet Service Provider

Companies can colocate their business critical equipment in one of

43 Cogents secure state-of-the-art data centers that connect directly

to a Tier-1 IP network The data centers have extensive powerbackup systems complete fire detection and suppression plans to

ensure the safety and security of equipment

Cogent Data Center Features

httpwwwcogentcocomenproducts-and-servicescolocation-

services

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1159

Colocation Data Centers and Cloud Servers

httpwwwdatacentermapcomdatacentershtml

httpwwwdatacentermapcomcloudhtml

Example AtlanticNet

httpwwwatlanticnetorlando-colocation-floridahtml

Orlando Data Center

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1259

Data Center TCP (DCTCP)

M Alizadehzy A Greenbergy D Maltzy J Padhyey P

Pately B Prabhakarz S Senguptay M Sridharan

983117983145983139983154983151983155983151983142983156 983122983141983155983141983137983154983139983144 amp 983123983156983137983150983142983151983154983140 983125983150983145983158983141983154983155983145983156983161

ACM SIGCOMM September 2010

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1359

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1459

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1559

Rack Servers with Commodity Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1659

Performance impairments of Shallow-buffered

Switches1 TCP Incast Collapse

Many applications generate barrier-synchronized requests in which the

client cannot make forward progress until the responses from every

server for the current request have been received An Example of these

applications is a web search query (eg a Google search) sent to a large

number of nodes with results returned to the parent node to be sorted

Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these

requests create many flows that converge on the same interface of a

switch over a short period of time The response packets create a long

queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and

throughput collapse

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1759

1 TCP Incast Collapse (continued)

Barrier-synchronized requests exhibit the PartitionAggregate workflow

pattern which is the foundation of many large scale web applications

Requests from higher layers of the application are broken into pieces and

farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content

composition and advertisement selection are based around the

PartitionAggregate design pattern

In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require

iteratively invoking the pattern with an aggregator making serial requests

to the workers below it to prepare a response (1 to 4 iterations are typical

though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up

to the root must be completed within the deadline

In other publications this pattern is referred to as the ScatterGather pattern

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1859

983137983143983143983154983141983143983137983156983151983154

983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154

The partitionaggregate design pattern

Request Latency deadline 250 ms

deadline 50 ms

deadline 10 ms

The total permissible latency for a request is limited and the ldquobackendrdquo part of the

application is typically allocated between 230-300 ms This limit is called the all-up SLA

Example in web search a query might be sent to many aggregators and workers each

responsible for a different part of the index Based on the replies an aggregator might

refine the query and send it out again to improve the relevance of the result Lagging

instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries

A high-level aggregator

(HLA) partitions queries to

a large number of mid-level

aggregators (MLAs) that in

turn partition each query

over the other servers in the

same rack as the MLA

Servers act as both MLAs

and workers so each server

will be acting as an

aggregator for some queries

and as a worker for other

queries

HLA

MLAMLA

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1959

aggregator

worker 1worker 2worker 3worker 4

query

response

Ack

A TCP Incast Event

Response from worker 3 is lost due to incast and is

retransmitted after a timeout

timeout

983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154

983090

983159983151983154983147983141983154

983089

983159983151983154983147983141983154

983091

983159983151983154983147983141983154

983091

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2059

IncastScenario

Packets from many

flows arriving to

the same port at

the same time

Incast Collapse Summary

In other publications the incast scnario

is referred to as the fan-in burst at the

parent node This incast is a key reason

for increased network delay and occurswhen all the children (eg workers at

the leaf level) of a parent node face the

same deadline and are likely to respond

nearly at the same time causing a fan-

in burst at the parent node

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2159

Performance impairments of Shallow-buffered

Switches2 Queue Buildup

When long and short flows traverse the same queue there is a queue

buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every

worker in the cluster handles both query traffic and background

traffic (large flows needed to update the data structures on the

workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays

because of long-lived greedy TCP flows Further answering a

request can require multiple iterations which magnifies the impact of

this delay

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2259

Performance impairments of Shallow-buffered

Switches3 Buffer Pressure

Given the mix of long and short flows in a data center it is very

common for short flows on one port to be impacted by activity on

other ports The loss rate of short flows in this traffic pattern depends

on the number of long flows traversing other ports

The long greedy TCP flows build up queues on their interfaces

Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space

available to absorb bursts of traffic from the PartitionAggregate

traffic This impairment is called buffer pressure The result is packet

loss and timeouts as in incast but without requiring synchronizedflows

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2359

Buffer

Pressure

Short flows on oneport and long flows

on another port

Incast

Scenario

Multiple shortflows on the same

port

Queue

Buildup

Short and longflows on the same

port

Flow Interactions in Shallow-buffered Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2459

C o n g e s t i o n

w i n d o w

10

5

15

20

0

Round-trip times

Slow

start

Congestionavoidance

Time-out

Legacy TCP Congestion Control

983155983155983135983156983144983154983141983155983144 983101983089983094

983139983159983150983140 983101983090983088

983155983155983135983156983144983154983141983155983144 983101983089983088

Segment loss

Segment loss

FastRetransmit

Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2559

The Need for a Data Center TCP

The data center environment is significantly

different from wide area networks

o round trip times (RTTs) can be less than 250 ms in absence ofqueuing

o Applications need extremely high bandwidths and very low

latencies

o little statistical multiplexing a single flow can dominate a

particular path

o The network is largely homogeneous and under a single

administrative controlo Traffic flowing in switches is mostly internal Connectivity to the

external Internet is typically managed through load balancers and

application proxies that effectively separate internal traffic from

external

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2659

The Need for a Data Center TCP (continued)

Data center applications generate a diverse mix of short and long

flows The measurements by the authors reveal that 9991 of

traffic in the data center is TCP traffic The traffic consists of query

traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to

100MB) These applications require three things from the data

center network

o low latency for short flows

o high burst tolerance

o high utilization for long flows

Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above

requirements

See paper for details of workload

characterization in cloud data centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2759

The Data Center TCP (DCTCP) Algorithm

The goal of DCTCP is to achieve high burst tolerance low latency

and high throughput with commodity shallow buffered switches

DCTCP uses the concept of ECN (Explicit Congestion Notification)

DCTCP achieves these goals primarily by reacting to congestion in

proportion to the extent of congestion

DCTCP uses a simple marking scheme at switches that sets the

Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold

The DCTCP source reacts by reducing the window by a factor that

depends on the fraction of marked packets the larger the fraction the

bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN

notification

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2859

DCTCP- Simple Marking at the Switch

DCTCP employs a simple active queue management scheme There

is only a single parameter the marking threshold K as opposed to

two parameters THmin and THmax in RED routers

An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival

Marking is based on the instantaneous value of the queue not the

average value as in RED routers

The DCTCP scheme ensures that sources are quickly notified of the

queue overshoot

The RED marking scheme implemented by most modern switches

can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of

average queue length

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2959

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 9: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 959

Example Amazon Data Centers

The Amazon Elastic Compute Cloud (Amazon EC2) web service provides users

with the ability to execute their applications in Amazons computing environment

To use Amazon EC2 Create an Amazon Machine Image (AMI) containing all the software including

the operating system

Upload this AMI to the Amazon S3 (Amazon Simple Storage Service)

Register to get an AMI ID Use this AMI ID and the Amazon EC2 web service APIs to run monitor and

terminate as many instances of this AMI as required

EC2 Pricing Policy pay as you go no minimal fee The prices are based on theRegion in which the application instance is running

httpawsamazoncomec2pricing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1059

Data Center Services

Exampe Colocation Services of Cogent

httpwwwcogentcocomen

Cogent is a multinational Tier 1 Internet Service Provider

Companies can colocate their business critical equipment in one of

43 Cogents secure state-of-the-art data centers that connect directly

to a Tier-1 IP network The data centers have extensive powerbackup systems complete fire detection and suppression plans to

ensure the safety and security of equipment

Cogent Data Center Features

httpwwwcogentcocomenproducts-and-servicescolocation-

services

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1159

Colocation Data Centers and Cloud Servers

httpwwwdatacentermapcomdatacentershtml

httpwwwdatacentermapcomcloudhtml

Example AtlanticNet

httpwwwatlanticnetorlando-colocation-floridahtml

Orlando Data Center

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1259

Data Center TCP (DCTCP)

M Alizadehzy A Greenbergy D Maltzy J Padhyey P

Pately B Prabhakarz S Senguptay M Sridharan

983117983145983139983154983151983155983151983142983156 983122983141983155983141983137983154983139983144 amp 983123983156983137983150983142983151983154983140 983125983150983145983158983141983154983155983145983156983161

ACM SIGCOMM September 2010

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1359

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1459

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1559

Rack Servers with Commodity Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1659

Performance impairments of Shallow-buffered

Switches1 TCP Incast Collapse

Many applications generate barrier-synchronized requests in which the

client cannot make forward progress until the responses from every

server for the current request have been received An Example of these

applications is a web search query (eg a Google search) sent to a large

number of nodes with results returned to the parent node to be sorted

Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these

requests create many flows that converge on the same interface of a

switch over a short period of time The response packets create a long

queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and

throughput collapse

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1759

1 TCP Incast Collapse (continued)

Barrier-synchronized requests exhibit the PartitionAggregate workflow

pattern which is the foundation of many large scale web applications

Requests from higher layers of the application are broken into pieces and

farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content

composition and advertisement selection are based around the

PartitionAggregate design pattern

In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require

iteratively invoking the pattern with an aggregator making serial requests

to the workers below it to prepare a response (1 to 4 iterations are typical

though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up

to the root must be completed within the deadline

In other publications this pattern is referred to as the ScatterGather pattern

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1859

983137983143983143983154983141983143983137983156983151983154

983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154

The partitionaggregate design pattern

Request Latency deadline 250 ms

deadline 50 ms

deadline 10 ms

The total permissible latency for a request is limited and the ldquobackendrdquo part of the

application is typically allocated between 230-300 ms This limit is called the all-up SLA

Example in web search a query might be sent to many aggregators and workers each

responsible for a different part of the index Based on the replies an aggregator might

refine the query and send it out again to improve the relevance of the result Lagging

instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries

A high-level aggregator

(HLA) partitions queries to

a large number of mid-level

aggregators (MLAs) that in

turn partition each query

over the other servers in the

same rack as the MLA

Servers act as both MLAs

and workers so each server

will be acting as an

aggregator for some queries

and as a worker for other

queries

HLA

MLAMLA

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1959

aggregator

worker 1worker 2worker 3worker 4

query

response

Ack

A TCP Incast Event

Response from worker 3 is lost due to incast and is

retransmitted after a timeout

timeout

983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154

983090

983159983151983154983147983141983154

983089

983159983151983154983147983141983154

983091

983159983151983154983147983141983154

983091

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2059

IncastScenario

Packets from many

flows arriving to

the same port at

the same time

Incast Collapse Summary

In other publications the incast scnario

is referred to as the fan-in burst at the

parent node This incast is a key reason

for increased network delay and occurswhen all the children (eg workers at

the leaf level) of a parent node face the

same deadline and are likely to respond

nearly at the same time causing a fan-

in burst at the parent node

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2159

Performance impairments of Shallow-buffered

Switches2 Queue Buildup

When long and short flows traverse the same queue there is a queue

buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every

worker in the cluster handles both query traffic and background

traffic (large flows needed to update the data structures on the

workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays

because of long-lived greedy TCP flows Further answering a

request can require multiple iterations which magnifies the impact of

this delay

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2259

Performance impairments of Shallow-buffered

Switches3 Buffer Pressure

Given the mix of long and short flows in a data center it is very

common for short flows on one port to be impacted by activity on

other ports The loss rate of short flows in this traffic pattern depends

on the number of long flows traversing other ports

The long greedy TCP flows build up queues on their interfaces

Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space

available to absorb bursts of traffic from the PartitionAggregate

traffic This impairment is called buffer pressure The result is packet

loss and timeouts as in incast but without requiring synchronizedflows

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2359

Buffer

Pressure

Short flows on oneport and long flows

on another port

Incast

Scenario

Multiple shortflows on the same

port

Queue

Buildup

Short and longflows on the same

port

Flow Interactions in Shallow-buffered Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2459

C o n g e s t i o n

w i n d o w

10

5

15

20

0

Round-trip times

Slow

start

Congestionavoidance

Time-out

Legacy TCP Congestion Control

983155983155983135983156983144983154983141983155983144 983101983089983094

983139983159983150983140 983101983090983088

983155983155983135983156983144983154983141983155983144 983101983089983088

Segment loss

Segment loss

FastRetransmit

Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2559

The Need for a Data Center TCP

The data center environment is significantly

different from wide area networks

o round trip times (RTTs) can be less than 250 ms in absence ofqueuing

o Applications need extremely high bandwidths and very low

latencies

o little statistical multiplexing a single flow can dominate a

particular path

o The network is largely homogeneous and under a single

administrative controlo Traffic flowing in switches is mostly internal Connectivity to the

external Internet is typically managed through load balancers and

application proxies that effectively separate internal traffic from

external

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2659

The Need for a Data Center TCP (continued)

Data center applications generate a diverse mix of short and long

flows The measurements by the authors reveal that 9991 of

traffic in the data center is TCP traffic The traffic consists of query

traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to

100MB) These applications require three things from the data

center network

o low latency for short flows

o high burst tolerance

o high utilization for long flows

Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above

requirements

See paper for details of workload

characterization in cloud data centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2759

The Data Center TCP (DCTCP) Algorithm

The goal of DCTCP is to achieve high burst tolerance low latency

and high throughput with commodity shallow buffered switches

DCTCP uses the concept of ECN (Explicit Congestion Notification)

DCTCP achieves these goals primarily by reacting to congestion in

proportion to the extent of congestion

DCTCP uses a simple marking scheme at switches that sets the

Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold

The DCTCP source reacts by reducing the window by a factor that

depends on the fraction of marked packets the larger the fraction the

bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN

notification

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2859

DCTCP- Simple Marking at the Switch

DCTCP employs a simple active queue management scheme There

is only a single parameter the marking threshold K as opposed to

two parameters THmin and THmax in RED routers

An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival

Marking is based on the instantaneous value of the queue not the

average value as in RED routers

The DCTCP scheme ensures that sources are quickly notified of the

queue overshoot

The RED marking scheme implemented by most modern switches

can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of

average queue length

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2959

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 10: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1059

Data Center Services

Exampe Colocation Services of Cogent

httpwwwcogentcocomen

Cogent is a multinational Tier 1 Internet Service Provider

Companies can colocate their business critical equipment in one of

43 Cogents secure state-of-the-art data centers that connect directly

to a Tier-1 IP network The data centers have extensive powerbackup systems complete fire detection and suppression plans to

ensure the safety and security of equipment

Cogent Data Center Features

httpwwwcogentcocomenproducts-and-servicescolocation-

services

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1159

Colocation Data Centers and Cloud Servers

httpwwwdatacentermapcomdatacentershtml

httpwwwdatacentermapcomcloudhtml

Example AtlanticNet

httpwwwatlanticnetorlando-colocation-floridahtml

Orlando Data Center

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1259

Data Center TCP (DCTCP)

M Alizadehzy A Greenbergy D Maltzy J Padhyey P

Pately B Prabhakarz S Senguptay M Sridharan

983117983145983139983154983151983155983151983142983156 983122983141983155983141983137983154983139983144 amp 983123983156983137983150983142983151983154983140 983125983150983145983158983141983154983155983145983156983161

ACM SIGCOMM September 2010

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1359

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1459

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1559

Rack Servers with Commodity Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1659

Performance impairments of Shallow-buffered

Switches1 TCP Incast Collapse

Many applications generate barrier-synchronized requests in which the

client cannot make forward progress until the responses from every

server for the current request have been received An Example of these

applications is a web search query (eg a Google search) sent to a large

number of nodes with results returned to the parent node to be sorted

Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these

requests create many flows that converge on the same interface of a

switch over a short period of time The response packets create a long

queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and

throughput collapse

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1759

1 TCP Incast Collapse (continued)

Barrier-synchronized requests exhibit the PartitionAggregate workflow

pattern which is the foundation of many large scale web applications

Requests from higher layers of the application are broken into pieces and

farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content

composition and advertisement selection are based around the

PartitionAggregate design pattern

In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require

iteratively invoking the pattern with an aggregator making serial requests

to the workers below it to prepare a response (1 to 4 iterations are typical

though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up

to the root must be completed within the deadline

In other publications this pattern is referred to as the ScatterGather pattern

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1859

983137983143983143983154983141983143983137983156983151983154

983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154

The partitionaggregate design pattern

Request Latency deadline 250 ms

deadline 50 ms

deadline 10 ms

The total permissible latency for a request is limited and the ldquobackendrdquo part of the

application is typically allocated between 230-300 ms This limit is called the all-up SLA

Example in web search a query might be sent to many aggregators and workers each

responsible for a different part of the index Based on the replies an aggregator might

refine the query and send it out again to improve the relevance of the result Lagging

instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries

A high-level aggregator

(HLA) partitions queries to

a large number of mid-level

aggregators (MLAs) that in

turn partition each query

over the other servers in the

same rack as the MLA

Servers act as both MLAs

and workers so each server

will be acting as an

aggregator for some queries

and as a worker for other

queries

HLA

MLAMLA

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1959

aggregator

worker 1worker 2worker 3worker 4

query

response

Ack

A TCP Incast Event

Response from worker 3 is lost due to incast and is

retransmitted after a timeout

timeout

983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154

983090

983159983151983154983147983141983154

983089

983159983151983154983147983141983154

983091

983159983151983154983147983141983154

983091

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2059

IncastScenario

Packets from many

flows arriving to

the same port at

the same time

Incast Collapse Summary

In other publications the incast scnario

is referred to as the fan-in burst at the

parent node This incast is a key reason

for increased network delay and occurswhen all the children (eg workers at

the leaf level) of a parent node face the

same deadline and are likely to respond

nearly at the same time causing a fan-

in burst at the parent node

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2159

Performance impairments of Shallow-buffered

Switches2 Queue Buildup

When long and short flows traverse the same queue there is a queue

buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every

worker in the cluster handles both query traffic and background

traffic (large flows needed to update the data structures on the

workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays

because of long-lived greedy TCP flows Further answering a

request can require multiple iterations which magnifies the impact of

this delay

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2259

Performance impairments of Shallow-buffered

Switches3 Buffer Pressure

Given the mix of long and short flows in a data center it is very

common for short flows on one port to be impacted by activity on

other ports The loss rate of short flows in this traffic pattern depends

on the number of long flows traversing other ports

The long greedy TCP flows build up queues on their interfaces

Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space

available to absorb bursts of traffic from the PartitionAggregate

traffic This impairment is called buffer pressure The result is packet

loss and timeouts as in incast but without requiring synchronizedflows

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2359

Buffer

Pressure

Short flows on oneport and long flows

on another port

Incast

Scenario

Multiple shortflows on the same

port

Queue

Buildup

Short and longflows on the same

port

Flow Interactions in Shallow-buffered Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2459

C o n g e s t i o n

w i n d o w

10

5

15

20

0

Round-trip times

Slow

start

Congestionavoidance

Time-out

Legacy TCP Congestion Control

983155983155983135983156983144983154983141983155983144 983101983089983094

983139983159983150983140 983101983090983088

983155983155983135983156983144983154983141983155983144 983101983089983088

Segment loss

Segment loss

FastRetransmit

Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2559

The Need for a Data Center TCP

The data center environment is significantly

different from wide area networks

o round trip times (RTTs) can be less than 250 ms in absence ofqueuing

o Applications need extremely high bandwidths and very low

latencies

o little statistical multiplexing a single flow can dominate a

particular path

o The network is largely homogeneous and under a single

administrative controlo Traffic flowing in switches is mostly internal Connectivity to the

external Internet is typically managed through load balancers and

application proxies that effectively separate internal traffic from

external

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2659

The Need for a Data Center TCP (continued)

Data center applications generate a diverse mix of short and long

flows The measurements by the authors reveal that 9991 of

traffic in the data center is TCP traffic The traffic consists of query

traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to

100MB) These applications require three things from the data

center network

o low latency for short flows

o high burst tolerance

o high utilization for long flows

Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above

requirements

See paper for details of workload

characterization in cloud data centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2759

The Data Center TCP (DCTCP) Algorithm

The goal of DCTCP is to achieve high burst tolerance low latency

and high throughput with commodity shallow buffered switches

DCTCP uses the concept of ECN (Explicit Congestion Notification)

DCTCP achieves these goals primarily by reacting to congestion in

proportion to the extent of congestion

DCTCP uses a simple marking scheme at switches that sets the

Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold

The DCTCP source reacts by reducing the window by a factor that

depends on the fraction of marked packets the larger the fraction the

bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN

notification

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2859

DCTCP- Simple Marking at the Switch

DCTCP employs a simple active queue management scheme There

is only a single parameter the marking threshold K as opposed to

two parameters THmin and THmax in RED routers

An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival

Marking is based on the instantaneous value of the queue not the

average value as in RED routers

The DCTCP scheme ensures that sources are quickly notified of the

queue overshoot

The RED marking scheme implemented by most modern switches

can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of

average queue length

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2959

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 11: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1159

Colocation Data Centers and Cloud Servers

httpwwwdatacentermapcomdatacentershtml

httpwwwdatacentermapcomcloudhtml

Example AtlanticNet

httpwwwatlanticnetorlando-colocation-floridahtml

Orlando Data Center

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1259

Data Center TCP (DCTCP)

M Alizadehzy A Greenbergy D Maltzy J Padhyey P

Pately B Prabhakarz S Senguptay M Sridharan

983117983145983139983154983151983155983151983142983156 983122983141983155983141983137983154983139983144 amp 983123983156983137983150983142983151983154983140 983125983150983145983158983141983154983155983145983156983161

ACM SIGCOMM September 2010

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1359

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1459

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1559

Rack Servers with Commodity Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1659

Performance impairments of Shallow-buffered

Switches1 TCP Incast Collapse

Many applications generate barrier-synchronized requests in which the

client cannot make forward progress until the responses from every

server for the current request have been received An Example of these

applications is a web search query (eg a Google search) sent to a large

number of nodes with results returned to the parent node to be sorted

Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these

requests create many flows that converge on the same interface of a

switch over a short period of time The response packets create a long

queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and

throughput collapse

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1759

1 TCP Incast Collapse (continued)

Barrier-synchronized requests exhibit the PartitionAggregate workflow

pattern which is the foundation of many large scale web applications

Requests from higher layers of the application are broken into pieces and

farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content

composition and advertisement selection are based around the

PartitionAggregate design pattern

In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require

iteratively invoking the pattern with an aggregator making serial requests

to the workers below it to prepare a response (1 to 4 iterations are typical

though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up

to the root must be completed within the deadline

In other publications this pattern is referred to as the ScatterGather pattern

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1859

983137983143983143983154983141983143983137983156983151983154

983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154

The partitionaggregate design pattern

Request Latency deadline 250 ms

deadline 50 ms

deadline 10 ms

The total permissible latency for a request is limited and the ldquobackendrdquo part of the

application is typically allocated between 230-300 ms This limit is called the all-up SLA

Example in web search a query might be sent to many aggregators and workers each

responsible for a different part of the index Based on the replies an aggregator might

refine the query and send it out again to improve the relevance of the result Lagging

instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries

A high-level aggregator

(HLA) partitions queries to

a large number of mid-level

aggregators (MLAs) that in

turn partition each query

over the other servers in the

same rack as the MLA

Servers act as both MLAs

and workers so each server

will be acting as an

aggregator for some queries

and as a worker for other

queries

HLA

MLAMLA

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1959

aggregator

worker 1worker 2worker 3worker 4

query

response

Ack

A TCP Incast Event

Response from worker 3 is lost due to incast and is

retransmitted after a timeout

timeout

983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154

983090

983159983151983154983147983141983154

983089

983159983151983154983147983141983154

983091

983159983151983154983147983141983154

983091

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2059

IncastScenario

Packets from many

flows arriving to

the same port at

the same time

Incast Collapse Summary

In other publications the incast scnario

is referred to as the fan-in burst at the

parent node This incast is a key reason

for increased network delay and occurswhen all the children (eg workers at

the leaf level) of a parent node face the

same deadline and are likely to respond

nearly at the same time causing a fan-

in burst at the parent node

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2159

Performance impairments of Shallow-buffered

Switches2 Queue Buildup

When long and short flows traverse the same queue there is a queue

buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every

worker in the cluster handles both query traffic and background

traffic (large flows needed to update the data structures on the

workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays

because of long-lived greedy TCP flows Further answering a

request can require multiple iterations which magnifies the impact of

this delay

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2259

Performance impairments of Shallow-buffered

Switches3 Buffer Pressure

Given the mix of long and short flows in a data center it is very

common for short flows on one port to be impacted by activity on

other ports The loss rate of short flows in this traffic pattern depends

on the number of long flows traversing other ports

The long greedy TCP flows build up queues on their interfaces

Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space

available to absorb bursts of traffic from the PartitionAggregate

traffic This impairment is called buffer pressure The result is packet

loss and timeouts as in incast but without requiring synchronizedflows

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2359

Buffer

Pressure

Short flows on oneport and long flows

on another port

Incast

Scenario

Multiple shortflows on the same

port

Queue

Buildup

Short and longflows on the same

port

Flow Interactions in Shallow-buffered Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2459

C o n g e s t i o n

w i n d o w

10

5

15

20

0

Round-trip times

Slow

start

Congestionavoidance

Time-out

Legacy TCP Congestion Control

983155983155983135983156983144983154983141983155983144 983101983089983094

983139983159983150983140 983101983090983088

983155983155983135983156983144983154983141983155983144 983101983089983088

Segment loss

Segment loss

FastRetransmit

Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2559

The Need for a Data Center TCP

The data center environment is significantly

different from wide area networks

o round trip times (RTTs) can be less than 250 ms in absence ofqueuing

o Applications need extremely high bandwidths and very low

latencies

o little statistical multiplexing a single flow can dominate a

particular path

o The network is largely homogeneous and under a single

administrative controlo Traffic flowing in switches is mostly internal Connectivity to the

external Internet is typically managed through load balancers and

application proxies that effectively separate internal traffic from

external

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2659

The Need for a Data Center TCP (continued)

Data center applications generate a diverse mix of short and long

flows The measurements by the authors reveal that 9991 of

traffic in the data center is TCP traffic The traffic consists of query

traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to

100MB) These applications require three things from the data

center network

o low latency for short flows

o high burst tolerance

o high utilization for long flows

Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above

requirements

See paper for details of workload

characterization in cloud data centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2759

The Data Center TCP (DCTCP) Algorithm

The goal of DCTCP is to achieve high burst tolerance low latency

and high throughput with commodity shallow buffered switches

DCTCP uses the concept of ECN (Explicit Congestion Notification)

DCTCP achieves these goals primarily by reacting to congestion in

proportion to the extent of congestion

DCTCP uses a simple marking scheme at switches that sets the

Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold

The DCTCP source reacts by reducing the window by a factor that

depends on the fraction of marked packets the larger the fraction the

bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN

notification

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2859

DCTCP- Simple Marking at the Switch

DCTCP employs a simple active queue management scheme There

is only a single parameter the marking threshold K as opposed to

two parameters THmin and THmax in RED routers

An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival

Marking is based on the instantaneous value of the queue not the

average value as in RED routers

The DCTCP scheme ensures that sources are quickly notified of the

queue overshoot

The RED marking scheme implemented by most modern switches

can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of

average queue length

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2959

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 12: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1259

Data Center TCP (DCTCP)

M Alizadehzy A Greenbergy D Maltzy J Padhyey P

Pately B Prabhakarz S Senguptay M Sridharan

983117983145983139983154983151983155983151983142983156 983122983141983155983141983137983154983139983144 amp 983123983156983137983150983142983151983154983140 983125983150983145983158983141983154983155983145983156983161

ACM SIGCOMM September 2010

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1359

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1459

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1559

Rack Servers with Commodity Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1659

Performance impairments of Shallow-buffered

Switches1 TCP Incast Collapse

Many applications generate barrier-synchronized requests in which the

client cannot make forward progress until the responses from every

server for the current request have been received An Example of these

applications is a web search query (eg a Google search) sent to a large

number of nodes with results returned to the parent node to be sorted

Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these

requests create many flows that converge on the same interface of a

switch over a short period of time The response packets create a long

queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and

throughput collapse

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1759

1 TCP Incast Collapse (continued)

Barrier-synchronized requests exhibit the PartitionAggregate workflow

pattern which is the foundation of many large scale web applications

Requests from higher layers of the application are broken into pieces and

farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content

composition and advertisement selection are based around the

PartitionAggregate design pattern

In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require

iteratively invoking the pattern with an aggregator making serial requests

to the workers below it to prepare a response (1 to 4 iterations are typical

though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up

to the root must be completed within the deadline

In other publications this pattern is referred to as the ScatterGather pattern

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1859

983137983143983143983154983141983143983137983156983151983154

983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154

The partitionaggregate design pattern

Request Latency deadline 250 ms

deadline 50 ms

deadline 10 ms

The total permissible latency for a request is limited and the ldquobackendrdquo part of the

application is typically allocated between 230-300 ms This limit is called the all-up SLA

Example in web search a query might be sent to many aggregators and workers each

responsible for a different part of the index Based on the replies an aggregator might

refine the query and send it out again to improve the relevance of the result Lagging

instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries

A high-level aggregator

(HLA) partitions queries to

a large number of mid-level

aggregators (MLAs) that in

turn partition each query

over the other servers in the

same rack as the MLA

Servers act as both MLAs

and workers so each server

will be acting as an

aggregator for some queries

and as a worker for other

queries

HLA

MLAMLA

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1959

aggregator

worker 1worker 2worker 3worker 4

query

response

Ack

A TCP Incast Event

Response from worker 3 is lost due to incast and is

retransmitted after a timeout

timeout

983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154

983090

983159983151983154983147983141983154

983089

983159983151983154983147983141983154

983091

983159983151983154983147983141983154

983091

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2059

IncastScenario

Packets from many

flows arriving to

the same port at

the same time

Incast Collapse Summary

In other publications the incast scnario

is referred to as the fan-in burst at the

parent node This incast is a key reason

for increased network delay and occurswhen all the children (eg workers at

the leaf level) of a parent node face the

same deadline and are likely to respond

nearly at the same time causing a fan-

in burst at the parent node

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2159

Performance impairments of Shallow-buffered

Switches2 Queue Buildup

When long and short flows traverse the same queue there is a queue

buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every

worker in the cluster handles both query traffic and background

traffic (large flows needed to update the data structures on the

workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays

because of long-lived greedy TCP flows Further answering a

request can require multiple iterations which magnifies the impact of

this delay

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2259

Performance impairments of Shallow-buffered

Switches3 Buffer Pressure

Given the mix of long and short flows in a data center it is very

common for short flows on one port to be impacted by activity on

other ports The loss rate of short flows in this traffic pattern depends

on the number of long flows traversing other ports

The long greedy TCP flows build up queues on their interfaces

Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space

available to absorb bursts of traffic from the PartitionAggregate

traffic This impairment is called buffer pressure The result is packet

loss and timeouts as in incast but without requiring synchronizedflows

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2359

Buffer

Pressure

Short flows on oneport and long flows

on another port

Incast

Scenario

Multiple shortflows on the same

port

Queue

Buildup

Short and longflows on the same

port

Flow Interactions in Shallow-buffered Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2459

C o n g e s t i o n

w i n d o w

10

5

15

20

0

Round-trip times

Slow

start

Congestionavoidance

Time-out

Legacy TCP Congestion Control

983155983155983135983156983144983154983141983155983144 983101983089983094

983139983159983150983140 983101983090983088

983155983155983135983156983144983154983141983155983144 983101983089983088

Segment loss

Segment loss

FastRetransmit

Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2559

The Need for a Data Center TCP

The data center environment is significantly

different from wide area networks

o round trip times (RTTs) can be less than 250 ms in absence ofqueuing

o Applications need extremely high bandwidths and very low

latencies

o little statistical multiplexing a single flow can dominate a

particular path

o The network is largely homogeneous and under a single

administrative controlo Traffic flowing in switches is mostly internal Connectivity to the

external Internet is typically managed through load balancers and

application proxies that effectively separate internal traffic from

external

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2659

The Need for a Data Center TCP (continued)

Data center applications generate a diverse mix of short and long

flows The measurements by the authors reveal that 9991 of

traffic in the data center is TCP traffic The traffic consists of query

traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to

100MB) These applications require three things from the data

center network

o low latency for short flows

o high burst tolerance

o high utilization for long flows

Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above

requirements

See paper for details of workload

characterization in cloud data centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2759

The Data Center TCP (DCTCP) Algorithm

The goal of DCTCP is to achieve high burst tolerance low latency

and high throughput with commodity shallow buffered switches

DCTCP uses the concept of ECN (Explicit Congestion Notification)

DCTCP achieves these goals primarily by reacting to congestion in

proportion to the extent of congestion

DCTCP uses a simple marking scheme at switches that sets the

Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold

The DCTCP source reacts by reducing the window by a factor that

depends on the fraction of marked packets the larger the fraction the

bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN

notification

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2859

DCTCP- Simple Marking at the Switch

DCTCP employs a simple active queue management scheme There

is only a single parameter the marking threshold K as opposed to

two parameters THmin and THmax in RED routers

An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival

Marking is based on the instantaneous value of the queue not the

average value as in RED routers

The DCTCP scheme ensures that sources are quickly notified of the

queue overshoot

The RED marking scheme implemented by most modern switches

can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of

average queue length

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2959

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 13: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1359

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1459

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1559

Rack Servers with Commodity Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1659

Performance impairments of Shallow-buffered

Switches1 TCP Incast Collapse

Many applications generate barrier-synchronized requests in which the

client cannot make forward progress until the responses from every

server for the current request have been received An Example of these

applications is a web search query (eg a Google search) sent to a large

number of nodes with results returned to the parent node to be sorted

Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these

requests create many flows that converge on the same interface of a

switch over a short period of time The response packets create a long

queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and

throughput collapse

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1759

1 TCP Incast Collapse (continued)

Barrier-synchronized requests exhibit the PartitionAggregate workflow

pattern which is the foundation of many large scale web applications

Requests from higher layers of the application are broken into pieces and

farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content

composition and advertisement selection are based around the

PartitionAggregate design pattern

In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require

iteratively invoking the pattern with an aggregator making serial requests

to the workers below it to prepare a response (1 to 4 iterations are typical

though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up

to the root must be completed within the deadline

In other publications this pattern is referred to as the ScatterGather pattern

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1859

983137983143983143983154983141983143983137983156983151983154

983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154

The partitionaggregate design pattern

Request Latency deadline 250 ms

deadline 50 ms

deadline 10 ms

The total permissible latency for a request is limited and the ldquobackendrdquo part of the

application is typically allocated between 230-300 ms This limit is called the all-up SLA

Example in web search a query might be sent to many aggregators and workers each

responsible for a different part of the index Based on the replies an aggregator might

refine the query and send it out again to improve the relevance of the result Lagging

instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries

A high-level aggregator

(HLA) partitions queries to

a large number of mid-level

aggregators (MLAs) that in

turn partition each query

over the other servers in the

same rack as the MLA

Servers act as both MLAs

and workers so each server

will be acting as an

aggregator for some queries

and as a worker for other

queries

HLA

MLAMLA

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1959

aggregator

worker 1worker 2worker 3worker 4

query

response

Ack

A TCP Incast Event

Response from worker 3 is lost due to incast and is

retransmitted after a timeout

timeout

983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154

983090

983159983151983154983147983141983154

983089

983159983151983154983147983141983154

983091

983159983151983154983147983141983154

983091

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2059

IncastScenario

Packets from many

flows arriving to

the same port at

the same time

Incast Collapse Summary

In other publications the incast scnario

is referred to as the fan-in burst at the

parent node This incast is a key reason

for increased network delay and occurswhen all the children (eg workers at

the leaf level) of a parent node face the

same deadline and are likely to respond

nearly at the same time causing a fan-

in burst at the parent node

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2159

Performance impairments of Shallow-buffered

Switches2 Queue Buildup

When long and short flows traverse the same queue there is a queue

buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every

worker in the cluster handles both query traffic and background

traffic (large flows needed to update the data structures on the

workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays

because of long-lived greedy TCP flows Further answering a

request can require multiple iterations which magnifies the impact of

this delay

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2259

Performance impairments of Shallow-buffered

Switches3 Buffer Pressure

Given the mix of long and short flows in a data center it is very

common for short flows on one port to be impacted by activity on

other ports The loss rate of short flows in this traffic pattern depends

on the number of long flows traversing other ports

The long greedy TCP flows build up queues on their interfaces

Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space

available to absorb bursts of traffic from the PartitionAggregate

traffic This impairment is called buffer pressure The result is packet

loss and timeouts as in incast but without requiring synchronizedflows

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2359

Buffer

Pressure

Short flows on oneport and long flows

on another port

Incast

Scenario

Multiple shortflows on the same

port

Queue

Buildup

Short and longflows on the same

port

Flow Interactions in Shallow-buffered Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2459

C o n g e s t i o n

w i n d o w

10

5

15

20

0

Round-trip times

Slow

start

Congestionavoidance

Time-out

Legacy TCP Congestion Control

983155983155983135983156983144983154983141983155983144 983101983089983094

983139983159983150983140 983101983090983088

983155983155983135983156983144983154983141983155983144 983101983089983088

Segment loss

Segment loss

FastRetransmit

Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2559

The Need for a Data Center TCP

The data center environment is significantly

different from wide area networks

o round trip times (RTTs) can be less than 250 ms in absence ofqueuing

o Applications need extremely high bandwidths and very low

latencies

o little statistical multiplexing a single flow can dominate a

particular path

o The network is largely homogeneous and under a single

administrative controlo Traffic flowing in switches is mostly internal Connectivity to the

external Internet is typically managed through load balancers and

application proxies that effectively separate internal traffic from

external

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2659

The Need for a Data Center TCP (continued)

Data center applications generate a diverse mix of short and long

flows The measurements by the authors reveal that 9991 of

traffic in the data center is TCP traffic The traffic consists of query

traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to

100MB) These applications require three things from the data

center network

o low latency for short flows

o high burst tolerance

o high utilization for long flows

Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above

requirements

See paper for details of workload

characterization in cloud data centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2759

The Data Center TCP (DCTCP) Algorithm

The goal of DCTCP is to achieve high burst tolerance low latency

and high throughput with commodity shallow buffered switches

DCTCP uses the concept of ECN (Explicit Congestion Notification)

DCTCP achieves these goals primarily by reacting to congestion in

proportion to the extent of congestion

DCTCP uses a simple marking scheme at switches that sets the

Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold

The DCTCP source reacts by reducing the window by a factor that

depends on the fraction of marked packets the larger the fraction the

bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN

notification

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2859

DCTCP- Simple Marking at the Switch

DCTCP employs a simple active queue management scheme There

is only a single parameter the marking threshold K as opposed to

two parameters THmin and THmax in RED routers

An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival

Marking is based on the instantaneous value of the queue not the

average value as in RED routers

The DCTCP scheme ensures that sources are quickly notified of the

queue overshoot

The RED marking scheme implemented by most modern switches

can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of

average queue length

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2959

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 14: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1459

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1559

Rack Servers with Commodity Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1659

Performance impairments of Shallow-buffered

Switches1 TCP Incast Collapse

Many applications generate barrier-synchronized requests in which the

client cannot make forward progress until the responses from every

server for the current request have been received An Example of these

applications is a web search query (eg a Google search) sent to a large

number of nodes with results returned to the parent node to be sorted

Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these

requests create many flows that converge on the same interface of a

switch over a short period of time The response packets create a long

queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and

throughput collapse

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1759

1 TCP Incast Collapse (continued)

Barrier-synchronized requests exhibit the PartitionAggregate workflow

pattern which is the foundation of many large scale web applications

Requests from higher layers of the application are broken into pieces and

farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content

composition and advertisement selection are based around the

PartitionAggregate design pattern

In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require

iteratively invoking the pattern with an aggregator making serial requests

to the workers below it to prepare a response (1 to 4 iterations are typical

though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up

to the root must be completed within the deadline

In other publications this pattern is referred to as the ScatterGather pattern

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1859

983137983143983143983154983141983143983137983156983151983154

983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154

The partitionaggregate design pattern

Request Latency deadline 250 ms

deadline 50 ms

deadline 10 ms

The total permissible latency for a request is limited and the ldquobackendrdquo part of the

application is typically allocated between 230-300 ms This limit is called the all-up SLA

Example in web search a query might be sent to many aggregators and workers each

responsible for a different part of the index Based on the replies an aggregator might

refine the query and send it out again to improve the relevance of the result Lagging

instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries

A high-level aggregator

(HLA) partitions queries to

a large number of mid-level

aggregators (MLAs) that in

turn partition each query

over the other servers in the

same rack as the MLA

Servers act as both MLAs

and workers so each server

will be acting as an

aggregator for some queries

and as a worker for other

queries

HLA

MLAMLA

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1959

aggregator

worker 1worker 2worker 3worker 4

query

response

Ack

A TCP Incast Event

Response from worker 3 is lost due to incast and is

retransmitted after a timeout

timeout

983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154

983090

983159983151983154983147983141983154

983089

983159983151983154983147983141983154

983091

983159983151983154983147983141983154

983091

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2059

IncastScenario

Packets from many

flows arriving to

the same port at

the same time

Incast Collapse Summary

In other publications the incast scnario

is referred to as the fan-in burst at the

parent node This incast is a key reason

for increased network delay and occurswhen all the children (eg workers at

the leaf level) of a parent node face the

same deadline and are likely to respond

nearly at the same time causing a fan-

in burst at the parent node

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2159

Performance impairments of Shallow-buffered

Switches2 Queue Buildup

When long and short flows traverse the same queue there is a queue

buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every

worker in the cluster handles both query traffic and background

traffic (large flows needed to update the data structures on the

workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays

because of long-lived greedy TCP flows Further answering a

request can require multiple iterations which magnifies the impact of

this delay

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2259

Performance impairments of Shallow-buffered

Switches3 Buffer Pressure

Given the mix of long and short flows in a data center it is very

common for short flows on one port to be impacted by activity on

other ports The loss rate of short flows in this traffic pattern depends

on the number of long flows traversing other ports

The long greedy TCP flows build up queues on their interfaces

Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space

available to absorb bursts of traffic from the PartitionAggregate

traffic This impairment is called buffer pressure The result is packet

loss and timeouts as in incast but without requiring synchronizedflows

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2359

Buffer

Pressure

Short flows on oneport and long flows

on another port

Incast

Scenario

Multiple shortflows on the same

port

Queue

Buildup

Short and longflows on the same

port

Flow Interactions in Shallow-buffered Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2459

C o n g e s t i o n

w i n d o w

10

5

15

20

0

Round-trip times

Slow

start

Congestionavoidance

Time-out

Legacy TCP Congestion Control

983155983155983135983156983144983154983141983155983144 983101983089983094

983139983159983150983140 983101983090983088

983155983155983135983156983144983154983141983155983144 983101983089983088

Segment loss

Segment loss

FastRetransmit

Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2559

The Need for a Data Center TCP

The data center environment is significantly

different from wide area networks

o round trip times (RTTs) can be less than 250 ms in absence ofqueuing

o Applications need extremely high bandwidths and very low

latencies

o little statistical multiplexing a single flow can dominate a

particular path

o The network is largely homogeneous and under a single

administrative controlo Traffic flowing in switches is mostly internal Connectivity to the

external Internet is typically managed through load balancers and

application proxies that effectively separate internal traffic from

external

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2659

The Need for a Data Center TCP (continued)

Data center applications generate a diverse mix of short and long

flows The measurements by the authors reveal that 9991 of

traffic in the data center is TCP traffic The traffic consists of query

traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to

100MB) These applications require three things from the data

center network

o low latency for short flows

o high burst tolerance

o high utilization for long flows

Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above

requirements

See paper for details of workload

characterization in cloud data centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2759

The Data Center TCP (DCTCP) Algorithm

The goal of DCTCP is to achieve high burst tolerance low latency

and high throughput with commodity shallow buffered switches

DCTCP uses the concept of ECN (Explicit Congestion Notification)

DCTCP achieves these goals primarily by reacting to congestion in

proportion to the extent of congestion

DCTCP uses a simple marking scheme at switches that sets the

Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold

The DCTCP source reacts by reducing the window by a factor that

depends on the fraction of marked packets the larger the fraction the

bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN

notification

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2859

DCTCP- Simple Marking at the Switch

DCTCP employs a simple active queue management scheme There

is only a single parameter the marking threshold K as opposed to

two parameters THmin and THmax in RED routers

An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival

Marking is based on the instantaneous value of the queue not the

average value as in RED routers

The DCTCP scheme ensures that sources are quickly notified of the

queue overshoot

The RED marking scheme implemented by most modern switches

can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of

average queue length

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2959

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 15: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1559

Rack Servers with Commodity Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1659

Performance impairments of Shallow-buffered

Switches1 TCP Incast Collapse

Many applications generate barrier-synchronized requests in which the

client cannot make forward progress until the responses from every

server for the current request have been received An Example of these

applications is a web search query (eg a Google search) sent to a large

number of nodes with results returned to the parent node to be sorted

Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these

requests create many flows that converge on the same interface of a

switch over a short period of time The response packets create a long

queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and

throughput collapse

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1759

1 TCP Incast Collapse (continued)

Barrier-synchronized requests exhibit the PartitionAggregate workflow

pattern which is the foundation of many large scale web applications

Requests from higher layers of the application are broken into pieces and

farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content

composition and advertisement selection are based around the

PartitionAggregate design pattern

In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require

iteratively invoking the pattern with an aggregator making serial requests

to the workers below it to prepare a response (1 to 4 iterations are typical

though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up

to the root must be completed within the deadline

In other publications this pattern is referred to as the ScatterGather pattern

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1859

983137983143983143983154983141983143983137983156983151983154

983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154

The partitionaggregate design pattern

Request Latency deadline 250 ms

deadline 50 ms

deadline 10 ms

The total permissible latency for a request is limited and the ldquobackendrdquo part of the

application is typically allocated between 230-300 ms This limit is called the all-up SLA

Example in web search a query might be sent to many aggregators and workers each

responsible for a different part of the index Based on the replies an aggregator might

refine the query and send it out again to improve the relevance of the result Lagging

instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries

A high-level aggregator

(HLA) partitions queries to

a large number of mid-level

aggregators (MLAs) that in

turn partition each query

over the other servers in the

same rack as the MLA

Servers act as both MLAs

and workers so each server

will be acting as an

aggregator for some queries

and as a worker for other

queries

HLA

MLAMLA

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1959

aggregator

worker 1worker 2worker 3worker 4

query

response

Ack

A TCP Incast Event

Response from worker 3 is lost due to incast and is

retransmitted after a timeout

timeout

983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154

983090

983159983151983154983147983141983154

983089

983159983151983154983147983141983154

983091

983159983151983154983147983141983154

983091

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2059

IncastScenario

Packets from many

flows arriving to

the same port at

the same time

Incast Collapse Summary

In other publications the incast scnario

is referred to as the fan-in burst at the

parent node This incast is a key reason

for increased network delay and occurswhen all the children (eg workers at

the leaf level) of a parent node face the

same deadline and are likely to respond

nearly at the same time causing a fan-

in burst at the parent node

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2159

Performance impairments of Shallow-buffered

Switches2 Queue Buildup

When long and short flows traverse the same queue there is a queue

buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every

worker in the cluster handles both query traffic and background

traffic (large flows needed to update the data structures on the

workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays

because of long-lived greedy TCP flows Further answering a

request can require multiple iterations which magnifies the impact of

this delay

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2259

Performance impairments of Shallow-buffered

Switches3 Buffer Pressure

Given the mix of long and short flows in a data center it is very

common for short flows on one port to be impacted by activity on

other ports The loss rate of short flows in this traffic pattern depends

on the number of long flows traversing other ports

The long greedy TCP flows build up queues on their interfaces

Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space

available to absorb bursts of traffic from the PartitionAggregate

traffic This impairment is called buffer pressure The result is packet

loss and timeouts as in incast but without requiring synchronizedflows

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2359

Buffer

Pressure

Short flows on oneport and long flows

on another port

Incast

Scenario

Multiple shortflows on the same

port

Queue

Buildup

Short and longflows on the same

port

Flow Interactions in Shallow-buffered Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2459

C o n g e s t i o n

w i n d o w

10

5

15

20

0

Round-trip times

Slow

start

Congestionavoidance

Time-out

Legacy TCP Congestion Control

983155983155983135983156983144983154983141983155983144 983101983089983094

983139983159983150983140 983101983090983088

983155983155983135983156983144983154983141983155983144 983101983089983088

Segment loss

Segment loss

FastRetransmit

Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2559

The Need for a Data Center TCP

The data center environment is significantly

different from wide area networks

o round trip times (RTTs) can be less than 250 ms in absence ofqueuing

o Applications need extremely high bandwidths and very low

latencies

o little statistical multiplexing a single flow can dominate a

particular path

o The network is largely homogeneous and under a single

administrative controlo Traffic flowing in switches is mostly internal Connectivity to the

external Internet is typically managed through load balancers and

application proxies that effectively separate internal traffic from

external

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2659

The Need for a Data Center TCP (continued)

Data center applications generate a diverse mix of short and long

flows The measurements by the authors reveal that 9991 of

traffic in the data center is TCP traffic The traffic consists of query

traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to

100MB) These applications require three things from the data

center network

o low latency for short flows

o high burst tolerance

o high utilization for long flows

Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above

requirements

See paper for details of workload

characterization in cloud data centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2759

The Data Center TCP (DCTCP) Algorithm

The goal of DCTCP is to achieve high burst tolerance low latency

and high throughput with commodity shallow buffered switches

DCTCP uses the concept of ECN (Explicit Congestion Notification)

DCTCP achieves these goals primarily by reacting to congestion in

proportion to the extent of congestion

DCTCP uses a simple marking scheme at switches that sets the

Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold

The DCTCP source reacts by reducing the window by a factor that

depends on the fraction of marked packets the larger the fraction the

bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN

notification

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2859

DCTCP- Simple Marking at the Switch

DCTCP employs a simple active queue management scheme There

is only a single parameter the marking threshold K as opposed to

two parameters THmin and THmax in RED routers

An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival

Marking is based on the instantaneous value of the queue not the

average value as in RED routers

The DCTCP scheme ensures that sources are quickly notified of the

queue overshoot

The RED marking scheme implemented by most modern switches

can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of

average queue length

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2959

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 16: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1659

Performance impairments of Shallow-buffered

Switches1 TCP Incast Collapse

Many applications generate barrier-synchronized requests in which the

client cannot make forward progress until the responses from every

server for the current request have been received An Example of these

applications is a web search query (eg a Google search) sent to a large

number of nodes with results returned to the parent node to be sorted

Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these

requests create many flows that converge on the same interface of a

switch over a short period of time The response packets create a long

queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and

throughput collapse

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1759

1 TCP Incast Collapse (continued)

Barrier-synchronized requests exhibit the PartitionAggregate workflow

pattern which is the foundation of many large scale web applications

Requests from higher layers of the application are broken into pieces and

farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content

composition and advertisement selection are based around the

PartitionAggregate design pattern

In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require

iteratively invoking the pattern with an aggregator making serial requests

to the workers below it to prepare a response (1 to 4 iterations are typical

though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up

to the root must be completed within the deadline

In other publications this pattern is referred to as the ScatterGather pattern

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1859

983137983143983143983154983141983143983137983156983151983154

983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154

The partitionaggregate design pattern

Request Latency deadline 250 ms

deadline 50 ms

deadline 10 ms

The total permissible latency for a request is limited and the ldquobackendrdquo part of the

application is typically allocated between 230-300 ms This limit is called the all-up SLA

Example in web search a query might be sent to many aggregators and workers each

responsible for a different part of the index Based on the replies an aggregator might

refine the query and send it out again to improve the relevance of the result Lagging

instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries

A high-level aggregator

(HLA) partitions queries to

a large number of mid-level

aggregators (MLAs) that in

turn partition each query

over the other servers in the

same rack as the MLA

Servers act as both MLAs

and workers so each server

will be acting as an

aggregator for some queries

and as a worker for other

queries

HLA

MLAMLA

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1959

aggregator

worker 1worker 2worker 3worker 4

query

response

Ack

A TCP Incast Event

Response from worker 3 is lost due to incast and is

retransmitted after a timeout

timeout

983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154

983090

983159983151983154983147983141983154

983089

983159983151983154983147983141983154

983091

983159983151983154983147983141983154

983091

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2059

IncastScenario

Packets from many

flows arriving to

the same port at

the same time

Incast Collapse Summary

In other publications the incast scnario

is referred to as the fan-in burst at the

parent node This incast is a key reason

for increased network delay and occurswhen all the children (eg workers at

the leaf level) of a parent node face the

same deadline and are likely to respond

nearly at the same time causing a fan-

in burst at the parent node

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2159

Performance impairments of Shallow-buffered

Switches2 Queue Buildup

When long and short flows traverse the same queue there is a queue

buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every

worker in the cluster handles both query traffic and background

traffic (large flows needed to update the data structures on the

workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays

because of long-lived greedy TCP flows Further answering a

request can require multiple iterations which magnifies the impact of

this delay

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2259

Performance impairments of Shallow-buffered

Switches3 Buffer Pressure

Given the mix of long and short flows in a data center it is very

common for short flows on one port to be impacted by activity on

other ports The loss rate of short flows in this traffic pattern depends

on the number of long flows traversing other ports

The long greedy TCP flows build up queues on their interfaces

Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space

available to absorb bursts of traffic from the PartitionAggregate

traffic This impairment is called buffer pressure The result is packet

loss and timeouts as in incast but without requiring synchronizedflows

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2359

Buffer

Pressure

Short flows on oneport and long flows

on another port

Incast

Scenario

Multiple shortflows on the same

port

Queue

Buildup

Short and longflows on the same

port

Flow Interactions in Shallow-buffered Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2459

C o n g e s t i o n

w i n d o w

10

5

15

20

0

Round-trip times

Slow

start

Congestionavoidance

Time-out

Legacy TCP Congestion Control

983155983155983135983156983144983154983141983155983144 983101983089983094

983139983159983150983140 983101983090983088

983155983155983135983156983144983154983141983155983144 983101983089983088

Segment loss

Segment loss

FastRetransmit

Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2559

The Need for a Data Center TCP

The data center environment is significantly

different from wide area networks

o round trip times (RTTs) can be less than 250 ms in absence ofqueuing

o Applications need extremely high bandwidths and very low

latencies

o little statistical multiplexing a single flow can dominate a

particular path

o The network is largely homogeneous and under a single

administrative controlo Traffic flowing in switches is mostly internal Connectivity to the

external Internet is typically managed through load balancers and

application proxies that effectively separate internal traffic from

external

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2659

The Need for a Data Center TCP (continued)

Data center applications generate a diverse mix of short and long

flows The measurements by the authors reveal that 9991 of

traffic in the data center is TCP traffic The traffic consists of query

traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to

100MB) These applications require three things from the data

center network

o low latency for short flows

o high burst tolerance

o high utilization for long flows

Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above

requirements

See paper for details of workload

characterization in cloud data centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2759

The Data Center TCP (DCTCP) Algorithm

The goal of DCTCP is to achieve high burst tolerance low latency

and high throughput with commodity shallow buffered switches

DCTCP uses the concept of ECN (Explicit Congestion Notification)

DCTCP achieves these goals primarily by reacting to congestion in

proportion to the extent of congestion

DCTCP uses a simple marking scheme at switches that sets the

Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold

The DCTCP source reacts by reducing the window by a factor that

depends on the fraction of marked packets the larger the fraction the

bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN

notification

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2859

DCTCP- Simple Marking at the Switch

DCTCP employs a simple active queue management scheme There

is only a single parameter the marking threshold K as opposed to

two parameters THmin and THmax in RED routers

An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival

Marking is based on the instantaneous value of the queue not the

average value as in RED routers

The DCTCP scheme ensures that sources are quickly notified of the

queue overshoot

The RED marking scheme implemented by most modern switches

can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of

average queue length

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2959

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 17: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1759

1 TCP Incast Collapse (continued)

Barrier-synchronized requests exhibit the PartitionAggregate workflow

pattern which is the foundation of many large scale web applications

Requests from higher layers of the application are broken into pieces and

farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content

composition and advertisement selection are based around the

PartitionAggregate design pattern

In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require

iteratively invoking the pattern with an aggregator making serial requests

to the workers below it to prepare a response (1 to 4 iterations are typical

though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up

to the root must be completed within the deadline

In other publications this pattern is referred to as the ScatterGather pattern

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1859

983137983143983143983154983141983143983137983156983151983154

983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154

The partitionaggregate design pattern

Request Latency deadline 250 ms

deadline 50 ms

deadline 10 ms

The total permissible latency for a request is limited and the ldquobackendrdquo part of the

application is typically allocated between 230-300 ms This limit is called the all-up SLA

Example in web search a query might be sent to many aggregators and workers each

responsible for a different part of the index Based on the replies an aggregator might

refine the query and send it out again to improve the relevance of the result Lagging

instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries

A high-level aggregator

(HLA) partitions queries to

a large number of mid-level

aggregators (MLAs) that in

turn partition each query

over the other servers in the

same rack as the MLA

Servers act as both MLAs

and workers so each server

will be acting as an

aggregator for some queries

and as a worker for other

queries

HLA

MLAMLA

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1959

aggregator

worker 1worker 2worker 3worker 4

query

response

Ack

A TCP Incast Event

Response from worker 3 is lost due to incast and is

retransmitted after a timeout

timeout

983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154

983090

983159983151983154983147983141983154

983089

983159983151983154983147983141983154

983091

983159983151983154983147983141983154

983091

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2059

IncastScenario

Packets from many

flows arriving to

the same port at

the same time

Incast Collapse Summary

In other publications the incast scnario

is referred to as the fan-in burst at the

parent node This incast is a key reason

for increased network delay and occurswhen all the children (eg workers at

the leaf level) of a parent node face the

same deadline and are likely to respond

nearly at the same time causing a fan-

in burst at the parent node

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2159

Performance impairments of Shallow-buffered

Switches2 Queue Buildup

When long and short flows traverse the same queue there is a queue

buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every

worker in the cluster handles both query traffic and background

traffic (large flows needed to update the data structures on the

workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays

because of long-lived greedy TCP flows Further answering a

request can require multiple iterations which magnifies the impact of

this delay

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2259

Performance impairments of Shallow-buffered

Switches3 Buffer Pressure

Given the mix of long and short flows in a data center it is very

common for short flows on one port to be impacted by activity on

other ports The loss rate of short flows in this traffic pattern depends

on the number of long flows traversing other ports

The long greedy TCP flows build up queues on their interfaces

Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space

available to absorb bursts of traffic from the PartitionAggregate

traffic This impairment is called buffer pressure The result is packet

loss and timeouts as in incast but without requiring synchronizedflows

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2359

Buffer

Pressure

Short flows on oneport and long flows

on another port

Incast

Scenario

Multiple shortflows on the same

port

Queue

Buildup

Short and longflows on the same

port

Flow Interactions in Shallow-buffered Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2459

C o n g e s t i o n

w i n d o w

10

5

15

20

0

Round-trip times

Slow

start

Congestionavoidance

Time-out

Legacy TCP Congestion Control

983155983155983135983156983144983154983141983155983144 983101983089983094

983139983159983150983140 983101983090983088

983155983155983135983156983144983154983141983155983144 983101983089983088

Segment loss

Segment loss

FastRetransmit

Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2559

The Need for a Data Center TCP

The data center environment is significantly

different from wide area networks

o round trip times (RTTs) can be less than 250 ms in absence ofqueuing

o Applications need extremely high bandwidths and very low

latencies

o little statistical multiplexing a single flow can dominate a

particular path

o The network is largely homogeneous and under a single

administrative controlo Traffic flowing in switches is mostly internal Connectivity to the

external Internet is typically managed through load balancers and

application proxies that effectively separate internal traffic from

external

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2659

The Need for a Data Center TCP (continued)

Data center applications generate a diverse mix of short and long

flows The measurements by the authors reveal that 9991 of

traffic in the data center is TCP traffic The traffic consists of query

traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to

100MB) These applications require three things from the data

center network

o low latency for short flows

o high burst tolerance

o high utilization for long flows

Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above

requirements

See paper for details of workload

characterization in cloud data centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2759

The Data Center TCP (DCTCP) Algorithm

The goal of DCTCP is to achieve high burst tolerance low latency

and high throughput with commodity shallow buffered switches

DCTCP uses the concept of ECN (Explicit Congestion Notification)

DCTCP achieves these goals primarily by reacting to congestion in

proportion to the extent of congestion

DCTCP uses a simple marking scheme at switches that sets the

Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold

The DCTCP source reacts by reducing the window by a factor that

depends on the fraction of marked packets the larger the fraction the

bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN

notification

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2859

DCTCP- Simple Marking at the Switch

DCTCP employs a simple active queue management scheme There

is only a single parameter the marking threshold K as opposed to

two parameters THmin and THmax in RED routers

An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival

Marking is based on the instantaneous value of the queue not the

average value as in RED routers

The DCTCP scheme ensures that sources are quickly notified of the

queue overshoot

The RED marking scheme implemented by most modern switches

can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of

average queue length

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2959

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 18: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1859

983137983143983143983154983141983143983137983156983151983154

983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154

The partitionaggregate design pattern

Request Latency deadline 250 ms

deadline 50 ms

deadline 10 ms

The total permissible latency for a request is limited and the ldquobackendrdquo part of the

application is typically allocated between 230-300 ms This limit is called the all-up SLA

Example in web search a query might be sent to many aggregators and workers each

responsible for a different part of the index Based on the replies an aggregator might

refine the query and send it out again to improve the relevance of the result Lagging

instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries

A high-level aggregator

(HLA) partitions queries to

a large number of mid-level

aggregators (MLAs) that in

turn partition each query

over the other servers in the

same rack as the MLA

Servers act as both MLAs

and workers so each server

will be acting as an

aggregator for some queries

and as a worker for other

queries

HLA

MLAMLA

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1959

aggregator

worker 1worker 2worker 3worker 4

query

response

Ack

A TCP Incast Event

Response from worker 3 is lost due to incast and is

retransmitted after a timeout

timeout

983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154

983090

983159983151983154983147983141983154

983089

983159983151983154983147983141983154

983091

983159983151983154983147983141983154

983091

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2059

IncastScenario

Packets from many

flows arriving to

the same port at

the same time

Incast Collapse Summary

In other publications the incast scnario

is referred to as the fan-in burst at the

parent node This incast is a key reason

for increased network delay and occurswhen all the children (eg workers at

the leaf level) of a parent node face the

same deadline and are likely to respond

nearly at the same time causing a fan-

in burst at the parent node

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2159

Performance impairments of Shallow-buffered

Switches2 Queue Buildup

When long and short flows traverse the same queue there is a queue

buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every

worker in the cluster handles both query traffic and background

traffic (large flows needed to update the data structures on the

workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays

because of long-lived greedy TCP flows Further answering a

request can require multiple iterations which magnifies the impact of

this delay

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2259

Performance impairments of Shallow-buffered

Switches3 Buffer Pressure

Given the mix of long and short flows in a data center it is very

common for short flows on one port to be impacted by activity on

other ports The loss rate of short flows in this traffic pattern depends

on the number of long flows traversing other ports

The long greedy TCP flows build up queues on their interfaces

Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space

available to absorb bursts of traffic from the PartitionAggregate

traffic This impairment is called buffer pressure The result is packet

loss and timeouts as in incast but without requiring synchronizedflows

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2359

Buffer

Pressure

Short flows on oneport and long flows

on another port

Incast

Scenario

Multiple shortflows on the same

port

Queue

Buildup

Short and longflows on the same

port

Flow Interactions in Shallow-buffered Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2459

C o n g e s t i o n

w i n d o w

10

5

15

20

0

Round-trip times

Slow

start

Congestionavoidance

Time-out

Legacy TCP Congestion Control

983155983155983135983156983144983154983141983155983144 983101983089983094

983139983159983150983140 983101983090983088

983155983155983135983156983144983154983141983155983144 983101983089983088

Segment loss

Segment loss

FastRetransmit

Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2559

The Need for a Data Center TCP

The data center environment is significantly

different from wide area networks

o round trip times (RTTs) can be less than 250 ms in absence ofqueuing

o Applications need extremely high bandwidths and very low

latencies

o little statistical multiplexing a single flow can dominate a

particular path

o The network is largely homogeneous and under a single

administrative controlo Traffic flowing in switches is mostly internal Connectivity to the

external Internet is typically managed through load balancers and

application proxies that effectively separate internal traffic from

external

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2659

The Need for a Data Center TCP (continued)

Data center applications generate a diverse mix of short and long

flows The measurements by the authors reveal that 9991 of

traffic in the data center is TCP traffic The traffic consists of query

traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to

100MB) These applications require three things from the data

center network

o low latency for short flows

o high burst tolerance

o high utilization for long flows

Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above

requirements

See paper for details of workload

characterization in cloud data centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2759

The Data Center TCP (DCTCP) Algorithm

The goal of DCTCP is to achieve high burst tolerance low latency

and high throughput with commodity shallow buffered switches

DCTCP uses the concept of ECN (Explicit Congestion Notification)

DCTCP achieves these goals primarily by reacting to congestion in

proportion to the extent of congestion

DCTCP uses a simple marking scheme at switches that sets the

Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold

The DCTCP source reacts by reducing the window by a factor that

depends on the fraction of marked packets the larger the fraction the

bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN

notification

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2859

DCTCP- Simple Marking at the Switch

DCTCP employs a simple active queue management scheme There

is only a single parameter the marking threshold K as opposed to

two parameters THmin and THmax in RED routers

An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival

Marking is based on the instantaneous value of the queue not the

average value as in RED routers

The DCTCP scheme ensures that sources are quickly notified of the

queue overshoot

The RED marking scheme implemented by most modern switches

can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of

average queue length

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2959

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 19: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 1959

aggregator

worker 1worker 2worker 3worker 4

query

response

Ack

A TCP Incast Event

Response from worker 3 is lost due to incast and is

retransmitted after a timeout

timeout

983137983143983143983154983141983143983137983156983151983154

983159983151983154983147983141983154

983090

983159983151983154983147983141983154

983089

983159983151983154983147983141983154

983091

983159983151983154983147983141983154

983091

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2059

IncastScenario

Packets from many

flows arriving to

the same port at

the same time

Incast Collapse Summary

In other publications the incast scnario

is referred to as the fan-in burst at the

parent node This incast is a key reason

for increased network delay and occurswhen all the children (eg workers at

the leaf level) of a parent node face the

same deadline and are likely to respond

nearly at the same time causing a fan-

in burst at the parent node

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2159

Performance impairments of Shallow-buffered

Switches2 Queue Buildup

When long and short flows traverse the same queue there is a queue

buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every

worker in the cluster handles both query traffic and background

traffic (large flows needed to update the data structures on the

workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays

because of long-lived greedy TCP flows Further answering a

request can require multiple iterations which magnifies the impact of

this delay

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2259

Performance impairments of Shallow-buffered

Switches3 Buffer Pressure

Given the mix of long and short flows in a data center it is very

common for short flows on one port to be impacted by activity on

other ports The loss rate of short flows in this traffic pattern depends

on the number of long flows traversing other ports

The long greedy TCP flows build up queues on their interfaces

Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space

available to absorb bursts of traffic from the PartitionAggregate

traffic This impairment is called buffer pressure The result is packet

loss and timeouts as in incast but without requiring synchronizedflows

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2359

Buffer

Pressure

Short flows on oneport and long flows

on another port

Incast

Scenario

Multiple shortflows on the same

port

Queue

Buildup

Short and longflows on the same

port

Flow Interactions in Shallow-buffered Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2459

C o n g e s t i o n

w i n d o w

10

5

15

20

0

Round-trip times

Slow

start

Congestionavoidance

Time-out

Legacy TCP Congestion Control

983155983155983135983156983144983154983141983155983144 983101983089983094

983139983159983150983140 983101983090983088

983155983155983135983156983144983154983141983155983144 983101983089983088

Segment loss

Segment loss

FastRetransmit

Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2559

The Need for a Data Center TCP

The data center environment is significantly

different from wide area networks

o round trip times (RTTs) can be less than 250 ms in absence ofqueuing

o Applications need extremely high bandwidths and very low

latencies

o little statistical multiplexing a single flow can dominate a

particular path

o The network is largely homogeneous and under a single

administrative controlo Traffic flowing in switches is mostly internal Connectivity to the

external Internet is typically managed through load balancers and

application proxies that effectively separate internal traffic from

external

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2659

The Need for a Data Center TCP (continued)

Data center applications generate a diverse mix of short and long

flows The measurements by the authors reveal that 9991 of

traffic in the data center is TCP traffic The traffic consists of query

traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to

100MB) These applications require three things from the data

center network

o low latency for short flows

o high burst tolerance

o high utilization for long flows

Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above

requirements

See paper for details of workload

characterization in cloud data centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2759

The Data Center TCP (DCTCP) Algorithm

The goal of DCTCP is to achieve high burst tolerance low latency

and high throughput with commodity shallow buffered switches

DCTCP uses the concept of ECN (Explicit Congestion Notification)

DCTCP achieves these goals primarily by reacting to congestion in

proportion to the extent of congestion

DCTCP uses a simple marking scheme at switches that sets the

Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold

The DCTCP source reacts by reducing the window by a factor that

depends on the fraction of marked packets the larger the fraction the

bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN

notification

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2859

DCTCP- Simple Marking at the Switch

DCTCP employs a simple active queue management scheme There

is only a single parameter the marking threshold K as opposed to

two parameters THmin and THmax in RED routers

An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival

Marking is based on the instantaneous value of the queue not the

average value as in RED routers

The DCTCP scheme ensures that sources are quickly notified of the

queue overshoot

The RED marking scheme implemented by most modern switches

can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of

average queue length

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2959

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 20: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2059

IncastScenario

Packets from many

flows arriving to

the same port at

the same time

Incast Collapse Summary

In other publications the incast scnario

is referred to as the fan-in burst at the

parent node This incast is a key reason

for increased network delay and occurswhen all the children (eg workers at

the leaf level) of a parent node face the

same deadline and are likely to respond

nearly at the same time causing a fan-

in burst at the parent node

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2159

Performance impairments of Shallow-buffered

Switches2 Queue Buildup

When long and short flows traverse the same queue there is a queue

buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every

worker in the cluster handles both query traffic and background

traffic (large flows needed to update the data structures on the

workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays

because of long-lived greedy TCP flows Further answering a

request can require multiple iterations which magnifies the impact of

this delay

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2259

Performance impairments of Shallow-buffered

Switches3 Buffer Pressure

Given the mix of long and short flows in a data center it is very

common for short flows on one port to be impacted by activity on

other ports The loss rate of short flows in this traffic pattern depends

on the number of long flows traversing other ports

The long greedy TCP flows build up queues on their interfaces

Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space

available to absorb bursts of traffic from the PartitionAggregate

traffic This impairment is called buffer pressure The result is packet

loss and timeouts as in incast but without requiring synchronizedflows

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2359

Buffer

Pressure

Short flows on oneport and long flows

on another port

Incast

Scenario

Multiple shortflows on the same

port

Queue

Buildup

Short and longflows on the same

port

Flow Interactions in Shallow-buffered Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2459

C o n g e s t i o n

w i n d o w

10

5

15

20

0

Round-trip times

Slow

start

Congestionavoidance

Time-out

Legacy TCP Congestion Control

983155983155983135983156983144983154983141983155983144 983101983089983094

983139983159983150983140 983101983090983088

983155983155983135983156983144983154983141983155983144 983101983089983088

Segment loss

Segment loss

FastRetransmit

Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2559

The Need for a Data Center TCP

The data center environment is significantly

different from wide area networks

o round trip times (RTTs) can be less than 250 ms in absence ofqueuing

o Applications need extremely high bandwidths and very low

latencies

o little statistical multiplexing a single flow can dominate a

particular path

o The network is largely homogeneous and under a single

administrative controlo Traffic flowing in switches is mostly internal Connectivity to the

external Internet is typically managed through load balancers and

application proxies that effectively separate internal traffic from

external

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2659

The Need for a Data Center TCP (continued)

Data center applications generate a diverse mix of short and long

flows The measurements by the authors reveal that 9991 of

traffic in the data center is TCP traffic The traffic consists of query

traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to

100MB) These applications require three things from the data

center network

o low latency for short flows

o high burst tolerance

o high utilization for long flows

Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above

requirements

See paper for details of workload

characterization in cloud data centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2759

The Data Center TCP (DCTCP) Algorithm

The goal of DCTCP is to achieve high burst tolerance low latency

and high throughput with commodity shallow buffered switches

DCTCP uses the concept of ECN (Explicit Congestion Notification)

DCTCP achieves these goals primarily by reacting to congestion in

proportion to the extent of congestion

DCTCP uses a simple marking scheme at switches that sets the

Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold

The DCTCP source reacts by reducing the window by a factor that

depends on the fraction of marked packets the larger the fraction the

bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN

notification

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2859

DCTCP- Simple Marking at the Switch

DCTCP employs a simple active queue management scheme There

is only a single parameter the marking threshold K as opposed to

two parameters THmin and THmax in RED routers

An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival

Marking is based on the instantaneous value of the queue not the

average value as in RED routers

The DCTCP scheme ensures that sources are quickly notified of the

queue overshoot

The RED marking scheme implemented by most modern switches

can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of

average queue length

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2959

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 21: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2159

Performance impairments of Shallow-buffered

Switches2 Queue Buildup

When long and short flows traverse the same queue there is a queue

buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every

worker in the cluster handles both query traffic and background

traffic (large flows needed to update the data structures on the

workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays

because of long-lived greedy TCP flows Further answering a

request can require multiple iterations which magnifies the impact of

this delay

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2259

Performance impairments of Shallow-buffered

Switches3 Buffer Pressure

Given the mix of long and short flows in a data center it is very

common for short flows on one port to be impacted by activity on

other ports The loss rate of short flows in this traffic pattern depends

on the number of long flows traversing other ports

The long greedy TCP flows build up queues on their interfaces

Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space

available to absorb bursts of traffic from the PartitionAggregate

traffic This impairment is called buffer pressure The result is packet

loss and timeouts as in incast but without requiring synchronizedflows

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2359

Buffer

Pressure

Short flows on oneport and long flows

on another port

Incast

Scenario

Multiple shortflows on the same

port

Queue

Buildup

Short and longflows on the same

port

Flow Interactions in Shallow-buffered Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2459

C o n g e s t i o n

w i n d o w

10

5

15

20

0

Round-trip times

Slow

start

Congestionavoidance

Time-out

Legacy TCP Congestion Control

983155983155983135983156983144983154983141983155983144 983101983089983094

983139983159983150983140 983101983090983088

983155983155983135983156983144983154983141983155983144 983101983089983088

Segment loss

Segment loss

FastRetransmit

Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2559

The Need for a Data Center TCP

The data center environment is significantly

different from wide area networks

o round trip times (RTTs) can be less than 250 ms in absence ofqueuing

o Applications need extremely high bandwidths and very low

latencies

o little statistical multiplexing a single flow can dominate a

particular path

o The network is largely homogeneous and under a single

administrative controlo Traffic flowing in switches is mostly internal Connectivity to the

external Internet is typically managed through load balancers and

application proxies that effectively separate internal traffic from

external

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2659

The Need for a Data Center TCP (continued)

Data center applications generate a diverse mix of short and long

flows The measurements by the authors reveal that 9991 of

traffic in the data center is TCP traffic The traffic consists of query

traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to

100MB) These applications require three things from the data

center network

o low latency for short flows

o high burst tolerance

o high utilization for long flows

Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above

requirements

See paper for details of workload

characterization in cloud data centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2759

The Data Center TCP (DCTCP) Algorithm

The goal of DCTCP is to achieve high burst tolerance low latency

and high throughput with commodity shallow buffered switches

DCTCP uses the concept of ECN (Explicit Congestion Notification)

DCTCP achieves these goals primarily by reacting to congestion in

proportion to the extent of congestion

DCTCP uses a simple marking scheme at switches that sets the

Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold

The DCTCP source reacts by reducing the window by a factor that

depends on the fraction of marked packets the larger the fraction the

bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN

notification

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2859

DCTCP- Simple Marking at the Switch

DCTCP employs a simple active queue management scheme There

is only a single parameter the marking threshold K as opposed to

two parameters THmin and THmax in RED routers

An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival

Marking is based on the instantaneous value of the queue not the

average value as in RED routers

The DCTCP scheme ensures that sources are quickly notified of the

queue overshoot

The RED marking scheme implemented by most modern switches

can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of

average queue length

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2959

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 22: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2259

Performance impairments of Shallow-buffered

Switches3 Buffer Pressure

Given the mix of long and short flows in a data center it is very

common for short flows on one port to be impacted by activity on

other ports The loss rate of short flows in this traffic pattern depends

on the number of long flows traversing other ports

The long greedy TCP flows build up queues on their interfaces

Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space

available to absorb bursts of traffic from the PartitionAggregate

traffic This impairment is called buffer pressure The result is packet

loss and timeouts as in incast but without requiring synchronizedflows

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2359

Buffer

Pressure

Short flows on oneport and long flows

on another port

Incast

Scenario

Multiple shortflows on the same

port

Queue

Buildup

Short and longflows on the same

port

Flow Interactions in Shallow-buffered Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2459

C o n g e s t i o n

w i n d o w

10

5

15

20

0

Round-trip times

Slow

start

Congestionavoidance

Time-out

Legacy TCP Congestion Control

983155983155983135983156983144983154983141983155983144 983101983089983094

983139983159983150983140 983101983090983088

983155983155983135983156983144983154983141983155983144 983101983089983088

Segment loss

Segment loss

FastRetransmit

Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2559

The Need for a Data Center TCP

The data center environment is significantly

different from wide area networks

o round trip times (RTTs) can be less than 250 ms in absence ofqueuing

o Applications need extremely high bandwidths and very low

latencies

o little statistical multiplexing a single flow can dominate a

particular path

o The network is largely homogeneous and under a single

administrative controlo Traffic flowing in switches is mostly internal Connectivity to the

external Internet is typically managed through load balancers and

application proxies that effectively separate internal traffic from

external

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2659

The Need for a Data Center TCP (continued)

Data center applications generate a diverse mix of short and long

flows The measurements by the authors reveal that 9991 of

traffic in the data center is TCP traffic The traffic consists of query

traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to

100MB) These applications require three things from the data

center network

o low latency for short flows

o high burst tolerance

o high utilization for long flows

Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above

requirements

See paper for details of workload

characterization in cloud data centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2759

The Data Center TCP (DCTCP) Algorithm

The goal of DCTCP is to achieve high burst tolerance low latency

and high throughput with commodity shallow buffered switches

DCTCP uses the concept of ECN (Explicit Congestion Notification)

DCTCP achieves these goals primarily by reacting to congestion in

proportion to the extent of congestion

DCTCP uses a simple marking scheme at switches that sets the

Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold

The DCTCP source reacts by reducing the window by a factor that

depends on the fraction of marked packets the larger the fraction the

bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN

notification

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2859

DCTCP- Simple Marking at the Switch

DCTCP employs a simple active queue management scheme There

is only a single parameter the marking threshold K as opposed to

two parameters THmin and THmax in RED routers

An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival

Marking is based on the instantaneous value of the queue not the

average value as in RED routers

The DCTCP scheme ensures that sources are quickly notified of the

queue overshoot

The RED marking scheme implemented by most modern switches

can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of

average queue length

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2959

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 23: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2359

Buffer

Pressure

Short flows on oneport and long flows

on another port

Incast

Scenario

Multiple shortflows on the same

port

Queue

Buildup

Short and longflows on the same

port

Flow Interactions in Shallow-buffered Switches

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2459

C o n g e s t i o n

w i n d o w

10

5

15

20

0

Round-trip times

Slow

start

Congestionavoidance

Time-out

Legacy TCP Congestion Control

983155983155983135983156983144983154983141983155983144 983101983089983094

983139983159983150983140 983101983090983088

983155983155983135983156983144983154983141983155983144 983101983089983088

Segment loss

Segment loss

FastRetransmit

Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2559

The Need for a Data Center TCP

The data center environment is significantly

different from wide area networks

o round trip times (RTTs) can be less than 250 ms in absence ofqueuing

o Applications need extremely high bandwidths and very low

latencies

o little statistical multiplexing a single flow can dominate a

particular path

o The network is largely homogeneous and under a single

administrative controlo Traffic flowing in switches is mostly internal Connectivity to the

external Internet is typically managed through load balancers and

application proxies that effectively separate internal traffic from

external

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2659

The Need for a Data Center TCP (continued)

Data center applications generate a diverse mix of short and long

flows The measurements by the authors reveal that 9991 of

traffic in the data center is TCP traffic The traffic consists of query

traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to

100MB) These applications require three things from the data

center network

o low latency for short flows

o high burst tolerance

o high utilization for long flows

Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above

requirements

See paper for details of workload

characterization in cloud data centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2759

The Data Center TCP (DCTCP) Algorithm

The goal of DCTCP is to achieve high burst tolerance low latency

and high throughput with commodity shallow buffered switches

DCTCP uses the concept of ECN (Explicit Congestion Notification)

DCTCP achieves these goals primarily by reacting to congestion in

proportion to the extent of congestion

DCTCP uses a simple marking scheme at switches that sets the

Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold

The DCTCP source reacts by reducing the window by a factor that

depends on the fraction of marked packets the larger the fraction the

bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN

notification

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2859

DCTCP- Simple Marking at the Switch

DCTCP employs a simple active queue management scheme There

is only a single parameter the marking threshold K as opposed to

two parameters THmin and THmax in RED routers

An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival

Marking is based on the instantaneous value of the queue not the

average value as in RED routers

The DCTCP scheme ensures that sources are quickly notified of the

queue overshoot

The RED marking scheme implemented by most modern switches

can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of

average queue length

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2959

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 24: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2459

C o n g e s t i o n

w i n d o w

10

5

15

20

0

Round-trip times

Slow

start

Congestionavoidance

Time-out

Legacy TCP Congestion Control

983155983155983135983156983144983154983141983155983144 983101983089983094

983139983159983150983140 983101983090983088

983155983155983135983156983144983154983141983155983144 983101983089983088

Segment loss

Segment loss

FastRetransmit

Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2559

The Need for a Data Center TCP

The data center environment is significantly

different from wide area networks

o round trip times (RTTs) can be less than 250 ms in absence ofqueuing

o Applications need extremely high bandwidths and very low

latencies

o little statistical multiplexing a single flow can dominate a

particular path

o The network is largely homogeneous and under a single

administrative controlo Traffic flowing in switches is mostly internal Connectivity to the

external Internet is typically managed through load balancers and

application proxies that effectively separate internal traffic from

external

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2659

The Need for a Data Center TCP (continued)

Data center applications generate a diverse mix of short and long

flows The measurements by the authors reveal that 9991 of

traffic in the data center is TCP traffic The traffic consists of query

traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to

100MB) These applications require three things from the data

center network

o low latency for short flows

o high burst tolerance

o high utilization for long flows

Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above

requirements

See paper for details of workload

characterization in cloud data centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2759

The Data Center TCP (DCTCP) Algorithm

The goal of DCTCP is to achieve high burst tolerance low latency

and high throughput with commodity shallow buffered switches

DCTCP uses the concept of ECN (Explicit Congestion Notification)

DCTCP achieves these goals primarily by reacting to congestion in

proportion to the extent of congestion

DCTCP uses a simple marking scheme at switches that sets the

Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold

The DCTCP source reacts by reducing the window by a factor that

depends on the fraction of marked packets the larger the fraction the

bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN

notification

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2859

DCTCP- Simple Marking at the Switch

DCTCP employs a simple active queue management scheme There

is only a single parameter the marking threshold K as opposed to

two parameters THmin and THmax in RED routers

An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival

Marking is based on the instantaneous value of the queue not the

average value as in RED routers

The DCTCP scheme ensures that sources are quickly notified of the

queue overshoot

The RED marking scheme implemented by most modern switches

can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of

average queue length

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2959

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 25: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2559

The Need for a Data Center TCP

The data center environment is significantly

different from wide area networks

o round trip times (RTTs) can be less than 250 ms in absence ofqueuing

o Applications need extremely high bandwidths and very low

latencies

o little statistical multiplexing a single flow can dominate a

particular path

o The network is largely homogeneous and under a single

administrative controlo Traffic flowing in switches is mostly internal Connectivity to the

external Internet is typically managed through load balancers and

application proxies that effectively separate internal traffic from

external

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2659

The Need for a Data Center TCP (continued)

Data center applications generate a diverse mix of short and long

flows The measurements by the authors reveal that 9991 of

traffic in the data center is TCP traffic The traffic consists of query

traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to

100MB) These applications require three things from the data

center network

o low latency for short flows

o high burst tolerance

o high utilization for long flows

Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above

requirements

See paper for details of workload

characterization in cloud data centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2759

The Data Center TCP (DCTCP) Algorithm

The goal of DCTCP is to achieve high burst tolerance low latency

and high throughput with commodity shallow buffered switches

DCTCP uses the concept of ECN (Explicit Congestion Notification)

DCTCP achieves these goals primarily by reacting to congestion in

proportion to the extent of congestion

DCTCP uses a simple marking scheme at switches that sets the

Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold

The DCTCP source reacts by reducing the window by a factor that

depends on the fraction of marked packets the larger the fraction the

bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN

notification

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2859

DCTCP- Simple Marking at the Switch

DCTCP employs a simple active queue management scheme There

is only a single parameter the marking threshold K as opposed to

two parameters THmin and THmax in RED routers

An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival

Marking is based on the instantaneous value of the queue not the

average value as in RED routers

The DCTCP scheme ensures that sources are quickly notified of the

queue overshoot

The RED marking scheme implemented by most modern switches

can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of

average queue length

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2959

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 26: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2659

The Need for a Data Center TCP (continued)

Data center applications generate a diverse mix of short and long

flows The measurements by the authors reveal that 9991 of

traffic in the data center is TCP traffic The traffic consists of query

traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to

100MB) These applications require three things from the data

center network

o low latency for short flows

o high burst tolerance

o high utilization for long flows

Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above

requirements

See paper for details of workload

characterization in cloud data centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2759

The Data Center TCP (DCTCP) Algorithm

The goal of DCTCP is to achieve high burst tolerance low latency

and high throughput with commodity shallow buffered switches

DCTCP uses the concept of ECN (Explicit Congestion Notification)

DCTCP achieves these goals primarily by reacting to congestion in

proportion to the extent of congestion

DCTCP uses a simple marking scheme at switches that sets the

Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold

The DCTCP source reacts by reducing the window by a factor that

depends on the fraction of marked packets the larger the fraction the

bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN

notification

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2859

DCTCP- Simple Marking at the Switch

DCTCP employs a simple active queue management scheme There

is only a single parameter the marking threshold K as opposed to

two parameters THmin and THmax in RED routers

An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival

Marking is based on the instantaneous value of the queue not the

average value as in RED routers

The DCTCP scheme ensures that sources are quickly notified of the

queue overshoot

The RED marking scheme implemented by most modern switches

can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of

average queue length

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2959

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 27: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2759

The Data Center TCP (DCTCP) Algorithm

The goal of DCTCP is to achieve high burst tolerance low latency

and high throughput with commodity shallow buffered switches

DCTCP uses the concept of ECN (Explicit Congestion Notification)

DCTCP achieves these goals primarily by reacting to congestion in

proportion to the extent of congestion

DCTCP uses a simple marking scheme at switches that sets the

Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold

The DCTCP source reacts by reducing the window by a factor that

depends on the fraction of marked packets the larger the fraction the

bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN

notification

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2859

DCTCP- Simple Marking at the Switch

DCTCP employs a simple active queue management scheme There

is only a single parameter the marking threshold K as opposed to

two parameters THmin and THmax in RED routers

An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival

Marking is based on the instantaneous value of the queue not the

average value as in RED routers

The DCTCP scheme ensures that sources are quickly notified of the

queue overshoot

The RED marking scheme implemented by most modern switches

can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of

average queue length

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2959

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 28: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2859

DCTCP- Simple Marking at the Switch

DCTCP employs a simple active queue management scheme There

is only a single parameter the marking threshold K as opposed to

two parameters THmin and THmax in RED routers

An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival

Marking is based on the instantaneous value of the queue not the

average value as in RED routers

The DCTCP scheme ensures that sources are quickly notified of the

queue overshoot

The RED marking scheme implemented by most modern switches

can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of

average queue length

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2959

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 29: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 2959

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 30: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3059

DCTCP- ECN Echo at the Receiver

RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the

congestion notification has been received The DCTCP receiver however tries to

accurately convey the exact sequence of marked packets back to the sender This is

done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK

For senders that use delayed ACKs (one cumulative ACK for every m

consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the

delayed ACK scheme

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 31: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3159

DCTCP- Control at the Sender

The sender maintains an estimate of the fraction of packets that are marked

called α which is updated once for every window of data (roughly once

every one RTT) as follows

αααα= (1 - g)

timestimestimestimes αααα+ g

timestimestimestimesF

where F is the fraction of packets that were marked in the latest window

of data and 0 lt g lt 1 is the weight given to new samples against the past in

the estimation of α Given that the sender receives marks for every packet

when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α

estimates the probability that the queue size is greater than K The higher the

value of α the higher the level of congestion

Notice that the above equation uses the exponentially weighted average

formula used in many applications eg estimating the average queue size

in RED routers estimating RTO in a TCP connection and flow traffic

prediction in online multihoming smart routing

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 32: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3259

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 33: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3359

RED Router

983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144

983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161

983105983139983139983141983152983156

983088 983124983112983149983145983150 983124983112983149983137983160 983107

RED Router

Update the value of the average queue size

avg = (1- wq ) times avg + wq times q

if (avg lt THmin) accept packet

else if (THmin le avg le THmax)

calculate probability Pa

with probability Pa

discard or mark packet

otherwise with probability 1 ndash Paaccept packet

else if (avg gt THmax) discard packet

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

983148983145983149983145983156

DCTCP Switch

DCTCP Switch

if (q le K) accept packet

else if ( K lt q le limit )

accept and mark packet

else if ( q gt limit) discard packet

DCTCP Sender

Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window

ssthresh = cwnd times (1- α 2)

cwnd = ssthresh

Legacy TCP Sender

Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 34: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3459

Benefits of DCTCP

Queue buildup DCTCP senders start reacting as soon as the

queue length on an interface exceeds K This reduces queuing

delays on congested switch ports which minimizes the impact

of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses

that can lead to timeouts

Buffer pressure a congested portrsquos queue length does not

grow exceedingly large Therefore in shared memory

switches a few congested ports will not exhaust the buffer

resources for flows passing through other ports

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 35: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3559

Benefits of DCTCP (continued)

Incast the incast scenario where a large number of synchronized

small flows hit the same queue is the most difficult to handle If the

number of small flows is so high that even 1 packet from each flow is

sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to

avoid packet drops

However in practice each flow has several packets to transmit and

their windows build up over multiple RTTs It is often bursts in

subsequent RTTs that lead to drops Because DCTCP starts marking

early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two

RTTs to tame the size of follow up bursts This prevents buffer

overflows and resulting timeouts

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 36: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3659

DCTCP Performance

The paper has more details on

Guidelines for choosing parameters and estimating gain

Analytical model for the steady state behavior of DCTCP

Benchmark traffic and the micro-benchmark experiments

used to evaluate DCTCP

Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New

Reno (w SACK) implementation

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 37: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3759

D3 TCP

983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150

ACM SIGCOMM August 2011

Better Never Than LateMeeting Deadlines in Datacenter Networks

Microsoft Research

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 38: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3859

Pros and Cons of DCTCP

DCTCP is an elegant proposal that targets the tail-end latency by

gracefully throttling flows in proportion to the extent of

congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found

to reduce the 99th-percentile of the network latency by 29

Unfortunately DCTCP is a deadline-agnostic protocol that

equally throttles all flows irrespective of whether their deadlines

are near or far

Rule a flow is useful if and only if it satisfies its deadline

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 39: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 3959

d1 d2

f1

f2

Time

Flow

D3 TCP Basic Idea of Deadline Awareness

d1 d2

f1

f2

Time

FlowD3 TCPDCTCP

Two flows (f1 f2) with different deadlines (d1 d2) The

thickness of a flow line represents the rate allocated to it

DCTCP is not aware of deadlines and treat all flows equally

DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their

deadline Awareness of deadlines can be used in D3 TCP to

ensure they are met

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 40: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4059

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 41: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4159

Deadlines are associated with flows not packets All packets of a

flow need to arrive before the deadline

Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of

deadlines (including some that do not have a deadline) Further

datacenters host multiple services with diverse traffic patterns

Most flows are very short (lt50KB) and RTTs are minimal

(300microsec) Consequently reaction time-scales are short and

centralized heavy weight (complex) mechanisms to reserve

bandwidth for flows are impractical

Challenges

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 42: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4259

D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce

traffic in the network

D3 TCP uses a Deadline-Driven Delivery control protocol that

addresses the aforementioned challenges Each application knows the deadline for a message and the size

of the message and pass this information to the transport layer

in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination

Routers allocate sending rates to flows to greedily satisfy as

many deadlines as possible

D3

TCP tries to ensure that the largest possible fraction offlows meet their deadlines

Basic Design Idea

Details of the D3 TCP scheme can be found in the paper

posted on Webcourses

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 43: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4359

D2 TCP

B Vamanan J Hasan T Vijaykumar

Purdue University amp Google Inc

ACM SIGCOMM August 2012

Deadline-Aware Datacenter TCP

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 44: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4459

Pros and Cons of DCTCP and D3 TCP

Results reported in D3 TCP show that as much as 7 of flows

may miss their deadlines with DCTCP Our results show

DCTCP with 25 missed deadlines at high fan-in amp tight

deadlines

D3 TCP tackles missed deadlines by pioneering the idea of

incorporating deadline awareness into the network While D3

TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP

does not handle fan-in bursts well

introduces priority inversion at fan-in bursts (see next slide)

does not co-exist with TCP

requires custom silicon (ie switches)

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 45: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4559

983123983159983145983156983139983144

Switch grants

requests FCFS

Bandwidth requests arriving at switch

request paused request granted

Request with near deadline

Request with far deadline

Priority Inversion in D3 TCP

D3 TCP greedy approach may allocate bandwidth to far-deadline

requests arriving slightly ahead of near-deadline requests Due to this

race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts

the priority of 24-33 requests

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 46: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4659

Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion

avoidance far-deadline rarr back off morenear-deadline rarr back off less

Reactive decentralized

Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today

D2 TCP achieves 75 and 50 fewer

missed deadlines than DCTCP and D3

D

2

TCPrsquos Contributions

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 47: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4759

OnLine Data Intensive Applications (OLDI)

OLDI applications operate under soft-real-time constraints (eg 300

ms latency) OLDI applications cane be found in the growing high-

revenue online services such as Web search online retail and

advertisementExample

A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable

by the user and her friends cascade of friend event notifications a chat

application listing friends currently on-line and advertisements This

Facebook page is made up of many components generated by independent

subsystems and ldquomixedrdquo together to provide rich presentation

The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are

delayed Alternatively it must present what it has at the deadline sacrificing

page quality and wasting resources consumed in creating parts of a page that

a user never sees

O A i i

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 48: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4859

Features

bull Deadline bound

bull Handle large data

bull Partition-aggregate patternbull Tree-like structure

bull Deadline budget splittotal = 300 ms

parents-leaf RPC = 50 ms

bull Missed deadlines rarr incomplete responses

bull Affect user experience amp revenue

OLDI Applications

OLDI applications employ tree-based divide-and-conquer

algorithms where every query operates on data spanning thousands

of servers

parent parent

leaf leaf leaf leaf

root

bull bull bull bull bull bull bull bull

bull bull bull bullbull bull bull

User

queryOLDI response

simsimsimsim250 ms

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 49: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 4959

Deadline-aware and handles fan-in bursts

Key Idea Vary sending rate based on both deadlineand extent of congestion

Built on top of DCTCP Distributed uses per-flow state at end hosts

Reactive senders react to congestion

No knowledge of other flows

D

2

TCP

D2 TCP G C ti

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 50: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5059

Like DCTCP D2 TCP maintains a weighted average that quantitatively

measures the extent of congestion

α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f

where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples

We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty

function applied to the window size as follows

D2 TCP Gamma Correction

p = d

Note that being a fraction le 983089 and therefore le 1 The above

function is known in computer graphics as the gamma-correction

D2 TCP Adj ti C ti Wi d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 51: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5159

The congestion window W is adjusted as follows

= times (1 minus 2) f gt0 case of packets marked

= + 1 f=0 case of no packets marked

bull When 983142 is zero (ie no CE-marked packets indicating absence of

congestion) the window size is grown by one segment similar toTCP

bull When all packets are CE-marked (case of congestion) 983101983089 and

therefore 983101983089 then the window size gets halved similar to TCP

bull For between 0 and 1 the window size is modulated by

D2 TCP Adjusting Congestion Window

Note Larger p rArrrArrrArrrArr smaller window

D2 TCP Basic Form las

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 52: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5259

After determining p we resize the congestion window W as follows

= times (1 minus 2) f gt0

whereWhere d = deadline imminence factor

d = T c D

Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires

d lt 1 for far-deadline flows d gt 1 for near-deadline flows

d = 1 for long flows that do not specify deadlines (ie in this

case D2 TCP behaves like DCTCP)

D2 TCP Basic Formulas

p = d

Gamma Correction Function

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 53: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5359

Gamma correction elegantly combinescongestion and deadlines

Gamma Correction Function

983140 983101 983089

983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)

983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)

Key insight Near-deadline flows back off lesswhile far-deadline flows back off more

983127 983098983101 983127 983082 983080 983089 983085991251

983152 983087 983090 983081

10

983152

10

983142983137983154

983140 983101 983089

983150983141983137983154

983152 983101 α983134983140

bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159

bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155

983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159

bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155

983108983107983124983107983120 983138983141983144983137983158983145983151983154

D2 TCP Computing αααα

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 54: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5459

983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115

Buffer_l 983145983149983145983156

Switch Buffer

Switch

if (q le K)

accept packet without marking

else if ( K lt q le Buffer_limit )

accept and mark packetelse if ( q gt Buffer_limit) discard packet

SenderUpdate once every RTT

α = (1 - g) times α + g times f f is the fraction of packets that were

marked in the latest window of data

D2 TCP Computing αααα

α is calculated by

aggregating ECN (like

DCTCP)

Switches mark packets if

queue_length gt threshold

Sender computes thefraction of marked packets

averaged over time

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 55: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5559

D TCP Computing the deadline imminence factor d

As in D3 TCP The applicationknows the deadline D for a

message and pass this information

to the transport layer in the request

to send

To estimate the time Tc to complete

transmitting the message (flow) D2

TCP uses a sawtooth deadline-agnostic congestion behavior

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W rarrW2 upon congestion detection p=

W

W2

time

D

D = the time remaining until the deadline expires

W= flowrsquos current window size

B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic

sawtooth transmission behavior We want Tc le D

Analysis continued on the next slide

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 56: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5659

= 2 +

2 +1 +

2 +2 +

⋯ +

2 + L-1 Tc L

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

Since the value of B is known by the application and L -1 = W2 for the

sawtooth pattern the value Tc can be computed An alternative reasonable

approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155

Note that Tc L is the number of

sawtooth waves needed to completetransmitting the message

= (075) W in bytes

Analysis continued on the next slide

D TCP Computing the deadline imminence factor d

D2 TCP Computing the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 57: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5759

Tc

L

Tc gt L

Sawtooth waves for deadline-agnostic

behavior (similar to DCTCP)

W

W2

Time in RTT

D

It also follows that if gt then we should set gt1 to indicate a tight

deadline and vice versa Therefore we compute d as

is the time needed for a flow to

complete transmitting all its data

under the deadline-agnosticbehavior and D is the time

remaining until its deadline

expires If the flow can just meet

its deadline under the deadline-agnostic congestion behavior (ie

cong) then d = 1 is appropriate

= (075) approximation

D TCP Computing the deadline imminence factor d

=

D2 TCP the deadline imminence factor d

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 58: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5859

D TCP the deadline imminence factor d

What if Tc lt L

In this case the partial

sawtooth pattern is as shown

in the figure In this case wehave

Tc

L

Tc lt L

Sawtooth waves for deadline-agnostic

behavior (DCTCP)

W

W2

time

= 2 +

2 +1 + 2 +2 +

⋯ +

2 + Tc-1

Since the value of B is known by the application the value Tc canbe computed The value d is given by

=

D2 TCP Summary

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches

Page 59: TCP for Data Centers

8132019 TCP for Data Centers

httpslidepdfcomreaderfulltcp-for-data-centers 5959

D TCP Summary

D2 TCP adjusts the congestion window size in a deadline-aware

manner When congestion occurs far-deadline flows back off

aggressively while near-deadline flows back off only a little or

not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but

also tighter deadlines can be met

D2 TCP requires no changes to the switch hardware and only

requires that the switches support ECN which is true of todayrsquos

datacenter switches