Upload
fahad-ahmad-khan
View
221
Download
0
Embed Size (px)
Citation preview
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 159
Design of TCP for Data Centers
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 259
Cloud ComputingThe cloud computing architecture is comprised of two significant
parts
The front end is the side at which the user of the computer orthe client himself is able to access This involves the clientrsquos
network or his computer and the program(s) that the client uses
to access the database or the servers that contain all the data
The back end is the cloud itself which is the collection of allrelated information saved on the servers that the client wishes to
have access to
These two ends of the cloud computing architecture are connectedthrough a network usually the Internet which provides remote
access to all the users of the cloud
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 359
Benefits of Cloud Computing
For large-scale businesses the cloud computing technology
eliminates the need to buy an additional number of hardware and
storage devices since all data needed would be easily accessible
from the cloud through the individual computers of the
employees
Installing software in every computer would no longer be
necessary because the cloud computing platform would be ableto do the job
Cloud hosting services provide managed hosting for all server
configurations on a dedicated 247 availability Cloud software
ranges from sales applications to custom applications dependingon the usersrsquo choice
The above benefits can provide a great deal of profit to many
businesses and can also improve customer satisfaction
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 459
Cloud Data Centers Cloud computing services provide the users of the Cloud
better management of their information This might save a
company on expenses since the company will not need to hirea large IT team for its own technical support
There is a lot of cloud computing software available today that
offers provision of cloud computing applications Running
applications and storing data on the cloud has proven to beeconomical and efficient for many businesses
Cloud data centers host diverse applications mixing
workloads that require small predictable latency with othersrequiring large sustained throughput
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 559
Data Center Bubble
Numerous companies are already providing cloud services
including Amazon Google Yahoo Microsoft HP IBM Cisco
etc
Data Centers range in size from ldquoedgerdquo facilities to megascale datacenters (100K to 1M servers)
Data centers are located in many countries eg USA India
Singapore Germany etcThere is a push for Green Data Centers that use windsolar energy
efficient floor layout recycling of waste material environment
friendly materialpaint green rated power equipment
Fastest growing sectors in data centers Telecom Foreign hosting
companies Global information management Business
connectivity
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 659
Data Center Closures amp ConsolidationUS Government Data Centers
In November 2012 the US Government has closed an additional
64 data centers bringing the total number of closed facilities to
381 The closures are part of the Federal Data CenterConsolidation Initiative for streamlining government IT
operations
The ultimate goal is to close 40 percent of the US Federal
Governmentrsquos data centers (ie close 1200 of the nearly 2900identified data centers) by 2015
Commercial Data CentersData center consolidation is a trend in industry For example HP
has been replacing its 85 data centers around the world with only 6newly-built larger facilities in Austin Atlanta and Houston
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 759
Hewlett-Packard Development Company
Example HP Cloud Services
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 859
Example Amazon Data CentersAmazon data centers serve four regions in
the US and three regions in Europe and
Asia Another data center in the US was
opened July 2011 in the state of Oregon to
serve the Pacific Northwest region
In December 2011 Amazon announced it is
opening a data center in Sao Paulo Brazil
its first in South America
In November 2012 Amazon announced it
is adding a ninth region by opening a data
center in Sydney Australia
The data centers support all Amazon Web
Services (AWS) including Amazon Elastic
Compute Cloud (EC2) and Amazon
Simple Storage Service (S3)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 959
Example Amazon Data Centers
The Amazon Elastic Compute Cloud (Amazon EC2) web service provides users
with the ability to execute their applications in Amazons computing environment
To use Amazon EC2 Create an Amazon Machine Image (AMI) containing all the software including
the operating system
Upload this AMI to the Amazon S3 (Amazon Simple Storage Service)
Register to get an AMI ID Use this AMI ID and the Amazon EC2 web service APIs to run monitor and
terminate as many instances of this AMI as required
EC2 Pricing Policy pay as you go no minimal fee The prices are based on theRegion in which the application instance is running
httpawsamazoncomec2pricing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1059
Data Center Services
Exampe Colocation Services of Cogent
httpwwwcogentcocomen
Cogent is a multinational Tier 1 Internet Service Provider
Companies can colocate their business critical equipment in one of
43 Cogents secure state-of-the-art data centers that connect directly
to a Tier-1 IP network The data centers have extensive powerbackup systems complete fire detection and suppression plans to
ensure the safety and security of equipment
Cogent Data Center Features
httpwwwcogentcocomenproducts-and-servicescolocation-
services
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1159
Colocation Data Centers and Cloud Servers
httpwwwdatacentermapcomdatacentershtml
httpwwwdatacentermapcomcloudhtml
Example AtlanticNet
httpwwwatlanticnetorlando-colocation-floridahtml
Orlando Data Center
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1259
Data Center TCP (DCTCP)
M Alizadehzy A Greenbergy D Maltzy J Padhyey P
Pately B Prabhakarz S Senguptay M Sridharan
983117983145983139983154983151983155983151983142983156 983122983141983155983141983137983154983139983144 amp 983123983156983137983150983142983151983154983140 983125983150983145983158983141983154983155983145983156983161
ACM SIGCOMM September 2010
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1359
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1459
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1559
Rack Servers with Commodity Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1659
Performance impairments of Shallow-buffered
Switches1 TCP Incast Collapse
Many applications generate barrier-synchronized requests in which the
client cannot make forward progress until the responses from every
server for the current request have been received An Example of these
applications is a web search query (eg a Google search) sent to a large
number of nodes with results returned to the parent node to be sorted
Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these
requests create many flows that converge on the same interface of a
switch over a short period of time The response packets create a long
queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and
throughput collapse
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1759
1 TCP Incast Collapse (continued)
Barrier-synchronized requests exhibit the PartitionAggregate workflow
pattern which is the foundation of many large scale web applications
Requests from higher layers of the application are broken into pieces and
farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content
composition and advertisement selection are based around the
PartitionAggregate design pattern
In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require
iteratively invoking the pattern with an aggregator making serial requests
to the workers below it to prepare a response (1 to 4 iterations are typical
though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up
to the root must be completed within the deadline
In other publications this pattern is referred to as the ScatterGather pattern
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1859
983137983143983143983154983141983143983137983156983151983154
983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154
The partitionaggregate design pattern
Request Latency deadline 250 ms
deadline 50 ms
deadline 10 ms
The total permissible latency for a request is limited and the ldquobackendrdquo part of the
application is typically allocated between 230-300 ms This limit is called the all-up SLA
Example in web search a query might be sent to many aggregators and workers each
responsible for a different part of the index Based on the replies an aggregator might
refine the query and send it out again to improve the relevance of the result Lagging
instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries
A high-level aggregator
(HLA) partitions queries to
a large number of mid-level
aggregators (MLAs) that in
turn partition each query
over the other servers in the
same rack as the MLA
Servers act as both MLAs
and workers so each server
will be acting as an
aggregator for some queries
and as a worker for other
queries
HLA
MLAMLA
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1959
aggregator
worker 1worker 2worker 3worker 4
query
response
Ack
A TCP Incast Event
Response from worker 3 is lost due to incast and is
retransmitted after a timeout
timeout
983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154
983090
983159983151983154983147983141983154
983089
983159983151983154983147983141983154
983091
983159983151983154983147983141983154
983091
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2059
IncastScenario
Packets from many
flows arriving to
the same port at
the same time
Incast Collapse Summary
In other publications the incast scnario
is referred to as the fan-in burst at the
parent node This incast is a key reason
for increased network delay and occurswhen all the children (eg workers at
the leaf level) of a parent node face the
same deadline and are likely to respond
nearly at the same time causing a fan-
in burst at the parent node
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2159
Performance impairments of Shallow-buffered
Switches2 Queue Buildup
When long and short flows traverse the same queue there is a queue
buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every
worker in the cluster handles both query traffic and background
traffic (large flows needed to update the data structures on the
workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays
because of long-lived greedy TCP flows Further answering a
request can require multiple iterations which magnifies the impact of
this delay
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2259
Performance impairments of Shallow-buffered
Switches3 Buffer Pressure
Given the mix of long and short flows in a data center it is very
common for short flows on one port to be impacted by activity on
other ports The loss rate of short flows in this traffic pattern depends
on the number of long flows traversing other ports
The long greedy TCP flows build up queues on their interfaces
Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space
available to absorb bursts of traffic from the PartitionAggregate
traffic This impairment is called buffer pressure The result is packet
loss and timeouts as in incast but without requiring synchronizedflows
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2359
Buffer
Pressure
Short flows on oneport and long flows
on another port
Incast
Scenario
Multiple shortflows on the same
port
Queue
Buildup
Short and longflows on the same
port
Flow Interactions in Shallow-buffered Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2459
C o n g e s t i o n
w i n d o w
10
5
15
20
0
Round-trip times
Slow
start
Congestionavoidance
Time-out
Legacy TCP Congestion Control
983155983155983135983156983144983154983141983155983144 983101983089983094
983139983159983150983140 983101983090983088
983155983155983135983156983144983154983141983155983144 983101983089983088
Segment loss
Segment loss
FastRetransmit
Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2559
The Need for a Data Center TCP
The data center environment is significantly
different from wide area networks
o round trip times (RTTs) can be less than 250 ms in absence ofqueuing
o Applications need extremely high bandwidths and very low
latencies
o little statistical multiplexing a single flow can dominate a
particular path
o The network is largely homogeneous and under a single
administrative controlo Traffic flowing in switches is mostly internal Connectivity to the
external Internet is typically managed through load balancers and
application proxies that effectively separate internal traffic from
external
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2659
The Need for a Data Center TCP (continued)
Data center applications generate a diverse mix of short and long
flows The measurements by the authors reveal that 9991 of
traffic in the data center is TCP traffic The traffic consists of query
traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to
100MB) These applications require three things from the data
center network
o low latency for short flows
o high burst tolerance
o high utilization for long flows
Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above
requirements
See paper for details of workload
characterization in cloud data centers
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2759
The Data Center TCP (DCTCP) Algorithm
The goal of DCTCP is to achieve high burst tolerance low latency
and high throughput with commodity shallow buffered switches
DCTCP uses the concept of ECN (Explicit Congestion Notification)
DCTCP achieves these goals primarily by reacting to congestion in
proportion to the extent of congestion
DCTCP uses a simple marking scheme at switches that sets the
Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold
The DCTCP source reacts by reducing the window by a factor that
depends on the fraction of marked packets the larger the fraction the
bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN
notification
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2859
DCTCP- Simple Marking at the Switch
DCTCP employs a simple active queue management scheme There
is only a single parameter the marking threshold K as opposed to
two parameters THmin and THmax in RED routers
An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival
Marking is based on the instantaneous value of the queue not the
average value as in RED routers
The DCTCP scheme ensures that sources are quickly notified of the
queue overshoot
The RED marking scheme implemented by most modern switches
can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of
average queue length
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2959
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 259
Cloud ComputingThe cloud computing architecture is comprised of two significant
parts
The front end is the side at which the user of the computer orthe client himself is able to access This involves the clientrsquos
network or his computer and the program(s) that the client uses
to access the database or the servers that contain all the data
The back end is the cloud itself which is the collection of allrelated information saved on the servers that the client wishes to
have access to
These two ends of the cloud computing architecture are connectedthrough a network usually the Internet which provides remote
access to all the users of the cloud
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 359
Benefits of Cloud Computing
For large-scale businesses the cloud computing technology
eliminates the need to buy an additional number of hardware and
storage devices since all data needed would be easily accessible
from the cloud through the individual computers of the
employees
Installing software in every computer would no longer be
necessary because the cloud computing platform would be ableto do the job
Cloud hosting services provide managed hosting for all server
configurations on a dedicated 247 availability Cloud software
ranges from sales applications to custom applications dependingon the usersrsquo choice
The above benefits can provide a great deal of profit to many
businesses and can also improve customer satisfaction
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 459
Cloud Data Centers Cloud computing services provide the users of the Cloud
better management of their information This might save a
company on expenses since the company will not need to hirea large IT team for its own technical support
There is a lot of cloud computing software available today that
offers provision of cloud computing applications Running
applications and storing data on the cloud has proven to beeconomical and efficient for many businesses
Cloud data centers host diverse applications mixing
workloads that require small predictable latency with othersrequiring large sustained throughput
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 559
Data Center Bubble
Numerous companies are already providing cloud services
including Amazon Google Yahoo Microsoft HP IBM Cisco
etc
Data Centers range in size from ldquoedgerdquo facilities to megascale datacenters (100K to 1M servers)
Data centers are located in many countries eg USA India
Singapore Germany etcThere is a push for Green Data Centers that use windsolar energy
efficient floor layout recycling of waste material environment
friendly materialpaint green rated power equipment
Fastest growing sectors in data centers Telecom Foreign hosting
companies Global information management Business
connectivity
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 659
Data Center Closures amp ConsolidationUS Government Data Centers
In November 2012 the US Government has closed an additional
64 data centers bringing the total number of closed facilities to
381 The closures are part of the Federal Data CenterConsolidation Initiative for streamlining government IT
operations
The ultimate goal is to close 40 percent of the US Federal
Governmentrsquos data centers (ie close 1200 of the nearly 2900identified data centers) by 2015
Commercial Data CentersData center consolidation is a trend in industry For example HP
has been replacing its 85 data centers around the world with only 6newly-built larger facilities in Austin Atlanta and Houston
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 759
Hewlett-Packard Development Company
Example HP Cloud Services
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 859
Example Amazon Data CentersAmazon data centers serve four regions in
the US and three regions in Europe and
Asia Another data center in the US was
opened July 2011 in the state of Oregon to
serve the Pacific Northwest region
In December 2011 Amazon announced it is
opening a data center in Sao Paulo Brazil
its first in South America
In November 2012 Amazon announced it
is adding a ninth region by opening a data
center in Sydney Australia
The data centers support all Amazon Web
Services (AWS) including Amazon Elastic
Compute Cloud (EC2) and Amazon
Simple Storage Service (S3)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 959
Example Amazon Data Centers
The Amazon Elastic Compute Cloud (Amazon EC2) web service provides users
with the ability to execute their applications in Amazons computing environment
To use Amazon EC2 Create an Amazon Machine Image (AMI) containing all the software including
the operating system
Upload this AMI to the Amazon S3 (Amazon Simple Storage Service)
Register to get an AMI ID Use this AMI ID and the Amazon EC2 web service APIs to run monitor and
terminate as many instances of this AMI as required
EC2 Pricing Policy pay as you go no minimal fee The prices are based on theRegion in which the application instance is running
httpawsamazoncomec2pricing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1059
Data Center Services
Exampe Colocation Services of Cogent
httpwwwcogentcocomen
Cogent is a multinational Tier 1 Internet Service Provider
Companies can colocate their business critical equipment in one of
43 Cogents secure state-of-the-art data centers that connect directly
to a Tier-1 IP network The data centers have extensive powerbackup systems complete fire detection and suppression plans to
ensure the safety and security of equipment
Cogent Data Center Features
httpwwwcogentcocomenproducts-and-servicescolocation-
services
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1159
Colocation Data Centers and Cloud Servers
httpwwwdatacentermapcomdatacentershtml
httpwwwdatacentermapcomcloudhtml
Example AtlanticNet
httpwwwatlanticnetorlando-colocation-floridahtml
Orlando Data Center
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1259
Data Center TCP (DCTCP)
M Alizadehzy A Greenbergy D Maltzy J Padhyey P
Pately B Prabhakarz S Senguptay M Sridharan
983117983145983139983154983151983155983151983142983156 983122983141983155983141983137983154983139983144 amp 983123983156983137983150983142983151983154983140 983125983150983145983158983141983154983155983145983156983161
ACM SIGCOMM September 2010
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1359
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1459
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1559
Rack Servers with Commodity Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1659
Performance impairments of Shallow-buffered
Switches1 TCP Incast Collapse
Many applications generate barrier-synchronized requests in which the
client cannot make forward progress until the responses from every
server for the current request have been received An Example of these
applications is a web search query (eg a Google search) sent to a large
number of nodes with results returned to the parent node to be sorted
Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these
requests create many flows that converge on the same interface of a
switch over a short period of time The response packets create a long
queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and
throughput collapse
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1759
1 TCP Incast Collapse (continued)
Barrier-synchronized requests exhibit the PartitionAggregate workflow
pattern which is the foundation of many large scale web applications
Requests from higher layers of the application are broken into pieces and
farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content
composition and advertisement selection are based around the
PartitionAggregate design pattern
In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require
iteratively invoking the pattern with an aggregator making serial requests
to the workers below it to prepare a response (1 to 4 iterations are typical
though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up
to the root must be completed within the deadline
In other publications this pattern is referred to as the ScatterGather pattern
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1859
983137983143983143983154983141983143983137983156983151983154
983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154
The partitionaggregate design pattern
Request Latency deadline 250 ms
deadline 50 ms
deadline 10 ms
The total permissible latency for a request is limited and the ldquobackendrdquo part of the
application is typically allocated between 230-300 ms This limit is called the all-up SLA
Example in web search a query might be sent to many aggregators and workers each
responsible for a different part of the index Based on the replies an aggregator might
refine the query and send it out again to improve the relevance of the result Lagging
instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries
A high-level aggregator
(HLA) partitions queries to
a large number of mid-level
aggregators (MLAs) that in
turn partition each query
over the other servers in the
same rack as the MLA
Servers act as both MLAs
and workers so each server
will be acting as an
aggregator for some queries
and as a worker for other
queries
HLA
MLAMLA
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1959
aggregator
worker 1worker 2worker 3worker 4
query
response
Ack
A TCP Incast Event
Response from worker 3 is lost due to incast and is
retransmitted after a timeout
timeout
983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154
983090
983159983151983154983147983141983154
983089
983159983151983154983147983141983154
983091
983159983151983154983147983141983154
983091
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2059
IncastScenario
Packets from many
flows arriving to
the same port at
the same time
Incast Collapse Summary
In other publications the incast scnario
is referred to as the fan-in burst at the
parent node This incast is a key reason
for increased network delay and occurswhen all the children (eg workers at
the leaf level) of a parent node face the
same deadline and are likely to respond
nearly at the same time causing a fan-
in burst at the parent node
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2159
Performance impairments of Shallow-buffered
Switches2 Queue Buildup
When long and short flows traverse the same queue there is a queue
buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every
worker in the cluster handles both query traffic and background
traffic (large flows needed to update the data structures on the
workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays
because of long-lived greedy TCP flows Further answering a
request can require multiple iterations which magnifies the impact of
this delay
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2259
Performance impairments of Shallow-buffered
Switches3 Buffer Pressure
Given the mix of long and short flows in a data center it is very
common for short flows on one port to be impacted by activity on
other ports The loss rate of short flows in this traffic pattern depends
on the number of long flows traversing other ports
The long greedy TCP flows build up queues on their interfaces
Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space
available to absorb bursts of traffic from the PartitionAggregate
traffic This impairment is called buffer pressure The result is packet
loss and timeouts as in incast but without requiring synchronizedflows
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2359
Buffer
Pressure
Short flows on oneport and long flows
on another port
Incast
Scenario
Multiple shortflows on the same
port
Queue
Buildup
Short and longflows on the same
port
Flow Interactions in Shallow-buffered Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2459
C o n g e s t i o n
w i n d o w
10
5
15
20
0
Round-trip times
Slow
start
Congestionavoidance
Time-out
Legacy TCP Congestion Control
983155983155983135983156983144983154983141983155983144 983101983089983094
983139983159983150983140 983101983090983088
983155983155983135983156983144983154983141983155983144 983101983089983088
Segment loss
Segment loss
FastRetransmit
Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2559
The Need for a Data Center TCP
The data center environment is significantly
different from wide area networks
o round trip times (RTTs) can be less than 250 ms in absence ofqueuing
o Applications need extremely high bandwidths and very low
latencies
o little statistical multiplexing a single flow can dominate a
particular path
o The network is largely homogeneous and under a single
administrative controlo Traffic flowing in switches is mostly internal Connectivity to the
external Internet is typically managed through load balancers and
application proxies that effectively separate internal traffic from
external
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2659
The Need for a Data Center TCP (continued)
Data center applications generate a diverse mix of short and long
flows The measurements by the authors reveal that 9991 of
traffic in the data center is TCP traffic The traffic consists of query
traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to
100MB) These applications require three things from the data
center network
o low latency for short flows
o high burst tolerance
o high utilization for long flows
Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above
requirements
See paper for details of workload
characterization in cloud data centers
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2759
The Data Center TCP (DCTCP) Algorithm
The goal of DCTCP is to achieve high burst tolerance low latency
and high throughput with commodity shallow buffered switches
DCTCP uses the concept of ECN (Explicit Congestion Notification)
DCTCP achieves these goals primarily by reacting to congestion in
proportion to the extent of congestion
DCTCP uses a simple marking scheme at switches that sets the
Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold
The DCTCP source reacts by reducing the window by a factor that
depends on the fraction of marked packets the larger the fraction the
bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN
notification
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2859
DCTCP- Simple Marking at the Switch
DCTCP employs a simple active queue management scheme There
is only a single parameter the marking threshold K as opposed to
two parameters THmin and THmax in RED routers
An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival
Marking is based on the instantaneous value of the queue not the
average value as in RED routers
The DCTCP scheme ensures that sources are quickly notified of the
queue overshoot
The RED marking scheme implemented by most modern switches
can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of
average queue length
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2959
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 359
Benefits of Cloud Computing
For large-scale businesses the cloud computing technology
eliminates the need to buy an additional number of hardware and
storage devices since all data needed would be easily accessible
from the cloud through the individual computers of the
employees
Installing software in every computer would no longer be
necessary because the cloud computing platform would be ableto do the job
Cloud hosting services provide managed hosting for all server
configurations on a dedicated 247 availability Cloud software
ranges from sales applications to custom applications dependingon the usersrsquo choice
The above benefits can provide a great deal of profit to many
businesses and can also improve customer satisfaction
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 459
Cloud Data Centers Cloud computing services provide the users of the Cloud
better management of their information This might save a
company on expenses since the company will not need to hirea large IT team for its own technical support
There is a lot of cloud computing software available today that
offers provision of cloud computing applications Running
applications and storing data on the cloud has proven to beeconomical and efficient for many businesses
Cloud data centers host diverse applications mixing
workloads that require small predictable latency with othersrequiring large sustained throughput
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 559
Data Center Bubble
Numerous companies are already providing cloud services
including Amazon Google Yahoo Microsoft HP IBM Cisco
etc
Data Centers range in size from ldquoedgerdquo facilities to megascale datacenters (100K to 1M servers)
Data centers are located in many countries eg USA India
Singapore Germany etcThere is a push for Green Data Centers that use windsolar energy
efficient floor layout recycling of waste material environment
friendly materialpaint green rated power equipment
Fastest growing sectors in data centers Telecom Foreign hosting
companies Global information management Business
connectivity
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 659
Data Center Closures amp ConsolidationUS Government Data Centers
In November 2012 the US Government has closed an additional
64 data centers bringing the total number of closed facilities to
381 The closures are part of the Federal Data CenterConsolidation Initiative for streamlining government IT
operations
The ultimate goal is to close 40 percent of the US Federal
Governmentrsquos data centers (ie close 1200 of the nearly 2900identified data centers) by 2015
Commercial Data CentersData center consolidation is a trend in industry For example HP
has been replacing its 85 data centers around the world with only 6newly-built larger facilities in Austin Atlanta and Houston
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 759
Hewlett-Packard Development Company
Example HP Cloud Services
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 859
Example Amazon Data CentersAmazon data centers serve four regions in
the US and three regions in Europe and
Asia Another data center in the US was
opened July 2011 in the state of Oregon to
serve the Pacific Northwest region
In December 2011 Amazon announced it is
opening a data center in Sao Paulo Brazil
its first in South America
In November 2012 Amazon announced it
is adding a ninth region by opening a data
center in Sydney Australia
The data centers support all Amazon Web
Services (AWS) including Amazon Elastic
Compute Cloud (EC2) and Amazon
Simple Storage Service (S3)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 959
Example Amazon Data Centers
The Amazon Elastic Compute Cloud (Amazon EC2) web service provides users
with the ability to execute their applications in Amazons computing environment
To use Amazon EC2 Create an Amazon Machine Image (AMI) containing all the software including
the operating system
Upload this AMI to the Amazon S3 (Amazon Simple Storage Service)
Register to get an AMI ID Use this AMI ID and the Amazon EC2 web service APIs to run monitor and
terminate as many instances of this AMI as required
EC2 Pricing Policy pay as you go no minimal fee The prices are based on theRegion in which the application instance is running
httpawsamazoncomec2pricing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1059
Data Center Services
Exampe Colocation Services of Cogent
httpwwwcogentcocomen
Cogent is a multinational Tier 1 Internet Service Provider
Companies can colocate their business critical equipment in one of
43 Cogents secure state-of-the-art data centers that connect directly
to a Tier-1 IP network The data centers have extensive powerbackup systems complete fire detection and suppression plans to
ensure the safety and security of equipment
Cogent Data Center Features
httpwwwcogentcocomenproducts-and-servicescolocation-
services
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1159
Colocation Data Centers and Cloud Servers
httpwwwdatacentermapcomdatacentershtml
httpwwwdatacentermapcomcloudhtml
Example AtlanticNet
httpwwwatlanticnetorlando-colocation-floridahtml
Orlando Data Center
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1259
Data Center TCP (DCTCP)
M Alizadehzy A Greenbergy D Maltzy J Padhyey P
Pately B Prabhakarz S Senguptay M Sridharan
983117983145983139983154983151983155983151983142983156 983122983141983155983141983137983154983139983144 amp 983123983156983137983150983142983151983154983140 983125983150983145983158983141983154983155983145983156983161
ACM SIGCOMM September 2010
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1359
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1459
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1559
Rack Servers with Commodity Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1659
Performance impairments of Shallow-buffered
Switches1 TCP Incast Collapse
Many applications generate barrier-synchronized requests in which the
client cannot make forward progress until the responses from every
server for the current request have been received An Example of these
applications is a web search query (eg a Google search) sent to a large
number of nodes with results returned to the parent node to be sorted
Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these
requests create many flows that converge on the same interface of a
switch over a short period of time The response packets create a long
queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and
throughput collapse
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1759
1 TCP Incast Collapse (continued)
Barrier-synchronized requests exhibit the PartitionAggregate workflow
pattern which is the foundation of many large scale web applications
Requests from higher layers of the application are broken into pieces and
farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content
composition and advertisement selection are based around the
PartitionAggregate design pattern
In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require
iteratively invoking the pattern with an aggregator making serial requests
to the workers below it to prepare a response (1 to 4 iterations are typical
though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up
to the root must be completed within the deadline
In other publications this pattern is referred to as the ScatterGather pattern
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1859
983137983143983143983154983141983143983137983156983151983154
983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154
The partitionaggregate design pattern
Request Latency deadline 250 ms
deadline 50 ms
deadline 10 ms
The total permissible latency for a request is limited and the ldquobackendrdquo part of the
application is typically allocated between 230-300 ms This limit is called the all-up SLA
Example in web search a query might be sent to many aggregators and workers each
responsible for a different part of the index Based on the replies an aggregator might
refine the query and send it out again to improve the relevance of the result Lagging
instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries
A high-level aggregator
(HLA) partitions queries to
a large number of mid-level
aggregators (MLAs) that in
turn partition each query
over the other servers in the
same rack as the MLA
Servers act as both MLAs
and workers so each server
will be acting as an
aggregator for some queries
and as a worker for other
queries
HLA
MLAMLA
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1959
aggregator
worker 1worker 2worker 3worker 4
query
response
Ack
A TCP Incast Event
Response from worker 3 is lost due to incast and is
retransmitted after a timeout
timeout
983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154
983090
983159983151983154983147983141983154
983089
983159983151983154983147983141983154
983091
983159983151983154983147983141983154
983091
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2059
IncastScenario
Packets from many
flows arriving to
the same port at
the same time
Incast Collapse Summary
In other publications the incast scnario
is referred to as the fan-in burst at the
parent node This incast is a key reason
for increased network delay and occurswhen all the children (eg workers at
the leaf level) of a parent node face the
same deadline and are likely to respond
nearly at the same time causing a fan-
in burst at the parent node
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2159
Performance impairments of Shallow-buffered
Switches2 Queue Buildup
When long and short flows traverse the same queue there is a queue
buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every
worker in the cluster handles both query traffic and background
traffic (large flows needed to update the data structures on the
workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays
because of long-lived greedy TCP flows Further answering a
request can require multiple iterations which magnifies the impact of
this delay
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2259
Performance impairments of Shallow-buffered
Switches3 Buffer Pressure
Given the mix of long and short flows in a data center it is very
common for short flows on one port to be impacted by activity on
other ports The loss rate of short flows in this traffic pattern depends
on the number of long flows traversing other ports
The long greedy TCP flows build up queues on their interfaces
Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space
available to absorb bursts of traffic from the PartitionAggregate
traffic This impairment is called buffer pressure The result is packet
loss and timeouts as in incast but without requiring synchronizedflows
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2359
Buffer
Pressure
Short flows on oneport and long flows
on another port
Incast
Scenario
Multiple shortflows on the same
port
Queue
Buildup
Short and longflows on the same
port
Flow Interactions in Shallow-buffered Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2459
C o n g e s t i o n
w i n d o w
10
5
15
20
0
Round-trip times
Slow
start
Congestionavoidance
Time-out
Legacy TCP Congestion Control
983155983155983135983156983144983154983141983155983144 983101983089983094
983139983159983150983140 983101983090983088
983155983155983135983156983144983154983141983155983144 983101983089983088
Segment loss
Segment loss
FastRetransmit
Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2559
The Need for a Data Center TCP
The data center environment is significantly
different from wide area networks
o round trip times (RTTs) can be less than 250 ms in absence ofqueuing
o Applications need extremely high bandwidths and very low
latencies
o little statistical multiplexing a single flow can dominate a
particular path
o The network is largely homogeneous and under a single
administrative controlo Traffic flowing in switches is mostly internal Connectivity to the
external Internet is typically managed through load balancers and
application proxies that effectively separate internal traffic from
external
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2659
The Need for a Data Center TCP (continued)
Data center applications generate a diverse mix of short and long
flows The measurements by the authors reveal that 9991 of
traffic in the data center is TCP traffic The traffic consists of query
traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to
100MB) These applications require three things from the data
center network
o low latency for short flows
o high burst tolerance
o high utilization for long flows
Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above
requirements
See paper for details of workload
characterization in cloud data centers
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2759
The Data Center TCP (DCTCP) Algorithm
The goal of DCTCP is to achieve high burst tolerance low latency
and high throughput with commodity shallow buffered switches
DCTCP uses the concept of ECN (Explicit Congestion Notification)
DCTCP achieves these goals primarily by reacting to congestion in
proportion to the extent of congestion
DCTCP uses a simple marking scheme at switches that sets the
Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold
The DCTCP source reacts by reducing the window by a factor that
depends on the fraction of marked packets the larger the fraction the
bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN
notification
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2859
DCTCP- Simple Marking at the Switch
DCTCP employs a simple active queue management scheme There
is only a single parameter the marking threshold K as opposed to
two parameters THmin and THmax in RED routers
An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival
Marking is based on the instantaneous value of the queue not the
average value as in RED routers
The DCTCP scheme ensures that sources are quickly notified of the
queue overshoot
The RED marking scheme implemented by most modern switches
can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of
average queue length
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2959
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 459
Cloud Data Centers Cloud computing services provide the users of the Cloud
better management of their information This might save a
company on expenses since the company will not need to hirea large IT team for its own technical support
There is a lot of cloud computing software available today that
offers provision of cloud computing applications Running
applications and storing data on the cloud has proven to beeconomical and efficient for many businesses
Cloud data centers host diverse applications mixing
workloads that require small predictable latency with othersrequiring large sustained throughput
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 559
Data Center Bubble
Numerous companies are already providing cloud services
including Amazon Google Yahoo Microsoft HP IBM Cisco
etc
Data Centers range in size from ldquoedgerdquo facilities to megascale datacenters (100K to 1M servers)
Data centers are located in many countries eg USA India
Singapore Germany etcThere is a push for Green Data Centers that use windsolar energy
efficient floor layout recycling of waste material environment
friendly materialpaint green rated power equipment
Fastest growing sectors in data centers Telecom Foreign hosting
companies Global information management Business
connectivity
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 659
Data Center Closures amp ConsolidationUS Government Data Centers
In November 2012 the US Government has closed an additional
64 data centers bringing the total number of closed facilities to
381 The closures are part of the Federal Data CenterConsolidation Initiative for streamlining government IT
operations
The ultimate goal is to close 40 percent of the US Federal
Governmentrsquos data centers (ie close 1200 of the nearly 2900identified data centers) by 2015
Commercial Data CentersData center consolidation is a trend in industry For example HP
has been replacing its 85 data centers around the world with only 6newly-built larger facilities in Austin Atlanta and Houston
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 759
Hewlett-Packard Development Company
Example HP Cloud Services
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 859
Example Amazon Data CentersAmazon data centers serve four regions in
the US and three regions in Europe and
Asia Another data center in the US was
opened July 2011 in the state of Oregon to
serve the Pacific Northwest region
In December 2011 Amazon announced it is
opening a data center in Sao Paulo Brazil
its first in South America
In November 2012 Amazon announced it
is adding a ninth region by opening a data
center in Sydney Australia
The data centers support all Amazon Web
Services (AWS) including Amazon Elastic
Compute Cloud (EC2) and Amazon
Simple Storage Service (S3)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 959
Example Amazon Data Centers
The Amazon Elastic Compute Cloud (Amazon EC2) web service provides users
with the ability to execute their applications in Amazons computing environment
To use Amazon EC2 Create an Amazon Machine Image (AMI) containing all the software including
the operating system
Upload this AMI to the Amazon S3 (Amazon Simple Storage Service)
Register to get an AMI ID Use this AMI ID and the Amazon EC2 web service APIs to run monitor and
terminate as many instances of this AMI as required
EC2 Pricing Policy pay as you go no minimal fee The prices are based on theRegion in which the application instance is running
httpawsamazoncomec2pricing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1059
Data Center Services
Exampe Colocation Services of Cogent
httpwwwcogentcocomen
Cogent is a multinational Tier 1 Internet Service Provider
Companies can colocate their business critical equipment in one of
43 Cogents secure state-of-the-art data centers that connect directly
to a Tier-1 IP network The data centers have extensive powerbackup systems complete fire detection and suppression plans to
ensure the safety and security of equipment
Cogent Data Center Features
httpwwwcogentcocomenproducts-and-servicescolocation-
services
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1159
Colocation Data Centers and Cloud Servers
httpwwwdatacentermapcomdatacentershtml
httpwwwdatacentermapcomcloudhtml
Example AtlanticNet
httpwwwatlanticnetorlando-colocation-floridahtml
Orlando Data Center
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1259
Data Center TCP (DCTCP)
M Alizadehzy A Greenbergy D Maltzy J Padhyey P
Pately B Prabhakarz S Senguptay M Sridharan
983117983145983139983154983151983155983151983142983156 983122983141983155983141983137983154983139983144 amp 983123983156983137983150983142983151983154983140 983125983150983145983158983141983154983155983145983156983161
ACM SIGCOMM September 2010
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1359
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1459
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1559
Rack Servers with Commodity Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1659
Performance impairments of Shallow-buffered
Switches1 TCP Incast Collapse
Many applications generate barrier-synchronized requests in which the
client cannot make forward progress until the responses from every
server for the current request have been received An Example of these
applications is a web search query (eg a Google search) sent to a large
number of nodes with results returned to the parent node to be sorted
Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these
requests create many flows that converge on the same interface of a
switch over a short period of time The response packets create a long
queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and
throughput collapse
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1759
1 TCP Incast Collapse (continued)
Barrier-synchronized requests exhibit the PartitionAggregate workflow
pattern which is the foundation of many large scale web applications
Requests from higher layers of the application are broken into pieces and
farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content
composition and advertisement selection are based around the
PartitionAggregate design pattern
In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require
iteratively invoking the pattern with an aggregator making serial requests
to the workers below it to prepare a response (1 to 4 iterations are typical
though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up
to the root must be completed within the deadline
In other publications this pattern is referred to as the ScatterGather pattern
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1859
983137983143983143983154983141983143983137983156983151983154
983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154
The partitionaggregate design pattern
Request Latency deadline 250 ms
deadline 50 ms
deadline 10 ms
The total permissible latency for a request is limited and the ldquobackendrdquo part of the
application is typically allocated between 230-300 ms This limit is called the all-up SLA
Example in web search a query might be sent to many aggregators and workers each
responsible for a different part of the index Based on the replies an aggregator might
refine the query and send it out again to improve the relevance of the result Lagging
instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries
A high-level aggregator
(HLA) partitions queries to
a large number of mid-level
aggregators (MLAs) that in
turn partition each query
over the other servers in the
same rack as the MLA
Servers act as both MLAs
and workers so each server
will be acting as an
aggregator for some queries
and as a worker for other
queries
HLA
MLAMLA
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1959
aggregator
worker 1worker 2worker 3worker 4
query
response
Ack
A TCP Incast Event
Response from worker 3 is lost due to incast and is
retransmitted after a timeout
timeout
983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154
983090
983159983151983154983147983141983154
983089
983159983151983154983147983141983154
983091
983159983151983154983147983141983154
983091
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2059
IncastScenario
Packets from many
flows arriving to
the same port at
the same time
Incast Collapse Summary
In other publications the incast scnario
is referred to as the fan-in burst at the
parent node This incast is a key reason
for increased network delay and occurswhen all the children (eg workers at
the leaf level) of a parent node face the
same deadline and are likely to respond
nearly at the same time causing a fan-
in burst at the parent node
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2159
Performance impairments of Shallow-buffered
Switches2 Queue Buildup
When long and short flows traverse the same queue there is a queue
buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every
worker in the cluster handles both query traffic and background
traffic (large flows needed to update the data structures on the
workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays
because of long-lived greedy TCP flows Further answering a
request can require multiple iterations which magnifies the impact of
this delay
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2259
Performance impairments of Shallow-buffered
Switches3 Buffer Pressure
Given the mix of long and short flows in a data center it is very
common for short flows on one port to be impacted by activity on
other ports The loss rate of short flows in this traffic pattern depends
on the number of long flows traversing other ports
The long greedy TCP flows build up queues on their interfaces
Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space
available to absorb bursts of traffic from the PartitionAggregate
traffic This impairment is called buffer pressure The result is packet
loss and timeouts as in incast but without requiring synchronizedflows
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2359
Buffer
Pressure
Short flows on oneport and long flows
on another port
Incast
Scenario
Multiple shortflows on the same
port
Queue
Buildup
Short and longflows on the same
port
Flow Interactions in Shallow-buffered Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2459
C o n g e s t i o n
w i n d o w
10
5
15
20
0
Round-trip times
Slow
start
Congestionavoidance
Time-out
Legacy TCP Congestion Control
983155983155983135983156983144983154983141983155983144 983101983089983094
983139983159983150983140 983101983090983088
983155983155983135983156983144983154983141983155983144 983101983089983088
Segment loss
Segment loss
FastRetransmit
Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2559
The Need for a Data Center TCP
The data center environment is significantly
different from wide area networks
o round trip times (RTTs) can be less than 250 ms in absence ofqueuing
o Applications need extremely high bandwidths and very low
latencies
o little statistical multiplexing a single flow can dominate a
particular path
o The network is largely homogeneous and under a single
administrative controlo Traffic flowing in switches is mostly internal Connectivity to the
external Internet is typically managed through load balancers and
application proxies that effectively separate internal traffic from
external
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2659
The Need for a Data Center TCP (continued)
Data center applications generate a diverse mix of short and long
flows The measurements by the authors reveal that 9991 of
traffic in the data center is TCP traffic The traffic consists of query
traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to
100MB) These applications require three things from the data
center network
o low latency for short flows
o high burst tolerance
o high utilization for long flows
Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above
requirements
See paper for details of workload
characterization in cloud data centers
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2759
The Data Center TCP (DCTCP) Algorithm
The goal of DCTCP is to achieve high burst tolerance low latency
and high throughput with commodity shallow buffered switches
DCTCP uses the concept of ECN (Explicit Congestion Notification)
DCTCP achieves these goals primarily by reacting to congestion in
proportion to the extent of congestion
DCTCP uses a simple marking scheme at switches that sets the
Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold
The DCTCP source reacts by reducing the window by a factor that
depends on the fraction of marked packets the larger the fraction the
bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN
notification
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2859
DCTCP- Simple Marking at the Switch
DCTCP employs a simple active queue management scheme There
is only a single parameter the marking threshold K as opposed to
two parameters THmin and THmax in RED routers
An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival
Marking is based on the instantaneous value of the queue not the
average value as in RED routers
The DCTCP scheme ensures that sources are quickly notified of the
queue overshoot
The RED marking scheme implemented by most modern switches
can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of
average queue length
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2959
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 559
Data Center Bubble
Numerous companies are already providing cloud services
including Amazon Google Yahoo Microsoft HP IBM Cisco
etc
Data Centers range in size from ldquoedgerdquo facilities to megascale datacenters (100K to 1M servers)
Data centers are located in many countries eg USA India
Singapore Germany etcThere is a push for Green Data Centers that use windsolar energy
efficient floor layout recycling of waste material environment
friendly materialpaint green rated power equipment
Fastest growing sectors in data centers Telecom Foreign hosting
companies Global information management Business
connectivity
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 659
Data Center Closures amp ConsolidationUS Government Data Centers
In November 2012 the US Government has closed an additional
64 data centers bringing the total number of closed facilities to
381 The closures are part of the Federal Data CenterConsolidation Initiative for streamlining government IT
operations
The ultimate goal is to close 40 percent of the US Federal
Governmentrsquos data centers (ie close 1200 of the nearly 2900identified data centers) by 2015
Commercial Data CentersData center consolidation is a trend in industry For example HP
has been replacing its 85 data centers around the world with only 6newly-built larger facilities in Austin Atlanta and Houston
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 759
Hewlett-Packard Development Company
Example HP Cloud Services
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 859
Example Amazon Data CentersAmazon data centers serve four regions in
the US and three regions in Europe and
Asia Another data center in the US was
opened July 2011 in the state of Oregon to
serve the Pacific Northwest region
In December 2011 Amazon announced it is
opening a data center in Sao Paulo Brazil
its first in South America
In November 2012 Amazon announced it
is adding a ninth region by opening a data
center in Sydney Australia
The data centers support all Amazon Web
Services (AWS) including Amazon Elastic
Compute Cloud (EC2) and Amazon
Simple Storage Service (S3)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 959
Example Amazon Data Centers
The Amazon Elastic Compute Cloud (Amazon EC2) web service provides users
with the ability to execute their applications in Amazons computing environment
To use Amazon EC2 Create an Amazon Machine Image (AMI) containing all the software including
the operating system
Upload this AMI to the Amazon S3 (Amazon Simple Storage Service)
Register to get an AMI ID Use this AMI ID and the Amazon EC2 web service APIs to run monitor and
terminate as many instances of this AMI as required
EC2 Pricing Policy pay as you go no minimal fee The prices are based on theRegion in which the application instance is running
httpawsamazoncomec2pricing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1059
Data Center Services
Exampe Colocation Services of Cogent
httpwwwcogentcocomen
Cogent is a multinational Tier 1 Internet Service Provider
Companies can colocate their business critical equipment in one of
43 Cogents secure state-of-the-art data centers that connect directly
to a Tier-1 IP network The data centers have extensive powerbackup systems complete fire detection and suppression plans to
ensure the safety and security of equipment
Cogent Data Center Features
httpwwwcogentcocomenproducts-and-servicescolocation-
services
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1159
Colocation Data Centers and Cloud Servers
httpwwwdatacentermapcomdatacentershtml
httpwwwdatacentermapcomcloudhtml
Example AtlanticNet
httpwwwatlanticnetorlando-colocation-floridahtml
Orlando Data Center
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1259
Data Center TCP (DCTCP)
M Alizadehzy A Greenbergy D Maltzy J Padhyey P
Pately B Prabhakarz S Senguptay M Sridharan
983117983145983139983154983151983155983151983142983156 983122983141983155983141983137983154983139983144 amp 983123983156983137983150983142983151983154983140 983125983150983145983158983141983154983155983145983156983161
ACM SIGCOMM September 2010
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1359
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1459
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1559
Rack Servers with Commodity Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1659
Performance impairments of Shallow-buffered
Switches1 TCP Incast Collapse
Many applications generate barrier-synchronized requests in which the
client cannot make forward progress until the responses from every
server for the current request have been received An Example of these
applications is a web search query (eg a Google search) sent to a large
number of nodes with results returned to the parent node to be sorted
Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these
requests create many flows that converge on the same interface of a
switch over a short period of time The response packets create a long
queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and
throughput collapse
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1759
1 TCP Incast Collapse (continued)
Barrier-synchronized requests exhibit the PartitionAggregate workflow
pattern which is the foundation of many large scale web applications
Requests from higher layers of the application are broken into pieces and
farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content
composition and advertisement selection are based around the
PartitionAggregate design pattern
In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require
iteratively invoking the pattern with an aggregator making serial requests
to the workers below it to prepare a response (1 to 4 iterations are typical
though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up
to the root must be completed within the deadline
In other publications this pattern is referred to as the ScatterGather pattern
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1859
983137983143983143983154983141983143983137983156983151983154
983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154
The partitionaggregate design pattern
Request Latency deadline 250 ms
deadline 50 ms
deadline 10 ms
The total permissible latency for a request is limited and the ldquobackendrdquo part of the
application is typically allocated between 230-300 ms This limit is called the all-up SLA
Example in web search a query might be sent to many aggregators and workers each
responsible for a different part of the index Based on the replies an aggregator might
refine the query and send it out again to improve the relevance of the result Lagging
instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries
A high-level aggregator
(HLA) partitions queries to
a large number of mid-level
aggregators (MLAs) that in
turn partition each query
over the other servers in the
same rack as the MLA
Servers act as both MLAs
and workers so each server
will be acting as an
aggregator for some queries
and as a worker for other
queries
HLA
MLAMLA
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1959
aggregator
worker 1worker 2worker 3worker 4
query
response
Ack
A TCP Incast Event
Response from worker 3 is lost due to incast and is
retransmitted after a timeout
timeout
983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154
983090
983159983151983154983147983141983154
983089
983159983151983154983147983141983154
983091
983159983151983154983147983141983154
983091
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2059
IncastScenario
Packets from many
flows arriving to
the same port at
the same time
Incast Collapse Summary
In other publications the incast scnario
is referred to as the fan-in burst at the
parent node This incast is a key reason
for increased network delay and occurswhen all the children (eg workers at
the leaf level) of a parent node face the
same deadline and are likely to respond
nearly at the same time causing a fan-
in burst at the parent node
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2159
Performance impairments of Shallow-buffered
Switches2 Queue Buildup
When long and short flows traverse the same queue there is a queue
buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every
worker in the cluster handles both query traffic and background
traffic (large flows needed to update the data structures on the
workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays
because of long-lived greedy TCP flows Further answering a
request can require multiple iterations which magnifies the impact of
this delay
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2259
Performance impairments of Shallow-buffered
Switches3 Buffer Pressure
Given the mix of long and short flows in a data center it is very
common for short flows on one port to be impacted by activity on
other ports The loss rate of short flows in this traffic pattern depends
on the number of long flows traversing other ports
The long greedy TCP flows build up queues on their interfaces
Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space
available to absorb bursts of traffic from the PartitionAggregate
traffic This impairment is called buffer pressure The result is packet
loss and timeouts as in incast but without requiring synchronizedflows
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2359
Buffer
Pressure
Short flows on oneport and long flows
on another port
Incast
Scenario
Multiple shortflows on the same
port
Queue
Buildup
Short and longflows on the same
port
Flow Interactions in Shallow-buffered Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2459
C o n g e s t i o n
w i n d o w
10
5
15
20
0
Round-trip times
Slow
start
Congestionavoidance
Time-out
Legacy TCP Congestion Control
983155983155983135983156983144983154983141983155983144 983101983089983094
983139983159983150983140 983101983090983088
983155983155983135983156983144983154983141983155983144 983101983089983088
Segment loss
Segment loss
FastRetransmit
Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2559
The Need for a Data Center TCP
The data center environment is significantly
different from wide area networks
o round trip times (RTTs) can be less than 250 ms in absence ofqueuing
o Applications need extremely high bandwidths and very low
latencies
o little statistical multiplexing a single flow can dominate a
particular path
o The network is largely homogeneous and under a single
administrative controlo Traffic flowing in switches is mostly internal Connectivity to the
external Internet is typically managed through load balancers and
application proxies that effectively separate internal traffic from
external
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2659
The Need for a Data Center TCP (continued)
Data center applications generate a diverse mix of short and long
flows The measurements by the authors reveal that 9991 of
traffic in the data center is TCP traffic The traffic consists of query
traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to
100MB) These applications require three things from the data
center network
o low latency for short flows
o high burst tolerance
o high utilization for long flows
Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above
requirements
See paper for details of workload
characterization in cloud data centers
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2759
The Data Center TCP (DCTCP) Algorithm
The goal of DCTCP is to achieve high burst tolerance low latency
and high throughput with commodity shallow buffered switches
DCTCP uses the concept of ECN (Explicit Congestion Notification)
DCTCP achieves these goals primarily by reacting to congestion in
proportion to the extent of congestion
DCTCP uses a simple marking scheme at switches that sets the
Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold
The DCTCP source reacts by reducing the window by a factor that
depends on the fraction of marked packets the larger the fraction the
bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN
notification
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2859
DCTCP- Simple Marking at the Switch
DCTCP employs a simple active queue management scheme There
is only a single parameter the marking threshold K as opposed to
two parameters THmin and THmax in RED routers
An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival
Marking is based on the instantaneous value of the queue not the
average value as in RED routers
The DCTCP scheme ensures that sources are quickly notified of the
queue overshoot
The RED marking scheme implemented by most modern switches
can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of
average queue length
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2959
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 659
Data Center Closures amp ConsolidationUS Government Data Centers
In November 2012 the US Government has closed an additional
64 data centers bringing the total number of closed facilities to
381 The closures are part of the Federal Data CenterConsolidation Initiative for streamlining government IT
operations
The ultimate goal is to close 40 percent of the US Federal
Governmentrsquos data centers (ie close 1200 of the nearly 2900identified data centers) by 2015
Commercial Data CentersData center consolidation is a trend in industry For example HP
has been replacing its 85 data centers around the world with only 6newly-built larger facilities in Austin Atlanta and Houston
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 759
Hewlett-Packard Development Company
Example HP Cloud Services
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 859
Example Amazon Data CentersAmazon data centers serve four regions in
the US and three regions in Europe and
Asia Another data center in the US was
opened July 2011 in the state of Oregon to
serve the Pacific Northwest region
In December 2011 Amazon announced it is
opening a data center in Sao Paulo Brazil
its first in South America
In November 2012 Amazon announced it
is adding a ninth region by opening a data
center in Sydney Australia
The data centers support all Amazon Web
Services (AWS) including Amazon Elastic
Compute Cloud (EC2) and Amazon
Simple Storage Service (S3)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 959
Example Amazon Data Centers
The Amazon Elastic Compute Cloud (Amazon EC2) web service provides users
with the ability to execute their applications in Amazons computing environment
To use Amazon EC2 Create an Amazon Machine Image (AMI) containing all the software including
the operating system
Upload this AMI to the Amazon S3 (Amazon Simple Storage Service)
Register to get an AMI ID Use this AMI ID and the Amazon EC2 web service APIs to run monitor and
terminate as many instances of this AMI as required
EC2 Pricing Policy pay as you go no minimal fee The prices are based on theRegion in which the application instance is running
httpawsamazoncomec2pricing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1059
Data Center Services
Exampe Colocation Services of Cogent
httpwwwcogentcocomen
Cogent is a multinational Tier 1 Internet Service Provider
Companies can colocate their business critical equipment in one of
43 Cogents secure state-of-the-art data centers that connect directly
to a Tier-1 IP network The data centers have extensive powerbackup systems complete fire detection and suppression plans to
ensure the safety and security of equipment
Cogent Data Center Features
httpwwwcogentcocomenproducts-and-servicescolocation-
services
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1159
Colocation Data Centers and Cloud Servers
httpwwwdatacentermapcomdatacentershtml
httpwwwdatacentermapcomcloudhtml
Example AtlanticNet
httpwwwatlanticnetorlando-colocation-floridahtml
Orlando Data Center
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1259
Data Center TCP (DCTCP)
M Alizadehzy A Greenbergy D Maltzy J Padhyey P
Pately B Prabhakarz S Senguptay M Sridharan
983117983145983139983154983151983155983151983142983156 983122983141983155983141983137983154983139983144 amp 983123983156983137983150983142983151983154983140 983125983150983145983158983141983154983155983145983156983161
ACM SIGCOMM September 2010
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1359
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1459
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1559
Rack Servers with Commodity Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1659
Performance impairments of Shallow-buffered
Switches1 TCP Incast Collapse
Many applications generate barrier-synchronized requests in which the
client cannot make forward progress until the responses from every
server for the current request have been received An Example of these
applications is a web search query (eg a Google search) sent to a large
number of nodes with results returned to the parent node to be sorted
Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these
requests create many flows that converge on the same interface of a
switch over a short period of time The response packets create a long
queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and
throughput collapse
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1759
1 TCP Incast Collapse (continued)
Barrier-synchronized requests exhibit the PartitionAggregate workflow
pattern which is the foundation of many large scale web applications
Requests from higher layers of the application are broken into pieces and
farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content
composition and advertisement selection are based around the
PartitionAggregate design pattern
In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require
iteratively invoking the pattern with an aggregator making serial requests
to the workers below it to prepare a response (1 to 4 iterations are typical
though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up
to the root must be completed within the deadline
In other publications this pattern is referred to as the ScatterGather pattern
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1859
983137983143983143983154983141983143983137983156983151983154
983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154
The partitionaggregate design pattern
Request Latency deadline 250 ms
deadline 50 ms
deadline 10 ms
The total permissible latency for a request is limited and the ldquobackendrdquo part of the
application is typically allocated between 230-300 ms This limit is called the all-up SLA
Example in web search a query might be sent to many aggregators and workers each
responsible for a different part of the index Based on the replies an aggregator might
refine the query and send it out again to improve the relevance of the result Lagging
instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries
A high-level aggregator
(HLA) partitions queries to
a large number of mid-level
aggregators (MLAs) that in
turn partition each query
over the other servers in the
same rack as the MLA
Servers act as both MLAs
and workers so each server
will be acting as an
aggregator for some queries
and as a worker for other
queries
HLA
MLAMLA
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1959
aggregator
worker 1worker 2worker 3worker 4
query
response
Ack
A TCP Incast Event
Response from worker 3 is lost due to incast and is
retransmitted after a timeout
timeout
983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154
983090
983159983151983154983147983141983154
983089
983159983151983154983147983141983154
983091
983159983151983154983147983141983154
983091
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2059
IncastScenario
Packets from many
flows arriving to
the same port at
the same time
Incast Collapse Summary
In other publications the incast scnario
is referred to as the fan-in burst at the
parent node This incast is a key reason
for increased network delay and occurswhen all the children (eg workers at
the leaf level) of a parent node face the
same deadline and are likely to respond
nearly at the same time causing a fan-
in burst at the parent node
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2159
Performance impairments of Shallow-buffered
Switches2 Queue Buildup
When long and short flows traverse the same queue there is a queue
buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every
worker in the cluster handles both query traffic and background
traffic (large flows needed to update the data structures on the
workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays
because of long-lived greedy TCP flows Further answering a
request can require multiple iterations which magnifies the impact of
this delay
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2259
Performance impairments of Shallow-buffered
Switches3 Buffer Pressure
Given the mix of long and short flows in a data center it is very
common for short flows on one port to be impacted by activity on
other ports The loss rate of short flows in this traffic pattern depends
on the number of long flows traversing other ports
The long greedy TCP flows build up queues on their interfaces
Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space
available to absorb bursts of traffic from the PartitionAggregate
traffic This impairment is called buffer pressure The result is packet
loss and timeouts as in incast but without requiring synchronizedflows
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2359
Buffer
Pressure
Short flows on oneport and long flows
on another port
Incast
Scenario
Multiple shortflows on the same
port
Queue
Buildup
Short and longflows on the same
port
Flow Interactions in Shallow-buffered Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2459
C o n g e s t i o n
w i n d o w
10
5
15
20
0
Round-trip times
Slow
start
Congestionavoidance
Time-out
Legacy TCP Congestion Control
983155983155983135983156983144983154983141983155983144 983101983089983094
983139983159983150983140 983101983090983088
983155983155983135983156983144983154983141983155983144 983101983089983088
Segment loss
Segment loss
FastRetransmit
Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2559
The Need for a Data Center TCP
The data center environment is significantly
different from wide area networks
o round trip times (RTTs) can be less than 250 ms in absence ofqueuing
o Applications need extremely high bandwidths and very low
latencies
o little statistical multiplexing a single flow can dominate a
particular path
o The network is largely homogeneous and under a single
administrative controlo Traffic flowing in switches is mostly internal Connectivity to the
external Internet is typically managed through load balancers and
application proxies that effectively separate internal traffic from
external
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2659
The Need for a Data Center TCP (continued)
Data center applications generate a diverse mix of short and long
flows The measurements by the authors reveal that 9991 of
traffic in the data center is TCP traffic The traffic consists of query
traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to
100MB) These applications require three things from the data
center network
o low latency for short flows
o high burst tolerance
o high utilization for long flows
Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above
requirements
See paper for details of workload
characterization in cloud data centers
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2759
The Data Center TCP (DCTCP) Algorithm
The goal of DCTCP is to achieve high burst tolerance low latency
and high throughput with commodity shallow buffered switches
DCTCP uses the concept of ECN (Explicit Congestion Notification)
DCTCP achieves these goals primarily by reacting to congestion in
proportion to the extent of congestion
DCTCP uses a simple marking scheme at switches that sets the
Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold
The DCTCP source reacts by reducing the window by a factor that
depends on the fraction of marked packets the larger the fraction the
bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN
notification
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2859
DCTCP- Simple Marking at the Switch
DCTCP employs a simple active queue management scheme There
is only a single parameter the marking threshold K as opposed to
two parameters THmin and THmax in RED routers
An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival
Marking is based on the instantaneous value of the queue not the
average value as in RED routers
The DCTCP scheme ensures that sources are quickly notified of the
queue overshoot
The RED marking scheme implemented by most modern switches
can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of
average queue length
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2959
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 759
Hewlett-Packard Development Company
Example HP Cloud Services
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 859
Example Amazon Data CentersAmazon data centers serve four regions in
the US and three regions in Europe and
Asia Another data center in the US was
opened July 2011 in the state of Oregon to
serve the Pacific Northwest region
In December 2011 Amazon announced it is
opening a data center in Sao Paulo Brazil
its first in South America
In November 2012 Amazon announced it
is adding a ninth region by opening a data
center in Sydney Australia
The data centers support all Amazon Web
Services (AWS) including Amazon Elastic
Compute Cloud (EC2) and Amazon
Simple Storage Service (S3)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 959
Example Amazon Data Centers
The Amazon Elastic Compute Cloud (Amazon EC2) web service provides users
with the ability to execute their applications in Amazons computing environment
To use Amazon EC2 Create an Amazon Machine Image (AMI) containing all the software including
the operating system
Upload this AMI to the Amazon S3 (Amazon Simple Storage Service)
Register to get an AMI ID Use this AMI ID and the Amazon EC2 web service APIs to run monitor and
terminate as many instances of this AMI as required
EC2 Pricing Policy pay as you go no minimal fee The prices are based on theRegion in which the application instance is running
httpawsamazoncomec2pricing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1059
Data Center Services
Exampe Colocation Services of Cogent
httpwwwcogentcocomen
Cogent is a multinational Tier 1 Internet Service Provider
Companies can colocate their business critical equipment in one of
43 Cogents secure state-of-the-art data centers that connect directly
to a Tier-1 IP network The data centers have extensive powerbackup systems complete fire detection and suppression plans to
ensure the safety and security of equipment
Cogent Data Center Features
httpwwwcogentcocomenproducts-and-servicescolocation-
services
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1159
Colocation Data Centers and Cloud Servers
httpwwwdatacentermapcomdatacentershtml
httpwwwdatacentermapcomcloudhtml
Example AtlanticNet
httpwwwatlanticnetorlando-colocation-floridahtml
Orlando Data Center
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1259
Data Center TCP (DCTCP)
M Alizadehzy A Greenbergy D Maltzy J Padhyey P
Pately B Prabhakarz S Senguptay M Sridharan
983117983145983139983154983151983155983151983142983156 983122983141983155983141983137983154983139983144 amp 983123983156983137983150983142983151983154983140 983125983150983145983158983141983154983155983145983156983161
ACM SIGCOMM September 2010
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1359
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1459
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1559
Rack Servers with Commodity Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1659
Performance impairments of Shallow-buffered
Switches1 TCP Incast Collapse
Many applications generate barrier-synchronized requests in which the
client cannot make forward progress until the responses from every
server for the current request have been received An Example of these
applications is a web search query (eg a Google search) sent to a large
number of nodes with results returned to the parent node to be sorted
Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these
requests create many flows that converge on the same interface of a
switch over a short period of time The response packets create a long
queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and
throughput collapse
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1759
1 TCP Incast Collapse (continued)
Barrier-synchronized requests exhibit the PartitionAggregate workflow
pattern which is the foundation of many large scale web applications
Requests from higher layers of the application are broken into pieces and
farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content
composition and advertisement selection are based around the
PartitionAggregate design pattern
In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require
iteratively invoking the pattern with an aggregator making serial requests
to the workers below it to prepare a response (1 to 4 iterations are typical
though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up
to the root must be completed within the deadline
In other publications this pattern is referred to as the ScatterGather pattern
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1859
983137983143983143983154983141983143983137983156983151983154
983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154
The partitionaggregate design pattern
Request Latency deadline 250 ms
deadline 50 ms
deadline 10 ms
The total permissible latency for a request is limited and the ldquobackendrdquo part of the
application is typically allocated between 230-300 ms This limit is called the all-up SLA
Example in web search a query might be sent to many aggregators and workers each
responsible for a different part of the index Based on the replies an aggregator might
refine the query and send it out again to improve the relevance of the result Lagging
instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries
A high-level aggregator
(HLA) partitions queries to
a large number of mid-level
aggregators (MLAs) that in
turn partition each query
over the other servers in the
same rack as the MLA
Servers act as both MLAs
and workers so each server
will be acting as an
aggregator for some queries
and as a worker for other
queries
HLA
MLAMLA
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1959
aggregator
worker 1worker 2worker 3worker 4
query
response
Ack
A TCP Incast Event
Response from worker 3 is lost due to incast and is
retransmitted after a timeout
timeout
983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154
983090
983159983151983154983147983141983154
983089
983159983151983154983147983141983154
983091
983159983151983154983147983141983154
983091
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2059
IncastScenario
Packets from many
flows arriving to
the same port at
the same time
Incast Collapse Summary
In other publications the incast scnario
is referred to as the fan-in burst at the
parent node This incast is a key reason
for increased network delay and occurswhen all the children (eg workers at
the leaf level) of a parent node face the
same deadline and are likely to respond
nearly at the same time causing a fan-
in burst at the parent node
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2159
Performance impairments of Shallow-buffered
Switches2 Queue Buildup
When long and short flows traverse the same queue there is a queue
buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every
worker in the cluster handles both query traffic and background
traffic (large flows needed to update the data structures on the
workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays
because of long-lived greedy TCP flows Further answering a
request can require multiple iterations which magnifies the impact of
this delay
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2259
Performance impairments of Shallow-buffered
Switches3 Buffer Pressure
Given the mix of long and short flows in a data center it is very
common for short flows on one port to be impacted by activity on
other ports The loss rate of short flows in this traffic pattern depends
on the number of long flows traversing other ports
The long greedy TCP flows build up queues on their interfaces
Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space
available to absorb bursts of traffic from the PartitionAggregate
traffic This impairment is called buffer pressure The result is packet
loss and timeouts as in incast but without requiring synchronizedflows
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2359
Buffer
Pressure
Short flows on oneport and long flows
on another port
Incast
Scenario
Multiple shortflows on the same
port
Queue
Buildup
Short and longflows on the same
port
Flow Interactions in Shallow-buffered Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2459
C o n g e s t i o n
w i n d o w
10
5
15
20
0
Round-trip times
Slow
start
Congestionavoidance
Time-out
Legacy TCP Congestion Control
983155983155983135983156983144983154983141983155983144 983101983089983094
983139983159983150983140 983101983090983088
983155983155983135983156983144983154983141983155983144 983101983089983088
Segment loss
Segment loss
FastRetransmit
Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2559
The Need for a Data Center TCP
The data center environment is significantly
different from wide area networks
o round trip times (RTTs) can be less than 250 ms in absence ofqueuing
o Applications need extremely high bandwidths and very low
latencies
o little statistical multiplexing a single flow can dominate a
particular path
o The network is largely homogeneous and under a single
administrative controlo Traffic flowing in switches is mostly internal Connectivity to the
external Internet is typically managed through load balancers and
application proxies that effectively separate internal traffic from
external
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2659
The Need for a Data Center TCP (continued)
Data center applications generate a diverse mix of short and long
flows The measurements by the authors reveal that 9991 of
traffic in the data center is TCP traffic The traffic consists of query
traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to
100MB) These applications require three things from the data
center network
o low latency for short flows
o high burst tolerance
o high utilization for long flows
Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above
requirements
See paper for details of workload
characterization in cloud data centers
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2759
The Data Center TCP (DCTCP) Algorithm
The goal of DCTCP is to achieve high burst tolerance low latency
and high throughput with commodity shallow buffered switches
DCTCP uses the concept of ECN (Explicit Congestion Notification)
DCTCP achieves these goals primarily by reacting to congestion in
proportion to the extent of congestion
DCTCP uses a simple marking scheme at switches that sets the
Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold
The DCTCP source reacts by reducing the window by a factor that
depends on the fraction of marked packets the larger the fraction the
bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN
notification
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2859
DCTCP- Simple Marking at the Switch
DCTCP employs a simple active queue management scheme There
is only a single parameter the marking threshold K as opposed to
two parameters THmin and THmax in RED routers
An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival
Marking is based on the instantaneous value of the queue not the
average value as in RED routers
The DCTCP scheme ensures that sources are quickly notified of the
queue overshoot
The RED marking scheme implemented by most modern switches
can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of
average queue length
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2959
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 859
Example Amazon Data CentersAmazon data centers serve four regions in
the US and three regions in Europe and
Asia Another data center in the US was
opened July 2011 in the state of Oregon to
serve the Pacific Northwest region
In December 2011 Amazon announced it is
opening a data center in Sao Paulo Brazil
its first in South America
In November 2012 Amazon announced it
is adding a ninth region by opening a data
center in Sydney Australia
The data centers support all Amazon Web
Services (AWS) including Amazon Elastic
Compute Cloud (EC2) and Amazon
Simple Storage Service (S3)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 959
Example Amazon Data Centers
The Amazon Elastic Compute Cloud (Amazon EC2) web service provides users
with the ability to execute their applications in Amazons computing environment
To use Amazon EC2 Create an Amazon Machine Image (AMI) containing all the software including
the operating system
Upload this AMI to the Amazon S3 (Amazon Simple Storage Service)
Register to get an AMI ID Use this AMI ID and the Amazon EC2 web service APIs to run monitor and
terminate as many instances of this AMI as required
EC2 Pricing Policy pay as you go no minimal fee The prices are based on theRegion in which the application instance is running
httpawsamazoncomec2pricing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1059
Data Center Services
Exampe Colocation Services of Cogent
httpwwwcogentcocomen
Cogent is a multinational Tier 1 Internet Service Provider
Companies can colocate their business critical equipment in one of
43 Cogents secure state-of-the-art data centers that connect directly
to a Tier-1 IP network The data centers have extensive powerbackup systems complete fire detection and suppression plans to
ensure the safety and security of equipment
Cogent Data Center Features
httpwwwcogentcocomenproducts-and-servicescolocation-
services
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1159
Colocation Data Centers and Cloud Servers
httpwwwdatacentermapcomdatacentershtml
httpwwwdatacentermapcomcloudhtml
Example AtlanticNet
httpwwwatlanticnetorlando-colocation-floridahtml
Orlando Data Center
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1259
Data Center TCP (DCTCP)
M Alizadehzy A Greenbergy D Maltzy J Padhyey P
Pately B Prabhakarz S Senguptay M Sridharan
983117983145983139983154983151983155983151983142983156 983122983141983155983141983137983154983139983144 amp 983123983156983137983150983142983151983154983140 983125983150983145983158983141983154983155983145983156983161
ACM SIGCOMM September 2010
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1359
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1459
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1559
Rack Servers with Commodity Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1659
Performance impairments of Shallow-buffered
Switches1 TCP Incast Collapse
Many applications generate barrier-synchronized requests in which the
client cannot make forward progress until the responses from every
server for the current request have been received An Example of these
applications is a web search query (eg a Google search) sent to a large
number of nodes with results returned to the parent node to be sorted
Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these
requests create many flows that converge on the same interface of a
switch over a short period of time The response packets create a long
queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and
throughput collapse
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1759
1 TCP Incast Collapse (continued)
Barrier-synchronized requests exhibit the PartitionAggregate workflow
pattern which is the foundation of many large scale web applications
Requests from higher layers of the application are broken into pieces and
farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content
composition and advertisement selection are based around the
PartitionAggregate design pattern
In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require
iteratively invoking the pattern with an aggregator making serial requests
to the workers below it to prepare a response (1 to 4 iterations are typical
though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up
to the root must be completed within the deadline
In other publications this pattern is referred to as the ScatterGather pattern
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1859
983137983143983143983154983141983143983137983156983151983154
983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154
The partitionaggregate design pattern
Request Latency deadline 250 ms
deadline 50 ms
deadline 10 ms
The total permissible latency for a request is limited and the ldquobackendrdquo part of the
application is typically allocated between 230-300 ms This limit is called the all-up SLA
Example in web search a query might be sent to many aggregators and workers each
responsible for a different part of the index Based on the replies an aggregator might
refine the query and send it out again to improve the relevance of the result Lagging
instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries
A high-level aggregator
(HLA) partitions queries to
a large number of mid-level
aggregators (MLAs) that in
turn partition each query
over the other servers in the
same rack as the MLA
Servers act as both MLAs
and workers so each server
will be acting as an
aggregator for some queries
and as a worker for other
queries
HLA
MLAMLA
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1959
aggregator
worker 1worker 2worker 3worker 4
query
response
Ack
A TCP Incast Event
Response from worker 3 is lost due to incast and is
retransmitted after a timeout
timeout
983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154
983090
983159983151983154983147983141983154
983089
983159983151983154983147983141983154
983091
983159983151983154983147983141983154
983091
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2059
IncastScenario
Packets from many
flows arriving to
the same port at
the same time
Incast Collapse Summary
In other publications the incast scnario
is referred to as the fan-in burst at the
parent node This incast is a key reason
for increased network delay and occurswhen all the children (eg workers at
the leaf level) of a parent node face the
same deadline and are likely to respond
nearly at the same time causing a fan-
in burst at the parent node
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2159
Performance impairments of Shallow-buffered
Switches2 Queue Buildup
When long and short flows traverse the same queue there is a queue
buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every
worker in the cluster handles both query traffic and background
traffic (large flows needed to update the data structures on the
workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays
because of long-lived greedy TCP flows Further answering a
request can require multiple iterations which magnifies the impact of
this delay
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2259
Performance impairments of Shallow-buffered
Switches3 Buffer Pressure
Given the mix of long and short flows in a data center it is very
common for short flows on one port to be impacted by activity on
other ports The loss rate of short flows in this traffic pattern depends
on the number of long flows traversing other ports
The long greedy TCP flows build up queues on their interfaces
Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space
available to absorb bursts of traffic from the PartitionAggregate
traffic This impairment is called buffer pressure The result is packet
loss and timeouts as in incast but without requiring synchronizedflows
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2359
Buffer
Pressure
Short flows on oneport and long flows
on another port
Incast
Scenario
Multiple shortflows on the same
port
Queue
Buildup
Short and longflows on the same
port
Flow Interactions in Shallow-buffered Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2459
C o n g e s t i o n
w i n d o w
10
5
15
20
0
Round-trip times
Slow
start
Congestionavoidance
Time-out
Legacy TCP Congestion Control
983155983155983135983156983144983154983141983155983144 983101983089983094
983139983159983150983140 983101983090983088
983155983155983135983156983144983154983141983155983144 983101983089983088
Segment loss
Segment loss
FastRetransmit
Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2559
The Need for a Data Center TCP
The data center environment is significantly
different from wide area networks
o round trip times (RTTs) can be less than 250 ms in absence ofqueuing
o Applications need extremely high bandwidths and very low
latencies
o little statistical multiplexing a single flow can dominate a
particular path
o The network is largely homogeneous and under a single
administrative controlo Traffic flowing in switches is mostly internal Connectivity to the
external Internet is typically managed through load balancers and
application proxies that effectively separate internal traffic from
external
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2659
The Need for a Data Center TCP (continued)
Data center applications generate a diverse mix of short and long
flows The measurements by the authors reveal that 9991 of
traffic in the data center is TCP traffic The traffic consists of query
traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to
100MB) These applications require three things from the data
center network
o low latency for short flows
o high burst tolerance
o high utilization for long flows
Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above
requirements
See paper for details of workload
characterization in cloud data centers
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2759
The Data Center TCP (DCTCP) Algorithm
The goal of DCTCP is to achieve high burst tolerance low latency
and high throughput with commodity shallow buffered switches
DCTCP uses the concept of ECN (Explicit Congestion Notification)
DCTCP achieves these goals primarily by reacting to congestion in
proportion to the extent of congestion
DCTCP uses a simple marking scheme at switches that sets the
Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold
The DCTCP source reacts by reducing the window by a factor that
depends on the fraction of marked packets the larger the fraction the
bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN
notification
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2859
DCTCP- Simple Marking at the Switch
DCTCP employs a simple active queue management scheme There
is only a single parameter the marking threshold K as opposed to
two parameters THmin and THmax in RED routers
An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival
Marking is based on the instantaneous value of the queue not the
average value as in RED routers
The DCTCP scheme ensures that sources are quickly notified of the
queue overshoot
The RED marking scheme implemented by most modern switches
can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of
average queue length
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2959
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 959
Example Amazon Data Centers
The Amazon Elastic Compute Cloud (Amazon EC2) web service provides users
with the ability to execute their applications in Amazons computing environment
To use Amazon EC2 Create an Amazon Machine Image (AMI) containing all the software including
the operating system
Upload this AMI to the Amazon S3 (Amazon Simple Storage Service)
Register to get an AMI ID Use this AMI ID and the Amazon EC2 web service APIs to run monitor and
terminate as many instances of this AMI as required
EC2 Pricing Policy pay as you go no minimal fee The prices are based on theRegion in which the application instance is running
httpawsamazoncomec2pricing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1059
Data Center Services
Exampe Colocation Services of Cogent
httpwwwcogentcocomen
Cogent is a multinational Tier 1 Internet Service Provider
Companies can colocate their business critical equipment in one of
43 Cogents secure state-of-the-art data centers that connect directly
to a Tier-1 IP network The data centers have extensive powerbackup systems complete fire detection and suppression plans to
ensure the safety and security of equipment
Cogent Data Center Features
httpwwwcogentcocomenproducts-and-servicescolocation-
services
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1159
Colocation Data Centers and Cloud Servers
httpwwwdatacentermapcomdatacentershtml
httpwwwdatacentermapcomcloudhtml
Example AtlanticNet
httpwwwatlanticnetorlando-colocation-floridahtml
Orlando Data Center
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1259
Data Center TCP (DCTCP)
M Alizadehzy A Greenbergy D Maltzy J Padhyey P
Pately B Prabhakarz S Senguptay M Sridharan
983117983145983139983154983151983155983151983142983156 983122983141983155983141983137983154983139983144 amp 983123983156983137983150983142983151983154983140 983125983150983145983158983141983154983155983145983156983161
ACM SIGCOMM September 2010
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1359
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1459
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1559
Rack Servers with Commodity Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1659
Performance impairments of Shallow-buffered
Switches1 TCP Incast Collapse
Many applications generate barrier-synchronized requests in which the
client cannot make forward progress until the responses from every
server for the current request have been received An Example of these
applications is a web search query (eg a Google search) sent to a large
number of nodes with results returned to the parent node to be sorted
Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these
requests create many flows that converge on the same interface of a
switch over a short period of time The response packets create a long
queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and
throughput collapse
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1759
1 TCP Incast Collapse (continued)
Barrier-synchronized requests exhibit the PartitionAggregate workflow
pattern which is the foundation of many large scale web applications
Requests from higher layers of the application are broken into pieces and
farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content
composition and advertisement selection are based around the
PartitionAggregate design pattern
In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require
iteratively invoking the pattern with an aggregator making serial requests
to the workers below it to prepare a response (1 to 4 iterations are typical
though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up
to the root must be completed within the deadline
In other publications this pattern is referred to as the ScatterGather pattern
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1859
983137983143983143983154983141983143983137983156983151983154
983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154
The partitionaggregate design pattern
Request Latency deadline 250 ms
deadline 50 ms
deadline 10 ms
The total permissible latency for a request is limited and the ldquobackendrdquo part of the
application is typically allocated between 230-300 ms This limit is called the all-up SLA
Example in web search a query might be sent to many aggregators and workers each
responsible for a different part of the index Based on the replies an aggregator might
refine the query and send it out again to improve the relevance of the result Lagging
instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries
A high-level aggregator
(HLA) partitions queries to
a large number of mid-level
aggregators (MLAs) that in
turn partition each query
over the other servers in the
same rack as the MLA
Servers act as both MLAs
and workers so each server
will be acting as an
aggregator for some queries
and as a worker for other
queries
HLA
MLAMLA
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1959
aggregator
worker 1worker 2worker 3worker 4
query
response
Ack
A TCP Incast Event
Response from worker 3 is lost due to incast and is
retransmitted after a timeout
timeout
983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154
983090
983159983151983154983147983141983154
983089
983159983151983154983147983141983154
983091
983159983151983154983147983141983154
983091
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2059
IncastScenario
Packets from many
flows arriving to
the same port at
the same time
Incast Collapse Summary
In other publications the incast scnario
is referred to as the fan-in burst at the
parent node This incast is a key reason
for increased network delay and occurswhen all the children (eg workers at
the leaf level) of a parent node face the
same deadline and are likely to respond
nearly at the same time causing a fan-
in burst at the parent node
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2159
Performance impairments of Shallow-buffered
Switches2 Queue Buildup
When long and short flows traverse the same queue there is a queue
buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every
worker in the cluster handles both query traffic and background
traffic (large flows needed to update the data structures on the
workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays
because of long-lived greedy TCP flows Further answering a
request can require multiple iterations which magnifies the impact of
this delay
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2259
Performance impairments of Shallow-buffered
Switches3 Buffer Pressure
Given the mix of long and short flows in a data center it is very
common for short flows on one port to be impacted by activity on
other ports The loss rate of short flows in this traffic pattern depends
on the number of long flows traversing other ports
The long greedy TCP flows build up queues on their interfaces
Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space
available to absorb bursts of traffic from the PartitionAggregate
traffic This impairment is called buffer pressure The result is packet
loss and timeouts as in incast but without requiring synchronizedflows
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2359
Buffer
Pressure
Short flows on oneport and long flows
on another port
Incast
Scenario
Multiple shortflows on the same
port
Queue
Buildup
Short and longflows on the same
port
Flow Interactions in Shallow-buffered Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2459
C o n g e s t i o n
w i n d o w
10
5
15
20
0
Round-trip times
Slow
start
Congestionavoidance
Time-out
Legacy TCP Congestion Control
983155983155983135983156983144983154983141983155983144 983101983089983094
983139983159983150983140 983101983090983088
983155983155983135983156983144983154983141983155983144 983101983089983088
Segment loss
Segment loss
FastRetransmit
Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2559
The Need for a Data Center TCP
The data center environment is significantly
different from wide area networks
o round trip times (RTTs) can be less than 250 ms in absence ofqueuing
o Applications need extremely high bandwidths and very low
latencies
o little statistical multiplexing a single flow can dominate a
particular path
o The network is largely homogeneous and under a single
administrative controlo Traffic flowing in switches is mostly internal Connectivity to the
external Internet is typically managed through load balancers and
application proxies that effectively separate internal traffic from
external
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2659
The Need for a Data Center TCP (continued)
Data center applications generate a diverse mix of short and long
flows The measurements by the authors reveal that 9991 of
traffic in the data center is TCP traffic The traffic consists of query
traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to
100MB) These applications require three things from the data
center network
o low latency for short flows
o high burst tolerance
o high utilization for long flows
Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above
requirements
See paper for details of workload
characterization in cloud data centers
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2759
The Data Center TCP (DCTCP) Algorithm
The goal of DCTCP is to achieve high burst tolerance low latency
and high throughput with commodity shallow buffered switches
DCTCP uses the concept of ECN (Explicit Congestion Notification)
DCTCP achieves these goals primarily by reacting to congestion in
proportion to the extent of congestion
DCTCP uses a simple marking scheme at switches that sets the
Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold
The DCTCP source reacts by reducing the window by a factor that
depends on the fraction of marked packets the larger the fraction the
bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN
notification
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2859
DCTCP- Simple Marking at the Switch
DCTCP employs a simple active queue management scheme There
is only a single parameter the marking threshold K as opposed to
two parameters THmin and THmax in RED routers
An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival
Marking is based on the instantaneous value of the queue not the
average value as in RED routers
The DCTCP scheme ensures that sources are quickly notified of the
queue overshoot
The RED marking scheme implemented by most modern switches
can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of
average queue length
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2959
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1059
Data Center Services
Exampe Colocation Services of Cogent
httpwwwcogentcocomen
Cogent is a multinational Tier 1 Internet Service Provider
Companies can colocate their business critical equipment in one of
43 Cogents secure state-of-the-art data centers that connect directly
to a Tier-1 IP network The data centers have extensive powerbackup systems complete fire detection and suppression plans to
ensure the safety and security of equipment
Cogent Data Center Features
httpwwwcogentcocomenproducts-and-servicescolocation-
services
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1159
Colocation Data Centers and Cloud Servers
httpwwwdatacentermapcomdatacentershtml
httpwwwdatacentermapcomcloudhtml
Example AtlanticNet
httpwwwatlanticnetorlando-colocation-floridahtml
Orlando Data Center
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1259
Data Center TCP (DCTCP)
M Alizadehzy A Greenbergy D Maltzy J Padhyey P
Pately B Prabhakarz S Senguptay M Sridharan
983117983145983139983154983151983155983151983142983156 983122983141983155983141983137983154983139983144 amp 983123983156983137983150983142983151983154983140 983125983150983145983158983141983154983155983145983156983161
ACM SIGCOMM September 2010
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1359
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1459
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1559
Rack Servers with Commodity Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1659
Performance impairments of Shallow-buffered
Switches1 TCP Incast Collapse
Many applications generate barrier-synchronized requests in which the
client cannot make forward progress until the responses from every
server for the current request have been received An Example of these
applications is a web search query (eg a Google search) sent to a large
number of nodes with results returned to the parent node to be sorted
Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these
requests create many flows that converge on the same interface of a
switch over a short period of time The response packets create a long
queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and
throughput collapse
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1759
1 TCP Incast Collapse (continued)
Barrier-synchronized requests exhibit the PartitionAggregate workflow
pattern which is the foundation of many large scale web applications
Requests from higher layers of the application are broken into pieces and
farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content
composition and advertisement selection are based around the
PartitionAggregate design pattern
In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require
iteratively invoking the pattern with an aggregator making serial requests
to the workers below it to prepare a response (1 to 4 iterations are typical
though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up
to the root must be completed within the deadline
In other publications this pattern is referred to as the ScatterGather pattern
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1859
983137983143983143983154983141983143983137983156983151983154
983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154
The partitionaggregate design pattern
Request Latency deadline 250 ms
deadline 50 ms
deadline 10 ms
The total permissible latency for a request is limited and the ldquobackendrdquo part of the
application is typically allocated between 230-300 ms This limit is called the all-up SLA
Example in web search a query might be sent to many aggregators and workers each
responsible for a different part of the index Based on the replies an aggregator might
refine the query and send it out again to improve the relevance of the result Lagging
instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries
A high-level aggregator
(HLA) partitions queries to
a large number of mid-level
aggregators (MLAs) that in
turn partition each query
over the other servers in the
same rack as the MLA
Servers act as both MLAs
and workers so each server
will be acting as an
aggregator for some queries
and as a worker for other
queries
HLA
MLAMLA
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1959
aggregator
worker 1worker 2worker 3worker 4
query
response
Ack
A TCP Incast Event
Response from worker 3 is lost due to incast and is
retransmitted after a timeout
timeout
983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154
983090
983159983151983154983147983141983154
983089
983159983151983154983147983141983154
983091
983159983151983154983147983141983154
983091
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2059
IncastScenario
Packets from many
flows arriving to
the same port at
the same time
Incast Collapse Summary
In other publications the incast scnario
is referred to as the fan-in burst at the
parent node This incast is a key reason
for increased network delay and occurswhen all the children (eg workers at
the leaf level) of a parent node face the
same deadline and are likely to respond
nearly at the same time causing a fan-
in burst at the parent node
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2159
Performance impairments of Shallow-buffered
Switches2 Queue Buildup
When long and short flows traverse the same queue there is a queue
buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every
worker in the cluster handles both query traffic and background
traffic (large flows needed to update the data structures on the
workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays
because of long-lived greedy TCP flows Further answering a
request can require multiple iterations which magnifies the impact of
this delay
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2259
Performance impairments of Shallow-buffered
Switches3 Buffer Pressure
Given the mix of long and short flows in a data center it is very
common for short flows on one port to be impacted by activity on
other ports The loss rate of short flows in this traffic pattern depends
on the number of long flows traversing other ports
The long greedy TCP flows build up queues on their interfaces
Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space
available to absorb bursts of traffic from the PartitionAggregate
traffic This impairment is called buffer pressure The result is packet
loss and timeouts as in incast but without requiring synchronizedflows
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2359
Buffer
Pressure
Short flows on oneport and long flows
on another port
Incast
Scenario
Multiple shortflows on the same
port
Queue
Buildup
Short and longflows on the same
port
Flow Interactions in Shallow-buffered Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2459
C o n g e s t i o n
w i n d o w
10
5
15
20
0
Round-trip times
Slow
start
Congestionavoidance
Time-out
Legacy TCP Congestion Control
983155983155983135983156983144983154983141983155983144 983101983089983094
983139983159983150983140 983101983090983088
983155983155983135983156983144983154983141983155983144 983101983089983088
Segment loss
Segment loss
FastRetransmit
Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2559
The Need for a Data Center TCP
The data center environment is significantly
different from wide area networks
o round trip times (RTTs) can be less than 250 ms in absence ofqueuing
o Applications need extremely high bandwidths and very low
latencies
o little statistical multiplexing a single flow can dominate a
particular path
o The network is largely homogeneous and under a single
administrative controlo Traffic flowing in switches is mostly internal Connectivity to the
external Internet is typically managed through load balancers and
application proxies that effectively separate internal traffic from
external
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2659
The Need for a Data Center TCP (continued)
Data center applications generate a diverse mix of short and long
flows The measurements by the authors reveal that 9991 of
traffic in the data center is TCP traffic The traffic consists of query
traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to
100MB) These applications require three things from the data
center network
o low latency for short flows
o high burst tolerance
o high utilization for long flows
Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above
requirements
See paper for details of workload
characterization in cloud data centers
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2759
The Data Center TCP (DCTCP) Algorithm
The goal of DCTCP is to achieve high burst tolerance low latency
and high throughput with commodity shallow buffered switches
DCTCP uses the concept of ECN (Explicit Congestion Notification)
DCTCP achieves these goals primarily by reacting to congestion in
proportion to the extent of congestion
DCTCP uses a simple marking scheme at switches that sets the
Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold
The DCTCP source reacts by reducing the window by a factor that
depends on the fraction of marked packets the larger the fraction the
bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN
notification
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2859
DCTCP- Simple Marking at the Switch
DCTCP employs a simple active queue management scheme There
is only a single parameter the marking threshold K as opposed to
two parameters THmin and THmax in RED routers
An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival
Marking is based on the instantaneous value of the queue not the
average value as in RED routers
The DCTCP scheme ensures that sources are quickly notified of the
queue overshoot
The RED marking scheme implemented by most modern switches
can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of
average queue length
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2959
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1159
Colocation Data Centers and Cloud Servers
httpwwwdatacentermapcomdatacentershtml
httpwwwdatacentermapcomcloudhtml
Example AtlanticNet
httpwwwatlanticnetorlando-colocation-floridahtml
Orlando Data Center
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1259
Data Center TCP (DCTCP)
M Alizadehzy A Greenbergy D Maltzy J Padhyey P
Pately B Prabhakarz S Senguptay M Sridharan
983117983145983139983154983151983155983151983142983156 983122983141983155983141983137983154983139983144 amp 983123983156983137983150983142983151983154983140 983125983150983145983158983141983154983155983145983156983161
ACM SIGCOMM September 2010
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1359
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1459
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1559
Rack Servers with Commodity Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1659
Performance impairments of Shallow-buffered
Switches1 TCP Incast Collapse
Many applications generate barrier-synchronized requests in which the
client cannot make forward progress until the responses from every
server for the current request have been received An Example of these
applications is a web search query (eg a Google search) sent to a large
number of nodes with results returned to the parent node to be sorted
Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these
requests create many flows that converge on the same interface of a
switch over a short period of time The response packets create a long
queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and
throughput collapse
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1759
1 TCP Incast Collapse (continued)
Barrier-synchronized requests exhibit the PartitionAggregate workflow
pattern which is the foundation of many large scale web applications
Requests from higher layers of the application are broken into pieces and
farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content
composition and advertisement selection are based around the
PartitionAggregate design pattern
In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require
iteratively invoking the pattern with an aggregator making serial requests
to the workers below it to prepare a response (1 to 4 iterations are typical
though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up
to the root must be completed within the deadline
In other publications this pattern is referred to as the ScatterGather pattern
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1859
983137983143983143983154983141983143983137983156983151983154
983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154
The partitionaggregate design pattern
Request Latency deadline 250 ms
deadline 50 ms
deadline 10 ms
The total permissible latency for a request is limited and the ldquobackendrdquo part of the
application is typically allocated between 230-300 ms This limit is called the all-up SLA
Example in web search a query might be sent to many aggregators and workers each
responsible for a different part of the index Based on the replies an aggregator might
refine the query and send it out again to improve the relevance of the result Lagging
instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries
A high-level aggregator
(HLA) partitions queries to
a large number of mid-level
aggregators (MLAs) that in
turn partition each query
over the other servers in the
same rack as the MLA
Servers act as both MLAs
and workers so each server
will be acting as an
aggregator for some queries
and as a worker for other
queries
HLA
MLAMLA
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1959
aggregator
worker 1worker 2worker 3worker 4
query
response
Ack
A TCP Incast Event
Response from worker 3 is lost due to incast and is
retransmitted after a timeout
timeout
983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154
983090
983159983151983154983147983141983154
983089
983159983151983154983147983141983154
983091
983159983151983154983147983141983154
983091
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2059
IncastScenario
Packets from many
flows arriving to
the same port at
the same time
Incast Collapse Summary
In other publications the incast scnario
is referred to as the fan-in burst at the
parent node This incast is a key reason
for increased network delay and occurswhen all the children (eg workers at
the leaf level) of a parent node face the
same deadline and are likely to respond
nearly at the same time causing a fan-
in burst at the parent node
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2159
Performance impairments of Shallow-buffered
Switches2 Queue Buildup
When long and short flows traverse the same queue there is a queue
buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every
worker in the cluster handles both query traffic and background
traffic (large flows needed to update the data structures on the
workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays
because of long-lived greedy TCP flows Further answering a
request can require multiple iterations which magnifies the impact of
this delay
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2259
Performance impairments of Shallow-buffered
Switches3 Buffer Pressure
Given the mix of long and short flows in a data center it is very
common for short flows on one port to be impacted by activity on
other ports The loss rate of short flows in this traffic pattern depends
on the number of long flows traversing other ports
The long greedy TCP flows build up queues on their interfaces
Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space
available to absorb bursts of traffic from the PartitionAggregate
traffic This impairment is called buffer pressure The result is packet
loss and timeouts as in incast but without requiring synchronizedflows
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2359
Buffer
Pressure
Short flows on oneport and long flows
on another port
Incast
Scenario
Multiple shortflows on the same
port
Queue
Buildup
Short and longflows on the same
port
Flow Interactions in Shallow-buffered Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2459
C o n g e s t i o n
w i n d o w
10
5
15
20
0
Round-trip times
Slow
start
Congestionavoidance
Time-out
Legacy TCP Congestion Control
983155983155983135983156983144983154983141983155983144 983101983089983094
983139983159983150983140 983101983090983088
983155983155983135983156983144983154983141983155983144 983101983089983088
Segment loss
Segment loss
FastRetransmit
Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2559
The Need for a Data Center TCP
The data center environment is significantly
different from wide area networks
o round trip times (RTTs) can be less than 250 ms in absence ofqueuing
o Applications need extremely high bandwidths and very low
latencies
o little statistical multiplexing a single flow can dominate a
particular path
o The network is largely homogeneous and under a single
administrative controlo Traffic flowing in switches is mostly internal Connectivity to the
external Internet is typically managed through load balancers and
application proxies that effectively separate internal traffic from
external
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2659
The Need for a Data Center TCP (continued)
Data center applications generate a diverse mix of short and long
flows The measurements by the authors reveal that 9991 of
traffic in the data center is TCP traffic The traffic consists of query
traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to
100MB) These applications require three things from the data
center network
o low latency for short flows
o high burst tolerance
o high utilization for long flows
Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above
requirements
See paper for details of workload
characterization in cloud data centers
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2759
The Data Center TCP (DCTCP) Algorithm
The goal of DCTCP is to achieve high burst tolerance low latency
and high throughput with commodity shallow buffered switches
DCTCP uses the concept of ECN (Explicit Congestion Notification)
DCTCP achieves these goals primarily by reacting to congestion in
proportion to the extent of congestion
DCTCP uses a simple marking scheme at switches that sets the
Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold
The DCTCP source reacts by reducing the window by a factor that
depends on the fraction of marked packets the larger the fraction the
bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN
notification
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2859
DCTCP- Simple Marking at the Switch
DCTCP employs a simple active queue management scheme There
is only a single parameter the marking threshold K as opposed to
two parameters THmin and THmax in RED routers
An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival
Marking is based on the instantaneous value of the queue not the
average value as in RED routers
The DCTCP scheme ensures that sources are quickly notified of the
queue overshoot
The RED marking scheme implemented by most modern switches
can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of
average queue length
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2959
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1259
Data Center TCP (DCTCP)
M Alizadehzy A Greenbergy D Maltzy J Padhyey P
Pately B Prabhakarz S Senguptay M Sridharan
983117983145983139983154983151983155983151983142983156 983122983141983155983141983137983154983139983144 amp 983123983156983137983150983142983151983154983140 983125983150983145983158983141983154983155983145983156983161
ACM SIGCOMM September 2010
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1359
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1459
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1559
Rack Servers with Commodity Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1659
Performance impairments of Shallow-buffered
Switches1 TCP Incast Collapse
Many applications generate barrier-synchronized requests in which the
client cannot make forward progress until the responses from every
server for the current request have been received An Example of these
applications is a web search query (eg a Google search) sent to a large
number of nodes with results returned to the parent node to be sorted
Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these
requests create many flows that converge on the same interface of a
switch over a short period of time The response packets create a long
queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and
throughput collapse
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1759
1 TCP Incast Collapse (continued)
Barrier-synchronized requests exhibit the PartitionAggregate workflow
pattern which is the foundation of many large scale web applications
Requests from higher layers of the application are broken into pieces and
farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content
composition and advertisement selection are based around the
PartitionAggregate design pattern
In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require
iteratively invoking the pattern with an aggregator making serial requests
to the workers below it to prepare a response (1 to 4 iterations are typical
though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up
to the root must be completed within the deadline
In other publications this pattern is referred to as the ScatterGather pattern
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1859
983137983143983143983154983141983143983137983156983151983154
983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154
The partitionaggregate design pattern
Request Latency deadline 250 ms
deadline 50 ms
deadline 10 ms
The total permissible latency for a request is limited and the ldquobackendrdquo part of the
application is typically allocated between 230-300 ms This limit is called the all-up SLA
Example in web search a query might be sent to many aggregators and workers each
responsible for a different part of the index Based on the replies an aggregator might
refine the query and send it out again to improve the relevance of the result Lagging
instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries
A high-level aggregator
(HLA) partitions queries to
a large number of mid-level
aggregators (MLAs) that in
turn partition each query
over the other servers in the
same rack as the MLA
Servers act as both MLAs
and workers so each server
will be acting as an
aggregator for some queries
and as a worker for other
queries
HLA
MLAMLA
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1959
aggregator
worker 1worker 2worker 3worker 4
query
response
Ack
A TCP Incast Event
Response from worker 3 is lost due to incast and is
retransmitted after a timeout
timeout
983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154
983090
983159983151983154983147983141983154
983089
983159983151983154983147983141983154
983091
983159983151983154983147983141983154
983091
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2059
IncastScenario
Packets from many
flows arriving to
the same port at
the same time
Incast Collapse Summary
In other publications the incast scnario
is referred to as the fan-in burst at the
parent node This incast is a key reason
for increased network delay and occurswhen all the children (eg workers at
the leaf level) of a parent node face the
same deadline and are likely to respond
nearly at the same time causing a fan-
in burst at the parent node
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2159
Performance impairments of Shallow-buffered
Switches2 Queue Buildup
When long and short flows traverse the same queue there is a queue
buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every
worker in the cluster handles both query traffic and background
traffic (large flows needed to update the data structures on the
workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays
because of long-lived greedy TCP flows Further answering a
request can require multiple iterations which magnifies the impact of
this delay
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2259
Performance impairments of Shallow-buffered
Switches3 Buffer Pressure
Given the mix of long and short flows in a data center it is very
common for short flows on one port to be impacted by activity on
other ports The loss rate of short flows in this traffic pattern depends
on the number of long flows traversing other ports
The long greedy TCP flows build up queues on their interfaces
Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space
available to absorb bursts of traffic from the PartitionAggregate
traffic This impairment is called buffer pressure The result is packet
loss and timeouts as in incast but without requiring synchronizedflows
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2359
Buffer
Pressure
Short flows on oneport and long flows
on another port
Incast
Scenario
Multiple shortflows on the same
port
Queue
Buildup
Short and longflows on the same
port
Flow Interactions in Shallow-buffered Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2459
C o n g e s t i o n
w i n d o w
10
5
15
20
0
Round-trip times
Slow
start
Congestionavoidance
Time-out
Legacy TCP Congestion Control
983155983155983135983156983144983154983141983155983144 983101983089983094
983139983159983150983140 983101983090983088
983155983155983135983156983144983154983141983155983144 983101983089983088
Segment loss
Segment loss
FastRetransmit
Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2559
The Need for a Data Center TCP
The data center environment is significantly
different from wide area networks
o round trip times (RTTs) can be less than 250 ms in absence ofqueuing
o Applications need extremely high bandwidths and very low
latencies
o little statistical multiplexing a single flow can dominate a
particular path
o The network is largely homogeneous and under a single
administrative controlo Traffic flowing in switches is mostly internal Connectivity to the
external Internet is typically managed through load balancers and
application proxies that effectively separate internal traffic from
external
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2659
The Need for a Data Center TCP (continued)
Data center applications generate a diverse mix of short and long
flows The measurements by the authors reveal that 9991 of
traffic in the data center is TCP traffic The traffic consists of query
traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to
100MB) These applications require three things from the data
center network
o low latency for short flows
o high burst tolerance
o high utilization for long flows
Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above
requirements
See paper for details of workload
characterization in cloud data centers
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2759
The Data Center TCP (DCTCP) Algorithm
The goal of DCTCP is to achieve high burst tolerance low latency
and high throughput with commodity shallow buffered switches
DCTCP uses the concept of ECN (Explicit Congestion Notification)
DCTCP achieves these goals primarily by reacting to congestion in
proportion to the extent of congestion
DCTCP uses a simple marking scheme at switches that sets the
Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold
The DCTCP source reacts by reducing the window by a factor that
depends on the fraction of marked packets the larger the fraction the
bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN
notification
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2859
DCTCP- Simple Marking at the Switch
DCTCP employs a simple active queue management scheme There
is only a single parameter the marking threshold K as opposed to
two parameters THmin and THmax in RED routers
An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival
Marking is based on the instantaneous value of the queue not the
average value as in RED routers
The DCTCP scheme ensures that sources are quickly notified of the
queue overshoot
The RED marking scheme implemented by most modern switches
can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of
average queue length
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2959
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1359
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1459
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1559
Rack Servers with Commodity Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1659
Performance impairments of Shallow-buffered
Switches1 TCP Incast Collapse
Many applications generate barrier-synchronized requests in which the
client cannot make forward progress until the responses from every
server for the current request have been received An Example of these
applications is a web search query (eg a Google search) sent to a large
number of nodes with results returned to the parent node to be sorted
Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these
requests create many flows that converge on the same interface of a
switch over a short period of time The response packets create a long
queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and
throughput collapse
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1759
1 TCP Incast Collapse (continued)
Barrier-synchronized requests exhibit the PartitionAggregate workflow
pattern which is the foundation of many large scale web applications
Requests from higher layers of the application are broken into pieces and
farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content
composition and advertisement selection are based around the
PartitionAggregate design pattern
In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require
iteratively invoking the pattern with an aggregator making serial requests
to the workers below it to prepare a response (1 to 4 iterations are typical
though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up
to the root must be completed within the deadline
In other publications this pattern is referred to as the ScatterGather pattern
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1859
983137983143983143983154983141983143983137983156983151983154
983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154
The partitionaggregate design pattern
Request Latency deadline 250 ms
deadline 50 ms
deadline 10 ms
The total permissible latency for a request is limited and the ldquobackendrdquo part of the
application is typically allocated between 230-300 ms This limit is called the all-up SLA
Example in web search a query might be sent to many aggregators and workers each
responsible for a different part of the index Based on the replies an aggregator might
refine the query and send it out again to improve the relevance of the result Lagging
instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries
A high-level aggregator
(HLA) partitions queries to
a large number of mid-level
aggregators (MLAs) that in
turn partition each query
over the other servers in the
same rack as the MLA
Servers act as both MLAs
and workers so each server
will be acting as an
aggregator for some queries
and as a worker for other
queries
HLA
MLAMLA
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1959
aggregator
worker 1worker 2worker 3worker 4
query
response
Ack
A TCP Incast Event
Response from worker 3 is lost due to incast and is
retransmitted after a timeout
timeout
983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154
983090
983159983151983154983147983141983154
983089
983159983151983154983147983141983154
983091
983159983151983154983147983141983154
983091
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2059
IncastScenario
Packets from many
flows arriving to
the same port at
the same time
Incast Collapse Summary
In other publications the incast scnario
is referred to as the fan-in burst at the
parent node This incast is a key reason
for increased network delay and occurswhen all the children (eg workers at
the leaf level) of a parent node face the
same deadline and are likely to respond
nearly at the same time causing a fan-
in burst at the parent node
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2159
Performance impairments of Shallow-buffered
Switches2 Queue Buildup
When long and short flows traverse the same queue there is a queue
buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every
worker in the cluster handles both query traffic and background
traffic (large flows needed to update the data structures on the
workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays
because of long-lived greedy TCP flows Further answering a
request can require multiple iterations which magnifies the impact of
this delay
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2259
Performance impairments of Shallow-buffered
Switches3 Buffer Pressure
Given the mix of long and short flows in a data center it is very
common for short flows on one port to be impacted by activity on
other ports The loss rate of short flows in this traffic pattern depends
on the number of long flows traversing other ports
The long greedy TCP flows build up queues on their interfaces
Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space
available to absorb bursts of traffic from the PartitionAggregate
traffic This impairment is called buffer pressure The result is packet
loss and timeouts as in incast but without requiring synchronizedflows
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2359
Buffer
Pressure
Short flows on oneport and long flows
on another port
Incast
Scenario
Multiple shortflows on the same
port
Queue
Buildup
Short and longflows on the same
port
Flow Interactions in Shallow-buffered Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2459
C o n g e s t i o n
w i n d o w
10
5
15
20
0
Round-trip times
Slow
start
Congestionavoidance
Time-out
Legacy TCP Congestion Control
983155983155983135983156983144983154983141983155983144 983101983089983094
983139983159983150983140 983101983090983088
983155983155983135983156983144983154983141983155983144 983101983089983088
Segment loss
Segment loss
FastRetransmit
Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2559
The Need for a Data Center TCP
The data center environment is significantly
different from wide area networks
o round trip times (RTTs) can be less than 250 ms in absence ofqueuing
o Applications need extremely high bandwidths and very low
latencies
o little statistical multiplexing a single flow can dominate a
particular path
o The network is largely homogeneous and under a single
administrative controlo Traffic flowing in switches is mostly internal Connectivity to the
external Internet is typically managed through load balancers and
application proxies that effectively separate internal traffic from
external
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2659
The Need for a Data Center TCP (continued)
Data center applications generate a diverse mix of short and long
flows The measurements by the authors reveal that 9991 of
traffic in the data center is TCP traffic The traffic consists of query
traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to
100MB) These applications require three things from the data
center network
o low latency for short flows
o high burst tolerance
o high utilization for long flows
Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above
requirements
See paper for details of workload
characterization in cloud data centers
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2759
The Data Center TCP (DCTCP) Algorithm
The goal of DCTCP is to achieve high burst tolerance low latency
and high throughput with commodity shallow buffered switches
DCTCP uses the concept of ECN (Explicit Congestion Notification)
DCTCP achieves these goals primarily by reacting to congestion in
proportion to the extent of congestion
DCTCP uses a simple marking scheme at switches that sets the
Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold
The DCTCP source reacts by reducing the window by a factor that
depends on the fraction of marked packets the larger the fraction the
bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN
notification
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2859
DCTCP- Simple Marking at the Switch
DCTCP employs a simple active queue management scheme There
is only a single parameter the marking threshold K as opposed to
two parameters THmin and THmax in RED routers
An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival
Marking is based on the instantaneous value of the queue not the
average value as in RED routers
The DCTCP scheme ensures that sources are quickly notified of the
queue overshoot
The RED marking scheme implemented by most modern switches
can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of
average queue length
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2959
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1459
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1559
Rack Servers with Commodity Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1659
Performance impairments of Shallow-buffered
Switches1 TCP Incast Collapse
Many applications generate barrier-synchronized requests in which the
client cannot make forward progress until the responses from every
server for the current request have been received An Example of these
applications is a web search query (eg a Google search) sent to a large
number of nodes with results returned to the parent node to be sorted
Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these
requests create many flows that converge on the same interface of a
switch over a short period of time The response packets create a long
queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and
throughput collapse
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1759
1 TCP Incast Collapse (continued)
Barrier-synchronized requests exhibit the PartitionAggregate workflow
pattern which is the foundation of many large scale web applications
Requests from higher layers of the application are broken into pieces and
farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content
composition and advertisement selection are based around the
PartitionAggregate design pattern
In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require
iteratively invoking the pattern with an aggregator making serial requests
to the workers below it to prepare a response (1 to 4 iterations are typical
though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up
to the root must be completed within the deadline
In other publications this pattern is referred to as the ScatterGather pattern
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1859
983137983143983143983154983141983143983137983156983151983154
983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154
The partitionaggregate design pattern
Request Latency deadline 250 ms
deadline 50 ms
deadline 10 ms
The total permissible latency for a request is limited and the ldquobackendrdquo part of the
application is typically allocated between 230-300 ms This limit is called the all-up SLA
Example in web search a query might be sent to many aggregators and workers each
responsible for a different part of the index Based on the replies an aggregator might
refine the query and send it out again to improve the relevance of the result Lagging
instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries
A high-level aggregator
(HLA) partitions queries to
a large number of mid-level
aggregators (MLAs) that in
turn partition each query
over the other servers in the
same rack as the MLA
Servers act as both MLAs
and workers so each server
will be acting as an
aggregator for some queries
and as a worker for other
queries
HLA
MLAMLA
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1959
aggregator
worker 1worker 2worker 3worker 4
query
response
Ack
A TCP Incast Event
Response from worker 3 is lost due to incast and is
retransmitted after a timeout
timeout
983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154
983090
983159983151983154983147983141983154
983089
983159983151983154983147983141983154
983091
983159983151983154983147983141983154
983091
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2059
IncastScenario
Packets from many
flows arriving to
the same port at
the same time
Incast Collapse Summary
In other publications the incast scnario
is referred to as the fan-in burst at the
parent node This incast is a key reason
for increased network delay and occurswhen all the children (eg workers at
the leaf level) of a parent node face the
same deadline and are likely to respond
nearly at the same time causing a fan-
in burst at the parent node
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2159
Performance impairments of Shallow-buffered
Switches2 Queue Buildup
When long and short flows traverse the same queue there is a queue
buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every
worker in the cluster handles both query traffic and background
traffic (large flows needed to update the data structures on the
workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays
because of long-lived greedy TCP flows Further answering a
request can require multiple iterations which magnifies the impact of
this delay
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2259
Performance impairments of Shallow-buffered
Switches3 Buffer Pressure
Given the mix of long and short flows in a data center it is very
common for short flows on one port to be impacted by activity on
other ports The loss rate of short flows in this traffic pattern depends
on the number of long flows traversing other ports
The long greedy TCP flows build up queues on their interfaces
Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space
available to absorb bursts of traffic from the PartitionAggregate
traffic This impairment is called buffer pressure The result is packet
loss and timeouts as in incast but without requiring synchronizedflows
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2359
Buffer
Pressure
Short flows on oneport and long flows
on another port
Incast
Scenario
Multiple shortflows on the same
port
Queue
Buildup
Short and longflows on the same
port
Flow Interactions in Shallow-buffered Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2459
C o n g e s t i o n
w i n d o w
10
5
15
20
0
Round-trip times
Slow
start
Congestionavoidance
Time-out
Legacy TCP Congestion Control
983155983155983135983156983144983154983141983155983144 983101983089983094
983139983159983150983140 983101983090983088
983155983155983135983156983144983154983141983155983144 983101983089983088
Segment loss
Segment loss
FastRetransmit
Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2559
The Need for a Data Center TCP
The data center environment is significantly
different from wide area networks
o round trip times (RTTs) can be less than 250 ms in absence ofqueuing
o Applications need extremely high bandwidths and very low
latencies
o little statistical multiplexing a single flow can dominate a
particular path
o The network is largely homogeneous and under a single
administrative controlo Traffic flowing in switches is mostly internal Connectivity to the
external Internet is typically managed through load balancers and
application proxies that effectively separate internal traffic from
external
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2659
The Need for a Data Center TCP (continued)
Data center applications generate a diverse mix of short and long
flows The measurements by the authors reveal that 9991 of
traffic in the data center is TCP traffic The traffic consists of query
traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to
100MB) These applications require three things from the data
center network
o low latency for short flows
o high burst tolerance
o high utilization for long flows
Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above
requirements
See paper for details of workload
characterization in cloud data centers
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2759
The Data Center TCP (DCTCP) Algorithm
The goal of DCTCP is to achieve high burst tolerance low latency
and high throughput with commodity shallow buffered switches
DCTCP uses the concept of ECN (Explicit Congestion Notification)
DCTCP achieves these goals primarily by reacting to congestion in
proportion to the extent of congestion
DCTCP uses a simple marking scheme at switches that sets the
Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold
The DCTCP source reacts by reducing the window by a factor that
depends on the fraction of marked packets the larger the fraction the
bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN
notification
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2859
DCTCP- Simple Marking at the Switch
DCTCP employs a simple active queue management scheme There
is only a single parameter the marking threshold K as opposed to
two parameters THmin and THmax in RED routers
An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival
Marking is based on the instantaneous value of the queue not the
average value as in RED routers
The DCTCP scheme ensures that sources are quickly notified of the
queue overshoot
The RED marking scheme implemented by most modern switches
can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of
average queue length
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2959
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1559
Rack Servers with Commodity Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1659
Performance impairments of Shallow-buffered
Switches1 TCP Incast Collapse
Many applications generate barrier-synchronized requests in which the
client cannot make forward progress until the responses from every
server for the current request have been received An Example of these
applications is a web search query (eg a Google search) sent to a large
number of nodes with results returned to the parent node to be sorted
Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these
requests create many flows that converge on the same interface of a
switch over a short period of time The response packets create a long
queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and
throughput collapse
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1759
1 TCP Incast Collapse (continued)
Barrier-synchronized requests exhibit the PartitionAggregate workflow
pattern which is the foundation of many large scale web applications
Requests from higher layers of the application are broken into pieces and
farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content
composition and advertisement selection are based around the
PartitionAggregate design pattern
In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require
iteratively invoking the pattern with an aggregator making serial requests
to the workers below it to prepare a response (1 to 4 iterations are typical
though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up
to the root must be completed within the deadline
In other publications this pattern is referred to as the ScatterGather pattern
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1859
983137983143983143983154983141983143983137983156983151983154
983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154
The partitionaggregate design pattern
Request Latency deadline 250 ms
deadline 50 ms
deadline 10 ms
The total permissible latency for a request is limited and the ldquobackendrdquo part of the
application is typically allocated between 230-300 ms This limit is called the all-up SLA
Example in web search a query might be sent to many aggregators and workers each
responsible for a different part of the index Based on the replies an aggregator might
refine the query and send it out again to improve the relevance of the result Lagging
instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries
A high-level aggregator
(HLA) partitions queries to
a large number of mid-level
aggregators (MLAs) that in
turn partition each query
over the other servers in the
same rack as the MLA
Servers act as both MLAs
and workers so each server
will be acting as an
aggregator for some queries
and as a worker for other
queries
HLA
MLAMLA
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1959
aggregator
worker 1worker 2worker 3worker 4
query
response
Ack
A TCP Incast Event
Response from worker 3 is lost due to incast and is
retransmitted after a timeout
timeout
983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154
983090
983159983151983154983147983141983154
983089
983159983151983154983147983141983154
983091
983159983151983154983147983141983154
983091
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2059
IncastScenario
Packets from many
flows arriving to
the same port at
the same time
Incast Collapse Summary
In other publications the incast scnario
is referred to as the fan-in burst at the
parent node This incast is a key reason
for increased network delay and occurswhen all the children (eg workers at
the leaf level) of a parent node face the
same deadline and are likely to respond
nearly at the same time causing a fan-
in burst at the parent node
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2159
Performance impairments of Shallow-buffered
Switches2 Queue Buildup
When long and short flows traverse the same queue there is a queue
buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every
worker in the cluster handles both query traffic and background
traffic (large flows needed to update the data structures on the
workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays
because of long-lived greedy TCP flows Further answering a
request can require multiple iterations which magnifies the impact of
this delay
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2259
Performance impairments of Shallow-buffered
Switches3 Buffer Pressure
Given the mix of long and short flows in a data center it is very
common for short flows on one port to be impacted by activity on
other ports The loss rate of short flows in this traffic pattern depends
on the number of long flows traversing other ports
The long greedy TCP flows build up queues on their interfaces
Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space
available to absorb bursts of traffic from the PartitionAggregate
traffic This impairment is called buffer pressure The result is packet
loss and timeouts as in incast but without requiring synchronizedflows
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2359
Buffer
Pressure
Short flows on oneport and long flows
on another port
Incast
Scenario
Multiple shortflows on the same
port
Queue
Buildup
Short and longflows on the same
port
Flow Interactions in Shallow-buffered Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2459
C o n g e s t i o n
w i n d o w
10
5
15
20
0
Round-trip times
Slow
start
Congestionavoidance
Time-out
Legacy TCP Congestion Control
983155983155983135983156983144983154983141983155983144 983101983089983094
983139983159983150983140 983101983090983088
983155983155983135983156983144983154983141983155983144 983101983089983088
Segment loss
Segment loss
FastRetransmit
Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2559
The Need for a Data Center TCP
The data center environment is significantly
different from wide area networks
o round trip times (RTTs) can be less than 250 ms in absence ofqueuing
o Applications need extremely high bandwidths and very low
latencies
o little statistical multiplexing a single flow can dominate a
particular path
o The network is largely homogeneous and under a single
administrative controlo Traffic flowing in switches is mostly internal Connectivity to the
external Internet is typically managed through load balancers and
application proxies that effectively separate internal traffic from
external
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2659
The Need for a Data Center TCP (continued)
Data center applications generate a diverse mix of short and long
flows The measurements by the authors reveal that 9991 of
traffic in the data center is TCP traffic The traffic consists of query
traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to
100MB) These applications require three things from the data
center network
o low latency for short flows
o high burst tolerance
o high utilization for long flows
Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above
requirements
See paper for details of workload
characterization in cloud data centers
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2759
The Data Center TCP (DCTCP) Algorithm
The goal of DCTCP is to achieve high burst tolerance low latency
and high throughput with commodity shallow buffered switches
DCTCP uses the concept of ECN (Explicit Congestion Notification)
DCTCP achieves these goals primarily by reacting to congestion in
proportion to the extent of congestion
DCTCP uses a simple marking scheme at switches that sets the
Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold
The DCTCP source reacts by reducing the window by a factor that
depends on the fraction of marked packets the larger the fraction the
bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN
notification
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2859
DCTCP- Simple Marking at the Switch
DCTCP employs a simple active queue management scheme There
is only a single parameter the marking threshold K as opposed to
two parameters THmin and THmax in RED routers
An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival
Marking is based on the instantaneous value of the queue not the
average value as in RED routers
The DCTCP scheme ensures that sources are quickly notified of the
queue overshoot
The RED marking scheme implemented by most modern switches
can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of
average queue length
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2959
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1659
Performance impairments of Shallow-buffered
Switches1 TCP Incast Collapse
Many applications generate barrier-synchronized requests in which the
client cannot make forward progress until the responses from every
server for the current request have been received An Example of these
applications is a web search query (eg a Google search) sent to a large
number of nodes with results returned to the parent node to be sorted
Barrier-synchronized requests can result in packets overfilling theshallow buffers on the clients port on the switch In other words these
requests create many flows that converge on the same interface of a
switch over a short period of time The response packets create a long
queue and may exhaust either the switch memory or the maximumpermitted buffer for that interface resulting in packet losses and
throughput collapse
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1759
1 TCP Incast Collapse (continued)
Barrier-synchronized requests exhibit the PartitionAggregate workflow
pattern which is the foundation of many large scale web applications
Requests from higher layers of the application are broken into pieces and
farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content
composition and advertisement selection are based around the
PartitionAggregate design pattern
In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require
iteratively invoking the pattern with an aggregator making serial requests
to the workers below it to prepare a response (1 to 4 iterations are typical
though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up
to the root must be completed within the deadline
In other publications this pattern is referred to as the ScatterGather pattern
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1859
983137983143983143983154983141983143983137983156983151983154
983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154
The partitionaggregate design pattern
Request Latency deadline 250 ms
deadline 50 ms
deadline 10 ms
The total permissible latency for a request is limited and the ldquobackendrdquo part of the
application is typically allocated between 230-300 ms This limit is called the all-up SLA
Example in web search a query might be sent to many aggregators and workers each
responsible for a different part of the index Based on the replies an aggregator might
refine the query and send it out again to improve the relevance of the result Lagging
instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries
A high-level aggregator
(HLA) partitions queries to
a large number of mid-level
aggregators (MLAs) that in
turn partition each query
over the other servers in the
same rack as the MLA
Servers act as both MLAs
and workers so each server
will be acting as an
aggregator for some queries
and as a worker for other
queries
HLA
MLAMLA
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1959
aggregator
worker 1worker 2worker 3worker 4
query
response
Ack
A TCP Incast Event
Response from worker 3 is lost due to incast and is
retransmitted after a timeout
timeout
983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154
983090
983159983151983154983147983141983154
983089
983159983151983154983147983141983154
983091
983159983151983154983147983141983154
983091
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2059
IncastScenario
Packets from many
flows arriving to
the same port at
the same time
Incast Collapse Summary
In other publications the incast scnario
is referred to as the fan-in burst at the
parent node This incast is a key reason
for increased network delay and occurswhen all the children (eg workers at
the leaf level) of a parent node face the
same deadline and are likely to respond
nearly at the same time causing a fan-
in burst at the parent node
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2159
Performance impairments of Shallow-buffered
Switches2 Queue Buildup
When long and short flows traverse the same queue there is a queue
buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every
worker in the cluster handles both query traffic and background
traffic (large flows needed to update the data structures on the
workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays
because of long-lived greedy TCP flows Further answering a
request can require multiple iterations which magnifies the impact of
this delay
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2259
Performance impairments of Shallow-buffered
Switches3 Buffer Pressure
Given the mix of long and short flows in a data center it is very
common for short flows on one port to be impacted by activity on
other ports The loss rate of short flows in this traffic pattern depends
on the number of long flows traversing other ports
The long greedy TCP flows build up queues on their interfaces
Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space
available to absorb bursts of traffic from the PartitionAggregate
traffic This impairment is called buffer pressure The result is packet
loss and timeouts as in incast but without requiring synchronizedflows
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2359
Buffer
Pressure
Short flows on oneport and long flows
on another port
Incast
Scenario
Multiple shortflows on the same
port
Queue
Buildup
Short and longflows on the same
port
Flow Interactions in Shallow-buffered Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2459
C o n g e s t i o n
w i n d o w
10
5
15
20
0
Round-trip times
Slow
start
Congestionavoidance
Time-out
Legacy TCP Congestion Control
983155983155983135983156983144983154983141983155983144 983101983089983094
983139983159983150983140 983101983090983088
983155983155983135983156983144983154983141983155983144 983101983089983088
Segment loss
Segment loss
FastRetransmit
Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2559
The Need for a Data Center TCP
The data center environment is significantly
different from wide area networks
o round trip times (RTTs) can be less than 250 ms in absence ofqueuing
o Applications need extremely high bandwidths and very low
latencies
o little statistical multiplexing a single flow can dominate a
particular path
o The network is largely homogeneous and under a single
administrative controlo Traffic flowing in switches is mostly internal Connectivity to the
external Internet is typically managed through load balancers and
application proxies that effectively separate internal traffic from
external
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2659
The Need for a Data Center TCP (continued)
Data center applications generate a diverse mix of short and long
flows The measurements by the authors reveal that 9991 of
traffic in the data center is TCP traffic The traffic consists of query
traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to
100MB) These applications require three things from the data
center network
o low latency for short flows
o high burst tolerance
o high utilization for long flows
Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above
requirements
See paper for details of workload
characterization in cloud data centers
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2759
The Data Center TCP (DCTCP) Algorithm
The goal of DCTCP is to achieve high burst tolerance low latency
and high throughput with commodity shallow buffered switches
DCTCP uses the concept of ECN (Explicit Congestion Notification)
DCTCP achieves these goals primarily by reacting to congestion in
proportion to the extent of congestion
DCTCP uses a simple marking scheme at switches that sets the
Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold
The DCTCP source reacts by reducing the window by a factor that
depends on the fraction of marked packets the larger the fraction the
bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN
notification
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2859
DCTCP- Simple Marking at the Switch
DCTCP employs a simple active queue management scheme There
is only a single parameter the marking threshold K as opposed to
two parameters THmin and THmax in RED routers
An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival
Marking is based on the instantaneous value of the queue not the
average value as in RED routers
The DCTCP scheme ensures that sources are quickly notified of the
queue overshoot
The RED marking scheme implemented by most modern switches
can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of
average queue length
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2959
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1759
1 TCP Incast Collapse (continued)
Barrier-synchronized requests exhibit the PartitionAggregate workflow
pattern which is the foundation of many large scale web applications
Requests from higher layers of the application are broken into pieces and
farmed out to workers in lower layers The responses of these workers areaggregated to produce a result Web searches social network content
composition and advertisement selection are based around the
PartitionAggregate design pattern
In a multi-layer partitionaggregate pattern workflow lags at one layerdelay the initiation of others Further answering a request may require
iteratively invoking the pattern with an aggregator making serial requests
to the workers below it to prepare a response (1 to 4 iterations are typical
though as many as 20 may occur) The propagation of the request down to leaves and of the responses back up
to the root must be completed within the deadline
In other publications this pattern is referred to as the ScatterGather pattern
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1859
983137983143983143983154983141983143983137983156983151983154
983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154
The partitionaggregate design pattern
Request Latency deadline 250 ms
deadline 50 ms
deadline 10 ms
The total permissible latency for a request is limited and the ldquobackendrdquo part of the
application is typically allocated between 230-300 ms This limit is called the all-up SLA
Example in web search a query might be sent to many aggregators and workers each
responsible for a different part of the index Based on the replies an aggregator might
refine the query and send it out again to improve the relevance of the result Lagging
instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries
A high-level aggregator
(HLA) partitions queries to
a large number of mid-level
aggregators (MLAs) that in
turn partition each query
over the other servers in the
same rack as the MLA
Servers act as both MLAs
and workers so each server
will be acting as an
aggregator for some queries
and as a worker for other
queries
HLA
MLAMLA
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1959
aggregator
worker 1worker 2worker 3worker 4
query
response
Ack
A TCP Incast Event
Response from worker 3 is lost due to incast and is
retransmitted after a timeout
timeout
983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154
983090
983159983151983154983147983141983154
983089
983159983151983154983147983141983154
983091
983159983151983154983147983141983154
983091
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2059
IncastScenario
Packets from many
flows arriving to
the same port at
the same time
Incast Collapse Summary
In other publications the incast scnario
is referred to as the fan-in burst at the
parent node This incast is a key reason
for increased network delay and occurswhen all the children (eg workers at
the leaf level) of a parent node face the
same deadline and are likely to respond
nearly at the same time causing a fan-
in burst at the parent node
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2159
Performance impairments of Shallow-buffered
Switches2 Queue Buildup
When long and short flows traverse the same queue there is a queue
buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every
worker in the cluster handles both query traffic and background
traffic (large flows needed to update the data structures on the
workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays
because of long-lived greedy TCP flows Further answering a
request can require multiple iterations which magnifies the impact of
this delay
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2259
Performance impairments of Shallow-buffered
Switches3 Buffer Pressure
Given the mix of long and short flows in a data center it is very
common for short flows on one port to be impacted by activity on
other ports The loss rate of short flows in this traffic pattern depends
on the number of long flows traversing other ports
The long greedy TCP flows build up queues on their interfaces
Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space
available to absorb bursts of traffic from the PartitionAggregate
traffic This impairment is called buffer pressure The result is packet
loss and timeouts as in incast but without requiring synchronizedflows
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2359
Buffer
Pressure
Short flows on oneport and long flows
on another port
Incast
Scenario
Multiple shortflows on the same
port
Queue
Buildup
Short and longflows on the same
port
Flow Interactions in Shallow-buffered Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2459
C o n g e s t i o n
w i n d o w
10
5
15
20
0
Round-trip times
Slow
start
Congestionavoidance
Time-out
Legacy TCP Congestion Control
983155983155983135983156983144983154983141983155983144 983101983089983094
983139983159983150983140 983101983090983088
983155983155983135983156983144983154983141983155983144 983101983089983088
Segment loss
Segment loss
FastRetransmit
Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2559
The Need for a Data Center TCP
The data center environment is significantly
different from wide area networks
o round trip times (RTTs) can be less than 250 ms in absence ofqueuing
o Applications need extremely high bandwidths and very low
latencies
o little statistical multiplexing a single flow can dominate a
particular path
o The network is largely homogeneous and under a single
administrative controlo Traffic flowing in switches is mostly internal Connectivity to the
external Internet is typically managed through load balancers and
application proxies that effectively separate internal traffic from
external
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2659
The Need for a Data Center TCP (continued)
Data center applications generate a diverse mix of short and long
flows The measurements by the authors reveal that 9991 of
traffic in the data center is TCP traffic The traffic consists of query
traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to
100MB) These applications require three things from the data
center network
o low latency for short flows
o high burst tolerance
o high utilization for long flows
Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above
requirements
See paper for details of workload
characterization in cloud data centers
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2759
The Data Center TCP (DCTCP) Algorithm
The goal of DCTCP is to achieve high burst tolerance low latency
and high throughput with commodity shallow buffered switches
DCTCP uses the concept of ECN (Explicit Congestion Notification)
DCTCP achieves these goals primarily by reacting to congestion in
proportion to the extent of congestion
DCTCP uses a simple marking scheme at switches that sets the
Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold
The DCTCP source reacts by reducing the window by a factor that
depends on the fraction of marked packets the larger the fraction the
bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN
notification
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2859
DCTCP- Simple Marking at the Switch
DCTCP employs a simple active queue management scheme There
is only a single parameter the marking threshold K as opposed to
two parameters THmin and THmax in RED routers
An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival
Marking is based on the instantaneous value of the queue not the
average value as in RED routers
The DCTCP scheme ensures that sources are quickly notified of the
queue overshoot
The RED marking scheme implemented by most modern switches
can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of
average queue length
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2959
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1859
983137983143983143983154983141983143983137983156983151983154
983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154 983159983151983154983147983141983154
The partitionaggregate design pattern
Request Latency deadline 250 ms
deadline 50 ms
deadline 10 ms
The total permissible latency for a request is limited and the ldquobackendrdquo part of the
application is typically allocated between 230-300 ms This limit is called the all-up SLA
Example in web search a query might be sent to many aggregators and workers each
responsible for a different part of the index Based on the replies an aggregator might
refine the query and send it out again to improve the relevance of the result Lagging
instances of partitionaggregate can thus add up to threaten the all-up SLAs for queries
A high-level aggregator
(HLA) partitions queries to
a large number of mid-level
aggregators (MLAs) that in
turn partition each query
over the other servers in the
same rack as the MLA
Servers act as both MLAs
and workers so each server
will be acting as an
aggregator for some queries
and as a worker for other
queries
HLA
MLAMLA
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1959
aggregator
worker 1worker 2worker 3worker 4
query
response
Ack
A TCP Incast Event
Response from worker 3 is lost due to incast and is
retransmitted after a timeout
timeout
983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154
983090
983159983151983154983147983141983154
983089
983159983151983154983147983141983154
983091
983159983151983154983147983141983154
983091
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2059
IncastScenario
Packets from many
flows arriving to
the same port at
the same time
Incast Collapse Summary
In other publications the incast scnario
is referred to as the fan-in burst at the
parent node This incast is a key reason
for increased network delay and occurswhen all the children (eg workers at
the leaf level) of a parent node face the
same deadline and are likely to respond
nearly at the same time causing a fan-
in burst at the parent node
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2159
Performance impairments of Shallow-buffered
Switches2 Queue Buildup
When long and short flows traverse the same queue there is a queue
buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every
worker in the cluster handles both query traffic and background
traffic (large flows needed to update the data structures on the
workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays
because of long-lived greedy TCP flows Further answering a
request can require multiple iterations which magnifies the impact of
this delay
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2259
Performance impairments of Shallow-buffered
Switches3 Buffer Pressure
Given the mix of long and short flows in a data center it is very
common for short flows on one port to be impacted by activity on
other ports The loss rate of short flows in this traffic pattern depends
on the number of long flows traversing other ports
The long greedy TCP flows build up queues on their interfaces
Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space
available to absorb bursts of traffic from the PartitionAggregate
traffic This impairment is called buffer pressure The result is packet
loss and timeouts as in incast but without requiring synchronizedflows
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2359
Buffer
Pressure
Short flows on oneport and long flows
on another port
Incast
Scenario
Multiple shortflows on the same
port
Queue
Buildup
Short and longflows on the same
port
Flow Interactions in Shallow-buffered Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2459
C o n g e s t i o n
w i n d o w
10
5
15
20
0
Round-trip times
Slow
start
Congestionavoidance
Time-out
Legacy TCP Congestion Control
983155983155983135983156983144983154983141983155983144 983101983089983094
983139983159983150983140 983101983090983088
983155983155983135983156983144983154983141983155983144 983101983089983088
Segment loss
Segment loss
FastRetransmit
Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2559
The Need for a Data Center TCP
The data center environment is significantly
different from wide area networks
o round trip times (RTTs) can be less than 250 ms in absence ofqueuing
o Applications need extremely high bandwidths and very low
latencies
o little statistical multiplexing a single flow can dominate a
particular path
o The network is largely homogeneous and under a single
administrative controlo Traffic flowing in switches is mostly internal Connectivity to the
external Internet is typically managed through load balancers and
application proxies that effectively separate internal traffic from
external
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2659
The Need for a Data Center TCP (continued)
Data center applications generate a diverse mix of short and long
flows The measurements by the authors reveal that 9991 of
traffic in the data center is TCP traffic The traffic consists of query
traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to
100MB) These applications require three things from the data
center network
o low latency for short flows
o high burst tolerance
o high utilization for long flows
Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above
requirements
See paper for details of workload
characterization in cloud data centers
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2759
The Data Center TCP (DCTCP) Algorithm
The goal of DCTCP is to achieve high burst tolerance low latency
and high throughput with commodity shallow buffered switches
DCTCP uses the concept of ECN (Explicit Congestion Notification)
DCTCP achieves these goals primarily by reacting to congestion in
proportion to the extent of congestion
DCTCP uses a simple marking scheme at switches that sets the
Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold
The DCTCP source reacts by reducing the window by a factor that
depends on the fraction of marked packets the larger the fraction the
bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN
notification
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2859
DCTCP- Simple Marking at the Switch
DCTCP employs a simple active queue management scheme There
is only a single parameter the marking threshold K as opposed to
two parameters THmin and THmax in RED routers
An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival
Marking is based on the instantaneous value of the queue not the
average value as in RED routers
The DCTCP scheme ensures that sources are quickly notified of the
queue overshoot
The RED marking scheme implemented by most modern switches
can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of
average queue length
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2959
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 1959
aggregator
worker 1worker 2worker 3worker 4
query
response
Ack
A TCP Incast Event
Response from worker 3 is lost due to incast and is
retransmitted after a timeout
timeout
983137983143983143983154983141983143983137983156983151983154
983159983151983154983147983141983154
983090
983159983151983154983147983141983154
983089
983159983151983154983147983141983154
983091
983159983151983154983147983141983154
983091
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2059
IncastScenario
Packets from many
flows arriving to
the same port at
the same time
Incast Collapse Summary
In other publications the incast scnario
is referred to as the fan-in burst at the
parent node This incast is a key reason
for increased network delay and occurswhen all the children (eg workers at
the leaf level) of a parent node face the
same deadline and are likely to respond
nearly at the same time causing a fan-
in burst at the parent node
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2159
Performance impairments of Shallow-buffered
Switches2 Queue Buildup
When long and short flows traverse the same queue there is a queue
buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every
worker in the cluster handles both query traffic and background
traffic (large flows needed to update the data structures on the
workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays
because of long-lived greedy TCP flows Further answering a
request can require multiple iterations which magnifies the impact of
this delay
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2259
Performance impairments of Shallow-buffered
Switches3 Buffer Pressure
Given the mix of long and short flows in a data center it is very
common for short flows on one port to be impacted by activity on
other ports The loss rate of short flows in this traffic pattern depends
on the number of long flows traversing other ports
The long greedy TCP flows build up queues on their interfaces
Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space
available to absorb bursts of traffic from the PartitionAggregate
traffic This impairment is called buffer pressure The result is packet
loss and timeouts as in incast but without requiring synchronizedflows
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2359
Buffer
Pressure
Short flows on oneport and long flows
on another port
Incast
Scenario
Multiple shortflows on the same
port
Queue
Buildup
Short and longflows on the same
port
Flow Interactions in Shallow-buffered Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2459
C o n g e s t i o n
w i n d o w
10
5
15
20
0
Round-trip times
Slow
start
Congestionavoidance
Time-out
Legacy TCP Congestion Control
983155983155983135983156983144983154983141983155983144 983101983089983094
983139983159983150983140 983101983090983088
983155983155983135983156983144983154983141983155983144 983101983089983088
Segment loss
Segment loss
FastRetransmit
Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2559
The Need for a Data Center TCP
The data center environment is significantly
different from wide area networks
o round trip times (RTTs) can be less than 250 ms in absence ofqueuing
o Applications need extremely high bandwidths and very low
latencies
o little statistical multiplexing a single flow can dominate a
particular path
o The network is largely homogeneous and under a single
administrative controlo Traffic flowing in switches is mostly internal Connectivity to the
external Internet is typically managed through load balancers and
application proxies that effectively separate internal traffic from
external
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2659
The Need for a Data Center TCP (continued)
Data center applications generate a diverse mix of short and long
flows The measurements by the authors reveal that 9991 of
traffic in the data center is TCP traffic The traffic consists of query
traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to
100MB) These applications require three things from the data
center network
o low latency for short flows
o high burst tolerance
o high utilization for long flows
Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above
requirements
See paper for details of workload
characterization in cloud data centers
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2759
The Data Center TCP (DCTCP) Algorithm
The goal of DCTCP is to achieve high burst tolerance low latency
and high throughput with commodity shallow buffered switches
DCTCP uses the concept of ECN (Explicit Congestion Notification)
DCTCP achieves these goals primarily by reacting to congestion in
proportion to the extent of congestion
DCTCP uses a simple marking scheme at switches that sets the
Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold
The DCTCP source reacts by reducing the window by a factor that
depends on the fraction of marked packets the larger the fraction the
bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN
notification
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2859
DCTCP- Simple Marking at the Switch
DCTCP employs a simple active queue management scheme There
is only a single parameter the marking threshold K as opposed to
two parameters THmin and THmax in RED routers
An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival
Marking is based on the instantaneous value of the queue not the
average value as in RED routers
The DCTCP scheme ensures that sources are quickly notified of the
queue overshoot
The RED marking scheme implemented by most modern switches
can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of
average queue length
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2959
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2059
IncastScenario
Packets from many
flows arriving to
the same port at
the same time
Incast Collapse Summary
In other publications the incast scnario
is referred to as the fan-in burst at the
parent node This incast is a key reason
for increased network delay and occurswhen all the children (eg workers at
the leaf level) of a parent node face the
same deadline and are likely to respond
nearly at the same time causing a fan-
in burst at the parent node
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2159
Performance impairments of Shallow-buffered
Switches2 Queue Buildup
When long and short flows traverse the same queue there is a queue
buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every
worker in the cluster handles both query traffic and background
traffic (large flows needed to update the data structures on the
workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays
because of long-lived greedy TCP flows Further answering a
request can require multiple iterations which magnifies the impact of
this delay
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2259
Performance impairments of Shallow-buffered
Switches3 Buffer Pressure
Given the mix of long and short flows in a data center it is very
common for short flows on one port to be impacted by activity on
other ports The loss rate of short flows in this traffic pattern depends
on the number of long flows traversing other ports
The long greedy TCP flows build up queues on their interfaces
Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space
available to absorb bursts of traffic from the PartitionAggregate
traffic This impairment is called buffer pressure The result is packet
loss and timeouts as in incast but without requiring synchronizedflows
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2359
Buffer
Pressure
Short flows on oneport and long flows
on another port
Incast
Scenario
Multiple shortflows on the same
port
Queue
Buildup
Short and longflows on the same
port
Flow Interactions in Shallow-buffered Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2459
C o n g e s t i o n
w i n d o w
10
5
15
20
0
Round-trip times
Slow
start
Congestionavoidance
Time-out
Legacy TCP Congestion Control
983155983155983135983156983144983154983141983155983144 983101983089983094
983139983159983150983140 983101983090983088
983155983155983135983156983144983154983141983155983144 983101983089983088
Segment loss
Segment loss
FastRetransmit
Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2559
The Need for a Data Center TCP
The data center environment is significantly
different from wide area networks
o round trip times (RTTs) can be less than 250 ms in absence ofqueuing
o Applications need extremely high bandwidths and very low
latencies
o little statistical multiplexing a single flow can dominate a
particular path
o The network is largely homogeneous and under a single
administrative controlo Traffic flowing in switches is mostly internal Connectivity to the
external Internet is typically managed through load balancers and
application proxies that effectively separate internal traffic from
external
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2659
The Need for a Data Center TCP (continued)
Data center applications generate a diverse mix of short and long
flows The measurements by the authors reveal that 9991 of
traffic in the data center is TCP traffic The traffic consists of query
traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to
100MB) These applications require three things from the data
center network
o low latency for short flows
o high burst tolerance
o high utilization for long flows
Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above
requirements
See paper for details of workload
characterization in cloud data centers
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2759
The Data Center TCP (DCTCP) Algorithm
The goal of DCTCP is to achieve high burst tolerance low latency
and high throughput with commodity shallow buffered switches
DCTCP uses the concept of ECN (Explicit Congestion Notification)
DCTCP achieves these goals primarily by reacting to congestion in
proportion to the extent of congestion
DCTCP uses a simple marking scheme at switches that sets the
Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold
The DCTCP source reacts by reducing the window by a factor that
depends on the fraction of marked packets the larger the fraction the
bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN
notification
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2859
DCTCP- Simple Marking at the Switch
DCTCP employs a simple active queue management scheme There
is only a single parameter the marking threshold K as opposed to
two parameters THmin and THmax in RED routers
An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival
Marking is based on the instantaneous value of the queue not the
average value as in RED routers
The DCTCP scheme ensures that sources are quickly notified of the
queue overshoot
The RED marking scheme implemented by most modern switches
can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of
average queue length
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2959
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2159
Performance impairments of Shallow-buffered
Switches2 Queue Buildup
When long and short flows traverse the same queue there is a queue
buildup impairment the short flows experience increased latency asthey are in queue behind packets from the large flows Since every
worker in the cluster handles both query traffic and background
traffic (large flows needed to update the data structures on the
workers) this traffic pattern occurs very frequentlyThis indicates that query flows can experience queuing delays
because of long-lived greedy TCP flows Further answering a
request can require multiple iterations which magnifies the impact of
this delay
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2259
Performance impairments of Shallow-buffered
Switches3 Buffer Pressure
Given the mix of long and short flows in a data center it is very
common for short flows on one port to be impacted by activity on
other ports The loss rate of short flows in this traffic pattern depends
on the number of long flows traversing other ports
The long greedy TCP flows build up queues on their interfaces
Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space
available to absorb bursts of traffic from the PartitionAggregate
traffic This impairment is called buffer pressure The result is packet
loss and timeouts as in incast but without requiring synchronizedflows
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2359
Buffer
Pressure
Short flows on oneport and long flows
on another port
Incast
Scenario
Multiple shortflows on the same
port
Queue
Buildup
Short and longflows on the same
port
Flow Interactions in Shallow-buffered Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2459
C o n g e s t i o n
w i n d o w
10
5
15
20
0
Round-trip times
Slow
start
Congestionavoidance
Time-out
Legacy TCP Congestion Control
983155983155983135983156983144983154983141983155983144 983101983089983094
983139983159983150983140 983101983090983088
983155983155983135983156983144983154983141983155983144 983101983089983088
Segment loss
Segment loss
FastRetransmit
Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2559
The Need for a Data Center TCP
The data center environment is significantly
different from wide area networks
o round trip times (RTTs) can be less than 250 ms in absence ofqueuing
o Applications need extremely high bandwidths and very low
latencies
o little statistical multiplexing a single flow can dominate a
particular path
o The network is largely homogeneous and under a single
administrative controlo Traffic flowing in switches is mostly internal Connectivity to the
external Internet is typically managed through load balancers and
application proxies that effectively separate internal traffic from
external
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2659
The Need for a Data Center TCP (continued)
Data center applications generate a diverse mix of short and long
flows The measurements by the authors reveal that 9991 of
traffic in the data center is TCP traffic The traffic consists of query
traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to
100MB) These applications require three things from the data
center network
o low latency for short flows
o high burst tolerance
o high utilization for long flows
Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above
requirements
See paper for details of workload
characterization in cloud data centers
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2759
The Data Center TCP (DCTCP) Algorithm
The goal of DCTCP is to achieve high burst tolerance low latency
and high throughput with commodity shallow buffered switches
DCTCP uses the concept of ECN (Explicit Congestion Notification)
DCTCP achieves these goals primarily by reacting to congestion in
proportion to the extent of congestion
DCTCP uses a simple marking scheme at switches that sets the
Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold
The DCTCP source reacts by reducing the window by a factor that
depends on the fraction of marked packets the larger the fraction the
bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN
notification
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2859
DCTCP- Simple Marking at the Switch
DCTCP employs a simple active queue management scheme There
is only a single parameter the marking threshold K as opposed to
two parameters THmin and THmax in RED routers
An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival
Marking is based on the instantaneous value of the queue not the
average value as in RED routers
The DCTCP scheme ensures that sources are quickly notified of the
queue overshoot
The RED marking scheme implemented by most modern switches
can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of
average queue length
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2959
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2259
Performance impairments of Shallow-buffered
Switches3 Buffer Pressure
Given the mix of long and short flows in a data center it is very
common for short flows on one port to be impacted by activity on
other ports The loss rate of short flows in this traffic pattern depends
on the number of long flows traversing other ports
The long greedy TCP flows build up queues on their interfaces
Since the switch is shallow-buffered and the buffer space is a sharedresource the queue build up reduces the amount of buffer space
available to absorb bursts of traffic from the PartitionAggregate
traffic This impairment is called buffer pressure The result is packet
loss and timeouts as in incast but without requiring synchronizedflows
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2359
Buffer
Pressure
Short flows on oneport and long flows
on another port
Incast
Scenario
Multiple shortflows on the same
port
Queue
Buildup
Short and longflows on the same
port
Flow Interactions in Shallow-buffered Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2459
C o n g e s t i o n
w i n d o w
10
5
15
20
0
Round-trip times
Slow
start
Congestionavoidance
Time-out
Legacy TCP Congestion Control
983155983155983135983156983144983154983141983155983144 983101983089983094
983139983159983150983140 983101983090983088
983155983155983135983156983144983154983141983155983144 983101983089983088
Segment loss
Segment loss
FastRetransmit
Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2559
The Need for a Data Center TCP
The data center environment is significantly
different from wide area networks
o round trip times (RTTs) can be less than 250 ms in absence ofqueuing
o Applications need extremely high bandwidths and very low
latencies
o little statistical multiplexing a single flow can dominate a
particular path
o The network is largely homogeneous and under a single
administrative controlo Traffic flowing in switches is mostly internal Connectivity to the
external Internet is typically managed through load balancers and
application proxies that effectively separate internal traffic from
external
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2659
The Need for a Data Center TCP (continued)
Data center applications generate a diverse mix of short and long
flows The measurements by the authors reveal that 9991 of
traffic in the data center is TCP traffic The traffic consists of query
traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to
100MB) These applications require three things from the data
center network
o low latency for short flows
o high burst tolerance
o high utilization for long flows
Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above
requirements
See paper for details of workload
characterization in cloud data centers
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2759
The Data Center TCP (DCTCP) Algorithm
The goal of DCTCP is to achieve high burst tolerance low latency
and high throughput with commodity shallow buffered switches
DCTCP uses the concept of ECN (Explicit Congestion Notification)
DCTCP achieves these goals primarily by reacting to congestion in
proportion to the extent of congestion
DCTCP uses a simple marking scheme at switches that sets the
Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold
The DCTCP source reacts by reducing the window by a factor that
depends on the fraction of marked packets the larger the fraction the
bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN
notification
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2859
DCTCP- Simple Marking at the Switch
DCTCP employs a simple active queue management scheme There
is only a single parameter the marking threshold K as opposed to
two parameters THmin and THmax in RED routers
An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival
Marking is based on the instantaneous value of the queue not the
average value as in RED routers
The DCTCP scheme ensures that sources are quickly notified of the
queue overshoot
The RED marking scheme implemented by most modern switches
can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of
average queue length
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2959
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2359
Buffer
Pressure
Short flows on oneport and long flows
on another port
Incast
Scenario
Multiple shortflows on the same
port
Queue
Buildup
Short and longflows on the same
port
Flow Interactions in Shallow-buffered Switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2459
C o n g e s t i o n
w i n d o w
10
5
15
20
0
Round-trip times
Slow
start
Congestionavoidance
Time-out
Legacy TCP Congestion Control
983155983155983135983156983144983154983141983155983144 983101983089983094
983139983159983150983140 983101983090983088
983155983155983135983156983144983154983141983155983144 983101983089983088
Segment loss
Segment loss
FastRetransmit
Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2559
The Need for a Data Center TCP
The data center environment is significantly
different from wide area networks
o round trip times (RTTs) can be less than 250 ms in absence ofqueuing
o Applications need extremely high bandwidths and very low
latencies
o little statistical multiplexing a single flow can dominate a
particular path
o The network is largely homogeneous and under a single
administrative controlo Traffic flowing in switches is mostly internal Connectivity to the
external Internet is typically managed through load balancers and
application proxies that effectively separate internal traffic from
external
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2659
The Need for a Data Center TCP (continued)
Data center applications generate a diverse mix of short and long
flows The measurements by the authors reveal that 9991 of
traffic in the data center is TCP traffic The traffic consists of query
traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to
100MB) These applications require three things from the data
center network
o low latency for short flows
o high burst tolerance
o high utilization for long flows
Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above
requirements
See paper for details of workload
characterization in cloud data centers
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2759
The Data Center TCP (DCTCP) Algorithm
The goal of DCTCP is to achieve high burst tolerance low latency
and high throughput with commodity shallow buffered switches
DCTCP uses the concept of ECN (Explicit Congestion Notification)
DCTCP achieves these goals primarily by reacting to congestion in
proportion to the extent of congestion
DCTCP uses a simple marking scheme at switches that sets the
Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold
The DCTCP source reacts by reducing the window by a factor that
depends on the fraction of marked packets the larger the fraction the
bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN
notification
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2859
DCTCP- Simple Marking at the Switch
DCTCP employs a simple active queue management scheme There
is only a single parameter the marking threshold K as opposed to
two parameters THmin and THmax in RED routers
An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival
Marking is based on the instantaneous value of the queue not the
average value as in RED routers
The DCTCP scheme ensures that sources are quickly notified of the
queue overshoot
The RED marking scheme implemented by most modern switches
can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of
average queue length
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2959
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2459
C o n g e s t i o n
w i n d o w
10
5
15
20
0
Round-trip times
Slow
start
Congestionavoidance
Time-out
Legacy TCP Congestion Control
983155983155983135983156983144983154983141983155983144 983101983089983094
983139983159983150983140 983101983090983088
983155983155983135983156983144983154983141983155983144 983101983089983088
Segment loss
Segment loss
FastRetransmit
Fast Retransmission ssthresh = cwnd2 = cwnd(1-05) cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2559
The Need for a Data Center TCP
The data center environment is significantly
different from wide area networks
o round trip times (RTTs) can be less than 250 ms in absence ofqueuing
o Applications need extremely high bandwidths and very low
latencies
o little statistical multiplexing a single flow can dominate a
particular path
o The network is largely homogeneous and under a single
administrative controlo Traffic flowing in switches is mostly internal Connectivity to the
external Internet is typically managed through load balancers and
application proxies that effectively separate internal traffic from
external
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2659
The Need for a Data Center TCP (continued)
Data center applications generate a diverse mix of short and long
flows The measurements by the authors reveal that 9991 of
traffic in the data center is TCP traffic The traffic consists of query
traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to
100MB) These applications require three things from the data
center network
o low latency for short flows
o high burst tolerance
o high utilization for long flows
Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above
requirements
See paper for details of workload
characterization in cloud data centers
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2759
The Data Center TCP (DCTCP) Algorithm
The goal of DCTCP is to achieve high burst tolerance low latency
and high throughput with commodity shallow buffered switches
DCTCP uses the concept of ECN (Explicit Congestion Notification)
DCTCP achieves these goals primarily by reacting to congestion in
proportion to the extent of congestion
DCTCP uses a simple marking scheme at switches that sets the
Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold
The DCTCP source reacts by reducing the window by a factor that
depends on the fraction of marked packets the larger the fraction the
bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN
notification
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2859
DCTCP- Simple Marking at the Switch
DCTCP employs a simple active queue management scheme There
is only a single parameter the marking threshold K as opposed to
two parameters THmin and THmax in RED routers
An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival
Marking is based on the instantaneous value of the queue not the
average value as in RED routers
The DCTCP scheme ensures that sources are quickly notified of the
queue overshoot
The RED marking scheme implemented by most modern switches
can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of
average queue length
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2959
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2559
The Need for a Data Center TCP
The data center environment is significantly
different from wide area networks
o round trip times (RTTs) can be less than 250 ms in absence ofqueuing
o Applications need extremely high bandwidths and very low
latencies
o little statistical multiplexing a single flow can dominate a
particular path
o The network is largely homogeneous and under a single
administrative controlo Traffic flowing in switches is mostly internal Connectivity to the
external Internet is typically managed through load balancers and
application proxies that effectively separate internal traffic from
external
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2659
The Need for a Data Center TCP (continued)
Data center applications generate a diverse mix of short and long
flows The measurements by the authors reveal that 9991 of
traffic in the data center is TCP traffic The traffic consists of query
traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to
100MB) These applications require three things from the data
center network
o low latency for short flows
o high burst tolerance
o high utilization for long flows
Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above
requirements
See paper for details of workload
characterization in cloud data centers
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2759
The Data Center TCP (DCTCP) Algorithm
The goal of DCTCP is to achieve high burst tolerance low latency
and high throughput with commodity shallow buffered switches
DCTCP uses the concept of ECN (Explicit Congestion Notification)
DCTCP achieves these goals primarily by reacting to congestion in
proportion to the extent of congestion
DCTCP uses a simple marking scheme at switches that sets the
Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold
The DCTCP source reacts by reducing the window by a factor that
depends on the fraction of marked packets the larger the fraction the
bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN
notification
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2859
DCTCP- Simple Marking at the Switch
DCTCP employs a simple active queue management scheme There
is only a single parameter the marking threshold K as opposed to
two parameters THmin and THmax in RED routers
An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival
Marking is based on the instantaneous value of the queue not the
average value as in RED routers
The DCTCP scheme ensures that sources are quickly notified of the
queue overshoot
The RED marking scheme implemented by most modern switches
can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of
average queue length
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2959
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2659
The Need for a Data Center TCP (continued)
Data center applications generate a diverse mix of short and long
flows The measurements by the authors reveal that 9991 of
traffic in the data center is TCP traffic The traffic consists of query
traffic (2KB to 20KB in size) delay sensitive short messages(100KB to 1MB) and throughput sensitive long flows (1MB to
100MB) These applications require three things from the data
center network
o low latency for short flows
o high burst tolerance
o high utilization for long flows
Because of the impairments of shallow buffered commodityswitches legacy TCP protocols fall short of satisfying the above
requirements
See paper for details of workload
characterization in cloud data centers
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2759
The Data Center TCP (DCTCP) Algorithm
The goal of DCTCP is to achieve high burst tolerance low latency
and high throughput with commodity shallow buffered switches
DCTCP uses the concept of ECN (Explicit Congestion Notification)
DCTCP achieves these goals primarily by reacting to congestion in
proportion to the extent of congestion
DCTCP uses a simple marking scheme at switches that sets the
Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold
The DCTCP source reacts by reducing the window by a factor that
depends on the fraction of marked packets the larger the fraction the
bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN
notification
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2859
DCTCP- Simple Marking at the Switch
DCTCP employs a simple active queue management scheme There
is only a single parameter the marking threshold K as opposed to
two parameters THmin and THmax in RED routers
An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival
Marking is based on the instantaneous value of the queue not the
average value as in RED routers
The DCTCP scheme ensures that sources are quickly notified of the
queue overshoot
The RED marking scheme implemented by most modern switches
can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of
average queue length
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2959
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2759
The Data Center TCP (DCTCP) Algorithm
The goal of DCTCP is to achieve high burst tolerance low latency
and high throughput with commodity shallow buffered switches
DCTCP uses the concept of ECN (Explicit Congestion Notification)
DCTCP achieves these goals primarily by reacting to congestion in
proportion to the extent of congestion
DCTCP uses a simple marking scheme at switches that sets the
Congestion Experienced (CE) codepoint of packets as soon as thebuffer occupancy exceeds a fixed small threshold
The DCTCP source reacts by reducing the window by a factor that
depends on the fraction of marked packets the larger the fraction the
bigger the decrease factor This is different from standard TCP whichcuts its window size by a factor of 2 when it receives ECN
notification
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2859
DCTCP- Simple Marking at the Switch
DCTCP employs a simple active queue management scheme There
is only a single parameter the marking threshold K as opposed to
two parameters THmin and THmax in RED routers
An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival
Marking is based on the instantaneous value of the queue not the
average value as in RED routers
The DCTCP scheme ensures that sources are quickly notified of the
queue overshoot
The RED marking scheme implemented by most modern switches
can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of
average queue length
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2959
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2859
DCTCP- Simple Marking at the Switch
DCTCP employs a simple active queue management scheme There
is only a single parameter the marking threshold K as opposed to
two parameters THmin and THmax in RED routers
An arriving packet is marked with the CE codepoint if the queueoccupancy for the interface is greater than K upon itrsquos arrival
Marking is based on the instantaneous value of the queue not the
average value as in RED routers
The DCTCP scheme ensures that sources are quickly notified of the
queue overshoot
The RED marking scheme implemented by most modern switches
can be re-purposed for DCTCP To do so we set both the low andhigh thresholds to K and mark based on instantaneous instead of
average queue length
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2959
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 2959
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3059
DCTCP- ECN Echo at the Receiver
RFC 3168 states that a receiver sets the ECN-Echo flag in a series of ACK packetsuntil it receives confirmation from the sender (through the CWR flag) that the
congestion notification has been received The DCTCP receiver however tries to
accurately convey the exact sequence of marked packets back to the sender This is
done by setting the ECN-Echo flag if and only if the packet has a marked CEcodepoint For each marked packet there is only a single ECN-Echo ACK
For senders that use delayed ACKs (one cumulative ACK for every m
consecutively received packets) the DCTCP receiver uses a state-machine with twostates to determine whether to set the ECN Echo bit See paper for details of the
delayed ACK scheme
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3159
DCTCP- Control at the Sender
The sender maintains an estimate of the fraction of packets that are marked
called α which is updated once for every window of data (roughly once
every one RTT) as follows
αααα= (1 - g)
timestimestimestimes αααα+ g
timestimestimestimesF
where F is the fraction of packets that were marked in the latest window
of data and 0 lt g lt 1 is the weight given to new samples against the past in
the estimation of α Given that the sender receives marks for every packet
when the queue length is higher than K and does not receive any markswhen the queue length is below K the above equation implies that α
estimates the probability that the queue size is greater than K The higher the
value of α the higher the level of congestion
Notice that the above equation uses the exponentially weighted average
formula used in many applications eg estimating the average queue size
in RED routers estimating RTO in a TCP connection and flow traffic
prediction in online multihoming smart routing
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3259
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3359
RED Router
983108983145983155983139983137983154983140983108983145983155983139983137983154983140 983151983154 983117983137983154983147 983159983145983156983144
983145983150983139983154983141983137983155983145983150983143 983152983154983151983138983137983138983145983148983145983156983161
983105983139983139983141983152983156
983088 983124983112983149983145983150 983124983112983149983137983160 983107
RED Router
Update the value of the average queue size
avg = (1- wq ) times avg + wq times q
if (avg lt THmin) accept packet
else if (THmin le avg le THmax)
calculate probability Pa
with probability Pa
discard or mark packet
otherwise with probability 1 ndash Paaccept packet
else if (avg gt THmax) discard packet
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
983148983145983149983145983156
DCTCP Switch
DCTCP Switch
if (q le K) accept packet
else if ( K lt q le limit )
accept and mark packet
else if ( q gt limit) discard packet
DCTCP Sender
Update α = (1 - g) times α + g times F Reaction to marked ACK in a new window
ssthresh = cwnd times (1- α 2)
cwnd = ssthresh
Legacy TCP Sender
Reaction to marked ACK in a new windowssthresh = cwnd2 cwnd = ssthresh
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3459
Benefits of DCTCP
Queue buildup DCTCP senders start reacting as soon as the
queue length on an interface exceeds K This reduces queuing
delays on congested switch ports which minimizes the impact
of long flows on the completion time of small flows Theavailability of more buffer space mitigates costly packet losses
that can lead to timeouts
Buffer pressure a congested portrsquos queue length does not
grow exceedingly large Therefore in shared memory
switches a few congested ports will not exhaust the buffer
resources for flows passing through other ports
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3559
Benefits of DCTCP (continued)
Incast the incast scenario where a large number of synchronized
small flows hit the same queue is the most difficult to handle If the
number of small flows is so high that even 1 packet from each flow is
sufficient to overwhelm the buffer on a synchronized burst then thereisnrsquot much DCTCP or any congestion control scheme can do to
avoid packet drops
However in practice each flow has several packets to transmit and
their windows build up over multiple RTTs It is often bursts in
subsequent RTTs that lead to drops Because DCTCP starts marking
early (and aggressively based on instantaneous queue length)DCTCP sources receive enough marks during the first one or two
RTTs to tame the size of follow up bursts This prevents buffer
overflows and resulting timeouts
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3659
DCTCP Performance
The paper has more details on
Guidelines for choosing parameters and estimating gain
Analytical model for the steady state behavior of DCTCP
Benchmark traffic and the micro-benchmark experiments
used to evaluate DCTCP
Results of the performance comparisons between a fullimplementation of DCTCP and a state-of-the-art TCP New
Reno (w SACK) implementation
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3759
D3 TCP
983107 983127983145983148983155983151983150 983112 983106983137983148983148983137983150983145 983124 983115983137983154983137983143983145983137983150983150983145983155 983105 983122983151983159983155983156983154983151983150
ACM SIGCOMM August 2011
Better Never Than LateMeeting Deadlines in Datacenter Networks
Microsoft Research
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3859
Pros and Cons of DCTCP
DCTCP is an elegant proposal that targets the tail-end latency by
gracefully throttling flows in proportion to the extent of
congestion thereby reducing queuing delays and congestivepacket drops and hence also retransmits DCTCP has been found
to reduce the 99th-percentile of the network latency by 29
Unfortunately DCTCP is a deadline-agnostic protocol that
equally throttles all flows irrespective of whether their deadlines
are near or far
Rule a flow is useful if and only if it satisfies its deadline
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 3959
d1 d2
f1
f2
Time
Flow
D3 TCP Basic Idea of Deadline Awareness
d1 d2
f1
f2
Time
FlowD3 TCPDCTCP
Two flows (f1 f2) with different deadlines (d1 d2) The
thickness of a flow line represents the rate allocated to it
DCTCP is not aware of deadlines and treat all flows equally
DCTCP can easily cause some flows to miss their deadline D3 TCP allocates bandwidth to flows based on their
deadline Awareness of deadlines can be used in D3 TCP to
ensure they are met
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4059
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4159
Deadlines are associated with flows not packets All packets of a
flow need to arrive before the deadline
Deadlines for flows can vary significantly For example onlineservices like Bing and Google include flows with a continuum of
deadlines (including some that do not have a deadline) Further
datacenters host multiple services with diverse traffic patterns
Most flows are very short (lt50KB) and RTTs are minimal
(300microsec) Consequently reaction time-scales are short and
centralized heavy weight (complex) mechanisms to reserve
bandwidth for flows are impractical
Challenges
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4259
D3 TCP explores the feasibility of exploiting deadlineinformation to control the rate at which end hosts introduce
traffic in the network
D3 TCP uses a Deadline-Driven Delivery control protocol that
addresses the aforementioned challenges Each application knows the deadline for a message and the size
of the message and pass this information to the transport layer
in the request to send End hosts use the deadline information torequest rates from routers along the data path to the destination
Routers allocate sending rates to flows to greedily satisfy as
many deadlines as possible
D3
TCP tries to ensure that the largest possible fraction offlows meet their deadlines
Basic Design Idea
Details of the D3 TCP scheme can be found in the paper
posted on Webcourses
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4359
D2 TCP
B Vamanan J Hasan T Vijaykumar
Purdue University amp Google Inc
ACM SIGCOMM August 2012
Deadline-Aware Datacenter TCP
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4459
Pros and Cons of DCTCP and D3 TCP
Results reported in D3 TCP show that as much as 7 of flows
may miss their deadlines with DCTCP Our results show
DCTCP with 25 missed deadlines at high fan-in amp tight
deadlines
D3 TCP tackles missed deadlines by pioneering the idea of
incorporating deadline awareness into the network While D3
TCP improves upon DCTCP it has significant performance andpractical shortcomings Specifically D3 TCP
does not handle fan-in bursts well
introduces priority inversion at fan-in bursts (see next slide)
does not co-exist with TCP
requires custom silicon (ie switches)
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4559
983123983159983145983156983139983144
Switch grants
requests FCFS
Bandwidth requests arriving at switch
request paused request granted
Request with near deadline
Request with far deadline
Priority Inversion in D3 TCP
D3 TCP greedy approach may allocate bandwidth to far-deadline
requests arriving slightly ahead of near-deadline requests Due to this
race condition D3 TCP causes frequent priority inversions whichcontribute to missed deadlines Our results in show that D3 TCP inverts
the priority of 24-33 requests
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4659
Deadline-aware and handles fan-in bursts wellElegant uses gamma-correction for congestion
avoidance far-deadline rarr back off morenear-deadline rarr back off less
Reactive decentralized
Does not hinder long-lived (non-deadline) flowsCoexists with TCP rarr incrementally deployableNo change to switch hardware rarr deployable today
D2 TCP achieves 75 and 50 fewer
missed deadlines than DCTCP and D3
D
2
TCPrsquos Contributions
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4759
OnLine Data Intensive Applications (OLDI)
OLDI applications operate under soft-real-time constraints (eg 300
ms latency) OLDI applications cane be found in the growing high-
revenue online services such as Web search online retail and
advertisementExample
A typical Facebook page consists of a timeline-organized ldquowallrdquo writeable
by the user and her friends cascade of friend event notifications a chat
application listing friends currently on-line and advertisements This
Facebook page is made up of many components generated by independent
subsystems and ldquomixedrdquo together to provide rich presentation
The final mixing system must wait for all subsystems to deliver some oftheir content potentially sacrificing responsiveness if some subsystems are
delayed Alternatively it must present what it has at the deadline sacrificing
page quality and wasting resources consumed in creating parts of a page that
a user never sees
O A i i
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4859
Features
bull Deadline bound
bull Handle large data
bull Partition-aggregate patternbull Tree-like structure
bull Deadline budget splittotal = 300 ms
parents-leaf RPC = 50 ms
bull Missed deadlines rarr incomplete responses
bull Affect user experience amp revenue
OLDI Applications
OLDI applications employ tree-based divide-and-conquer
algorithms where every query operates on data spanning thousands
of servers
parent parent
leaf leaf leaf leaf
root
bull bull bull bull bull bull bull bull
bull bull bull bullbull bull bull
User
queryOLDI response
simsimsimsim250 ms
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 4959
Deadline-aware and handles fan-in bursts
Key Idea Vary sending rate based on both deadlineand extent of congestion
Built on top of DCTCP Distributed uses per-flow state at end hosts
Reactive senders react to congestion
No knowledge of other flows
D
2
TCP
D2 TCP G C ti
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5059
Like DCTCP D2 TCP maintains a weighted average that quantitatively
measures the extent of congestion
α αα α = (1 - g) times timestimes times α αα α + g times timestimes times f
where f is the fraction of packets that were marked in the latestwindow of data and 0 lt g lt 1 is the weight given to new samples
We now define d as the deadline imminence factor A larger d impliesa closer deadline Based on and d we compute p the penalty
function applied to the window size as follows
D2 TCP Gamma Correction
p = d
Note that being a fraction le 983089 and therefore le 1 The above
function is known in computer graphics as the gamma-correction
D2 TCP Adj ti C ti Wi d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5159
The congestion window W is adjusted as follows
= times (1 minus 2) f gt0 case of packets marked
= + 1 f=0 case of no packets marked
bull When 983142 is zero (ie no CE-marked packets indicating absence of
congestion) the window size is grown by one segment similar toTCP
bull When all packets are CE-marked (case of congestion) 983101983089 and
therefore 983101983089 then the window size gets halved similar to TCP
bull For between 0 and 1 the window size is modulated by
D2 TCP Adjusting Congestion Window
Note Larger p rArrrArrrArrrArr smaller window
D2 TCP Basic Form las
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5259
After determining p we resize the congestion window W as follows
= times (1 minus 2) f gt0
whereWhere d = deadline imminence factor
d = T c D
Tc = flow completion time achieved with the current sending rate D = the time remaining until the deadline expires
d lt 1 for far-deadline flows d gt 1 for near-deadline flows
d = 1 for long flows that do not specify deadlines (ie in this
case D2 TCP behaves like DCTCP)
D2 TCP Basic Formulas
p = d
Gamma Correction Function
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5359
Gamma correction elegantly combinescongestion and deadlines
Gamma Correction Function
983140 983101 983089
983140 983100 983089 (983142983137983154 983140983141983137983140983148983145983150983141)
983140 983102 983089 (983150983141983137983154 983140983141983137983140983148983145983150983141)
Key insight Near-deadline flows back off lesswhile far-deadline flows back off more
983127 983098983101 983127 983082 983080 983089 983085991251
983152 983087 983090 983081
10
983152
10
983142983137983154
983140 983101 983089
983150983141983137983154
983152 983101 α983134983140
bull 983140 983100 983089 rarr 983152 983102 983142983151983154 983142983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983148983137983154983143983141 rarr 983155983144983154983145983150983147 983159983145983150983140983151983159
bull 983140 983102 983089 rarr 983152 983100 983142983151983154 983150983141983137983154983085983140983141983137983140983148983145983150983141 983142983148983151983159983155
983152 983155983149983137983148983148 rarr 983154983141983156983137983145983150 983159983145983150983140983151983159
bull 983140 983101 983089 rarr 983152 983101 983142983151983154 983148983151983150983143 983148983145983158983141983140 983142983148983151983159983155
983108983107983124983107983120 983138983141983144983137983158983145983151983154
D2 TCP Computing αααα
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5459
983105983139983139983141983152983156 983159983145983156983144 983149983137983154983147983145983150983143983105983139983139983141983152983156 983159983145983156983144983151983157983156 983149983137983154983147983145983150983143 983115
Buffer_l 983145983149983145983156
Switch Buffer
Switch
if (q le K)
accept packet without marking
else if ( K lt q le Buffer_limit )
accept and mark packetelse if ( q gt Buffer_limit) discard packet
SenderUpdate once every RTT
α = (1 - g) times α + g times f f is the fraction of packets that were
marked in the latest window of data
D2 TCP Computing αααα
α is calculated by
aggregating ECN (like
DCTCP)
Switches mark packets if
queue_length gt threshold
Sender computes thefraction of marked packets
averaged over time
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5559
D TCP Computing the deadline imminence factor d
As in D3 TCP The applicationknows the deadline D for a
message and pass this information
to the transport layer in the request
to send
To estimate the time Tc to complete
transmitting the message (flow) D2
TCP uses a sawtooth deadline-agnostic congestion behavior
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W rarrW2 upon congestion detection p=
W
W2
time
D
D = the time remaining until the deadline expires
W= flowrsquos current window size
B = bytes remaining to fully transmit the messageTc = time when the flow finishes transmitting B bytes under the pessimistic
sawtooth transmission behavior We want Tc le D
Analysis continued on the next slide
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5659
= 2 +
2 +1 +
2 +2 +
⋯ +
2 + L-1 Tc L
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
Since the value of B is known by the application and L -1 = W2 for the
sawtooth pattern the value Tc can be computed An alternative reasonable
approximation is to assume that the average window size over the duration ofTc is 075 (983145983141 983124983139 983145983155 983137983150 983145983150983156983141983143983141983154 983149983157983148983156983145983152983148983141 983151983142 983116) 983124983144983145983155 983143983145983158983141983155
Note that Tc L is the number of
sawtooth waves needed to completetransmitting the message
= (075) W in bytes
Analysis continued on the next slide
D TCP Computing the deadline imminence factor d
D2 TCP Computing the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5759
Tc
L
Tc gt L
Sawtooth waves for deadline-agnostic
behavior (similar to DCTCP)
W
W2
Time in RTT
D
It also follows that if gt then we should set gt1 to indicate a tight
deadline and vice versa Therefore we compute d as
is the time needed for a flow to
complete transmitting all its data
under the deadline-agnosticbehavior and D is the time
remaining until its deadline
expires If the flow can just meet
its deadline under the deadline-agnostic congestion behavior (ie
cong) then d = 1 is appropriate
= (075) approximation
D TCP Computing the deadline imminence factor d
=
D2 TCP the deadline imminence factor d
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5859
D TCP the deadline imminence factor d
What if Tc lt L
In this case the partial
sawtooth pattern is as shown
in the figure In this case wehave
Tc
L
Tc lt L
Sawtooth waves for deadline-agnostic
behavior (DCTCP)
W
W2
time
= 2 +
2 +1 + 2 +2 +
⋯ +
2 + Tc-1
Since the value of B is known by the application the value Tc canbe computed The value d is given by
=
D2 TCP Summary
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches
8132019 TCP for Data Centers
httpslidepdfcomreaderfulltcp-for-data-centers 5959
D TCP Summary
D2 TCP adjusts the congestion window size in a deadline-aware
manner When congestion occurs far-deadline flows back off
aggressively while near-deadline flows back off only a little or
not at all With such deadline-aware congestion managementnot only can the number of missed deadlines be reduced but
also tighter deadlines can be met
D2 TCP requires no changes to the switch hardware and only
requires that the switches support ECN which is true of todayrsquos
datacenter switches