Resilient Datacenter Load Balancing in the Wild · CCS CONCEPTS • Networks ... However, the standard multi-path load balancing mechanism ... Resilient Datacenter Load Balancing

Resilient Datacenter Load Balancing in the WildHong Zhang1 Junxue Zhang1 Wei Bai1 Kai Chen1 Mosharaf Chowdhury2

1SING Lab Hong Kong University of Science and Technology2University of Michigan

hzhanganjzhangcswbaiabkaichencseusthk mosharafumicheduABSTRACTProduction datacenters operate under various uncertainties suchas trac dynamics topology asymmetry and failures Thereforedatacenter load balancing schemes must be resilient to these un-certainties ie they should accurately sense path conditions andtimely react to mitigate the fallouts Despite signicant eortsprior solutions have important drawbacks On the one hand solu-tions such as Presto and DRB are oblivious to path conditions andblindly reroute at xed granularity On the other hand solutionssuch as CONGA and CLOVE can sense congestion but they canonly reroute when owlets emerge thus they cannot always reacttimely to uncertainties To make things worse these solutions failto detecthandle failures such as blackholes and random packetdrops which greatly degrades their performance

In this paper we introduce Hermes a datacenter load balancerthat is resilient to the aforementioned uncertainties At its heartHermes leverages comprehensive sensing to detect path conditionsincluding failures unattended before and it reacts using timelyyet cautious rerouting Hermes is a practical edge-based solutionwith no switch modication We have implemented Hermes withcommodity switches and evaluated it through both testbed experi-ments and large-scale simulations Our results show that Hermesachieves comparable performance to CONGA and Presto in normalcases and well handles uncertainties under asymmetries Hermesachieves up to 10 and 20 better ow completion time (FCT) thanCONGA and CLOVE under switch failures it outperforms all otherschemes by over 32

CCS CONCEPTSbull Networks rarr Network architectures End nodes Data cen-ter networks Data path algorithms

KEYWORDSDatacenter fabric Load balancing Distributed

1 INTRODUCTIONModern datacenter networks enable multiple paths between hostpairs and balance trac among them to deliver good performance

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor prot or commercial advantage and that copies bear this notice and the full citationon the rst page Copyrights for components of this work owned by others than ACMmust be honored Abstracting with credit is permitted To copy otherwise or republishto post on servers or to redistribute to lists requires prior specic permission andor afee Request permissions from permissionsacmorgSIGCOMM rsquo17 Los Angeles CA USAcopy 2017 ACM 978-1-4503-4653-51708 $1500DOI 10114530988223098841

to dierent bandwidth- and latency-sensitive datacenter applica-tions [3 4 18] Meanwhile production datacenters operate undera multitude of uncertainties such as congestion asymmetry andfailures [5 10 14 17] Uncertainties arise due to a variety of reasonsthat include among others trac dynamics link cuts device het-erogeneity and switch malfunctions [19] Naturally a datacenterload balancer must adapt to these uncertainties ie they should (1)accurately sense path conditions and (2) appropriately split tracamong parallel paths in reaction to path conditions

However the standard multi-path load balancing mechanismused in todayrsquos datacenters ECMP (Equal Cost Multi-Path) [21] bal-ances trac poorly ECMP randomly stripes ows across availablepaths using ow hashing Because it accounts for neither networkuncertainties nor ow sizes it can waste over 50 of the bisectionbandwidth [4]

As a result many load balancing schemes have been proposed toaddress the problem Despite signicant eorts prior solutions stillhave important drawbacks (see sect2 and sect7 for details) Some of themsuch as DRB [12] and Presto [20] are insensitive or even obliviousto path conditions and blindly split trac at xed (eg packetor owcell) granularity These solutions suer in the presence ofasymmetry because the optimal trac splitting across parallel pathsdepends on trac demands and path conditions [5 14] and schemesthat lack visibility of path conditions are unable to make optimaldecisions Furthermore blindly splitting ows onto dierent pathson a round-robin basis can adversely aect transport protocolsBesides the well-known packet reordering problem we furtherunveil the relatively less understood congestion mismatch problem(sect222)

In contrast other solutions such as CONGA [5] and CLOVE[24] are designed to be congestion-aware Although they can sensecongestion these solutions only reroute when owlets emergeWhile owlet switching helps to minimize packet re-ordering [33]it is inexible As the formation of owlets is decided by manyfactors such as applications and transport protocols owlet-basedsolutions are inherently passive and cannot always timely react tocongestion by rerouting ows when needed We show that this caneasily result in sim50 performance loss (sect222)

Furthermore existing approaches for congestion sensing havekey shortcomings They are either impractical because they requirenon-trivial switch modications [5 25 35] or inecient becausethey provide limited visibility into network congestion [23 24]Moreover none of them can detect switch failures such as packetblackholes and random packet drops which are frequently wit-nessed in production datacenters [19] Our experiments show thatbeing unable to detect these failures may lead to unnished owsand over 100times worse average FCT

SIGCOMM rsquo17 August 21-25 2017 Los Angeles CA USA H Zhang et al

Given the above ineciencies we ask the following question canwe design a resilient load balancing scheme that can gracefully handleall these uncertainties in a practical readily-deployable fashion Inthis paper we present Hermes to answer this question armatively

At its heart Hermes detects path conditions via comprehensivesensing (sect31) It makes use of transport-level signals such as ECNand RTT to measure path congestion while leveraging events suchas retransmissions and timeouts to detect packet blackholes andrandom packet drops caused by malfunctioning switches To furtherimprove visibility Hermes employs active probing ndash guided by thepower of two choices technique [28] ndash that can eectively increasethe scope of sensing at minimal probing cost

Given path conditions how to react to the perceived uncertain-ties is still non-trivial Considering the passiveness of owlets anatural choice is to actively split ows at a ner granularity andalways switching to the best available path instantly However ouranalysis reveals that such vigorous path changing even coupledwith comprehensive sensing can interfere with transport protocolsby causing frequent packet reordering and congestion mismatchproblems

Therefore Hermes handles uncertainties timely yet cautiously(sect32) On the one hand rather than being constrained by a xedor passive granularity Hermes is capable of reacting timely once itsenses congestion or failures On the other hand instead of vigor-ously changing paths Hermes assesses ow status and path condi-tions and makes deliberate rerouting decisions only if they bringperformance gains This enables Hermes to prune unnecessaryreroutings and to reduce packet reordering and congestion mis-match making it transport-friendly

Hermes is a practical edge-based solution with no switch modi-cations We implemented a Hermes prototype and deployed it inour small-scale testbed (sect4) Testbed experiments as well as large-scale simulations with the realistic web-search [6] and data-mining[18] workloads show that (sect5)bull Under symmetric topologies Hermes achieves 10-38 better FCT

than ECMP outperforms CLOVE-ECN by up to 15 for bothworkloads and achieves comparable performance to Presto andCONGA

bull Under asymmetric topologies Hermes outperforms CONGA byup to 10 due to its timely rerouting when sensing congestionwith the data-mining workload Furthermore with better visi-bility via active probing it achieves up to 20 better FCT thansolutions with limited visibility such as CLOVE-ECN and Let-Flow [14]

bull In case of switch failures Hermes can eectively detect the fail-ures and it outperforms all other schemes by more than 32

2 BACKGROUND AND MOTIVATIONIn this section we rst summarize dierent sources of uncertaintiesin datacenters (sect21) and then elaborate the limitations of priorsolutions in terms of both sensing and reacting to uncertainties(sect22)

21 UncertaintiesDatacenters are lled with the following uncertainties

bull Trac dynamics As shown in previous studies [17] trac inproduction datacenters can be highly dynamic Congestion canquickly arise as a few high-rate ows start and it can dissipateas they complete

bull Asymmetries Asymmetries are the norm in datacenters Forexample topology asymmetry can arise as a datacenter evolvesby adding racks and switches over time which may also intro-duce coexistence of heterogenous devices (eg both 10G and40G spine switches) Furthermore it is reported that datacenterscan experience frequent link cuts [17 19] creating asymmetries

bull Switch failures Besides link failures that directly cause topol-ogy asymmetry production datacenters also suer from a varietyof switch failures or malfunctions that traditionally have receivedless attention Notably a recent study of Microsoft production dat-acenters [19] reveals two types of switch failures that adverselyaect network performance (1) packet blackholes packets thatmeet certain lsquopatternsrsquo (eg certain source-destination IP pairsor port numbers) are dropped deterministically (ie 100) bythe switch and (2) silent random packet drops a switch dropspackets silently and randomly at a high rate The root causesof these switch failures include TCAM decits in the switchingASIC switching fabric CRC checksum errors and linecards notbeing well seated among othersLoad balancing trac under these uncertainties is a challenge To

deal with trac dynamics and asymmetries an ideal solution needsto be congestion-aware To handle failures it needs to detect themquickly and accurately Despite continuous eorts in recent yearsprior solutions still have important drawbacks as we illustrate inthe following

22 Drawbacks of Prior Solutions in HandlingUncertainties

We follow Table 1 to discuss the drawbacks of prior solutionsmotivating our design of Hermes

221 Limitations in Sensing UncertaintiesFirst of all solutions such as DRB Presto and ECMP [12 20 21] areeither oblivious to congestion or rely only on local trac conditionsThey perform poorly under asymmetry because the optimal tracsplitting across paths in asymmetric topologies depends primarilyon trac demands and path conditions Schemes without globalcongestion-awareness cannot eectively balance the trac [5 14]

Second existing solutions that are designed to be global congestion-aware [5 23ndash25 35] have drawbacks too They are either impracti-cal because they require non-trivial switch modications [5 25 35]or inecient because they only provide limited visibility into net-work congestion at the end hosts [23 24]

To illustrate this we quantify network visibility as the averagenumber of concurrent ows observed on parallel paths We runa trace-driven simulation based on web-search and data-miningworkloads in a 8times8 leaf spine topology with 10 Gbps links and 128servers We measure the visibility of both ToR switches and endhosts for 2s The results are shown in Table 2 We nd that a sourceToR switch can observe congestion status of several parallel paths toeach destination ToR simultaneously while the same is not the casefor end host pairs Overall current end host based solutions [23 24]

Resilient Datacenter Load Balancing in the Wild SIGCOMM rsquo17 August 21-25 2017 Los Angeles CA USA

SchemesSensing Uncertainties Reacting to Uncertainties

Advanced HardwareCongestion Switch Minimum Switchable Switching Method and FrequencyFailure Unit

ECMP [21]Oblivious Oblivious

Flow Per-ow random hashingNoPresto [20] Flowcell (small xed-sized unit) Per-owcell round robin

DRB [12] Packet Per-packet round robinLetFLow [14] Oblivious Oblivious Flowlet Per-owlet random hashing Yes1

DRILL [16] Local awareness (Switch) Oblivious Packet Per-packet rerouting (according to local congestion) YesCONGA [5]HULA [25] Global awareness (Switch) Oblivious Flowlet Per-owlet rerouting (according to global congestion) Yes

CLOVE-INT [24]FlowBender [23] Global awareness (End host) Oblivious Packet Reactive and random rerouting (when congested) NoCLOVE-ECN [24] Flowlet Per-owlet weighted round robing (according to global congestion)

Hermes Global awareness (End host) Aware Packet Timely yet cautious rerouting (based on global congestion and failure) No

Table 1 Summary of prior work under uncertainties Note that unlike link failures that directly cause topology asymmetryhere switch failure refers to malfunctions such as packet blackholes and silent random packet drops

Workload Data-mining Data-mining Web-search Web-search60 load 80 load 60 load 80 load

Switch pair 1725 2344 4173 5859End host pair 0007 0009 0016 0022

Table 2 The average number of concurrent ows observedon parallel paths between ToR-to-ToR host-to-host pairs

which rely primarily on piggybacked information lack sucientvisibility for making appropriate load balancing decisions

Finally none of the existing schemes detect switch failures suchas packet blackholes and silent random packet drops We note thatsome solutions such as Presto [20] and DRB [12] rely on a cen-tral controller to detect link failures however they do not detectswitch failures Moreover current congestion-aware solutions suchas CONGA [5] estimate congestion level by measuring link uti-lization thus they cannot eectively detect switch failures eitherConsider a switch that experiences silent random packet dropsows traversing this switch tend to have a low sending rate dueto frequent packet drops Then the corresponding paths experi-encing switch failures will have even lower measured congestionlevels compared to other parallel paths As a consequence existingcongestion-aware solutions may shift more trac to these undesir-able paths In our evaluation this causes CONGA to perform worsethan ECMP (sect533)

222 Problems with Reacting to UncertaintiesFlowlet switching cannot timely react to uncertainties Manycongestion-aware solutions [5 24 25] adopt owlet switching Thebenet is that owlets provide a ner-granularity alternative toows for load balancing without causing much packet reorderingHowever because owlets are decided by many factors such asapplications and transport protocols solutions relying on owletswitching are inherently passive and cannot always timely react tocongestion by splitting trac when needed

Figure 1 shows such an example with CONGA [5] We have twosmall ows (A B) and two large ows (C D) from L0 to L1 via paral-lel paths P1 and P2 We use DCTCP [6] as the underlying transportprotocol At the beginning CONGA balances load by placing ow

1LetFlow [14] has been implemented for Ciscorsquos datacenter switch production linehowever currently it is not widely supported by commodity switches For examplemost Broadcom switching chipsets do not support owlet switching

10G10G

10G 10GL0 L1

(a) A simple case of 4 ows

Flow A B finish

Flow C D finish

Flow A B finish

Flow C D finish

Flow C reroute from to

(b) Comparison of schemes

Figure 1 [Example 1] Flowlet switching cannot timely reactto congestion by splitting ows under stable trac pattern

A B on P1 and C D on P2 After ow A and B complete CONGAsenses that P1 is idle However when using a owlet timeout of 100-500micros [5] CONGA cannot identify owlets for rerouting mainlybecause DCTCP is less bursty ndash this is because DCTCP adjusts itscongestion window smoothly and thus it is less likely to generatesucient inactivity gaps that form owlets On the other handa smaller owlet timeout value (eg 50micros [14]) triggers frequentipping causing severe packet reordering in our simulation In theideal scenario appropriate rerouting can almost halve the FCTs ofthe large owsVigorous rerouting is harmful Considering the deciency ofowlets a natural choice is to split ows at a ner granularity andalways switching to the best available path instantly However suchvigorous path changing even coupled with congestion awarenesscan adversely interfere with transport protocols Besides the well-known packet re-ordering problem we further unveil the congestionmismatch problem (dened below) that has previously received littleattention

Basically congestion control algorithms adjust the rate (window)of a ow based on the congestion state of the current path Henceeach rerouting event can cause a mismatch between the sendingrate and the state of the new path With vigorous rerouting withina ow the congestion states of dierent paths are mixed togetherand congestion on one path may be mistakenly used to adjust therate on another path We refer to this phenomenon as congestionmismatch

10GFlow ADCTCP

Flow BUDP9Gbps

(a) Asymmetry due to broken linkL0-S1

0 10 20 30 40 50 60 70 80 90

Through

put (G

ueue Size (KB)

Time (ms)

Queue Size Throughput

(b) Queue length of S0-L2 and throughputof ow A

Figure 2 [Example 2] Congestion mismatch results in se-vere throughput loss and queue oscillation under asymmet-ric topologies with Presto

10G 10G

Flow ADCTCP

(a) A 110G hybrid network

0 10 20 30 40 50 60 70 80 90

Through

put (Gbps)

Time (ms)

(b) Throughput of ow A

Figure 3 [Example 3] Distributing loads according to linkcapacities does not solve the congestion mismatch problem

In the following we show three cases of congestion mismatchand their impacts Although we use DCTCP as the transport proto-col the problems demonstrated are not tied to any specic transportprotocol To mask the throughput loss caused by packet reorderingwe set DupAckThreshold to 500

First we show that for congestion-oblivious load balancingschemes with small granularity (eg Presto [20] and DRB [12])congestion mismatch causes throughput loss and queue oscillationsunder asymmetries Consider a simple 3times2 leaf-spine topology inFigure 2a2 with a broken link from L0 to S1 Flow A is a DCTCPow from L1 to L2 and ow B is a UDP ow from L0 to L2 witha limited rate of 9 Gbps We adopt Presto with equal weights fordierent paths

As shown in Figure 2b ow A only achieves around 1 Gbpsoverall throughput and the queue length of the output port ofspine S0 to leaf L2 experiences large variations Due to vigorousrerouting the congestion feedback (ie ECN) of the upper pathconstrains the congestion window resulting in throughput loss inthe bottom path Furthermore when ow A with a larger windowshifts from the bottom path to the upper path the upper path cannotimmediately absorb such a burst causing queue length oscillations

Second we show that congestion mismatch remains harmfuleven if we distribute trac proportionally to path capacity Toillustrate this consider a heterogenous network with 1 and 10 Gbpspaths shown in Figure 3a We spread owcells using 110 ratio tomatch path capacities and expect both paths to be fully utilized

However as shown in Figure 3b ow A can only achieve an over-all throughput of around 5 Gbps To understand the reason assumethat the rst 10 owcells go through the 10 Gbps path Because the

2We ip leaf L2 and omit the unused paths for clarity

Flow A On 001s Off 0003s

L0 L1 L2

Flow BAlways on

(a) Hidden terminal scenario

0 005 01 015 02 025

Queue Size (KB)

Time (s)

(b) Queue length of S1-L2

Figure 4 [Example 4] The hidden terminal scenario (a)causes owA toip between spine S0 and S1with stale infor-mation (b) causes sudden queue build-up every time whenow A reroutes through S1

path is not congested DCTCP will increase the congestion windowwithout realizing that the subsequent owcell will go through the 1Gbps path With a large congestion window the 11th owcell is sentat a high rate on the 1 Gbps path causing a rapid queue buildup onthe output port of spine S0 to leaf L2 As the queue length exceedsthe ECN marking threshold (ie 32KB for 1 Gbps link) DCTCPwill reduce the window upon receiving ECN-marked ACKs withoutrealizing that such a reduction will aect the following owcellson the 10 Gbps path As a result such a congestion mismatch stillcauses throughput loss and queue oscillations

Third we show that for congestion-aware solutions suboptimalrerouting also leads to severe congestion mismatch Figure 4a showssuch an example with CONGA Flow A starts from L0 to L2 andwe add a 3ms pause every 10ms to create a owlet timeout FlowB keeps sending from L1 to L2 We nd that ow A keeps ippingbetween spine S0 and S1 This is because no matter which pathow A chooses it does not have explicit feedback packets on thealternative path thus it always assumes the alternative path to beempty after an aging period (ie 10ms as suggested in [5]) As aresult ow A will aggressively reroute to the alternative path everytime it observes a owlet

We note that imperfect congestion information is unavoidable ina distributed setting However current congestion-aware solutions[5 25] do not take this into account and their aggressive reroutingmay lead to congestion mismatch As shown in Figure 4b each timeow A reroutes from spine S0 to S1 its sending rate is much higherthan the desired rate at the new path this causes an acute queueincrease at the output port of S1-L2 These periodic spikes in queueoccupancy can lead to high tail latencies for small ows or evenpacket drops when buer-hungry protocols (eg TCP) are usedIn large-scale simulations with production workloads (sect532) weobserve that CONGA performs 30 worse with a smaller owlettimeout of 50micros compared to that of 150micros even after we maskpacket reordering We believe that such performance degradationsare due to congestion mismatch

3 DESIGNThe limitations discussed in sect2 highlight the following propertiesof an ideal load balancing solutionbull Comprehensiveness it should eectively detect congestion and

failures to guide load balancing decisions

End host

Active Probing

Sensing Congestion

Sensing Module (Re)Routing Module

Sensing Failures When amp Where to reroute N

etwork Traffic

(Re)Route

Hypervisor

FeedTrigger

Figure 5 Hermes overview

Flow-level Variablerf Sending rate of a ow

ssent Size sent of a ow used to estimate the remaining sizei ft imeout Set if a ow experiences a timeout

Path-level VariablefECN Fraction of ECN-marked packets of a pathtRTT RTT measurement of a path

nt imeout Number of timeout events of a pathfr etransmission Fraction of retransmission events of a path

rp Aggregate sending rate of all ows of a path (sumrf )

type Characterization of path condition

Table 3 Variables in Hermes

Parameter Recommended SettingTECN threshold for fraction of ECN 40TRTT _low threshold for low RTT 20 minus 40micros + base RTTTRTT _hiдh threshold for high RTT 15timesone hop delay + base RTT∆RTT threshold for notably better RTT one hop delay∆ECN threshold for notably better ECN fraction 3-10R the highest ow sending rate threshold for rerouting 20-40 of the link capacityS the smallest ow sent size threshold for rerouting 100 minus 800KBProbe interval 100 minus 500micros

Table 4 Parameters in Hermes and recommended settings

bull Timeliness it should quickly react to various uncertainties andmake timely rerouting decisions

bull Transport-friendliness it should limit its impact (ie packet re-ordering and congestion mismatch) on transport protocols

bull Deployability it should be implementable with commodity hard-ware in current datacenter environmentsTo this end we propose Hermes a resilient load balancing scheme

to gracefully handle uncertainties Figure 5 overviews the two mainmodules of Hermes (1) the sensing module that is responsiblefor sensing path conditions and (2) the rerouting module that isresponsible for determining when and where to reroute the trac

Akin to previous work [9 20 24] we implement a Hermes in-stance running in the hypervisor of each end host leveraging itsprogrammability to enable comprehensive sensing (sect31) and timelyyet cautious rerouting (sect32) to meet the aforementioned propertiesTable 3 and 4 summarize the key variables and parameters used inthe design and implementation of Hermes

31 Comprehensive Sensing311 Sensing Congestion

Hermes leverages both RTT and ECN to detect path conditions

Algorithm 1 Hermes Path Characterization1 for each path p do2 if fECN lt TECN and tRTT lt TRTT _low then3 type = дood4 else if fECN gt TECN and tRTT gt TRTT _hiдh then5 type = conдested6 else7 type = дray

8 if (nt imeout gt 3 and no packet is ACKed) or(fr etransmission gt 1 and type conдested ) then

9 type = f ailed

ECN RTT Possible Cause CharacterizationHigh High Congested Congested path

High ModerateLow Not enough ECN samples or

Gray pathall delay is built up at one hop

Low High Not enough ECN samples orthe network stack incurs high RTT

Low Moderate Moderately loadedLow Low Underutilized Good path

Table 5 Outcome of path conditions using ECN and RTT

bull RTT directly signals the extent of end-to-end congestion Whileit is informative accurate RTT measurement is dicult in com-modity datacenters without advanced NIC hardware [26] Thusin our implementation we primarily use RTT to make a course-grained categorization of paths ie to separate congested pathsfrom uncongested ones While a large RTT does not necessarilyindicate a highly congested path (eg end host network stackdelay can increase RTT) a small RTT is an indicator of an un-derutilized path

bull ECN captures congestion on individual hops It is supported bycommodity switches and widely used as congestion signal formany congestion control algorithms [6 34] Note that ECN is abinary signal that is marked when a local switch queue lengthexceeds a marking threshold it can well capture the most con-gested hop along the path However in a highly loaded networkwhere congestion may occur at multiple hops ECN cannot reectthis Furthermore a low ECN marking rate does not necessarilyindicate a vacant end-to-end path especially when there are notenough samples

Given that neither RTT nor ECN can accurately indicate thecondition of an end-to-end path to get the best of both we com-bine these two signals using a set of simple guidelines in Algo-rithm 1 Specically we characterize a path to be good if both RTTmeasurement and ECN fraction are low In contrast if both RTTmeasurement and ECN fraction are high we identify the path to becongested ndash this is because a large RTT value alone may be causedby the network stack latency at the end host whereas a high ECNfraction alone from a small number of samples may be inaccurateas well Otherwise in all the other cases we classify the path to begray Table 5 summarizes the outcome of the algorithm and reasonsbehind

Scheme Piggy-back Brute-force Power of two Hermes([23 24]) Probing ChoicesVisibility lt 001 100 gt3 gt3Overhead NA 100times 3times 3

Table 6 Comparison of dierent probing schemes in termsof visibility and corresponding overhead Recall that visibil-ity is quantied as the average number of concurrent owsa sender can observe on parallel paths to each end host(sect221) and overhead is dened as the sending rate of theextra probe trac introduced over the edge-leaf link capac-ity Here we consider a 100times100 leaf-spine topology with 105end hosts and 10Gbps link A probe packet is typically 64bytes and the probe interval is set to 500micros

312 Sensing FailuresHermes leverages packet retransmission and timeout events to inferswitch failures such as packet blackholes and random drops3

Recall that a switch with blackholes will drop packets from cer-tain source-destination IP pairs (or plus port numbers) deterministi-cally [19] To detect a blackhole of a certain source-destination pairHermes monitors ow timeouts on each path Once it observes3 timeouts on a path it further checks if any of the packets onthat path have been successfully ACKed If none have been ACKedHermes concludes that all packets on this path are being droppedand identies it as a failed path Blackholes including port numberscan be detected in a similar way

A switch experiencing silent random packet drops introduceshigh packet drop rates To detect such failures we observe thatpacket drops always trigger retransmissions which can be capturedby monitoring TCP sequence numbers Based on this observationHermes records packet retransmissions on each path and picksout paths experiencing high retransmission rates (eg 1 underDCTCP) for every τms We set τ to be 10 by default Congestion cancause frequent retransmissions as well Therefore Hermes furtherchecks the congestion status and identies uncongested paths withhigh retransmission rates as failed paths (lines 8-9 in Algorithm 1)

313 Improving VisibilityHermes requires visibility into the network for good load balancingdecisions However visibility comes at a cost One can exhaustivelyprobe all the paths to achieves full visibility but it introduces highprobing overhead ndash 100times the capacity of a 10Gbps link (Table 6) Incontrast piggybacking used in existing edge-based load balancingsolutions [23 24] incur little overhead but it provides very limitedvisibility

We seek a better tradeo between visibility and probing overheadby leveraging the well-known power-of-two-choices technique [28]Unlike DRILL [16] that applies this for packet routing within aswitch we adopt it at end hosts to probe path-wise informationwith the following considerations

First we nd that there is no need to pursue congestion informa-tion for all the paths instead probing a small number of them can

3We note that existing diagnostic tools [19 32] take at least 10s of seconds to detectand locate switch failures As a result they cannot be directly adopted by Hermes asload balancing requires fast reactions to these failures

R2Estimation

Remaining size = R1 T1

Do not reroute

Reroute

Figure 6 A simplied model to assess the cost and benetof rerouting Whether rerouting will shorten a owrsquos com-pletion time depends on factors such as the rate dierencebetween the new (R2) and current (R1) paths the gradient ofrate change and the ow size

eectively improve the load balancing performance with aord-able probing cost Second all path-wise congestion signals have atleast one RTT delay Hence probing only a small number of pathsreduces the risk of many end hosts making identical choices andoverwhelming one path while leaving others underutilized4 Thirdin addition to the two random probes we add an extra probe on thepreviously observed best path This brings better stability [16 28]and increases the chance of nding an underutilized path [29]

As shown in Table 6 with such a technique Hermes ensuresthat each source can see the status of at least 3 parallel paths toits destination which can eectively guide load balancing deci-sions Meanwhile it reduces the overhead by over 30times comparedto the brute-force approach To further reduce the overhead Her-mes picks one hypervisor under each rack as the probe agent Ineach probe interval these agents probe each other and share theprobed information among all hypervisors under the same rackThis further reduces the overhead by 100times Note that this approachcan introduce inaccuracies when the last hop latencies to dierenthosts under the same rack are signicantly dierent

Overall Hermes achieves over 300times better visibility than pig-gybacking Its overhead is around 3 which is over 3000times betterthan the brute-force approach

32 Timely yet Cautious ReroutingEven with comprehensive sensing reacting to the perceived un-certainties is still non-trivial As shown in sect222 the challengesare two-fold owlet switching is passive and cannot always timelyreact to uncertainties whereas vigorous rerouting adversely inter-feres with transport protocols To address these challenges Hermesemploys timely yet cautious rerouting

On the one hand instead of passively waiting for owlets Her-mes uses packet as the minimum switchable granularity so that itis capable of timely reacting to uncertainties On the other handbecause vigorous rerouting can introduce congestion mismatch andpacket reordering Hermes tries to be transport-friendly by reduc-ing the frequency of rerouting Unlike prior solutions that performrandom hashing [14 21] round-robin [12 20 24] or always rerouteto the best path [5] Hermes cautiously makes rerouting decisionsbased on both path conditions and ow status

4This herd behavior caused by delayed information update is elaborated in [27]

Figure 6 illustrates a simplied cost-benet assessment Hermesperforms before making rerouting decisions Consider a ow thathas already sent some data and due to changes in network condi-tions we need to determine whether to reroute Rerouting to a lesscongested path may result in packet reordering which in turn canhalve its sending rate (R1 rarr 1

2R1) assuming TCP fast recovery istriggered In this case whether a rerouting can improve the owrsquoscompletion time depends on many factors such as the sendingrate on the new path (R2) the current path (R1) and the remainingow size Note that it is a coarse-grained model because (1) somemetrics such as R2 cannot be eectively measured and (2) it doesnot accurately capture the cost of congestion mismatch caused byfrequent rerouting However it highlights the need for caution andhelps us identify the following heuristics for cautious rerouting

First considering the rate reduction caused by rerouting and theestimation error of R2 shown in Figure 6 we note that reroutingis not always benecial if R2 is not signicantly larger than R1 Asa result Hermes reroutes a ow only when it nds a path withnotably better condition (evaluated by an RTT threshold ∆RTT andan ECN threshold ∆ECN in our implementation) than the currentpath

Second it indicates that rerouting a ow with a small remainingsize may bring limited benet this is because it may nish before itssending rate peaks As a result Hermes uses the size a ow alreadysent to estimate the remaining size [7 30] and reroutes ows onlywhen the size sent exceeds a threshold S

Finally although we cannot accurately measure R2 we can mea-sure R1 by using CONGArsquos DRE algorithm [5] If R1 is already highthe potential benet of rerouting is likely to be marginal moreoverthe cost would be high if we reroute to a wrong path (as shownin the example of Figure 4b) Therefore we avoid rerouting a owwhose sending rate exceeds a threshold R

To summarize Algorithm 2 shows the rerouting logic of HermesIt is timely triggered for every packet when (1) it belongs to a newow or a ow experiencing failuretimeout or (2) the current pathis congested In the former case (lines 3-12) Hermes rst tries toselect a good path with the least sending rate rp in order to preventlocal hotspots If it fails Hermes further checks gray paths If itfails again Hermes randomly chooses an available path with nofailure For the latter case (lines 13-23) Hermes rst tries to evaluatethe benet of rerouting As discussed above Hermes checks thesending rate rf and size sent Ssent and decides to reroute onlywhen both conditions for cautious rerouting are met (line 14) Thefollowing rerouting procedure is similar to that in the former caseexcept that the selected path should be notably better than thecurrent path (lines 15 and 19) The ow stays on its original path ifno better path is found

33 Parameter SettingsHermes has a number of parameters as shown in Table 4 We nowdiscuss how we set them up Note that in this section we onlyprovide several useful rules-of-thumb and leave (automatic) optimalparameter conguration as an important future work

Recall that we use 3 parameters to infer congestion TRTT _low TECN andTRTT _hiдh (sect31)TRTT _low is a threshold for good pathwe set it to be 20-40micros (20micros by default) plus one-way base RTT to

Algorithm 2 Timely yet Cautious Rerouting1 for every packet do2 Assume its corresponding ow is f and path is p3 if f is a new ow or f i ft imeout==true or p type == f ailed

then4 pprime = all good paths5 if pprime empty then6 plowast = Arдminpisinpprime(p rp )

Select a good path with the smallest

local sending rate

7 else8 pprimeprime = all gray paths9 if pprimeprime empty then

10 plowast = Arдminpisinpprimeprime(p rp )11 else12 plowast = a randomly selected path with no failure

13 else if p type == conдested then14 if f ssent gt S and f rf lt R then15 pprime = all good paths notably better than p

forallpprime isin pprime we have p tRTT minus pprimetRTT gt ∆RTTand p fECN minus pprimefECN gt ∆ECN

16 if pprime empty then17 plowast = Arдminpisinpprime(p rp )18 else19 pprimeprime = all gray paths notably better than p20 if pprimeprime empty then21 plowast = Arдminpisinpprimeprime(p rp )22 else23 plowast = p Do not reroute

24 return plowast The new routing path

ensure that a good path is only lightly loadedTECN is an indicatorfor congested path we set it to be 40 to ensure the path is heav-ily loaded To set TRTT _hiдh we use per-hop delay as a guidelineNote that each fully loaded hop introduces a relatively stable de-lay This value can be calculated by ECN markinд threshold

Link capacity withDCTCP [6] We set TRTT _hiдh to be base RTT plus 15times of the onehop delay Such a value suggests that the path is likely to be highlyloaded at more than one hop Due to dierent settings TRTT _hiдhis congured to be 300micros in our testbed and 180micros in simulationsSensitivity analysis in sect54 shows that Hermes performs well withTRTT _hiдh between 140 to 280micros in simulations

For probing our evaluation suggests that an interval of 100-500micros brings good performance and we set the default value as500micros For rerouting the key idea is to ensure that each reroutingis indeed necessary ∆RTT is the RTT threshold to ensure a path isnotably better than another we set it to be the one hop delay (80microsin simulation and 120micros in testbed) to mask potential RTT mea-surement inaccuracies Similarly we set the ECN fraction threshold∆ECN to be 3-10 (5 by default) Moreover we nd that setting Sto be 100-800KB can avoid rerouting small ows and improve theFCT of small ows under high load Also setting R to be 20-40 ofthe link capacity can avoid rerouting ows with high sending rateswhich eectively improves the FCT of large ows The performance

of Hermes is fairly stable with the above settings and we set S tobe 600KB and R to be 30 of the link capacity by default

4 IMPLEMENTATIONWe have implemented a Hermes prototype and deployed it in ourtestbed We enforce the explicit routing path control at the endhost using XPath [22] Note that our implementation is compatiblewith legacy kernel network stacks and requires no modication toswitch hardwareEnd host module The end host module is implemented as aLinux kernel module residing between TCPIP stacks and Linuxqdisc It can be installed and removed while the system is runningwithout recompiling the OS kernel The kernel module consists offour components a Netfilter TX hook a Netfilter RX hooka hash-based ow table and a path table Note that it operates atboth TX and RX paths

The operations in the TX path are as follows (1) The NetfilterTX hook intercepts all outgoing packets and forwards them to theow table (2) We identify the ow entry the packet belongs to andupdate its state (eg size sent sending rate sequence number etc)(3) Then we compute the packetrsquos path using Hermesrsquos reroutingalgorithm (4) Note that each path in XPath [22] framework canbe identied by a unique IP address So after obtaining the desiredpath we add an IP header to the packet and write the desired pathIP into the destination address eld

The operations in the RX path are as follows (1) The NetfilterRX hook intercepts all incoming packets (2) We identify the owentry of this packet and update its ow state (3) We identify thepath this ow uses and update the path state (4) After updatinginformation we remove the outer IP header and deliver this packetto the upper layer In addition to the kernel module we generateprobe packets using a simple client-server applicationSystem overhead To quantify the system overhead introduced byHermes we install it on a server with 4-core Intel E5-1410 28GHzCPU and a Broadcom BCM57810 NetXtreme II 10Gigabit EthernetNIC We generate more than 9Gbps of trac with more than 5000concurrent connections The extra CPU overhead introduced is lessthan 2 compared to the case where the Hermes end host moduleis not running The measured throughput remains the same in bothcasesSwitch conguration We congure L3 forwarding table ECNREDmarking and strict priority queueing (two priorities) at the switchInspired by previous work [26] we classify pure ACK packets inthe reverse path into the high priority queue for more accurate RTTmeasurements

5 EVALUATIONWe evaluate Hermes using a combination of testbed experimentsand large-scale ns-3 simulations Our evaluation seeks to answerthe following questionsHow does Hermes perform under a symmetric topologyTestbed experiments (sect52) show that Hermes achieves 10-38 bet-ter FCT than ECMP outperforms CLOVE-ECN by up to 15 andachieves comparable performance to Presto In a large simulated

10E+01 10E+03 10E+05 10E+07 10E+09

Size (Bytes)

Data Mining

Web Search

Figure 7 Trac distributions used for evaluation

topology we show that Hermes performs close to (and slightly bet-ter than) CONGA for the web-search (and data-mining) workload(sect531)How does Hermes perform under an asymmetric topologyTestbed experiments with a link cut show that Hermes performs 12-26 better than CLOVE-ECN (sect52) and signicantly outperformsPresto and ECMP at high loads Moreover we nd that Hermes isvery competitive in a large simulated topology with a high degreeof asymmetry (sect532) 1) For the web-search workload Hermesis within 4-30 of CONGA Compared to LetFlow and CLOVE-ECN cautious rerouting in Hermes leads to up to 33times better FCTsfor small ows at high loads 2) For the data-mining workloadHermes outperforms CONGA by up to 10 due to its more timelyrerouting With active probing Hermes achieves up to 20 betterFCT than prior solutions with limited visibility (eg CLOVE-ECNand LetFlow)How eective is Hermes under switch failures (sect533) Wesimulate both silent random packet drop and packet blackholescenarios and our results show that Hermes can eectively senseand avoid both types of failures achieving more than 32 betterFCT compared to all other schemesHow does each design component contributes and how ro-bust isHermes under dierent parameter settings (sect54) Weevaluate the benets brought by good visibility and cautious yettimely rerouting separately Results show that both componentscontribute to over 10 improvements in the overall outcomes More-over we also nd that Hermesrsquo performance is stable under a varietyof parameter settings

51 Methodology

Transport We use DCTCP as the default transport protocol Inour testbed we use the DCTCP implementation in Linux kernel31811 [1] In our simulator we implement DCTCP on top of ns-3rsquosTCP New Reno protocol faithfully capturing its optimizations Theinitial window is 10 packets We set both initial and minimum valueof TCP RTO to 10ms We set other parameters as suggested in [6]Schemes compared Besides ECMP we compare Hermes withthe following solutionsbull CONGA [simulation] We simulate CONGA following the param-

eter settings in [5] However because DCTCP is less bursty thanTCP the default 500micros owlet timeout value is too big For a faircomparison against owlet-based solutions we tried dierentowlet timeout values and adopted the best one (ie 150micros) inour simulations

bull LetFlow [simulation] We simulate LetFlow with a owlet time-out value of 150micros (as we do for CONGA)

Spine 0 Spine 1

Leaf 0 Leaf 1 6x1Gbps 6x1Gbps

4x1Gbps 4x1Gbps

(a) The symmetric case

Spine 0 Spine 1

4x1Gbps 3x1Gbps

Failed Link

(b) The asymmetric case

Figure 8 [Testbed] Topology used for testbed experiments

bull Presto [testbed simulation] Similar to [14] we implement avariant of Presto (Presto) to isolate performance issues fromthose caused by packet reordering We spray packets insteadof owcells and implement a reordering buer to mask packetreordering

bull CLOVE-ECN [testbed simulation] We evaluated two versionsof CLOVE Edge-Flowlet and CLOVE-ECN via both testbed ex-periments and simulations We only show the results of CLOVE-ECN since it slightly outperforms Edge-Flowlet in most casesWe do not simulate CLOVE-INT since it requires programmableswitches and it is shown to be outperformed by CONGA [24]We set a 150micros owlet timeout value in simulations For testbedwe pick the best owlet timeout value (ie 800micros) after tryingdierent values

Remark We do not compare against MPTCP [31] because there is alack of a reliable ns-3 simulation package and the latest publiclyavailable MPTCP Linux release suers from performance instabil-ity [20 23] Note that MPTCP suers from incast issues and hasbeen shown to be outperformed by CONGA in many cases similarto those we consider Moreover we have implemented FlowBen-der [23] on top of DCTCP in our testbed and we follow the defaultsettings in [23] However the performance is close to ECMP Onepossible reason is that the workload or topology do not suit the de-fault parameter settings Hence we omit the results of FlowBenderto avoid unfair comparisonWorkloads We use two widely-used realistic workloads observedfrom deployed datacenters data-mining [18] and web-search [6] Asshown in Figure 7 both distributions are heavy-tailed Particularlythe data-mining workload is more skewed with 95 of all data bytesbelonging to about 36 of ows that are larger than 35MB whichmakes it more challenging for load balancing [5] We adopt the owgenerator in [8] which generates ows between random sendersand receivers under dierent leaf switches according to Poissonprocesses with varying trac loadsMetrics We use ow completion time (FCT) as the primary per-formance metric Besides the overall average FCT we also breakdown the FCT for small ows (lt100KB) and large ows (gt10MB)in some cases to better understand the results The results are theaverage of 5 runs

52 Testbed Experiments

Testbed setup Our testbed consists of 12 servers connected to 4switches As shown in Figure 8a the servers are organized in two

30 40 50 60 70 80 90

Load ()

CLOVE-ECN

Hermes

Presto

(a) Web-search

30 40 50 60 70 80 90

Load ()

CLOVE-ECN

Hermes

Presto

(b) Data-mining

Figure 9 [Testbed] Overall avg FCT in the symmetric case

racks (6 servers each) We adopt the same topology as used in priorwork [5 14 24] and all servers and switches are connected with1Gbps links Given that there is a 32 oversubscription at the leaflevel We also consider an asymmetric case where we cut one ofthe links between a leaf switch and a spine switch as indicated inFigure 8b

Our servers have 4-core Intel E5-1410 28GHz CPU and a Broad-com BCM5719 NetXtreme Gigabit Ethernet NIC All the servers runLinux kernel 31811 The switches are Pronto 3295 48-port GigabitEthernet switch Since the base RTT of our testbed is around 100microswe set standard ECN marking threshold to 30KB accordingly Weuse a simple client-server application to generate trac accordingto the two workloads discussed above and measure the FCT on theapplication layer We compare Hermes with ECMP CLOVE-ECNand Presto all of which are implementable on our testbedThe symmetric case Figure 9 shows the results for both theweb-search and data-mining workloads in the symmetric case Wesee similar trends in both workloads First we nd that Hermesperforms increasingly better (10-38) compared to ECMP as theload increases This is as expected because ECMP suers morefrom hash collisions at higher loads Moreover Hermes performs9-15 better at 30-70 loads compared to CLOVE-ECN This isbecause Hermes has better visibility and can react to congestionmore timely than owlet-based solutions especially under low andmedium loads Also we observe that Hermes performs closely toPresto which is shown to achieve near-optimal performance insymmetric topologies [20]

Another observation is that Hermes outperforms Presto bysim13at 90 load with the web-search workload One possible reason isthat our testbed is not perfectly symmetric because links may nothave exactly the same capacity As a result Presto may always beconstrained by the bottlenecked link similar to the case shown inExample 2 of sect222 Such congestion mismatch caused by imperfectsymmetry may aect Presto under very high load In comparisonPresto is more stable in the data-mining workload We believe thatthis is due to the more bursty nature of the web-search workloadwhich has a smaller inter-ow arrival time and a larger number ofsmall owsThe asymmetric case We repeat the above experiments for theasymmetric topology Note that we only consider loads up to 70relative to the symmetric case because the bisection bandwidth isonly 75 of the symmetric case

First we observe that the performance of ECMP deterioratesafter the load exceeds 40-50 This is because the link between

30 40 50 60 70

Load ()

CLOVE-ECNHermesPrestoECMP

(a) Web-search

30 40 50 60 70F

Load ()

(b) Data-mining

Figure 10 [Testbed] Overall avg FCT in the asymmetric case

30 40 50 60 70

Load ()

CLOVE-ECN

Hermes

Presto

(a) Small ow (lt100KB) avg

30 40 50 60 70

Load ()

CLOVE-ECNHERMESPrestoECMP

(b) Large ow (gt10MB) avg

Figure 11 [Testbed] Web-search workload statistics in theasymmetric case Note that for large ows we normalize theFCT to Hermes to better visualize the results

spine 1 and leaf 1 becomes fully loaded Moreover we observe thatHermes performs 12-30 better than CLOVE-ECN at 30-65 loads mdashwith both workloads and among dierent ow size groups (Figure11) Similar to the symmetric case this demonstrates Hermesrsquosbetter visibility and more timely reaction to congestion comparedwith CLOVE-ECN Note that CLOVE-ECN achieves comparableperformance to Hermes at 70 load One possible reason is thatmore owlets are created at high loads thus CLOVE-ECN can moretimely converge to a balanced load

For Presto we take the path asymmetry into account and assignweights for parallel paths statically to equalize the average load[14] However we nd that even with topology-dependent weightsPresto fails to match Hermesrsquos performance We believe that theroot cause is congestion mismatch As the load increases dierentpaths begin to experience dierent levels of congestion becausethe trac matrix is not symmetric As a result the congestionwindow of Presto is always constrained by the most congestedpath and ECN signals are also mismatched among dierent pathsSuch mismatch greatly aects the normal transport behavior (asshown in the Examples 2 and 3 of sect222) and causes a dramaticincrease in FCT after the load exceeds 60

53 Deep Dive with Large SimulationsDue to testbed limitations our experiments contain 4 switches 1Gbps links and 1 link failure Using simulations we further evalu-ate Hermes with a larger topology dierent degrees of asymmetryand multiple types of switch failures Specially we look into theperformance of dierent schemes in detail and many of our ob-servations echo our design rationale of timely triggering cautiousrerouting and enhanced visibility

101520253035

30 50 70 90

Load ()

Hermes

(a) Web-search

30 50 70 90

Load ()

ECMP CONGA Hermes

(b) Data-mining

Figure 12 [Simulation] Overall avg FCT (baseline topology)

531 BaselineWe rst inspect the performance of Hermes under a symmetric 8times8leaf-spine topology with 128 hosts connected by 10Gbps links Wesimulate a 21 oversubscription at the leaf level typical of todayrsquosdatacenter deployments [5] Figure 12 shows the overall averageFCT for the web-search and data-mining workloads For the web-search workload Hermes outperforms ECMP by up to 55 as thetrac load increases As a readily-deployable solution we observethat Hermes is always within 17 of CONGA (requiring switchmodications) at all levels of loads For the data-mining workloadHermes is 29 better than ECMP at high load Moreover unlike theweb-search workload we nd that Hermes can slightly outperformCONGA by up to 4Analysis The distinction between CONGA and Hermes underthe data-mining and web-search workloads suggests the impor-tance of visibility and timely reaction On the one hand CONGAoutperforms Hermes for the web-search workload One key reasonis that CONGA achieves better visibility by keeping track of allthe ows on a leaf switch as shown in Table 2 On the other handHermes performs slightly better for the data-mining workload be-cause of its timely reaction to congestion To explain this note thatthe data-mining workload contains more large ows hence it ismore challenging to deal with because multiple large ows maycollide on one path [5] Furthermore the data-mining workload hasa much bigger inter-ow arrival time So when there are no burstyarrivals of new ows the packet inter-arrival time of these largeows can be quite stable Therefore CONGA cannot always timelyresolve such bottlenecks because there are not enough owlet gapsas shown in the Example 1 of sect222 In comparison Hermes cantimely reroute these ows once congestion is sensed and a vacantpath is found

532 Impact of Asymmetric TopologyWe further compare Hermes with CONGA LetFlow Clove-ECNand Presto under an asymmetric topology We adopt the baselinetopology and reduce the capacity from 10Gbps to 2Gbps for 20 ofrandomly selected leaf-to-spine links Note that we normalize theFCT to Hermes in order to better visualize the resultsUnder theweb-searchworkload As shown in Figure 13 CONGAperforms over 10 better than the other schemes in most casesHermes CLOVE-ECN and LetFlow achieve similar performanceThis is because the web-search workload contains many small owsand is also more bursty Dynamics such as frequent ow arrivalare likely to create owlet gaps to break large ows into smallsized owlets With enough owlets CLOVE-ECN and LetFlow can

30 40 50 60 70 80 90

Load()

CLOVE-ECN CONGAHermes PrestoLetFlow

(a) Overall avg FCT

30 40 50 60 70 80 90

Load()

(b) Small ow(lt100KB) avg

30 40 50 60 70 80 90

Load()

(c) Large ow (gt10MB) avg

30 40 50 60 70 80 90

Load()

(d) Small ow 99th percentile

Figure 13 [Simulation] FCT statistics for the web-search workload in the asymmetric topology (normalized to Hermes)

30 40 50 60 70 80 90

Load ()

(a) Overall avg FCT

30 40 50 60 70 80 90

Load ()

(b) Small ow (lt100KB) avg

30 40 50 60 70 80 90FC

Load ()

30 40 50 60 70 80 90

Load ()

(d) Overall 99th percentile

Figure 14 [Simulation] FCT statistics for the data-mining workload in the asymmetric topology (normalized to Hermes)

Overall Large (gt10MB) Small (lt100KB)FC

50us 150us 500us

Figure 15 [Simulation] CONGA with dierent owlet time-out values (web-search) To evaluate the impact of conges-tionmismatch we try to mask the impact of packet reorder-ing by implementing a reordering buer [15]

quickly converge to a balanced load even without good visibilityIn comparison recall that Hermes outperforms CLOVE-ECN by9-15 with the web-search workload in testbed experiments Onepossible reason is that our 1Gbps testbed has a higher RTT thusCLOVE-ECN converges more slowly

However excessive rerouting opportunities may negatively af-fect small ows without cautious rerouting As we can see in Figure13b and 13d the average and the 99th percentile FCTs for small owsgrow dramatically for owlet-based solutions as the load increasesThis is because small ows are broken into several owlets underhigh loads thus they are heavily aected by packet reorderingand congestion mismatch In comparison Hermes outperforms thecompetition by 15-33times at 90 load due to its cautious reroutingUnder the data-miningworkload Figure 14 shows that Hermesoutperforms CONGA by 5-10 Compared to the baseline (Figure12b) we observe that the timely reaction of Hermes brings moreobvious performance gains This is perhaps because Hermes caneectively resolve collisions of large ows on the 2Gbps links We

also observe that Hermes is 13-20 better than CLOVE-ECN andLetFlow This is because the data-mining workload is signicantlyless bursty When there are not enough rerouting opportunities(ie owlets for CLOVE-ECN and LetFlow) solutions without goodvisibility can hardly balance the tracValidating congestion mismatch Similar to our testbed experi-ments we observe that Presto with topology-dependent weightsfails to achieve comparable performance to others under the asym-metric topology To further validate the eect of congestion mis-match caused by vigorous rerouting we x the trac load at 80and run CONGA with dierent owlet timeout gaps We maskpacket reordering to rule out their impact As shown in Figure15 we nd that reducing the owlet timeout from 500micros to 150microsimproves FCT (by sim6) due to more rerouting opportunities How-ever further reducing the timeout value to 50micros degrades FCT bysim30 This indicates that even congestion-aware solutions suerfrom congestion mismatch With such a small owlet gap CONGAchanges path vigorously This in turn negatively aects the normalbehavior of transport protocols as we showed in the Example 4 ofsect222

533 Impact of Switch FailuresWe next evaluate Hermes under failure scenarios We adopt thebaseline topology and randomly select one core switch to simulatesilent random packet drop and packet blackhole [19] Note thatbecause only 7 out of 8 core switches are working appropriatelywe consider trac loads up to 70 We compare Hermes againstCONGA Presto LetFlow and ECMPSilent randompacket drop To simulate the silent random packetdrop scenario we set the drop rate to 2 on a randomly selectedcore switch Figure 16 shows the performance of dierent schemes

30 40 50 60 70

Load ()

ECMP CONGAHermes PrestoLetFlow

(a) Overall avg FCT

30 40 50 60 70

s)Load ()

Figure 16 [Simulation] Performance with random packetdrops (web-search)

30 40 50 60 70

Load ()

ECMP CONGA

Hermes Presto

LetFlow

(a) Overall avg FCT

30 40 50 60 70

Load ()

(b) of unnished ows

Figure 17 [Simulation] Performance with packet blackhole(web-search)

First of all we observe that Hermes outperforms all other schemesby over 32 this is because it can eectively sense the failure andavoid routing through the failed switch ECMP has 17-23times higherFCT compared to Hermes even at low loads this is because about18th of the ows traverse the failed switch using ECMP and areaected by the high packet drop rate We also observe that CONGAperforms similarly to ECMP To explain this note that ows travers-ing the failed switch tend to have a low sending rate due to frequentpacket drops Therefore CONGA paradoxically shifts more tracto such undesirable paths because it senses and balances tracbased on network utilization Presto is more heavily aected be-cause all the ows have to go through this failed switch and thus beaected5 LetFlow is comparatively less aected because randompacket drops create more rerouting opportunities on the aectedpaths However without the ability to explicitly detect and avoidfailures LetFlow still performs sim15times worse than HermesPacket blackhole To simulate the packet blackhole scenario wedrop packets for half of the source-destination IP pairs from Rack 1to Rack 8 deterministically on one randomly selected switch Figure17 shows the performance of dierent schemes

As expected Hermes can eectively detect the blackhole after 3timeouts hence all the ows can nish and Hermes achieves over16times better FCT than others For ECMP a xed group of ows fromRack 1 to Rack 8 will be hashed to the failed switch which leads to asim15 of unnished ows (Figure 17b) The unnished ows greatlyenlarge the average FCT which makes ECMP 9-22times worse thanHermes Similar to the random drop scenario CONGA paradoxi-cally shifts more ows to the failed switch because it appears to be5Note that we observe good average FCT for small ows in case of Presto One possiblereason is that Presto has a lower network utilization on all the paths as all large owsare heavily aected and cannot send at high rates

40 60 80

Hermes wo Both Hermes wo Rerouting Hermes

(a) Eectiveness of probing andrerouting (overall avg FCT)

40 60 80

No Probing 500μs Probing 100 μs Probing

(b) Impact of probe interval (small ow(lt100KB) avg)

Figure 18 [Simulation] Hermes deep dive (data-mining)Here ldquowithout both refers to Hermes without both probingand rerouting

140 160 180 200 220 240 260 280Nor

esTRTT_high(micros)

Web-Search Data-Mining

(a) Sensitivity to TRTT _hiдh

25 50 75 100 125 150 175 200

∆RTT (micros)

(b) Sensitivity to ∆RTT

Figure 19 [Simulation] Sensitivity to TRTT _hiдh and ∆RTT

less congested This results in a higher portion of unnished owsand higher FCT compared to ECMP Presto works in a round-robinmanner so all the ows can nish However the average FCT is stillgreatly enlarged because all the corresponding ows are aectedFinally we see that LetFlow performs the second best because itcan timely reroute the aected ows However it is still over 16timesworse than Hermes because it cannot explicitly detect and avoidfailures

54 Performance Breakdown and Robustness

Eectiveness of probing and rerouting We have already shownthat good visibility and timely yet cautious rerouting together bringgood performance Now we further investigate the incremental ben-ets of some key design components eg probing and reroutingusing the data-mining workload Figure 18a shows that probingand rerouting bring around 20 and 10 improvement to the over-all average FCT respectively A similar trend is observed for thelarge and small ows as well This observation validates that ac-tive probing can eectively increase the visibility of end hostsand timely rerouting can eectively resolve hotspots caused bycollisions among large owsImpact of probe intervals Figure 18b shows that compared toHermes without probing a 500micros probe interval brings around11-15 improvement while reducing the probe interval to 100microsbrings around 1-3 improvementSensitivity of parameter settings We next study how dierentparameter settings aect the performance of Hermes Figure 19aand 19b show the sensitivity analysis for ∆RTT and TRTT _hiдh First we observe that the FCT is relatively stable when these two

parameters are set around the suggested values Another observa-tion is that the web-search and data-mining workloads experiencedierent trends as TRTT _hiдh and ∆RTT increase To understandthis recall that the web-search workload is more bursty so it tendsto create frequent rerouting opportunities Therefore a conserva-tive parameter setting (ie highTRTT _hiдh and ∆RTT ) brings betterperformance because it can prune excessive reroutings Howeverbecause the data-mining is less bursty an aggressive parametersetting leads to better performance We have also tested other pa-rameters and observed that Hermes performs well and relativelystable around the settings suggested in sect33Dierent transport protocols We nally check the performanceof Hermes with TCP under the 8times8 topology in simulation For Her-mes we rely only on RTT to sense congestion Since TCP is morebursty and has a larger RTT we set∆RTT andTRTT _hiдh 15times largerand keep all the other parameters unchanged For CONGA we setthe owlet timout to be 500micros Under the web-search workloadHermes is within 10-25 of CONGA at all loads in both baselineand asymmetric topology Under the data-mining workload Her-mes performs almost identically to CONGA in most cases (guresomitted due to space limitation) We have observed a similar trendfor DCTCP in sect53 except that CONGA performs slightly betterrelative to Hermes This is because TCP is more bursty thus morelikely to create sucient owlet gaps

6 DISCUSSIONHermes is not a panacea Here we discuss some design tradeosand potential deployment concernsEndhost based sensing The choice of pushing congestion aware-ness to the network edge makes Hermes readily deployable andresilient to uncertainties at the same time However we note thatvisibility at end hosts although enhanced by active probing is stilllimited in comparison to switch-based solutions [5 25 35] Bettervisibility often leads to better initial routing assignments whichcan be especially important for small ows For example our eval-uation shows that CONGA outperforms Hermes in the web-searchworkload which has a relatively smaller average ow sizeEectiveness of the rerouting design Compared to owlet-based solutions the choice of timely yet cautious rerouting is thekey for faster reaction to congestion especially when trac istoo steady to create enough owlet gaps As a result Hermes out-performs owlet-based solutions under the relatively stable data-mining workload (Figure 12b and 14) However when ows aresmall and bursty dynamics such as frequent ow arrival are likely tocreate enough owlet gaps making owlet-based solutions equallyecient (Figure 13)Number of parameters Hermes introduces a number of parame-ters to eectively sense path conditions and to make deliberate loadbalancing decisions As a result deploying Hermes requires moretuning compared to solutions with much simpler load balancinglogics At this point we provide some rules of thumb for param-eter tuning A full exploration of the optimal parameter settingstogether with an automatic parameter tuning procedure wouldgreatly simplify the deployment of Hermes We consider it as animportant future work

Burst avoidance and stability Hermes takes at least one RTTto sense and react to uncertainties and thus it does not directlyhandle microbursts [16] As for stability recent congestion-awareload balancing solutions [5 24 25] have demonstrated that stableperformance can be achieved in practice as long as network state iscollected at ne-grained timescales Compared to these solutionsHermes can more eectively prevent path oscillations and burstscaused by synchronized routing choices because 1) Hermes lever-ages the power of two choices to avoid the herd behavior and 2)Hermes does not reroute small ows and ows with a high sendingrate

7 RELATEDWORKThe literature on datacenter load balancing is vast Among themwe only discuss some representative ones close to this work

Presto [20] DRB [12] and RPS [13] are per-packetowcell basedcongestion-oblivious load balancing schemes We have shown thatthey suer from congestion mismatch under asymmetry

CONGA [5] and Expeditus [35] employ congestion aware switch-ing on specialized switch chipsets to load balance the networkHULA [25] and CLOVE-INT [24] leverage advanced programmableswitches [2 11] to achieve better visibility

LetFlow [14] leverages owlet switching to automatically con-verge to a balanced trac share among parallel paths However wehave shown that owlets cannot always timely react to congestionunder stable trac patterns which makes LetFlow converge slowlyand achieve suboptimal performance

CLOVE-ECN [24] adopts per-owlet weighted round-robin atend hosts where path weights are calculated based on piggybackedECN signals Due to its limited visibility and slow reaction to con-gestion CLOVE-ECN performs up to 25 worse than Hermes underan asymmetric topology

MPTCP [31] is a transport protocol that routes several subowsconcurrently over multiple paths Because subows do not changepaths MPTCP does not suer from the congestion mismatch prob-lem However MPTCP has some important drawbacks [5 23 24]First it modies the end host networking stacks which makes itchallenging to deploy in multi-tenant environments Moreover itperforms poorly in incast scenarios because several connectionsare maintained simultaneously for each ow

DRILL [16] is a switch-local per-packet load balancing solutionwith a major goal of resolving micro bursts under heavy load DRILLdoes not consider asymmetric topologies in their design and itreroutes every packet vigorously with only local information As aresult it also suers from congestion mismatch under asymmetry

FlowBender [23] reroutes ows blindly whenever congestion isdetected by end hosts Such random and vigorous rerouting bringssub-optimal performance under high loads

Finally all the aforementioned schemes lack the ability to timelyreact to switch malfunctions which results in signicant perfor-mance degradation as we showed in sect533

8 CONCLUSIONThis paper has introduced Hermes a datacenter load balancer re-silient to uncertainties such as trac dynamics topology asym-metry and failures Hermes leverages comprehensive sensing to

detect uncertainties (including switch failures unattended before)and reacts by timely yet cautious rerouting We have implementedHermes with commodity switches and evaluated it through testbedexperiments and large simulations We demonstrated that Hermeshandles uncertainties well under asymmetries Hermes achievesup to 10 (20) better FCT than CONGA (CLOVE) under switchfailures it outperforms all other schemes by over 32

ACKNOWLEDGMENTSThis work is supported in part by the Hong Kong RGC ECS-26200014GRF-16203715 GRF-613113 CRF-C703615G and the China 973 Pro-gram No2014CB340303 Mosharaf Chowdhury was supported inpart by NSF grants CNS-1617773 CCF-1629397 and CNS-1563095We are thankful to Zhouwang Fu for his help in testbed evaluationWe would also like to thank our shepherd Dina Papagiannaki andthe anonymous SIGCOMM reviewers for their valuable feedback

REFERENCES[1] DCTCP in Linux Kernel 318 httpkernelnewbiesorgLinux318[2] In-band Network Telemetry (INT) httpp4orgwp-contentuploads

xedINTINT-current-specpdf[3] Mohammad Al-Fares Alexander Loukissas and Amin Vahdat A

Scalable Commodity Data Center Network Architecture In SIGCOMM2008

[4] Mohammad Al-Fares Sivasankar Radhakrishnan Barath RaghavanNelson Huang and Amin Vahdat Hedera Dynamic Flow Schedulingfor Data Center Networks In NSDI 2010

[5] Mohammad Alizadeh Tom Edsall Sarang Dharmapurikar RamananVaidyanathan Kevin Chu Andy Fingerhut Francis Matus Rong PanNavindra Yadav George Varghese and others CONGA DistributedCongestion-Aware Load balancing for Datacenters In SIGCOMM 2014

[6] Mohammad Alizadeh Albert Greenberg David A Maltz Jitendra Pad-hye Parveen Patel Balaji Prabhakar Sudipta Sengupta and MurariSridharan Data Center TCP (DCTCP) In SIGCOMM 2010

[7] Wei Bai Li Chen Kai Chen Dongsu Han Chen Tian and Hao WangInformation-Agnostic Flow Scheduling for Commodity Data CentersIn NSDI 2015

[8] Wei Bai Li Chen Kai Chen and Haitao Wu Enabling ECN in Multi-Service Multi-Queue Data Centers In NSDI 2016

[9] Hitesh Ballani Paolo Costa Christos Gkantsidis Matthew PGrosvenor Thomas Karagiannis Lazaros Koromilas and Greg OrsquoSheaEnabling End-host Network Functions In SIGCOMM 2015

[10] Peter Bodiacutek Ishai Menache Mosharaf Chowdhury PradeepkumarMani David A Maltz and Ion Stoica Surviving Failures in Bandwidth-Constrained Datacenters In SIGCOMM 2012

[11] Pat Bosshart Dan Daly Glen Gibb Martin Izzard Nick McKeownJennifer Rexford Cole Schlesinger Dan Talayco Amin Vahdat GeorgeVarghese and others 2014 P4 Programming Protocol-IndependentPacket Processors SIGCOMM CCR 44 3 (2014) 87ndash95

[12] Jiaxin Cao Rui Xia Pengkun Yang Chuanxiong Guo Guohan LuLihua Yuan Yixin Zheng Haitao Wu Yongqiang Xiong and DaveMaltz Per-packet Load-balanced Low-latency Routing for Clos-basedData Center Networks In CoNEXT 2013

[13] Advait Dixit Pawan Prakash Y Charlie Hu and Ramana Rao Kom-pella On the Impact of Packet Spraying in Data Center Networks InINFOCOM 2013

[14] Vanini Erico Pan Rong Alizadeh Mohammad Taheri Parvin andEdsall Tom Let it FLow Resilient Asymmetric Load Balancing withFlowlet Switching In NSDI 2017

[15] Yilong Geng Vimalkumar Jeyakumar Abdul Kabbani and MohammadAlizadeh JUGGLER A Practical Reordering Resilient Network Stackfor Datacenters In EuroSys 2016

[16] Soudeh Ghorbani Brighten Godfrey Yashar Ganjali and AminFiroozshahian Micro Load Balancing in Data Centers with DRILL InHotNets 2015

[17] Phillipa Gill Navendu Jain and Nachiappan Nagappan Understand-ing Network Failures in Data Centers Measurement Analysis andImplications In SIGCOMM 2011

[18] Albert Greenberg James R Hamilton Navendu Jain Srikanth KandulaChanghoon Kim Parantap Lahiri David A Maltz Parveen Patel andSudipta Sengupta VL2 A Scalable and Flexible Data Center NetworkIn ACM SIGCOMM 2009

[19] Chuanxiong Guo Lihua Yuan Dong Xiang Yingnong Dang RayHuang Dave Maltz Zhaoyi Liu Vin Wang Bin Pang Hua Chen Zhi-Wei Lin and Varugis Kurien Pingmesh A Large-Scale System for DataCenter Network Latency Measurement and Analysis In SIGCOMM2015

[20] Keqiang He Eric Rozner Kanak Agarwal Wes Felter John Carter andAditya Akella Presto Edge-based Load Balancing for Fast DatacenterNetworks In SIGCOMM 2015

[21] CE Hopps Analysis of an Equal-Cost Multi-Path Algorithm In RFC2992

[22] Shuihai Hu Kai Chen Haitao Wu Wei Bai Chang Lan Hao WangHongze Zhao and Chuanxiong Guo Explicit Path Control in Com-modity Data Centers Design and Applications In NSDI 2015

[23] Abdul Kabbani Balajee Vamanan Jahangir Hasan and Fabien Duch-ene FlowBender Flow-level Adaptive Routing for Improved Latencyand Throughput in Datacenter Networks In CoNEXT 2014

[24] Naga Katta Mukesh Hira Aditi Ghag Changhoon Kim Isaac Keslassyand Jennifer Rexford CLOVE How I Learned to Stop Worrying aboutthe Core and Love the Edge In HotNets 2016

[25] Naga Katta Mukesh Hira Changhoon Kim Anirudh Sivaraman andJennifer Rexford HULA Scalable Load Balancing Using ProgrammableData Planes In SOSR 2016

[26] Radhika Mittal Nandita Dukkipati Emily Blem Hassan Wassel Mo-nia Ghobadi Amin Vahdat Yaogong Wang David Wetherall DavidZats and others TIMELY RTT-based Congestion Control for theDatacenter In SIGCOMM 2015

[27] Michael Mitzenmacher 2000 How Useful Is Old Information IEEETPDS 11 1 (2000) 6ndash20

[28] Michael Mitzenmacher 2001 The Power of Two Choices in Random-ized Load Balancing IEEE TPDS 12 10 (2001) 1094ndash1104

[29] Michael Mitzenmacher Balaji Prabhakar and Devavrat Shah LoadBalancing with Memory In FOCS 2002

[30] Jayakrishnan Nair Adam Wierman and Bert Zwart The Funda-mentals of Heavy-tails Properties Emergence and Identication InSIGMETRICS 2013

[31] Costin Raiciu Sebastien Barre Christopher Pluntke Adam Green-halgh Damon Wischik and Mark Handley Improving DatacenterPerformance and Robustness with Multipath TCP In SIGCOMM 2011

[32] Arjun Roy Hongyi Zeng Jasmeet Bagga and Alex C Snoeren PassiveRealtime Datacenter Fault Detection and Localization In NSDI 2017

[33] Shan Sinha Srikanth Kandula and Dina Katabi Harnessing TCPrsquosBurstiness with Flowlet Switching In HotNets 2004

[34] Balajee Vamanan Jahangir Hasan and TN Vijaykumar Deadline-Aware Datacenter TCP (D2TCP) In SIGCOMM 2012

[35] Peng Wang Hong Xu Zhixiong Niu Dongsu Han and YongqiangXiong Expeditus Congestion-aware Load Balancing in Clos DataCenter Networks In SoCC 2016

Abstract

1 Introduction

2 Background and Motivation

21 Uncertainties

22 Drawbacks of Prior Solutions in Handling Uncertainties

3 Design

31 Comprehensive Sensing

32 Timely yet Cautious Rerouting

33 Parameter Settings

4 Implementation

5 Evaluation

51 Methodology


53 Deep Dive with Large Simulations


6 Discussion

7 Related Work

8 Conclusion

References