Hedera - Dynamic Flow Scheduling for Data Center Networks, an Application of Software-Defined Networking (SDN)

1

Hedera: Dynamic Flow Scheduling for Data Center

NetworkMohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, Amin Vahdat

- USENIX NSDI 2010 -

Presenter: Jason, Tsung-Cheng, HOUAdvisor: Wanjiun Liao

Dec. 22nd, 2011

2

Problem

• Relying on multipathing, due to…– Limited port densities of routers/switches– Horizontal expansion

• Multi-rooted tree topologies– Example: Fat-tree / Clos

3

Problem

• BW demand is essential and volatile–Must route among multiple paths– Avoid bottlenecks and deliver aggre. BW

• However, current multipath routing…–Mostly: flow-hash-based ECMP– Static and oblivious to link-utilization– Causes long-term large-flow collisions

• Inefficiently utilizing path diversity– Need a protocol or a scheduler

Collisions of elephant flows

• Collisions in two ways: Upward or Downward

D1S1 D2S2 D3S3 D4S4

Equal Cost Paths

• Many equal cost paths going up to the core switches

• Only one path down from each core switch• Need to find good flow-to-core mapping

DS

6

Goal

• Given a dynamic flow demands– Need to find paths that maximize

network bisection BW– No end hosts modifications

• However, local switch information is unable to find proper allocation– Need a central scheduler–Must use commodity Ethernet switches– OpenFlow

Architecture

• Detect Large Flows

– Flows that need bandwidth but are network-limited

• Estimate Flow Demands

– Use min-max fairness to allocate flows between SD pairs

• Allocate Flows

– Use estimated demands to heuristically find better placement of large flows on the EC paths

– Arrange switches and iterate again

Detect Large Flows

Estimate Flow Demands Allocate Flows

Architecture

• Feedback loop

• Optimize achievable bisection BW by assigning flow-to-core mappings

• Heuristics of flow demand estimation and placement

• Central Scheduler– Global knowledge of all links in the network

– Control tables of all switches (OpenFlow)

Detect Large Flows

Estimate Flow Demands Allocate Flows

9

Elephant Detection

10

Elephant Detection• Scheduler polls edge switches– Flows exceeding threshold are “large”– 10% of hosts’ link capacity (> 100Mbps)

• Small flows: Default ECMP hashing• Hedera complements ECMP–Default forwarding is ECMP– Only schedules large flows contributing

to bisection BW bottlenecks

• Centralized functions: the essentials

11

Demand Estimation

12

Demand Estimation

• Current flow rate: misleading–May be already constrained by network

• Need to find flow’s “natural” BW demand when not limited by network– As if only limited by NIC of S or D

• Allocate S/D capacity among flows using max-min fairness

• Equals to BW allocation of optimal routing, input to placement algorithm

13

Demand Estimation

• Given pairs of large flows, modify each flow size at S/D iteratively– S distributes unconv. BW among flows– R limited: redistributes BW among

excessive-demand flows– Repeat until all flows converge

• Guaranteed to converge in O(|F|)– Linear to no. of flows

Demand Estimation

A

B

C

X

Y

Flow Estimate Conv. ?AXAYBYCY

Sender Available Unconv. BW Flows Share

A 1 2 1/2B 1 1 1C 1 1 1

Senders

Demand Estimation

Recv RL? Non-SLFlows Share

X No - -

Y Yes 3 1/3

Receivers

Flow Estimate Conv. ?AX 1/2AY 1/2BY 1CY 1

A

B

C

X

Y

Demand Estimation

Flow Estimate Conv. ?AX 1/2AY 1/3 YesBY 1/3 YesCY 1/3 Yes

Sender Available Unconv. BW Flows Share

A 2/3 1 2/3B 0 0 0C 0 0 0

Senders

A

B

C

X

Y

Demand Estimation

Flow Estimate Conv. ?AX 2/3 YesAY 1/3 YesBY 1/3 YesCY 1/3 Yes

Recv RL? Non-SLFlows Share

X No - -

Y No - -

Receivers

A

B

C

X

Y

18

Placement Heuristics

Placement Heuristics

• Find a good large-flow-to-core mapping– such that average bisection BW is maximized

• Two approaches• Global First Fit: Greedily choose path that

has sufficient unreserved BW– O([ports/switch]2)

• Simulated Annealing: Iteratively find a globally better mapping of paths to flows– O(# flows)

Global First-Fit

• New flow found, linearly search all paths from SD

• Place on first path with links can fit the flow

• Once flow ends, entries + reservations time out

?Flow AFlow BFlow C

? ?

0 1 2 3

Scheduler

S D

21

Simulated Annealing• Annealing: letting metal to cool down

and get better crystal structure– Heating up to enter higher energy state– Cooling to lower energy state with a

better structure and stopping at a temp

• Simulated Annealing: – Search neighborhood for possible states– Probabilistically accepting worse state– Accepting better state, settle gradually– Avoid local minima

22

Simulated Annealing• State / State Space

– Possible solutions

• Energy– Objective

• Neighborhood– Other options

• Boltzman’s Function– Prob. to higher state

• Control Temperature– Current temp. affect

prob. to higher state

• Cooling Schedule– How temp. falls

• Stopping Criterion

)()'(

{)'( 0,10),/exp(

XfXff

Xp ffTf

)/(1)( tEEP

23

Simulated Annealing

• State Space: – All possible large-flow-to-core mappings– However, same destinations map to same core– Reduce state space, as long as not too many

large flows and proper threshold

• Neighborhood:– Swap cores for two hosts within same pod,

attached to same edge / aggregate– Avoids local minima

24

Simulated Annealing• Energy:– Estimated demand of flows– Total exceeded BW capacity of links, minimize

• Temperature: remaining iterations

• Probability:

• Final state is published to switches and used as initial state for next round

• Incremental calculation of exceeded cap.• No recalculation of all links, only new large

flows found and neighborhood swaps

25

Evaluation

26

Implementation

• 16 hosts, k=4 fat-tree data plane– 20 switches: 4-port NetFPGAs / OpenFlow– Parallel 48-port non-blocking Quanta switch– 1 scheduler, OpenFlow control protocol– Testbed: PortLand

27

Simulator

• k=32; 8,192 hosts– Pack-level simulators not applicable– 1Gbps for 8k hosts, takes 2.5x1011 pkts

• Model TCP flows– TCP’s AIMD when constrained by topology– Poisson arrival of flows– No pkt size variations– No bursty traffic– No inter-flow dynamics

28

PortLand/OpenFlow, k=4

29

Simulator

Reactiveness

• Demand Estimation:

– 27K hosts, 250K flows, converges < 200ms

• Simulated Annealing:

– Asymptotically dependent on # of flows + # iter., 50K flows and 1K iter.: 11ms

– Most of final bisection BW: few hundred iter.

• Scheduler control loop:– Polling + Est. + SA = 145ms for 27K hosts

31

Comments

32

Comments

• Destine to same host, via same core– May congest at cores, but how severe?– Large flows to/from a host: <k/2– No proof, no evaluation

• Decrease search space and runtime– Scalable for per-flow basis? For large k?

• No protection for mice flows, RPCs– Only assumes work well under ECMP– No address when route with large flows

33

Comments

• Own flow-level simulator– Aim to saturate network– No flow number by different size– Traffic generation: avg. flow size and arrival

rates (Poisson) with a mean– Only above descriptions, no specific numbers– Too ideal or not volatile enough? – Avg. bisection BW, but real-time graphs?

• States that per-flow VLB = per-flow ECMP– Does not compare with other options (VL2)– No further elaboration

34

Comments

• Shared responsibility– Controller only deals with critical situations– Switches perform default measures– Improves performance and saves time– How to strike a balance?– Adopt to different problems?

• Default multipath routing– States problems of per-flow VLB and ECMP– How about per-pkt? Author’s future work– How to improve switches’ default actions?

35

Comments

• Critical controller actions– Considers large flows degrade overall

efficiency– What are critical situations?– How to detect and react?– How to improve reactiveness and adaptability?

• Amin Vahdat’s lab– Proposes fat-tree topology– Develops PortLand L2 virtualization– Hedera: enhances multipath performance– Integrate all above

36

References

• M. Al-Fares, et. al., “Hedera: Dynamic Flow Scheduling for Data Center Network”, USENIX NSDI 2010

• Tathagata Das, “Hedera: Dynamic Flow Scheduling for Data Center Networks”, UC Berkeley course CS 294

• M. Al-Fares, “Hedera: Dynamic Flow Scheduling for Data Center Network”, USENIX NSDI 2010, slides

37

Supplement

Fault-Tolerance

• Link / Switch failure– Use PortLand’s fault notification protocol– Hedera routes around failed components

0 1 3Flow AFlow BFlow C

2

Scheduler

Fault-Tolerance

• Scheduler failure– Soft-state, not required for correctness

(connectivity)

– Switches fall back to ECMP

0 1 3Flow AFlow BFlow C

2

Scheduler

Limitations

• Dynamic workloads, large flow turnover faster than control loop

– Scheduler will be continually chasing the traffic matrix

• Need to include penalty term for unnecessary SA flow re-assignments

Flow Size

Mat

rix S

tabi

lity

Stab

leU

nsta

ble

ECMP Hedera

Technology

Hedera - Dynamic Flow Scheduling for Data Center Networks, an Application of Software-Defined Networking (SDN)