Upload
jason-hou
View
987
Download
4
Embed Size (px)
DESCRIPTION
Internet Research Lab at NTU, Taiwan.
Citation preview
1
Hedera: Dynamic Flow Scheduling for Data Center
NetworkMohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, Amin Vahdat
- USENIX NSDI 2010 -
Presenter: Jason, Tsung-Cheng, HOUAdvisor: Wanjiun Liao
Dec. 22nd, 2011
2
Problem
• Relying on multipathing, due to…– Limited port densities of routers/switches– Horizontal expansion
• Multi-rooted tree topologies– Example: Fat-tree / Clos
3
Problem
• BW demand is essential and volatile–Must route among multiple paths– Avoid bottlenecks and deliver aggre. BW
• However, current multipath routing…–Mostly: flow-hash-based ECMP– Static and oblivious to link-utilization– Causes long-term large-flow collisions
• Inefficiently utilizing path diversity– Need a protocol or a scheduler
Collisions of elephant flows
• Collisions in two ways: Upward or Downward
D1S1 D2S2 D3S3 D4S4
Equal Cost Paths
• Many equal cost paths going up to the core switches
• Only one path down from each core switch• Need to find good flow-to-core mapping
DS
6
Goal
• Given a dynamic flow demands– Need to find paths that maximize
network bisection BW– No end hosts modifications
• However, local switch information is unable to find proper allocation– Need a central scheduler–Must use commodity Ethernet switches– OpenFlow
Architecture
• Detect Large Flows
– Flows that need bandwidth but are network-limited
• Estimate Flow Demands
– Use min-max fairness to allocate flows between SD pairs
• Allocate Flows
– Use estimated demands to heuristically find better placement of large flows on the EC paths
– Arrange switches and iterate again
Detect Large Flows
Estimate Flow Demands Allocate Flows
Architecture
• Feedback loop
• Optimize achievable bisection BW by assigning flow-to-core mappings
• Heuristics of flow demand estimation and placement
• Central Scheduler– Global knowledge of all links in the network
– Control tables of all switches (OpenFlow)
Detect Large Flows
Estimate Flow Demands Allocate Flows
9
Elephant Detection
10
Elephant Detection• Scheduler polls edge switches– Flows exceeding threshold are “large”– 10% of hosts’ link capacity (> 100Mbps)
• Small flows: Default ECMP hashing• Hedera complements ECMP–Default forwarding is ECMP– Only schedules large flows contributing
to bisection BW bottlenecks
• Centralized functions: the essentials
11
Demand Estimation
12
Demand Estimation
• Current flow rate: misleading–May be already constrained by network
• Need to find flow’s “natural” BW demand when not limited by network– As if only limited by NIC of S or D
• Allocate S/D capacity among flows using max-min fairness
• Equals to BW allocation of optimal routing, input to placement algorithm
13
Demand Estimation
• Given pairs of large flows, modify each flow size at S/D iteratively– S distributes unconv. BW among flows– R limited: redistributes BW among
excessive-demand flows– Repeat until all flows converge
• Guaranteed to converge in O(|F|)– Linear to no. of flows
Demand Estimation
A
B
C
X
Y
Flow Estimate Conv. ?AXAYBYCY
Sender Available Unconv. BW Flows Share
A 1 2 1/2B 1 1 1C 1 1 1
Senders
Demand Estimation
Recv RL? Non-SLFlows Share
X No - -
Y Yes 3 1/3
Receivers
Flow Estimate Conv. ?AX 1/2AY 1/2BY 1CY 1
A
B
C
X
Y
Demand Estimation
Flow Estimate Conv. ?AX 1/2AY 1/3 YesBY 1/3 YesCY 1/3 Yes
Sender Available Unconv. BW Flows Share
A 2/3 1 2/3B 0 0 0C 0 0 0
Senders
A
B
C
X
Y
Demand Estimation
Flow Estimate Conv. ?AX 2/3 YesAY 1/3 YesBY 1/3 YesCY 1/3 Yes
Recv RL? Non-SLFlows Share
X No - -
Y No - -
Receivers
A
B
C
X
Y
18
Placement Heuristics
Placement Heuristics
• Find a good large-flow-to-core mapping– such that average bisection BW is maximized
• Two approaches• Global First Fit: Greedily choose path that
has sufficient unreserved BW– O([ports/switch]2)
• Simulated Annealing: Iteratively find a globally better mapping of paths to flows– O(# flows)
Global First-Fit
• New flow found, linearly search all paths from SD
• Place on first path with links can fit the flow
• Once flow ends, entries + reservations time out
?Flow AFlow BFlow C
? ?
0 1 2 3
Scheduler
S D
21
Simulated Annealing• Annealing: letting metal to cool down
and get better crystal structure– Heating up to enter higher energy state– Cooling to lower energy state with a
better structure and stopping at a temp
• Simulated Annealing: – Search neighborhood for possible states– Probabilistically accepting worse state– Accepting better state, settle gradually– Avoid local minima
22
Simulated Annealing• State / State Space
– Possible solutions
• Energy– Objective
• Neighborhood– Other options
• Boltzman’s Function– Prob. to higher state
• Control Temperature– Current temp. affect
prob. to higher state
• Cooling Schedule– How temp. falls
• Stopping Criterion
)()'(
{)'( 0,10),/exp(
XfXff
Xp ffTf
)/(1)( tEEP
23
Simulated Annealing
• State Space: – All possible large-flow-to-core mappings– However, same destinations map to same core– Reduce state space, as long as not too many
large flows and proper threshold
• Neighborhood:– Swap cores for two hosts within same pod,
attached to same edge / aggregate– Avoids local minima
24
Simulated Annealing• Energy:– Estimated demand of flows– Total exceeded BW capacity of links, minimize
• Temperature: remaining iterations
• Probability:
• Final state is published to switches and used as initial state for next round
• Incremental calculation of exceeded cap.• No recalculation of all links, only new large
flows found and neighborhood swaps
25
Evaluation
26
Implementation
• 16 hosts, k=4 fat-tree data plane– 20 switches: 4-port NetFPGAs / OpenFlow– Parallel 48-port non-blocking Quanta switch– 1 scheduler, OpenFlow control protocol– Testbed: PortLand
27
Simulator
• k=32; 8,192 hosts– Pack-level simulators not applicable– 1Gbps for 8k hosts, takes 2.5x1011 pkts
• Model TCP flows– TCP’s AIMD when constrained by topology– Poisson arrival of flows– No pkt size variations– No bursty traffic– No inter-flow dynamics
28
PortLand/OpenFlow, k=4
29
Simulator
Reactiveness
• Demand Estimation:
– 27K hosts, 250K flows, converges < 200ms
• Simulated Annealing:
– Asymptotically dependent on # of flows + # iter., 50K flows and 1K iter.: 11ms
– Most of final bisection BW: few hundred iter.
• Scheduler control loop:– Polling + Est. + SA = 145ms for 27K hosts
31
Comments
32
Comments
• Destine to same host, via same core– May congest at cores, but how severe?– Large flows to/from a host: <k/2– No proof, no evaluation
• Decrease search space and runtime– Scalable for per-flow basis? For large k?
• No protection for mice flows, RPCs– Only assumes work well under ECMP– No address when route with large flows
33
Comments
• Own flow-level simulator– Aim to saturate network– No flow number by different size– Traffic generation: avg. flow size and arrival
rates (Poisson) with a mean– Only above descriptions, no specific numbers– Too ideal or not volatile enough? – Avg. bisection BW, but real-time graphs?
• States that per-flow VLB = per-flow ECMP– Does not compare with other options (VL2)– No further elaboration
34
Comments
• Shared responsibility– Controller only deals with critical situations– Switches perform default measures– Improves performance and saves time– How to strike a balance?– Adopt to different problems?
• Default multipath routing– States problems of per-flow VLB and ECMP– How about per-pkt? Author’s future work– How to improve switches’ default actions?
35
Comments
• Critical controller actions– Considers large flows degrade overall
efficiency– What are critical situations?– How to detect and react?– How to improve reactiveness and adaptability?
• Amin Vahdat’s lab– Proposes fat-tree topology– Develops PortLand L2 virtualization– Hedera: enhances multipath performance– Integrate all above
36
References
• M. Al-Fares, et. al., “Hedera: Dynamic Flow Scheduling for Data Center Network”, USENIX NSDI 2010
• Tathagata Das, “Hedera: Dynamic Flow Scheduling for Data Center Networks”, UC Berkeley course CS 294
• M. Al-Fares, “Hedera: Dynamic Flow Scheduling for Data Center Network”, USENIX NSDI 2010, slides
37
Supplement
Fault-Tolerance
• Link / Switch failure– Use PortLand’s fault notification protocol– Hedera routes around failed components
0 1 3Flow AFlow BFlow C
2
Scheduler
Fault-Tolerance
• Scheduler failure– Soft-state, not required for correctness
(connectivity)
– Switches fall back to ECMP
0 1 3Flow AFlow BFlow C
2
Scheduler
Limitations
• Dynamic workloads, large flow turnover faster than control loop
– Scheduler will be continually chasing the traffic matrix
• Need to include penalty term for unnecessary SA flow re-assignments
Flow Size
Mat
rix S
tabi
lity
Stab
leU
nsta
ble
ECMP Hedera