Upload
ida
View
42
Download
0
Tags:
Embed Size (px)
DESCRIPTION
A Load-Balanced Switch with an Arbitrary Number of Linecards. Isaac Keslassy , Shang-Tse (Da) Chuang, Nick McKeown Stanford University. Stanford 100Tb/s Router. “Optics in Routers” project http://yuba.stanford.edu/or/ Some challenging numbers: 100Tb/s R =160Gb/s linecard rate - PowerPoint PPT Presentation
Citation preview
A Load-Balanced Switch with an Arbitrary Number of Linecards
Isaac Keslassy, Shang-Tse (Da) Chuang, Nick McKeown
Stanford University
Stanford 100Tb/s Router
“Optics in Routers” project http://yuba.stanford.edu/or/
Some challenging numbers: 100Tb/s R=160Gb/s linecard rate N=640 linecards
Performance guarantees
Router Wish ListScale to High Linecard Speeds
No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity
Scale to High Number of Linecards High Number of Linecards Arbitrary Arrangement of Linecards
Provide Performance Guarantees 100% Throughput Guarantee Delay Guarantee No Packet Reordering
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
112233
Load-Balanced Switch
Load-balancing mesh
Forwarding mesh
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N33
22
11
Load-Balanced Switch
Load-balancing mesh
Forwarding mesh
Out
Out
Out
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
In
In
In
R
R
R
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
R/N
Combining the Two Meshes
One linecard
In
Out
In
Out
A Single Combined Mesh
In
Out
In
Out
In
Out
In
Out
RIn
Out
In
Out
In
Out
In
Out
R2R/N
References on Early Work
Initial Work C.-S. Chang, D.-S. Lee and Y.-S. Jou, "Load
Balanced Birkhoff-von Neumann Switches, part I: One-Stage Buffering," Computer Communications, Vol. 25, pp. 611-622, 2002.
Sigcomm’03 I. Keslassy, S.-T. Chuang, K. Yu, D. Miller, M.
Horowitz, O. Solgaard and N. McKeown, "Scaling Internet Routers Using Optics," ACM SIGCOMM '03, Karlsruhe, Germany, August 2003.
Summary of Early Work
Initial Work (C.-S. Chang et al.)
Sigcomm‘03
Scheduler No centralized scheduler
No centralized scheduler
Architecture Crossbar-based architecture
Mesh-based architecture => no reconfiguration Single Mesh
Performance guarantees
100% throughput guarantee for weakly-mixing traffic
100% throughput guarantee for any adversarial traffic Average delay within constant from output-queued router No packet reordering
Router Wish ListScale to High Linecard Speeds
No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity
Scale to High Number of Linecards High Number of Linecards Arbitrary Arrangement of Linecards
Provide Performance Guarantees 100% Throughput Guarantee Delay Guarantee No Packet Reordering
1
2
3
4
ExampleN=8
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
2R/8
When N is Too LargeDecompose into groups (or racks)
4R/42R 2R1
2
3
4
5
6
7
8
2R2R
1
2
3
4
5
6
7
8
4R 4R
When N is Too LargeDecompose into groups (or racks)
1
2
L
2R2R
2R
1
2
L
2R2R
2R
Group/Rack 1
Group/Rack G
1
2
L
2R2R
2R
Group/Rack 1
1
2
L
2R2R
2R
Group/Rack G
2RL
2RL 2RL
2RL2RL/G
2RL/G
2RL/G
2RL/G
Router Wish ListScale to High Linecard Speeds
No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity
Scale to High Number of Linecards High Number of Linecards Arbitrary Arrangement of Linecards
Provide Performance Guarantees 100% Throughput Guarantee Delay Guarantee No Packet Reordering
When Linecards are MissingFailures, Incremental Additions, and Removals…
1
2
L
2R2R
2R
1
2
L
2R2R
2R
Group/Rack 1
Group/Rack G
1
2
L
2R2R
2R
Group/Rack 1
1
2
L
2R2R
2R
Group/Rack G
2RL
2RL 2RL
2RL2RL/G
2RL/G
2RL/G
2RL/G
2RL
Solution: replace mesh with sum of permutations
= + +
2RL/G 2RL/G 2RL/G 2RL/G
≤
2RL 2RL/G
G *
Hybrid Electro-Optical ArchitectureUsing MEMS Switches
1
2
L
2R2R
2R
1
2
L
2R2R
2R
Group/Rack 1
Group/Rack G
1
2
L
2R2R
2R
Group/Rack 1
1
2
L
2R2R
2R
Group/Rack G
MEMSSwitch
MEMSSwitch
Electronics Electronics
Optics
1
2
L
2R2R
2R
1
2
L
2R2R
2R
Group/Rack 1
Group/Rack G
1
2
L
2R2R
2R
Group/Rack 1
1
2
L
2R2R
2R
Group/Rack G
MEMSSwitch
MEMSSwitch
When Linecards are Missing
Router Wish ListScale to High Linecard Speeds
No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity
Scale to High Number of Linecards High Number of Linecards Arbitrary Arrangement of Linecards
Provide Performance Guarantees 100% Throughput Guarantee Delay Guarantee No Packet Reordering
Questions
Number of MEMS Switches?
TDM Schedule?
All Link Capacities Are Equal
1
2
L
2R2R
2R
1
2
L
2R2R
2R
Group/Rack 1
Group/Rack G
1
2
L
2R2R
2R
Group/Rack 1
1
2
L
2R2R
2R
Group/Rack G
MEMSSwitch
MEMSSwitch
MEMSSwitch
Link Capacity ≈ 64 λ’s * 5 Gb/s/λ = 320 Gb/s = 2R
Laser/Modulator
MUX≤ 2R
≤ 2R
≤ 2R
≤ 2R
≤ 2R
≤ 2R
Group/Rack 1
1
2
2R
2R 4R
Group/Rack 2
1
2
2R
2R 4R
Example2 Groups of 2 Linecards
1
2
2R
2R
Group/Rack 1
1
2
2R
2R
Group/Rack 2
4R
4R
2R
2R
2R
2R
2R
2R
Intuition on Worst-Case
1
2
L
2R2R
2R
Group/Rack 1
1
2
L
2R2R
2R
Group/Rack 1
MEMSSwitch
MEMSSwitch
MEMSSwitch
2RL 2RL≤ 2R
≤ 2R
≤ 2R
L
Group/Rack G
12R
2R
12R
Group/Rack 2
2R1 2R
Group/Rack 2
2R
1 2R
Group/Rack G
2RG-1
Theorem: M ≤ L+G-1
Number of MEMS Switches
Examples:
5540,16,640
2
MGLN
NMNGL
Questions
Number of MEMS Switches?
TDM Schedule?
Group A
1
2
2R
2R 4R
Group B
1
2
2R
2R 4R
TDM Schedule
1
2
2R
2R
Group A
1
2
2R
2R
Group B
4R
4R
2R
2R
2R
2R
TDM Schedule
T+1 T+2 T+3 T+4
Tx LC A1 ? ? ? ?
Tx LC A2 ? ? ? ?
Tx LC B1 ? ? ? ?
Tx LC B2 ? ? ? ?
Tx Group A
Tx Group B
TDM Schedule
T+1 T+2 T+3 T+4
Tx LC A1 A1 A2 B1 B2
Tx LC A2 B2 A1 A2 B1
Tx LC B1 B1 B2 A1 A2
Tx LC B2 A2 B1 B2 A1
Tx Group A
Tx Group B
Bad TDM Schedule
T+1 T+2 T+3 T+4
Tx LC A1 A1 A2 B1 B2
Tx LC A2 B2 A1 A2 B1
Tx LC B1 B1 B2 A1 A2
Tx LC B2 A2 B1 B2 A1
Tx Group A
Tx Group B
TDM Schedule Algorithm
Intuition1. Create TDM schedule between groups:
“Group A sends to group B”
2. Assign group connections to specific linecards: “Linecard A1 sends to linecard B3”
Theorem: There exists a polynomial-time algorithm to find a correct TDM schedule.
Algorithm Running Time
0
5
10
15
20
25
30
35
40
0-49 100-149
200-249
300-349
400-449
500-549
600-639
milliseconds
number of linecards
Worst CaseAverage CaseBest Case
[Verilog simulation, linecard placement generated uniformly-at-random among 40 groups, 4ns clock cycle, 1000 runs per case. Source: Srikanth Arekapudi]
Open Questions
Greedy TDM algorithm with more capacity?
A better switch fabric architecture?
Thank you.