A Load-Balanced Switch with an Arbitrary Number of Linecards

A Load-Balanced Switch with an Arbitrary Number of Linecards

Isaac Keslassy, Shang-Tse (Da) Chuang, Nick McKeown

Stanford University

Stanford 100Tb/s Router

“Optics in Routers” project http://yuba.stanford.edu/or/

Some challenging numbers: 100Tb/s R=160Gb/s linecard rate N=640 linecards

Performance guarantees

Router Wish ListScale to High Linecard Speeds

No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity

Scale to High Number of Linecards High Number of Linecards Arbitrary Arrangement of Linecards

Provide Performance Guarantees 100% Throughput Guarantee Delay Guarantee No Packet Reordering

Out

Out

Out

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

In

In

In

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

112233

Load-Balanced Switch

Load-balancing mesh

Forwarding mesh

Out

Out

Out

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

In

In

In

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N33

22

11

Load-Balanced Switch

Load-balancing mesh

Forwarding mesh

Out

Out

Out

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

In

In

In

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

Combining the Two Meshes

One linecard

In

Out

In

Out

A Single Combined Mesh

In

Out

In

Out

In

Out

In

Out

RIn

Out

In

Out

In

Out

In

Out

R2R/N

References on Early Work

Initial Work C.-S. Chang, D.-S. Lee and Y.-S. Jou, "Load

Balanced Birkhoff-von Neumann Switches, part I: One-Stage Buffering," Computer Communications, Vol. 25, pp. 611-622, 2002.

Sigcomm’03 I. Keslassy, S.-T. Chuang, K. Yu, D. Miller, M.

Horowitz, O. Solgaard and N. McKeown, "Scaling Internet Routers Using Optics," ACM SIGCOMM '03, Karlsruhe, Germany, August 2003.

Summary of Early Work

Initial Work (C.-S. Chang et al.)

Sigcomm‘03

Scheduler No centralized scheduler

No centralized scheduler

Architecture Crossbar-based architecture

Mesh-based architecture => no reconfiguration Single Mesh

Performance guarantees

100% throughput guarantee for weakly-mixing traffic

100% throughput guarantee for any adversarial traffic Average delay within constant from output-queued router No packet reordering





1

2

3

4

ExampleN=8

1

2

3

4

5

6

7

8

1

2

3

4

5

6

7

8

2R/8

When N is Too LargeDecompose into groups (or racks)

4R/42R 2R1

2

3

4

5

6

7

8

2R2R

1

2

3

4

5

6

7

8

4R 4R

When N is Too LargeDecompose into groups (or racks)

1

2

L

2R2R

2R

1

2

L

2R2R

2R

Group/Rack 1

Group/Rack G

1

2

L

2R2R

2R

Group/Rack 1

1

2

L

2R2R

2R

Group/Rack G

2RL

2RL 2RL

2RL2RL/G

2RL/G

2RL/G

2RL/G





When Linecards are MissingFailures, Incremental Additions, and Removals…

1

2

L

2R2R

2R

1

2

L

2R2R

2R

Group/Rack 1

Group/Rack G

1

2

L

2R2R

2R

Group/Rack 1

1

2

L

2R2R

2R

Group/Rack G

2RL

2RL 2RL

2RL2RL/G

2RL/G

2RL/G

2RL/G

2RL

Solution: replace mesh with sum of permutations

= + +

2RL/G 2RL/G 2RL/G 2RL/G

≤

2RL 2RL/G

G *

Hybrid Electro-Optical ArchitectureUsing MEMS Switches

1

2

L

2R2R

2R

1

2

L

2R2R

2R

Group/Rack 1

Group/Rack G

1

2

L

2R2R

2R

Group/Rack 1

1

2

L

2R2R

2R

Group/Rack G

MEMSSwitch

MEMSSwitch

Electronics Electronics

Optics

1

2

L

2R2R

2R

1

2

L

2R2R

2R

Group/Rack 1

Group/Rack G

1

2

L

2R2R

2R

Group/Rack 1

1

2

L

2R2R

2R

Group/Rack G

MEMSSwitch

MEMSSwitch

When Linecards are Missing





Questions

Number of MEMS Switches?

TDM Schedule?

All Link Capacities Are Equal

1

2

L

2R2R

2R

1

2

L

2R2R

2R

Group/Rack 1

Group/Rack G

1

2

L

2R2R

2R

Group/Rack 1

1

2

L

2R2R

2R

Group/Rack G

MEMSSwitch

MEMSSwitch

MEMSSwitch

Link Capacity ≈ 64 λ’s * 5 Gb/s/λ = 320 Gb/s = 2R

Laser/Modulator

MUX≤ 2R

≤ 2R

≤ 2R

≤ 2R

≤ 2R

≤ 2R

Group/Rack 1

1

2

2R

2R 4R

Group/Rack 2

1

2

2R

2R 4R

Example2 Groups of 2 Linecards

1

2

2R

2R

Group/Rack 1

1

2

2R

2R

Group/Rack 2

4R

4R

2R

2R

2R

2R

2R

2R

Intuition on Worst-Case

1

2

L

2R2R

2R

Group/Rack 1

1

2

L

2R2R

2R

Group/Rack 1

MEMSSwitch

MEMSSwitch

MEMSSwitch

2RL 2RL≤ 2R

≤ 2R

≤ 2R

L

Group/Rack G

12R

2R

12R

Group/Rack 2

2R1 2R

Group/Rack 2

2R

1 2R

Group/Rack G

2RG-1

Theorem: M ≤ L+G-1

Number of MEMS Switches

Examples:

5540,16,640

2

MGLN

NMNGL

Questions

Number of MEMS Switches?

TDM Schedule?

Group A

1

2

2R

2R 4R

Group B

1

2

2R

2R 4R

TDM Schedule

1

2

2R

2R

Group A

1

2

2R

2R

Group B

4R

4R

2R

2R

2R

2R

TDM Schedule

T+1 T+2 T+3 T+4

Tx LC A1 ? ? ? ?

Tx LC A2 ? ? ? ?

Tx LC B1 ? ? ? ?

Tx LC B2 ? ? ? ?

Tx Group A

Tx Group B

TDM Schedule

T+1 T+2 T+3 T+4

Tx LC A1 A1 A2 B1 B2

Tx LC A2 B2 A1 A2 B1

Tx LC B1 B1 B2 A1 A2

Tx LC B2 A2 B1 B2 A1

Tx Group A

Tx Group B

Bad TDM Schedule

T+1 T+2 T+3 T+4

Tx LC A1 A1 A2 B1 B2

Tx LC A2 B2 A1 A2 B1

Tx LC B1 B1 B2 A1 A2

Tx LC B2 A2 B1 B2 A1

Tx Group A

Tx Group B

TDM Schedule Algorithm

Intuition1. Create TDM schedule between groups:

“Group A sends to group B”

2. Assign group connections to specific linecards: “Linecard A1 sends to linecard B3”

Theorem: There exists a polynomial-time algorithm to find a correct TDM schedule.

Algorithm Running Time

0

5

10

15

20

25

30

35

40

0-49 100-149

200-249

300-349

400-449

500-549

600-639

milliseconds

number of linecards

Worst CaseAverage CaseBest Case

[Verilog simulation, linecard placement generated uniformly-at-random among 40 groups, 4ns clock cycle, 1000 runs per case. Source: Srikanth Arekapudi]

Open Questions

Greedy TDM algorithm with more capacity?

A better switch fabric architecture?

Thank you.

Documents

A Load-Balanced Switch with an Arbitrary Number of Linecards