The Fork-Join Router Nick McKeown Assistant Professor of Electrical Engineering and Computer...

Preview:

DESCRIPTION

First Generation Packet Switches Shared Backplane Line Interface CPU Memory CPU Buffer Memory Line Interface DMA MAC Line Interface DMA MAC Line Interface DMA MAC Fixed length “DMA” blocks or cells. Reassembled on egress linecard Fixed length cells or variable length packets

Citation preview

Hi gh Pe rf orm a nceSwi tc hi ng and Routi ngTe lec om Ce nter W orks ho p: Sep t 4 , 19 97.

The Fork-Join Router

Nick McKeownAssistant Professor of Electrical Engineering and Computer Science, Stanford Universitynickm@stanford.eduhttp://www.stanford.edu/~nickm

Outline

• Quick Background on Packet Switches

• What’s the problem?“What if data rates exceed memory

bandwidth?”

• The Fork-Join Router• Parallel Packet Switches

First Generation Packet Switches

Shared Backplane

Line Interface

CPUMemory

CPU BufferMemory

LineInterface

DMA

MAC

LineInterface

DMA

MAC

LineInterface

DMA

MAC

Fixed length “DMA” blocksor cells. Reassembled on egress

linecard

Fixed length cells or variable length packets

Second Generation Packet Switches

CPU BufferMemory

LineCard

DMA

MAC

LocalBuffer

Memory

LineCard

DMA

MAC

LocalBuffer

Memory

LineCard

DMA

MAC

LocalBuffer

Memory

Third Generation Packet Switches

LineCard

MAC

LocalBuffer

Memory

CPUCard

LineCard

MAC

LocalBuffer

Memory

Switched Backplane

Line Interface

CPUMemory

Fourth Generation Packet Switches

Two Basic Techniques

Input-queued Crossbar

Shared Memory

1+1 = 2 operations per cell time

N+N = 2N operations per cell time

Shared MemoryThe Ideal

A

ZZ

A

ZZZ

A

A

Z

A

ZPIKTD

AAAAAAA

FXHBAD

Numerous work has proven and made possible:– Fairness– Delay Guarantees– Delay Variation Control– Loss Guarantees– Statistical Guarantees

Precise Emulation of an Output Queued Switch

N N

Output Queued Switch1

N

Combined Input-Output Queued Switch

= ?

Scheduler

Result

Theorem: A speedup of 2-1/N is necessary

and sufficient for a combined input- and output-queued switch to precisely emulate an output-queued switch for all traffic.

Joint work with Balaji Prabhakar at Stanford.

Outline

• Quick Background on Packet Switches

• What’s the problem?“What if data rates exceed memory

bandwidth?”

• The Fork-Join Router• Parallel Packet Switches

Buffer MemoryHow Fast Can I Make a Packet Buffer?

BufferMemory

5ns SRAM

Rough Estimate:– 5ns per memory operation.– Two memory operations per

packet.– Therefore, maximum 51.2Gb/s.– In practice, closer to 40Gb/s.

64-byte wide bus 64-byte wide bus

Buffer MemoryIs It Going to Get Better?

time

Specmarks,Memory size,Gate density

time

MemoryBandwidth

(to core)

Optical Physical Layers……are Going to Make Things “Worse”

DWDM:– More ’s per fiber more “ports” per switch.– # ports: 16, …, 1000’s.

Data rate:– More b/s per higher capacity.– Data rates: 2.5Gb/s, 10Gb/s, 40Gb/s, 160Gb/s, …

Approach #1: Ping-pong Buffering

BufferMemory

64-byte wide bus

BufferMemory

64-byte wide bus

Approach #1: Ping-pong Buffering

BufferMemory

64-byte wide bus

BufferMemory

64-byte wide bus

Memory bandwidth doubled to ~80 Gb/s

Approach #2: Multiple Parallel Buffers

aka Banking, Interleaving

BufferMemoryBuffer

MemoryBuffer

MemoryBuffer

Memory

Outline

• Quick Background on Packet Switches

• What’s the problem?“What if data rates exceed memory

bandwidth?”

• The Fork-Join Router• Parallel Packet Switches

The Fork-Join Router

1

2

k

1

N

rate, R

rate, R

rate, R

rate, R

1

N

Router

Bufferless

The Fork-Join Router

• Advantages– kmemory bandwidth – klookup/classification rate – k routing/classification table size

• Problems– How to demultiplex prior to

lookup/classification?– How does the system perform/behave?– Can we predict/guarantee performance?

Outline

• Quick Background on Packet Switches

• What’s the problem?“What if data rates exceed memory

bandwidth?”

• The Fork-Join Router• Parallel Packet Switches

A Parallel Packet Switch

1

N

rate, R

rate, R

rate, R

rate, R

1

N

OutputQueuedSwitch

OutputQueuedSwitch

OutputQueuedSwitch

1

2

k

Parallel Packet SwitchQuestions

1. Can it be work-conserving?2. Can it emulate a single big output

queued switch?3. Can it support delay guarantees,

strict-priorities, WFQ, …?4. What happens with multicast?

Parallel Packet SwitchWork Conservation

rate, R1rate, R

1

2

k

1

R/k

R/k

R/k

R/k

R/k

R/k

Input LinkConstraint

Output LinkConstraint

Parallel Packet SwitchWork Conservation

rate, R1rate, R

1

2

k

1

R/k

R/k

R/k

R/k

R/k

R/k

1

2

3 Output LinkConstraint

451

2

3

4

1234115

Parallel Packet SwitchWork Conservation

1

N

rate, R

rate, R

rate, R

rate, R

1

N

OutputQueuedSwitch

OutputQueuedSwitch

OutputQueuedSwitch

1

2

k

S(R/k)

S(R/k)

S(R/k)

S(R/k)

S(R/k)

S(R/k)

Precise Emulation of an Output Queued Switch

N N

Output Queued Switch1

N

Parallel Packet Switch

= ?

1

N

1

N

Parallel Packet SwitchTheorems

1. If S > 2k/(k+2) 2 then a parallel packet switch can be work-conserving for all traffic.

2. If S > 2k/(k+2) 2 then a parallel packet switch can precisely emulate a FCFS output-queued switch for all traffic.

Parallel Packet SwitchTheorems

3. If S > 3k/(k+3) 3 then a parallel packet switch can be precisely emulate a switch with WFQ, strict priorities, and other types of QoS, for all traffic.

An asideUnbuffered Clos Circuit Switch

Expansion factor required = 2-1/N

Clos Network

I1

IX

ab

c

O1

OXm {

}m

}m

m {

O1 O2 O3 Ox

I1 I2 I3 Ix

b

<= min(R,m) entries in each row <= min(R,m) entries in each column

R middlestage switches

Clos Network

I1

IX

ab

c

O1

OXm {

}m

}m

m {

O1 O2 O3 Ox

I1 I2 I3 Ix

b

<= min(R,m) entries in each row<= min(R,m) entries in each column

R middlestage switches

Define: UIL(Ii) = used links at switch Ii to connect to middle stages. UOL(Oi) = used links at switch Oi to connect to middle stages.

If we wish to connect Ii to Oi:When adding connection: |UIL(Ii)| <= m-1 and |UOL(Oi)| <= m-1Worst-case: |UIL(Ii) U UOL(Oi)| = 2m -2Therefore, if R >= 2m-2 there are always enough middle stages.

An asideUnbuffered Clos Circuit Switch

Expansion factor required = 2-1/N

Expansion 2 - 4/(k+2)

Fork-Join Router ProjectWhat’s next?

• Theory: – Extending results to distributed

algorithms.– Extending results to multicast.

• Implementation/Prototyping:– Under discussion...

Recommended