Interconnect-Centric Computing

HPCA: 1 Feb 12, 2007

Interconnect-Centric Computing

William J. DallyComputer Systems Laboratory

Stanford University

HPCA Keynote

February 12, 2007

HPCA: 2 Feb 12, 2007

Outline

• Interconnection Networks (INs) are THE centralcomponent of modern computer systems

• Topology driven to high-radix by packagingtechnology

• Global adaptive routing balances load - and enablesefficient topologies

• Case study, the Cray Black Widow

• On-Chip Interconnection Networks (OCINs) faceunique challenges

• The road ahead…

HPCA: 3 Feb 12, 2007

Outline







HPCA: 4 Feb 12, 2007

INs: Connect Processors in Clusters

IBM Blue Gene

HPCA: 5 Feb 12, 2007

and on chip

MIT RAW

HPCA: 6 Feb 12, 2007

Connect Processors to Memories in Systems

Cray Black Widow

HPCA: 7 Feb 12, 2007

and on chip

Texas TRIPS

HPCA: 8 Feb 12, 2007

provide the fabric fornetwork Switches and Routers

Avici TSR

HPCA: 9 Feb 12, 2007

and connect I/O Devices

Brocade Switch

HPCA: 10 Feb 12, 2007

Group History: Routing Chips &Interconnection Networks

• Mars Router, Torus Routing Chip, Network DesignFrame, Reliable Router

• Basis for Intel, Cray/SGI, Mercury, Avici network chips

MARS Router

1984

Torus Routing Chip

1985

Network Design Frame

1988

Reliable Router

1994

HPCA: 11 Feb 12, 2007

Group History: Parallel Computer Systems

• J-Machine (MDP) led to Cray T3D/T3E

• M-Machine (MAP)

– Fast messaging, scalable processing nodes, scalablememory architecture

• Imagine – basis for SPI

MDP Chip J-Machine Cray T3D MAP Chip Imagine Chip

HPCA: 12 Feb 12, 2007

Interconnection Networks are THE CentralComponent of Modern Computer Systems

• Processors are a commodity

– Performance no longer scaling (ILP mined out)

– Future growth is through CMPs - connected by INs

• Memory is a commodity

– Memory system performance determined by interconnect

• I/O systems are largely interconnect

• Embedded systems built using SoCs

– Standard components

– Connected by on-chip INs (OCINs)

HPCA: 13 Feb 12, 2007

Outline







HPCA: 14 Feb 12, 2007

0.1

1

10

100

1000

10000

1985 1990 1995 2000 2005 2010

year

ba

nd

wid

th p

er

rou

ter

no

de

(G

b/s

)

Torus Routing ChipIntel iPSC/2J-Machine

CM-5Intel Paragon XPCray T3D

MIT AlewifeIBM VulcanCray T3E

SGI Origin 2000AlphaServer GS320IBM SP Switch2Quadrics QsNet

Cray X1Velio 3003IBM HPS

SGI Altix 3000Cray XT3YARC

BlackWidow

Technology Trends…

HPCA: 15 Feb 12, 2007

High-Radix Router

Router

Router

HPCA: 16 Feb 12, 2007

High-Radix Router

Router

Router

Low-radix (small number of fat ports) High-radix (large number of skinny ports)

RouterRouter

HPCA: 17 Feb 12, 2007

4 hops 2 hops

96 channels 32 channels

Low-Radix vs. High-Radix Router

O0

O1

O2

O3

O4

O5

O6

O7

O8

O9

O10

O11

O12

O13

O14

O15

I0

I1

I2

I3

I4

I5

I6

I7

I8

I9

I10

I11

I12

I13

I14

I15

I0

I1

I2

I3

I4

I5

I6

I7

I8

I9

I10

I11

I12

I13

I14

I15

O0

O1

O2

O3

O4

O5

O6

O7

O8

O9

O10

O11

O12

O13

O14

O15

Low-Radix High-Radix

Latency :

Cost :

HPCA: 18 Feb 12, 2007

Latency

Latency = H tr + L / b

= 2trlogkN + 2kL / B

where k = radix B = total router Bandwidth N = # of nodes L = message size

HPCA: 19 Feb 12, 2007

Latency vs. Radix

0

50

100

150

200

250

300

0 50 100 150 200 250

radix

late

nc

y (

ns

ec

)

2003 technology 2010 technology

Optimal radix ~ 40

Optimal radix ~ 128

Serialization latency increases

Header latency

decreases

HPCA: 20 Feb 12, 2007

Determining Optimal Radix

Latency = Header Latency + Serialization Latency

= H tr + L / b

= 2trlogkN + 2kL / B

Optimal radix

k log2 k = (B tr log N) / L

= Aspect Ratio

where k = radix B = total router Bandwidth N = # of nodes L = message size

HPCA: 21 Feb 12, 2007

Higher Aspect Ratio, Higher Optimal Radix

1996

2003

2010

1991

1

10

100

1000

10 100 1000 10000

Aspect Ratio

Op

tim

al R

ad

ix (

k)

HPCA: 22 Feb 12, 2007

High-Radix Topology

• Use high radix, k, to get low hop count

– H = logk(N)

• Provide good performance on both benign andadversarial traffic patterns

– Rules out butterfly networks - no path diversity

– Clos networks work well

• H = 2logk(N) - with short circuit

– Cayley graphs have nice properties but are hard to route

HPCA: 23 Feb 12, 2007

Example radix-64 Clos Network

Y0

BW0 BW1 BW31

Y31

BW992 BW993 BW1023

Y1

BW32 BW33 BW63

Y32 Y33 Y63

Rank 1

Rank 2

HPCA: 24 Feb 12, 2007

Flattened Butterfly Topology

HPCA: 25 Feb 12, 2007

Packaging the Flattened Butterfly

HPCA: 26 Feb 12, 2007

Packaging the Flattened Butterfly (2)

HPCA: 27 Feb 12, 2007

Cost

HPCA: 28 Feb 12, 2007

Outline







HPCA: 29 Feb 12, 2007

Routing in High-Radix Networks

• Adaptive routing avoids transient load imbalance

• Global adaptive routing balances load for adversarialtraffic

– Cost/perf of a butterfly on benign traffic and at low loads

– Cost/perf of a clos on adversarial traffic

HPCA: 30 Feb 12, 2007

A Clos can statically load balance trafficusing oblivious routing

Y0

BW0 BW1 BW31

Y31

BW992 BW993 BW1023

Y1

BW32 BW33 BW63

Y32 Y33 Y63

Rank 1

Rank 2

HPCA: 31 Feb 12, 2007

Transient Imbalance

HPCA: 32 Feb 12, 2007

With Adaptive Routing

HPCA: 33 Feb 12, 2007

Latency for UR traffic

HPCA: 34 Feb 12, 2007


0 1 2 3 4 5 6 7

HPCA: 35 Feb 12, 2007


0 1 2 3 4 5 6 7

What if node 0 sends all of its traffic to node 1?

HPCA: 36 Feb 12, 2007


0 1 2 3 4 5 6 7

What if node 0 sends all of its traffic to node 1?

How much traffic should we route over alternate paths?

HPCA: 37 Feb 12, 2007

Simpler Case - ring of 8 nodesSend traffic from 2 to 5

• Model: Assume queues to be a network ofindependent M/D/1 queues

21 3 4

5670

x1

x2

= x1 + x2

Min path delay = Dm(x1)

Non-min path delay = Dnm(x2)

• Routing remains minimal as long as

Dm’( ) Dnm’(0)

• Afterwards, route a fraction, x2, non-

minimally such that

Dm’(x1) = Dnm’(x2)

HPCA: 38 Feb 12, 2007

Traffic divides to balance delayLoad balanced at saturation

0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.1 0.2 0.3 0.4 0.5 0.6

Offered Load (fraction of capacity)

Accepte

d T

hro

ughput

Model Overall

Model Minimal

Model Non-minimal

HPCA: 39 Feb 12, 2007

Channel-Queue Routing

• Estimate delay per hop by local queue length Qi

• Overall latency estimated by

– Li ~ QiHi

• Route each packet on route with lowest estimated Li

• Works extremely well in practice

HPCA: 40 Feb 12, 2007

Performance on UR Traffic

HPCA: 41 Feb 12, 2007

Performance on WC Traffic

HPCA: 42 Feb 12, 2007

Allocator Design Matters

HPCA: 43 Feb 12, 2007

Outline







HPCA: 44 Feb 12, 2007

Putting it all togetherThe Cray BlackWidow Network

In collaboration with Steve Scott andDennis Abts (Cray Inc.)

HPCA: 45 Feb 12, 2007

Cray Black Widow

• Shared-memory vector parallel computer

• Up to 32K nodes

• Vector processor per node

• Shared memory across nodes

HPCA: 46 Feb 12, 2007

Black Widow Topology

• Up to 32K nodes in a 3-levelfolded Clos

• Each node has 4 18.75Gb/schannels, one to each of 4network slices

HPCA: 47 Feb 12, 2007

YARCYet Another Router Chip

• 64 Ports

• Each port is 18.75 Gb/s (3 x 6.25Gb/s links)

• Table-driven routing

• Fault tolerance

– CRC with link-level retry

– Graceful degradation of links

• 3 bits -> 2 bits -> 1 bit -> OTS

HPCA: 48 Feb 12, 2007

YARC Microarchitecture

• Regular 8x8 array of tiles

– Easy to lay out chip

• No global arbitration

– All decisions local

• Simple routing

• Hierarchical organization

– Input buffers

– Row buffers

– Column buffers

HPCA: 49 Feb 12, 2007

A Closer Look at a Tile

• No global arbitration

• Non-blocking with an 8xinternal speedup insubswitch

• Simple routing– Small 8-entry routing table per tile

– High routing throughput for small packets

HPCA: 50 Feb 12, 2007

YARC Implementation

• Implemented in a 90nmCMOS standard-cell ASICtechnology

• 192 SerDes on the chip• (64 ports x 3-bits per port)

• 6.25Gbaud data rate

• Estimated power• 80 W (idle)

• 87 W (peak)

• 17mm x 17mm die

HPCA: 51 Feb 12, 2007

YARC Implementation

• Implemented in a 90nmCMOS standard-cell ASICtechnology

• 192 SerDes on the chip• (64 ports x 3-bits per port)

• 6.25Gbaud data rate

• Estimated power• 80 W (idle)

• 87 W (peak)

• 17mm x 17mm die

HPCA: 52 Feb 12, 2007

Outline







HPCA: 53 Feb 12, 2007

Much of the future is on-chip(CMP, SoC, Operand)

2006 2007.5 2009

2010.5 2012 20152013.5

HPCA: 54 Feb 12, 2007

On-Chip Networks are Fundamentally Different

• Different cost model

– Wires plentiful, no pin constraints

– Buffers expensive (consume die area)

– Slow signal propagation

• Different usage patterns

– Particularly for SoCs

• Significant isochronous traffic

• Hard RT constraints

• Different design problems

– Floorplans

– Energy-efficient transmission circuits

HPCA: 55 Feb 12, 2007

NSF Workshop Identified 3 Critical Issues

• Power

– OCINs will have 10x the required power with currentapproaches

• Circuit and architecture innovations can close this gap

• Latency

– OCIN latency currently not competitive with buses anddedicated wiring

• Novel flow-control strategies required

• Tool Integration

– OCINs need to be integrated with standard tool flows toenable widespread use

HPCA: 56 Feb 12, 2007

The Road Ahead

• INs become an even more dominant system component– Number of processors goes up, cost of processors decreases

– Communication dominates performance and cost

– From hand-held media UI devices to huge data centers

• Technology drives topology in new directions– On-chip, short reach electrical (10m), optical

– Expect radix to continue to increase

– Hybrid topologies to match each packaging level

• Latency will approach that of dedicated wiring– Better flow-control and router architecture

– Optimized circuits

• Adaptivity will optimize performance– Balance load, route around defects, tolerate variation, tune power to

load

HPCA: 57 Feb 12, 2007

Summary

• Interconnection Networks (INs) are THE central component of moderncomputing systems

• High-radix topologies have evolved to exploit packaging/signalingtechnology

– Including hybrid optical/electrical

– Flattened Butterfly

• Global adaptive routing balances load and enables advanced topologies

– Eliminate transient load imbalance

– Use local queues to estimate global congestion

• Cray Black Widow - an example high-radix network

• On-Chip INs

– Very different constraints

– Three “Gaps” identified - power, latency, tools.

• The road ahead

– Lots of room for improvement, INs are in their infancy

HPCA: 58 Feb 12, 2007

Some very good books

HPCA: 59 Feb 12, 2007

Backup

HPCA: 60 Feb 12, 2007

Virtual Channel Router Architecture

Switch

Allocator

VC

Allocator

Output k

Crossbar switch

RouterRouting

computation

Output 1

VC 1

VC 2

VC v

VC 1

VC 2

VC v

Input 1

Input k

Switch

Allocator

VC

Allocator

Output k

Crossbar switch

RouterRouting

computation

Output 1

VC 1

VC 2

VC v

VC 1

VC 2

VC v

Input 1

Input k

Switch

Allocator

VC

Allocator

Output k

Crossbar switch

RouterRouting

computation

Output 1

VC 1

VC 2

VC v

VC 1

VC 2

VC v

Input 1

Input k

Switch

Allocator

VC

Allocator

Output k

Crossbar switch

RouterRouting

computation

Output 1

VC 1

VC 2

VC v

VC 1

VC 2

VC v

Input 1

Input k

Switch

Allocator

VC

Allocator

Output k

Crossbar switch

RouterRouting

computation

Output 1

VC 1

VC 2

VC v

VC 1

VC 2

VC v

Input 1

Input k

Switch

Allocator

VC

Allocator

Output k

Crossbar switch

RouterRouting

computation

Output 1

VC 1

VC 2

VC v

VC 1

VC 2

VC v

Input 1

Input k

Switch

Allocator

VC

Allocator

Output k

Crossbar switch

RouterRouting

computation

Output 1

VC 1

VC 2

VC v

VC 1

VC 2

VC v

Input 1

Input k

HPCA: 61 Feb 12, 2007

Baseline Performance Evaluation

0

10

20

30

40

50

0 0.2 0.4 0.6 0.8 1

offered load

late

ncy (

cycle

s)

low-radix

HPCA: 62 Feb 12, 2007

Baseline Performance Evaluation

0

10

20

30

40

50

0 0.2 0.4 0.6 0.8 1

offered load

late

ncy (

cycle

s)

low-radix

baseline(high-radix)

Low

radix

better

Documents

Interconnect-Centric Computing