62
HPCA: 1 Feb 12, 2007 Interconnect-Centric Computing William J. Dally Computer Systems Laboratory Stanford University HPCA Keynote February 12, 2007

Interconnect-Centric Computing

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Interconnect-Centric Computing

HPCA: 1 Feb 12, 2007

Interconnect-Centric Computing

William J. DallyComputer Systems Laboratory

Stanford University

HPCA Keynote

February 12, 2007

Page 2: Interconnect-Centric Computing

HPCA: 2 Feb 12, 2007

Outline

• Interconnection Networks (INs) are THE centralcomponent of modern computer systems

• Topology driven to high-radix by packagingtechnology

• Global adaptive routing balances load - and enablesefficient topologies

• Case study, the Cray Black Widow

• On-Chip Interconnection Networks (OCINs) faceunique challenges

• The road ahead…

Page 3: Interconnect-Centric Computing

HPCA: 3 Feb 12, 2007

Outline

• Interconnection Networks (INs) are THE centralcomponent of modern computer systems

• Topology driven to high-radix by packagingtechnology

• Global adaptive routing balances load - and enablesefficient topologies

• Case study, the Cray Black Widow

• On-Chip Interconnection Networks (OCINs) faceunique challenges

• The road ahead…

Page 4: Interconnect-Centric Computing

HPCA: 4 Feb 12, 2007

INs: Connect Processors in Clusters

IBM Blue Gene

Page 5: Interconnect-Centric Computing

HPCA: 5 Feb 12, 2007

and on chip

MIT RAW

Page 6: Interconnect-Centric Computing

HPCA: 6 Feb 12, 2007

Connect Processors to Memories in Systems

Cray Black Widow

Page 7: Interconnect-Centric Computing

HPCA: 7 Feb 12, 2007

and on chip

Texas TRIPS

Page 8: Interconnect-Centric Computing

HPCA: 8 Feb 12, 2007

provide the fabric fornetwork Switches and Routers

Avici TSR

Page 9: Interconnect-Centric Computing

HPCA: 9 Feb 12, 2007

and connect I/O Devices

Brocade Switch

Page 10: Interconnect-Centric Computing

HPCA: 10 Feb 12, 2007

Group History: Routing Chips &Interconnection Networks

• Mars Router, Torus Routing Chip, Network DesignFrame, Reliable Router

• Basis for Intel, Cray/SGI, Mercury, Avici network chips

MARS Router

1984

Torus Routing Chip

1985

Network Design Frame

1988

Reliable Router

1994

Page 11: Interconnect-Centric Computing

HPCA: 11 Feb 12, 2007

Group History: Parallel Computer Systems

• J-Machine (MDP) led to Cray T3D/T3E

• M-Machine (MAP)

– Fast messaging, scalable processing nodes, scalablememory architecture

• Imagine – basis for SPI

MDP Chip J-Machine Cray T3D MAP Chip Imagine Chip

Page 12: Interconnect-Centric Computing

HPCA: 12 Feb 12, 2007

Interconnection Networks are THE CentralComponent of Modern Computer Systems

• Processors are a commodity

– Performance no longer scaling (ILP mined out)

– Future growth is through CMPs - connected by INs

• Memory is a commodity

– Memory system performance determined by interconnect

• I/O systems are largely interconnect

• Embedded systems built using SoCs

– Standard components

– Connected by on-chip INs (OCINs)

Page 13: Interconnect-Centric Computing

HPCA: 13 Feb 12, 2007

Outline

• Interconnection Networks (INs) are THE centralcomponent of modern computer systems

• Topology driven to high-radix by packagingtechnology

• Global adaptive routing balances load - and enablesefficient topologies

• Case study, the Cray Black Widow

• On-Chip Interconnection Networks (OCINs) faceunique challenges

• The road ahead…

Page 14: Interconnect-Centric Computing

HPCA: 14 Feb 12, 2007

0.1

1

10

100

1000

10000

1985 1990 1995 2000 2005 2010

year

ba

nd

wid

th p

er

rou

ter

no

de

(G

b/s

)

Torus Routing ChipIntel iPSC/2J-Machine

CM-5Intel Paragon XPCray T3D

MIT AlewifeIBM VulcanCray T3E

SGI Origin 2000AlphaServer GS320IBM SP Switch2Quadrics QsNet

Cray X1Velio 3003IBM HPS

SGI Altix 3000Cray XT3YARC

BlackWidow

Technology Trends…

Page 15: Interconnect-Centric Computing

HPCA: 15 Feb 12, 2007

High-Radix Router

Router

Router

Page 16: Interconnect-Centric Computing

HPCA: 16 Feb 12, 2007

High-Radix Router

Router

Router

Low-radix (small number of fat ports) High-radix (large number of skinny ports)

RouterRouter

Page 17: Interconnect-Centric Computing

HPCA: 17 Feb 12, 2007

4 hops 2 hops

96 channels 32 channels

Low-Radix vs. High-Radix Router

O0

O1

O2

O3

O4

O5

O6

O7

O8

O9

O10

O11

O12

O13

O14

O15

I0

I1

I2

I3

I4

I5

I6

I7

I8

I9

I10

I11

I12

I13

I14

I15

I0

I1

I2

I3

I4

I5

I6

I7

I8

I9

I10

I11

I12

I13

I14

I15

O0

O1

O2

O3

O4

O5

O6

O7

O8

O9

O10

O11

O12

O13

O14

O15

Low-Radix High-Radix

Latency :

Cost :

Page 18: Interconnect-Centric Computing

HPCA: 18 Feb 12, 2007

Latency

Latency = H tr + L / b

= 2trlogkN + 2kL / B

where k = radix B = total router Bandwidth N = # of nodes L = message size

Page 19: Interconnect-Centric Computing

HPCA: 19 Feb 12, 2007

Latency vs. Radix

0

50

100

150

200

250

300

0 50 100 150 200 250

radix

late

nc

y (

ns

ec

)

2003 technology 2010 technology

Optimal radix ~ 40

Optimal radix ~ 128

Serialization latency increases

Header latency

decreases

Page 20: Interconnect-Centric Computing

HPCA: 20 Feb 12, 2007

Determining Optimal Radix

Latency = Header Latency + Serialization Latency

= H tr + L / b

= 2trlogkN + 2kL / B

Optimal radix

k log2 k = (B tr log N) / L

= Aspect Ratio

where k = radix B = total router Bandwidth N = # of nodes L = message size

Page 21: Interconnect-Centric Computing

HPCA: 21 Feb 12, 2007

Higher Aspect Ratio, Higher Optimal Radix

1996

2003

2010

1991

1

10

100

1000

10 100 1000 10000

Aspect Ratio

Op

tim

al R

ad

ix (

k)

Page 22: Interconnect-Centric Computing

HPCA: 22 Feb 12, 2007

High-Radix Topology

• Use high radix, k, to get low hop count

– H = logk(N)

• Provide good performance on both benign andadversarial traffic patterns

– Rules out butterfly networks - no path diversity

– Clos networks work well

• H = 2logk(N) - with short circuit

– Cayley graphs have nice properties but are hard to route

Page 23: Interconnect-Centric Computing

HPCA: 23 Feb 12, 2007

Example radix-64 Clos Network

Y0

BW0 BW1 BW31

Y31

BW992 BW993 BW1023

Y1

BW32 BW33 BW63

Y32 Y33 Y63

Rank 1

Rank 2

Page 24: Interconnect-Centric Computing

HPCA: 24 Feb 12, 2007

Flattened Butterfly Topology

Page 25: Interconnect-Centric Computing

HPCA: 25 Feb 12, 2007

Packaging the Flattened Butterfly

Page 26: Interconnect-Centric Computing

HPCA: 26 Feb 12, 2007

Packaging the Flattened Butterfly (2)

Page 27: Interconnect-Centric Computing

HPCA: 27 Feb 12, 2007

Cost

Page 28: Interconnect-Centric Computing

HPCA: 28 Feb 12, 2007

Outline

• Interconnection Networks (INs) are THE centralcomponent of modern computer systems

• Topology driven to high-radix by packagingtechnology

• Global adaptive routing balances load - and enablesefficient topologies

• Case study, the Cray Black Widow

• On-Chip Interconnection Networks (OCINs) faceunique challenges

• The road ahead…

Page 29: Interconnect-Centric Computing

HPCA: 29 Feb 12, 2007

Routing in High-Radix Networks

• Adaptive routing avoids transient load imbalance

• Global adaptive routing balances load for adversarialtraffic

– Cost/perf of a butterfly on benign traffic and at low loads

– Cost/perf of a clos on adversarial traffic

Page 30: Interconnect-Centric Computing

HPCA: 30 Feb 12, 2007

A Clos can statically load balance trafficusing oblivious routing

Y0

BW0 BW1 BW31

Y31

BW992 BW993 BW1023

Y1

BW32 BW33 BW63

Y32 Y33 Y63

Rank 1

Rank 2

Page 31: Interconnect-Centric Computing

HPCA: 31 Feb 12, 2007

Transient Imbalance

Page 32: Interconnect-Centric Computing

HPCA: 32 Feb 12, 2007

With Adaptive Routing

Page 33: Interconnect-Centric Computing

HPCA: 33 Feb 12, 2007

Latency for UR traffic

Page 34: Interconnect-Centric Computing

HPCA: 34 Feb 12, 2007

Flattened Butterfly Topology

0 1 2 3 4 5 6 7

Page 35: Interconnect-Centric Computing

HPCA: 35 Feb 12, 2007

Flattened Butterfly Topology

0 1 2 3 4 5 6 7

What if node 0 sends all of its traffic to node 1?

Page 36: Interconnect-Centric Computing

HPCA: 36 Feb 12, 2007

Flattened Butterfly Topology

0 1 2 3 4 5 6 7

What if node 0 sends all of its traffic to node 1?

How much traffic should we route over alternate paths?

Page 37: Interconnect-Centric Computing

HPCA: 37 Feb 12, 2007

Simpler Case - ring of 8 nodesSend traffic from 2 to 5

• Model: Assume queues to be a network ofindependent M/D/1 queues

21 3 4

5670

x1

x2

= x1 + x2

Min path delay = Dm(x1)

Non-min path delay = Dnm(x2)

• Routing remains minimal as long as

Dm’( ) Dnm’(0)

• Afterwards, route a fraction, x2, non-

minimally such that

Dm’(x1) = Dnm’(x2)

Page 38: Interconnect-Centric Computing

HPCA: 38 Feb 12, 2007

Traffic divides to balance delayLoad balanced at saturation

0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.1 0.2 0.3 0.4 0.5 0.6

Offered Load (fraction of capacity)

Accepte

d T

hro

ughput

Model Overall

Model Minimal

Model Non-minimal

Page 39: Interconnect-Centric Computing

HPCA: 39 Feb 12, 2007

Channel-Queue Routing

• Estimate delay per hop by local queue length Qi

• Overall latency estimated by

– Li ~ QiHi

• Route each packet on route with lowest estimated Li

• Works extremely well in practice

Page 40: Interconnect-Centric Computing

HPCA: 40 Feb 12, 2007

Performance on UR Traffic

Page 41: Interconnect-Centric Computing

HPCA: 41 Feb 12, 2007

Performance on WC Traffic

Page 42: Interconnect-Centric Computing

HPCA: 42 Feb 12, 2007

Allocator Design Matters

Page 43: Interconnect-Centric Computing

HPCA: 43 Feb 12, 2007

Outline

• Interconnection Networks (INs) are THE centralcomponent of modern computer systems

• Topology driven to high-radix by packagingtechnology

• Global adaptive routing balances load - and enablesefficient topologies

• Case study, the Cray Black Widow

• On-Chip Interconnection Networks (OCINs) faceunique challenges

• The road ahead…

Page 44: Interconnect-Centric Computing

HPCA: 44 Feb 12, 2007

Putting it all togetherThe Cray BlackWidow Network

In collaboration with Steve Scott andDennis Abts (Cray Inc.)

Page 45: Interconnect-Centric Computing

HPCA: 45 Feb 12, 2007

Cray Black Widow

• Shared-memory vector parallel computer

• Up to 32K nodes

• Vector processor per node

• Shared memory across nodes

Page 46: Interconnect-Centric Computing

HPCA: 46 Feb 12, 2007

Black Widow Topology

• Up to 32K nodes in a 3-levelfolded Clos

• Each node has 4 18.75Gb/schannels, one to each of 4network slices

Page 47: Interconnect-Centric Computing

HPCA: 47 Feb 12, 2007

YARCYet Another Router Chip

• 64 Ports

• Each port is 18.75 Gb/s (3 x 6.25Gb/s links)

• Table-driven routing

• Fault tolerance

– CRC with link-level retry

– Graceful degradation of links

• 3 bits -> 2 bits -> 1 bit -> OTS

Page 48: Interconnect-Centric Computing

HPCA: 48 Feb 12, 2007

YARC Microarchitecture

• Regular 8x8 array of tiles

– Easy to lay out chip

• No global arbitration

– All decisions local

• Simple routing

• Hierarchical organization

– Input buffers

– Row buffers

– Column buffers

Page 49: Interconnect-Centric Computing

HPCA: 49 Feb 12, 2007

A Closer Look at a Tile

• No global arbitration

• Non-blocking with an 8xinternal speedup insubswitch

• Simple routing– Small 8-entry routing table per tile

– High routing throughput for small packets

Page 50: Interconnect-Centric Computing

HPCA: 50 Feb 12, 2007

YARC Implementation

• Implemented in a 90nmCMOS standard-cell ASICtechnology

• 192 SerDes on the chip• (64 ports x 3-bits per port)

• 6.25Gbaud data rate

• Estimated power• 80 W (idle)

• 87 W (peak)

• 17mm x 17mm die

Page 51: Interconnect-Centric Computing

HPCA: 51 Feb 12, 2007

YARC Implementation

• Implemented in a 90nmCMOS standard-cell ASICtechnology

• 192 SerDes on the chip• (64 ports x 3-bits per port)

• 6.25Gbaud data rate

• Estimated power• 80 W (idle)

• 87 W (peak)

• 17mm x 17mm die

Page 52: Interconnect-Centric Computing

HPCA: 52 Feb 12, 2007

Outline

• Interconnection Networks (INs) are THE centralcomponent of modern computer systems

• Topology driven to high-radix by packagingtechnology

• Global adaptive routing balances load - and enablesefficient topologies

• Case study, the Cray Black Widow

• On-Chip Interconnection Networks (OCINs) faceunique challenges

• The road ahead…

Page 53: Interconnect-Centric Computing

HPCA: 53 Feb 12, 2007

Much of the future is on-chip(CMP, SoC, Operand)

2006 2007.5 2009

2010.5 2012 20152013.5

Page 54: Interconnect-Centric Computing

HPCA: 54 Feb 12, 2007

On-Chip Networks are Fundamentally Different

• Different cost model

– Wires plentiful, no pin constraints

– Buffers expensive (consume die area)

– Slow signal propagation

• Different usage patterns

– Particularly for SoCs

• Significant isochronous traffic

• Hard RT constraints

• Different design problems

– Floorplans

– Energy-efficient transmission circuits

Page 55: Interconnect-Centric Computing

HPCA: 55 Feb 12, 2007

NSF Workshop Identified 3 Critical Issues

• Power

– OCINs will have 10x the required power with currentapproaches

• Circuit and architecture innovations can close this gap

• Latency

– OCIN latency currently not competitive with buses anddedicated wiring

• Novel flow-control strategies required

• Tool Integration

– OCINs need to be integrated with standard tool flows toenable widespread use

Page 56: Interconnect-Centric Computing

HPCA: 56 Feb 12, 2007

The Road Ahead

• INs become an even more dominant system component– Number of processors goes up, cost of processors decreases

– Communication dominates performance and cost

– From hand-held media UI devices to huge data centers

• Technology drives topology in new directions– On-chip, short reach electrical (10m), optical

– Expect radix to continue to increase

– Hybrid topologies to match each packaging level

• Latency will approach that of dedicated wiring– Better flow-control and router architecture

– Optimized circuits

• Adaptivity will optimize performance– Balance load, route around defects, tolerate variation, tune power to

load

Page 57: Interconnect-Centric Computing

HPCA: 57 Feb 12, 2007

Summary

• Interconnection Networks (INs) are THE central component of moderncomputing systems

• High-radix topologies have evolved to exploit packaging/signalingtechnology

– Including hybrid optical/electrical

– Flattened Butterfly

• Global adaptive routing balances load and enables advanced topologies

– Eliminate transient load imbalance

– Use local queues to estimate global congestion

• Cray Black Widow - an example high-radix network

• On-Chip INs

– Very different constraints

– Three “Gaps” identified - power, latency, tools.

• The road ahead

– Lots of room for improvement, INs are in their infancy

Page 58: Interconnect-Centric Computing

HPCA: 58 Feb 12, 2007

Some very good books

Page 59: Interconnect-Centric Computing

HPCA: 59 Feb 12, 2007

Backup

Page 60: Interconnect-Centric Computing

HPCA: 60 Feb 12, 2007

Virtual Channel Router Architecture

Switch

Allocator

VC

Allocator

Output k

Crossbar switch

RouterRouting

computation

Output 1

VC 1

VC 2

VC v

VC 1

VC 2

VC v

Input 1

Input k

Switch

Allocator

VC

Allocator

Output k

Crossbar switch

RouterRouting

computation

Output 1

VC 1

VC 2

VC v

VC 1

VC 2

VC v

Input 1

Input k

Switch

Allocator

VC

Allocator

Output k

Crossbar switch

RouterRouting

computation

Output 1

VC 1

VC 2

VC v

VC 1

VC 2

VC v

Input 1

Input k

Switch

Allocator

VC

Allocator

Output k

Crossbar switch

RouterRouting

computation

Output 1

VC 1

VC 2

VC v

VC 1

VC 2

VC v

Input 1

Input k

Switch

Allocator

VC

Allocator

Output k

Crossbar switch

RouterRouting

computation

Output 1

VC 1

VC 2

VC v

VC 1

VC 2

VC v

Input 1

Input k

Switch

Allocator

VC

Allocator

Output k

Crossbar switch

RouterRouting

computation

Output 1

VC 1

VC 2

VC v

VC 1

VC 2

VC v

Input 1

Input k

Switch

Allocator

VC

Allocator

Output k

Crossbar switch

RouterRouting

computation

Output 1

VC 1

VC 2

VC v

VC 1

VC 2

VC v

Input 1

Input k

Page 61: Interconnect-Centric Computing

HPCA: 61 Feb 12, 2007

Baseline Performance Evaluation

0

10

20

30

40

50

0 0.2 0.4 0.6 0.8 1

offered load

late

ncy (

cycle

s)

low-radix

Page 62: Interconnect-Centric Computing

HPCA: 62 Feb 12, 2007

Baseline Performance Evaluation

0

10

20

30

40

50

0 0.2 0.4 0.6 0.8 1

offered load

late

ncy (

cycle

s)

low-radix

baseline(high-radix)

Low

radix

better