Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
HPCA: 1 Feb 12, 2007
Interconnect-Centric Computing
William J. DallyComputer Systems Laboratory
Stanford University
HPCA Keynote
February 12, 2007
HPCA: 2 Feb 12, 2007
Outline
• Interconnection Networks (INs) are THE centralcomponent of modern computer systems
• Topology driven to high-radix by packagingtechnology
• Global adaptive routing balances load - and enablesefficient topologies
• Case study, the Cray Black Widow
• On-Chip Interconnection Networks (OCINs) faceunique challenges
• The road ahead…
HPCA: 3 Feb 12, 2007
Outline
• Interconnection Networks (INs) are THE centralcomponent of modern computer systems
• Topology driven to high-radix by packagingtechnology
• Global adaptive routing balances load - and enablesefficient topologies
• Case study, the Cray Black Widow
• On-Chip Interconnection Networks (OCINs) faceunique challenges
• The road ahead…
HPCA: 4 Feb 12, 2007
INs: Connect Processors in Clusters
IBM Blue Gene
HPCA: 5 Feb 12, 2007
and on chip
MIT RAW
HPCA: 6 Feb 12, 2007
Connect Processors to Memories in Systems
Cray Black Widow
HPCA: 7 Feb 12, 2007
and on chip
Texas TRIPS
HPCA: 8 Feb 12, 2007
provide the fabric fornetwork Switches and Routers
Avici TSR
HPCA: 9 Feb 12, 2007
and connect I/O Devices
Brocade Switch
HPCA: 10 Feb 12, 2007
Group History: Routing Chips &Interconnection Networks
• Mars Router, Torus Routing Chip, Network DesignFrame, Reliable Router
• Basis for Intel, Cray/SGI, Mercury, Avici network chips
MARS Router
1984
Torus Routing Chip
1985
Network Design Frame
1988
Reliable Router
1994
HPCA: 11 Feb 12, 2007
Group History: Parallel Computer Systems
• J-Machine (MDP) led to Cray T3D/T3E
• M-Machine (MAP)
– Fast messaging, scalable processing nodes, scalablememory architecture
• Imagine – basis for SPI
MDP Chip J-Machine Cray T3D MAP Chip Imagine Chip
HPCA: 12 Feb 12, 2007
Interconnection Networks are THE CentralComponent of Modern Computer Systems
• Processors are a commodity
– Performance no longer scaling (ILP mined out)
– Future growth is through CMPs - connected by INs
• Memory is a commodity
– Memory system performance determined by interconnect
• I/O systems are largely interconnect
• Embedded systems built using SoCs
– Standard components
– Connected by on-chip INs (OCINs)
HPCA: 13 Feb 12, 2007
Outline
• Interconnection Networks (INs) are THE centralcomponent of modern computer systems
• Topology driven to high-radix by packagingtechnology
• Global adaptive routing balances load - and enablesefficient topologies
• Case study, the Cray Black Widow
• On-Chip Interconnection Networks (OCINs) faceunique challenges
• The road ahead…
HPCA: 14 Feb 12, 2007
0.1
1
10
100
1000
10000
1985 1990 1995 2000 2005 2010
year
ba
nd
wid
th p
er
rou
ter
no
de
(G
b/s
)
Torus Routing ChipIntel iPSC/2J-Machine
CM-5Intel Paragon XPCray T3D
MIT AlewifeIBM VulcanCray T3E
SGI Origin 2000AlphaServer GS320IBM SP Switch2Quadrics QsNet
Cray X1Velio 3003IBM HPS
SGI Altix 3000Cray XT3YARC
BlackWidow
Technology Trends…
HPCA: 15 Feb 12, 2007
High-Radix Router
Router
Router
HPCA: 16 Feb 12, 2007
High-Radix Router
Router
Router
Low-radix (small number of fat ports) High-radix (large number of skinny ports)
RouterRouter
HPCA: 17 Feb 12, 2007
4 hops 2 hops
96 channels 32 channels
Low-Radix vs. High-Radix Router
O0
O1
O2
O3
O4
O5
O6
O7
O8
O9
O10
O11
O12
O13
O14
O15
I0
I1
I2
I3
I4
I5
I6
I7
I8
I9
I10
I11
I12
I13
I14
I15
I0
I1
I2
I3
I4
I5
I6
I7
I8
I9
I10
I11
I12
I13
I14
I15
O0
O1
O2
O3
O4
O5
O6
O7
O8
O9
O10
O11
O12
O13
O14
O15
Low-Radix High-Radix
Latency :
Cost :
HPCA: 18 Feb 12, 2007
Latency
Latency = H tr + L / b
= 2trlogkN + 2kL / B
where k = radix B = total router Bandwidth N = # of nodes L = message size
HPCA: 19 Feb 12, 2007
Latency vs. Radix
0
50
100
150
200
250
300
0 50 100 150 200 250
radix
late
nc
y (
ns
ec
)
2003 technology 2010 technology
Optimal radix ~ 40
Optimal radix ~ 128
Serialization latency increases
Header latency
decreases
HPCA: 20 Feb 12, 2007
Determining Optimal Radix
Latency = Header Latency + Serialization Latency
= H tr + L / b
= 2trlogkN + 2kL / B
Optimal radix
k log2 k = (B tr log N) / L
= Aspect Ratio
where k = radix B = total router Bandwidth N = # of nodes L = message size
HPCA: 21 Feb 12, 2007
Higher Aspect Ratio, Higher Optimal Radix
1996
2003
2010
1991
1
10
100
1000
10 100 1000 10000
Aspect Ratio
Op
tim
al R
ad
ix (
k)
HPCA: 22 Feb 12, 2007
High-Radix Topology
• Use high radix, k, to get low hop count
– H = logk(N)
• Provide good performance on both benign andadversarial traffic patterns
– Rules out butterfly networks - no path diversity
– Clos networks work well
• H = 2logk(N) - with short circuit
– Cayley graphs have nice properties but are hard to route
HPCA: 23 Feb 12, 2007
Example radix-64 Clos Network
Y0
BW0 BW1 BW31
Y31
BW992 BW993 BW1023
Y1
BW32 BW33 BW63
Y32 Y33 Y63
Rank 1
Rank 2
HPCA: 24 Feb 12, 2007
Flattened Butterfly Topology
HPCA: 25 Feb 12, 2007
Packaging the Flattened Butterfly
HPCA: 26 Feb 12, 2007
Packaging the Flattened Butterfly (2)
HPCA: 27 Feb 12, 2007
Cost
HPCA: 28 Feb 12, 2007
Outline
• Interconnection Networks (INs) are THE centralcomponent of modern computer systems
• Topology driven to high-radix by packagingtechnology
• Global adaptive routing balances load - and enablesefficient topologies
• Case study, the Cray Black Widow
• On-Chip Interconnection Networks (OCINs) faceunique challenges
• The road ahead…
HPCA: 29 Feb 12, 2007
Routing in High-Radix Networks
• Adaptive routing avoids transient load imbalance
• Global adaptive routing balances load for adversarialtraffic
– Cost/perf of a butterfly on benign traffic and at low loads
– Cost/perf of a clos on adversarial traffic
HPCA: 30 Feb 12, 2007
A Clos can statically load balance trafficusing oblivious routing
Y0
BW0 BW1 BW31
Y31
BW992 BW993 BW1023
Y1
BW32 BW33 BW63
Y32 Y33 Y63
Rank 1
Rank 2
HPCA: 31 Feb 12, 2007
Transient Imbalance
HPCA: 32 Feb 12, 2007
With Adaptive Routing
HPCA: 33 Feb 12, 2007
Latency for UR traffic
HPCA: 34 Feb 12, 2007
Flattened Butterfly Topology
0 1 2 3 4 5 6 7
HPCA: 35 Feb 12, 2007
Flattened Butterfly Topology
0 1 2 3 4 5 6 7
What if node 0 sends all of its traffic to node 1?
HPCA: 36 Feb 12, 2007
Flattened Butterfly Topology
0 1 2 3 4 5 6 7
What if node 0 sends all of its traffic to node 1?
How much traffic should we route over alternate paths?
HPCA: 37 Feb 12, 2007
Simpler Case - ring of 8 nodesSend traffic from 2 to 5
• Model: Assume queues to be a network ofindependent M/D/1 queues
21 3 4
5670
x1
x2
= x1 + x2
Min path delay = Dm(x1)
Non-min path delay = Dnm(x2)
• Routing remains minimal as long as
Dm’( ) Dnm’(0)
• Afterwards, route a fraction, x2, non-
minimally such that
Dm’(x1) = Dnm’(x2)
HPCA: 38 Feb 12, 2007
Traffic divides to balance delayLoad balanced at saturation
0
0.1
0.2
0.3
0.4
0.5
0.6
0 0.1 0.2 0.3 0.4 0.5 0.6
Offered Load (fraction of capacity)
Accepte
d T
hro
ughput
Model Overall
Model Minimal
Model Non-minimal
HPCA: 39 Feb 12, 2007
Channel-Queue Routing
• Estimate delay per hop by local queue length Qi
• Overall latency estimated by
– Li ~ QiHi
• Route each packet on route with lowest estimated Li
• Works extremely well in practice
HPCA: 40 Feb 12, 2007
Performance on UR Traffic
HPCA: 41 Feb 12, 2007
Performance on WC Traffic
HPCA: 42 Feb 12, 2007
Allocator Design Matters
HPCA: 43 Feb 12, 2007
Outline
• Interconnection Networks (INs) are THE centralcomponent of modern computer systems
• Topology driven to high-radix by packagingtechnology
• Global adaptive routing balances load - and enablesefficient topologies
• Case study, the Cray Black Widow
• On-Chip Interconnection Networks (OCINs) faceunique challenges
• The road ahead…
HPCA: 44 Feb 12, 2007
Putting it all togetherThe Cray BlackWidow Network
In collaboration with Steve Scott andDennis Abts (Cray Inc.)
HPCA: 45 Feb 12, 2007
Cray Black Widow
• Shared-memory vector parallel computer
• Up to 32K nodes
• Vector processor per node
• Shared memory across nodes
HPCA: 46 Feb 12, 2007
Black Widow Topology
• Up to 32K nodes in a 3-levelfolded Clos
• Each node has 4 18.75Gb/schannels, one to each of 4network slices
HPCA: 47 Feb 12, 2007
YARCYet Another Router Chip
• 64 Ports
• Each port is 18.75 Gb/s (3 x 6.25Gb/s links)
• Table-driven routing
• Fault tolerance
– CRC with link-level retry
– Graceful degradation of links
• 3 bits -> 2 bits -> 1 bit -> OTS
HPCA: 48 Feb 12, 2007
YARC Microarchitecture
• Regular 8x8 array of tiles
– Easy to lay out chip
• No global arbitration
– All decisions local
• Simple routing
• Hierarchical organization
– Input buffers
– Row buffers
– Column buffers
HPCA: 49 Feb 12, 2007
A Closer Look at a Tile
• No global arbitration
• Non-blocking with an 8xinternal speedup insubswitch
• Simple routing– Small 8-entry routing table per tile
– High routing throughput for small packets
HPCA: 50 Feb 12, 2007
YARC Implementation
• Implemented in a 90nmCMOS standard-cell ASICtechnology
• 192 SerDes on the chip• (64 ports x 3-bits per port)
• 6.25Gbaud data rate
• Estimated power• 80 W (idle)
• 87 W (peak)
• 17mm x 17mm die
HPCA: 51 Feb 12, 2007
YARC Implementation
• Implemented in a 90nmCMOS standard-cell ASICtechnology
• 192 SerDes on the chip• (64 ports x 3-bits per port)
• 6.25Gbaud data rate
• Estimated power• 80 W (idle)
• 87 W (peak)
• 17mm x 17mm die
HPCA: 52 Feb 12, 2007
Outline
• Interconnection Networks (INs) are THE centralcomponent of modern computer systems
• Topology driven to high-radix by packagingtechnology
• Global adaptive routing balances load - and enablesefficient topologies
• Case study, the Cray Black Widow
• On-Chip Interconnection Networks (OCINs) faceunique challenges
• The road ahead…
HPCA: 53 Feb 12, 2007
Much of the future is on-chip(CMP, SoC, Operand)
2006 2007.5 2009
2010.5 2012 20152013.5
HPCA: 54 Feb 12, 2007
On-Chip Networks are Fundamentally Different
• Different cost model
– Wires plentiful, no pin constraints
– Buffers expensive (consume die area)
– Slow signal propagation
• Different usage patterns
– Particularly for SoCs
• Significant isochronous traffic
• Hard RT constraints
• Different design problems
– Floorplans
– Energy-efficient transmission circuits
HPCA: 55 Feb 12, 2007
NSF Workshop Identified 3 Critical Issues
• Power
– OCINs will have 10x the required power with currentapproaches
• Circuit and architecture innovations can close this gap
• Latency
– OCIN latency currently not competitive with buses anddedicated wiring
• Novel flow-control strategies required
• Tool Integration
– OCINs need to be integrated with standard tool flows toenable widespread use
HPCA: 56 Feb 12, 2007
The Road Ahead
• INs become an even more dominant system component– Number of processors goes up, cost of processors decreases
– Communication dominates performance and cost
– From hand-held media UI devices to huge data centers
• Technology drives topology in new directions– On-chip, short reach electrical (10m), optical
– Expect radix to continue to increase
– Hybrid topologies to match each packaging level
• Latency will approach that of dedicated wiring– Better flow-control and router architecture
– Optimized circuits
• Adaptivity will optimize performance– Balance load, route around defects, tolerate variation, tune power to
load
HPCA: 57 Feb 12, 2007
Summary
• Interconnection Networks (INs) are THE central component of moderncomputing systems
• High-radix topologies have evolved to exploit packaging/signalingtechnology
– Including hybrid optical/electrical
– Flattened Butterfly
• Global adaptive routing balances load and enables advanced topologies
– Eliminate transient load imbalance
– Use local queues to estimate global congestion
• Cray Black Widow - an example high-radix network
• On-Chip INs
– Very different constraints
– Three “Gaps” identified - power, latency, tools.
• The road ahead
– Lots of room for improvement, INs are in their infancy
HPCA: 58 Feb 12, 2007
Some very good books
HPCA: 59 Feb 12, 2007
Backup
HPCA: 60 Feb 12, 2007
Virtual Channel Router Architecture
Switch
Allocator
VC
Allocator
Output k
Crossbar switch
RouterRouting
computation
Output 1
VC 1
VC 2
VC v
VC 1
VC 2
VC v
Input 1
Input k
Switch
Allocator
VC
Allocator
Output k
Crossbar switch
RouterRouting
computation
Output 1
VC 1
VC 2
VC v
VC 1
VC 2
VC v
Input 1
Input k
Switch
Allocator
VC
Allocator
Output k
Crossbar switch
RouterRouting
computation
Output 1
VC 1
VC 2
VC v
VC 1
VC 2
VC v
Input 1
Input k
Switch
Allocator
VC
Allocator
Output k
Crossbar switch
RouterRouting
computation
Output 1
VC 1
VC 2
VC v
VC 1
VC 2
VC v
Input 1
Input k
Switch
Allocator
VC
Allocator
Output k
Crossbar switch
RouterRouting
computation
Output 1
VC 1
VC 2
VC v
VC 1
VC 2
VC v
Input 1
Input k
Switch
Allocator
VC
Allocator
Output k
Crossbar switch
RouterRouting
computation
Output 1
VC 1
VC 2
VC v
VC 1
VC 2
VC v
Input 1
Input k
Switch
Allocator
VC
Allocator
Output k
Crossbar switch
RouterRouting
computation
Output 1
VC 1
VC 2
VC v
VC 1
VC 2
VC v
Input 1
Input k
HPCA: 61 Feb 12, 2007
Baseline Performance Evaluation
0
10
20
30
40
50
0 0.2 0.4 0.6 0.8 1
offered load
late
ncy (
cycle
s)
low-radix
HPCA: 62 Feb 12, 2007
Baseline Performance Evaluation
0
10
20
30
40
50
0 0.2 0.4 0.6 0.8 1
offered load
late
ncy (
cycle
s)
low-radix
baseline(high-radix)
Low
radix
better