Upload
kiral
View
60
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Streaming Supercomputer Strawman Architecture. November 27, 2001 Ben Serebrin. High-level Programming Model. Streams are partitioned across nodes. Programming: Partitioning. Across nodes is straightforward domain decomposition Within nodes we have 2 choices (SW) Domain decomposition - PowerPoint PPT Presentation
Citation preview
Streaming Supercomputer Strawman Architecture
November 27, 2001Ben Serebrin
High-level Programming Model
Streams are partitioned across nodes8 Nodes
8 Nodes
1024 Records
1024 Records
128
128
Programming: Partitioning
Across nodes is straightforward domain decompositionWithin nodes we have 2 choices (SW) Domain decomposition Each cluster receives neighboring
record1 Clusters
m+1
m+2
...
2m
1
2
...
m
2m+1
2m+2
...
3m
3m+1
3m+2
...
4m
2 3 4
Data inSRF
1 Clusters
2
6
...
4m-2
1
5
...
4m-3
3
7
...
4m-1
4
8
...
4m
2 3 4
Data inSRF
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
SRF Lane
SRF Lane
SRF Lane
SRF Lane
SRF Lane
SRF Lane
16 Clusters
High-level Programming Model
Parallelism within a node
Record
Record
Record
Record
Record
Record
Result
Result
Result
Result
Result
Result
Record
Record
Record
Record
Record
Record
Record
Record
Record
Record
Record
Record
Streams vs. Vectors
Compound operations on records Traverse operations first
and records second Temporary values
encapsulated within kernel
Global instruction bandwidth is of kernels
Group whole records into streams
Gather records from memory – one stream buffer per record type
Simple operations on vectors of elements First fetch all elements of
all records then operate Large set of temporary
values Global instruction
bandwidth is of many simple operations
Group like-elements of records into vectors
Gather elements from memory – one stream buffer per record element type
Example – Vertex Transform
x
y
z
w
t00x
t10x
t20x
t30x
t01y
t11y
t21y
t31y
t02z
t12z
t22z
t32z
t03w
t13w
t23w
t33w
x’
y’
z’
w’
w
z
y
x
T
w
z
y
x
'
'
'
'
input record intermediate results
result record
Example encapsulate
intermediate results enable small and fast
LRFs
large working set of intermediates must use the global RF
Instruction Set Architecture
Machine State Program Counter (pc) Scalar Registers: part of MIPS/ARM core Local Registers (LRF): local to each ALU
in cluster Scratchpad: Small RAM within the
cluster Stream Buffers (SB): between SRF and
clusters Serve to make SRF appear multi-ported
Instruction Set Architecture
Machine state (continued) Stream Register File (SRF): Clustered
memory that sources most data Stream Cache (SC): to make graph
stream accesses efficient. With SRF or outside?
Segment Registers: A set of registers to provide paging and protection
Global Memory (M)
ISA: Instruction Types
Scalar processor Scalar: Standard RISC Stream Load/Store Stream Prefetch (graph stream) Execute Kernel
Clusters Kernel Instructions: VLIW instructions
ISA: Memory Model
Memory Model for global shared addressing Segmented (to allow time-sharing?) Descriptor contains node and size
information Length of segment (power of 2) Base address (aligned to multiple of length) Range of nodes owning the data (power of 2) Interleaving (which bits select nodes) Cache behavior? (non-cached, read-only, (full?))
No paging, no TLBs
ISA: Caching
Stream cache improves bandwidth and latency for graph accesses (irregular structures) Pseudo read-only (like a texture
cache—changes very infrequently) Explicit gang-invalidation
Scalar Processor has Instruction and Data caches
Global Mechanisms
Remote Memory access Processor can busy wait on a location until Remote processor updates
Signal and Wait (on named broadcast signals)Fuzzy barriers – split barriers Processor signals “I’m done” and can continue
with other work When next phase is reached the processor waits
for all other processors to signal Barriers are named can be implemented with signals and atomic ops
Atomic Remote Operations Fetch&op (add, or, etc …) Compare&Swap
Scan Example
Prefix-sum operation Recursively:
Higher level processor (“thread”): clear memory locations for partial sums and ready
bits signal Si poll ready bits and add to local sum when ready
Lower level processor: calculate local sum wait on Si write local sum to prepared memory location atomic update of ready bit in higher level
System Architecture
StreamProcessor64 FPUs
64GFLOPS
16 xDRDRAM2GBytes
38GBytes/s
20GBytes/s32+32 pairs
Node
On-Board Network
Node2
Node16 Board 2
16 Nodes1K FPUs1TFLOPS32GBytes
Intra-Cabinet Network(passive - wires only)
Board 64
160GBytes/s256+256 pairs
10.5" Teradyne GbX
Board
Cabinet
Inter-Cabinet Network
Cabinet 264 Boards1K Nodes64K FPUs64TFLOPS
2TBytes
E/OO/E
5TBytes/s8K+8K links
Ribbon Fiber
Cabinet 16
Bisection 64TBytes/s
All links 5Gb/s per pair or fiberAll bandwidths are full duplex
Node Microarchitecture
Stream Execution Unit
Stream Register File
Sca
lar
Pro
cess
or
AddressGenerators
MemoryControl
Ca
che
LocalDRDRAM(38 GB/s)
Ne
two
rkIn
terf
ace
NetworkChannels(20 GB/s)
Clu
ste
r 0
Clu
ste
r 1
Clu
ste
r 1
5
CommFP
MUL
FPMUL
FPADD
FPADD
FPDSQ
Regs
Regs
Regs
Regs
Clu
ste
r S
witc
h
Regs
Regs
Regs
Regs
Regs
Regs
Regs
Regs
Scr
atc
hP
ad
Regs
Stream Execution Unit
Cluster
Stream Processor
To/From SRF(260 GB/s)
LRF BW = 1.5TB/s
uArch: Scalar Processor
Standard RISC (MIPS, ARM) Scalar ops and stream dispatch are
interleaved (no synchronization needed)
Accesses same memory space (SRF & global memory) as clusters
I and D cachesSmall RTOS
uArch: Arithmetic Clusters
16 identical arithmetic clusters 2 ADD, 2 MUL, 1 DSQ, scratchpad (?) ALUs connect to SRF via Stream
Buffers and Local Register Files LRF: one for each ALU input, 32 64-bit
entries each Local inter-cluster crossbar Statically-scheduled VLIW control SIMD/MIMD?
uArch: Stream Register File
Stream Register File (SRF) Arranged in clusters parallel to
Arithmetic Clusters Accessible by clusters, scalar
processor, memory system Kernels refer to stream number (and
offset?) Stream Descriptor Registers track start,
end, direction of streams
uArch: Memory
Address generator (above cache) Creates a stream of addresses for strided Accepts a stream of addresses for gather/scatter
Memory access: Check: In cache? Check: In local memory? Else: Get from network
Network Send and receive memory requests
Memory Controller Talks to SRF and to Network
Feeds and Speeds: in node
2 GByte DRDRAM local memory: 38 GByte/s
On-chip memory: 64 GByte/sStream registers: 256 GByte/sLocal registers: 1520 GByte/s
Feeds and Speeds: Global
Card-level (16 nodes): 20 GBytes/secBackplane (64 cards): 10 GBytes/secSystem (16 backplanes): 4 Gbytes/sec
Expect < 1 sec latency (500 ns?) for memory request to random address
Open Issues
2-port DRF? Currently, the ALUs all have LRFs for
each input
Open Issues
Is rotate enough or do we want fully random access SRF with reduced BW if accessing same bank? Rotate allows arbitrary linear rotation
and is simpler Full random access requires a big
switch Can trade BW for size
Open Issues
Do we need an explicitly managed cache (for locking root of a tree for example)?
Open Issues
Do we want messaging (probably yes) allows elegant distributed control allows complex “fetch&ops” (remote
procedures) can build software coherency protocols
and such
Do we need coherency in the scalar part
Open Issues
Is dynamic migration important? Moving data from one node to
another not possible without pages or COMA
Open Issues
Exceptions? No external exceptions Arithmetic overflow/underflow, div by 0, etc. Exception on cache miss? (Can we guarantee
no cache misses?) Disrupts stream sequencing and control flow
Interrupts and scalar/stream sync Interrupts from Network? From stream to scalar? From scalar to
stream?
Experiments
Conditionals Experiment Are predications and conditional
stream sufficient? Experiment with adding instruction
sequencers for each cluster (quasi-MIMD)
Examine cost and performance