Upload
marianna-shields
View
227
Download
0
Tags:
Embed Size (px)
Citation preview
Vector Processors
Prof. Sivarama Dandamudi
School of Computer Science
Carleton University
Carleton University © S. Dandamudi 2
Pipelining
Vector machines exploit pipelining in all its activitiesComputationsMovement of data from/to memory
Pipelining provides overlapped execution Increases throughputHides latency …
Carleton University © S. Dandamudi 3
Pipelining (cont’d)
Pipeline overlaps execution:6 versus 18 cycles
Carleton University © S. Dandamudi 4
Pipelining (cont’d)
One measure of performance:
Ideal case:n-stage pipeline should give a speedup of n
Two factors affect this:Pipeline fillPipeline drain
Non-pipelined execution time
Pipelined execution time Speedup =
Carleton University © S. Dandamudi 5
Pipelining (cont’d)
N computations, each takes n * T time
Non-pipelined time = N * n * T time
Pipelined time = n * T + (N – 1) T time
= (n + N –1) T time
n * Nn + N 1
Speedup = 1/N + 1/n – 1/(n * N )
1=
Carleton University © S. Dandamudi 6
Pipelining (cont’d)
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150
Number of elements, N
Spee
dup
n = 9
n = 3
n = 6
Carleton University © S. Dandamudi 7
Pipelining (cont’d)
Pipeline depth, n
Carleton University © S. Dandamudi 8
Vector Machines
Provide high-level operationsWork on vectors (linear arrays of numbers)A typical vector operation
Add two 64-element floating-point vectorsEquivalent to an entire loop
CRAY formatV3 V2 VOP V1 V3 V2 VOP V1
Carleton University © S. Dandamudi 9
Vector Machines (cont’d)
Consists of Scalar unit
Works on scalarsAddress arithmetic
Vector unitResponsible for vector operationsSeveral vector functional units
Integer add, FP add, FP multiply …
Carleton University © S. Dandamudi 10
Vector Machines (cont’d)
Two types of architectureMemory-to-memory architecture
Vectors are memory resident First machines are of this type Example: CDC Star 100, CYBER 205
Vector-register architecture Vectors are stored in registers
Modern vector machines belong to this type Examples: Cary 1/2/X-MP/YMP, NEC SX/2, Fujitsu VP200,
Hitachi S820
Carleton University © S. Dandamudi 11
Components
Primary components of vector-register machineVector registers
Each register can hold a small vectorExample: Cray-1 has 8 vector registers
Each vector register can hold 64 doublewords (64-bit values) Two read ports and one write port
Allows overlap among the vector operations
Carleton University © S. Dandamudi 12
Cray-1Architecture
Carleton University © S. Dandamudi 13
ComponentsVector functional units
Each unit is fully pipelined Can start a new operation on every clock cycle Cray-1 has six functional units
FP Add, FP multiply, FP reciprocal, Integer add, Logical, Shift
Scalar registersStore scalarsCompute addresses to pass on to the load/store unit
Carleton University © S. Dandamudi 14
ComponentsVector load/store unit
Moves vectors between memory and vector registers Load and store operations are pipelined
Some processors have more than one load/store unit NEC SX/2 has 8 load/store units
MemoryDesigned to allow pipelined accessTypically use interleaved memories
Will discuss later
Carleton University © S. Dandamudi 15
Some Example Vector Machines
Machine Year # VR VR size # LSUs
CRAY-2 1985 8 64 1
Cray Y-MP 1988 8 64 2 loads/1 store
Fujitsu VP100 1982 8-256 32-1024 2
Hitachi S810 1983 32 256 4
NEC SX/2 1984 8+8192 256+var. 8
Convex C-1 1985 8 128 1
Carleton University © S. Dandamudi 16
Some Example Vector Machines (cont’d)
Vector functional unitsCray X-MP/Y-MP
8 units FP add, FP multiply, FP reciprocal Integer add, 2 logical Shift Population count/parity
Carleton University © S. Dandamudi 17
Some Example Vector Machines (cont’d)
Vector functional units (cont’d)
NEX SX/216 units
4 FP add, 4 FP multiply/divide 4 Integer add/logical, 4 Shift
Carleton University © S. Dandamudi 18
Advantages of Vector Machines
Flynn’s bottleneck can be reducedVector instructions significantly improve code densityA single vector instruction specifies a great deal of
workReduce the number of instructions needed to execute a
programEliminate control overhead of a loop
A vector instruction represents the entire loop Loop overhead can be substantial
Carleton University © S. Dandamudi 19
Advantages of Vector Machines (cont’d)
Impact of main memory latency can be reducedVector instructions that access memory have a known
patternPipelined access can be usedCan exploit interleaved memoryHigh latency associated with memory can be amortized over
the entire vector Latency is not associated with each data item
When accessing a floating-point number
Carleton University © S. Dandamudi 20
Advantages of Vector Machines (cont’d)
Control hazards can be reducedVector machines organize data operands into regular
sequences Suitable for pipelined access in hardware
Vector operation loop
Data hazards can be eliminatedDue to structured nature of data
Allows planned prefetching of data
Carleton University © S. Dandamudi 21
Example Problem A Typical Vector Problem
Y = a * X + Y X and Y are vectors This problem is known as
SAXPY (single precision A*X Plus Y)DAXPY (double precision A*X Plus Y)
SAXPY/DAXPY represents a small piece of code that takes most of the time in the benchmark
Carleton University © S. Dandamudi 22
Example Problem (cont’d)
Non-vector code fragment LD F0,a ADDI R4,Rx,#512 ;last address to loadloop: LD F2,0(Rx) ;F2 := M[0+Rx]
; i.e., load X[i] MULT F2,F0,F2 ;a*X[i]
Carleton University © S. Dandamudi 23
Example Problem (cont’d)
LD F4,0(Ry) ;load Y[i]
ADD F4,F2,F4 ;a*X[i] + y[i]
SD F4,0(Ry) ;store into Y[i]
ADDI Rx,Rx,#8 ;increment index to X
ADDI Ry,Ry,#8 ;increment index to Y
SUB R20,R4,Rx ;R20 := R4-Rx
JNZ R20,loop ;jump if not done9 instructions in the loop
Carleton University © S. Dandamudi 24
Example Problem (cont’d)
Vector code fragment LD F0,a ;load scalar a LV V1,Rx ;load vector X MULTSV V2,F0,V1 ;V2 := F0 * V1 LV V3,Ry ;load vector Y ADDV V4,V2,V3 ;V4 := V2 + V3 SV Ry,V4 ; store the result Only 6 vector instructions!
Carleton University © S. Dandamudi 25
Example Problem (cont’d)
Two main observationsExecution efficiency
Vector code Executes 6 instructions
Non-vector code Nearly 600 instructions (9 * 64) Lots of control overhead
4 out of 9 instructions! Absent in the vector code
Carleton University © S. Dandamudi 26
Example Problem (cont’d)
Two main observationsFrequency of pipeline interlock
Non-vector code: Every ADD must wait for MULT Every SD must wait for ADD
Loop unrolling can eliminate this interlockVector code
Each instruction is independent Pipeline stalls once per vector operation
Not once per vector element
Carleton University © S. Dandamudi 27
Vector Length
Vector register has a natural vector length64 elements in CRAY systems
What if the vector has a different length?Three cases
Vector length < Vector register length Use a vector length register to indicate the vector length
Vector length = Vector register lengthVector length > Vector register length
Carleton University © S. Dandamudi 28
Vector Length (cont’d)
Vector length > Vector register lengthUse strip miningVector is partitioned into strips that are less than or
equal to the vector register length
Odd strip
Carleton University © S. Dandamudi 29
Vector Stride
Vector strideDistance separating the elements that are to be merged
into a single vectorIn elements, not bytes
Typically multidimensional matrices may have non-unit stride access patternsExample: matrix multiply
Carleton University © S. Dandamudi 30
Vector Stride (cont’d)
Matrix multiplicationfor (i = 1, 100)
for (j = 1, 100)
A[i,j] = 0
for (k = 1, 100)
A[i,j] = A[i,j] + B[i,k] * C[k,j]
Non-unit stride
Unit stride
Carleton University © S. Dandamudi 31
Vector Stride (cont’d)
Access pattern of B and C depends on how the matrix is storedRow-major
Matrix is stored row-by-rowUsed by most languages except FORTRAN
Column-majorMatrix is stored column-by-columnUsed by FORTRAN
Carleton University © S. Dandamudi 32
Vector Stride (cont’d)
11 12 13 1421 22 23 2431 32 33 3441 42 43 44
Carleton University © S. Dandamudi 33
Cray X-MP Instructions
Integer additionVi Vj+Vk Vi = Vj + VkVi Sj+Vk Vi = Sj + Vk
Sj is a scalar
Floating-point additionVi Vj+FVk Vi = Vj + VkVi Sj+FVk Vi = Sj + Vk
Sj is a scalar
Carleton University © S. Dandamudi 34
Cray X-MP Instructions (cont’d)
Load instructionsVi ,A0,Ak Vi = M(A0)+Ak
Vector load with stride AkLoads VL elements from memory address A0
Vi ,A0,1 Vi = M(A0)+1Vector load with stride 1Special case
Carleton University © S. Dandamudi 35
Cray X-MP Instructions (cont’d)
Store instructions ,A0,Ak Vi
Vector store with stride AkStores VL elements starting at memory address A0
,A0,1 ViVector store with stride 1Special case
Carleton University © S. Dandamudi 36
Cray X-MP Instructions (cont’d)
Logical AND instructionsVi Vj&Vk Vi = Vj & VkVi Sj&Vk Vi = Sj & Vk
Sj is a scalar
Shift instructionsVi Vj>Ak Vi = Vj >> AkVi Vj<Ak Vi = Vj << Ak
Left/right shift each element of Vj and store the result in Vi
Carleton University © S. Dandamudi 37
Sample Vector Functional Units
Vector functional unit # Stages Available to chain
Vector results
Integer ADD (64-bit) 3 8 VL+8
64-bit shift 3 8 VL+8
128-bit shift 4 9 VL+9
Floating ADD 6 11 VL+11
Floating MULTIPLY 7 12 VL+12
Carleton University © S. Dandamudi 38
X-MP Pipeline Operation
Three phasesSetup phase
Sets functional units to perform the appropriate operationEstablishes routes to source and destination vector registersRequires 3 clock cycles for all functional units
Execution phaseShutdown phase
Carleton University © S. Dandamudi 39
X-MP Pipeline Operation (Cont’d)
Three phases (cont’d)
Execution phaseSource and destination vector registers are reserved
Cannot be used by another instruction
Source vector register is reserved for VL+3 clock cycles VL = vector length
One pair of operands/clock cycle enter the first stage
Carleton University © S. Dandamudi 40
X-MP Pipeline Operation (Cont’d)
Three phases (cont’d)
Shutdown phaseShutdown time = 3 clock cyclesShutdown time
Time difference between when the last result emerges and when the destination vector register becomes available for other
instructions
Carleton University © S. Dandamudi 41
X-MP Pipeline Operation (Cont’d)
Three phases (cont’d)
Shutdown phaseDestination register becomes available after
3 + n + (VL1) + 3 = n + VL + 5 clock cyclesSetup time = shutdown time = 3 clock cyclesFirst result comes after n clock cyclesRemaining (VL1) results come out at one/clock cycle
Carleton University © S. Dandamudi 42
A Simple Vector Add Operation
A1 5VL A1V1 V2+FV3
Carleton University © S. Dandamudi 43
Overlapped Vector OperationsA1 5VL A1V1 V2+FV3V4 V5*FV6
Carleton University © S. Dandamudi 44
Chaining ExampleA1 5VL A1V1 V2+FV3V4 V5*FV1
Carleton University © S. Dandamudi 45
Vector Processing Performance
Carleton University © S. Dandamudi 46
Interleaved Memories
Traditional memory designsProvide sequential, non-overlapped access
Use high-order interleaving
Interleaved memoriesFacilitate overlapped, pipelined accessUsed by vector and high performance systems
Use low-order interleaving
Carleton University © S. Dandamudi 47
Interleaved Memories (cont’d)
Carleton University © S. Dandamudi 48
Interleaved Memories (cont’d)
Two types of designsSynchronized access organization
Upper m bits are given to all memory banks simultaneouslyRequires output latchesDoes not efficiently support non-sequential access
Independent access organizationSupports pipelined access for arbitrary access patternRequire address registers
Carleton University © S. Dandamudi 49
Interleaved Memories (cont’d)
Synchronized access organization
Carleton University © S. Dandamudi 50
Interleaved Memories (cont’d)
Pipelined transfer of datain interleaved memories
Carleton University © S. Dandamudi 51
Interleaved Memories (cont’d)
Independent access organization
Carleton University © S. Dandamudi 52
Interleaved Memories (cont’d)
Number of banks B
B MM = memory access time in cycles
Sequential access if stride = B B = 8, M = 6 clock cycles, stride = 1
Time to read 16 words = 6 + 16 = 22 clock cycles If stride is 8, it takes 16 * 6 = 96 clock cycles
Last slide