Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Vector Processors

Prof. Sivarama Dandamudi

School of Computer Science

Carleton University

Carleton University © S. Dandamudi 2

Pipelining

Vector machines exploit pipelining in all its activitiesComputationsMovement of data from/to memory

Pipelining provides overlapped execution Increases throughputHides latency …


Pipelining (cont’d)

Pipeline overlaps execution:6 versus 18 cycles



One measure of performance:

Ideal case:n-stage pipeline should give a speedup of n

Two factors affect this:Pipeline fillPipeline drain

Non-pipelined execution time

Pipelined execution time Speedup =



N computations, each takes n * T time

Non-pipelined time = N * n * T time

Pipelined time = n * T + (N – 1) T time

= (n + N –1) T time

n * Nn + N 1

Speedup = 1/N + 1/n – 1/(n * N )

1=



1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150

Number of elements, N

Spee

dup

n = 9

n = 3

n = 6



Pipeline depth, n


Vector Machines

Provide high-level operationsWork on vectors (linear arrays of numbers)A typical vector operation

Add two 64-element floating-point vectorsEquivalent to an entire loop

CRAY formatV3 V2 VOP V1 V3 V2 VOP V1


Vector Machines (cont’d)

Consists of Scalar unit

Works on scalarsAddress arithmetic

Vector unitResponsible for vector operationsSeveral vector functional units

Integer add, FP add, FP multiply …


Vector Machines (cont’d)

Two types of architectureMemory-to-memory architecture

Vectors are memory resident First machines are of this type Example: CDC Star 100, CYBER 205

Vector-register architecture Vectors are stored in registers

Modern vector machines belong to this type Examples: Cary 1/2/X-MP/YMP, NEC SX/2, Fujitsu VP200,

Hitachi S820


Components

Primary components of vector-register machineVector registers

Each register can hold a small vectorExample: Cray-1 has 8 vector registers

Each vector register can hold 64 doublewords (64-bit values) Two read ports and one write port

Allows overlap among the vector operations


Cray-1Architecture


ComponentsVector functional units

Each unit is fully pipelined Can start a new operation on every clock cycle Cray-1 has six functional units

FP Add, FP multiply, FP reciprocal, Integer add, Logical, Shift

Scalar registersStore scalarsCompute addresses to pass on to the load/store unit


ComponentsVector load/store unit

Moves vectors between memory and vector registers Load and store operations are pipelined

Some processors have more than one load/store unit NEC SX/2 has 8 load/store units

MemoryDesigned to allow pipelined accessTypically use interleaved memories

Will discuss later


Some Example Vector Machines

Machine Year # VR VR size # LSUs

CRAY-2 1985 8 64 1

Cray Y-MP 1988 8 64 2 loads/1 store

Fujitsu VP100 1982 8-256 32-1024 2

Hitachi S810 1983 32 256 4

NEC SX/2 1984 8+8192 256+var. 8

Convex C-1 1985 8 128 1


Some Example Vector Machines (cont’d)

Vector functional unitsCray X-MP/Y-MP

8 units FP add, FP multiply, FP reciprocal Integer add, 2 logical Shift Population count/parity


Some Example Vector Machines (cont’d)

Vector functional units (cont’d)

NEX SX/216 units

4 FP add, 4 FP multiply/divide 4 Integer add/logical, 4 Shift


Advantages of Vector Machines

Flynn’s bottleneck can be reducedVector instructions significantly improve code densityA single vector instruction specifies a great deal of

workReduce the number of instructions needed to execute a

programEliminate control overhead of a loop

A vector instruction represents the entire loop Loop overhead can be substantial


Advantages of Vector Machines (cont’d)

Impact of main memory latency can be reducedVector instructions that access memory have a known

patternPipelined access can be usedCan exploit interleaved memoryHigh latency associated with memory can be amortized over

the entire vector Latency is not associated with each data item

When accessing a floating-point number


Advantages of Vector Machines (cont’d)

Control hazards can be reducedVector machines organize data operands into regular

sequences Suitable for pipelined access in hardware

Vector operation loop

Data hazards can be eliminatedDue to structured nature of data

Allows planned prefetching of data


Example Problem A Typical Vector Problem

Y = a * X + Y X and Y are vectors This problem is known as

SAXPY (single precision A*X Plus Y)DAXPY (double precision A*X Plus Y)

SAXPY/DAXPY represents a small piece of code that takes most of the time in the benchmark


Example Problem (cont’d)

Non-vector code fragment LD F0,a ADDI R4,Rx,#512 ;last address to loadloop: LD F2,0(Rx) ;F2 := M[0+Rx]

; i.e., load X[i] MULT F2,F0,F2 ;a*X[i]



LD F4,0(Ry) ;load Y[i]

ADD F4,F2,F4 ;a*X[i] + y[i]

SD F4,0(Ry) ;store into Y[i]

ADDI Rx,Rx,#8 ;increment index to X

ADDI Ry,Ry,#8 ;increment index to Y

SUB R20,R4,Rx ;R20 := R4-Rx

JNZ R20,loop ;jump if not done9 instructions in the loop



Vector code fragment LD F0,a ;load scalar a LV V1,Rx ;load vector X MULTSV V2,F0,V1 ;V2 := F0 * V1 LV V3,Ry ;load vector Y ADDV V4,V2,V3 ;V4 := V2 + V3 SV Ry,V4 ; store the result Only 6 vector instructions!



Two main observationsExecution efficiency

Vector code Executes 6 instructions

Non-vector code Nearly 600 instructions (9 * 64) Lots of control overhead

4 out of 9 instructions! Absent in the vector code



Two main observationsFrequency of pipeline interlock

Non-vector code: Every ADD must wait for MULT Every SD must wait for ADD

Loop unrolling can eliminate this interlockVector code

Each instruction is independent Pipeline stalls once per vector operation

Not once per vector element


Vector Length

Vector register has a natural vector length64 elements in CRAY systems

What if the vector has a different length?Three cases

Vector length < Vector register length Use a vector length register to indicate the vector length

Vector length = Vector register lengthVector length > Vector register length


Vector Length (cont’d)

Vector length > Vector register lengthUse strip miningVector is partitioned into strips that are less than or

equal to the vector register length

Odd strip


Vector Stride

Vector strideDistance separating the elements that are to be merged

into a single vectorIn elements, not bytes

Typically multidimensional matrices may have non-unit stride access patternsExample: matrix multiply


Vector Stride (cont’d)

Matrix multiplicationfor (i = 1, 100)

for (j = 1, 100)

A[i,j] = 0

for (k = 1, 100)

A[i,j] = A[i,j] + B[i,k] * C[k,j]

Non-unit stride

Unit stride



Access pattern of B and C depends on how the matrix is storedRow-major

Matrix is stored row-by-rowUsed by most languages except FORTRAN

Column-majorMatrix is stored column-by-columnUsed by FORTRAN



11 12 13 1421 22 23 2431 32 33 3441 42 43 44


Cray X-MP Instructions

Integer additionVi Vj+Vk Vi = Vj + VkVi Sj+Vk Vi = Sj + Vk

Sj is a scalar

Floating-point additionVi Vj+FVk Vi = Vj + VkVi Sj+FVk Vi = Sj + Vk

Sj is a scalar


Cray X-MP Instructions (cont’d)

Load instructionsVi ,A0,Ak Vi = M(A0)+Ak

Vector load with stride AkLoads VL elements from memory address A0

Vi ,A0,1 Vi = M(A0)+1Vector load with stride 1Special case



Store instructions ,A0,Ak Vi

Vector store with stride AkStores VL elements starting at memory address A0

,A0,1 ViVector store with stride 1Special case



Logical AND instructionsVi Vj&Vk Vi = Vj & VkVi Sj&Vk Vi = Sj & Vk

Sj is a scalar

Shift instructionsVi Vj>Ak Vi = Vj >> AkVi Vj<Ak Vi = Vj << Ak

Left/right shift each element of Vj and store the result in Vi


Sample Vector Functional Units

Vector functional unit # Stages Available to chain

Vector results

Integer ADD (64-bit) 3 8 VL+8

64-bit shift 3 8 VL+8

128-bit shift 4 9 VL+9

Floating ADD 6 11 VL+11

Floating MULTIPLY 7 12 VL+12


X-MP Pipeline Operation

Three phasesSetup phase

Sets functional units to perform the appropriate operationEstablishes routes to source and destination vector registersRequires 3 clock cycles for all functional units

Execution phaseShutdown phase


X-MP Pipeline Operation (Cont’d)

Three phases (cont’d)

Execution phaseSource and destination vector registers are reserved

Cannot be used by another instruction

Source vector register is reserved for VL+3 clock cycles VL = vector length

One pair of operands/clock cycle enter the first stage




Shutdown phaseShutdown time = 3 clock cyclesShutdown time

Time difference between when the last result emerges and when the destination vector register becomes available for other

instructions




Shutdown phaseDestination register becomes available after

3 + n + (VL1) + 3 = n + VL + 5 clock cyclesSetup time = shutdown time = 3 clock cyclesFirst result comes after n clock cyclesRemaining (VL1) results come out at one/clock cycle


A Simple Vector Add Operation

A1 5VL A1V1 V2+FV3


Overlapped Vector OperationsA1 5VL A1V1 V2+FV3V4 V5*FV6


Chaining ExampleA1 5VL A1V1 V2+FV3V4 V5*FV1


Vector Processing Performance


Interleaved Memories

Traditional memory designsProvide sequential, non-overlapped access

Use high-order interleaving

Interleaved memoriesFacilitate overlapped, pipelined accessUsed by vector and high performance systems

Use low-order interleaving


Interleaved Memories (cont’d)



Two types of designsSynchronized access organization

Upper m bits are given to all memory banks simultaneouslyRequires output latchesDoes not efficiently support non-sequential access

Independent access organizationSupports pipelined access for arbitrary access patternRequire address registers



Synchronized access organization



Pipelined transfer of datain interleaved memories



Independent access organization



Number of banks B

B MM = memory access time in cycles

Sequential access if stride = B B = 8, M = 6 clock cycles, stride = 1

Time to read 16 words = 6 + 16 = 22 clock cycles If stride is 8, it takes 16 * 6 = 96 clock cycles

Last slide

Documents

Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University