52
Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Embed Size (px)

Citation preview

Page 1: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Vector Processors

Prof. Sivarama Dandamudi

School of Computer Science

Carleton University

Page 2: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 2

Pipelining

Vector machines exploit pipelining in all its activitiesComputationsMovement of data from/to memory

Pipelining provides overlapped execution Increases throughputHides latency …

Page 3: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 3

Pipelining (cont’d)

Pipeline overlaps execution:6 versus 18 cycles

Page 4: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 4

Pipelining (cont’d)

One measure of performance:

Ideal case:n-stage pipeline should give a speedup of n

Two factors affect this:Pipeline fillPipeline drain

Non-pipelined execution time

Pipelined execution time Speedup =

Page 5: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 5

Pipelining (cont’d)

N computations, each takes n * T time

Non-pipelined time = N * n * T time

Pipelined time = n * T + (N – 1) T time

= (n + N –1) T time

n * Nn + N 1

Speedup = 1/N + 1/n – 1/(n * N )

1=

Page 6: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 6

Pipelining (cont’d)

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150

Number of elements, N

Spee

dup

n = 9

n = 3

n = 6

Page 7: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 7

Pipelining (cont’d)

Pipeline depth, n

Page 8: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 8

Vector Machines

Provide high-level operationsWork on vectors (linear arrays of numbers)A typical vector operation

Add two 64-element floating-point vectorsEquivalent to an entire loop

CRAY formatV3 V2 VOP V1 V3 V2 VOP V1

Page 9: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 9

Vector Machines (cont’d)

Consists of Scalar unit

Works on scalarsAddress arithmetic

Vector unitResponsible for vector operationsSeveral vector functional units

Integer add, FP add, FP multiply …

Page 10: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 10

Vector Machines (cont’d)

Two types of architectureMemory-to-memory architecture

Vectors are memory resident First machines are of this type Example: CDC Star 100, CYBER 205

Vector-register architecture Vectors are stored in registers

Modern vector machines belong to this type Examples: Cary 1/2/X-MP/YMP, NEC SX/2, Fujitsu VP200,

Hitachi S820

Page 11: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 11

Components

Primary components of vector-register machineVector registers

Each register can hold a small vectorExample: Cray-1 has 8 vector registers

Each vector register can hold 64 doublewords (64-bit values) Two read ports and one write port

Allows overlap among the vector operations

Page 12: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 12

Cray-1Architecture

Page 13: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 13

ComponentsVector functional units

Each unit is fully pipelined Can start a new operation on every clock cycle Cray-1 has six functional units

FP Add, FP multiply, FP reciprocal, Integer add, Logical, Shift

Scalar registersStore scalarsCompute addresses to pass on to the load/store unit

Page 14: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 14

ComponentsVector load/store unit

Moves vectors between memory and vector registers Load and store operations are pipelined

Some processors have more than one load/store unit NEC SX/2 has 8 load/store units

MemoryDesigned to allow pipelined accessTypically use interleaved memories

Will discuss later

Page 15: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 15

Some Example Vector Machines

Machine Year # VR VR size # LSUs

CRAY-2 1985 8 64 1

Cray Y-MP 1988 8 64 2 loads/1 store

Fujitsu VP100 1982 8-256 32-1024 2

Hitachi S810 1983 32 256 4

NEC SX/2 1984 8+8192 256+var. 8

Convex C-1 1985 8 128 1

Page 16: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 16

Some Example Vector Machines (cont’d)

Vector functional unitsCray X-MP/Y-MP

8 units FP add, FP multiply, FP reciprocal Integer add, 2 logical Shift Population count/parity

Page 17: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 17

Some Example Vector Machines (cont’d)

Vector functional units (cont’d)

NEX SX/216 units

4 FP add, 4 FP multiply/divide 4 Integer add/logical, 4 Shift

Page 18: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 18

Advantages of Vector Machines

Flynn’s bottleneck can be reducedVector instructions significantly improve code densityA single vector instruction specifies a great deal of

workReduce the number of instructions needed to execute a

programEliminate control overhead of a loop

A vector instruction represents the entire loop Loop overhead can be substantial

Page 19: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 19

Advantages of Vector Machines (cont’d)

Impact of main memory latency can be reducedVector instructions that access memory have a known

patternPipelined access can be usedCan exploit interleaved memoryHigh latency associated with memory can be amortized over

the entire vector Latency is not associated with each data item

When accessing a floating-point number

Page 20: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 20

Advantages of Vector Machines (cont’d)

Control hazards can be reducedVector machines organize data operands into regular

sequences Suitable for pipelined access in hardware

Vector operation loop

Data hazards can be eliminatedDue to structured nature of data

Allows planned prefetching of data

Page 21: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 21

Example Problem A Typical Vector Problem

Y = a * X + Y X and Y are vectors This problem is known as

SAXPY (single precision A*X Plus Y)DAXPY (double precision A*X Plus Y)

SAXPY/DAXPY represents a small piece of code that takes most of the time in the benchmark

Page 22: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 22

Example Problem (cont’d)

Non-vector code fragment LD F0,a ADDI R4,Rx,#512 ;last address to loadloop: LD F2,0(Rx) ;F2 := M[0+Rx]

; i.e., load X[i] MULT F2,F0,F2 ;a*X[i]

Page 23: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 23

Example Problem (cont’d)

LD F4,0(Ry) ;load Y[i]

ADD F4,F2,F4 ;a*X[i] + y[i]

SD F4,0(Ry) ;store into Y[i]

ADDI Rx,Rx,#8 ;increment index to X

ADDI Ry,Ry,#8 ;increment index to Y

SUB R20,R4,Rx ;R20 := R4-Rx

JNZ R20,loop ;jump if not done9 instructions in the loop

Page 24: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 24

Example Problem (cont’d)

Vector code fragment LD F0,a ;load scalar a LV V1,Rx ;load vector X MULTSV V2,F0,V1 ;V2 := F0 * V1 LV V3,Ry ;load vector Y ADDV V4,V2,V3 ;V4 := V2 + V3 SV Ry,V4 ; store the result Only 6 vector instructions!

Page 25: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 25

Example Problem (cont’d)

Two main observationsExecution efficiency

Vector code Executes 6 instructions

Non-vector code Nearly 600 instructions (9 * 64) Lots of control overhead

4 out of 9 instructions! Absent in the vector code

Page 26: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 26

Example Problem (cont’d)

Two main observationsFrequency of pipeline interlock

Non-vector code: Every ADD must wait for MULT Every SD must wait for ADD

Loop unrolling can eliminate this interlockVector code

Each instruction is independent Pipeline stalls once per vector operation

Not once per vector element

Page 27: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 27

Vector Length

Vector register has a natural vector length64 elements in CRAY systems

What if the vector has a different length?Three cases

Vector length < Vector register length Use a vector length register to indicate the vector length

Vector length = Vector register lengthVector length > Vector register length

Page 28: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 28

Vector Length (cont’d)

Vector length > Vector register lengthUse strip miningVector is partitioned into strips that are less than or

equal to the vector register length

Odd strip

Page 29: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 29

Vector Stride

Vector strideDistance separating the elements that are to be merged

into a single vectorIn elements, not bytes

Typically multidimensional matrices may have non-unit stride access patternsExample: matrix multiply

Page 30: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 30

Vector Stride (cont’d)

Matrix multiplicationfor (i = 1, 100)

for (j = 1, 100)

A[i,j] = 0

for (k = 1, 100)

A[i,j] = A[i,j] + B[i,k] * C[k,j]

Non-unit stride

Unit stride

Page 31: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 31

Vector Stride (cont’d)

Access pattern of B and C depends on how the matrix is storedRow-major

Matrix is stored row-by-rowUsed by most languages except FORTRAN

Column-majorMatrix is stored column-by-columnUsed by FORTRAN

Page 32: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 32

Vector Stride (cont’d)

11 12 13 1421 22 23 2431 32 33 3441 42 43 44

Page 33: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 33

Cray X-MP Instructions

Integer additionVi Vj+Vk Vi = Vj + VkVi Sj+Vk Vi = Sj + Vk

Sj is a scalar

Floating-point additionVi Vj+FVk Vi = Vj + VkVi Sj+FVk Vi = Sj + Vk

Sj is a scalar

Page 34: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 34

Cray X-MP Instructions (cont’d)

Load instructionsVi ,A0,Ak Vi = M(A0)+Ak

Vector load with stride AkLoads VL elements from memory address A0

Vi ,A0,1 Vi = M(A0)+1Vector load with stride 1Special case

Page 35: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 35

Cray X-MP Instructions (cont’d)

Store instructions ,A0,Ak Vi

Vector store with stride AkStores VL elements starting at memory address A0

,A0,1 ViVector store with stride 1Special case

Page 36: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 36

Cray X-MP Instructions (cont’d)

Logical AND instructionsVi Vj&Vk Vi = Vj & VkVi Sj&Vk Vi = Sj & Vk

Sj is a scalar

Shift instructionsVi Vj>Ak Vi = Vj >> AkVi Vj<Ak Vi = Vj << Ak

Left/right shift each element of Vj and store the result in Vi

Page 37: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 37

Sample Vector Functional Units

Vector functional unit # Stages Available to chain

Vector results

Integer ADD (64-bit) 3 8 VL+8

64-bit shift 3 8 VL+8

128-bit shift 4 9 VL+9

Floating ADD 6 11 VL+11

Floating MULTIPLY 7 12 VL+12

Page 38: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 38

X-MP Pipeline Operation

Three phasesSetup phase

Sets functional units to perform the appropriate operationEstablishes routes to source and destination vector registersRequires 3 clock cycles for all functional units

Execution phaseShutdown phase

Page 39: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 39

X-MP Pipeline Operation (Cont’d)

Three phases (cont’d)

Execution phaseSource and destination vector registers are reserved

Cannot be used by another instruction

Source vector register is reserved for VL+3 clock cycles VL = vector length

One pair of operands/clock cycle enter the first stage

Page 40: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 40

X-MP Pipeline Operation (Cont’d)

Three phases (cont’d)

Shutdown phaseShutdown time = 3 clock cyclesShutdown time

Time difference between when the last result emerges and when the destination vector register becomes available for other

instructions

Page 41: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 41

X-MP Pipeline Operation (Cont’d)

Three phases (cont’d)

Shutdown phaseDestination register becomes available after

3 + n + (VL1) + 3 = n + VL + 5 clock cyclesSetup time = shutdown time = 3 clock cyclesFirst result comes after n clock cyclesRemaining (VL1) results come out at one/clock cycle

Page 42: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 42

A Simple Vector Add Operation

A1 5VL A1V1 V2+FV3

Page 43: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 43

Overlapped Vector OperationsA1 5VL A1V1 V2+FV3V4 V5*FV6

Page 44: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 44

Chaining ExampleA1 5VL A1V1 V2+FV3V4 V5*FV1

Page 45: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 45

Vector Processing Performance

Page 46: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 46

Interleaved Memories

Traditional memory designsProvide sequential, non-overlapped access

Use high-order interleaving

Interleaved memoriesFacilitate overlapped, pipelined accessUsed by vector and high performance systems

Use low-order interleaving

Page 47: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 47

Interleaved Memories (cont’d)

Page 48: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 48

Interleaved Memories (cont’d)

Two types of designsSynchronized access organization

Upper m bits are given to all memory banks simultaneouslyRequires output latchesDoes not efficiently support non-sequential access

Independent access organizationSupports pipelined access for arbitrary access patternRequire address registers

Page 49: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 49

Interleaved Memories (cont’d)

Synchronized access organization

Page 50: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 50

Interleaved Memories (cont’d)

Pipelined transfer of datain interleaved memories

Page 51: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 51

Interleaved Memories (cont’d)

Independent access organization

Page 52: Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University

Carleton University © S. Dandamudi 52

Interleaved Memories (cont’d)

Number of banks B

B MM = memory access time in cycles

Sequential access if stride = B B = 8, M = 6 clock cycles, stride = 1

Time to read 16 words = 6 + 16 = 22 clock cycles If stride is 8, it takes 16 * 6 = 96 clock cycles

Last slide