Download ppt - VECTOR PROCESSING Τσόλκας Χρήστος & Αντωνίου Χρυσόστομος

VECTOR PROCESSINGVECTOR PROCESSING

Τσόλκας ΧρήστοςΤσόλκας Χρήστος

&&

Αντωνίου ΧρυσόστομοςΑντωνίου Χρυσόστομος

22

ContentsContents

IntroductionIntroduction Vector Processor DefinitionVector Processor Definition Components & Properties of Vector ProcessorsComponents & Properties of Vector Processors Advantages/Disadvantages of Vector ProcessorsAdvantages/Disadvantages of Vector Processors Vector Machines & ArchitecturesVector Machines & Architectures Virtual Processors ModelVirtual Processors Model Vectorization InhibitorsVectorization Inhibitors Improving PerformanceImproving Performance Vector MetricsVector Metrics ApplicationsApplications

33

Architecture ClassificationArchitecture Classification

SISDSISD Single Instruction Single DataSingle Instruction Single Data

SIMDSIMD Single Instruction Multiple DataSingle Instruction Multiple Data

MIMDMIMD Multiple Instruction Multiple DataMultiple Instruction Multiple Data

MISDMISD Multiple Instruction Single DataMultiple Instruction Single Data

44

Alternative Forms of Alternative Forms of Machine ParallelismMachine Parallelism

Instruction Level Parallelism (ILP)Instruction Level Parallelism (ILP) Thread Level ParallelismThread Level Parallelism ( (TLPTLP)) vector Data Parallelismvector Data Parallelism ( (DPDP))

55

Alternative Forms of Alternative Forms of Machine ParallelismMachine Parallelism

66

Drawbacks of ILP and TLPDrawbacks of ILP and TLP

CoherencyCoherency SynchronizationSynchronization Large OverheadLarge Overhead instruction fetch and decode: at some point, its instruction fetch and decode: at some point, its

hard to fetch and decode more instructions per hard to fetch and decode more instructions per clock cycleclock cycle

cache hit rate: some long-running (scientific) cache hit rate: some long-running (scientific) programs have very large data sets accessed programs have very large data sets accessed with poor locality; with poor locality; others have continuous data streams others have continuous data streams (multimedia) and hence poor locality(multimedia) and hence poor locality

77

Alternative: Vector Alternative: Vector ProcessorsProcessors

88

What is a Vector What is a Vector Processor?Processor?

Provides high-level operations that Provides high-level operations that work on work on vectorsvectors Vector is a linear array of numbersVector is a linear array of numbers

Type of number can vary, but usually 64 Type of number can vary, but usually 64 bit bit floating pointfloating point (IEEE 754, 2’s complement) (IEEE 754, 2’s complement)

Length of the array also varies depending on Length of the array also varies depending on hardwarehardware

Example vectors would be 64 or 128 elements in Example vectors would be 64 or 128 elements in lengthlength

Small vectors (e.g. MMX/SSE) are about 4 elements Small vectors (e.g. MMX/SSE) are about 4 elements in lengthin length

99

Components of Vector Components of Vector ProcessorProcessor

Vector RegistersVector Registers Fixed length bank holding a single vectorFixed length bank holding a single vector

Has at least 2 read and 1 write portsHas at least 2 read and 1 write ports Typically 8-32 vector registers, each holding 64-128 64-bit Typically 8-32 vector registers, each holding 64-128 64-bit

elementselements Vector Functional UnitsVector Functional Units

Fully pipelined, start new operation every clockFully pipelined, start new operation every clock Typically 4-8 FUs: FP add, FP mult, FP reciprocal, integer add, Typically 4-8 FUs: FP add, FP mult, FP reciprocal, integer add,

logical, shiftlogical, shift Scalar RegistersScalar Registers

Single element for FP scalar or addressSingle element for FP scalar or address Load Store Units Load Store Units

1010

Components of Vector Components of Vector ProcessorProcessor

1111

Vector Processor Vector Processor PropertiesProperties

Computation of each result must be Computation of each result must be independent of previous resultsindependent of previous results

Single vector instruction specifies a Single vector instruction specifies a great deal of workgreat deal of work Equivalent to executing an entire loopEquivalent to executing an entire loop

Vector instructions must access Vector instructions must access memory in a known access patternmemory in a known access pattern

Many control hazards can be avoided Many control hazards can be avoided since the entire loop is replaced by a since the entire loop is replaced by a vector instructionvector instruction

1212

Advantages of Vector Advantages of Vector ProcessorsProcessors

Increase in code densityIncrease in code density Decrease in total number of instructionsDecrease in total number of instructions Data is organized in patterns which is Data is organized in patterns which is

easier for the hardware to computeeasier for the hardware to compute Simple loops are replaced with vector Simple loops are replaced with vector

instructions, hence decrease in overheadinstructions, hence decrease in overhead ScalableScalable

1313

Disadvantages of Vector Disadvantages of Vector ProcessorsProcessors

Expansion of the Instruction Set Expansion of the Instruction Set Architecture (ISA) is neededArchitecture (ISA) is needed

Additional vector functional units and Additional vector functional units and registersregisters

Modification of the memory systemModification of the memory system

1414

Example Vector Example Vector MachinesMachines

MachineMachine YearYear ClockClock RegsRegsElementsElementsFUsFUsLSUsLSUs Cray 1Cray 1 1976197680 MHz80 MHz 88 6464 66 11 Cray XMPCray XMP 19831983120 MHz120 MHz 88 6464 882 L, 1 S2 L, 1 S Cray YMPCray YMP 19881988166 MHz166 MHz 88 6464 88 2 L, 1 2 L, 1

SS Cray C-90Cray C-90 19911991240 MHz240 MHz 88 128128 88 44 Cray T-90Cray T-90 19961996455 MHz455 MHz 88 128128 88 44 Conv. C-1Conv. C-1 1984198410 MHz10 MHz 88 128128 44 11 Conv. C-4Conv. C-4 19941994133 MHz133 MHz1616 128128 33 11 Fuj. VP2001982Fuj. VP2001982133 MHz133 MHz8-2568-25632-102432-102433 22 NEC SX/2NEC SX/2 19841984160 MHz160 MHz8+8K8+8K256+var256+var1616 88 NEC SX/3NEC SX/3 19951995400 MHz400 MHz8+8K8+8K256+var256+var1616 88

1515

Vector Instruction Vector Instruction ExecutionExecution

Static schedulingStatic scheduling PrefetchingPrefetching Dynamic schedulingDynamic scheduling

1616

Styles of Vector Styles of Vector ArchitecturesArchitectures

Memory-memory vector processorsMemory-memory vector processors All vector operations are memory to memoryAll vector operations are memory to memory

CDC Star 100CDC Star 100

Vector-register processorsVector-register processors All vector operations between vector registersAll vector operations between vector registers Vector equivalent of load-store architectureVector equivalent of load-store architecture Includes all vector machines since late 1980sIncludes all vector machines since late 1980s

Cray, Convex, Fujitsu, Hitachi, NECCray, Convex, Fujitsu, Hitachi, NEC

1717

Vector-Register Vector-Register ArchitectureArchitecture

1818

Memory operationsMemory operations

Load/store operations move groups of Load/store operations move groups of data between registers and memorydata between registers and memory

Three types of addressingThree types of addressing Unit stride access Unit stride access

FastestFastest Non-unit (constant) stride accessNon-unit (constant) stride access Indexed (gather-scatter) Indexed (gather-scatter)

Vector equivalent of register indirectVector equivalent of register indirect Increases number of programs that vectorizeIncreases number of programs that vectorize

1919

Vector StrideVector Stride

Position of the elements we want Position of the elements we want in memory may not be sequentialin memory may not be sequential

Consider following code:Consider following code:Do 10 I=1, 100Do 10 I=1, 100

Do 10 j =1, 100Do 10 j =1, 100A(I,j) = 0.0A(I,j) = 0.0Do 10 k =1,100Do 10 k =1,100A(I,j) = A(I,j) + B(I,k)*C(k,j)A(I,j) = A(I,j) + B(I,k)*C(k,j)

10 Continue10 Continue

2020

Virtual Processor ModelVirtual Processor Model

Vector operations are SIMD Vector operations are SIMD (single instruction multiple (single instruction multiple data)operationsdata)operations

Each element is computed by a Each element is computed by a virtual processor (VP)virtual processor (VP)

Number of VPs given by vector Number of VPs given by vector lengthlength

2121

Virtual Processor ModelVirtual Processor Model

2222

Vectorization ExampleVectorization ExampleDO 100 I = 1, N

A(I) = B(I) + C(I)

100 CONTINUE

Scalar process:

1. B(1) will be fetched from memory

2. C(1) will be fetched from memory

3. A scalar add instruction will operate on B(1) and C(1)

4. A(1) will be stored back to memory

5. Step (1) to (4) will be repeated N times.

2323

Vectorization ExampleVectorization ExampleDO 100 I = 1, N

A(I) = B(I) + C(I)

100 CONTINUE

Vector process:

1. A vector of values in B(I) will be fetched from memory

2. A vector of values in C(I) will be fetched from memory.

3. A vector add instruction will operate on pairs of B(I) and C(I) values.

4. After a short start-up time, stream of A(I) values will be stored back to memory, one value every clock cycle.

2424

Example (2): Y=aX+YExample (2): Y=aX+YScalar Code: LD F0, A

ADDI R4,Rx, #512 ; Last addrLoop: LD F2, 0(Rx)

MULTD F2, F0, F2 ; A * X[I]LD F4, 0(Ry)ADDD F4, F2, F4 ; + Y[I]SD 0(Ry), F4ADDI Rx, Rx, #8 ; Inc indexADDI Ry, Ry, #8SUB R20, R4, RxBNEZ R20, Loop

Vector Code: LD F0, A LV V1, Rx ; Load vecX MULTSV V2, F0, V1 ; Vec Mult LV V3, Ry ; Load vecY ADDV V4, V2, V3 ; Vec Add SV Ry, V4 ; Store result

Loop goes 64 times.

2+9*64=578 operations

64 is element size .So we need no loop now

1+5*64=321 operations

Vector/Scalar=1.8x

2525

Vector LengthVector Length

We would like loops to iterate the same We would like loops to iterate the same number of times that we have elements in a number of times that we have elements in a vectorvector But unlikely in a real programBut unlikely in a real program Also the number of iterations might be unknown at Also the number of iterations might be unknown at

compile timecompile time Problem: Problem: nn, number of iterations, greater than , number of iterations, greater than

MVL (Maximum Vector Length)MVL (Maximum Vector Length) Solution: Strip MiningSolution: Strip Mining Create one loop that iterates a multiple of MVL Create one loop that iterates a multiple of MVL

timestimes Create a final loop that handles any remaining Create a final loop that handles any remaining

iterations, which must be less than MVLiterations, which must be less than MVL

2626

Strip Mining ExampleStrip Mining Example

llow=1ow=1VL = (n mod MVL)VL = (n mod MVL) ; Find odd-sized piece; Find odd-sized pieceDo 1 j=0,(n/MVL)Do 1 j=0,(n/MVL) ; Outer Loop; Outer Loop

Do 10 I = low, low+VL-1 Do 10 I = low, low+VL-1 ; runs for length VL; runs for length VLY(I) = a*X(I)+Y(I)Y(I) = a*X(I)+Y(I) ; Main operation; Main operation

10 continue10 continuellow = low + VLow = low + VLVL = MVLVL = MVL

11 Continue ContinueExecutes loop in blocks of MVLInner loop can be vectorized

2727


llow=1ow=1 ; low=1 ; low=1VL = (n mod MVL)VL = (n mod MVL) ; VL=2 ; VL=2 Do 1 j=0,(n/MVL)Do 1 j=0,(n/MVL) ; j= ; j=11Do 10 I = low, low+VL-1 Do 10 I = low, low+VL-1 ; I=1; I=1 .. 2 .. 2

Y(I) = a*X(I)+Y(I)Y(I) = a*X(I)+Y(I) ; ; Υ(1) Υ(1) andand Υ(2) Υ(2)

10 continue ;10 continue ;llow = low + VL ; ow = low + VL ; low=3low=3VL = MVL ; VL = MVL ; VL=32VL=32

11 Continue ; Continue ;

2828


llow=1ow=1 ; low= ; low=33VL = (n mod MVL)VL = (n mod MVL) ; VL= ; VL=3232 Do 1 j=0,(n/MVL)Do 1 j=0,(n/MVL) ; j= ; j=22Do 10 I = low, low+VL-1 Do 10 I = low, low+VL-1 ; I=; I=3 3 .... 34 34

Y(I) = a*X(I)+Y(I)Y(I) = a*X(I)+Y(I) ; ; Υ(3) Υ(3) .... Υ(34) Υ(34)

10 continue ;10 continue ;llow = low + VL ; ow = low + VL ; low=low=3535VL = MVL ; VL = MVL ; VL=32VL=32


2929


llow=1ow=1 ; low= ; low=9999VL = (n mod MVL)VL = (n mod MVL) ; VL= ; VL=3232 Do 1 j=0,(n/MVL)Do 1 j=0,(n/MVL) ; j= ; j=44Do 10 I = low, low+VL-1 Do 10 I = low, low+VL-1 ; I=; I=99 99 .... 130 130

Y(I) = a*X(I)+Y(I)Y(I) = a*X(I)+Y(I) ; ; Υ(99) Υ(99) .... Υ(130) Υ(130)

10 continue ;10 continue ;llow = low + VL ; ow = low + VL ; low=low=130130VL = MVL ; VL = MVL ; VL=32VL=32


3030

Vectorization InhibitorsVectorization Inhibitors

Subroutine callsSubroutine calls I/O StatementsI/O Statements Character dataCharacter data Unstructured branchesUnstructured branches Data dependenciesData dependencies Complicated programmingComplicated programming

3131



Subroutine callsSubroutine callsSolution: InlineSolution: Inline

inline double radius( double x, double y, double z ) inline double radius( double x, double y, double z )

{{

return sqrt( x*x + y*y + z*z );return sqrt( x*x + y*y + z*z );

}}

....

int main()int main()

{ {

.. ..

for( int i=1; i<=n; ++i ){ r[i] = radius( x[i], y[i], z[i] ); for( int i=1; i<=n; ++i ){ r[i] = radius( x[i], y[i], z[i] );

} }

.. ..

} }

3333



3434



3535



3636



3737

Dependence ExampleDependence Exampledo i=2,n

a(i) = a(i-1)

enddo

do i=2,n

temp(i) = a(i-1) ! temporary vector

enddo

temp(1) = a(1)

do i=1,n

a(i) = temp(i)

enddo

3838



3939

Improving Vector Improving Vector PerformancePerformance

Better compiler techniquesBetter compiler techniques As with all other techniques, we may be able to As with all other techniques, we may be able to

rearrange code to increase the amount of vectorizationrearrange code to increase the amount of vectorization Techniques for accessing sparse matricesTechniques for accessing sparse matrices

Hardware support to move between dense (no zeros), Hardware support to move between dense (no zeros), and normal (include zeros) representationsand normal (include zeros) representations

ChainingChaining Same idea as forwarding in pipeliningSame idea as forwarding in pipelining Consider:Consider:

MULTV V1, V2, V3MULTV V1, V2, V3 ADDVADDV V4, V1, V5V4, V1, V5

ADDV must wait for MULTV to finish ADDV must wait for MULTV to finish But we could implement forwarding; as each element from the But we could implement forwarding; as each element from the

MULTV finishes, send it off to the ADDV to start workMULTV finishes, send it off to the ADDV to start work

4040

Chaining ExampleChaining Example

Unchained

7 64 6 64

MULTV ADDV

Total = 141

Chained

7 64

MULTV Total = 77

6 64

ADDV

6 and 7 cycles are start-up-times of the adder and multiplier

Every vector processor today performs chaining

4141

Improving PerformanceImproving Performance

Conditionally Executed StatementsConditionally Executed Statements Consider the following loopConsider the following loop

Do 100 Do 100 ii=1, 64=1, 64 If (a(i) .ne. 0) thenIf (a(i) .ne. 0) then

a(i)=a(i)-b(i)a(i)=a(i)-b(i) EndifEndif

100 continue100 continue Not vectorizable due to the conditional Not vectorizable due to the conditional

statementstatement But we could vectorize this if we could But we could vectorize this if we could

somehow only include in the vector operation somehow only include in the vector operation those elements where a(i) != 0those elements where a(i) != 0

4242

Conditional ExecutionConditional Execution

Solution: Create a Solution: Create a vector maskvector mask of bits that of bits that corresponds to each vector elementcorresponds to each vector element 1=apply operation1=apply operation 0=leave alone0=leave alone

As long as we properly set the mask first, As long as we properly set the mask first, we can now vectorize the previous loop we can now vectorize the previous loop with the conditionalwith the conditional

Implemented on most vector processors Implemented on most vector processors todaytoday

Conditional ExecutionConditional Execution

lv v1 ra ;load vector into v1lv v1 ra ;load vector into v1

lv v2 rb ;load vector into v2lv v2 rb ;load vector into v2

id f0 #0 ;f0=0id f0 #0 ;f0=0

vsnes f0 v1 ;set VM to 1 if v1(i)!vsnes f0 v1 ;set VM to 1 if v1(i)!=0 =0

vsub v1 v1 v2 ;sub. under vector vsub v1 v1 v2 ;sub. under vector maskmask

cvm ;set vector mask all cvm ;set vector mask all to 1to 1

sv ra v1 ; store the reslult to sv ra v1 ; store the reslult to aa

4444

Common Vector Common Vector MetricsMetrics

RRnn: MFLOPS rate on an infinite-length vector: MFLOPS rate on an infinite-length vector Real problems do not have unlimitend vector lengths, and the start-up Real problems do not have unlimitend vector lengths, and the start-up

penalties encountered in real problems will be larger penalties encountered in real problems will be larger (R(Rnn is the MFLOPS rate for a vector of length n) is the MFLOPS rate for a vector of length n)

NN1/21/2: The vector length needed to reach one-half of R: The vector length needed to reach one-half of Rnn a good measure of the impact of start-upa good measure of the impact of start-up

NNVV: The vector length needed to make vector mode faster than : The vector length needed to make vector mode faster than scalar mode scalar mode measures both start-up and speed of scalars relative to vectors, quality measures both start-up and speed of scalars relative to vectors, quality

of connection of scalar unit to vector unitof connection of scalar unit to vector unit

4545

ApplicationsApplications

Linear AlgebraLinear Algebra Image processing (Convolution, Image processing (Convolution,

Composition, Compressing, etc.)Composition, Compressing, etc.) Audio synthesisAudio synthesis Compression Compression CryptographyCryptography Speech recognitionSpeech recognition

4646

Applications in multimediaApplications in multimedia

KernelKernel Vector lengthVector length Matrix transpose/multiplyMatrix transpose/multiply # vertices at once# vertices at once DCT (video, communication)DCT (video, communication) image widthimage width FFT (audio)FFT (audio) 256-1024256-1024 Motion estimation (video)Motion estimation (video) image width,iw/16image width,iw/16 Gamma correction (video)Gamma correction (video) image widthimage width Median filter (image processing)Median filter (image processing) image widthimage width Separable convolution (img. proc.)Separable convolution (img. proc.) image widthimage width

4747

Vector SummaryVector Summary

Alternate model,doesn’t rely on caches as Alternate model,doesn’t rely on caches as does Out-Of-Order and superscalar does Out-Of-Order and superscalar implementationsimplementations

Handles memory in a more organized wayHandles memory in a more organized way Powerful instructions that replace loopsPowerful instructions that replace loops Cope with multimedia applicationsCope with multimedia applications Ideal architecture for scientific simulationIdeal architecture for scientific simulation