VECTOR PROCESSINGVECTOR PROCESSING
Τσόλκας ΧρήστοςΤσόλκας Χρήστος
&&
Αντωνίου ΧρυσόστομοςΑντωνίου Χρυσόστομος
22
ContentsContents
IntroductionIntroduction Vector Processor DefinitionVector Processor Definition Components & Properties of Vector ProcessorsComponents & Properties of Vector Processors Advantages/Disadvantages of Vector ProcessorsAdvantages/Disadvantages of Vector Processors Vector Machines & ArchitecturesVector Machines & Architectures Virtual Processors ModelVirtual Processors Model Vectorization InhibitorsVectorization Inhibitors Improving PerformanceImproving Performance Vector MetricsVector Metrics ApplicationsApplications
33
Architecture ClassificationArchitecture Classification
SISDSISD Single Instruction Single DataSingle Instruction Single Data
SIMDSIMD Single Instruction Multiple DataSingle Instruction Multiple Data
MIMDMIMD Multiple Instruction Multiple DataMultiple Instruction Multiple Data
MISDMISD Multiple Instruction Single DataMultiple Instruction Single Data
44
Alternative Forms of Alternative Forms of Machine ParallelismMachine Parallelism
Instruction Level Parallelism (ILP)Instruction Level Parallelism (ILP) Thread Level ParallelismThread Level Parallelism ( (TLPTLP)) vector Data Parallelismvector Data Parallelism ( (DPDP))
55
Alternative Forms of Alternative Forms of Machine ParallelismMachine Parallelism
66
Drawbacks of ILP and TLPDrawbacks of ILP and TLP
CoherencyCoherency SynchronizationSynchronization Large OverheadLarge Overhead instruction fetch and decode: at some point, its instruction fetch and decode: at some point, its
hard to fetch and decode more instructions per hard to fetch and decode more instructions per clock cycleclock cycle
cache hit rate: some long-running (scientific) cache hit rate: some long-running (scientific) programs have very large data sets accessed programs have very large data sets accessed with poor locality; with poor locality; others have continuous data streams others have continuous data streams (multimedia) and hence poor locality(multimedia) and hence poor locality
77
Alternative: Vector Alternative: Vector ProcessorsProcessors
88
What is a Vector What is a Vector Processor?Processor?
Provides high-level operations that Provides high-level operations that work on work on vectorsvectors Vector is a linear array of numbersVector is a linear array of numbers
Type of number can vary, but usually 64 Type of number can vary, but usually 64 bit bit floating pointfloating point (IEEE 754, 2’s complement) (IEEE 754, 2’s complement)
Length of the array also varies depending on Length of the array also varies depending on hardwarehardware
Example vectors would be 64 or 128 elements in Example vectors would be 64 or 128 elements in lengthlength
Small vectors (e.g. MMX/SSE) are about 4 elements Small vectors (e.g. MMX/SSE) are about 4 elements in lengthin length
99
Components of Vector Components of Vector ProcessorProcessor
Vector RegistersVector Registers Fixed length bank holding a single vectorFixed length bank holding a single vector
Has at least 2 read and 1 write portsHas at least 2 read and 1 write ports Typically 8-32 vector registers, each holding 64-128 64-bit Typically 8-32 vector registers, each holding 64-128 64-bit
elementselements Vector Functional UnitsVector Functional Units
Fully pipelined, start new operation every clockFully pipelined, start new operation every clock Typically 4-8 FUs: FP add, FP mult, FP reciprocal, integer add, Typically 4-8 FUs: FP add, FP mult, FP reciprocal, integer add,
logical, shiftlogical, shift Scalar RegistersScalar Registers
Single element for FP scalar or addressSingle element for FP scalar or address Load Store Units Load Store Units
1010
Components of Vector Components of Vector ProcessorProcessor
1111
Vector Processor Vector Processor PropertiesProperties
Computation of each result must be Computation of each result must be independent of previous resultsindependent of previous results
Single vector instruction specifies a Single vector instruction specifies a great deal of workgreat deal of work Equivalent to executing an entire loopEquivalent to executing an entire loop
Vector instructions must access Vector instructions must access memory in a known access patternmemory in a known access pattern
Many control hazards can be avoided Many control hazards can be avoided since the entire loop is replaced by a since the entire loop is replaced by a vector instructionvector instruction
1212
Advantages of Vector Advantages of Vector ProcessorsProcessors
Increase in code densityIncrease in code density Decrease in total number of instructionsDecrease in total number of instructions Data is organized in patterns which is Data is organized in patterns which is
easier for the hardware to computeeasier for the hardware to compute Simple loops are replaced with vector Simple loops are replaced with vector
instructions, hence decrease in overheadinstructions, hence decrease in overhead ScalableScalable
1313
Disadvantages of Vector Disadvantages of Vector ProcessorsProcessors
Expansion of the Instruction Set Expansion of the Instruction Set Architecture (ISA) is neededArchitecture (ISA) is needed
Additional vector functional units and Additional vector functional units and registersregisters
Modification of the memory systemModification of the memory system
1414
Example Vector Example Vector MachinesMachines
MachineMachine YearYear ClockClock RegsRegsElementsElementsFUsFUsLSUsLSUs Cray 1Cray 1 1976197680 MHz80 MHz 88 6464 66 11 Cray XMPCray XMP 19831983120 MHz120 MHz 88 6464 882 L, 1 S2 L, 1 S Cray YMPCray YMP 19881988166 MHz166 MHz 88 6464 88 2 L, 1 2 L, 1
SS Cray C-90Cray C-90 19911991240 MHz240 MHz 88 128128 88 44 Cray T-90Cray T-90 19961996455 MHz455 MHz 88 128128 88 44 Conv. C-1Conv. C-1 1984198410 MHz10 MHz 88 128128 44 11 Conv. C-4Conv. C-4 19941994133 MHz133 MHz1616 128128 33 11 Fuj. VP2001982Fuj. VP2001982133 MHz133 MHz8-2568-25632-102432-102433 22 NEC SX/2NEC SX/2 19841984160 MHz160 MHz8+8K8+8K256+var256+var1616 88 NEC SX/3NEC SX/3 19951995400 MHz400 MHz8+8K8+8K256+var256+var1616 88
1515
Vector Instruction Vector Instruction ExecutionExecution
Static schedulingStatic scheduling PrefetchingPrefetching Dynamic schedulingDynamic scheduling
1616
Styles of Vector Styles of Vector ArchitecturesArchitectures
Memory-memory vector processorsMemory-memory vector processors All vector operations are memory to memoryAll vector operations are memory to memory
CDC Star 100CDC Star 100
Vector-register processorsVector-register processors All vector operations between vector registersAll vector operations between vector registers Vector equivalent of load-store architectureVector equivalent of load-store architecture Includes all vector machines since late 1980sIncludes all vector machines since late 1980s
Cray, Convex, Fujitsu, Hitachi, NECCray, Convex, Fujitsu, Hitachi, NEC
1717
Vector-Register Vector-Register ArchitectureArchitecture
1818
Memory operationsMemory operations
Load/store operations move groups of Load/store operations move groups of data between registers and memorydata between registers and memory
Three types of addressingThree types of addressing Unit stride access Unit stride access
FastestFastest Non-unit (constant) stride accessNon-unit (constant) stride access Indexed (gather-scatter) Indexed (gather-scatter)
Vector equivalent of register indirectVector equivalent of register indirect Increases number of programs that vectorizeIncreases number of programs that vectorize
1919
Vector StrideVector Stride
Position of the elements we want Position of the elements we want in memory may not be sequentialin memory may not be sequential
Consider following code:Consider following code:Do 10 I=1, 100Do 10 I=1, 100
Do 10 j =1, 100Do 10 j =1, 100A(I,j) = 0.0A(I,j) = 0.0Do 10 k =1,100Do 10 k =1,100A(I,j) = A(I,j) + B(I,k)*C(k,j)A(I,j) = A(I,j) + B(I,k)*C(k,j)
10 Continue10 Continue
2020
Virtual Processor ModelVirtual Processor Model
Vector operations are SIMD Vector operations are SIMD (single instruction multiple (single instruction multiple data)operationsdata)operations
Each element is computed by a Each element is computed by a virtual processor (VP)virtual processor (VP)
Number of VPs given by vector Number of VPs given by vector lengthlength
2121
Virtual Processor ModelVirtual Processor Model
2222
Vectorization ExampleVectorization ExampleDO 100 I = 1, N
A(I) = B(I) + C(I)
100 CONTINUE
Scalar process:
1. B(1) will be fetched from memory
2. C(1) will be fetched from memory
3. A scalar add instruction will operate on B(1) and C(1)
4. A(1) will be stored back to memory
5. Step (1) to (4) will be repeated N times.
2323
Vectorization ExampleVectorization ExampleDO 100 I = 1, N
A(I) = B(I) + C(I)
100 CONTINUE
Vector process:
1. A vector of values in B(I) will be fetched from memory
2. A vector of values in C(I) will be fetched from memory.
3. A vector add instruction will operate on pairs of B(I) and C(I) values.
4. After a short start-up time, stream of A(I) values will be stored back to memory, one value every clock cycle.
2424
Example (2): Y=aX+YExample (2): Y=aX+YScalar Code: LD F0, A
ADDI R4,Rx, #512 ; Last addrLoop: LD F2, 0(Rx)
MULTD F2, F0, F2 ; A * X[I]LD F4, 0(Ry)ADDD F4, F2, F4 ; + Y[I]SD 0(Ry), F4ADDI Rx, Rx, #8 ; Inc indexADDI Ry, Ry, #8SUB R20, R4, RxBNEZ R20, Loop
Vector Code: LD F0, A LV V1, Rx ; Load vecX MULTSV V2, F0, V1 ; Vec Mult LV V3, Ry ; Load vecY ADDV V4, V2, V3 ; Vec Add SV Ry, V4 ; Store result
Loop goes 64 times.
2+9*64=578 operations
64 is element size .So we need no loop now
1+5*64=321 operations
Vector/Scalar=1.8x
2525
Vector LengthVector Length
We would like loops to iterate the same We would like loops to iterate the same number of times that we have elements in a number of times that we have elements in a vectorvector But unlikely in a real programBut unlikely in a real program Also the number of iterations might be unknown at Also the number of iterations might be unknown at
compile timecompile time Problem: Problem: nn, number of iterations, greater than , number of iterations, greater than
MVL (Maximum Vector Length)MVL (Maximum Vector Length) Solution: Strip MiningSolution: Strip Mining Create one loop that iterates a multiple of MVL Create one loop that iterates a multiple of MVL
timestimes Create a final loop that handles any remaining Create a final loop that handles any remaining
iterations, which must be less than MVLiterations, which must be less than MVL
2626
Strip Mining ExampleStrip Mining Example
llow=1ow=1VL = (n mod MVL)VL = (n mod MVL) ; Find odd-sized piece; Find odd-sized pieceDo 1 j=0,(n/MVL)Do 1 j=0,(n/MVL) ; Outer Loop; Outer Loop
Do 10 I = low, low+VL-1 Do 10 I = low, low+VL-1 ; runs for length VL; runs for length VLY(I) = a*X(I)+Y(I)Y(I) = a*X(I)+Y(I) ; Main operation; Main operation
10 continue10 continuellow = low + VLow = low + VLVL = MVLVL = MVL
11 Continue ContinueExecutes loop in blocks of MVLInner loop can be vectorized
2727
Strip Mining ExampleStrip Mining Example
llow=1ow=1 ; low=1 ; low=1VL = (n mod MVL)VL = (n mod MVL) ; VL=2 ; VL=2 Do 1 j=0,(n/MVL)Do 1 j=0,(n/MVL) ; j= ; j=11Do 10 I = low, low+VL-1 Do 10 I = low, low+VL-1 ; I=1; I=1 .. 2 .. 2
Y(I) = a*X(I)+Y(I)Y(I) = a*X(I)+Y(I) ; ; Υ(1) Υ(1) andand Υ(2) Υ(2)
10 continue ;10 continue ;llow = low + VL ; ow = low + VL ; low=3low=3VL = MVL ; VL = MVL ; VL=32VL=32
11 Continue ; Continue ;
2828
Strip Mining ExampleStrip Mining Example
llow=1ow=1 ; low= ; low=33VL = (n mod MVL)VL = (n mod MVL) ; VL= ; VL=3232 Do 1 j=0,(n/MVL)Do 1 j=0,(n/MVL) ; j= ; j=22Do 10 I = low, low+VL-1 Do 10 I = low, low+VL-1 ; I=; I=3 3 .... 34 34
Y(I) = a*X(I)+Y(I)Y(I) = a*X(I)+Y(I) ; ; Υ(3) Υ(3) .... Υ(34) Υ(34)
10 continue ;10 continue ;llow = low + VL ; ow = low + VL ; low=low=3535VL = MVL ; VL = MVL ; VL=32VL=32
11 Continue ; Continue ;
2929
Strip Mining ExampleStrip Mining Example
llow=1ow=1 ; low= ; low=9999VL = (n mod MVL)VL = (n mod MVL) ; VL= ; VL=3232 Do 1 j=0,(n/MVL)Do 1 j=0,(n/MVL) ; j= ; j=44Do 10 I = low, low+VL-1 Do 10 I = low, low+VL-1 ; I=; I=99 99 .... 130 130
Y(I) = a*X(I)+Y(I)Y(I) = a*X(I)+Y(I) ; ; Υ(99) Υ(99) .... Υ(130) Υ(130)
10 continue ;10 continue ;llow = low + VL ; ow = low + VL ; low=low=130130VL = MVL ; VL = MVL ; VL=32VL=32
11 Continue ; Continue ;
3030
Vectorization InhibitorsVectorization Inhibitors
Subroutine callsSubroutine calls I/O StatementsI/O Statements Character dataCharacter data Unstructured branchesUnstructured branches Data dependenciesData dependencies Complicated programmingComplicated programming
3131
Vectorization InhibitorsVectorization Inhibitors
Subroutine callsSubroutine calls I/O StatementsI/O Statements Character dataCharacter data Unstructured branchesUnstructured branches Data dependenciesData dependencies Complicated programmingComplicated programming
Subroutine callsSubroutine callsSolution: InlineSolution: Inline
inline double radius( double x, double y, double z ) inline double radius( double x, double y, double z )
{{
return sqrt( x*x + y*y + z*z );return sqrt( x*x + y*y + z*z );
}}
....
int main()int main()
{ {
.. ..
for( int i=1; i<=n; ++i ){ r[i] = radius( x[i], y[i], z[i] ); for( int i=1; i<=n; ++i ){ r[i] = radius( x[i], y[i], z[i] );
} }
.. ..
} }
3333
Vectorization InhibitorsVectorization Inhibitors
Subroutine callsSubroutine calls I/O StatementsI/O Statements Character dataCharacter data Unstructured branchesUnstructured branches Data dependenciesData dependencies Complicated programmingComplicated programming
3434
Vectorization InhibitorsVectorization Inhibitors
Subroutine callsSubroutine calls I/O StatementsI/O Statements Character dataCharacter data Unstructured branchesUnstructured branches Data dependenciesData dependencies Complicated programmingComplicated programming
3535
Vectorization InhibitorsVectorization Inhibitors
Subroutine callsSubroutine calls I/O StatementsI/O Statements Character dataCharacter data Unstructured branchesUnstructured branches Data dependenciesData dependencies Complicated programmingComplicated programming
3636
Vectorization InhibitorsVectorization Inhibitors
Subroutine callsSubroutine calls I/O StatementsI/O Statements Character dataCharacter data Unstructured branchesUnstructured branches Data dependenciesData dependencies Complicated programmingComplicated programming
3737
Dependence ExampleDependence Exampledo i=2,n
a(i) = a(i-1)
enddo
do i=2,n
temp(i) = a(i-1) ! temporary vector
enddo
temp(1) = a(1)
do i=1,n
a(i) = temp(i)
enddo
3838
Vectorization InhibitorsVectorization Inhibitors
Subroutine callsSubroutine calls I/O StatementsI/O Statements Character dataCharacter data Unstructured branchesUnstructured branches Data dependenciesData dependencies Complicated programmingComplicated programming
3939
Improving Vector Improving Vector PerformancePerformance
Better compiler techniquesBetter compiler techniques As with all other techniques, we may be able to As with all other techniques, we may be able to
rearrange code to increase the amount of vectorizationrearrange code to increase the amount of vectorization Techniques for accessing sparse matricesTechniques for accessing sparse matrices
Hardware support to move between dense (no zeros), Hardware support to move between dense (no zeros), and normal (include zeros) representationsand normal (include zeros) representations
ChainingChaining Same idea as forwarding in pipeliningSame idea as forwarding in pipelining Consider:Consider:
MULTV V1, V2, V3MULTV V1, V2, V3 ADDVADDV V4, V1, V5V4, V1, V5
ADDV must wait for MULTV to finish ADDV must wait for MULTV to finish But we could implement forwarding; as each element from the But we could implement forwarding; as each element from the
MULTV finishes, send it off to the ADDV to start workMULTV finishes, send it off to the ADDV to start work
4040
Chaining ExampleChaining Example
Unchained
7 64 6 64
MULTV ADDV
Total = 141
Chained
7 64
MULTV Total = 77
6 64
ADDV
6 and 7 cycles are start-up-times of the adder and multiplier
Every vector processor today performs chaining
4141
Improving PerformanceImproving Performance
Conditionally Executed StatementsConditionally Executed Statements Consider the following loopConsider the following loop
Do 100 Do 100 ii=1, 64=1, 64 If (a(i) .ne. 0) thenIf (a(i) .ne. 0) then
a(i)=a(i)-b(i)a(i)=a(i)-b(i) EndifEndif
100 continue100 continue Not vectorizable due to the conditional Not vectorizable due to the conditional
statementstatement But we could vectorize this if we could But we could vectorize this if we could
somehow only include in the vector operation somehow only include in the vector operation those elements where a(i) != 0those elements where a(i) != 0
4242
Conditional ExecutionConditional Execution
Solution: Create a Solution: Create a vector maskvector mask of bits that of bits that corresponds to each vector elementcorresponds to each vector element 1=apply operation1=apply operation 0=leave alone0=leave alone
As long as we properly set the mask first, As long as we properly set the mask first, we can now vectorize the previous loop we can now vectorize the previous loop with the conditionalwith the conditional
Implemented on most vector processors Implemented on most vector processors todaytoday
Conditional ExecutionConditional Execution
lv v1 ra ;load vector into v1lv v1 ra ;load vector into v1
lv v2 rb ;load vector into v2lv v2 rb ;load vector into v2
id f0 #0 ;f0=0id f0 #0 ;f0=0
vsnes f0 v1 ;set VM to 1 if v1(i)!vsnes f0 v1 ;set VM to 1 if v1(i)!=0 =0
vsub v1 v1 v2 ;sub. under vector vsub v1 v1 v2 ;sub. under vector maskmask
cvm ;set vector mask all cvm ;set vector mask all to 1to 1
sv ra v1 ; store the reslult to sv ra v1 ; store the reslult to aa
4444
Common Vector Common Vector MetricsMetrics
RRnn: MFLOPS rate on an infinite-length vector: MFLOPS rate on an infinite-length vector Real problems do not have unlimitend vector lengths, and the start-up Real problems do not have unlimitend vector lengths, and the start-up
penalties encountered in real problems will be larger penalties encountered in real problems will be larger (R(Rnn is the MFLOPS rate for a vector of length n) is the MFLOPS rate for a vector of length n)
NN1/21/2: The vector length needed to reach one-half of R: The vector length needed to reach one-half of Rnn a good measure of the impact of start-upa good measure of the impact of start-up
NNVV: The vector length needed to make vector mode faster than : The vector length needed to make vector mode faster than scalar mode scalar mode measures both start-up and speed of scalars relative to vectors, quality measures both start-up and speed of scalars relative to vectors, quality
of connection of scalar unit to vector unitof connection of scalar unit to vector unit
4545
ApplicationsApplications
Linear AlgebraLinear Algebra Image processing (Convolution, Image processing (Convolution,
Composition, Compressing, etc.)Composition, Compressing, etc.) Audio synthesisAudio synthesis Compression Compression CryptographyCryptography Speech recognitionSpeech recognition
4646
Applications in multimediaApplications in multimedia
KernelKernel Vector lengthVector length Matrix transpose/multiplyMatrix transpose/multiply # vertices at once# vertices at once DCT (video, communication)DCT (video, communication) image widthimage width FFT (audio)FFT (audio) 256-1024256-1024 Motion estimation (video)Motion estimation (video) image width,iw/16image width,iw/16 Gamma correction (video)Gamma correction (video) image widthimage width Median filter (image processing)Median filter (image processing) image widthimage width Separable convolution (img. proc.)Separable convolution (img. proc.) image widthimage width
4747
Vector SummaryVector Summary
Alternate model,doesn’t rely on caches as Alternate model,doesn’t rely on caches as does Out-Of-Order and superscalar does Out-Of-Order and superscalar implementationsimplementations
Handles memory in a more organized wayHandles memory in a more organized way Powerful instructions that replace loopsPowerful instructions that replace loops Cope with multimedia applicationsCope with multimedia applications Ideal architecture for scientific simulationIdeal architecture for scientific simulation