Aca2 08 new

CSE539: Advanced Computer Architecture

Chapter 8

Multivector and SIMD Computers Book: “Advanced Computer Architecture – Parallelism, Scalability, Programmability”, Hwang & Jotwani

Sumit Mittu

Assistant Professor, CSE/IT

Lovely Professional University

[email protected]

mailto:[email protected]

In this chapter…

• Vector Processing Principles

• Compound Vector Operations

• Vector Loops and Chaining

• SIMD Computer Implementation Models

Sumit Mittu, Assistant Professor, CSE/IT, Lovely Professional University 2

VECTOR PROCESSING PRINCIPLES


• Vector Processing Definitions o Vector

o Stride

o Vector Processor

o Vector Processing

o Vectorization

o Vectorizing Compiler or Vectorizer

• Vector Instruction Types o Vector-vector instructions

o Vector-scalar instructions

o Vector-memory instructions





• Vector-Vector Instructions o F1: Vi Vj

o F2: Vi x Vj Vk

o Examples: V1 = sin(V2) V3 = V1+ V2

• Vector-Scalar Instructions o F3: s x Vi Vj

o Examples: V2 = 6 + V1

• Vector-Memory Instructions o F4: M V (Vector Load)

o F5: V M (Vector Store)

o Examples: X = V1 V2 = Y



• Vector Reduction Instructions o F6: Vi s

o F7: Vi x Vj s

• Gather and Scatter Instructions o F8: M Vi x Vj (Gather)

o F9: Vi x Vj M (Scatter)

• Masking o F10: Vi x Vm Vj (Vm is a binary vector)

• Examples…





• Vector-Access Memory Schemes o Vector-operand Specifications

• Base address, stride and length

o C-Access Memory Organization

• Low-order m-way interleaved memory

o S-access Memory Organizations

• High-order m-way interleaved memory

o C/S Access Memory Organization

• Early Supercomputers (Vectors Processors) o Cray Series ETA 10E NEC Sx-X 44

o CDC Cyber Fujitsu VP2600 Hitachi 820/80



• Relative Vector/Scalar Performance o Vector/scalar speed ratio r

o Vectorization ratio in program f

o Relative Performance P is given by:

• 𝑷 = 𝟏

𝟏−𝒇 + 𝒇/𝒓=

𝒓

𝟏−𝒇 𝒓 + 𝒇

o When f is low, the speedup cannot be high even with very high r

o Limiting Case:

• P 1 if f 0

o Maximum Case:

• P r if f 1

o Powerful single chip processors and multicore system-on-a-chip provide High-Performance Computing (HPC) using MIMD and/or SPMD configurations with large no. of processors.

COMPUOUND VECTOR PROCESSING


• Compound Vector Operations o Compound Vector Functions (CVFs)

• Composite function of vector operations converted from a looping structure of linked scalar

operations

o CVF Example: The SAXPY (Single-precision A multiply X Plus Y) Code

• For I = 1 to N

o Load R1, X(I)

o Load R2, Y(I)

o Multiply R1, A

o Add R2, R1

o Store Y(I), R2

• (End of Loop)



• One-dimensional CVF Examples o V(I) = V2(I) + V(3) x V(4)

o V1(I) = B(I) + C(I)

o A(I) = V(I) x S + B(I)

o A(I) = V(I) + B(I) + C(I)

o A(I) = Q x v1(I) (R x B(I) + C(I)), etc.

Legend:

o Vi(I) are vector registers

o A(I), B(I), C(I) are vectors in memory

o Q, S are scalars available from scalar registers in memory



• Vector Loops o Vector segmentation or strip-mining approach

o Example

• Vector Chaining o Example: SAXPY code

• Limited Chaining using only one memory-access pipe in Cray-I

• Complete Chaining using three memory-access pipes in Cray X-MP

• Functional Unit Independence

• Vector Recurrence





SIMD COMPUTER ORGANIZATIONS


• SIMD Computer Variants o Array Processor

o Associative Processor

• SIMD Processor v/s SISD v/s Vector Processor Operation o Illustration: for(i=0;i<5;i++) a[i] = a[i]+2;

o Lockstep mode of operation in SIMD processor

o Relative Performance comparison

• SIMD Implementation Models o Distributed Memory Model

• E.g. Illiac IV

o Shared memory Model

• E.g. BSP (Burroughs Scientific Processor)







• SIMD Instructions o Scalar Operations

• Arithmetic/Logical

o Vector Operations

• Arithmetic/Logical

o Data Routing Operations

• Permutations, broadcasts, multicasts, rotation and shifting

o Masking Operations

• Enable/Disable PEs

• Host and I/O

• Bit-slice and Word-slice Processing o WSBS, WSBP, WPBS, WPBP

Education

Aca2 08 new