Copyright 2011-2014 Curt Hill SIMD Single Instruction Multiple Data

Copyright © 2011-2014 Curt Hill

SIMDSingle Instruction Multiple

Data

SIMD• Only successful when the data is

highly parallel• There is a very large amount of time

spent on array processing • The array element processing is

somewhat independent– Such as adding corresponding array

elements of two arrays• There are plenty of applications but

they are specialized, usually scientific



Data level parallelism• Suppose, we have two arrays of 32

floating point operands and we want to add them

• A single processor will go down the line summing one at a time– If it is superscalar and it has two FPUs it

can do this slightly more than 16 units of time otherwise 32

• Not bad but generally outperformed by array and vector processors


Array Processor• Single control unit that drives multiple

ALUs– The ALUs usually have individual memories

• In the previous case it will take 16-32 units• Here if there are 32 floating point units

and the vector register contains 32 slots it will take one unit

• When adding two scalar variables the two would be the same speed, but when adding two array variables (length<=32) then the vector processor would be 32 times faster


Why• In most applications such parallelism

would be a waste, but in many scientific applications an array of size 32 is pretty small and substantial use could be made of this parallelism

• An array processor is a large number of identical processors that perform the same instruction on different pieces of data– Single control unit for the many processors– Parallel memories for the parallel processors

Examples• ILLIAC IV was the first in the late 60s

– Largely used by NASA for fluid dynamics calculations

– Very large amount of parallelism in this application

• Thinking Machines Connection Machine 1 and 2

• Goodrich Massively Parallel Processor• MasPar MP 1 and 2



Disadvantages:• Hardware heavy – expensive

– Never mass produced since they fit a niche market

– Register/memory configuration is unusual

• Difficult to program– Most languages have no support– High Performance FORTRAN is usual

choice• Only exceptional on truly parallel

computations


Vector processor• Essentially a normal processor,

usually superscalar, heavily pipelined• What it has different are vector

registers– A normal register contains a single

value, either integer or floating point of some size

– A vector register contains an array of these items that can be added with array arithmetic


Crays• Most of the Cray super computers

were vector processors• Programmed more like a regular

processor– There was usually a vector load/store

instruction• The number of values in a vector

register was usually modest: 4-8– This made the cost more reasonable– The performance was not so lopsided on

vector operations

Commercially• The market for these sorts of array

and vector processors is very limited

• There are few organizations that will always be able to utilize them

• In general it is a niche market• However there are some common

ones as well


Intel MMX instructions• The Pentium should not be considered

a vector processor• Yet it has vector operations in the MMX

subset– The SSE sets extend these

• These allow one 32 bit register to be considered four eight-bit registers or two 16 bit registers

• This allows array processing of 8 bit pixels or 16 bit sound samples


GPU• The graphics processing unit is the

most common vector processor• The pixel manipulation present in a

GPU is an ideal SIMD environment• Shading, for example, can be

easily done in parallel• Lets consider one GPU: ATI Radeon

HD 4870


Radeon HD 4870• There are 10 cores

– Each is SIMD• Each core has 256 registers• Each of these registers is actually a

vector register of size 64• The contents of one of these slots is

a 4 byte float• Multiply this out and it is 2.5MB of

registerCopyright © 2011-2014 Curt Hill

Exploiting the GPU• There is substantial power sitting in

the GPU• If 3D moving displays (such as

games) or video playing most of this power is sitting idle

• A number of options are now available to use this for scientific computing

• GPGPU – General Purpose computing on Graphics Processing Unit


Super Computers• A number of groups have

organized clusters of GPUs to achieve super computers

• Example: Chinese Mole 8.5 (2011)– 2200 NVIDIA Tesla GPUs– Used to simulate an H1N1 influenza

virus


Finally• The scientific big computers are a

niche market• Supercomputers have been

fabricated using clusters of GPUs – This is likely the future of SIMD


Documents

Copyright 2011-2014 Curt Hill SIMD Single Instruction Multiple Data