Upload
clifford-chapman
View
218
Download
0
Embed Size (px)
DESCRIPTION
Data level parallelism Suppose, we have two arrays of 32 floating point operands and we want to add them A single processor will go down the line summing one at a time –If it is superscalar and it has two FPUs it can do this slightly more than 16 units of time otherwise 32 Not bad but generally outperformed by array and vector processors
Citation preview
Copyright © 2011-2014 Curt Hill
SIMDSingle Instruction Multiple
Data
SIMD• Only successful when the data is
highly parallel• There is a very large amount of time
spent on array processing • The array element processing is
somewhat independent– Such as adding corresponding array
elements of two arrays• There are plenty of applications but
they are specialized, usually scientific
Copyright © 2011-2014 Curt Hill
Copyright © 2011-2014 Curt Hill
Data level parallelism• Suppose, we have two arrays of 32
floating point operands and we want to add them
• A single processor will go down the line summing one at a time– If it is superscalar and it has two FPUs it
can do this slightly more than 16 units of time otherwise 32
• Not bad but generally outperformed by array and vector processors
Copyright © 2011-2014 Curt Hill
Array Processor• Single control unit that drives multiple
ALUs– The ALUs usually have individual memories
• In the previous case it will take 16-32 units• Here if there are 32 floating point units
and the vector register contains 32 slots it will take one unit
• When adding two scalar variables the two would be the same speed, but when adding two array variables (length<=32) then the vector processor would be 32 times faster
Copyright © 2011-2014 Curt Hill
Why• In most applications such parallelism
would be a waste, but in many scientific applications an array of size 32 is pretty small and substantial use could be made of this parallelism
• An array processor is a large number of identical processors that perform the same instruction on different pieces of data– Single control unit for the many processors– Parallel memories for the parallel processors
Examples• ILLIAC IV was the first in the late 60s
– Largely used by NASA for fluid dynamics calculations
– Very large amount of parallelism in this application
• Thinking Machines Connection Machine 1 and 2
• Goodrich Massively Parallel Processor• MasPar MP 1 and 2
Copyright © 2011-2014 Curt Hill
Copyright © 2011-2014 Curt Hill
Disadvantages:• Hardware heavy – expensive
– Never mass produced since they fit a niche market
– Register/memory configuration is unusual
• Difficult to program– Most languages have no support– High Performance FORTRAN is usual
choice• Only exceptional on truly parallel
computations
Copyright © 2011-2014 Curt Hill
Vector processor• Essentially a normal processor,
usually superscalar, heavily pipelined• What it has different are vector
registers– A normal register contains a single
value, either integer or floating point of some size
– A vector register contains an array of these items that can be added with array arithmetic
Copyright © 2011-2014 Curt Hill
Crays• Most of the Cray super computers
were vector processors• Programmed more like a regular
processor– There was usually a vector load/store
instruction• The number of values in a vector
register was usually modest: 4-8– This made the cost more reasonable– The performance was not so lopsided on
vector operations
Commercially• The market for these sorts of array
and vector processors is very limited
• There are few organizations that will always be able to utilize them
• In general it is a niche market• However there are some common
ones as well
Copyright © 2011-2014 Curt Hill
Intel MMX instructions• The Pentium should not be considered
a vector processor• Yet it has vector operations in the MMX
subset– The SSE sets extend these
• These allow one 32 bit register to be considered four eight-bit registers or two 16 bit registers
• This allows array processing of 8 bit pixels or 16 bit sound samples
Copyright © 2011-2014 Curt Hill
GPU• The graphics processing unit is the
most common vector processor• The pixel manipulation present in a
GPU is an ideal SIMD environment• Shading, for example, can be
easily done in parallel• Lets consider one GPU: ATI Radeon
HD 4870
Copyright © 2011-2014 Curt Hill
Radeon HD 4870• There are 10 cores
– Each is SIMD• Each core has 256 registers• Each of these registers is actually a
vector register of size 64• The contents of one of these slots is
a 4 byte float• Multiply this out and it is 2.5MB of
registerCopyright © 2011-2014 Curt Hill
Exploiting the GPU• There is substantial power sitting in
the GPU• If 3D moving displays (such as
games) or video playing most of this power is sitting idle
• A number of options are now available to use this for scientific computing
• GPGPU – General Purpose computing on Graphics Processing Unit
Copyright © 2011-2014 Curt Hill
Super Computers• A number of groups have
organized clusters of GPUs to achieve super computers
• Example: Chinese Mole 8.5 (2011)– 2200 NVIDIA Tesla GPUs– Used to simulate an H1N1 influenza
virus
Copyright © 2011-2014 Curt Hill
Finally• The scientific big computers are a
niche market• Supercomputers have been
fabricated using clusters of GPUs – This is likely the future of SIMD
Copyright © 2011-2014 Curt Hill