Computer Architectures... High Performance Computing I Fall 2001 MAE609 /Mth667 Abani Patra

Computer Architectures ...High Performance Computing I

Fall 2001MAE609 /Mth667

Abani Patra

AP:Lec01 2

Microprocessor Basic Architecture

CISC vs. RISC Superscalar EPIC

AP:Lec01 3

Performance Measures

Floating Point Operations Per Second (FLOPS)

1 MFLOP, workstations 1 GFLOP readily available

HPC 1 TFLOP BEST NOW !! 1 PFLOP … 2010 ??

AP:Lec01 4

Performance

Ttheor: theoretical peak performance; obtained by multiplying clock rate with no. of CPU and no. of FPU/CPU

Treal:real performance on some specific operation e.g. vector add and multiply

Tsustained: sustained performance on an application e.g. CFD Tsustained << Treal << Ttheor

AP:Lec01 5

Performance Performance degrades if the

CPU has to wait for data to operate

Fast CPU => need adequate fast memory

Thumb rule -- Memory in MB = Ttheor in MFLOPS

AP:Lec01 6

Making a Supercomputer Faster Reduce Cycle time

Pipelining Instruction Pipelines Vector Pipelines

Internal Parallelism Superscalar EPIC

External Parallelism

AP:Lec01 7

Making a SuperComputer Faster Reduce Cycle time

increase clock rate Limited by semiconductor manufacture! Current generation 1-2GHz( Immediate future

10GHz)

Pipelining fine subdivision of an operation into sub-

operations leading to shorter cycle time but larger start-up time

AP:Lec01 8

Pipelining Instruction Pipelining

1

2

3 4

1

2

3

4

5

6

• 4 stage instruction pipeline

• 3 instructions A,B,C

• 4 cycles needed by each instruction

A

B

C

A

B

C

A

B

C

A

B

C

cycle

stage

• one result per cycle after pipe is “full” -- startup time

Fetch Ins Fetch Data Execute Store

AP:Lec01 9

Pipelining Almost all current computers

use some pipelining e.g. IBM RS6000

Speedup of instruction pipelining cannot always be achieved !! Next instruction may not be

known till execution --e.g. branch Data for execution may not be

available

AP:Lec01 10

Vector Pipelines Effective for operations like

do 10 I=1,1000 10 c(I)=a(I)*b(I)

same instructions executed 1000 times with different data

using a “vector pipe” the whole loop is one vector instruction Cray XMP, YMP, T90 ...

AP:Lec01 11

Vector pipelining For some operations like

a(I) = b(I) + c(I)*d(I) the results of the multiply are chained to

the addition pipeline Disadvantage:

startup time of vector code has to be vectorized; loops have to be

blocked into vector lengths

AP:Lec01 12

Internal Parallelism Use multiple Functional Units per

processor Cray T90 has 2 track vector units;NEC SX4,

Fujitsu VPP300 -- 8 track vector units superscalar e.g. IBM RS6000 Power2 uses 2

arithmetic units EPIC

Need to provide data to multiple functional unit => fast memory access

Limiting factors are memory-processor bandwidth

AP:Lec01 13

External Parallelism Use multiple processors

Shared Memory (SMP:Symmetric Multi-processors)

many processors accessing the same memory

limited by memory-processors bandwidth SUN Ultra2, SGI Octane, SGI Onyx,

Compaq ...CPU 0

CPU 1

Memory banks

AP:Lec01 14

External Parallelism Distributed memory

many processors each with local memory and some type of high speed interconnect

CPU 0

CPU 1

Local Memories

Interconnection

E.g. IBM SPx, Cray T3E, network of W/S, Beauwolf Clusters of Pentium PCs

AP:Lec01 15

External Parallelism SMP Clusters

nodes with multiple processors that have shared local memory; nodes connected by interconnect

“best of both ?”

AP:Lec01 16

Classification of Computers

Hardware SISD (Single Instruction Single Data) SIMD(Single Instruction Multiple Data) MIMD (Multiple Instruction Multiple

Data)

Programming Model SPMD(Single Program Multiple Data) MPMD(Multiple Program Multiple Data)

AP:Lec01 17

Hardware Classification SISD (Single Instruction Single Data)

classical scalar/vector computer -- one instruction one datum

superscalar -- instructions may run in parallel

SIMD (Single Instruction Multiple Data) vector computers Data Parallel -- Connection Machine etc.

extinct now

AP:Lec01 18

Hardware Classification MIMD (Multiple Instruction Multiple

Data) usual parallel computer each processor executes its own

instructions on different data streams need synchronization to get

meaningful result

AP:Lec01 19

Programming Model SPMD(Single Program Multiple Data)

single program is run on all processors with different data

each processor knows its ID -- thus if(proc ID .eq. N) then

…. Else

…. Constructs can be used for program

control

AP:Lec01 20

Programming Model MPMD(Multiple Program Multiple

Data) Different programs run on different

processors usually a master-slave model is used

AP:Lec01 21

Topologies/Interconnects Hypercube Torus

Prototype Supercomputers and Bottlenecks

AP:Lec01 23

Types of Processors/Computers used

in HPC Prototype processors

Vector Processors Superscalar Processors

Prototype Parallel Computers Shared Memory

Without Cache With Cache SMP

Distributed Memory

AP:Lec01 24

Vector Processors

AP:Lec01 25

Vector Processors

Components Vector registers ADD/Logic pipeline and MULTIPLY Pipelines Load/Store pipelines Scalar registers + pipelines

AP:Lec01 26

Vector Registers Finite length of

vector registers 32/64/128 etc.

Strip mining to operate on longer vectors

Codes often manually restructured to vector-length loops

Sawtooth performance curve -- maximum at multiples of vector length

AP:Lec01 27

Vector Processors Memory-processor bandwidth

performance depends completely on keeping the vector registers supplied with operands from memory

Size of main memory and extended memory bandwidth of main memory is much higher but main

memory is more expensive size determines -- size of problem that can be run

scalar registers/scalar processors for scalar instructions

I/O through special processor - - T90 can produce data at 14400 MB/sec -- Disk 20MB/s.

Thus single word can take 720 cycles on Cray T90 !!

AP:Lec01 28

Superscalar Processor Workstations and nodes of parallel

supercomputers

AP:Lec01 29

Superscalar Processor main components are

Multiple ALU and FPU data and instruction caches

superscalar since the ALU and FPU’s can operate in parallel producing more than one result per cycle

e.g. IBM POWER2 -- 2 FPU/ALU’s each can operate in parallel producing up to 4 results per cycle if operands are in registers

AP:Lec01 30

Superscalar Processor RISC architecture operating at very high

clock speeds (>1GHz now -- more in a year)

Processor works only on data in registers which come only from and go only to data cache. If data is not in cache -- “cache miss” -- processor is idle while another cache line (4 -16 words) are fetched from memory !!

AP:Lec01 31

Superscalar Processor Large off chip Level 2 caches to help in data

availability. L1 cache data is accessed in 1/2 cycles while L2 cache is 3/4 cycles and memory can be 8 times that!

Efficiency directly related to reuse of data in cache

Remedies: Blocked algorithms, contiguous storage, avoid strides and random/non-deterministic access

AP:Lec01 32

Superscalar Processor Remedies:

Blocked algorithms, do I=1,1000 do j=1,20

a(I)=…. do i=(j-1)*50,j*50 a(i)=....

contiguous storage, avoid strides and random/non-deterministic

access a(ix(i)) = ...

AP:Lec01 33

Superscalar Processors Memory bandwidth critical to performance

Many engineering applications are difficult to optimize for cache efficiency

Application efficiency => memory bandwidth

Size of memory determines size of problem that can be solved

DMA (direct memory access) channels take memory access duties for external application (I/O) remote processor request away from CPU

AP:Lec01 34

Shared Memory Parallel Computer

Memory in banks is accessed equally through a switch (crossbar) by the processors (usually vector)

Processors run “p” independent tasks with possibly shared data

Usually some compilers and preprocessors can extract the fine-grained parallelism available

Shared Memory Computer

Cray T90

P1 P2 P3

Shared Memory

Switch

...

AP:Lec01 35

Shared Memory Paralllel ... Memory contention and bandwidth limits the

number of processors that may be connected

Memory contention can be reduced by increasing banks and reducing the bank busy time(bbt)

This type of parallel computer is closest in programming model to the general purpose single processor computer

AP:Lec01 36

Symmetric Multiprocessors (SMP)

Processors are usually superscalar -- SUN Ultra, MIPS R10000 with large cache

Bus/crossbar used to connect to memory modules

For bus -- 1 processor can access memory at a time

SMP Computer

P1 P2 P3

Bus/Crossbar

...c1 c2 c3 c3

M1M2 M3

Sun Ultraenterprise 10000, SGI Powerchallenge

AP:Lec01 37

Symmetric Multi-processors If interconnect -- then there will be

memory contention

Data flows from memory to cache to processors;

Cache coherence: If a piece of data is changed in one cache then all

other caches that contain that data must update the value. Hardware and software must take care of this.

AP:Lec01 38

Symmetric Multi-Processors Performance depends dramatically on

the reuse of data in cache; Fetching data from larger memory with potential

memory contention can be expensive! Caches and cache lines also are bigger

Large L2 cache really plays the role of local fast memory with memory banks are more like extended memory accessed in blocks

AP:Lec01 39

Distributed Memory Parallel Computer

Prototype DMP Processors are

superscalar RISC with only LOCAL memory

Each processor can only work on data in local memory

Communication required for access to remote memory

Comm. network

PM

PM

PM

IBM SP, Intel Paragon,SGI Origin2000

AP:Lec01 40


Problems need to be broken up into independent tasks with independent memory -- naturally matches a data based decomposition of problem using a “owner computes” rule

Parallelization mostly at high granularity level controlled by user -- difficult for compilers/ automatic parallelization tools

Computers are scalable to very large numbers of processors

AP:Lec01 41


Hybrid Parallel Computer NUMA : non uniform

memory access based classification

Intel Paragon (1st teraflop machine had 4 Pentiums per node with a bus)

HP exemplar has bus at node

P

M

P

M

Bus

Comm. network

P

M

P

M

Bus….

AP:Lec01 42


Semi-autonomous memory

Semi-automomous memory: Processor can access remote memory using memory control units (MCU)

CRAY T3E and SGI Origin 2000

Comm. network

P

M

MCU….P

M

MCU

AP:Lec01 43


Fully autonomous memory

Memory and procesors are equally distributed over the network

Tera MTA is only example

Latency and data transfer from memory is at the speed of network!

Comm. network

M P P M

AP:Lec01 44

Accessing Distributed Memory Message Passing

User transfers all data using explicit send/receive instructions

synchronous message passing can be slow Programming with NEW programming model ! User must optimize communication asynchronous/one-sided get and put are faster but

need more care in programming Codes used to be machine specific -- Intel NEXUS etc.

until standardized to PVM (parallel virtual machine) and subsequently MPI (message passing interface)

AP:Lec01 45

Accessing Distributed Memory

Global distributed memory Physically distributed and globally addressable -- Cray T3E/ SGI

Origin 2000 User formally accesses remote memory as if it were local --

operating system/compilers will translate such accesses to fetches/stores over the communication network

High Performance FORTRAN (HPF) -- software realization of distributed memory -- arrays etc. when declared can be distributed using compiler directives. Compiler translates remote memory access to appropriate calls (message passing/ OS calls as supported by the hardware)

AP:Lec01 46

Processor interconnects/topologies Buses

Lower cost -- but only one pair of devices (processors/memories etc. can communicate at a time) e.g. ethernet used to link workstation networks

Switches Like the telephone network -- can sustain many-

many communications; higher cost! Critical measure is bisection bandwidth -- how much

data can be passed between units

AP:Lec01 47

Processor interconnects/topologies .

AP:Lec01 48

Processor interconnects/topologies .

AP:Lec01 49

Processor interconnects/topologies Workstation network on ethernet

Very high latency -- processors must participate in communication

AP:Lec01 50

Processor interconnects/topologies 1D and 2D Meshes and

rings/toruses

AP:Lec01 51

Processor interconnects/topologies 3DMeshes and rings/toruses

AP:Lec01 52

Processor interconnects/topologies D- dimensional hypercubes

AP:Lec01 53

Processor Scheduling Space Sharing

Processor banks of 4/8/16 etc. assigned to users for specific times

Time sharing on processor partitions

Livermore Gang Scheduling

AP:Lec01 54

IBM RS/6000 SP

• Distributed Memory Parallel Computer

•Assembly of workstations using a HPS (a crossbar type switch)

•Comes with a choice of processors -- POWER2 (variants), POWER3 and clusters of PowerPC (also used by Apple G3 G4 etc.)

AP:Lec01 55

POWER 2 Processor

Different versions -- with different frequency, cache size and bandwidth

AP:Lec01 56

POWER 2 ARCHITECTURE

AP:Lec01 57

POWER2

Double fixed point/floating point units -- multiply/add in each

Max. 4 Floating Point results/cycle

ICU (with 32 KB instruction cache) can execute a branch and a condition/cycle

Per cycle 8 instructions may be issued and executed -- truly SUPERSCALAR!

AP:Lec01 58

Wide 77 Node Performance

Theoretical peak performance: 2*77 = 154 MFLOP for dyad

4*77 = 308 MFLOP for triad Cache Effects dominate performance

256 KB Cache and 256 bit path to cache and from cache to memory -- 2 words (8 bytes each) may be fetched and 2 words stored per cycle

AP:Lec01 59

Expected Performance Expected Performance

For Dyad ai= bi*ci or ai=bi+ci -- needs 2 load and 1 store i.e. 6 memory references to feed 2 FPUs -- only 4 are available:

(2*77)*(4/6) = 102.7 MFLOP For linked triad

ai= bi + s*ci (2 load 1 store) (4*77)*(4/6) = 205.3 MFLOP

For vector triad ai = bi + ci * di (3 load 1 store) (4*77)*(4/8)=154 MFLOPS

AP:Lec01 60

Cache Hit/Miss

The Performance numbers assumed that data was available in cache

If data is not in cache it must be fetched in cache lines of 256 bytes each from memory at a much slower pace

AP:Lec01 61

AP:Lec01 62

TERM PAPER Based on the analysis of the Power

2 processor and IBM SP presented here prepare a similar analysis (including estimates of performance) for the new Power4 chip in the IBM SP or a cluster of Pentium4s.

Documents

Computer Architectures... High Performance Computing I Fall 2001 MAE609 /Mth667 Abani Patra