1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg

1

Parallel Scientific Computing: Algorithms and Tools

Lecture #3

APMA 2821A, Spring 2008

Instructors: George Em Karniadakis

Leopold Grinberg

2

Levels of ParallelismJob level parallelism: Capacity computing

Goal: run as many jobs as possible on a system for given time period. Concerned about throughput; Individual user’s jobs may not run faster.

Of interest to administratorsProgram/Task level parallelism: Capability

computingUse multiple processors to solve a single problem.Controlled by users.

Instruction level parallelism:Pipeline, multiple functional units, multiple cores.Invisible to users.

Bit-level parallelism:Of concern to hardware designers of arithmetic-logic

units

3

Granularity of Parallel Tasks

Large/coarse grain parallelism:Amount of operations that run in parallel is

fairly largee.g., on the order of an entire program

Small/fine grain parallelism:Amount of operations that run in parallel is

relatively smalle.g., on the order of single loop.

Coarse/large grains usually result in more favorable parallel performance

4

Flynn’s Taxonomy of Computers

SISD: Single instruction stream, single data stream

MISD: Multiple instruction streams, single data stream

SIMD: Single instruction stream, multiple data streams

MIMD: Multiple instruction streams, multiple data streams

5

Classification of Computers

SISD: single instruction single dataConventional computersCPU fetches from one instruction stream and

works on one data stream.Instructions may run in parallel (superscalar).

MISD: multiple instruction single dataNo real world implementation.

6

Classification of ComputersSIMD: single instruction multiple data

Controller + processing elements (PE)Controller dispatches an instruction to PEs; All PEs

execute same instruction, but on different datae.g., MasPar MP-1, Thinking machines CM-1, vector

computers (?)MIMD: multiple instruction multiple data

Processors execute own instructions on different data streams

Processors communicate with one another directly, or through shared memory.

Usual parallel computers, clusters of workstations

7

Flynn’s Taxonomy

8

Programming Model

SPMD: Single program multiple data

MPMD: multiple programs multiple data

9

Programming Model

SPMD: Single program multiple dataUsual parallel programming modelAll processors execute same program, on

multiple data sets (domain decomposition)Processor knows its own ID

•if(my_cpu_id == N){}•else {}

10

Programming Model

MPMD: Multiple programs multiple dataDifferent processors execute different

programs, on different dataUsually a master-slave model is used.

• Master CPU spawns and dispatches computations to slave CPUs running a different program.

Can be converted into SPMD model•if(my_cpu_id==0) run function_containing_program_1;

•else run function_containing_program_2;

11

Classification of Parallel Computers

Flynn’s MIMD computers contain a wide variety of parallel computers

Based on memory organization (address space):Shared-memory parallel computers

• Processors can access all memoriesDistributed-memory parallel computers

• Processor can only access local memory• Remote memory access through explicit

communication

12

Shared-Memory Parallel Computer

Superscalar processors with L2 cache connected to memory modules through a bus or crossbar All processors have access to all

machine resources including memory and I/O devices

SMP (symmetric multiprocessor): if processors are all the same and have equal access to machine resources, i.e. it is symmetric.

SMP are UMA (Uniform Memory Access) machines

e.g., A node of IBM SP machine; SUN Ultraenterprise 10000

Prototype shared-memory parallel computer

P – processor; C – cache; M – memory.

Bus or crossbar

M1

P1

C

M2

P2

C

M3

P3

C

Mn

Pn

C

…

…

memory

13

Shared-Memory Parallel Computer If bus,

Only one processor can access the memory at a time.

Processors contend for bus to access memory

If crossbar, Multiple processors can access

memory through independent paths

Contention when different processors access same memory module

Crossbar can be very expensive. Processor count limited by

memory contention and bandwidth Max usually 64 or 128

…

…

P1

C

M1 M2 Mn

P2

C

Pn

C

bus

M1

P1

C

M2

P2

C

M3

P3

C

Mn

Pn

C

…

…

memory

crossbar

memory

14

Shared-Memory Parallel Computer

Data flows from memory to cache, to processors

Performance depends dramatically on reuse of data in cacheFetching data from memory with potential

memory contention can be expensiveL2 cache plays of the role of local fast

memory; Shared memory is analogous to extended memory accessed in blocks

15

Cache CoherencyIf a piece of data in one processor’s cache

is modified, then all other processors’ cache that contain that data must be updated.

Cache coherency: the state that is achieved by maintaining consistent values of same data in all processors’ caches.

Usually hardware maintains cache coherency; System software can also do this, but more difficult.

16

Programming Shared-Memory Parallel Computers

All memory modules have the same global address space.Closest to single-processor computerRelatively easy to program.

Multi-threaded programming:Auto-parallelizing compilers can extract fine-grain

(loop-level) parallelism automatically;Or use OpenMP;Or use explicit POSIX (portable operating system

interface) threads or other thread libraries.Message passing:

MPI (Message Passing Interface).

17

Distributed-Memory Parallel Computer

Superscalar processors with local memory connected through communication network.

Each processor can only work on data in local memory

Access to remote memory requires explicit communication.

Present-day large supercomputers are all some sort of distributed-memory machines

Communication Network

P1

M

P2

M

Pn

M…

Prototype distributed-memory computer

e.g. IBM SP, BlueGene; Cray XT3/XT4

18

Distributed-Memory Parallel Computer

High scalabilityNo memory contention such as those in

shared-memory machinesNow scaled to > 100,000 processors.

Performance of network connection crucial to performance of applications.Ideal: low latency, high bandwidth

Communication much slower than local memory read/writeData locality is important. Frequently used data local memory

19

Programming Distributed-Memory Parallel Computer

“Owner computes” rule Problem needs to be broken up into independent tasks with

independent memory Each task assigned to a processor Naturally matches data based decomposition such as a domain

decomposition Message passing: tasks explicitly exchange data by

message passing. Transfers all data using explicit send/receive instructions User must optimize communications Usually MPI (used to be PVM), portable, high performance

Parallelization mostly at large granularity level controlled by user Difficult for compilers/auto-parallelization tools

20

Programming Distributed-Memory Parallel Computer

A global address space is provided on some distributed-memory machine Memory physically distributed, but globally addressable; can be

treated as “shared-memory” machine; so-called distributed shared-memory.

Cray T3E; SGI Altix, Origin. Multi-threaded programs (OpenMP, POSIX threads) can also be

used on such machines User accesses remote memory as if it were local; OS/compilers

translate such accesses to fetch/store over the communication network.

But difficult to control data locality; performance may suffer. NUMA (non-uniform memory access); ccNUMA (cache coherent

non-uniform memory access); overhead

21

Hybrid Parallel Computer

Overall distributed memory, SMP nodes

Most modern supercomputers and workstation clusters are of this type

Message passing; or hybrid message passing/threading.

M M

Bus or crossbar

P P

M M

Bus or crossbar

P P

Communication network

……

Hybrid parallel computer

e.g. IBM SP, Cray XT3

22

Interconnection Network/Topology

Nodes, links Neighbors: nodes with a link between them Degree of a node: number of neighbors it has Scalability: increase in complexity when more nodes are added.

Ring Fully connected network

23

Topology

Hypercube

24

Topology

1D/2D mesh/torus

3D mesh/torus

25

Topology

Tree Star

26

Topology

Bisection width: minimum number of links that must be cut in order to divide the topology into two independent networks of the same size (plus/minus one node)

Bisection bandwidth: communication bandwidth across the links that are cut in defining bisection width

Larger bisection bandwidth better

Documents

1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg