Lecture Notes on Parallel Scientific Computing

Lecture Notes on Parallel Scienti�c ComputingTao Yang �tyang�cs�ucsb�edu�

December ��

Contents

� Introduction �

� Design and Implementation of Parallel Algorithms �

�� A simple model of parallel computation � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Message�Passing Parallel Programming � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Complexity analysis for parallel algorithms � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

� Issues in Processor Network and Communication ��

�� Basic Communication Operations � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� One�to�All Broadcast � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� All�to�All Broadcast � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� One�to�all personalized broadcast � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Network Embedding � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

� Model�based Programming Methods ��

�� Embarrassingly Parallel Computations � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Geometrical Transformations of Images � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Mandelbrot Set � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Monte Carlo Methods � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Divide�and�Conquer � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Add n numbers � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Pipelined Computations � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Sorting n numbers � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� Transformation�based Parallel Programming ��

�� Dependence Analysis � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Basic dependence � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Loop Parallelism � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Program Partitioning � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Loop blocking unrolling � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Interior loop blocking � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Loop interchange � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Data Partitioning � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Data partitioning methods � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Consistency between program and data partitioning � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�

�� Data indexing between global space and local space � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� A summary on program parallelization � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� Matrix Vector Multiplication ��

Matrix�Matrix Multiplication ��

�� Sequential algorithm � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Parallel algorithm with su�cient memory � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Parallel algorithm with �D partitioning � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Fox�s algorithm for �D data partitioning � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Reorder the sequential additions� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Submatrix partitioning for block�based matrix multiplication � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� How to produce block�based matrix multiplication code � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Gaussian Elimination for Solving Linear Systems ��

�� Gaussian Elimination without Partial Pivoting � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� The Row�Oriented GE sequential algorithm � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� The row�oriented parallel algorithm � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� The column�oriented algorithm � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� Gaussian elimination with partial pivoting ��

�� The sequential algorithm � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Parallel column�oriented GE with pivoting � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Iterative Methods for Solving Ax � b ��

�� The iterative methods � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Norms and Convergence � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Norms of vectors and matrices � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Convergence of iterative methods � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Jacobi Method for Ax � b � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Parallel Jacobi Method � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Gauss�Seidel Method � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� More on convergence� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� The SOR method � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Numerical Di erentiation �

�� Approximation of Derivatives � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Central di�erence for second�derivatives � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Example � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� ODE and PDE ��

�� Finite Di�erence Method � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�

�� Gaussian Elimination for solving linear tridiagonal systems � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� PDE� Laplace�s Equation � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

A MPI Parallel Programming �

B Pthreads ��

B�� Introduction � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

B�� Thread creation and manipulation � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

B�� The Synchronization Routines� Lock � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

B� Condition Variables � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� Introduction

� Scienti�c computing uses computers to analyze and solve scienti�c and engineering problems based on mathematicalmodels� Applications of scienti�c computing include computer graphics� image Processing� stock analysis� weather�

prediction� seismic data processing� design simulation airplanescars�� computational medicine� astrophysics� and

nuclear Engineering�

� An example of Parallel Computing� Summation of � numbers in parallel� Why Parallel Computing� solving large scale problems� Many scienti�c problems can be reduced to the problem of solving linear equations� Time and

space complexity of solving n equations on SUN Workstations Sparc ��

n Time �n�� Space �n� bytes�� sec �� KB�� min �MB�� hr ��MB�� years ��TB

� Applications of parallel computing� Large scale scienti�c computing� Parallel data bases credit cards�� multi mediainformation servers� etc� Parallel computing is at every place even on PCs��

� Performance of current commercial machines� MF Mega�ops� � million �oating point operations per second�� GFGiga�ops��MF�

Peak BenchmarkCray Y MP� � proc �� MF �� MFCray Y MP� � proc �� GFSUN SPARC �� MFMeiko� � node �� MF � MFMeiko� �� nodes �� GF � MF

Origin �� one node� ��Mhz ��MFOrigin �� nodes ��GFPC Xeon Server � nodes� �� Mhz Pentium II

� Topics on Parallel Computing

� Introduction to parallel architectures� Model of Parallel Computers� Interconnection Network� Network

embedding� Task graph computation�

� Message passing parallel programming�

�

� Model based parallel programming� Embarrassingly parallel� divide and conquer� and pipelining�

� Transformation based parallel programming� Dependence analysis� Partitioning and mapping of programdata�

� Parallel algorithms for scienti�c computing�

� Parallel programming on shared memory machines�

� Computer architectures� sequential machines� vector machines� parallel machines�

� Parallel architectures�

� Control mechanism� SIMD type single instruction stream multiple data stream� e�g� MasPar MP �� CM ��

MIMD multiple instruction stream multiple data stream� Meiko CS �� Intel Paragon� Cray T�E SP��

� Address�Space Organization� Message passing architecture Intel Paragon� Meiko CS �� Cray T�E� IBM SP��

vs� Shared Address space Architecture SGI Origin �� SUN Enterprise��

� Interconnection network� bus� ring� hypercube� mesh�

Hypercube Network

� ��dimensional hypercube� Contain � nodes� �� 0 1

� ��dimensional hypercube� Contain �� nodes �� using the binary representations�00 01

10 11

� ��dimensional hypercube� Contain �� nodes�

� n�dimensional hypercube� Contain �n nodes� Two n� �� dimensional hypercubes are connected together byadding a link between two nodes that di�er by one bit in their binary IDs�

Message routing for node�to�node communication�

�� Circuit switch� Establish the path and maintain all the links in the path for the message to pass� uninterrupted�

from the source to the destination�

�� Packet switching� A message is divided into packets of information� each includes source and destination

addresses for routing �

� Packet delivery using store forward� Each intermediate processor forwards a packet to the next processorafter it has received and stored the entire packet�

� Virtual cut through� If an outgoing link is available� the message is immediately passed forward withoutbeing stored�

�� Wormhole routing� A channel is opened between the source and destination� The message is divided into small

pieces called �its�� The �its are pipelined through the network channel�

�

Networked computers as a parallel machine�

� Motivation� High performance workstations and PCs are readily available at low cost� The latest processors can

be upgraded easily� Existing software can be used or modi�ed�

� Successful applications� Web parallel searching using a cluster of workstations Inktomi��

� Network� Ethernet� Fast Ethernet� Gigabit Ethernet� Other networks such as Token rings� FDDI� Myrinet� SCI�

ATM�

Communication in Ethernet �� Mbps bus with distributed access control��

� Broadcast medium all transceivers see all packets and pass all packets to host interface� The host interface

chooses packets the host should receive and discards others�

� Access scheme� Carrier sense multiple access with collision detection� Each access point senses carrier wave to

�gure out if machine is idle� To transmit a packet� waits until carrier is idle� then starts transmitting�

� Collision detection and recovery� Transceivers monitor carrier during transmission to detect interference�

Interference can happen if two transceivers start sending at same time� If interference happens� transceiver detects

a collision�

� When collision detected� uses a binary exponential backo� policy to retry the send� Adds on a random delay toavoid synchronized retries�

Parallel performance evaluation�

�Speedup �

Sequential time

Parallel Time� Efficiency �

Speedup

No of Processors used�

� Performance is restricted by the availability of parallelism� Amdahl�s Law�Let T� be the sequential time of a program� Assume that � is the fraction of this program that has to be executed

sequentially� Only � � �� fraction of this program can be executed concurrently� The shortest parallel time for

this program is� Tp � �T� � �� T�p�

The speedup bound is

Speedup �T�

�T� � �� T��p�

�

�� p�

When � � �� Speedup � p� And when � � �� Speedup � ��

� Scalability� if a parallel algorithm can achieve high speedups with a large number of processors�

� Constant problem size scaling�

Fix a problem size� check if speedups increase when adding more processors Amdahl�s assumption��

� Time constrained scaling�

Fix parallel time for a given p� increase the problem size to see if speedups increase Gustafson�s assumption��

� Gustafson�s Law� Let PTp � �� Seq � �� p � ��

Speedup �Seq

PTp� �� p � � � p� �� p��

� fraction of operations which have to be done sequentially� independent of problem size�

When � � �� Speedup � p� When � � �� Speedup � ��p� �� The result is di�erent from Amdahl�s law�

Why�

�

Basic steps of parallel programming�

�� Preparing parallelism task partitioning� program dependence analysis��

�� Mapping and scheduling of parallelism�

�� Coding and debugging�

�� Evaluating parallel performance�

� Design and Implementation of Parallel Algorithms

�� A simple model of parallel computation

� Representation of Parallel Computation� Task model�

A task is an indivisible unit of computation which may be an assignment statement� a subroutine or even an

entire program� We assume that tasks are convex� which means that once a task starts its execution it can run to

completion without interrupting for communications�

Dependence� There exists a dependence between tasks� A task Ty depends on Tx� then there is a dependence

edge from Tx to Ty� Task nodes and their dependence constitute a graph which is a directed acyclic task graph

DAG��

Weights� Each task Tx can have a computation weight �x representing the execution time of this task� There is

a cost cx�y in sending a message from one task Tx to another task Ty if they are assigned to di�erent processors�

� Execution of Task Computation�

Architecture model� Let us �rst assume message passing architectures� Each processor has its own local

memory� Processors are fully connected�

Task execution� In the task computation� a task waits to receive all data before it starts its execution� As soon

as the task completes its execution it sends the output data to all successors�

Scheduling is de�ned by a processor assignment mapping� PATx�� of the tasks Tx onto the p processors and

by a starting time mapping� ST Tx�� of all nodes onto the real positive numbers set� CT Tx� � ST Tx� � �x is

de�ned as the completion time of task Tx in this schedule�

Dependence Constraints� If a task Ty depends on Tx� Ty cannot start until the data produced by Tx is available

in the processor of Ty� i�e�

ST Ty�� ST Tx� � �x � cx�y�

Resource constraints Two tasks cannot be executed in the same processor� and time�

Fig� �a� shows a weighted DAG with all computation weights assumed to be equal to �� Fig� �b� and c� show

the schedules with di�erent communication weight assumptions� Both b� and c� use Gantt charts to represent

schedules� A Gantt chart completely describes the corresponding schedule since it de�nes both PAnj� and ST nj��

The PA and ST values for schedule b� is summarized in Figure �d��

Diculty� Finding the shortest schedule for a general task graph is hard known as NP complete��

� Evaluation of Parallel Performance� Let p be the number of processors used�

Sequential time � Summation of all task weights� Parallel time � Length of the schedule�

Speedup �Sequential time

Parallel Time� Efficiency �

Speedup

p�

�

T1

T2 T3

T4

T1

T2 T3

T4

T1

T2

T3

T4

c = 0.5τ =1

c = 0τ = 1

10 0 1

T� T� T� T�PA � � � �ST � � � �

a� b� c� d�

Figure �� a� A DAG with node weights equal to �� b� A schedule with communication weights equal to �� c� Aschedule with communication weights equal to �� d� The PAST values for schedule b��

� Performance bound�

Let the degree of parallelism be the maximum size of independent task sets� Let the critical path be the path in

the task graph with the longest length including node computation weights only�� Then

Parallel time � Length of the critical path� Parallel time � Sequential time

p

Speedup � Sequential time

Length of the critical path� Speedup � Degree of parallelism�

x2

x1

x3

x4 x5

τz

y

c

Figure �� A DAG with node weights equal to ��

Example� For Figure �� assume that p � �� all task weights � � �� Thus we have Seq � �� We do not know the

communication weights� One of the maximum independent set is fx�� y� zg� Thus the degree of parallelism is �� Onecritical path is f x�� x�� x�� x�� x�g with length � noticed that the edge communication weights are not included�� Thus

PT � maxLengthCP ��seq

p� � max��

�

�� Speedup � Seq

��

��

�� Message�Passing Parallel Programming

Two Programming styles are�

� SPMD Single Program Multiple Data

� Data and program are distributed among processors� code is executed based on a predetermined schedule�

�

� Each processor executes the same program but operates on di�erent data based on processor identi�cation�

� Masterslaves� One control process is called the master or host�� There are a number of slaves working for this

master� These slaves can be coded using an SPMD style�

� MPMD� Single Programs Multiple Data

Note that masterslave or MPMD code can be implemented using an SPMD coding style�

Data communication and synchronization are implemented via message exchange� Two types�

� Synchronous message passing� a processor sends a message and waits until the message is received by the destina

tion�

� Asynchronous message passing� Sending is non blocking and the processor that executes the sending procedure

does not have to wait for an acknowledgment from the destination�

Programming Languages� CFortran � library functions�

Library functions�

� mynode�� return the processor ID� p processors are numbered as �� p� ��

� numnodes�� return the number of processors allocated�

� send�data�dest�� Send data to a destination processor�

� recv�data� or recv�data� processor�� Executing recv� will get a message from a processor or any processor� and

store it in the user bu�er speci�ed by data�

� broadcast�data�� Broadcast a message to all processors�

Examples�

� SPMD code� Print �hello��Execute in � processors� The screen is� hello

hellohellohello

� SPMD code� x�mynode��If x �� then Print �hello from � x�

Screen� hello from �hello from �hello from �

� Program is�

x��

For i � � to p��

y�i� ix�

Endfor

�

SPMD code� int x� y� i�x��i�mynode��y�i�x�

or int x� y� i�i�mynode��if i�� then x�� broadcastx��else recvx��y�i�x�

Libraries available for parallel programming using CC�� or Fortran�

� PVM� Masterslaves programming style� The �rst widely adopted library for using a workstation cluster as a

parallel machine�

� MPI� SPMD programming style� The most widely used message passing standard for parallel programming�

Available in all commercial parallel machines and workstation clusters�

PVM�

� A master program is regular C or Fortran code� It calls pvm spawnSlave binary code� arguments� to create aslave process�

� The system spawns and executes slave Unix processes on di�erent nodes of a parallel machineworkstation cluster�

� The master and slaves communicate using message passing library functions send�recv� broadcast etc��

�� Complexity analysis for parallel algorithms

How to Analyze Parallel Times� If a parallel program consists of separate phases of computation and communication�

PTp � Computation T ime� Communication T ime�

Communication cost for sending one message of size n between two nodes�

�� n�

� � the startup time� for handling a message at the sending processor e�g� the packaging cost�� plus the latency

for the latency for a message to reach the destination�

� � the data transmission speed between two processors ��bandwidth��

� n size of this message�

Example� Add n number on two computers� The �rst computer initially owns all numbers�

�� Computer � sends n�� numbers to computer ��

�� Both computers add n�� numbers simultaneously�

�� Computer � sends its partial result back to computer ��

�� Computer � adds the partial sums to produce the �nal result�

�

Assume that computation cost per operation is � Computation cost for Steps � and � is n�� Communication

cost for Steps � and � is �� n�� Total parallel time is�

n�� n��

Approximation in analysis� The O notation for measuring spacetime complexity�

� called big oh� the order of magnitude�

� Example� fx� � �x� � �x� �� is Ox��

� Example� fx� � �x� � �xlogx� �� is Ox��

� Formal de�nition� fx� � Ogx�� i� there exists positive constants� c and x� such that � � fx� � cgx� for all

x � x��

Approximation in cost analysis for parallel computing�

� You can drop insigni�cant terms during analysis�

� Example� Parallel time � � � n� � �n logn � � � n��

Empirical program evaluation� Measuring execution time through instrumentation�

� Add extra code in a program e�g� clock�� time�� gettimeofday��

� Measure the elapsed time between two points�

L�� time� t�� start timer�

parallel code

L�� time� t�� stop timer�

elapsed�time�difftime�t��t��

��

Measure communication time� Use a ping pong test for point to point communication�

At process ��

L�� time� t��

send� x�P�

recv� x�P�

L�� time� t��

elapsed�time��difftime�t��t��

At process ��

recv� x�P�

send� x�P�

��

� Issues in Processor Network and Communication

In this section we discuss several issues� �� Many�to�many communication� We discuss how communication primitives

such as broadcasting can be implemented in di�erent architectures� �� Network Embedding� Di�erence parallel machines

may use di�erent networks to connect their processors� We address how a program running on one network can be

simulated on another network�

�� Basic Communication Operations

In most of scienti�c computing programs� program code and data are divided among processors and data exchanges

between processors are often needed� Such exchanges require e!cient communication schemes since communication

delay a�ects the overall parallel performance� There are many communication patterns in a parallel program� We list

some of them as follows�

� One�one sending� That is the simplest format of communication�

� One�all broadcasting� In this operation� one processor broadcasts the same message to di�erent processors�

� All�all broadcasting� Every processor broadcasts a message to di�erent processors� This operation is depicted in

Figure�a��

P0 P1 P2

P0 P1 P2

P0

P1 P2 Pn. . .

a� b�

Figure �� a� An example of all all broadcasting�b� An example of accumulation gather��

� Accumulation �or called gather�� In this case� a processor receives a message from each of other processors and

puts these messages together� as shown in Figure �b��

� Reduction� In this operation� each processor Pi holds one value Vi� Assume there are p processors� The global

reduction computes the aggregate value of

V� V� � � � Vp�

and make it available to one processor or to all processors�� The operation could stand for any reductionfunction� for example� the global sum�

� One�all personalized communication or called single node scatter� In this operation� one processor sends one

personalized message to all of other processors� Di�erent processors receive di�erent messages�

� All�all personalized communication� In this operation� each processor sends one personalized message to all of

other processors�

��

0 1 2 354 6Ring

Figure �� Broadcasting on a ring with the store forward model�

�� One�to�All Broadcast

We �rst discuss the implementation of broadcast on a ring using the store forward routing model� Figure � depicts a

broadcasting operation carried on a ring� The message is sent from processor � and let p be the number of processors�

� be startup time� � transmission speed and m be the size of the message� Then the total cost of this broadcasting is�

p

�� m�

If the architecture is a linear array instead of ring� then the worst case cost is

p�� m��

Figure � depicts a broadcasting operation carried on a mesh� Assume that the message is sent from the left and top

processor� The communication is divided into two stages�

� At stage �� the message is broadcasted in the �rst row� It costs pp�� m��

� At stage �� the message is broadcasted in in all columns independently� The cost is� pp� � �m��

The total cost is� �pp�� m��

MESH

Stage 1

Stage 2

Figure �� Broadcasting on a mesh with the store forward model�

If the wormhole model is used� then the communication cost can be further reduced for this broadcasting case� Figure �

shows a bisection method to broadcast in a linear array using the wormhole routing scheme� The bisection method is

described as follows�

�� At the �rst� the message is directly sent to the center of the linear array� It costs �� m�

��

�� Then the message broadcasting is conducted independently in the left part and right part�

�� Repeat �� and �� recursively�

The bisection method takes about log p steps� The cost is � � �m� log p� Notice there is no channel contention in

wormhole routing since the broadcasting is conducted independently in subparts�

1

2

3

Figure �� Broadcasting on a linear array with the wormhole model�

�� All�to�All Broadcast

We discuss brie�y how an all all broadcast algorithm can be designed on a ring with p processors�

� Each processor x forwards a message to its neighbor x� �� mod p�

� After p� � steps� all processors receive messages from all other processors� Thus the cost is p� �� m��

Figure � depicts the method with p � �� At step �� message � reaches its neighbor and at step �� this message reach the

next neighbor� This process is repeated for � steps�

RING

Step 1 Step 2

3

4

3

4

1

2 1

2

Store-Forward

Figure �� All all broadcasting on a ring with the store forward model�

�� One�to�all personalized broadcast

We discuss brie�y on broadcasting a personalized message to each processor� Assume that the architecture is a linear

array� the broadcast starts from the center of this linear array� as shown in Figure ��

p/2-1 p/2 p/2 p/2-1Message size:

Figure �� Personalized broadcasting on a linear with the store forward model�

��

Then the center carries p�� messages to the left part and p�� messages to the right part� After the center sends these

p�� messages to its left neighbor� the total number of messages for this left neighbor to forward decreases by �� Thus

the cost of each step is listed as follows�

Step � � � p�m�

Step � � � p� � ��m��

Step p�� m�Total cost � p

��

p� �

�m��

�� Network Embedding

Given two processor networks� G � V�E�� G�V �� E�� we want to a mapping of nodes in G onto nodes in G�s� We

de�ne it as�

v mv��

where v is a node in the processor node set V in G�

Figure � gives an example of the embedding� Four processors in a linear processor array G are mapped onto a ring G��

The mapping function is�

m�� m�� m�� m��

G G’

1

2

34

1 2 3 41

32

2

13

m

1

2 3

40

0

Figure �� An example of processor embedding�

An embedding is called perfect if for any edge n�� n�� in G� then mn��mn�� is also an edge in G�� Figure � is

a perfect embedding� It is not always possible to have a perfect mapping� If mn��mn�� is not an edge in G��

then this communication edge must be simulated using a path in G�� The longer this path is� the more overhead this

communication will have� The goal of embedding is to minimize the length of such a path� Thus we measure the overall

quality of embedding using the following terms�

� The distance between two processors is the length of the shortest path connected these two processors�� For edge n�� n�� in G� Let map distance to be the processor distance of mn�� and mn�� in G

�� De�ne the

dilation to be the maximum of the map distances for all edges in G�

Thus the goal of optimization is to minimize the dilation� For a perfect embedding such as Figure �� the dilation of

embedding is always �� Figure �� shows another mapping with dilation as �� This is because Map Distance ��

Map Distance �� Map Distance ��

Given an embedding from one network G to another network G�� we can use its mapping function to simulate programs

of G on G�� For example� given an SPMD program called progx�� x is the processor identi�cation number� which is

��

1

2

34

G’G21

3

0

3

123

0 1

2 3

Figure �� Another example of processor embedding�

designed for a linear array of � processors� we need to execute it on a �� mesh� Figure �� shows an embedding of thislinear array on the mesh�

prog(0)

1 2 3 4 5 6 7

(1) (2) (3) (4) (5) (6) (7)

0

0 1 2 3

4 5 6 7

01234567

01237654

G G’

Figure �� Execute a linear array SPMD program on a mesh�

me � mynode��

if �me�� call prog��

else if �me�� call prog��







Figure �� The simulation of a linear array code on a mesh�

We can write SPMD code below on the mesh to run the linear array program� as shown in Figure �� Notice that the

processor identi�cation numbers in the linear array is di�erent from those in the mesh and a re mapping in the SPMD

code is needed� This code in Figure �� enumerates all cases and if the number of processors is unknown� then this

approach will not be feasible� If the number of processors is p� the mapping function mi� can be generalized as� If

i � p� then mi� � i� else mi� � �p

� � �� i� Then we can write a generalized SPMD code as shown in Figure ��

� Model�based Programming Methods

We can follow a speci�c parallelism model to program an application� �� Embarrassingly parallel computations� ��

Divide and conquer� �� Pipelined computations�

�� Embarrassingly Parallel Computations

Computation can be divided and can be done independently on multiprocessors� Examples� Geometrical transforma

tions of images� Mandelbrot set image processing�� Monte Carlo Methods for numerical computations�

��

me � mynode��p � numnodes��if me � p�� call progme�else call prog��p� � me��

Figure �� A generalized version for simulating a linear array code on a mesh of p processors�

�� Geometrical Transformations of Images

Given a �D image pixmap�� each pixel is located at position x�y��

� Shifting

x� � x�"x� y� � y �"y�

� Scaling

x� � x �"x� y� � y �"y�

� Rotation

x� � x � cos� � y � sin�� y� � �x � cos� � y � sin��

� Clipping� x�y� will be displayed only if

xl � x � xh� yl � y � yh�

Since operations performed on pixels are independent� parallel code can be designed as�

� A master holds an n� n image�

� p slaves are used to process an image�

� Each slave processes a portion of rows�

Master code�

� Tell each slave which rows to process �

� Wait to receive results from each slave� The result is a mapping old pixel position� new pixel position��

� Update the bitmap�

Slave code�

� Receive row numbers from the master that should be processed��

� For each pixel in the assigned rows� apply a transformation�

� Send results back to the master� pixel by pixel�

Parallel Time Analysis�

� Let be the cost for each transformation��

� Sequential computation time is � � n�� Each pixel involves x and y coordinates�� Parallel computation cost per slave is �n��p assuming uniform row distribution�� Communication cost between master and slaves�

� Workload distribution involves small overhead� which can be ignored�

� Receive results sequentially� n��

Note that communication cost in the text book p� �� is higher because numbers are sent one by one� Also it is not

necessary for the master to assign rows since mapping is statically determined�

�� Mandelbrot Set

Given a �D image pixmap�� each pixel is located at position x�y�� represented as a complex number c � x� yi where

i �p�� The new position is�

�� z��

�� While jzj � � do z � z� � c�

The number of iterations performed on each pixel can be displayed as a color for each pixel� Dark areas on screen

represent computation intensive parts� Notice that

� Let z � x� yi� jzj �px� � y�

� z� � x� � �xyi� yi�� decomposed as�

x� � x� � y�� y� � �xy

Operations performed on pixels are independent� parallel code with static assignment is designed as�

� A master holds an n� n image�

� p slaves are used to process an image�

� Each slave processes a �xed portion of rows n�p��

Master code�

� Wait to receive results from each slave�� Display an image using the computed color�

Slave code�

� For each pixel in the assigned rows� apply a transformation�� Send results back to the master�

Computing costs among rows are non uniform� Processor load may not be balanced�

Parallel code with dynamic assignment can balance load better� The basic idea is that master assigns rows based on

slaves� computation speed�

Master code�

��

� Assign one row per slave�

� Receive results from a slave any slave� and assign another row to that slave� Repeat until all rows are processed�

� Tell all slaves to stop� Display the results�

Slave code�

�� Receive a row from the master�

�� For each pixel in the assigned rows� apply a transformation�

�� Send results back to the master�

�� Go to � until a master sends a termination signal�

�� Monte Carlo Methods

Such a method uses random selections in calculations for numerical computing problems� Example� How can we

compute an integral such as� Z �

�

p�� x�dx �

�

��

A probabilistic method for such a problem�

Z x�

x�

fx�dx � limN��

NXi��

fxi��

where xi is a random number between x� and x�� The sequential code is�

sum��

for �i�� i�N� i��

z � random�number�x��x��

sum � sum � f�z�

�

Parallel Code�

� The master sequentially generates some random numbers�

� Each slave requests some numbers� and uses these numbers for computation� and then requests again�

Master code dynamic load assignment��

�� Generate N�n random numbers�

�� Wait for a slave to make request�

�� Send these numbers to this slave�

�� Repeat Step � until all N numbers are enumerated�

�� Tell all slaves to stop�

��

�� Do all reduction with slaves to sum all partial results�

Slave code�

�� Send a request to the master�

�� Receive N�n numbers to work�

�� Repeat Step � until receiving a stop signal from the master�

�� Do all reduction with other slaves to sum all partial results�

The way we design random number generation a�ects the parallel performance�

� Sequential generation of random numbers is the performance bottleneck�

� If possible� each slave generates random numbers by themself� and then computation can be done in parallel�

� The master may not be needed�

Sequential generation of random numbers� x�� x�� x�� xi� xi��

xi�� axi � c� mod m�

For example� a � ��m � �� and c � ��Parallel generation with p processors�

� Processor � uses x�� xp� x�p� � � � � ��

� Processor j uses xj � xj�p� xj��p� � � � � ��

The formula is�

xi�p � A � xi � C� mod m

where A � ap mod m� C � cap�� ap�� a� mod m�

�� Divide�and�Conquer

� Partitioning a problem into subproblems�

� Solve small problems �rst possibly in parallel��

� Then solve the original problem�

� This divide and conquer process can be recursive�

Example� Add n numbers� Numerical Integration� N body simulation problem�

��

�� Add n numbers

Masterslave version for adding n numbers with simple partitioning�

� Master distributes n�m numbers to each slave�

� Each of m slaves adds n�m numbers�

� Master collects m partial results of slaves and adds them together�

SPMD version�

� Assumes that each of m slaves holds n�m numbers�

� Each slave adds n�m numbers�

� Perform a global reduction to add partial results together� adds them together�

Summation with a tree structure Recursive divide and conquer sequential code��

int add �int s�

if�numbers�s�� return �simple�add�s�

Divide�s� s�� s��

part�sum�� add�s��

part�sum�� add�s��

return �part�sum��part�sum��

�

Parallel summation with a tree structure�

a1 a2 a3 a4 a5 a6 a7 a8

+

+

+

P0 P1 P2 P3

1 2 3 4

5 6

7

+ + + +

Schedule

1 2 3 4

5 6

7

SPMD code

me�mynode�� p�numnodes�� d � log p�

sum � sum of local numbers at this processor�

�Leaf nodes�

if� me mod � �� send sum to node me��

�Internal nodes�

for i� � to d do�

��

q� ��i

if�me mod q ��

x�receive partial sum from node me�q��

sum � sum �x

if �me mod �q ��

send sum to node me�q�

�

�

How to speedup the computation of mod and power operations in this code�

Parallel Time Analysis� Assume that each addition costs � Initially each node holds n�p data items� Then

communication cost is log p�� Computation cost� np� log p��

�� Pipelined Computations

� A problem is divided into a series of tasks linear chain��

� More than one instance of a problem needs to be processed�

Example� Send v messages�

P0

P1

P2

21 3 v...m

time

�� Sorting n numbers

Given n numbers� sort them in a non increasing order� For example� given �� we need to produce ��

Parallel algorithm using pipelined processing�

� Assume one master and n slaves P�� Pn��

� The master sends all numbers to P��

� Each slave node receives numbers from its left neighbor� holds a number x� and passes the numbers smaller thanx to the right slaves�

SPMD code for slaves

me�mynode��

no�stage� n�me��

recv�x� me��

for j�� to stage

recv�newnumber� me��

if� newnumber � x�

send�x� me��

��

x�newnumber�

� else send�newnumber� me��

� Transformation�based Parallel Programming

We discuss the basic program parallelization techniques in dependence analysis� programdata partitioning and mapping�

�� Dependence Analysis

Before executing a program in processors� we need to identify the inherent parallelism in this program� In this chapter�

we discuss several graphical representations of program parallelism�

�� Basic dependence

We need to introduce the concept of dependence� We call a basic computation unit as a task� A task is a program

fragment� which could be a basic assignment� a procedure or a logical test� A program consists of a sequence of tasks�

When tasks are executed in parallel on di�erent processors� the relative execution order of those tasks is di�erent from

the one in the sequential execution� It is mandatory to study orders between tasks that must be followed so that the

semantic of this program does not change during parallel execution� For example�

S� � x � a� b

S� � y � x� c

For this example� each statement is considered as a task� Assume statements are executed in separate processors� S�

needs to use the value of x de�ned by S�� If two processors share one memory� S� has to be executed �rst in one

processor� The result of x is updated in the global memory� Then S� can fetch x from the memory and start its

execution� In a distributed environment� after the execution of S�� data x needs to be sent to the processor where S� is

executed� This example demonstrates the importance of dependence relations between tasks� We formally de�ne basic

types of data dependence between tasks�

De�nition� Let INT � be the set of data items used by task T and OUT T � be the set of data items modi�ed by T �

� OUT T��TINT��

T� is data �ow�dependent �or called true�dependent� on T�� Example� S� � A � x�BS� � C � A � �

� OUT T��TOUT T��

T� is output�dependent on T�� S� � A � x�BS� � A � �

� INT��TOUT T��

T� is anti�dependent on T�� S� � B � A� �S� � A � �

Coarse�grain dependence graph� Tasks operate on a set of data items of large sizes and perform a large chunk of

computations� An example of such a dependence graph is shown in Figure ��

��

S1

S2

S3

flow

outputflow

flow anti

S1: A=f(X,B)

S2: C=g(A)

S3: A=h(A,C)

Figure �� An example of a dependence graph� Functions f� g� h do not modify their input arguments�

�� Loop Parallelism

Loop parallelism can be modeled by the iteration space of a loop program which contains all iterations of a loop and

data dependence between iteration statements� An example is in Fig� ��

For i= 1 to n

S i: a(i)=b(i)+c(i)

1D Loop:

S1 S2 Sn

For i= 1 to n Si: a(i)=a(i−1) +c(i) S1 S2 Sn

2D Loop:

For i = 1 to 3For j= 1 to 3

S i,j : X(i,j)=X(i,j−1)+1S11 S12 S13

S21 S22 S23

S31 S32 S33i

j

Figure �� An example of a loop dependence graph loop iteration space��

�� Program Partitioning

Purpose�

� Increase task grain size�

� Reduce unnecessary communication�

� Simplify the mapping of a large number of tasks to a small number of processors�

Two techniques are considered� loop blockingunrolling and interior loop blocking� Loop interchange technique that

assists partitioning will also be discussed�

��

For i = 1 to 2nSi: a(i) = b(i) + c(i)

For i = 1 to n

do S2i−1: a(2i−1)=b(2i−1)+c(2i−1)do S2i: a(2i)=b(2i)+c(2i)

S2nS2 S3 S4S1

S1S2

S3S4

S2n−1S2n

Figure �� An example of �D loop blockingunrolling�

�� Loop blockingunrolling

An example is Fig� �� In general� sequential code� For i � � to r�pSi � ai� � bi��ci�

Blocking this loop by a factor of r� For j � � to p �For i � r�j�� to r�j�rai� � bi��ci�

Assume there are p processors� A SPMD code for the above partitioning can be� me�mynode��For i � r�me�� to r�me�r

ai� � bi��ci�

�� Interior loop blocking

This technique is to block an interior loop and make it one task�


S i,j : X(i,j)=X(i,j−1)+1S11 S12 S13

S21 S22 S23

S31 S32 S33i

j


S i,j : X(i,j)=X(i,j−1)+1

Ti

T1 T2 T3

T1

T2

T3

Preserve parallelism.

Figure �� An example of interior loop blocking�

Grouping statements together may reduce available parallelism� We should try to preserve parallelism as much as

possible in partitioning a loop program� One such an example is in Fig� �� Fig� �� shows an example of interior

loop blocking that does not preserve parallelism� Loop interchange can be used before partitioning in assisting the

exploitation of parallelism�

�� Loop interchange

Loop interchange is a program transformation that changes the execution order of a loop program as shown in Fig� ��

Loop interchange is not legal if the new execution order violates data dependence� An example of illegal interchange is

in Fig� ��

An example of interchanging triangular loops is shown below�

��

For i = 1 to 3

For j= 1 to 3

S i,j :S11 S12 S13

S21 S22 S23

S31 S32 S33i

j

For i = 1 to 3

For j= 1 to 3

S i,j :

Ti

T1 T2 T3

T1 T2 T3

X(i,j)=X(i−1,j)+1

X(i,j)=X(i−1,j)+1

After Loop Interchange


S i,j :S11 S12 S13

S21 S22 S23

S31 S32 S33i

j


S i,j :

Ti

T1 T2 T3

T1

T2

T3

X(i,j)=X(i−1,j)+1

X(i,j)=X(i−1,j)+1

No parallelism

Preserve parallelism.

Figure �� An example of partitioning that reduces parallelism and the role of loop interchange�

S11 S12 S13

S21 S22 S23

S31 S32 S33i

j

S11 S12 S13

S21 S22 S23

S31 S32 S33i

j

Execution order


S i,j :


S i,j :

Figure �� Loop interchange re orders execution�

��


S i,j :S11 S12 S13

S21 S22 S23

S31 S32 S33i

j

X(i,j)=X(i−1,j+1)+1


S i,j : X(i,j)=X(i−1,j+1)+1

Legal?

S11 S12 S13

S21 S22 S23

S31 S32 S33i

j

Execution order

Dependence

Figure �� An illegal loop interchange�

For i � � to ��For j � i�� to ��Xi�j��Xi�j ��

� For j � � to ��For i � � to minj ��Xi�j��Xi�j ��

How can you derive the new bounds for i and j loops�

� Step �� List all inequalities regarding i and j from the original code�

i � �� i � �� j � �� j � i� ��

� Step �� Derive bounds for loop j�

� Extract all inequalities regarding the upper bound of j�

j � ��The upper bound is ��

� Extract all inequalities regarding the lower bound of j�

j � i� ��

The lower bound is � since i could be as low as ��

� Step �� Derive bounds for loop i when j value is �xed now loop i is an inner loop��

� Extract all inequalities regarding the upper bound of i�

i � �� i � j � ��The upper bound is min�� j � ��

� Extract all inequalities regarding the lower bound of i�

i � ��The lower bound is ��

�� Data Partitioning

For distributed memory architectures� data partitioning is needed when there is no enough space for replication� Data

structure is divided into data units and assigned to the local memories of the processors� A data unit can be a scalar

variable� a vector or a submatrix block�

��

�� Data partitioning methods

�D array � �D processors�

Assume that data items are counted from �� n � �� and processors are numbered from � to p � �� Let r � dnpe�

Three common methods for a �D array are depicted in Fig� ��

r

0 1 2 3p 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3p p 32103210

r r r r r r r r

a� Block� b� Cyclic� c� Block cyclic�

Figure �� Data partitioning schemes�

� D block� Data i is mapped to processor b irc�

� D cyclic� Data i is mapped to processor i mod p�

� D block cyclic� First the array is divided into a set of units using block partitioning block size b�� Then these

units are mapped in a cyclic manner to p processors� Data i is mapped to processor b ibc mod p�


Data elements are counted as i� j� where � � i� j � � � �n� �� Processors are numbered from � to p� �� Let r � dnpe�

� Row�wise block� Data i� j� is mapped to processor b irc�

� Column�wise block� Data i� j� is mapped to processor b jrc�

Proc0 1 2 3

Proc 0

Proc 1

Proc 2

Proc 3

Figure �� The left is the column wise block and the right is the row wise block�

� Row�wise cyclic Data i� j� is mapped to processor i mod p�

� Other partitioning� Column wise cyclic� Column wise block cyclic� Row wise block cyclic�


Data elements are counted as i� j� where � � i� j � � � �n�� Processors are numbered as s� t� where � � s� t � � � � q��where q �

pp� Let r � dn

qe�

� �Block�Block�� Data i� j� is mapped to processor b irc� b j

rc�

� �Cyclic�Cyclic�� Data i� j� is mapped to processor i mod q� j mod q��

� Other partitioning� Block� Cyclic�� Cyclic� Block�� Block cyclic� Block cyclic��

��

0 1 2 3

0

1

2

3

(0,0) (0,1) (0,2) (0,3)Proc Proc Proc Proc 0 1 2 3 0 1 2 3 0 1 2 3

012301230123

(0,3)Processor

Figure �� The left is the �D block mapping block�block� and the right is the �D cyclic mapping cyclic�cyclic��

�� Consistency between program and data partitioning

Given a computation partitioning and processor mapping� there are several choices available for data partitioning and

mapping� How can we make a good choice of data partitioning and mapping� For distributed memory architectures

large grain data partitioning is preferred because there is a high communication startup overhead in transferring a small

size data unit� If a task requires to access a large number of distinct data units and data units are evenly distributed

among processors� then there will be substantial communication overhead in fetching a large number of non local data

items for executing this task� Thus the following rule of thumb can be used to guide the design of program and data

partitioning�

Consistency� The program partitioning and data partitioning are consistent if su!cient parallelism is

provided by partitioning and at the same time the number of distinct units accessed by each task is minimized�

The following rule is a simple heuristic used in determining data and program mapping�

�Owner computes rule�� If a computation statement x modi�es a data item i� then the processor that

owns data item i executes statement x�

For example� sequential code� For i � � to r�p �Si � ai� � ��

Blocking this loop by a factor of r For j � � to p �For i � r�j to r�j�r �ai� � ��

Assume there are p processors� SPMD code is� me�mynode��For i � r�me to r�me�r �

ai� � ��

Data array ai� are distributed to processors such that if processor x executes ai� � �� then ai� is assigned to processor

x� For example� we let processor � own data a�� a�� ar � �� Otherwise if processor � does not have a�� thisprocessor needs to allocate some temporal space to perform a�� and send the result back the processor that owns

a�� which leads to a certain amount of communication overhead�

The above SPMD code is for block mapping� For cyclic mapping� the code is�

me�mynode��For i � me to r�p � step size p

ai� � ��

��

A more general SPMD code structure for an arbitrary processor mapping method proc mapi� but with more code

overhead� is

me�mynode��For i �� to rp �

if proc mapi� �� me� ai� � ��

For block mapping� proc mapi� � b irc� For cyclic mapping� proc mapi� � i mod p�

�� Data indexing between global space and local space

There is one more problem with the previous program� statement �ai�� uses �i� as the index function and the value

of i is in a range from � to r � p� �� When a processor allocates r units� there is a need to translate the global index i toa local index which accesses the local memory at that processor� This is depicted in Fig�� The correct code structure

is shown below�

int a#r$�me�mynode��For i �� to rp �

if proc mapi� �� me� alocali��

For block mapping� locali� � i mod r� For cyclic mapping� locali� � b ipc�

A

0 1 2 0 1 2

Local array, Proc 0 Local array, Proc 1

0 1 2 3 4 5

Figure �� Global data vs local data index�

In summary� given data item i i starts from ��

� �D Block�

Processor ID� proc mapi� � b irc�

Local data address� Locali� � i mod r�

An example of mapping with p�� and r�� is�

Proc � Proc ��

� �D Cyclic�

Processor ID� proc mapi� � i mod p�

Local data address� Locali� � b ipc�

An example of cyclic mapping with p�� is�

proc � proc ��

��

�� A summary on program parallelization

Program

CodePartitioning

DataPartitioning

dependenceTasks + Data

mappingscheduling

mapping

P processors P processors

parallel code

Figure �� The process of program parallelization�

The process of program parallelization is depicted in Figure �� and we will demonstrate this process in parallelizing

following scienti�c computing algorithms�

� Matrix vector multiplication�� Matrix matrix multiplication�� Direct methods for solving a linear equation� e�g� Gaussian Elimination�� Iterative methods for solving a linear equation� e�g� Jacobi�� Finite di�erence methods for di�erential equations�

� Matrix Vector Multiplication

Problem� y � A � x where A is a n� n matrix and x is a column vector of dimension n�

Sequential code� for i � � to n doyi � ��for j � � to n do

yi � yi � ai�j � xj �Endfor

Endfor

An example� ��

�A �

��

�A �

��

�A �

��

�A

The sequential complexity is �n� where is the time for an addition or multiplication�

Partitioned code� for i � � to n doSi � yi � ��

for j � � to n doyi � yi � ai�j � xj �

EndforEndfor

��

S1 S2 S3 Sn

Task graph:

Schedule:

S1S2

Sr

Sr+1

S2r Sn

Sr+2

0 1 p−1

Figure �� Task graph and a schedule for matrix vector multiplication�

Dependence task graph is shown in Fig� ��

Task schedule on p processors is shown in Fig� ��

Mapping function of tasks Si� For the above schedule� proc mapi� � b i��rc where r � dn

pe�

Data partitioning based on the above schedule� matrix A is divided into n rows A�� A�� An�

A1A2A3A4A5A6A7A8

proc 1

proc 0

0123

0123

Local space

Figure �� An illustration of data mapping for matrix vector multiplication�

Data mapping� Row Ai is mapped to processor proc mapi�� the same as task i� The indexing function is� locali� �

i� �� mod r� Vectors x and y are replicated to all processors�SPMD parallel code� int x#n$� y#n$� a#r$#n$�

me�mynode��for i � � to n do

if proc mapi� � me� then do Si�Si � y#i$ � ��

for j � � to n doy#i$ � y#i$ � a#locali�$#j$ � x#j$�

EndforEndfor

The parallel time is PT � np� �n since each task Si costs �n� ignoring the overhead of computing locali�� Thus

PT � �n��p�

Matrix�Matrix Multiplication

�� Sequential algorithm

Problem� C � A �B where A and B are n� n matrices�

��

Sequential code� for i � � to n dofor j � � to n dosum � ��for k � � to n do

sum � sum� ai� k� � bk� j��Endforci� j� � sum�

EndforEndfor

An example� ��

��

��

��

�

Time Complexity� Each multiplication or addition counts one time unit �

No of operations �

nXi��

nXj��

nXk��

� � �n�

�� Parallel algorithm with sucient memory

Partitioned code� for i � � to n doTi � for j � � to n do

sum � ��for k � � to n do

sum � sum� ai� k� � bk� j��Endforci� j� � sum�

EndforEndfor

Since tasks Ti � � i � n� are independent� we use the following mapping for the parallel code�

� Matrix A is partitioned using row wise block mapping� Matrix C is partitioned using row wise block mapping� Matrix B is duplicated to all processors� Task Ti is mapped to the processor of row i in matrix A�

The SPMD code with parallel time �n��p is� For i � � to nif proc mapi��me do Ti�

Endfor

A more detailed description of the algorithm is�

for i � � to n doif proc mapi��me do

for j � � to n dosum � ��for k � � to n do

sum � sum� alocali�� k� � bk� j��Endforclocali�� j� � sum�

Endforendif

Endfor

��

�� Parallel algorithm with �D partitioning

The above algorithm assumes that B can be replicated to all processors� In practice� that costs too much memory� We

describe an algorithm in which all of matrices A� B� and C are uniformly distributed among processors�

Partitioned code� for i � � to n dofor j � � to n doTi�j � sum � ��

for k � � to n dosum � sum� ai� k� � bk� j��

Endforci� j� � sum�

EndforEndfor

Data access� Each task Ti�j reads row Ai and column Bj to write data element ci�j �

Task graph� There are n� independent tasks� T�� T�� T��nT�� T�� T��n� � �Tn�� Tn�� Tn�n

Mapping�

� Matrix A is partitioned using row wise block mapping

� Matrix C is partitioned using row wise block mapping

� Matrix B is partitioned using column wise block mapping

� Task Ti�j is mapped to the processor of row i in matrixA� Cluster �� T�� T�� T��nCluster �� T�� T�� T��n� � �Cluster n� Tn�� Tn�� Tn�n

Parallel algorithm� For j � � to nBroadcast column Bj to all processorsDo tasks T��j � T��j � � � � � Tn�j in parallel�

Endfor

Parallel time analysis� Each multiplication or addition counts one time unit � Each task Ti�j costs �n� Also we

assume that each broadcast costs �� n� log p�

PT �nX

j��

� � �n� log p�n

p�n� � n�� n� log p�

�n�

p�

�� Foxs algorithm for �D data partitioning

This algorithm is similar to the Cannon�s algorithm described in the text book Chapter ��

Naturally another option we have for program partitioning is to treat each interior statement as one task� Let us �rst

ignore the initialization statement� We have the following partitioning�

��

for i � � to n dofor j � � to n dofor k � � to n do

Ti�j�k � ci�j � ci�j � ai�k � bk�j �Endfor

EndforEndfor

Then the iteration space is �D� All tasks Ti�j�k � � k � n� that modify the same ci�j will have a chain dependence�

which corresponds to a summation of n terms for ci�j � There are no dependence between those chains�

The Fox�s algorithm is used to support �D block partitioning of matrices A� B and C� This algorithm requires a re

organization of the sequential code� There are two basic ideas� reordering the sequential addition and submatrix based

partitioning�

�� Reorder the sequential additions�

We examine the summation sequence for each result element ci�j �

ci�j � ai�� b��j � aj�� b��j � � � �� ai�n � bn�j �

It can be re ordered as�

ci�j � ai�i � bi�j � ai�i�� bi��j � � � �� ai�n � bn�j � ai�� b��j � aj�� b��j � � � �� ai�i�� bi��j �

i�e�

ci�j �

nXk��

ai�i�k�� mod n�� bi�k�� mod n��j �

Assume that we use a �D data mapping on n � n processors� Processor i� j� owns data ai�j � bi�j and ci�j � We will

execute the computation for ci�j in n stages� At stage k k � �� n�� for all � � i� j � n� do

ci�j � ci�j � ai�i�k�� mod n�� bi�k�� mod n��j

We use an example for �� processors to demonstrate the computation sequence�� c�� a�� b�� c�� a�� b�� c�� a�� b��

c�� a�� b�� c�� a�� b�� c�� a�� b��c�� a�� b�� c�� a�� b�� c�� a�� b��

k��

� �� c�� a�� b�� c�� a�� b�� c�� a�� b��


k��

� �� c�� a�� b�� c�� a�� b�� c�� a�� b��


k��

Then we examine what elements of a are used during the above � stages so we can determine the communication

patterns� �� a�� a�� a��

a�� a�� a��a�� a�� a��

k��

� �� a�� a�� a��

a�� a�� a��a�� a�� a��

k��

� �� a�� a�� a��

a�� a�� a��a�� a�� a��

k��

Thus the communication pattern is that at each stage� ai�i�k�� mod n�� is broadcasted to every processor in row i� We

also examine what elements of b are used for each grid point at each stage�

�� b�� b�� b��

b�� b�� b��b�� b�� b��

k��

� �� b�� b�� b��

b�� b�� b��b�� b�� b��

k��

� �� b�� b�� b��

b�� b�� b��b�� b�� b��

k��

��

For b� studying the element movement of column � from Stage �� to �� we can observe that essentially a ring pipelining

for all all broadcast is conducted in every column direction�

Parallel algorithm�

For k � � to nOn each processor i� j�� set t � k � i� �� mod n� ��At each row� ai�t at processor i� t� is broadcasted to other processors in the same row i� j� where � � j � n�Do ci�j � ci�j � ai�t � bt�j in parallel for all processors�Every processor i�j� sends bi�t to processor i� �� j��

Endfor

�� Submatrix partitioning for block�based matrix multiplication

In practice� the number of processors is much less than n�� Also it does not make sense to utilize n� processors since the

code su�ers too much communication overhead for �ne grain computation� To increase the granularity of computation�

we map the n� n grid to processors using the �D block method�

First we partition all matrices A�B�C of size n�n� in a submatrix style� Each processor i� j� is assigned a submatrix

of size n�q�n�q where q �pp� Let r � n�q� The submatrix partitioning of A can be demonstrated using the following

example with n � � and q � ��

A �

�BB�

a�� a�� a�� a�na�� a�� a�� a��a�� a�� a�� a��a�� a�� a�� a��

�CCA �

�A�� A��

A�� A��

�

where

A��

�a�� a��a�� a��

�� A��

�a�� a��a�� a��

�� A��

�a�� a��a�� a��

�� A��

�a�� a��a�� a��

��

Then the sequential algorithm can be re organized as

for i � � to q dofor j � � to q doCi�j � ��for k � � to q do

Ci�j � Ci�j �Ai�k �Bk�j �Endfor

EndforEndfor

Then we use the same idea of re ordering�

Ci�j � Ai�i �Bi�j �Ai�i�� Bi��j � � � ��Ai�n �Bn�j �Ai�� B��j �Aj�� B��j � � � ��Ai�i�� Bi��j �

Parallel algorithm�

For k � � to qOn each processor i� j�� t � k � i� �� mod q � ��At each row� Ai�t at processor i� t� is broadcasted to other processors in the same row i� j� where � � j � q�Do Ci�j � Ci�j �Ai�t �Bt�j in parallel for all processors�Every processor i�j� sends Bi�t to processor i� �� j��

Endfor

Parallel time analysis� A submatrix multiplication Ai�t � Bt�j� costs �r�� A submatrix addition costs r�� Each

��

inner most statement costs �r�� Also we assume that each broadcast of a submatrix costs �� r�� log q�

PT �

qXk��

�� r�� log q� � �r�� q�� n�q�� log q� � �n�q�� pp��

n�pp�� log

pp� �

�n�

p�

This algorithm has communication overhead much smaller than the �D algorithm presented in the previous subsection�

�� How to produce block�based matrix multiplication code

Again� we assume that matrix A is partitioned into q � q submatrices� Let r � n�q� Using the loop blocking technique

discussed in Section �� we block each loop nest by introducing three control loops ii� jj� kk��

for ii � � to q dofor i � ii� �� r � � to ii � r dofor jj � � to q dofor j � jj � �� r � � to jj � r dofor kk � � to q dofor k � kk � �� r � � to kk � r do

ci�j � ci�j � ai�k � bk�j �EndforEndforEndforEndforEndforEndfor

Then we use the loop interchange technique discussed in Section ��

for ii � � to q dofor jj � � to q dofor kk � � to q dofor i � ii� �� r � � to ii � r dofor j � jj � �� r � � to jj � r dofor k � kk � �� r � � to kk � r do

ci�j � ci�j � ai�k � bk�j �EndforEndforEndforEndforEndforEndfor

The inner three loops access three submatrices in A�B� and C� The above code is the same as the block based code we

have used in the previous subsection� excluding the initialization of matrix C�

Gaussian Elimination for Solving Linear Systems

�� Gaussian Elimination without Partial Pivoting

�� The Row�Oriented GE sequential algorithm

The Gaussian Elimination method for solving linear system Ax � b is listed below� Assume that column n � � of A

stores column b� Loop k controls the elimination steps� Loop i controls i th row accessing and loop j controls j th

column accessing�

��

Forward Elimination� For k � � to n� �For i � k � � to nai�k � ai�k�ak�k�For j � k � � to n� �

ai�j � ai�j � ai�k � ak�j �Endfor

EndforEndfor

Notice that since the lower triangle matrix elements become zero after elimination� their space is used to store multipliers

ai�k � ai�k�ak�k��

Backward Substitution� Note that xi uses the space of ai�n�� For i � n to �For j � i� � to nxi � xi � ai�j � xj �Endfor

xi � xi�ai�i�Endfor

An example� Given the following matrix system�

�� x� � �x� � �x� � �� x� � �x� � �x� � �� x� � �x� � �x� � �

We eliminate the coe!cients of x� for equations �� and �� and make them zero�

�� x� � �x� � � ��

�x� ��x� � �

� ��

Then we eliminate the coe!cients of x� for equation ��

�� x� � ��

Now we have an upper triangular system��x� � �x� � �x� � �

��x� � �x� � �

�x� � ��

Given this upper triangular system� backward substitution performs the following operations�

x� � ��

x� � ��x��

� ��

x� � ��x��x��

�

The forward elimination process can be expressed using an augmented matrix A j b� where b is treated as the n��column of A�

��

�A

��

��

� � ��

��

��

�A �

��

�A

Time complexity� Each of division� multiplication and subtraction counts one time unit �

%Operations in forward elimination�

n��Xk��

nXi�k��

��

n��Xj�k��

�

�A �

n��Xk��

nXi�k��

�n� k� � �� n��Xk��

n� k�� n�

�

��

%Operations in backward substitution�

nXi��

� �nX

j�i��

�� nXi��

n� i� � n�

Thus the total number of operations is about �n�

� � The total space needed is about n� double precision numbers�

�� The row�oriented parallel algorithm

A program partitioning for the forward elimination part is list below Referece� Text book chapter ��

For k � � to n� �For i � k � � to n

T ik � aik � aik�akk

For j � k � � to n� �aij � aij � aik � akj

EndFor

The data access pattern of each task is�T ik � Read rows Ak� Ai

Write row Ai�

Dependence Graph�

T12

T13

T14

T1n

T23

T2n

T24

T34

Tn−1n

T3n

k=1

k=2

k=n−1

k=3

Thus for each step k� tasks T k��k T k��

k � � � Tnk are independent� We can have the following algorithm design for the

row oriented GE algorithm� For k � � to n� �Do T k��

k T k��k � � � Tn

k in parallel on p processors�

Task and data mapping� We �rst group tasks into a set of clusters using �owner computes rule�� Each task T ik

that modi�es the same row i is in the same cluster Ci� Row i and cluster Ci will be mapped to the same processor�

We discuss how rows should be mapped to processors� If block mapping is used� we pro�le the computation load of

clusters C�� C�� Cn below� Processor � will get the smallest amount of load and the last processor gets the most of

computation load for block mapping� thus load is NOT balanced among processors� Cyclic mapping should be used�

. . .

Load

C C C

cluster

C2 3 4 n

��

Parallel Algorithm� Proc � broadcasts Row �For k � � to n� �

Do T k��k � � � Tn

k in parallel�Broadcast row k � ��

Endfor

SPMD Code� me�mynode��For i � � to n

if proc mapi��me� initialize Row i�EndforIf proc map��me� broadcast Row � else receive it�For k � � to n� �

For i � k � � to nIf proc mapi��me� do T i

k�EndForIf proc mapk��me� then broadcast Row k � � else receive it�

EndFor

�� The column�oriented algorithm

The column oriented algorithm essentially interchanges loops i and j of the GE program in Section ��

Forward elimination part�

For k � � to n� �For i � k � � to nai�k � ai�k�ak�k

EndForFor j � k � � to n� �For i � k � � to nai�j � ai�j � ai�k � ak�j

EndForEndFor

EndFor

Given the �rst step of forward elimination in the previous example�

��

�A

��

��

� � ��

��

��

�A

One can mark the data access writing� sequence for row oriented elimination below� Notice that � computes and

stores the multiplier �� and � stores

��

� � � � �

� � � �

�A

Then the data writing sequence for column oriented elimination is�

��

� � � �

�A

Notice that � computes and stores the multiplier �� and � stores

��

��

Column�oriented backward substitution� We change the loops of i and j in the row oriented backward substitution

code� Notice again that column n� � stores the solution vector x�

For j � n to �xj � xj�aj�j �For i � j � � to �xi � xi � ai�jxj �

EndforEndFor

For example� given a upper triangular system��x� � �x� � �x� � ��x� � �x� � �

�x� � ��

The row oriented algorithm performs�

x� ��

x� � �� x�x� �

x��

x� � � � �x�x� � x� � �x�x� �

x��

The column oriented algorithm performs�

x� ��

x� � �� x�x� � �� x�x� �

x��

x� � x� � �x�x� �

x��

Partitioned forward elimination� For k � � to n� �T kk � For i � k � � to n

aik � aik�akkEndfor

For j � k � � to n� �

T jk � For i � k � � to n

aij � aij � aik � akjEndfor

EndforEndfor

Task graph�

T12

T13

T14

T1n k=1

T23

T2n

T24 k=2

T34

Tn−1n

T3n

k=n−1

k=3

T 11

T22

+1

+1

+1

T33

...

...

...

+1

��

Backward substitution on p processors

Partitioning� For j � n to �

Sxj xj � xj�aj�j �For i � j � � to �xi � xi � ai�jxj �

Endfor

EndFor

Dependence�

Sxn � Sxn�� Sx� �

Parallel algorithm� Execute all these tasks Sxj � j � n� � � � � �� gradually on the processor that owns x which is initial ized as column n�� The code is� For j � n to �

If ownercolumn x��me thenReceive column j if not available�Do Sxj �

Else If ownercolumn j��me� send column j to the owner of column x�EndFor

� Gaussian elimination with partial pivoting

�� The sequential algorithm

Pivoting will avoid the problem when ak�k is too small or zero� We just need to change the forward elimination algorithm

by adding the follow statements� at stage k� interchange rows such that jak�kj is the maximum in the lower portion ofthe column k� Notice that b is stored in column n� � and interchanging should also be applied to elements of b�

Example of GE with Pivoting�

��

��

�A ��

��

��

�A ��

��

� �

��

�A

��

��

� ��

��

�A ��

��

� ��

��

�A x� � �

x� � �x� � �

The backward substitution does not need any change�

Row�oriented forward elimination� For k � � to n� �Find m such that jam�kj � maxn�i�kfjai�kjg�If am�k � �� No unique solution� stop�Swap rowk� with rowm��

For i � k � � to naik � aik�akk�For j � k � � to n� �

ai�j � ai�j � ai�k � ak�j �Endfor

EndforEndfor

��

Column�oriented forward elimination� For k � � to n� �Find m such that jam�kj � maxn�i�kfjai�kjg�If am�k � �� No unique solution� stop�Swap rowk� with rowm��For i � k � � to nai�k � ai�k�ak�k

EndForFor j � k � � to n� �For i � k � � to nai�j � ai�j � ai�k � ak�j

EndForEndFor

EndFor

�� Parallel column�oriented GE with pivoting

Partitioned forward elimination� For k � � to n� �P kk Find m such that jam�kj � maxi�kfjai�kjg�

If am�k � �� No unique solution� stop�

For j � k to n� �

Sjk � Swap ak�j with am�j �

EndforT kk � For i � k � � to n

ai�k � ai�k�ak�kEndfor

For j � k � � to n� �

T jk For i � k � � to n

ai�j � ai�j � ai�k � ak�jEndfor

Endfor

Dependence structure for iteration k� The above partitioning produces the following dependence structure�

Pkk

Skk S k

k+1... Sk

n+1

Tkk

T kk+1 ... Tk

n+1

Find the maximum element.

Swap each column

Scaling column k

updating columns k+1,k+2,...,n+1

Broadcast swapping positions

Broadcast column k

We can further merging tasks and combine small messages as�

� De�ne task Ukk as performing P

kk � S

kk � and T

kk �

� De�ne task U jk as performing S

jk� and T

jk k � � � j � n� ��

Then the new graph has the following structure�

��

S kk+1

... Skn+1

T kk+1 ... Tk

n+1 updating columns k+1,k+2,...,n+1

Broadcast swapping positions

Pkk

Skk

Tkk

Find the maximum element.

Scaling column kSwap column k.

and column k.

Swap column k+1,k+2,...,n+1

Uk

k

Ukk+1

Ukn+1

Parallel algorithm for GE with pivoting�

For k � � to n� �The owner of column k does Uk

k and broadcasts the swapping positions andcolumn k�

Do Uk��k � � � Un

k in parallel�Endfor

�� Iterative Methods for Solving Ax b

The Gaussian elimination method is called the direct method� There is another kind of methods for solving a linear

system� called iterative methods� We use such methods when the given matrix A is very sparse� i�e�� many of its elements

are zero�

� �� The iterative methods

We use the following example to demonstrate the so called Jacobi iterative method� Given�

�� x� � �x� � x� � �� x� � �x� � �x� � �� x� � �x� � �x� � �

We reformulate it as�

� x� � ��

� � ��x� � x��

x� � ��

� ��x� � �x��x� � �

� � �� x� � �x��

� xk��

�� xk� � xk� ��

xk��

� �� xk� � �xk� ��

xk��

�� xk� � �xk� ��

We start from an initial approximation x� � �� x� � �� x� � �� Then we can get a new set of values for x�� x�� x�� We

keep doing this until the di�erence from iteration k and k � � is small� that means the error is small� The values of xi

for a number of iterations are listed below�

Iter � � � � � � � � �x� � �� x� � �� x� � ��

Iteration � actually delivers the �nal solution with error � �� Formally we stop when k xk�� xk k� �� where xk is a vector of the values of x�� x�� and x� after iteration k� We need to de�ne norm k xk�� xk k�

��

A general iterative method can be formulated as� Assign an initial value to x�

k��Do

xk�� H � xk � duntil k xk�� xk k� �

H is called the iterative matrix� For the above example� we have�

�� x�

x�x�

�A

k��

�

��

� � �

��

��

��

�� x�

x�x�

�A

k

�

��

��

�A

� �� Norms and Convergence

�� Norms of vectors and matrices

Norm of a vector� One application of vector norms is in error control� e�g� k Error k� �� In the rest of discussion�

we sometime simply use notation x instead of x� Given x � x�� x�� xn��

k x k��nXi��

j xi j� k x k��qX

j xi j�� k x k�� max j xi j �

Example�

x � ��

k x k�� k x k��p� � � � ��

p�� k x k��

Properties of norm�

k x k �� k x k� � if x � ��

k �x k�j � jk x k� k x� y k�k x k � k y k �

Norm of a matrix� Given a matrix of dimension n� n� a matrix norm has the following property�

k A k �� k A k� � if A��

k �A k�j � jk A k� k A�B k�k A k � k B k �k AB k�k A kk B k� k Ax k�k A kk x k where x is vector �

De�ne

k A k�� max��i�n

nXj��

aij max sum row

k A k�� max��j�n

nXi��

aij max sum column

Example�

A �

��

k A k�� max� � �� k A k�� max� � ��

��

�� Convergence of iterative methods

De�nition� Let x� be the exact solution and xk be the solution vector after step k� Sequence x�� x�� x�� xn convergesto the solution x� with respect to norm k � k if k xk � x� k� � when k is very large� Namely k �� k xk � x� k ��

A condition for convergence� Let error ek � xk � x�� Since

xk�� Hxk � d� x� � Hx� � d�

Then

xk�� x� � Hxk � x�� ek�� He��

We have

k ek�� k�k Hek k�k H kk ek k�k H k�k ek�� k� � � � �k H kk��k e� k

Then if k H k� �� The method converges�

� �� Jacobi Method for Ax � b

For each iteration�

xk��i ��

aiibi �

Xj�i

aijxkj � i � �� n

Example�� x� � �x� � x� � �� x� � �x� � �x� � �� x� � �x� � �x� � �

� xk��

�� xk� � xk� ��

xk��

� �� xk� � �xk� ��

xk��

�� xk� � �xk� ��

Jacobi method in a matrix�vector form

�� x�

x�x�

�A

k��

�

��

� � �

��

��

��

�� x�

x�x�

�A

k

�

��

��

�A

In general�

A �

�BBB�

a�� a�� a�na�� a�� a�n��

��

��an� an� � � � ann

�CCCA � D �

�BBB�

a��a��

� � �

ann

�CCCA � B � A�D�

Then from Ax � b�

D �B�x � b�

Thus Dx � �Bx� b� then

xk�� D��Bxk �D��b

i�e�

H � �D��B� d � D��b�

��

� �� Parallel Jacobi Method

xk�� D��Bxk �D��b

Parallel implementation�

� Distribute rows of B and the diagonal elements of D to processors�

� Perform computation based on the owner computes rule�

� Perform all all broadcasting after each iteration�

Note� If the iterative matrix is very sparse� i�e� containing a lot of zeros� the code design should take advantage of

this and should not store those nonzeros� Also the code design should explicitly skip those operations applied to zero

elements�

Example�

y� � �y� � y� � h�

y� � �y� � y� � h�

��

yn�� yn � yn�� h�

y� � yn�� This set of equations can be rewritten as�

�BBBBB�

��

� ��

�CCCCCA

�BBBBB�

y�y��

yn��yn

�CCCCCA�

�BBBBB�

h�

h�

��h�

h�

�CCCCCA

The Jacobi method in a matrix format�

�BBBBB�

y�y��

yn��yn

�CCCCCA

k��

� ��

�BBBBB�

� ��

� � � ��

�CCCCCA

�BBBBB�

y�y��

yn��yn

�CCCCCA

k

� ��

�BBBBB�

h�

h�

��h�

h�

�CCCCCA

It would be too time and space consuming if you multiply using the entire iterative matrix�

Correct solution� write the Jacobi method as�

RepeatFor i� � to nynewi � ��yoldi�� yoldi�� h��

EndforUntil k ynew � yold k� �

��

� �� Gauss�Seidel Method

The basic idea is to utilize new solutions as soon as they are available� For example�

�� x� � �x� � x� � �� x� � �x� � �x� � �� x� � �x� � �x� � �

� Jacobi method�xk��

�� xk� � xk��

xk�� xk� � �xk��

xk�� xk� � �xk��

� Gauss Seidel method�xk��

�� xk� � xk��

xk�� xk�� xk��


We show a number of iterations for the GS method� which converges faster than Jacobi�s method�

� � � � � �x� � �� x� � �� x� � ��

Formally the GS method is summarized as below� For each iteration�

xk��i ��

aiibi �

Xj�i

aijxk��j �

Xj �i

aijxkj �� i � �� n�

Matrix�vector Format� Let A�D�L�U where L is the lower triangular part of A� U is the upper triangular part�

D xk�� b� L xk�� U xk

D � L� xk�� b� U xk

xk�� D � L��U xk � D � L��b

Example�

xk�� xk� � xk��



�xk�� xk� �xk� ��

��xk�� xk�� xk� ��


��

�� x�

x�x�

�A

k��

�

��

�� x�

x�x�

�A

k

�

��

�A

H �

��

��

��

� d �

��

��

��

�A �

� �� More on convergence�

We can actually judge the convergence of Jacobi and GS by examining A but not H �

��

We call a matrix A as strictly diagonally dominant if

j aii jnX

j��j �i

j aij j i�� n

Theorem� If A is strictly diagonally dominant� then both Gauss Seidel and Jacobi methods converge�

Example� ��

��

�Ax �

��

�A

A is strictly diagonally dominant�

j � j � � �� Then both Jacobi and G�S� methods will converge�

� �� The SOR method

SOR stands for Successive Over Relaxation� The rate of convergence can be improved accelerated� by the SOR method�

Step �� Use the Gauss Seidel method� xk�� Hxk � d�

Step �� Do a correction� xk�� xk � wxk�� xk��

�� Numerical Di�erentiation

In next section� we will discuss the �nite di�erence method for solving ODEPDEs and this method� This method uses

numerical di�erentiation to approximate partial� derivatives� In this section� we discuss how to approximate derivatives�

�� Approximation of Derivatives

We discuss formulas for computing the �rst derivative f�x�� The de�nition of the �rst derivative is�

f �x� � limh��

fx� h�� fx�

h

Thus the forward di�erence method is�

f �x� � fx� h�� fx�

h

We can derive it using the Taylor�s expansion�

fx� h� � fx� � hf �x� �h�

�f ��z�

where z is between x and x� h� Then

f �x� �fx� h�� fx�

h� h

�f ��z��

Truncation error is Oh��

There are other formulas�

��

� Backward di�erence�

f �x� �fx�� fx� h�

h�h

�f ��z��

� Central di�erence�

fx� h� � fx�� hf �x� �h�

�f ��x� �

h�

�f �z��

fx� h� � fx�� hf �x� �h�

�f ��x� � h�

�f �z��

Then

fx� h�� fx� h� � �hf �x� �h�

�f �z�� f �z��

Thus the method is�

fx� �fx� h�� fx� h�

�h�O

h�

��

�� Central di�erence for second�derivatives

fx� h� � fx� � hf �x� �h�

�f ��x� �

h�

��f �x� �

h�

��f �z��

fx� h� � fx�� hf �x� �h�

�f ��x�� h�

��f �x� �

h�

��f �z��

fx� h� � fx� h� � �fx� � h�f ��z� �Oh�

��

Thus�

f ��x� �fx� h� � fx� h�� fx�

h��O

h�

��

�� Example

We approximate f �x� � cosx�� at x � � using the forward di�erence formula�

i h fx�h�fxh

Error EiEi��

Ei

� ��

From this case� we can see that this truncation error Ei is proportional to h� If h is halved� � � the error is also halved�We approximate f ��x� � cosx�� at x � �

� Using the central di�erence formula f��x� � fx�h��fx��fx�h��h��

we have�

i h f ��x� � Error EiEi��

Ei

� ��

The truncation error Ei is proportional to h��

��

�� ODE and PDE

ODE stands for Ordinary Di�erential Equations� PDE stands for Partial Di�erential Equations�

An ODE Example� Growth of Population

� Population � Nt�� t is time�

� Assumption � Population in a state grows continuously with time at a birth rate �� proportional to the numberof persons� Let � be the average number of persons moving in to this state after subtracting the moving out

number��d Nt�

dt� �Nt� � ��

� Let � � �� million�d Nt�

dt� ��Nt� � ��

� If N�� million� what are N�� N�� N�� N��

Other Examples

�

f �x� � x f��

What are f�� f�� f��

f�x� � �� f�� f��

What are f�� f�� f�� f��

� The Laplace PDE��Ux� y�

�x��Ux� y�

�y��

The domain is a �D region� Usually we know some boundary values or condition and we need to �nd values for

some points within this region�

�� Finite Di�erence Method

�� Discretize a region or interval of variables� or domain of this function�

�� For each point in the discretized domain� setup an equation using a numerical di�erentiation formula�

�� Solve the linear equations�

Example� Given�

y�x� � �� y��

Domain x� �� h� �h� � � � � nh�Find� yh�� y�h�� ynh��Method� At each point x � ih� we use the forward di�erence formula� � � y�ih� � yi��h�yih

h�

��

Thus

yi� ��h�� yih� � h for i � �� n� ��

yh�� y�� h

y�h�� yh� � h

��

ynh�� yn� ��h� � h

� �BBBBB�

��

��

��

�CCCCCA

�BBBBB�

yh�y�h��

yn� ��h�ynh�

�CCCCCA�

�BBBBB�

hh��hh

�CCCCCA

Example� Given�

y��x� � �� y�� y��

Domain� �� x�� x�� xn� ��Let xi � i � h and yi � yi � h� where h � �

n��

Find� y�� y�� yn�Method� At each point xi� we use the central di�erence formula�

� � y��xi� � yi�� yi � yi��h�

Thus

yi�� yi � yi�� h� For i � �� n�

Then

y� � �y� � y� � h�

y� � �y� � y� � h�

��

yn�� yn � yn�� h�

i�e� �BBBBB�

��

� ��

�CCCCCA

�BBBBB�

y�y��

yn��yn

�CCCCCA�

�BBBBB�

h�

h�

��h�

h�

�CCCCCA

��

�� Gaussian Elimination for solving linear tridiagonal systems

Given the following tridiagonal equations� we assume that pivoting is not necessary�

��

a� b�c� a� b�

c� a� b��

cn an

��

��

x�x�x�� xn

��

��

d�d�d��dn

��

Forward Elimination�

For i�� to n �temp �

ai�i��ai��i��

aii � aii � ai��i � tempbi � bi � bi�� tempEndFor

Backward Substitution� xn �bncnn

For i�n � to �xi � bi � xi�� ai�i��aii

EndFor

Parallel solutions for tridiagonal systems� There is no parallelism in the above GE method� We discuss an

algorithm called the Odd Even Reduction method or called Cyclic Reduction��

Basic idea� The basic idea is to eliminate the odd numbered variables from the n equations and formulate n�� equations�

This reduction process is modeled as a tree dependence structure and can be parallelized� For example�

�� a�x� � b�x� � d�� c�x� � a�x� � b�x� � d�� c�x� � a�x� � b�x� � d�

� a��x� � b��x� � d��

�� c�x� � a�x� � b�x� � d�� c�x� � a�x� � b�x� � d�� c�x� � a�x� � b�x � d�

� c��x� � a��x� � b��x � d��

Odd�even reduction part� In general� given n � �q�� equations� one reduction step eliminates all odd variables andreduces the system to �q�� equations� Recursively applying such reduction in logn steps q �� we �nally have oneequation with one variable� For example� n�� q��

Eq. 1 (x1 x2)

Eq. 2(x1 x2 x3)

Eq.3 (x2 x3 x4)

Eq. 4 (x3 x4 x5)

Eq. 5 (x4 x5 x6)

Eq.6 (x5 x6 x7)

Eq.7 (x6 x7)

Eq.1’ (x2 x4)

Eq. 3’ (x4 x6)

Eq. 1’’( x4)

(a) Odd−even reduction process.

ReductionStep 1

Reductionstep 2

Eq.2’(x2 x4 x6)

Backward substitution part� After one equation with one variable is derived� the solution for this variable can be

found� Then backward substitution process is applied recursively in logn steps to �nd all solutions�

A task graph for odd even reduction and back substitution is�

��

Reduction on Eq 1 2 3



Reduction on Eq 1’2’3’

Solve x1 (Eq 1)

Solve x5 (Eq 5)

Solve x3 (Eq 3)

Solve x7 (Eq 7)

using Eq. 1’Solve x2

Solve x6 using Eq 3’

Solve x4using Eq 1’’

(b) The task graph for reduction and backward solving.

�� PDE� Laplaces Equation

We demonstrate the use of the �nite di�erence method for solving a PDE that models a steady state heat �ow problem

on a rectangular ��cm� ��cm metal sheet� One edge maintains temperature of �� degree� other three edges maintain� degree as shown in Figure �� What are the steady state temperatures at interior points�

0

0 0

0 0 0

100

0

Temperature

x

Temperature

10cm

20cm

u11 u21 u31

Figure �� The problem domain for a Laplace equation in modeling the steady state temperature�

The mathematical model is the Laplace equation�

��ux� y�

�x��ux� y�

�y��

with the boundary condition�

ux� �� ux� ��

u�� y� � �� u�� y� � ��

We divide the region into a grid with gap h at each axis� At each point ih� jh�� let uih� jh� � ui�j � The goal is to �nd

the value of all points ui�j �

Numerical Solution� Use the central di�erence formula for approximating numerical di�erentiation�

d�fx�

dx�� f ��x� � fx� h� � fx� h�� fx�

h�

Let fx� � ux� y�� Then��ux� y�

�x��

d�fx�

dx�� ux� h� y� � ux� h� y�� ux� y�

h��

Thus��uxi� yi�

�x�� ui��j � ui��j � �ui�j

h�

��

Similarly��uxi� yi�

�y�� ui�j�� ui�j�� ui�j

h�

Then

ui��j � ui��j � �ui�j � ui�j�� ui�j�� ui�j � �or

�ui�j � ui��j � ui��j � ui�j�� ui�j��

Example� For the case in Figure �� Let u�� x�� u�� x�� u�� x��

At u�� x� � �� x� � �

At u�� x� � x� � �� x� � � � �At u�� x� � x� � ��

�� x�

x�x�

�

��

Solutions�

x� � �� x� � �� x� � ��

A general grid� Given a general n� �� n� �� grid� we have n� equations�


for � � i� j � n� For example� when r � �� n � ��

Temperature

held at U0

Temperature

held at U1

held at U

Temperature held at U

Temperature

0

0

We order the unknowns as u�� u�� u�n� u�� u�� u�n� � � � � un�� unn��For n � �� the ordering is� �

��x�x�x�x�

��

��

u��u��u��u��

��

The matrix is� ��

��

��

x�x�x�x�

��

��

u�� u��u�� u��u�� u��u�� u��

��

��

In general� the left side matrix is�

��

T �I�I T �I

�I T �I� � �

� � ��

�I T

��n�n�

T �

��

� ��

��

� � ��

��

��nn

I �

��

��

�

��nn

The matrix is very sparse� and a direct method for solving this system takes too much time�

The Jacobi Iterative Method� Given


for � � i� j � n� The Jacobi program is�

RepeatFor i�� to nFor j�� to nunewi�j � ��ui��j � ui��j � ui�j�� ui�j��EndForEndFor

Until k unew � u k� �

Parallel Jacobi Method� Assume we have a mesh of n� n processors� Assign ui�j to processor pi�j �

The SPMD Jacobi program at processor pi�j�

RepeatCollect data from four neighbors� ui��j � ui��j � ui�j�� ui�j�� from pi��j � pi��j � pi�j�� pi�j��unewi�j � ��ui��j � ui��j � ui�j�� ui�j��diffi�j � junewij � uij jDo a global reduction to get the maximum of diffi�j as M �

Until M � ��

Performance evaluation�

� Each computation step takes � � operations for each grid point�� There are � data items to be received� The size of each item is � unit� Assume sequential receiving and thencommunication costs �� for these � items�

� Assume that the global reduction takes � � �� logn�

� The sequential time Seq � Kn� where K is the number of steps for convergence�

� Assume � �� n � �� p� � ��

The parallel time PT � K � � � logn��

Speedup � � n�

� � � logn��

E!ciency �Speedup

n�� &�

��

In practice� the number of processors is much less than n�� Also the above analysis shows that it does not make sense

to utilize n� processors since the code su�ers too much communication overhead for �ne grain computation� To increase

the granularity of computation� we map the n� n grid to processors using �D block method�

Grid partitioning� Assume that the grid is mapped to a p � p mesh where p �� n� Let � � np� An example of a

mapping with � � �� n � � is shown below�

Code partitioning� Re write the kernel part of the sequential code using the loop blocking and loop interchange

techniques as�

For bi � � to pFor bj � � to pFor i � bi � �� to bi�For j � bj � �� to bj�unewi�j � ��ui��j � ui��j � ui�j�� ui�j��EndForEndForEndForEndFor

Parallel SPMD code� On processor pbi�bj �

RepeatCollect the data from its four neighbors�For i � bi � �� to bi�For j � bj � �� to bj�unewi�j � ��ui��j � ui��j � ui�j�� ui�j��EndForEndForCompute the local maximum diffbi�bj for the di�erence between old values and new values�Do a global reduction to get the maximum diffbi�bj as M �

Until M � ��

Performance evaluation�

� At each processor� each computation step takes � operations�� The communication cost for each step on each processor is ��

� Assume that the global reduction takes � � �� log p�

� The number of steps for convergence is K�� Assume � �� n � �� p� � ��

PT � K� � � � log p�� r��

��

Speedup ��p�

� � � � log p��

E!ciency � ��&�

Red�Black Ordering� Reordering variables can eliminate most of data dependence in the Gauss Seidel algorithm for

this PDE problem�

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

� Points are divided into �red� points white� and black points�

� First� black points are computed using the old red point values��

� Second� red points are computed using the new black point values��

Parallel code for red�black ordering

� Point i�j� is black if i�j is even�

� Point i�j� is red if i�j is odd�

� Computation on black points stage �� can be done in parallel�

� Computation on red points stage �� can be done in parallel�

The Kernel code�

� For all points i�j� with i�j� mod �� do in parallelui�j � ��ui��j � ui��j � ui�j�� ui�j��

� For all points i�j� with i�j� mod �� do in parallelui�j � ��ui��j � ui��j � ui�j�� ui�j��

A MPI Parallel Programming

References� �� MPI standard HTML�� www�mcs�anl�govmpimpi report ��mpi report�html� �� MPI tutorial PS

�le� is available at the CS��B homepage� �� On line MPI book HTML�� www�netlib�orgutkpapersmpi bookmpi

book�html�

MPI is to establish a widely used language and platform independent standard for writing message passing programs�

Six MPI basic library functions are�

��

� MPI Init �� Initiate an MPI computation�

� MPI Finalize �� Terminate a computation�

� MPI Comm size � � Determine number of processes�

� MPI Comm rank �� Determine my process identi�er�

� MPI Send � � Send a message�

� MPI Recv � � Receive a message�

For MPI programming� the concept �process� is about the same as �processor� in this course�� Communicator indicates

the process group in which communication occurs� We use MPI COMM WORLD for this course�

Example� hello�c

�include �stdio�h�

�include �mpi�h�

main�int argc� char argv� �

int my�rank�

MPI�Init� argc� argv�

MPI�Comm�rank�MPI�COMM�WORLD� my�rank�

printf��Hello from node !d"n�� my�rank�

MPI�Finalize�� Shut down MPI �

�

The result�

Hello from node �

Hello from node �

Hello from node �

Hello from node �

Basic functions�

� Blocking send

int MPI�Send�void buf� int count� MPI�Datatype datatype� int dest� int tag� MPI�Comm comm

� buf initial address of send bu�er

� count number of elements in send bu�er nonnegative integer�

� datatype datatype of each send bu�er element

� dest rank of destination

� tag message tag

� comm communicator

It does not return until the message has been safely stored away so that the sender is free to overwrite the send

bu�er� The message might be copied directly into the matching receive bu�er� or it might be copied into a

temporary system bu�er�

��

� Blocking receive

int MPI�Recv�void buf� int count� MPI�Datatype datatype� int source� int tag�

MPI�Comm comm� MPI�Status status

� buf initial address of receive bu�er

� count number of elements in receive bu�er

� datatype datatype of each receive bu�er element

� source rank of source

� tag message tag

� comm communicator

� status the status of this communication�

The wildcards MPI ANY SOURCE and MPI ANY TAG indicate that any source andor tag are acceptable�

� MPI datatype� MPI CHAR� MPI SHORT� MPI INT� MPI LONG� MPI FLOAT� MPI DOUBLE�

� Actions of MPI Send�

Network

Send Receive

User data

SystemBuffer

SystemBuffer

User space

Proc �� MPI Send �data�� MPI CHAR�� MPI COMM WORLD��

� Retrieve data#�$ data#�$ �� data#��$�

� Copy this msg with name�� to the local system bu�er or directly delivercopy to a bu�er at processor ��

� Actions of MPI Recv�

Proc �� MPI Recv �data�� MPI CHAR��MPI COMM WORLD� �status��

� Wait until a message from processor � with name�� arrives at the local bu�er�

� Then copy the msg to data�� data��

Proc �� MPI Recv �data�� MPI CHAR�MPI ANY SOURCE� MPI ANY TAG�MPI COMM WORL

�status�

� Wait until any msg arrives at the bu�er�

� Copy the msg to data�� data��

Example greetings�c�� Send a message from all processes with rank �� to process �� Process � prints the messagesreceived�

��


�include �string�h�

�include �mpi�h�

main�int argc� char argv� �

int my�rank� �rank of process �

int p� �number of processes�

int source� �rank of sender �

int dest� �rank of receiver �

int tag � �� tag for messages �

char message�� storage for message�

MPI�Status status� �return status for �

�receive �

MPI�Init� argc� argv�


MPI�Comm�size�MPI�COMM�WORLD� p�

if �my�rank ��

sprintf�message� �Greetings from process !d��

my�rank�

dest � ��

MPI�Send�message� strlen�message�� MPI�CHAR� dest� tag� MPI�COMM�WORLD�

�else �

for �source � �� source � p� source��

MPI�Recv�message� �� MPI�CHAR� source� tag� MPI�COMM�WORLD� status�

printf��!s"n�� message�

�

�

MPI�Finalize��

�

An example for numerical integration The trapezoidal rule�

Z b

a

fx�dx � fa� � fb�

�b� a�

If interval #a� b$ is divided into n sub intervals�

xi � a� i � h for � � i � n� h �b� a

n�

Z b

a

fx�dx �n��Xi��

fxi� � fxi��

�h � #fa�� fb�� fx�� fx�� fxn��$h�

Sequential code

Trap�float a�float b� int n� float h �

integral � �f�a � f�b��

x � a�

for �i � �� i �� n�� i��

��

x � x � h�

integral � integral � f�x�

�

integral � integralh�

return�integral�

�

Parallel numerical integration�

� �� Each process calculates �its� interval of integration�

� �� Each process estimates the integral of fx� over its interval�

� � Each process �� sends its integral to process ��

� � Process � sums the calculations received from the individual processes and prints the result�

The MPI code


MPI�Comm�size�MPI�COMM�WORLD� p�

� Local computation�

h � �b�a�n�

local�n � n�p�

local�a � a � my�ranklocal�nh�

local�b � local�a � local�nh�

integral � Trap�local�a� local�b� local�n� h�

�Add up the integrals from each process�


total � integral�

for �source � �� source � p� source��

MPI�Recv� integral� �� MPI�FLOAT� source� tag� MPI�COMM�WORLD� status�

total � total � integral�

�

� else �

MPI�Send� integral� �� MPI�FLOAT� dest� tag� MPI�COMM�WORLD�

�

Other issues

� Parallel IO issues� Given SPMD code�

scanf��!f !f !d�� a�ptr� b�ptr� n�ptr�

We run this code on two nodes� and type � � � �� What happens� Many possibilities�

� Both processes get the data�

� One process gets the data�

� Process � gets �� and Process � gets ��

��

There is no consensus yet in parallel computing world� Solution� Have one process do IO�

� IO in the trapezoidal code�

� �� Process � reads the input�

� �� Process � sends input values to other processes�


printf��Enter a� b� and n"n��

scanf��!f !f !d�� a� b� n�

for �dest � �� dest � p� dest��

tag � ��

MPI�Send� a� �� MPI�FLOAT� dest� tag� MPI�COMM�WORLD�

tag � ��

MPI�Send� b� �� MPI�FLOAT� dest� tag� MPI�COMM�WORLD�

tag � ��

MPI�Send� n� �� MPI�INT� dest� tag� MPI�COMM�WORLD�

�

� else �

tag � ��

MPI�Recv� a� �� MPI�FLOAT� source� tag� MPI�COMM�WORLD� status�

tag � ��

MPI�Recv� b� �� MPI�FLOAT� source� tag� MPI�COMM�WORLD� status�

tag � ��

MPI�Recv� n� �� MPI�INT� source� tag� MPI�COMM�WORLD� status�

�

�

� Group data for communication� Can we send a� b� and n together in one message�

struct�

float a�part�

float b�part�

int n�part�

� z�

z�a�part�a�

z�b�part�b�

z�n�part�n�


for �dest � �� dest � p� dest��

MPI�Send� z� sizeof�z� MPI�CHAR� dest� �� MPI�COMM�WORLD�

�else

MPI�Recv� z� sizeof�z� MPI�CHAR� source� �� MPI�COMM�WORLD� status�

Collective Communication� Global operations over a group of processes�

� MPI Barrier comm� Each process in the comm blocks until every process in comm has called it�

��

� MPI Bcast bu�er� count� datatype� root� comm� Broadcast from one member of the group to all other

members�

� MPI Reduce operand� result� count� datatype� operator� root� comm� Global reduction operations

such as max� min� sum� product� and min and max operations�

� MPI Gather sendbuf� sendcount� sendtype� recvbuf� recvcount� recvtype� root� comm� Gather

data from all group members to one process�

� MPI Scatter sendbuf� sendcount� sendtype� recvbuf� recvcount� recvtype� root� comm�� Scatter

data from one group member to all other members�

An example of MPI Bcast �� In the trapezoidal code� Process � prompts user for input and reads in the values�

Then it broadcasts input values to other processes�


printf��Enter a� b� and n"n��

scanf��!f !f !d�� a� b� n�

�

MPI�Bcast� a� �� MPI�FLOAT� �� MPI�COMM�WORLD�

MPI�Bcast� b� �� MPI�FLOAT� �� MPI�COMM�WORLD�

MPI�Bcast� n� �� MPI�INT� �� MPI�COMM�WORLD�

An example of MPI Reduce In the trapezoidal code� use MPI Reduce to do global summation for adding up the

integrals calculated by each process�

MPI�Reduce� integral� total� �� MPI�FLOAT� MPI�SUM� �� MPI�COMM�WORLD�

MPI Gather ��

int MPI�Gather�void sendbuf� int sendcount� MPI�Datatype sendtype� void recvbuf�

int recvcount� MPI�Datatype recvtype� int root� MPI�Comm comm

� sendbuf � starting address of send bu�er

� sendcount � number of elements in send bu�er

� sendtype � data type of send bu�er elements

� recvbuf � address of receive bu�er signi�cant only at root�

� recvcount � number of elements for any single receive signi�cant only at root�

� recvtype � data type of recv bu�er elements signi�cant only at root�

� root � rank of receiving process

� comm � communicator�

An example in matrix�vector Multiplication� If x and y are distributed in a �D block manner� each processor

needs to gather the entire x from all other processors�

��

Proc 0

Proc 1

Proc 2

Proc 3

A x y

* =

float local�x� � �local storage for x�

float global�x� � �storage for all of x�

MPI�Gather�local�x� n�p� MPI�FLOAT� global�x� n�p� MPI�FLOAT� �� MPI�COMM�WORLD�

MPI All gather ��

int MPI�All�gather�void sendbuf� int sendcount� MPI�Datatype sendtype�

void recvbuf� int recvcount� MPI�Datatype recvtype� MPI�Comm comm

� sendbuf � starting address of send bu�er

� sendcount � number of elements in send bu�er

� sendtype � data type of send bu�er elements

� recvbuf � address of receive bu�er signi�cant only at root�

� recvcount � number of elements for any single receive signi�cant only at root�

� recvtype � data type of recv bu�er elements signi�cant only at root�

� comm � communicator�

An example

For �root�� root�p� root��

MPI�Gather�local�x� n�p� MPI�FLOAT� global�x� n�p� MPI�FLOAT� root� MPI�COMM�WORLD�

It is the same as�

MPI�All�gather�local�x� n�p� MPI�FLOAT� global�x� n�p� MPI�FLOAT� MPI�COMM�WORLD�

Sample MPI code for y�Ax�

void Parallel�matrix�vector�prod�

LOCAL�MATRIX�T local�A

int m

int n

float local�x�

float global�x�

float local�y�

��

int local�m

int local�n �

� local�m � n�p� local�n � n�p �

int i� j�

MPI�Allgather�local�x� local�n� MPI�FLOAT� global�x� local�n� MPI�FLOAT� MPI�COMM�WORLD�

for �i � �� i � local�m� i��

local�y�i � ��

for �j � �� j � n� j��

local�y�i � local�y�i �

local�A�i �j global�x�j �

�

�

B Pthreads

References� On line material� www�cs�ucsb�edu�tyangclasspthreads�

B�� Introduction

Ptheads can be used to exploit parallelism on shared memory multiprocessor machines and implement concurrent pro

cessing on sequential machines� Conceptually� threads execute concurrently� This is the best way to reason about the

behavior of threads� But in practice� a machine may only have a �nite number of processors� and it can�t run all of the

runnable threads at once� So� it must multiplex the runnable threads on the �nite number of processors�

Pthreads is the POSIX ��c thread standard put out by the IEEE standards committee� This standard got approval

from the international Standards Organization ISO� Committee and the IEEE Standards Board in �� Currently

Pthreads implementation is available in commercial UNIX platforms� The NT platform supports another type of thread

programming� however� its basic concept is similar to Pthreads� We discuss a number of basic pthread routines below�

You can can get the detailed speci�cation using the Unix man command�

Notice that you should put �%include �pthread�h� at the beginning of your program �le� In compiling the code� you

should put � lpthread� in linking the code� The Web page of ��B provides sample pthread code and the make�le�

B�� Thread creation and manipulation

� pthread create �

int pthread�create�pthread�t thread� const pthread�attr�t attr�

void �start�routine�void � void arg�

The pthread create� routine creates a new thread within a process� The new thread starts in the start routine

start routine which has a start argument arg� The new thread has attributes speci�ed with attr� or default

attributes if attr is NULL�

If the pthread create� routine succeeds it will return � and put the new thread id into thread� otherwise an error

number shall be returned indicating the error�

��

� pthread attr init �

int pthread�attr�init�pthread�attr�t attr�

This routine creates and initializes an attribute variable e�g� scheduling policy� for the use with pthread create��

If the pthread attr init� routine succeeds it will return � and put the new attribute variable id into attr� otherwise

an error number shall be returned indicating the error�

� pthread equal �

int pthread�equal�pthread�t thread�� pthread�t thread��

The pthread equal� routine compares the thread ids thread � and thread � and returns a non � value if the ids

represent the same thread otherwise � is returned�

� pthread exit �

void pthread�exit�void status�

The pthread exit� routine terminates the currently running thread and makes status available to the thread that

successfully joins pthread join�� with the terminating thread�

An implicit call to pthread exit� is made if any thread� other than the thread in which main� was �rst called�

returns from the start routine speci�ed in pthread create�� The return value takes the place of status�

The process exits as if exit� was called with a status of � if the terminating thread is the last thread in the process�

The pthread exit� routine cannot return�

� pthread join �

int pthread�join�pthread�t thread� void status�

If the target thread thread is not detached and there are no other threads joined with the speci�ed thread then

the pthread join� function suspends execution of the current thread and waits for the target thread to terminate�

Otherwise the results are unde�ned�

On a successful call pthread join� will return �� and if status is non NULL then status will point to the status

argument of pthread exit�� On failure pthread join� will return an error number indicating the error�

� pthread self �

pthread�t pthread�self�void�

The pthread self� routine returns the thread id of the calling thread�

An example of spawning threads for a hello program is shown below�

�

p�hello�c �� a hello program �in pthread

�


�include �sys�types�h�

��

�include �pthread�h�

�define MAX�THREAD ��

typedef struct �

int id�

� parm�

void hello�void arg �

parm p��parm arg�

printf��Hello from node !d"n�� p��id�

return �NULL�

�

void main�int argc� char argv� �

int n�i�

pthread�t threads�

parm p�

if �argc ��

printf ��Usage� !s n"n where n is no� of threads"n��argv��

exit��

�

n�atoi�argv��

if ��n � � ## �n � MAX�THREAD �

printf ��The no of thread should between � and !d�"n��MAX�THREAD�

exit��

�

threads��pthread�t malloc�nsizeof�threads�

p��parm malloc�sizeof�parmn�

� Start up thread �

for �i�� i�n� i��

p�i �id�i�

pthread�create� threads�i � NULL� hello� �void �p�i�

�

� Synchronize the completion of each thread� �

for �i�� i�n� i�� pthread�join�threads�i �NULL�

free�p�

�

B�� The Synchronization Routines� Lock

A lock is used to synchronize threads that access a shared data area� For example� if two threads that increment a

variable�

int a � ��

void sum�int p �

a��

printf��!d � !d"n�� p� a�

�

void main� �

Spawn a thread executing sum� with argument ��


��

�

Possible trace�

� � � � � �

� � � � � �

� � � � � �

� � � � � �

So because of concurrent modi�cation of a shared variable� the results can be nondeterministic you may get di�erent

results when you run the program more than once� So� it can be very di!cult to reproduce bugs� Nondeterministic

execution is one of the things that makes writing parallel programs much more di!cult than writing serial programs�

Lock can be used to achieve mutual exclusion when accessing a shared data area� in order to avoid the above unexpected

results� Normally two operations are supplied in using a lock�

� Acquire the lock� Atomically waits until the lock state is unlocked� then sets the lock state to locked�

� Release the lock� Atomically changes the lock state to unlocked from locked�

The following code is correct after adding a lock�

int a � ��

void sum�int p �

Acquire the lock�

a��

Release the lock�

printf��!d � !d"n�� p� a�

�

void main� �



�

In Pthreads� a lock is called mutex� We list related functions as follows�

� pthread mutex init �

int pthread�mutex�init�pthread�mutex�t mutex� const pthread�mutex�attr attr�

The pthread mutex init� routine creates a new mutex� with attributes speci�ed with attr� or default attributes if

attr is NULL�

If the pthread mutex init� routine succeeds it will return � and put the new mutex id into mutex� otherwise an

error number shall be returned indicating the error�

� pthread mutex destroy �

int pthread�mutex�destroy�pthread�mutex�t mutex�

��

The pthread mutex destroy� routine destroys the mutex speci�ed by mutex�

If the pthread mutex destroy� routine succeeds it will return �� otherwise an error number shall be returned

indicating the error�

� pthread mutex lock �

int pthread�mutex�lock�pthread�mutex�t mutex�

The pthread mutex lock� routine shall lock the mutex speci�ed by mutex� If the mutex is already locked the

calling thread blocks until the mutex becomes available�

If the pthread mutex lock� routine succeeds it will return �� otherwise an error number shall be returned indicating

the error�

� pthread mutex trylock �

int pthread�mutex�trylock�pthread�mutex�t mutex�

The pthread mutex trylock� routine shall lock the mutex speci�ed by mutex and return �� otherwise an error

number shall be returned indicating the error� In all cases the pthread mutex trylock� routine will not block the

current running thread�

� pthread mutex unlock �

int pthread�mutex�unlock�pthread�mutex�t mutex�

If the current thread is the owner of the mutex speci�ed by mutex� then the pthread mutex unlock� routine shall

unlock the mutex� If there are any threads blocked waiting for the mutex� the scheduler will determine which

thread obtains the lock on the mutex� otherwise the mutex is available to the next thread that calls the routine

pthread mutex lock�� or pthread mutex trylock��

If the pthread mutex unlock� routine succeeds it will return �� otherwise an error number shall be returned


Example�

� greetings�c �� greetings program

Send a message from all processes with rank �� to process �� Process �

prints the messages received�

Input� none� Output� contents of messages received by process ��

See Chapter �� pp� �� ff in PPMPI� �


�include �string�h�




typedef struct �

int id�

int nproc�

��

� parm�

char message�� storage for message �

pthread�mutex�t msg�mutex � PTHREAD�MUTEX�INITIALIZER�

int token � ��

void greeting�void arg �

parm p � �parm arg�

int id � p��id�

int i�

if �id ��

� Create message �

while ��

pthread�mutex�lock� msg�mutex�

if �token ��

sprintf�message� �Greetings from process !d�� id�

token��

pthread�mutex�unlock� msg�mutex�

break�

�


usleep��

�

� Use strlen�� so that $"�$ gets transmitted �

� else � � my�rank ��

for �i � �� i � p��nproc� i��

while ��

pthread�mutex�lock� msg�mutex�

if �token ��

printf��!s"n�� message�

token��


break�

�


usleep��

�

�

�

�

void main�int argc� char argv� �

int my�rank� � rank of process �

int dest� � rank of receiver �

int tag � �� tag for messages �


parm p�

int n� i�

��

if �argc ��

printf��Usage� !s n"n where n is no� of thread"n�� argv��

exit��

�

n � atoi�argv��

if ��n � � ## �n � MAX�THREAD�

printf��The no of thread should between � and !d�"n�� MAX�THREAD�

exit��

�

threads � �pthread�t malloc�n sizeof�threads�

p��parm malloc�sizeof�parmn�

� Start up thread �

for �i � �� i � n� i��

p�i �id � i�

p�i �nproc � n�

pthread�create� threads�i � NULL� greeting� �void �p�i�

�


for �i � �� i � n� i�� pthread�join�threads�i � NULL�

free�p�

� � main �

B�� Condition Variables

The above greeting example shows that process � will continue to loop to check if it can write a message� This is bad

because it wastes CPU resources� There is another synchronization abstraction called condition variables just for this

kind of situation� Typical operations provided are�

� WaitLock �l� � Atomically releases the lock and waits� When Wait returns the lock will have been reacquired�� SignalLock �l� � Enables one of the waiting threads to run� When Signal returns the lock is still acquired�� BroadcastLock �l� � Enables all of the waiting threads to run� When Broadcast returns the lock is still acquired�

Typically� you associate a lock and a condition variable with a data structure� Before the program performs an operation

on the data structure� it acquires the lock� If it has to wait before it can perform the operation� it uses the condition

variable to wait for another operation to bring the data structure into a state where it can perform the operation� In

some cases you need more than one condition variable�

The functions for Pthread condition variables are�

� pthread cond init �

int pthread�cond�init�pthread�cond�t cond� const pthread�cond�attr attr�

The pthread cond init� routine creates a new condition variable� with attributes speci�ed with attr� or default

attributes if attr is NULL�

If the pthread cond init� routine succeeds it will return � and put the new condition variable id into cond�

otherwise an error number shall be returned indicating the error�

��

� pthread cond destroy �

int pthread�cond�destroy�pthread�cond�t cond�

The pthread cond destroy� routine destroys the condition variable speci�ed by cond�

If the pthread cond destroy� routine succeeds it will return �� otherwise an error number shall be returned


� pthread cond wait �

int pthread�cond�wait�pthread�cond�t cond� pthread�mutex�t mutex�

This routine atomically blocks the current thread waiting on condition variable speci�ed by cond� and unlocks the

mutex speci�ed by mutex� The waiting thread unblocks only after another thread calls pthread cond signal�� or

pthread cond broadcast� with the same condition variable� and the current thread requires the lock on the mutex�

If the pthread cond wait� routine succeeds it will return �� and the mutex speci�ed by mutex will be locked and

owned by the current thread� otherwise an error number shall be returned indicating the error�

� pthread cond timedwait �

int pthread�cond�timedwait�pthread�cond�t cond� pthread�mutex�t mutex�

const struct timespec abstime�

This routine blocks in the same manner as pthread cond wait�� The waiting thread unblocks for the same

conditions and returns �� or if the system time reaches or exceeds the time speci�ed by abstime� in which case

ETIMEDOUT will be returned� In any case the thread will reaquire the mutex speci�ed by mutex�

� pthread cond signal �

int pthread�cond�signal�pthread�cond�t cond�

This routine unblocks ONE thread blocked waiting for the condition variable speci�ed by cond� The scheduler

will determine which thread will be unblocked�

If the pthread cond signal� routine succeeds it will return �� otherwise an error number shall be returned indicating

the error�

� pthread cond broadcast �

int pthread�cond�broadcast�pthread�cond�t cond�

The pthread cond broadcast� routine unblocks ALL threads blocked waiting for the condition variable speci�ed

by cond�

If the pthread cond broadcast� routine succeeds it will return �� otherwise an error number shall be returned


Example�

��

� The Pi�program to compute pi�


�include �math�h�

�include �time�h�



�include �sys�time�h�


typedef struct �

int id�

int noproc�

int dim�

� parm�

typedef struct �

int cur�count�

pthread�mutex�t barrier�mutex�

pthread�cond�t barrier�cond�

� barrier�t�

barrier�t barrier��

double finals�

int rootn�

� barrier� the current implementation can only be called once�

void barrier�init�barrier�t mybarrier �

� must run before spawning the thread �

pthread�mutex�init� �mybarrier��barrier�mutex� NULL�

pthread�cond�init� �mybarrier��barrier�cond� NULL�

mybarrier��cur�count � ��

�

void barrier�int numproc� barrier�t mybarrier �

pthread�mutex�lock� �mybarrier��barrier�mutex�

mybarrier��cur�count� �mybarrier��cur�count�� ! numproc�

if �mybarrier��cur�count��

do �

pthread�cond�wait� �mybarrier��barrier�cond� �mybarrier��barrier�mutex�

� while �mybarrier��cur�count��

�

else �

pthread�cond�broadcast� �mybarrier��barrier�cond�

�

pthread�mutex�unlock� �mybarrier��barrier�mutex�

�

double f� double a �

return �� a a�

�

void cpi�void arg�

parm p � �parm arg�

int myid � p��id�

��

int numprocs � p��noproc�

int i�

double PI��DT � ��%��&%�%��&��

double mypi� pi� h� sum� x� a�

double startwtime� endwtime�

if �myid �� startwtime � clock��

if �rootn��

finals�myid ��

else �

h � �� double rootn�

sum � ��

for �i � myid � �� i ��rootn� i �� numprocs �

x � h ��double i � ��

sum �� f�x�

�

mypi � h sum�

�

finals�myid � mypi�

barrier�numprocs� barrier��

if �myid ��

pi � ��

for �i �� i � numprocs� i�� pi �� finals�i �

endwtime � clock��

printf��pi is approximately !��f� Error is !��f"n��

pi� fabs�pi � PI��DT�

printf��wall clock time � !f"n��

�endwtime � startwtime � CLOCKS�PER�SEC�

�

�

int main� int argc� char argv�

�

int done � �� n� myid� numprocs� i� rc�

double startwtime� endwtime�

parm arg�


if �argc ��

printf��Usage� !s n"n where n is no� of thread"n�� argv��

exit��

�

n � atoi�argv��

if ��n � � ## �n � MAX�THREAD�

printf��The no of thread should between � and !d�"n�� MAX�THREAD�

exit��

�

threads � �pthread�t malloc�n sizeof�threads�

� setup barrier �

��

barrier�init� barrier��

� allocate space for final result �

finals � �double malloc�n sizeof�double�

rootn � ��

arg��parm malloc�sizeof�parmn�

� Spawn thread �

for �i � �� i � n� i��

arg�i �id � i�

arg�i �noproc � n�

pthread�create� threads�i � NULL� cpi� �void �arg�i�

�


for �i � �� i � n� i�� pthread�join�threads�i � NULL�

free�arg�

�

��

Documents

Lecture Notes on Parallel Scientific Computing