Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Numerically Intensive Computing in Finance

Lecture 1: Introductionand Prototype Applications

Mike [email protected]

Lecture 1 1

'

&

$

%

Objectives

This course is motivated by the fact that large-scalecomputations are becoming a standard part ofmathematical finance.

By the end of this course you should have:• some understanding of computer hardware,

and the trends for the future;• a good understanding of the different kinds of

parallel computing;• an understanding of the different kinds of

parallelism inherent in financial applications;• some practical experience with parallel codes!

Lecture 1 2

'

&

$

%

Course Structure

Week 8: MM&SC and ACM MSc’s, andOSC users and others within Oxford University

Lectures in the mornings, Monday-Thursday:• 9:30 – 10:30• 10:45 – 11:45• 12:00 – 13:00

Practicals in the afternoons, Monday-Friday, 2-6.There is no need to be present all the time, justwork at your own pace to complete assignments.

For those doing the course as a Special Topicthere will be additional projects afterwards.

Lecture 1 3

'

&

$

%

Course Structure

Week 9: MSc in Mathematical Finance(module 10)

Lectures in the mornings, Tuesday-Friday:• 9:00 – 10:00• 10:15 – 11:15• 11:30 – 12:30

Practicals in the afternoons, Tuesday-Thursday,1:30-6. There is no need to be present all thetime, just work at your own pace to completepracticals 1-4.

Lecture 1 4

'

&

$

%

Lecture Outline

Day 1: Introduction1 Two prototype problems: Monte-Carlo and

Black-Scholes financial applications2 The “big picture” overview of high performance

computing3 Distributed resource management and web

services

Day 2: Shared-memory Parallelism4 Processor and memory technology5 Shared-memory multiprocessors6 OpenMP multi-threaded computing

Lecture 1 5

'

&

$

%

Lecture Outline

Day 3: Distributed-memory parallelism7 Distributed-memory systems8 BSP model of distributed computing, and

parallelisation of explicit approximations9 Introduction to MPI message passing

Day 4: Distributed-memory applications10 Parallelisation of explicit approximations11 Parallelisation of implicit approximations12 More on MPI

Lecture 1 6

'

&

$

%

Practicals

The practicals are a very important part of thecourse. Anyone taking the course for credit aspart of one of the MSc’s must completepracticals 1-4 and hand in a write-up showingthat they have gone through all of the exercises.• Using NAG libraries, Grid Engine and web

services for parallel Monte-Carlo calculations• Using OpenMP multithreading for an explicit

finite difference Black-Scholes discretisation• An introduction to MPI message passing,

including for Monte-Carlo calculations• Using MPI for an explicit B-S FD method• Using MPI for an implicit B-S FD method

Lecture 1 7

'

&

$

%

Reading Material

Hardware• Web references for current hardware

(see links from course webpage)• John L. Hennessy and David A. Patterson,

Computer Architecture: a QuantitativeApproach, 3rd edition, Morgan Kaufmann,2003.

Lecture 1 8

'

&

$

%

Reading Material

Software• R. Chandra et al, Parallel Programming in

OpenMP, Morgan Kaufmann, 2001.• W. Gropp, E. Lusk and A. Skjellum, Using MPI:

Portable Parallel Programming with theMessage-Passing Interface (second edition),MIT Press, 2000.

Lecture 1 9

'

&

$

%

Reading Material

Mathematical Finance• Lecture notes for MSc courses• P. Wilmott, S. D. Howison and J. Dewynne,

Mathematics of Financial Derivatives, CUP,1995.• D. Duffy, Finite Difference Methods in Financial

Engineering: A Partial Differential EquationApproach, John Wiley and Sons, 2006• P. Glasserman, Monte Carlo Methods in

Financial Engineering, Springer, 2004.

Lecture 1 10

'

&

$

%

Monte Carlo Model Problem

Stochastic differential models in mathematicalfinance have the form

dS = a(S, t) dt + b(S, t) dW

where S, a are vectors, b is a matrix, and dW is anincrement of a vector Wiener path with correlationΣ(S, t).

These are to be solved subject to some initialconditions at time t = 0, and the aim is todetermine the expected (discounted) value of apayoff function of the state at final time t = T .

Lecture 1 11

'

&

$

%


In Monte-Carlo simulations, the expected value isestimated by averaging the values obtained bydoing lots of different path calculations withdifferent random inputs.

Using Forward Euler time discretisation, each pathis calculated from

Sn+1 = S

n + a(Sn, tn) ∆t + b(Sn, tn) ∆Wn

where ∆Wn is a vector of normally distributedrandom variables with zero mean, variance ∆t andcorrelation Σ(Sn, tn)

Lecture 1 12

'

&

$

%


Because of the Central Limit Theorem, the errorin the estimated value is proportional to N−1/2,where N is the number of paths calculated.

There are various techniques for reducing theconstant of proportionality (co-variate variables,variance reduction) and improving the exponent(quasi-random sequences), but for the purposesof this course these are not important.

Lecture 1 13

'

&

$

%


What is important?• Each path calculation is entirely independent

– “trivially parallel”, just run a number ofpaths on each machine in a “cluster” or“farm” and average the output• Need to generate lots of random numbers• Each path needs a completely independent

set of random numbers

Lecture 1 14

'

&

$

%


Our two-asset model problem is

dS1 = r S1 dt + σ S1 dW1

dS2 = r S2 dt + σ S2 dW2

with correlation matrix

Σ =

(

1 ρρ 1

)

The initial conditions at t = 0 are S1 = S2 = 1

and the discounted payoff at t = 1 is

P (S1, S2) =

{

e−r, max(|S1 − 1|, |S2 − 1|) < 0.1

0, otherwise

Lecture 1 15

'

&

$

%

Random Number Generation

• use standard numerical libraries, so don’tneed to know how they’re generated• uniformly distributed random numbers on

[0,1] are generated by a recurrence relation• converted into Normally distributed variables

with zero mean and unit variance through:– Box-Muller method– Marsaglia-Bray method– inverting cumulative probability

distribution

See Glasserman’s book for more details.

Lecture 1 16

'

&

$

%


If X is a vector of independent normallydistributed random variables with zero mean andunit variance, then the vector Y defined by

Y = L X

is a vector of normally distributed variables withzero mean and covariance matrix

Σ = L LT

Lecture 1 17

'

&

$

%


Given a particular desired Σ, the simplest choicefor L is a Cholesky factorisation in which L islower-triangular.

For our model problem

Σ =

(

1 ρρ 1

)

this gives

L =

(

1 0

ρ√

1−ρ2

)

Lecture 1 18

'

&

$

%

Finite Difference Model Problem

The Black-Scholes equation for our two-assetmodel problem is can be written in the form

Vt + rS1VS1+ rS2VS2

+

σ2(

12S2

1VS1S1+ ρS1S2VS1S2

+ 12S2

2VS2S2

)

= rV

This is solved backwards in time from the finalvalue equal to the payoff function, to get thevalue at at the initial time t = 0.

Lecture 1 19

'

&

$

%


Switching to new variables η = logS, τ = 1− t,and defining

r∗ = r − 12σ2,

the equation becomes

Vτ = r∗(

Vη1 + Vη2

)

+ σ2(

12Vη1η1 + ρVη1η2 + 1

2Vη2η2

)

− rV

which is to be solved forward in time from τ = 0

to τ = 1.

Lecture 1 20

'

&

$

%


A simple Explicit Euler central spacediscretisation on a uniform Cartesian grid is

V n+1 = (1− r∆t)V n +r∗∆t

2∆η

(

δ2η1+ δ2η2

)

V n

+σ2∆t

2∆η2

(

(1−ρ) δ2η1+ ρ δ2η1η2

+ (1−ρ) δ2η2

)

V n

where

δ2η1Vi,j ≡ Vi+1,j − Vi−1,j

δ2η2Vi,j ≡ Vi,j+1 − Vi,j−1

and . . .

Lecture 1 21

'

&

$

%


} }

} } }

} }

δ2η1Vi,j ≡ Vi+1,j − 2Vi,j + Vi−1,j

δ2η1η2Vi,j ≡ Vi+1,j+1 − 2Vi,j + Vi−1,j−1

δ2η2Vi,j ≡ Vi,j+1 − 2Vi,j + Vi,j−1

making it a 7-point stencil:

Lecture 1 22

'

&

$

%


If we instead use Backward Euler timedifferencing, giving

(1 + r∆t) V n+1 − r∗∆t

2∆η

(

δ2η1+ δ2η2

)

V n+1

− σ2∆t

2∆η2

(

(1−ρ) δ2η1+ ρ δ2η1η2

+ (1−ρ) δ2η2

)

V n+1

= V n

then the question is how to solve the system ofsimultaneous equations for V n+1.

Jacobi, Gauss-Seidel and CG-like iterativesolution methods will be considered later.

Lecture 1 23

24

'

&

$

%


Lecture 2: Computing – the “Big Picture”


Lecture 2 25

'

&

$

%

The Driving Forces

Money and economics are what drive computing,not technology.

Money: if there’s a big enough market, someonewill develop the product.

Economics: cost per unit item is minimised byproducing huge numbers of the same item –particularly important in computing where thecosts of development and fabrication plant arehuge (measured in $bn’s).

Lecture 2 26

'

&

$

%

Technological Trends

Moore’s Law (from Gordon Moore of Intel, 30 yearsago): CPU speed doubles every 18-24 months

There is similar growth in all other hardwareaspects: memory size, memory bandwidth, disksize, network speed, . . .

Safe to assume that this will continue for at leastthe next 10 years, driven by:• multimedia applications• anti-virus/firewall/anti-spam software• image processing• “intelligent” software

Lecture 2 27

'

&

$

%

The Hardware Pyramid

JJ

JJ

JJ

JJ

JJ

JJ

JJ

JJ

JJ

JJ

JJ

JJ

JJ

JJ

JJ

JJJ

z

embedded systems

laptops

PC’s

servers

supercomputers -

Lecture 2 28

'

&

$

%

Hardware

• almost all computing is now done onsystems built of commodity components,benefitting from economies of scale– the days of highly-specialised “vector”supercomputers are over• roughly 4 : 2 : 1 ratio in performance for

CPUs in servers : PCs : embedded systems• Intel is the dominant force in CPUs; only

AMD, IBM, Sun are left in competition

Lecture 2 29

'

&

$

%

Multi-level Parallelism

• instruction parallelism (e.g. addition)• pipeline parallelism, overlapping different

instructions• multiple pipelines, each with own capabilities• multiple CPU’s within a single “multicore”

chip• multiple chips within a single shared-memory

computer• multiple computers within a

distributed-memory system• multiple systems within an organisation

Lecture 2 30

'

&

$

%

Hardware

Lecture 4 will look at CPUs to understand thelower levels of parallelism, and at how data ismoved between the CPU and the main memoryusing caches.

An understanding of both is required to get thebest execution speed from sequential processes,and the memory hierarchy also has majorconsequences for parallel computing.

Lecture 2 31

'

&

$

%

Hardware for high-end computing

1) shared-memory multiprocessor• the modern mainframe – top products from

Sun and IBM are used widely in banks,especially for database applications• single very large memory (up to 250GB?)

accessed by multiple processors (up to 72dual-core chips)• hardware challenge is high bandwidth

memory access – costly• often has high-reliability features such as

hot-swap disks, redundant power supplies– adds to cost

Lecture 2 32

'

&

$

%


Oxford Supercomputing Centre plans:• spend 20% of budget on shared-memory

systems for specific applications(e.g. Gaussian, molecular modellingpackage)• each with probably 8 dual-core processors,

and maybe 32GB memory• probably no high-reliability features to

minimise cost

Lecture 2 33

'

&

$

%


2) tightly-coupled distributed-memory system• multiple nodes (with 1 – 4 processors) each

with own memory• high-bandwidth low-latency network

connection (Gigabit Ethernet, Myrinet,Infiniband)• in academia, used to be collections of PCs

on a shelf, but now there are tailoredpackages from the leading vendors• a key issue is system management; you lose

all of the price/performance benefits if youhave to employ lots of system managers

Lecture 2 34

'

&

$

%


Oxford Supercomputing Centre plans:• spend 80% of budget on several large

clusters• each with probably 128 “nodes” containing 2

dual-proc chips, so a total of 512 cores percluster• probably Gigabit Ethernet with custom

drivers for networking, except for one clusterwith higher-spec custom networking

Lecture 2 35

'

&

$

%


Another example is the OCCF cluster for theOxford Centre for Computational Financeand the Computing Laboratory• 24 Sun Ultra-80 nodes each with 4

UltraSPARC processors and 2GB memory• connected by Myrinet for parallel computing,

and 100Mb/s Ethernet for file i/o and externalnetwork access• very old now – about to be shut down

Lecture 2 36

'

&

$

%


3) loosely-coupled PC/workstation “farms”• similar to 2) but with relatively low-speed

interconnect (100Mb/s or Gigabit Ethernetwith TCP/IP software)• ideally suited for “trivially-parallel”

applications like Monte-Carlo• system management and resource

management are again the key issues

Lecture 2 37

'

&

$

%


Dedicated farms:• racks of up to 4000 “pizza box” servers at

bio-informatics companies

Collection of “idle” resources:• traders’ workstations/PCs which are idle

overnight and at weekends• computer teaching labs in the university,

unused most of the time!

Lecture 2 38

'

&

$

%


Comparison:• 8 : 2 : 1 cost/performance ratio for three

categories• shared-memory and distributed-memory

systems built of high-end processorsbecause of cost of interconnect• PC/workstation farms built of low-end

processors for lowest cost/performance ratio

Lecture 2 39

'

&

$

%


Trends:• business/finance is replacing science as

main user of “supercomputers”• big shared-memory systems (> 64 procs)

are in decline because database softwarehas been re-written for distributed-memorysystems• concerns over power consumption caused

move to lower-frequency multicore chips– increasingly the aim is to maximise CPUperformance per watt!

Lecture 2 40

'

&

$

%

Electrical Power

• new OSC computer room will have 600kWsupply for computers (plus an additional400kW to keep them cool)• total power concumption: 1MW

total electricity bill: £400k/yr• averaged over a 3-year lifetime, electricity

cost is roughly 40% of the purchase price

• Intel Pentium 4 Extreme Edition had clockfrequency up to 3.8GHz and used up to130W• new Intel multicore chips run at up to 3GHz

and use 65-75W

Lecture 2 41

'

&

$

%

Final “Big Picture” Considerations

• driving force is vast market for PCs andservers with a price tag of £500 -£1500• consequence is that a compute cluster

costing £1M may have up to 1000 chips(2000 cores) if the interconnect is not tooexpensive• move to clusters with multi-core chips means

we may have to exploit both shared-memoryand distributed-memory parallel computingin high-end applications.

Lecture 2 42

'

&

$

%

Hardware vs. Software

Hardware:• phenomenal technological advances, driven

by user needs• new products every year, new architectures

every 10 years

Software:• disappointingly slow progress, limited by

“people” issues• new languages and standards every 10

years or so

Lecture 2 43

'

&

$

%

Software “People” Issues

• need global standards, agreed by committeewhich takes time – Fortran 90 was motivatedby vector computing, which was on the wayout by the time the standard was agreed!• need (re)training of staff – in the worst case

have to wait for existing staff to retire!• staff can be reluctant to learn new skills which

might not be transferrable to a new employer– another reason for standards• companies have been happier investing in

hardware than software – changing thesedays?

Lecture 2 44

'

&

$

%

Languages

Fortran and C:• for those who want the highest performance• closest to the level of the operating system

(written in C)

C++, Java, C#:• object-oriented computing for better software

design and re-use of code (in principle)

Visual Basic, Matlab:• “niche” languages with very strong following• emphasis on ease of use

Lecture 2 45

'

&

$

%

Parallel Computing Standards

• OpenMP for multithreaded computing onshared-memory systems• MPI for message-passing on

distributed-memory systems• both support Fortran, C and C++ and

provide portability across all major vendors

However, both are rather low-level and MPIinvolves tedious programming; I’d like to seemore research on developing parallel librariesto handle parallelism automatically

Lecture 2 46

47

48

'

&

$

%


Lecture 3: Distributed resource management andweb services


Lecture 3 49

'

&

$

%

“Trivial” Parallelism

Monte-Carlo applications are a good example oftrivial parallelism• 106 independent random paths can be

grouped into 100 jobs, each with 104 paths• Each job is independent and has very few

inputs and outputs• Given lots of machines, want “something” to

decide where the jobs should be run to givethe fastest turnaround time.

Only tricky bit for user is making sure each jobuses independent random number generation– see practical 1.

Lecture 3 50

'

&

$

%

Distributed Resource Management

Using loosely-coupled PC/workstation farms forMonte Carlo calculations needs distributedresource management:• which machines are available?• how heavily are they being used?• do they have the necessary

software/licenses?• what rights do I have to use them?

Lecture 3 51

'

&

$

%


Grid Engine (Sun), LSF (Platform Computing)and Condor (Univ. of Wisconsin) deal with thisthrough users submitting tasks to a unified queuewhich dispatches jobs based on:• matching job requirements to machine

properties• taking account of current interactive/batch

usage of machine• taking account of different priorities of

different user groups• doing charging if necessary

Maybe sounds simple – but very important

Lecture 3 52

'

&

$

%


Some DRM software can also work in ahierarchy:• each department has Grid Engine to

manage its own cluster• the overall organisation has Grid Engine

queues which can feed into the departmentalqueues• users normally use departmental resources,

but can go to higher level queues for extraresources

This is a more robust solution than having asingle control point for the entire organisation.

Lecture 3 53

'

&

$

%

Web Services

Distributed resource management is one aspectof Grid Computing.

Another is the use of Web Services to linkseparate applications running on differentmachines, possibly under different operatingsystems, even within different organisations.

Because of the requirements of eCommerce,there is a huge development effort with wellestablished standards supported by all of themajor companies (Microsoft, IBM, SUN)

Lecture 3 54

'

&

$

%

Web Services

At its simplest, web services follows an RPC(Remote Procedure Call) approach:• a client process sends a request to a server• the server process returns a response

• the client and server processes usually“belong” to different users (different userid)• the server process is usually a persistent

service, running indefinitely waiting for clientrequests

Lecture 3 55

'

&

$

%

Web Services

Within the basic client/server arrangement, thereare a number of subtle distinctions.

A standard web server can offer web servicesthrough CGI executables: it listens to port 80,and if a requests asks for a particular CGI to beexecuted to generate a response then it does it.

Alternatively, can have a standalone web servicewhich listens to a particular port and deals withrequests.

Lecture 3 56

'

&

$

%

Note on Ports

When an application “talks” to application onanother machine, it does so through numbered“ports”.

There is at most one application listening to eachport, with reserved port numbers for particularservices (/etc/services on a Unix system)• 21 ftp• 22 ssh• 80 http

Firewalls restrict which ports are left open, andhence control external communication.

Lecture 3 57

'

&

$

%

Web Services

What about handling multiple requests fromdifferent clients?

• could queue them up and process them oneat a time• could spin off a separate thread (or fork a

separate process) to deal with each one

Lecture 3 58

'

&

$

%

Web Services

What about handling multiple requests from thesame client?

If the history of the interaction needs to bemaintained (persistence), this can be done byopening a communication channel andmaintaining it (keepalive) until the client closes it,or there’s a timeout.

(In this case, should use a separate thread orprocess for each client.)

Lecture 3 59

'

&

$

%

Web Services

Standards are crucial for interoperability of webservices.

SOAP (Simple Object Access Protocol) definesthe RPC interaction:• XML for the main content (request and

response)• optional MIME attachment (just like email)• http/https to send the SOAP messages

There is no restriction on the choice of languagefor implementing the server or client application.

Lecture 3 60

'

&

$

%

Web Services

Language-specific support for creating webservices includes:• Java: IBM Websphere, Sun ONE,

Borland JBuilder, lots of others• C#: Microsoft .NET• Python: ZSI (Zolera Soap Infrastructure)• C/C++: gSOAP

Lecture 3 61

'

&

$

%

gSOAP

gSOAP is a package for generating web serviceservers and clients in C/C++• a pre-processor generates additional C/C++

files given a header file specification of theRPC routines• there are also some gSOAP files which

contain the code to do all the conversion ofdata to/from XML• the distribution includes 150 pages of

documentation and lots of exampleapplications

Lecture 3 62

'

&

$

%

gSOAP

The example is a web service calculator whichtakes two numbers and adds or subtracts them.

For this application, the user writes 3 files:• calc.h: a header file defining the RPC

routines• calcserver.c: the server code• calcclient.c: the client code

Lecture 3 63

'

&

$

%

calc.h

//gsoap ns service name: calc//gsoap ns schema namespace: urn:calc

int ns add(double a,double b,double *result);

int ns sub(double a,double b,double *result);

The ns prefix and the gSOAP declarations avoidambiguities if an application needs to use twoservices with the same RPC names

Lecture 3 64

'

&

$

%

calcserver.c

#include <math.h>#include "soapH.h"#include "calc.nsmap"

int main(int argc, char **argv){ int m, s; /* master and slave sockets */struct soap soap;soap init(&soap);

m = soap bind(&soap,NULL,80,100);

for ( ; ; ){ s = soap accept(&soap);soap serve(&soap);soap end(&soap);

}

return 0;}

Lecture 3 65

'

&

$

%

calcserver.c

int ns add(struct soap *soap,double a, double b, double *result)

{ *result = a + b;return SOAP OK;

}

int ns sub(struct soap *soap,double a, double b, double *result)

{ *result = a - b;return SOAP OK;

}

Lecture 3 66

'

&

$

%

calcclient.c

#include "soapH.h"#include "calc.nsmap"

const char server[] ="http://booth10.ecs.ox.ac.uk:80";

int main(int argc, char **argv){ struct soap soap;double a, b, result;

soap init(&soap);

a = strtod(argv[2], NULL);b = strtod(argv[3], NULL);

Lecture 3 67

'

&

$

%

calcclient.c

switch (*argv[1]){ case ’a’:

soap call ns add(&soap, server, "",a, b, &result);

break;case ’s’:soap call ns sub(&soap, server, "",

a, b, &result);break;

}

if (soap.error)soap print fault(&soap, stderr);

elseprintf("result = %g\n", result);

return 0;}

Lecture 3 68

'

&

$

%

gSOAP

Additional features:• multiple results handled by a result structure• dynamic arrays handled by a structure with

size and pointer• keepalive for services needing persistence• https and SSL for security• zlib and gzip compression• MIME attachments

Lecture 3 69

'

&

$

%

Final comments

Web services are likely to become very useful inlinking Windows PCs on the desktop to Unixservers in the back-office:• web clients on Windows PCs written in

Java/C#/Visual Basic using Microsoft’s• web services on Unix servers written in

C/C++ using gSOAP• much more dynamic/responsive than using

software like Grid Engine.

Lecture 3 70

71

72

'

&

$

%


Lecture 4: Processor and Memory Technology


Lecture 4 73

'

&

$

%

Processor Technology

Why discuss processor technology?• interesting to learn how Moore’s Law is being

upheld• interesting, because there’s lots of

parallelism in the CPU hidden from theprogrammer/user;• important, because a better understanding

enables an expert programmer to get betterperformance• important, because it affects whether

higher-level parallelism involves 10’s ofprocessors, or 1000’s

Lecture 4 74

'

&

$

%

Ideal Von Neumann Processor

• each cycle, CPU takes data from registers,does an operation, and puts the result back• load/store operations (memory←→ registers)

also take one cycle• CPU can do different operations each cycle• output of one operation can be input to next

-

timeop1-- -

op2-- -

op3-- -

CPU’s haven’t been this simple for a long time!

Lecture 4 75

'

&

$

%

Pipelining

Pipelining is a technique in which multipleinstructions are overlapped in execution.

-

time1 2 3 4 5-- -

1 2 3 4 5-- -

1 2 3 4 5-- -

• 1 result per cycle after pipeline fills up• improved utilisation of hardware• major complication – an output can only be

used as input for an operation starting later

Lecture 4 76

'

&

$

%

Superscalar Processors

Most processors have multiple pipelines fordifferent tasks, and can start a number ofdifferent operations each cycle.

Example: Sun Microsystems UltraSPARC III• 2 integer pipes• 1 floating-point (FP) multiply pipe• 1 FP addition/subtraction pipe• in principle, capable of producing 2 integer

and 2 FP results per cycle• FP division uses both FP pipes and is very

slow (29 cycles)

Lecture 4 77

'

&

$

%

Technical Challenges

• compiler to extract best performance,reordering instructions if necessary• controller to handle multiple pipelines

(sometimes with out-of-order execution)• memory hierarchy to deliver data to registers

fast enough to feed the processor• tricks to avoid delays waiting for data

(pipeline stall)• tricks to avoid delays due to conditional

branching (loops, logical tests)

These all limit the number of pipelines that canbe used effectively

Lecture 4 78

'

&

$

%

Programmer Assistance

The programmer can help the compiler byproviding more scope for re-ordering operations– common trick is loop unrolling with addedbenefit of less branching.

for (i=0; i<1000; i++) {

x += sqdt*rand[i];

}

Problem: each multiply must complete beforeaddition, and looping probably forces addition tocomplete before next multiply.

Lecture 4 79

'

&

$

%


for (i=0; i<1000; i+=4) {

x += sqdt*rand[i];

x += sqdt*rand[i+1];



}

Each addition must complete before nextaddition, but multiplies are now almost fullyoverlapped.

Note: need a “remainder” loop when loop rangeis not perfectly divisible by unrolling factor.

Lecture 4 80

'

&

$

%


To get more speedup, do 2 Monte-Carlo paths atsame time:

for (i=0; i<1000; i+=2) {

x1 += sqdt*rand[i];

x2 += sqdt*rand[i+1000];



}

Now enough scope for overlap to get almost fullutilisation of a processor with a single 3-stagepipeline.

Lecture 4 81

'

&

$

%

Compiler Optimisation

Need even more unrolling for multiple pipelines.Fortunately, the compiler will perform innermostloop unrolling, but sometimes needs to be told todo so – compiler directive.

Sun’s cc compiler also has different optimisationlevels, giving a trade-off between compiler andcode speed.-fast does a variety of optimisations includingmultiplying by a reciprocal instead of dividingrepeatedly by the same number, and optimisationfor native hardware.

Lecture 4 82

'

&

$

%

Current Trends

• clock cycle no longer reducing, due toproblems with power consumption(up to 130W per chip)• gates/chip still doubling every 24 months⇒ more on-chip memory and MMU

(memory management units)⇒ specialised hardware (e.g. multimedia,

encryption)⇒ multi-core (multiple CPU’s on one chip)• peak performance of chip still doubling every

12-18 months

Lecture 4 83

'

&

$

%

Intel chips

“Conroe” desktop chip:• dual-core chip running at 2.66GHz• 14-stage pipelines capable of 4 operations per

cycle

Others:• dual-core Core Duo already out in laptops• dual-core “Woodcrest” for servers soon• 90% of all sales dual-core by end of 2006• quad-core chips by early 2007

Lecture 4 84

'

&

$

%

AMD chips

Athlon X2 desktop chip:• dual-core chip running at up to 2.4GHz, using

90-110W• quad-core in early 2007

Dual-core Opterons:• dual-core chip running at up to 2.6GHz, using

55-95W• up to 8-way (16-core) SMP systems• quad-core due in early 2007

Lecture 4 85

'

&

$

%

IBM chips

Power 5:• 2 cores on a single chip• each core runs two threads simultaneously,

overlapping on different pipelinesIBM/Sony/Toshiba Cell chip:• originally designed for new Sony Playstation• has one Power 4 core plus 8 graphics cores• now to be used as multi-core chip in new IBM

blade system

Lecture 4 86

'

&

$

%

SUN chips

Sparc VI:• 2 cores, running at 2.4GHz using 120W• Fujitsu developing quad-core variant for

2008?

UltraSparc T1 (“Niagara”) chip:• 8 cores, running at up to 1.2GHz• extra bits for encryption and data

compression• limited floating point performance• intended for file servers / web servers

Lecture 4 87

'

&

$

%

ClearSpeed

• startup company• PCI-Express board with 2 compute chips• each has 96 cores, running at 133MHz(?),

using 10W• ideally suited for Monte Carlo applications• best performance/watt in marketplace?

Lecture 4 88

'

&

$

%

Memory Hierarchy

Why discuss memory?• more and more, it is the bottleneck in

modern computer systems• in some cases, it is possible to get much

greater performance through minor changesto a code• understanding how caches work is vital to

understanding the operation andprogramming of shared-memory parallelcomputers

Lecture 4 89

'

&

$

%

Memory Hierarchy

?

fastermore expensive

smaller

1 – 8 GB400MHz DDR2Main memory

1 – 4 MB1GHz SRAML2 Cache

L1 Cache64KB2GHz SRAM

registers

100+ cycle access, 5GB/s

12 cycle access, 20GB/s

2 cycle access

?

6

??66

???666

Lecture 4 90

'

&

$

%

Memory Hierarchy

Execution speed relies on exploiting data locality• temporal locality: a data item just accessed

is likely to be used again in the near future,so keep it in the cache• spatial locality: neighbouring data is also

likely to be used soon, so load them into thecache at the same time using a ‘wide’ bus(like a multi-lane motorway)

Lecture 4 91

'

&

$

%

Caches

The cache line is the basic unit of data transfer;typical size is 128 bytes ≡ 16× 8-byte items.

In a single cache system, when the CPU loadsdata into a register:• looks for line in cache• if there (hit), get data• if not (miss), get entire line from main

memory, displacing an existing line in cache(usually least recently used)

When the CPU stores data from a register:• same procedure

Lecture 4 92

'

&

$

%

Caches

What happens when a cache line is modified?

Write-through cache:• modified line is immediately written to the

main memory• main memory stays up-to-date• generates lots of memory traffic

Write-back cache:• modified line is only written to main memory

when it gets displaced from the cache• much less memory traffic• main memory may not have latest values

– potential problem for parallel computing

Lecture 4 93

'

&

$

%

Caches

Multi-level caches

All major processors use at least two levels ofcache:• primary cache is small (e.g. 64KB), on-chip

and write-through• secondary cache is larger (e.g. 2MB),

usually on-chip and write-back• if there is a third level cache, then it is even

larger, off-chip and write-back

Lecture 4 94

'

&

$

%

Importance of Locality

Typical workstation:2 Gflops CPU5 GB/s memory←→ L2 cache bandwidth128 bytes/line

5GB/s ≡ 40M line/s ≡ 600M reals/s

At worst, each flop requires 2 inputs and has 1output, forcing loading of 3 lines =⇒ 13 Mflops

If all 16 variables/line are used, then thisincreases to 200 Mflops.

To get up to 2Gflops needs temporal locality,re-using data already in the cache.

Lecture 4 95

'

&

$

%

Loop Ordering

A 2D finite difference code typically has loops ofthe form

for (i=0; i<1000; i++) {

for (j=0; j<1000; j++) {

u[id(i,j)] = ...

}

}

where id(i,j) maps the indices (i,j) to aunique element of u.

Question: would it be more efficient to re-orderthe loops?

Lecture 4 96

'

&

$

%

Loop Ordering

The answer depends on the function id(i,j).

If we use

id(i,j) = i + j*imax

then id(3,7) is next to id(4,7), but notid(3,8).

Multiple dimensions are handled similarly, withthe lower dimensions varying most rapidly.

Lecture 4 97

'

&

$

%

Loop Ordering

Consequently, in the FD example, it is best tohave the i loop innermost, to access theelements of u[id(i,j)] sequentially.

If the j loop is innermost, then the cache linewith element u[id(i,j)] may have beendisplaced by the time that u[id(i+1,j)] is tobe computed.

This can have very dramatic consequences!

Lecture 4 98

'

&

$

%

Current Trends

Memory hierarchy seems likely to remain – veryhigh speed memory is too expensive

Importance of cache lines and data locality islikely to remain – transferring multiple bits of datain parallel is only way to get high throughput

Best we can hope for is that compilers will handlecode optimisation, but remember, high-endnumerical computing is not a big driver.

Lecture 4 99

100

'

&

$

%


Lecture 5: Shared-memory Multiprocessors


Lecture 5 101

'

&

$

%

Shared-memory Multiprocessors

CPU CPU CPU CPU CPU

cache cache cache cache cache

Main Memory

Conceptual arrangement:• multiple CPU’s, each with own cache• all linked to a unified main memory by a very

high bandwidth interconnect

Lecture 5 102

'

&

$

%

Shared-memory Multiprocessors

For historical reasons, they are also referred toas SMP systems – Symmetric Multi-Processors

“Symmetric” refers to the fact that all processorsare equal

An asymmetric system is one in which there is amaster processor, and a number of slaves– like the ClearSpeed card

Lecture 5 103

'

&

$

%

Interconnect

One challenge in building shared-memorysystems is achieveing sufficient bandwidthbetween all of the processors and multiple“memory ports” (points of entry into the mainmemory)• traditional PC bus is not scalable – fixed

bandwidth shared between more and moreprocessors• scalable performance is achieved using

commodity crossbar (full interconnect) chipsoriginally developed for network switches

Lecture 5 104

'

&

$

%

Cache Coherency

The other challenge in shared-memory multiprocessors ismaintaining coherency with write-back caches

CPU1 CPU2 CPU3 CPU4 CPU5


Main Memory

Suppose CPU2 loads and modifies variable X, and thenCPU4 needs to load X – what happens?

Lecture 5 105

'

&

$

%

Cache Coherency

The solution is a “snoopy bus” linking the caches; CPU2spots the request from CPU4 and supplies the newervalue for X.

CPU1 CPU2 CPU3 CPU4 CPU5


Main Memory

Lecture 5 106

'

&

$

%

Cache Coherency

In the MESI cache coherency protocol, a cacheline can be in one of 4 states:• Modified: sole owner of modified line• Exclusive: sole owner, not modified• Shared: shared ownership, not modified• Invalid: incorrect data

��

��

��

��

M E

SI �

�

?@@

@@

@@

@@

@@I@@

@@

@@

@@

@@R

write

write

writeby other

read by otherread by other

write

Lecture 5 107

'

&

$

%

Cache Coherency

Note: don’t want different processors “fighting”for ownership of the same cache line – can givevery bad performance

As with the main system bus, the snoopy bus hasproblems scaling to large numbers of processors.

There have been alternative methods used inlarge shared-memory NUMA (Non-UniformMemory Access) machines, but they wereexpensive.

Lecture 5 108

'

&

$

%

Shared-memory Computing

A key distinction: processors and processes

A processor is a piece of hardware which canexecute instructions

A process is a program consisting of a set ofinstructions

At any instant, there is precisely one processexecuting on each processor, but the sameprocess may be executing on more than oneprocessor

Lecture 5 109

'

&

$

%


In a shared-memory system, a user application isa single Unix process, with a number of virtualmemory pages holding the user’s data, and anumber of “threads” working on it.

Very like having a project being carried out by apool of “workers”:• some tasks can only be done by a single

worker, while the rest wait around• other tasks can be carried out in parallel by

many workers• key is deciding what can be done in parallel,

avoiding conflicts between workers

Lecture 5 110

'

&

$

%


The operating system is itself a multithreadedapplication, with perhaps one thread handlingdisk i/o, one network i/o, one task scheduling, etc.

Task scheduling for multiple users is particularlyimportant:• system maintains a list of active processes• each process gets given its turn for execution

for a few milliseconds, and then is put to theback of the queue to wait for its next turn• multithreaded processes are usually

executed on a corresponding number ofprocessors

Lecture 5 111

'

&

$

%

Static and Stack Memory Management

To understand some aspects of shared-memoryprogramming, need to know how compilershandle data within programs.

Static allocation means that the compiler decidesat compilation time where the data will sit withinthe user’s virtual memory.

Stack allocation means it’s handled on-the-flyduring execution, as needed.

Lecture 5 112

'

&

$

%


In C, static allocation is specified through the useof the static instruction

void counter(int n){static int kount;

if (n==0)kount = 0;

else if (n==1)kount = kount + 1;

elseprintf("%d", kount);

return 0;

}

Lecture 5 113

'

&

$

%


In C, stack allocation is the default, enablingroutines to be used recursively

int factorial(int n){int fact, nm;

if (n==1)fact = 1;

else if (n>1) {nm = n-1;fact = n*factorial(nm);}

}

return fact;

Lecture 5 114

'

&

$

%


These two examples show the key aspects ofeach approach.

Static allocation is persistent, continuing after aroutine finishes – may be more efficient becauseno run-time allocation is needed.

Stack allocation is transient, with fresh allocationeach time a routine starts, disappearing when itfinishes.

Lecture 5 115

'

&

$

%


So far, have considered only sequentialprocesses – what about multi-threading?

Simple example – suppose two threads want toprint something out at the same time.

The libraries that handle printing have internaldata. To avoid conflict, each call needs stackallocation giving independent private data— “thread-safe” libraries (often not the defaultbecause they’re less efficient).

Lecture 5 116

'

&

$

%


More generally, in multithreaded applicationsthere is the important concept of shared andprivate data:

Private data belongs to a particular thread• it is allocated on its own private stack• it can be seen and changed only by that

thread

Shared data is visible to all threads• it is either statically allocated, or allocated on

a master stack• any of the threads can change its value

Lecture 5 117

'

&

$

%

Shared-memory Programming

In general terms, there are two levels ofshared-memory programming.

At a low-level, one can start several threads andthen explicitly tell each what to do — in this case,the code will have instructions such as “if this isthread 3 then do the following ... ”.

Lecture 5 118

'

&

$

%


This is very flexible, but involves tediousprogramming.

In general I would recommend it only toexperienced programmers wanting to do anapplication in which different threads are doingentirely different things — e.g. one thread ismanaging network i/o, one is running anexperiment, one is handling terminal i/o.

For C programs, POSIX pthreads is thestandard, but I have very little experience ofusing it.

Lecture 5 119

'

&

$

%


The higher-level approach is to tell the compilerwhat can be done in parallel, and let itautomatically generate the code to handle themultiple threads.

In this case, typically there is a master threadwhich is always active, and a bunch of otherthreads which spring into action for parallel loops,and hibernate in between.

OpenMP is the standard for this higher-levelapproach, superceding the many vendor specificversions that used to exist.

Lecture 5 120

'

&

$

%


Lecture 6: OpenMP Programming


Lecture 6 121

'

&

$

%

Overview

The code is executed sequentially by a masterthread except for regions (e.g. loops) which areexplicitly declared to be done in parallel.

The extra threads hibernate during the sequentialsections, are activated during the parallelsections, then get suspended again. This is allhandled by the compiler and the run-timeexecution environment.

The programmer is responsible for saying what isto be done in parallel. If the programmer makes amistake, execution may be slow and/or incorrect.

Lecture 6 122

'

&

$

%

parallel for

The parallel for directive says the next loopis to be executed in parallel:

#pragma omp parallel for \private(i,du) shared(u,v)

for (i=0; i<imax; i++) {du = v[i]*v[i];u[i] += du;

}

Note the specification of private and sharedvariables. The default is that loop indices areprivate, and everything else is shared.

Lecture 6 123

'

&

$

%

parallel for

Private variables are defined to exist transientlywithin the loop:• uninitialised on entry to the loop• undefined on exit from the loop

If there is a pre-existing global variable with thesame name, it is undefined what happens to this– avoid this!

Conceptually, the du variable in the previousexample becomes du n where n is the threadnumber, making these variables different fromany global variable du.

Lecture 6 124

'

&

$

%

parallel for

There is control over how the loop iterations aredivided between the threads, through an optionalschedule argument.

schedule(static) splits the loop range into(almost) equal chunks, one for each thread. Thisis the default, and the best for simple loops withequal work per iteration

schedule(static,n) uses chunks of size n,assigned to threads in simple rotation.

Lecture 6 125

'

&

$

%

parallel for

schedule(dynamic,n) uses chunks of size n

assigned to threads when they complete theprevious chunk. This is the best choice when thework per loop iteration varies considerably.

#pragma omp parallel for \private(i) shared(u) schedule(dynamic,n)

for (i=0; i<imax; i++) {if(u[i] < 0) {u[i] = small work(u[i]);

}if(u[i] >= 0) {u[i] = big work(u[i]);

}}

Lecture 6 126

'

&

$

%

parallel for

With nested loops, remember it is the loopimmediately after the directive that is parallelised.

#pragma omp parallel for \private(i,j) shared(u)for (j=0; j<jmax; j++) {for (i=0; i<imax; i++) {u[id(i,j)] = ...

}}

Here the j loop is parallelised.

Lecture 6 127

'

&

$

%

parallel for

for (j=0; j<jmax; j++) {#pragma omp parallel for \private(i) shared(j,u)

for (i=0; i<imax; i++) {u[id(i,j)] = ...

}}

Here the i loop is parallelised.

In general, parallelising the outer loop is best(less starting and suspending of threads) exceptwhen the outer loop is over a small range (poorload balancing — e.g. when there are 4 threadsand jmax = 5.)

Lecture 6 128

'

&

$

%

parallel for

What can go wrong? What about

sum = 0;

#pragma omp parallel for \private(i,ds) shared(u,sum)

for (i=0; i<imax; i++) {ds = u[i]*u[i];sum += ds;

}

This is likely to give incorrect results because ofthe accumulation into sum.

Lecture 6 129

'

&

$

%

parallel for

time

?

Thread 1load sum

add ds

store sum

Thread 2

load sum

add ds

store sum

What’s the problem? Consider two threads.

The overlapped additions to sum mean that thefirst thread’s contribution gets lost.

Lecture 6 130

'

&

$

%

parallel for

First solution uses critical directive to say onlyone thread at a time can work with sum

sum = 0;


for (i=0; i<imax; i++) {ds = u[i]*u[i];

#pragma omp critical{sum += ds;}

}

This will give valid results.

Lecture 6 131

'

&

$

%

parallel for

Second solution uses atomic update, making theload, add, store sequence act as a singleinstruction — only possible for single operations.

sum = 0;


for (i=0; i<imax; i++) {ds = u[i]*u[i];

#pragma omp atomicsum += ds;

}

This will give valid results.

Lecture 6 132

'

&

$

%

parallel for

Although both of these solutions will give validresults, the performance will be appalling becausethe different threads will fight over access to thecache line holding the shared sum variable.

Instead, use special reduction instruction

sum = 0;

#pragma omp parallel for \private(i,ds) shared(u) reduction(+:sum)

for (i=0; i<imax; i++) {ds = u[i]*u[i];sum += ds;

}

Lecture 6 133

'

&

$

%

parallel for

How does the compiler get good performance?

It creates temporary private variablessum local to accumulate the partial sums foreach thread, then at the end combines them withthe shared variable sum.

Works with other reduction operators such asmin, max, -, *.

Lecture 6 134

'

&

$

%

parallel for

Another example of data dependencies isGauss-Seidel iteration.

#pragma omp parallel for private(i,j) shared(u)for (j=0; j<jmax; j++) {for (i=0; i<imax; i++) {u[id(i,j)]=0.25*(u[id(i-1,j)]+u[id(i+1,j)]

+u[id(i,j-1)]+u[id(i,j+1)]);}

}

This will produce incorrect results because itdoes not respect the fact that u[id(10,10)]should be updated after u[id(9,9)]

Lecture 6 135

'

&

$

%

parallel for

To parallelise Gauss-Seidel correctly, first need toidentify inherent parallelism — all entries alongi + j = const can be updated in parallel.

w w w w w w w w ww w w w w w w w ww w w w w w w w ww w w w w w w w ww w w w w w w w ww w w w w w w w ww w w w w w w w ww w w w w w w w ww w w w w w w w w

@@

@@

@@

@@

@@

@@

@@@

@@

@@

@@

@@

@@

@

@@

@@

@@

@@

@@

@@

@@

@@

@@

@@

@@

@@

@@

@@

@@@

@@

@@

@@

@@

@@

@@

@@

@@

@@

@

@@

@@

@@

@@

@@

@@

@@

@@

@@

@@

@@

��

��

��

��

Lecture 6 136

'

&

$

%

parallel for

Hence parallelise first half of loop as

for (k=0; k<imax; k++) {

#pragma omp parallel for private(i,j) shared(k,u)for (i=0; i<=k; i++) {j = k - i;u[id(i,j)]=0.25*(u[id(i-1,j)]+u[id(i+1,j)]

+u[id(i,j-1)]+u[id(i,j+1)]);}

}

and do the second half similarly.

Lecture 6 137

'

&

$

%

Other OpenMP Directives

• parallel sections and section

Defines a number of sections of code to behandled by multiple threads, one per section.

• parallel

Most general parallel construct, definingcode to be executed by multiple threads,often with low-level control over what eachthread does, based on its thread number.

Lecture 6 138

'

&

$

%

Financial Applications

For financial applications (and most others too)parallel for with shared, private andreduction clauses should be all that is needed.

Monte Carlo• use parallel for for parallel execution of

paths, with reduction to combine theresults to get average value

Lecture 6 139

'

&

$

%

Financial Applications

Multi-dimensional Black-Scholes solution• for explicit FD methods, use parallel

for for outermost grid dimension• for implicit FD methods, details depend on

the iterative solver– methods like GMRES and BiCGstab will

need reduction for vector dot products– Gauss-Seidel and ILU preconditioners

will require careful re-writing to exposeinherent parallelism

Lecture 6 140

'

&

$

%


Lecture 7: Distributed Memory Multiprocessors


Lecture 7 141

'

&

$

%

Idealisation

BSP hardware model (Valiant, McColl)

/ / / / / /P P P P P PM M M M M M

• a number of processor/memory nodesconnected by a ‘network’• each processor has fast access to local

memory and slow access to remote memory• real hardware differs in having usual

memory/cache/register hierarchy

Lecture 7 142

'

&

$

%

Network

IBM’s Blue Gene uses a hypercubegeneralisation of a 2D network array

Lecture 7 143

'

&

$

%

Network

Clusters use a commodity switch (GigabitEthernet, Myrinet, Infiniband)

Key performance measures are:• latency – minimum time to communicate

between two processors• bandwidth per processor

Lecture 7 144

'

&

$

%

Network

Gigabit Ethernet• latency: 1–2 ms if using TCP/IP; 50µs if using

custom drivers• bandwidth per processor: 1Gb/s ≈ 100MB/s• now standard for PCs/servers

10Gig Ethernet• same latency as Gigabit Ethernet• bandwidth per processor: 10Gb/s ≈ 1GB/s• starting to be used for servers

Lecture 7 145

'

&

$

%

Network

Myrinet (from Myricom)• latency: 10 µs• bandwidth per processor: 2-10Gb/s ≈

250MB-1GB/s• the current proprietary market leader for

distributed-memory systems

Infiniband• latency: 10 µs• bandwidth per processor: 10–40Gb/s ≈

1–5GB/s• a new standard being adopted by major

manufacturers, including IBM and SUN

Lecture 7 146

'

&

$

%

Distributed-memory Computing

On commodity clusters, each node has its ownindependent Unix operating system kernel• completely independent computers,

connected by a network• each handles its own file i/o, network i/o,

process scheduling• if one machine “dies”, the rest carry on

regardless

Lecture 7 147

'

&

$

%


Slightly different on the IBM Blue Gene• micro-kernel on each node• specialised functions such as file i/o only

performed on certain nodes• not clear what happends when one node

“dies”

Lecture 7 148

'

&

$

%


User applications involve coordination betweenmultiple processes• problem data is split up between multiple

processes• typically, each process has unique use of

one node, to avoid scheduling difficulties• during program development, can run

multiple processes on one node to test code• processes communicate by sending

messages to each other

Lecture 7 149

'

&

$

%


Basic process loop:

6

?

?

do some work using local data

communicate between processes

Lecture 7 150

'

&

$

%

Message Passing

• standard software (MPI, PVM) for allsystems• simple, crude, effective• requires action by both processes• sending:

– write message (put data into an array)– send to other process

• receiving:– receive message– read message (copy into another array)

Lecture 7 151

'

&

$

%

Message Passing

MPI (Message Passing Interface) is the standard:• FORTRAN, C and C++ implementations

available on all major platforms• designed by committee, so lots of options, but

in practice few are needed• highly optimised• safe for use in parallel libraries

PVM (Parallel Virtual Machine) is an older library,now obselete.

Lecture 7 152

'

&

$

%

Message Passing Concepts

Two key concepts: buffering and blocking

Message passing is buffered if the message istransferred via a message buffer, and not directlyfrom/to the process memory

Buffering is less efficient because of copying thedata, but it is usually simpler and safer (lessscope for the programmer to make mistakes)

Lecture 7 153

'

&

$

%

Message Passing

Message passing is blocking if the process waitsto complete the send or receive operation beforecontinuing.

Non-blocking operations can be more efficient(allowing possible overlap of computation andcommunication) but can be more confusing, anderror prone.

Lecture 7 154

'

&

$

%

Message Passing

When using buffering, it is simplest to usenon-blocking send (like sending a letter) andblocking receive (wait for the postman to deliverthe post)

Without buffering, it is simplest to use blockingsend/receive leading to a synchronous transfer(like sending a fax)

Lecture 7 155

'

&

$

%

Message Passing

• buffered, non-blocking send– task A continues after sending message• buffered, blocking receive

– task B waits until it gets message

task A..send(B,msg)...

task B....recv(A,msg).

Lecture 7 156

'

&

$

%

Message Passing

The big problem to be avoided is deadlock, inwhich all processors are waiting for someoneelse to send a message – a common error forbeginners

task A.recv(B,msg1)send(B,msg2).

��*

HHHHHHY

task B.recv(A,msg2)send(A,msg1).

Lecture 7 157

'

&

$

%

Message Passing

For buffered transfers with a non-blocking send,the following works correctly,

task A.send(B,msg1)recv(B,msg2).

HHHHHHj

��

task B.send(A,msg2)recv(A,msg1).

but it still leads to deadlock for sends which areblocking/synchronous.

Lecture 7 158

'

&

$

%

Message Passing

For synchronous transfers, must use thefollowing:

task A.send(B,msg1)recv(B,msg2).

-

�

task B.recv(A,msg1)send(A,msg2).

Lecture 7 159

160

'

&

$

%


Lecture 8: BSP Model of Distributed Computing


Lecture 8 161

'

&

$

%

BSP Hardware Model

/ / / / / /P P P P P PM M M M M M

• a number of processor/memory nodesconnected by a ‘network’• each processor has fast access to local

memory and slow access to remote memory

Aim is to predict likely performance on realhardware, and make choices about alternativeimplementation strategies

Lecture 8 162

'

&

$

%

BSP Parameters

p = number of processors

s = processor speed (Mflops)

l =latency/synchronisation time

time for 1 floating point op

g =time to get/send 1 fp. variable

time to do 1 floating point op.

Note:• p, l, g are non–dimensional• estimated execution time will be s−1f(p, l, g)

• local memory access times are neglected– no modelling of cache performance

Lecture 8 163

'

&

$

%

BSP Parameters

For a cluster with Intel/AMD processors andMyrinet networking:

l ≈ 10µs

0.5ns= 2× 104, g ≈ 50ns

0.5ns= 100

For an IBM Blue Gene system with slowerprocessors and faster networking:

l ≈ 10µs

1ns= 104, g ≈ 25ns

1ns= 25

Lecture 8 164

'

&

$

%

BSP Computation Model

Execution proceeds in supersteps separated bysynchronisations

superstepsynchsuperstepsynch

"""

Each superstep consists of each process doingsome calculations using local data thencommunicating some data to other processors

Lecture 8 165

'

&

$

%

BSP Cost Modelling

The cost of a single superstep is

s−1(no + l + ncg)

whereno = max number of f.p. operationsnc = max number of real variables communicatedby one process

For a given application and problem size, no and nc

will depend on p.

The BSP cost of the whole task is just the sum ofthe individual supersteps.

Lecture 8 166

'

&

$

%

Explicit FD Calculation

@@��

Suppose we want to perform a 2D explicit FDcalculation on a grid which is N1×N2.

To do this, we will partition the grid using a“processor grid” which is p1×p2 (with p=p1 p2)

Lecture 8 167

'

&

$

%


qqqqqqqq

qqqqqqqq

q q q q q q q q

q q q q q q q qqqqqqqqq

qqqqqqqq

q q q q q q q q


qqqqqqqq

q q q q q q q q

q q q q q q q q

qqqqqqqq

qqqqqqqq

q q q q q q q q


qqqqqqqq

q q q q q q q q


qqqqqqqq

q q q q q q q q

q q q q q q q q

qqqqqqqq

qqqqqqqq

q q q q q q q q


qqqqqqqq

q q q q q q q q


qqqqqqqq

q q q q q q q q

q q q q q q q q

qqqqqqqq

qqqqqqqq

q q q q q q q q


qqqqqqqq

q q q q q q q q


qqqqqqqq

q q q q q q q q

q q q q q q q q

Lecture 8 168

'

&

$

%


ssssssss

ssssssss

s s s s s s s s

s s s s s s s s

To minimise memory requirements, eachprocessor works with just its part of the overall

grid, of sizeN1

p1×N2

p2, plus a copy of the

neighbouring nodes from adjacent partitions– often known as “halo nodes”

Lecture 8 169

'

&

$

%


If each timestep requires m operations per gridpoint, the total number of operations per superstepis

no = mN1 N2

p1 p2

The new values of halo nodes then have to becommunicated to the neighbours on all four sides,so

nc = 2

(

N1

p1+

N2

p2

)

and the total BSP cost is

T = s−1

(

mN1 N2

p1 p2+ l + 2g

(

N1

p1+

N2

p2

))

Lecture 8 170

'

&

$

%


Re-writing it as

T = s−1

(

mN1 N2

p+ l + 2g

(

N1

p1+

p1N2

p

))

and treating p1 as continuous, with p fixed, we findthis is minimised when

N1

p1=

N2

p2

This gives us our first result using BSP modelling— time is minimised by using square partitions(minimum ratio of surface to volume)

Lecture 8 171

'

&

$

%


If we now define

Nlocal =N1

p1=

N2

p2

then the total cost is

T = s−1(

m N2local + l + 4g Nlocal

)

For good efficiency, want communication andlatency costs to be small compared to computation,so require

Nlocal �max(√

l/m, 4g/m)

This is our second BSP result — the minimumproblem size for effective parallelisation

Lecture 8 172

'

&

$

%


Suppose we now consider a d-dimensionalproblem, with each partition of size Nd

local.

In this case, the total BSP cost per timestep is

T = s−1(

m Ndlocal + l + 2 d g Nd−1

local

)

For good efficiency require

Nlocal �max

(

l

m

)1/d

,2 d g

m

– probably best satisfied for d=3.

Lecture 8 173

'

&

$

%


In general, we define parallel efficiency as

Parallel efficiency =sequential time

p× parallel time

In the 2D explicit FD case,

sequential time = ms−1N1N2 = ms−1pN2local

so we get

Parallel efficiency =

(

1 +l

mN2local

+4g

mNlocal

)−1

Lecture 8 174

'

&

$

%

Parallel Efficiency and Scalability

Scalability concerns what happens as you increasethe number of processors. However, one has to becareful with how it is defined:• Fixed overall problem size: as p increases,

Nlocal decreases so the parallel efficiencydecreases.• Fixed problem size per processor: as p

increases, Nlocal remains fixed and so doesthe parallel efficiency.

Personally, I think the second definition is moreappropriate – the point of using lots of processorsis to be able to tackle really big problems.

Lecture 8 175

'

&

$

%

Halos for FD Calculations

v vv v vv v

interior

The FD approximation to the 2D Black-Scholesequation uses a 7-point stencil because of thecross-derivative.

Lecture 8 176

'

&

$

%


At first sight, it looks as if this will require halotransfers from 2 of the diagonal neighbours,as well as the four immediate neighbours.

However, with a little care, extra transfers canbe avoided.

The key is to complete halo exchange in thex-direction before starting halo exchange inthe y-direction.

Lecture 8 177

'

&

$

%


v v v v v vv v v v v vv v v v v vv v v v v v

After the exchange in the x-direction, with theimmediate neighbours on either side, the nodeswith dots have up-to-date values.

Lecture 8 178

'

&

$

%


v v v v v vv v v v v vv v v v v vv v v v v vv v v v v vv v v v v v

After the exchange in the y-direction, with theimmediate neighbours on either side, all nodeshave up-to-date values. The corner values comefrom copying the neighbours’ halos.

Lecture 8 179

180

'

&

$

%


Lecture 9: An Introduction to MPIMessage-Passing


Lecture 9 181

'

&

$

%

Key Reference

Using MPI: portable parallel programming withthe message-passing interface (second edition)by Gropp, Lusk and Skjellum is excellent!

• starts with basics and adds to them slowly• emphasises that most people need only a

limited subset of MPI• lots of examples of direct relevance• I suggest you stick to Chapters 1–4:

– Background– Introduction– Using MPI in Simple Programs– Intermediate MPI.

Lecture 9 182

'

&

$

%

Some Basics

A program using MPI must be compiled with aspecial command, usually mpicc or mpcc.

One thing this does is to provide a link to aheader file mpi.h which must be included ineach C file using the line

#include "mpi.h"

When run interactively, it is executed by a specialcommand of the formmprun -np n program

where n is the number of processes to be used.

Lecture 9 183

'

&

$

%

Some Basics

An MPI program usually starts with the lines

MPI Init(*argc,*argv);

MPI Comm size(MPI COMM WORLD, *nprocs);

MPI Comm rank(MPI COMM WORLD, *myid);

• MPI Init initialises things• MPI Comm size gives the number of processes• MPI Comm rank gives the “rank” within the group

(0 ≤ myid < nprocs)

Lecture 9 184

'

&

$

%

Some Basics

MPI COMM WORLD is a communicator which inthis case is a constant defined in mpi.h todenote the entire set of processes.

It is possible to construct other communicators,e.g. for communication between a subset ofprocesses, or to protect/isolate communicationwithin a library.

Lecture 9 185

'

&

$

%

Some Basics

The first routines to learn about are:• MPI Bcastto broadcast data from one

process to the others;• MPI Reduceto reduce data from all

processes to one;• MPI Send, MPI Recv, MPI Sendrecv

to send messages between processes• MPI Finalize terminates all MPI

communication

You can go a long way using just these routines.

Lecture 9 186

'

&

$

%

MPI Bcast

The syntax of the broadcast subroutine is

MPI Bcast(*data,size,type,origin,

communicator)

• data is the data to be sent• size is the number of pieces of data• type is its type (e.g. MPI INT orMPI DOUBLE)• origin is the rank of the process doing the

broadcast

Lecture 9 187

'

&

$

%

MPI Reduce

Similarly, the syntax of the reduction subroutine is

MPI Reduce(*input,*output,size,type,

operation,destination,

communicator)

• input is the data to be reduced• output is where the result is put on the process

given by destination; use MPI Allreduce

instead to send the output to all processes• operation is the reduction operation to be

performed (e.g. MPI SUM or MPI MAX)• the others are the same as for MPI Bcast

Lecture 9 188

'

&

$

%

MPI Send

MPI Send(*data,size,type,

destination,tag,

communicator)

• destination is where the message is to besent, and tag is a user-chosen integer label• to be safe, think of this as a blocking

synchronous send; for small messages it mayhave a non-blocking implementation using asystem buffer

Lecture 9 189

'

&

$

%

MPI Recv

MPI Recv(*data,size,type,

origin,tag,

communicator,*status)

• a blocking receive, will wait for a message withcorrect origin and tag, but these can be set toMPI ANY SOURCE and MPI ANY TAG

• status is a variable of special typeMPI Status with additional information• note that incoming messages do not have to be

read in the order in which they arrive

Lecture 9 190

'

&

$

%

MPI Sendrecv

MPI Sendrecv(*data1,size1,type1,dest,tag1,

*data2,size2,type2,orig,tag2,

communicator,*status)

• a combined blocking send and receive;my personal favourite when most processesneed to both send and receive• can use MPI PROC NULL as destination (or origin)

if there is no message to be sent (or received)• combining operations enables MPI implementation

to be more efficient

Lecture 9 191

'

&

$

%

Vector datatypes

So far, all of the send/receive routines have dealtwith contiguous blocks of data. However, inpractice the data to be communicated is often notcontiguous (e.g. 2D halo exchange).

What to do?• Option 1: copy everything into a contiguous

array, then send• Option 2: use MPI’s capability to define new

vector datatypes

Lecture 9 192

'

&

$

%

Vector datatypes

1 2 3 4 5 6 7

8 9 10 11 12 13 14

15 16 17 18 19 20 21

22 23 24 25 26 27 28

29 30 31 32 33 34 35

Simplest to show an example from “Using MPI “

MPI Type vector(5,1,7,MPI DOUBLE,

&newtype)

Lecture 9 193

'

&

$

%

Vector datatypes

After the new data type has been defined it has to be“committed” using the command

MPI Type commit(&newtype)

and then it can be used, as in

MPI Send(&data[3],1,newtype,

destination,tag,

communicator)

Note this specifies just one item, of type newtype

with data[3] being the start of the item

Lecture 9 194

'

&

$

%

Vector datatypes

The general syntax of MPI Type vector is

MPI Type vector(count,size,stride,

oldtype,*newtype)

• count is the number of blocks• size is the size of each block (often 1)

composed of type oldtype

• stride is the offset between each block (≥ size)• newtype is the label for the new datatype, of type

MPI Datatype

Note: oldtype can itself be a derived datatype,so you can build up very complex datatypes.

Lecture 9 195

196

'

&

$

%


Lecture 10: Explicit and Implicit FD Methods


Lecture 10 197

'

&

$

%


To recap, the explicit B-S discretisation is

V n+1 = (1− r∆t)V n +r∗∆t

2∆η

(

δ2η1+ δ2η2

)

V n

+σ2∆t

2∆η2

(

(1−ρ)δ2η1+ ρδ2η1η2

+ (1−ρ)δ2η2

)

V n

giving a 7-point stencil

x xx x x

x x

Lecture 10 198

'

&

$

%


The computational grid is broken into partitions ...

Lecture 10 199

'

&

$

%


... and each timestep involves calculations oneach partition followed by an updating of the halodata – a single BSP superstep

sssssssss

sssssssss

s s s s s s s s s

s s s s s s s s s

Lecture 10 200

'

&

$

%


One practical point to note: each MPI task shouldonly allocate memory for the partition and itshalo, not the entire grid.

There are two options on handling indices andarrays within each partition:• use usual “global” indices with an adjustment

to the definition of id(i,j) so thatid(i,j) = offset+i+j*imax local

• use “local” indices with standard arrayswithout offsets – this is my personalpreference.

Lecture 10 201

'

&

$

%

Implicit FD Calculation

The implicit discretisation is

(1 + r∆t) V n+1 − r∗∆t

2∆η

(

δ2η1+ δ2η2

)

V n+1

− σ2∆t

2∆η2

(

(1−ρ)δ2η1+ ρδ2η1η2

+ (1−ρ)δ2η2

)

V n+1

= V n

which may be written collectively as

AV n+1 = b

giving a system of simultaneous equations to besolved iteratively.

Lecture 10 202

'

&

$

%

Implicit FD Calculation

Note that the operations necessary to evaluatethe matrix-vector product AV , are essentially thesame as for an explicit timestep.

Assuming the halo data is up-to-date, one cancompute on each partition the elements of theproduct AV which correspond to grid points inthat partition.

Lecture 10 203

'

&

$

%

Jacobi iteration

The difficulties involved in parallelising an implicitFD calculation depend on how the implicitequations are solved.

The simplest approach would be to use Jacobiiteration in which each point is updated using oldvalues of its neighbours.

Ak,kV(m+1)k = bk −

∑

l 6=k

Ak,lV(m)l

Each Jacobi iteration step requires just onesuperstep to update the interior points andexchange halo data with neighbouring partitions.

Lecture 10 204

'

&

$

%

CG Iteration

If a Krylov iterative solver with a simple diagonalpreconditioner is used, then it is alsostraightforward.

To see this, we will consider the use of CG tosolve

Ax = b

with A being symmetric and positive definite, witha sparse 5-point stencil in 2D.

Lecture 10 205

'

&

$

%

CG Algorithm

x0 = 0; k = 0; r0 = b−Ax0

while |rk| > tolerancek = k + 1

if k = 1

p1 = r0else

βk = rTk−1rk−1/rT

k−2rk−2

pk = rk−1 + βkpk−1

endαk = pT

k rk−1/pTk Apk

xk = xk−1 + αkpk

rk = rk−1 − αkApk

endx = xk

Lecture 10 206

'

&

$

%

CG Algorithm

The core part of the algorithm, after somere-arrangement, is

αk = pTk rk−1/pT

k Apk

xk = xk−1 + αkpk


βk+1 = rTk rk/rT

k−1rk−1

pk+1 = rk + βk+1pk

which can be calculated in three supersteps asfollows:

Lecture 10 207

'

&

$

%

CG Algorithm

Superstep 1:• compute Apk on local partition• compute local contributions to pT

k rk−1 andpTk Apk and send to others

Superstep 2:• compute αk and update xk and rk

• compute local contribution to rTk rk and send

to others

Superstep 3:• compute βk and update pk+1

• exchange pk+1 halos

Lecture 10 208

'

&

$

%

CG Algorithm

Alternatively, if we write it as

pk = rk−1 + βkpk−1

αk = pTk rk−1/pT

k Apk

xk = xk−1 + αkpk


βk+1 = rTk rk/rT

k−1rk−1

then it can be done in two supersteps as follows:

Lecture 10 209

'

&

$

%

CG Algorithm

Superstep 1:• finish computing βk and update pk

(including halo copies)• compute Apk on local partition• compute local contributions to pT

k rk−1 andpTk Apk and send to others with Apk halo

Superstep 2:• compute αk and update xk and rk

(including halo copies)• compute local contribution to rT

k rk and sendto others

Lecture 10 210

'

&

$

%

CG Algorithm

This second approach is unusual:• usually don’t modify halo values – just

“read-only” copies from the neighbouringpartition• works in this case because they are updated

in exactly the same way as on the “master”partition

Shows a little creativity can reduce the executiontime.

Lecture 10 211

212

'

&

$

%


Lecture 11: More on Implicit Methods


Lecture 11 213

'

&

$

%

Gauss-Seidel

If Gauss-Seidel is used to solve the equations, oras a preconditioner, it is harder to parallelise.• start by partitioning the grid into strips

Lecture 11 214

'

&

$

%

Gauss-Seidel: first approach

• first superstep: start with first row, first stripand work across to first partition boundary tosend ‘halo’ point to neighbour

u u u u uLecture 11 215

'

&

$

%


• next superstep: do second row of first strip,and first row of second strip

u u u u uu u u u u u u u u u

Lecture 11 216

'

&

$

%


• additional supersteps: continue the processuntil the grid is completed

u u u u u u u u u u u u u u u u u u u uu u u u u u u u u u u u u u u u u u u uu u u u u u u u u u u u u u u u u u u uu u u u u u u u u u u u u u u u u u u uu u u u u u u u u u u u u u uu u u u u u u u u uu u u u u

Lecture 11 217

'

&

$

%


If the grid size is N×N , then

# supersteps = N+p−1 ≈ N, assuming p� N

cost of single step = s−1

(

14N

p+ l + 2g

)

=⇒ Total cost = s−1

(

14N2

p+ Nl + 2Ng

)

≈ s−1

(

14N2

p+ Nl

)

since l� g.

Lecture 11 218

'

&

$

%


For good parallel efficiency, needN

p� l

14

For real hardware this implies huge problems areneeded to make good use of parallelism.

What to do?

1) Use Jacobi iteration instead – tempting but lazy.Personal view: start with best numerical algorithmand then worry about how to parallelise it.

2) Reduce number of supersteps

Lecture 11 219

'

&

$

%

Gauss-Seidel: second approach

Same as before, except do m rows beforetransferring boundary data to neighbouringpartition

t t t t t t t t t t t t t t tt t t t t t t t t t t t t t tt t t t t t t t t t t t t t tt t t t t t t t t t t t t t tt t t t t t t t t tt t t t t t t t t tt t t t t t t t t tt t t t t t t t t tt t t t tt t t t tt t t t tt t t t t

m6

?

Lecture 11 220

'

&

$

%


# supersteps =N

m+ p− 1 ≈ N

m+ p

superstep cost = s−1

(

14mN

p+ l + 2mg

)

Total time T ≈ s−1(

N

m+ p

)

(

14mN

p+ l + 2mg

)

Note that N � pg is necessary for communicationtime to be negligible compared to computation.This condition is satisfied for large problems onhardware with high bandwidth.

Lecture 11 221

'

&

$

%


If it is satisfied, then

T ≈ s−1

(

14mN + pl +14N2

p+

Nl

m

)

For fixed values of N, p, s, g, l the total time is aminimum when

dT

dm= 0 =⇒ 14N − Nl

m2= 0

=⇒ m =√

l/14

Lecture 11 222

'

&

$

%


For this optimum value for m we get

T ≈ s−1

(

14N2

p+ 2N

√14 l + pl

)

= s−1 4N2

p

1 +p√

l/14

N

2

and so for good parallel efficiency we requireN � p

√l in addition to N � pg.

These restrictions are now achievable withreasonably large problem sizes on real hardware.

Lecture 11 223

'

&

$

%

Gauss-Seidel

The lessons to be learned from this are:• don’t settle for the most obvious solution; if it

doesn’t give good performance work out whyand try to find a solution• the optimal parallel algorithm may depend on

hardware BSP parameters; the attraction ofBSP cost modelling is that it allows you tomodel the tradeoffs

Lecture 11 224

'

&

$

%

ILU

ILU (incomplete LU factorisation) is sometimesused as a preconditioner for iterative solverssuch as GMRES.

It involves solving two systems of equations withtriangular matrices

LUx = b =⇒ Ly = b, Ux = y

The L solution is like the forward sweep in G-S;the U solution is like the reverse sweep.

Lecture 11 225

'

&

$

%

ADI

Parallelisation of ADI preconditioners is complicatedbecause of the tri-diagonal equations to be solved.

Start by dividing N×N grid into√

p×√p partitions tominimise communication costs.

Lecture 11 226

'

&

$

%

ADI

Using Thomas algorithm to solve the equations,in first superstep, begin m columns andcommunicate appropriate data to neighbours:

m-�

Lecture 11 227

'

&

$

%

ADI

In second superstep, do the next m columns:

m-�

Repeat until the forward sweep is complete.

Lecture 11 228

'

&

$

%

ADI

Use a similar procedure for the reverse sweep.

Optimum value for m can be deduced from BSPcost analysis.

Lecture 11 229

230

231

232

'

&

$

%


Lecture 12: More on MPI


Lecture 12 233

'

&

$

%

Cartesian Grids

Most finite difference methods use structuredgrids with i,j,k indexing (as opposed to finiteelement methods that often use unstructuredgrids composed of triangles/tetrahedra with avery general connectivity).

MPI calls them Cartesian grids, and provides anumber of special routines to make it easy towork with them.

Lecture 12 234

'

&

$

%

Cartesian Grids

MPI Dims create(nprocs,ndim,*pdims)

This routine creates a process grid to partition amulti-dimensional Cartesian grid• nprocs is the number of processes (input)• ndim is the number of dimensions (input)• pdims is an array containing the dimensions of

the process grid (output) with the product beingequal to nprocs

Lecture 12 235

'

&

$

%

Cartesian Grids

MPI Cart create(oldcomm, ndim, *pdims,

*periodic, *reorder,

*newcomm)

This routine assigns processes to the process gridand creates a new communicator• oldcomm is the old communicator (usuallyMPI COMM WORLD); newcomm is the new one• ndim and pdims are same as before• periodic is an array defining whether the grid

is to be periodic• reorder specifies whether to give MPI full

freedom in how to assign processors

Lecture 12 236

'

&

$

%

Cartesian Grids

MPI Cart coords(newcomm,myid,ndim,*coords)

This routine gives the coordinates of the process withinthe Cartesian process grid• myid is the rank obtained by callingMPI Comm rank(newcomm,*myid)

• coords is an integer array of size ndim giving thecoordinates

Lecture 12 237

'

&

$

%

Cartesian Grids

MPI Cart shift(newcomm,dir,shift,

*src,*dest)

Often want to shift data from a processor to itsneighbour in a particular direction. This routinegives the ID’s of the two neighbouring processes:• 0 ≤ dir < ndim is the direction• shift is the size of shift (usually 1)• src is the ID of the process below

(the source of shifted messages)• dest is the ID of the process above

(the destination of shifted messages)

Lecture 12 238

'

&

$

%

Cartesian Grids

These routines provide all the key capabilities forworking with Cartesian grids – all are used inPractical 5.

The one thing not provided is a simple routine toexchange halos – this you have to programyourself using a vector datatype.

Lecture 12 239

'

&

$

%

Cartesian Grids

ttttttttt t t t t t t t

nx

ny

In 2D, exchange in x-direction uses a stride ofnx, and exchange in y-direction is a simplecontiguous transfer.

Lecture 12 240

'

&

$

%

Cartesian Grids

In 3D, it is hard to visualise, but• in x-direction, halo has ny*nz elements with

a stride of nx• in z-direction, halo is a single contiguous

block of size nx*ny

• in y-direction halo has nz blocks of size nx

with stride nx*ny

Practical 5 generalises this to an arbitrarynumber of dimensions

Lecture 12 241

'

&

$

%

Blocking and Buffering

In lecture 7, discussed the concepts of blockingand buffering• blocking means the program waits until the

operation has completed before continuing• buffering means the data is copied to a

temporary buffer during transmission

MPI provides 5 different combinations of sendand receive — enough to confuse anyone!• MPI Send, MPI Ssend, MPI Bsend

all pair with MPI Recv

• MPI Sendrecv works on its own• MPI Isend pairs with MPI Irecv

Lecture 12 242

'

&

$

%


MPI Recv is a blocking receive• program can only continue once the

message has arrived

MPI Ssend is a blocking send• synchronous transfer like sending a fax• simple, but generally not efficient due to

unnecessary waiting

Lecture 12 243

'

&

$

%


MPI Bsend is a non-blocking buffered send• copies the data into a buffer before

continuing• user must supply the buffer – see

documentation• generally good for efficiency (less waiting)

but copying data costs time, and supplyingthe buffer is tedious and error-prone

MPI Send is a cross between MPI Ssend andMPI Bsend — a reasonable compromise• uses internal buffer for small messages• uses synchronous transfer for large ones

Lecture 12 244

'

&

$

%


MPI Sendrecv is a blocking send/recv pair• very well suited to halo exchange• the MPI system decides the order of sending

and receiving• no buffering so no time wasted on copying• very easy to use

Lecture 12 245

'

&

$

%


MPI Isend and MPI Irecv are quite different,non-blocking operations producing anasynchronous transfer

Continuing the letter/fax analogy, these are likeshipping a piano using a courier company:• sender says “here’s where the piano is”• receiver says “here’s where I want it to go

when it arrives”• the courier ships directly, when both ready• sender and receiver continue as usual,

occasionally checking to see if the piano hasgone/arrived

Lecture 12 246

'

&

$

%


The syntax for MPI Isend is:

MPI Isend(*data,size,type,

destination,tag,

communicator,*request)

The one extra argument compared to MPI Send

is request. This is a handle which can be usedlater to check if the send operation has beencompleted, or to wait for it to complete.

Lecture 12 247

'

&

$

%


Similarly, the syntax for MPI Irecv is:

MPI Irecv(*data,size,type,

origin,tag,

communicator,*request)

Compared to MPI Recv the argument statushas been replaced by the handle request.

Lecture 12 248

'

&

$

%


The status of a request can be tested with thecommand

MPI Test(*request,*flag,*status)

with flag being true if it has been completed.

Alternatively, can wait for it to be completed using

MPI Wait(*request,*status)

There are also MPI Waitall and MPI Waitany

variants for handling multiple requests; they do whattheir names suggest.

Lecture 12 249

'

&

$

%


When using MPI Isend and MPI Irecv theimportant thing is not to touch the data after thestart of the transfer and before its completion.

The whole point of using MPI Test andMPI Wait is to know when it is safe to startusing the data on the receiving side, and tore-use the storage on the sending side.

Using MPI advocates this form of sendingmessages — I agree in principle, but I thinkMPI Sendrecv is simpler / more intuitive.

Lecture 12 250

'

&

$

%

Other MPI Capabilities

• more general datatypes, useful for handlingstructures or a mix of real and integervariables• support for parallel libraries, to make sure

message-passing within the library does notconflict with the user’s own message-passing• MPI error handling• various scatter/gather operations in addition

to broadcast• routines for constructing new

communicators, e.g. to enable differentgroups of processes to do different tasks

Lecture 12 251

'

&

$

%

Other MPI Capabilities

MPI-2 (new standard, not yet fully implemented):• dynamic spawning of new processes, and

their inclusion into new communicators• parallel file I/O – better performance than all

file I/O being done by one process• remote memory operations put/get,

directly accessing remote memory withoutany action by remote process

Lecture 12 252

'

&

$

%

Final Advice

• get a solid understanding of the basics• read the early chapters of Using MPI

carefully, maybe skim through the rest• if using MPI Send, check it works if you useMPI Ssend instead• only use advanced capabilities if you’re sure

they will greatly simplify the programming(e.g. the Cartesian utilities) or greatlyimprove performance• keep the MPI code as isolated as possible

from the main application code• if it’s an important application, discuss it with

others with more experience

Lecture 12 253

'

&

$

%

Practical 4

Global view of data partitioning:

j = 0j = 1

j = m− 2j = m− 1

Lecture 12 254

'

&

$

%

Practical 4

Local view of data partitioning:

j = jlower − 1

j = jlower

j = jupper

j = jupper + 1

halojlocal = 0

jlocal = 1

jlocal = jmax − 2

jlocal = jmax − 1 halo

jlower = ((m−2) ∗myid)/nprocs + 1

jupper = ((m−2) ∗ (myid+1))/nprocs

jmax = jupper − jlower + 3

joff = jlower − 1

Lecture 12 255

Documents

Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s