255
Numerically Intensive Computing in Finance Lecture 1: Introduction and Prototype Applications Mike Giles [email protected] Lecture 1 1

Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Numerically Intensive Computing in Finance

Lecture 1: Introductionand Prototype Applications

Mike [email protected]

Lecture 1 1

Page 2: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Objectives

This course is motivated by the fact that large-scalecomputations are becoming a standard part ofmathematical finance.

By the end of this course you should have:• some understanding of computer hardware,

and the trends for the future;• a good understanding of the different kinds of

parallel computing;• an understanding of the different kinds of

parallelism inherent in financial applications;• some practical experience with parallel codes!

Lecture 1 2

Page 3: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Course Structure

Week 8: MM&SC and ACM MSc’s, andOSC users and others within Oxford University

Lectures in the mornings, Monday-Thursday:• 9:30 – 10:30• 10:45 – 11:45• 12:00 – 13:00

Practicals in the afternoons, Monday-Friday, 2-6.There is no need to be present all the time, justwork at your own pace to complete assignments.

For those doing the course as a Special Topicthere will be additional projects afterwards.

Lecture 1 3

Page 4: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Course Structure

Week 9: MSc in Mathematical Finance(module 10)

Lectures in the mornings, Tuesday-Friday:• 9:00 – 10:00• 10:15 – 11:15• 11:30 – 12:30

Practicals in the afternoons, Tuesday-Thursday,1:30-6. There is no need to be present all thetime, just work at your own pace to completepracticals 1-4.

Lecture 1 4

Page 5: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Lecture Outline

Day 1: Introduction1 Two prototype problems: Monte-Carlo and

Black-Scholes financial applications2 The “big picture” overview of high performance

computing3 Distributed resource management and web

services

Day 2: Shared-memory Parallelism4 Processor and memory technology5 Shared-memory multiprocessors6 OpenMP multi-threaded computing

Lecture 1 5

Page 6: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Lecture Outline

Day 3: Distributed-memory parallelism7 Distributed-memory systems8 BSP model of distributed computing, and

parallelisation of explicit approximations9 Introduction to MPI message passing

Day 4: Distributed-memory applications10 Parallelisation of explicit approximations11 Parallelisation of implicit approximations12 More on MPI

Lecture 1 6

Page 7: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Practicals

The practicals are a very important part of thecourse. Anyone taking the course for credit aspart of one of the MSc’s must completepracticals 1-4 and hand in a write-up showingthat they have gone through all of the exercises.• Using NAG libraries, Grid Engine and web

services for parallel Monte-Carlo calculations• Using OpenMP multithreading for an explicit

finite difference Black-Scholes discretisation• An introduction to MPI message passing,

including for Monte-Carlo calculations• Using MPI for an explicit B-S FD method• Using MPI for an implicit B-S FD method

Lecture 1 7

Page 8: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Reading Material

Hardware• Web references for current hardware

(see links from course webpage)• John L. Hennessy and David A. Patterson,

Computer Architecture: a QuantitativeApproach, 3rd edition, Morgan Kaufmann,2003.

Lecture 1 8

Page 9: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Reading Material

Software• R. Chandra et al, Parallel Programming in

OpenMP, Morgan Kaufmann, 2001.• W. Gropp, E. Lusk and A. Skjellum, Using MPI:

Portable Parallel Programming with theMessage-Passing Interface (second edition),MIT Press, 2000.

Lecture 1 9

Page 10: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Reading Material

Mathematical Finance• Lecture notes for MSc courses• P. Wilmott, S. D. Howison and J. Dewynne,

Mathematics of Financial Derivatives, CUP,1995.• D. Duffy, Finite Difference Methods in Financial

Engineering: A Partial Differential EquationApproach, John Wiley and Sons, 2006• P. Glasserman, Monte Carlo Methods in

Financial Engineering, Springer, 2004.

Lecture 1 10

Page 11: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Monte Carlo Model Problem

Stochastic differential models in mathematicalfinance have the form

dS = a(S, t) dt + b(S, t) dW

where S, a are vectors, b is a matrix, and dW is anincrement of a vector Wiener path with correlationΣ(S, t).

These are to be solved subject to some initialconditions at time t = 0, and the aim is todetermine the expected (discounted) value of apayoff function of the state at final time t = T .

Lecture 1 11

Page 12: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Monte Carlo Model Problem

In Monte-Carlo simulations, the expected value isestimated by averaging the values obtained bydoing lots of different path calculations withdifferent random inputs.

Using Forward Euler time discretisation, each pathis calculated from

Sn+1 = S

n + a(Sn, tn) ∆t + b(Sn, tn) ∆Wn

where ∆Wn is a vector of normally distributedrandom variables with zero mean, variance ∆t andcorrelation Σ(Sn, tn)

Lecture 1 12

Page 13: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Monte Carlo Model Problem

Because of the Central Limit Theorem, the errorin the estimated value is proportional to N−1/2,where N is the number of paths calculated.

There are various techniques for reducing theconstant of proportionality (co-variate variables,variance reduction) and improving the exponent(quasi-random sequences), but for the purposesof this course these are not important.

Lecture 1 13

Page 14: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Monte Carlo Model Problem

What is important?• Each path calculation is entirely independent

– “trivially parallel”, just run a number ofpaths on each machine in a “cluster” or“farm” and average the output• Need to generate lots of random numbers• Each path needs a completely independent

set of random numbers

Lecture 1 14

Page 15: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Monte Carlo Model Problem

Our two-asset model problem is

dS1 = r S1 dt + σ S1 dW1

dS2 = r S2 dt + σ S2 dW2

with correlation matrix

Σ =

(

1 ρρ 1

)

The initial conditions at t = 0 are S1 = S2 = 1

and the discounted payoff at t = 1 is

P (S1, S2) =

{

e−r, max(|S1 − 1|, |S2 − 1|) < 0.1

0, otherwise

Lecture 1 15

Page 16: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Random Number Generation

• use standard numerical libraries, so don’tneed to know how they’re generated• uniformly distributed random numbers on

[0,1] are generated by a recurrence relation• converted into Normally distributed variables

with zero mean and unit variance through:– Box-Muller method– Marsaglia-Bray method– inverting cumulative probability

distribution

See Glasserman’s book for more details.

Lecture 1 16

Page 17: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Random Number Generation

If X is a vector of independent normallydistributed random variables with zero mean andunit variance, then the vector Y defined by

Y = L X

is a vector of normally distributed variables withzero mean and covariance matrix

Σ = L LT

Lecture 1 17

Page 18: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Random Number Generation

Given a particular desired Σ, the simplest choicefor L is a Cholesky factorisation in which L islower-triangular.

For our model problem

Σ =

(

1 ρρ 1

)

this gives

L =

(

1 0

ρ√

1−ρ2

)

Lecture 1 18

Page 19: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Finite Difference Model Problem

The Black-Scholes equation for our two-assetmodel problem is can be written in the form

Vt + rS1VS1+ rS2VS2

+

σ2(

12S2

1VS1S1+ ρS1S2VS1S2

+ 12S2

2VS2S2

)

= rV

This is solved backwards in time from the finalvalue equal to the payoff function, to get thevalue at at the initial time t = 0.

Lecture 1 19

Page 20: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Finite Difference Model Problem

Switching to new variables η = logS, τ = 1− t,and defining

r∗ = r − 12σ2,

the equation becomes

Vτ = r∗(

Vη1 + Vη2

)

+ σ2(

12Vη1η1 + ρVη1η2 + 1

2Vη2η2

)

− rV

which is to be solved forward in time from τ = 0

to τ = 1.

Lecture 1 20

Page 21: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Finite Difference Model Problem

A simple Explicit Euler central spacediscretisation on a uniform Cartesian grid is

V n+1 = (1− r∆t)V n +r∗∆t

2∆η

(

δ2η1+ δ2η2

)

V n

+σ2∆t

2∆η2

(

(1−ρ) δ2η1+ ρ δ2η1η2

+ (1−ρ) δ2η2

)

V n

where

δ2η1Vi,j ≡ Vi+1,j − Vi−1,j

δ2η2Vi,j ≡ Vi,j+1 − Vi,j−1

and . . .

Lecture 1 21

Page 22: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Finite Difference Model Problem

} }

} } }

} }

δ2η1Vi,j ≡ Vi+1,j − 2Vi,j + Vi−1,j

δ2η1η2Vi,j ≡ Vi+1,j+1 − 2Vi,j + Vi−1,j−1

δ2η2Vi,j ≡ Vi,j+1 − 2Vi,j + Vi,j−1

making it a 7-point stencil:

Lecture 1 22

Page 23: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Finite Difference Model Problem

If we instead use Backward Euler timedifferencing, giving

(1 + r∆t) V n+1 − r∗∆t

2∆η

(

δ2η1+ δ2η2

)

V n+1

− σ2∆t

2∆η2

(

(1−ρ) δ2η1+ ρ δ2η1η2

+ (1−ρ) δ2η2

)

V n+1

= V n

then the question is how to solve the system ofsimultaneous equations for V n+1.

Jacobi, Gauss-Seidel and CG-like iterativesolution methods will be considered later.

Lecture 1 23

Page 24: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

24

Page 25: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Numerically Intensive Computing in Finance

Lecture 2: Computing – the “Big Picture”

Mike [email protected]

Lecture 2 25

Page 26: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

The Driving Forces

Money and economics are what drive computing,not technology.

Money: if there’s a big enough market, someonewill develop the product.

Economics: cost per unit item is minimised byproducing huge numbers of the same item –particularly important in computing where thecosts of development and fabrication plant arehuge (measured in $bn’s).

Lecture 2 26

Page 27: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Technological Trends

Moore’s Law (from Gordon Moore of Intel, 30 yearsago): CPU speed doubles every 18-24 months

There is similar growth in all other hardwareaspects: memory size, memory bandwidth, disksize, network speed, . . .

Safe to assume that this will continue for at leastthe next 10 years, driven by:• multimedia applications• anti-virus/firewall/anti-spam software• image processing• “intelligent” software

Lecture 2 27

Page 28: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

The Hardware Pyramid

JJ

JJ

JJ

JJ

JJ

JJ

JJ

JJ

JJ

JJ

JJ

JJ

JJ

JJ

JJ

JJJ

z

embedded systems

laptops

PC’s

servers

supercomputers -

Lecture 2 28

Page 29: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Hardware

• almost all computing is now done onsystems built of commodity components,benefitting from economies of scale– the days of highly-specialised “vector”supercomputers are over• roughly 4 : 2 : 1 ratio in performance for

CPUs in servers : PCs : embedded systems• Intel is the dominant force in CPUs; only

AMD, IBM, Sun are left in competition

Lecture 2 29

Page 30: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Multi-level Parallelism

• instruction parallelism (e.g. addition)• pipeline parallelism, overlapping different

instructions• multiple pipelines, each with own capabilities• multiple CPU’s within a single “multicore”

chip• multiple chips within a single shared-memory

computer• multiple computers within a

distributed-memory system• multiple systems within an organisation

Lecture 2 30

Page 31: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Hardware

Lecture 4 will look at CPUs to understand thelower levels of parallelism, and at how data ismoved between the CPU and the main memoryusing caches.

An understanding of both is required to get thebest execution speed from sequential processes,and the memory hierarchy also has majorconsequences for parallel computing.

Lecture 2 31

Page 32: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Hardware for high-end computing

1) shared-memory multiprocessor• the modern mainframe – top products from

Sun and IBM are used widely in banks,especially for database applications• single very large memory (up to 250GB?)

accessed by multiple processors (up to 72dual-core chips)• hardware challenge is high bandwidth

memory access – costly• often has high-reliability features such as

hot-swap disks, redundant power supplies– adds to cost

Lecture 2 32

Page 33: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Hardware for high-end computing

Oxford Supercomputing Centre plans:• spend 20% of budget on shared-memory

systems for specific applications(e.g. Gaussian, molecular modellingpackage)• each with probably 8 dual-core processors,

and maybe 32GB memory• probably no high-reliability features to

minimise cost

Lecture 2 33

Page 34: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Hardware for high-end computing

2) tightly-coupled distributed-memory system• multiple nodes (with 1 – 4 processors) each

with own memory• high-bandwidth low-latency network

connection (Gigabit Ethernet, Myrinet,Infiniband)• in academia, used to be collections of PCs

on a shelf, but now there are tailoredpackages from the leading vendors• a key issue is system management; you lose

all of the price/performance benefits if youhave to employ lots of system managers

Lecture 2 34

Page 35: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Hardware for high-end computing

Oxford Supercomputing Centre plans:• spend 80% of budget on several large

clusters• each with probably 128 “nodes” containing 2

dual-proc chips, so a total of 512 cores percluster• probably Gigabit Ethernet with custom

drivers for networking, except for one clusterwith higher-spec custom networking

Lecture 2 35

Page 36: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Hardware for high-end computing

Another example is the OCCF cluster for theOxford Centre for Computational Financeand the Computing Laboratory• 24 Sun Ultra-80 nodes each with 4

UltraSPARC processors and 2GB memory• connected by Myrinet for parallel computing,

and 100Mb/s Ethernet for file i/o and externalnetwork access• very old now – about to be shut down

Lecture 2 36

Page 37: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Hardware for high-end computing

3) loosely-coupled PC/workstation “farms”• similar to 2) but with relatively low-speed

interconnect (100Mb/s or Gigabit Ethernetwith TCP/IP software)• ideally suited for “trivially-parallel”

applications like Monte-Carlo• system management and resource

management are again the key issues

Lecture 2 37

Page 38: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Hardware for high-end computing

Dedicated farms:• racks of up to 4000 “pizza box” servers at

bio-informatics companies

Collection of “idle” resources:• traders’ workstations/PCs which are idle

overnight and at weekends• computer teaching labs in the university,

unused most of the time!

Lecture 2 38

Page 39: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Hardware for high-end computing

Comparison:• 8 : 2 : 1 cost/performance ratio for three

categories• shared-memory and distributed-memory

systems built of high-end processorsbecause of cost of interconnect• PC/workstation farms built of low-end

processors for lowest cost/performance ratio

Lecture 2 39

Page 40: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Hardware for high-end computing

Trends:• business/finance is replacing science as

main user of “supercomputers”• big shared-memory systems (> 64 procs)

are in decline because database softwarehas been re-written for distributed-memorysystems• concerns over power consumption caused

move to lower-frequency multicore chips– increasingly the aim is to maximise CPUperformance per watt!

Lecture 2 40

Page 41: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Electrical Power

• new OSC computer room will have 600kWsupply for computers (plus an additional400kW to keep them cool)• total power concumption: 1MW

total electricity bill: £400k/yr• averaged over a 3-year lifetime, electricity

cost is roughly 40% of the purchase price

• Intel Pentium 4 Extreme Edition had clockfrequency up to 3.8GHz and used up to130W• new Intel multicore chips run at up to 3GHz

and use 65-75W

Lecture 2 41

Page 42: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Final “Big Picture” Considerations

• driving force is vast market for PCs andservers with a price tag of £500 -£1500• consequence is that a compute cluster

costing £1M may have up to 1000 chips(2000 cores) if the interconnect is not tooexpensive• move to clusters with multi-core chips means

we may have to exploit both shared-memoryand distributed-memory parallel computingin high-end applications.

Lecture 2 42

Page 43: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Hardware vs. Software

Hardware:• phenomenal technological advances, driven

by user needs• new products every year, new architectures

every 10 years

Software:• disappointingly slow progress, limited by

“people” issues• new languages and standards every 10

years or so

Lecture 2 43

Page 44: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Software “People” Issues

• need global standards, agreed by committeewhich takes time – Fortran 90 was motivatedby vector computing, which was on the wayout by the time the standard was agreed!• need (re)training of staff – in the worst case

have to wait for existing staff to retire!• staff can be reluctant to learn new skills which

might not be transferrable to a new employer– another reason for standards• companies have been happier investing in

hardware than software – changing thesedays?

Lecture 2 44

Page 45: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Languages

Fortran and C:• for those who want the highest performance• closest to the level of the operating system

(written in C)

C++, Java, C#:• object-oriented computing for better software

design and re-use of code (in principle)

Visual Basic, Matlab:• “niche” languages with very strong following• emphasis on ease of use

Lecture 2 45

Page 46: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Parallel Computing Standards

• OpenMP for multithreaded computing onshared-memory systems• MPI for message-passing on

distributed-memory systems• both support Fortran, C and C++ and

provide portability across all major vendors

However, both are rather low-level and MPIinvolves tedious programming; I’d like to seemore research on developing parallel librariesto handle parallelism automatically

Lecture 2 46

Page 47: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

47

Page 48: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

48

Page 49: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Numerically Intensive Computing in Finance

Lecture 3: Distributed resource management andweb services

Mike [email protected]

Lecture 3 49

Page 50: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

“Trivial” Parallelism

Monte-Carlo applications are a good example oftrivial parallelism• 106 independent random paths can be

grouped into 100 jobs, each with 104 paths• Each job is independent and has very few

inputs and outputs• Given lots of machines, want “something” to

decide where the jobs should be run to givethe fastest turnaround time.

Only tricky bit for user is making sure each jobuses independent random number generation– see practical 1.

Lecture 3 50

Page 51: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Distributed Resource Management

Using loosely-coupled PC/workstation farms forMonte Carlo calculations needs distributedresource management:• which machines are available?• how heavily are they being used?• do they have the necessary

software/licenses?• what rights do I have to use them?

Lecture 3 51

Page 52: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Distributed Resource Management

Grid Engine (Sun), LSF (Platform Computing)and Condor (Univ. of Wisconsin) deal with thisthrough users submitting tasks to a unified queuewhich dispatches jobs based on:• matching job requirements to machine

properties• taking account of current interactive/batch

usage of machine• taking account of different priorities of

different user groups• doing charging if necessary

Maybe sounds simple – but very important

Lecture 3 52

Page 53: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Distributed Resource Management

Some DRM software can also work in ahierarchy:• each department has Grid Engine to

manage its own cluster• the overall organisation has Grid Engine

queues which can feed into the departmentalqueues• users normally use departmental resources,

but can go to higher level queues for extraresources

This is a more robust solution than having asingle control point for the entire organisation.

Lecture 3 53

Page 54: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Web Services

Distributed resource management is one aspectof Grid Computing.

Another is the use of Web Services to linkseparate applications running on differentmachines, possibly under different operatingsystems, even within different organisations.

Because of the requirements of eCommerce,there is a huge development effort with wellestablished standards supported by all of themajor companies (Microsoft, IBM, SUN)

Lecture 3 54

Page 55: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Web Services

At its simplest, web services follows an RPC(Remote Procedure Call) approach:• a client process sends a request to a server• the server process returns a response

• the client and server processes usually“belong” to different users (different userid)• the server process is usually a persistent

service, running indefinitely waiting for clientrequests

Lecture 3 55

Page 56: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Web Services

Within the basic client/server arrangement, thereare a number of subtle distinctions.

A standard web server can offer web servicesthrough CGI executables: it listens to port 80,and if a requests asks for a particular CGI to beexecuted to generate a response then it does it.

Alternatively, can have a standalone web servicewhich listens to a particular port and deals withrequests.

Lecture 3 56

Page 57: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Note on Ports

When an application “talks” to application onanother machine, it does so through numbered“ports”.

There is at most one application listening to eachport, with reserved port numbers for particularservices (/etc/services on a Unix system)• 21 ftp• 22 ssh• 80 http

Firewalls restrict which ports are left open, andhence control external communication.

Lecture 3 57

Page 58: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Web Services

What about handling multiple requests fromdifferent clients?

• could queue them up and process them oneat a time• could spin off a separate thread (or fork a

separate process) to deal with each one

Lecture 3 58

Page 59: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Web Services

What about handling multiple requests from thesame client?

If the history of the interaction needs to bemaintained (persistence), this can be done byopening a communication channel andmaintaining it (keepalive) until the client closes it,or there’s a timeout.

(In this case, should use a separate thread orprocess for each client.)

Lecture 3 59

Page 60: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Web Services

Standards are crucial for interoperability of webservices.

SOAP (Simple Object Access Protocol) definesthe RPC interaction:• XML for the main content (request and

response)• optional MIME attachment (just like email)• http/https to send the SOAP messages

There is no restriction on the choice of languagefor implementing the server or client application.

Lecture 3 60

Page 61: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Web Services

Language-specific support for creating webservices includes:• Java: IBM Websphere, Sun ONE,

Borland JBuilder, lots of others• C#: Microsoft .NET• Python: ZSI (Zolera Soap Infrastructure)• C/C++: gSOAP

Lecture 3 61

Page 62: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

gSOAP

gSOAP is a package for generating web serviceservers and clients in C/C++• a pre-processor generates additional C/C++

files given a header file specification of theRPC routines• there are also some gSOAP files which

contain the code to do all the conversion ofdata to/from XML• the distribution includes 150 pages of

documentation and lots of exampleapplications

Lecture 3 62

Page 63: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

gSOAP

The example is a web service calculator whichtakes two numbers and adds or subtracts them.

For this application, the user writes 3 files:• calc.h: a header file defining the RPC

routines• calcserver.c: the server code• calcclient.c: the client code

Lecture 3 63

Page 64: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

calc.h

//gsoap ns service name: calc//gsoap ns schema namespace: urn:calc

int ns add(double a,double b,double *result);

int ns sub(double a,double b,double *result);

The ns prefix and the gSOAP declarations avoidambiguities if an application needs to use twoservices with the same RPC names

Lecture 3 64

Page 65: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

calcserver.c

#include <math.h>#include "soapH.h"#include "calc.nsmap"

int main(int argc, char **argv){ int m, s; /* master and slave sockets */struct soap soap;soap init(&soap);

m = soap bind(&soap,NULL,80,100);

for ( ; ; ){ s = soap accept(&soap);soap serve(&soap);soap end(&soap);

}

return 0;}

Lecture 3 65

Page 66: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

calcserver.c

int ns add(struct soap *soap,double a, double b, double *result)

{ *result = a + b;return SOAP OK;

}

int ns sub(struct soap *soap,double a, double b, double *result)

{ *result = a - b;return SOAP OK;

}

Lecture 3 66

Page 67: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

calcclient.c

#include "soapH.h"#include "calc.nsmap"

const char server[] ="http://booth10.ecs.ox.ac.uk:80";

int main(int argc, char **argv){ struct soap soap;double a, b, result;

soap init(&soap);

a = strtod(argv[2], NULL);b = strtod(argv[3], NULL);

Lecture 3 67

Page 68: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

calcclient.c

switch (*argv[1]){ case ’a’:

soap call ns add(&soap, server, "",a, b, &result);

break;case ’s’:soap call ns sub(&soap, server, "",

a, b, &result);break;

}

if (soap.error)soap print fault(&soap, stderr);

elseprintf("result = %g\n", result);

return 0;}

Lecture 3 68

Page 69: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

gSOAP

Additional features:• multiple results handled by a result structure• dynamic arrays handled by a structure with

size and pointer• keepalive for services needing persistence• https and SSL for security• zlib and gzip compression• MIME attachments

Lecture 3 69

Page 70: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Final comments

Web services are likely to become very useful inlinking Windows PCs on the desktop to Unixservers in the back-office:• web clients on Windows PCs written in

Java/C#/Visual Basic using Microsoft’s• web services on Unix servers written in

C/C++ using gSOAP• much more dynamic/responsive than using

software like Grid Engine.

Lecture 3 70

Page 71: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

71

Page 72: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

72

Page 73: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Numerically Intensive Computing in Finance

Lecture 4: Processor and Memory Technology

Mike [email protected]

Lecture 4 73

Page 74: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Processor Technology

Why discuss processor technology?• interesting to learn how Moore’s Law is being

upheld• interesting, because there’s lots of

parallelism in the CPU hidden from theprogrammer/user;• important, because a better understanding

enables an expert programmer to get betterperformance• important, because it affects whether

higher-level parallelism involves 10’s ofprocessors, or 1000’s

Lecture 4 74

Page 75: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Ideal Von Neumann Processor

• each cycle, CPU takes data from registers,does an operation, and puts the result back• load/store operations (memory←→ registers)

also take one cycle• CPU can do different operations each cycle• output of one operation can be input to next

-

timeop1-- -

op2-- -

op3-- -

CPU’s haven’t been this simple for a long time!

Lecture 4 75

Page 76: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Pipelining

Pipelining is a technique in which multipleinstructions are overlapped in execution.

-

time1 2 3 4 5-- -

1 2 3 4 5-- -

1 2 3 4 5-- -

• 1 result per cycle after pipeline fills up• improved utilisation of hardware• major complication – an output can only be

used as input for an operation starting later

Lecture 4 76

Page 77: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Superscalar Processors

Most processors have multiple pipelines fordifferent tasks, and can start a number ofdifferent operations each cycle.

Example: Sun Microsystems UltraSPARC III• 2 integer pipes• 1 floating-point (FP) multiply pipe• 1 FP addition/subtraction pipe• in principle, capable of producing 2 integer

and 2 FP results per cycle• FP division uses both FP pipes and is very

slow (29 cycles)

Lecture 4 77

Page 78: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Technical Challenges

• compiler to extract best performance,reordering instructions if necessary• controller to handle multiple pipelines

(sometimes with out-of-order execution)• memory hierarchy to deliver data to registers

fast enough to feed the processor• tricks to avoid delays waiting for data

(pipeline stall)• tricks to avoid delays due to conditional

branching (loops, logical tests)

These all limit the number of pipelines that canbe used effectively

Lecture 4 78

Page 79: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Programmer Assistance

The programmer can help the compiler byproviding more scope for re-ordering operations– common trick is loop unrolling with addedbenefit of less branching.

for (i=0; i<1000; i++) {

x += sqdt*rand[i];

}

Problem: each multiply must complete beforeaddition, and looping probably forces addition tocomplete before next multiply.

Lecture 4 79

Page 80: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Programmer Assistance

for (i=0; i<1000; i+=4) {

x += sqdt*rand[i];

x += sqdt*rand[i+1];

x += sqdt*rand[i+2];

x += sqdt*rand[i+3];

}

Each addition must complete before nextaddition, but multiplies are now almost fullyoverlapped.

Note: need a “remainder” loop when loop rangeis not perfectly divisible by unrolling factor.

Lecture 4 80

Page 81: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Programmer Assistance

To get more speedup, do 2 Monte-Carlo paths atsame time:

for (i=0; i<1000; i+=2) {

x1 += sqdt*rand[i];

x2 += sqdt*rand[i+1000];

x1 += sqdt*rand[i+1];

x2 += sqdt*rand[i+1001];

}

Now enough scope for overlap to get almost fullutilisation of a processor with a single 3-stagepipeline.

Lecture 4 81

Page 82: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Compiler Optimisation

Need even more unrolling for multiple pipelines.Fortunately, the compiler will perform innermostloop unrolling, but sometimes needs to be told todo so – compiler directive.

Sun’s cc compiler also has different optimisationlevels, giving a trade-off between compiler andcode speed.-fast does a variety of optimisations includingmultiplying by a reciprocal instead of dividingrepeatedly by the same number, and optimisationfor native hardware.

Lecture 4 82

Page 83: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Current Trends

• clock cycle no longer reducing, due toproblems with power consumption(up to 130W per chip)• gates/chip still doubling every 24 months⇒ more on-chip memory and MMU

(memory management units)⇒ specialised hardware (e.g. multimedia,

encryption)⇒ multi-core (multiple CPU’s on one chip)• peak performance of chip still doubling every

12-18 months

Lecture 4 83

Page 84: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Intel chips

“Conroe” desktop chip:• dual-core chip running at 2.66GHz• 14-stage pipelines capable of 4 operations per

cycle

Others:• dual-core Core Duo already out in laptops• dual-core “Woodcrest” for servers soon• 90% of all sales dual-core by end of 2006• quad-core chips by early 2007

Lecture 4 84

Page 85: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

AMD chips

Athlon X2 desktop chip:• dual-core chip running at up to 2.4GHz, using

90-110W• quad-core in early 2007

Dual-core Opterons:• dual-core chip running at up to 2.6GHz, using

55-95W• up to 8-way (16-core) SMP systems• quad-core due in early 2007

Lecture 4 85

Page 86: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

IBM chips

Power 5:• 2 cores on a single chip• each core runs two threads simultaneously,

overlapping on different pipelinesIBM/Sony/Toshiba Cell chip:• originally designed for new Sony Playstation• has one Power 4 core plus 8 graphics cores• now to be used as multi-core chip in new IBM

blade system

Lecture 4 86

Page 87: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

SUN chips

Sparc VI:• 2 cores, running at 2.4GHz using 120W• Fujitsu developing quad-core variant for

2008?

UltraSparc T1 (“Niagara”) chip:• 8 cores, running at up to 1.2GHz• extra bits for encryption and data

compression• limited floating point performance• intended for file servers / web servers

Lecture 4 87

Page 88: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

ClearSpeed

• startup company• PCI-Express board with 2 compute chips• each has 96 cores, running at 133MHz(?),

using 10W• ideally suited for Monte Carlo applications• best performance/watt in marketplace?

Lecture 4 88

Page 89: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Memory Hierarchy

Why discuss memory?• more and more, it is the bottleneck in

modern computer systems• in some cases, it is possible to get much

greater performance through minor changesto a code• understanding how caches work is vital to

understanding the operation andprogramming of shared-memory parallelcomputers

Lecture 4 89

Page 90: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Memory Hierarchy

?

fastermore expensive

smaller

1 – 8 GB400MHz DDR2Main memory

1 – 4 MB1GHz SRAML2 Cache

L1 Cache64KB2GHz SRAM

registers

100+ cycle access, 5GB/s

12 cycle access, 20GB/s

2 cycle access

?

6

??66

???666

Lecture 4 90

Page 91: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Memory Hierarchy

Execution speed relies on exploiting data locality• temporal locality: a data item just accessed

is likely to be used again in the near future,so keep it in the cache• spatial locality: neighbouring data is also

likely to be used soon, so load them into thecache at the same time using a ‘wide’ bus(like a multi-lane motorway)

Lecture 4 91

Page 92: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Caches

The cache line is the basic unit of data transfer;typical size is 128 bytes ≡ 16× 8-byte items.

In a single cache system, when the CPU loadsdata into a register:• looks for line in cache• if there (hit), get data• if not (miss), get entire line from main

memory, displacing an existing line in cache(usually least recently used)

When the CPU stores data from a register:• same procedure

Lecture 4 92

Page 93: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Caches

What happens when a cache line is modified?

Write-through cache:• modified line is immediately written to the

main memory• main memory stays up-to-date• generates lots of memory traffic

Write-back cache:• modified line is only written to main memory

when it gets displaced from the cache• much less memory traffic• main memory may not have latest values

– potential problem for parallel computing

Lecture 4 93

Page 94: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Caches

Multi-level caches

All major processors use at least two levels ofcache:• primary cache is small (e.g. 64KB), on-chip

and write-through• secondary cache is larger (e.g. 2MB),

usually on-chip and write-back• if there is a third level cache, then it is even

larger, off-chip and write-back

Lecture 4 94

Page 95: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Importance of Locality

Typical workstation:2 Gflops CPU5 GB/s memory←→ L2 cache bandwidth128 bytes/line

5GB/s ≡ 40M line/s ≡ 600M reals/s

At worst, each flop requires 2 inputs and has 1output, forcing loading of 3 lines =⇒ 13 Mflops

If all 16 variables/line are used, then thisincreases to 200 Mflops.

To get up to 2Gflops needs temporal locality,re-using data already in the cache.

Lecture 4 95

Page 96: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Loop Ordering

A 2D finite difference code typically has loops ofthe form

for (i=0; i<1000; i++) {

for (j=0; j<1000; j++) {

u[id(i,j)] = ...

}

}

where id(i,j) maps the indices (i,j) to aunique element of u.

Question: would it be more efficient to re-orderthe loops?

Lecture 4 96

Page 97: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Loop Ordering

The answer depends on the function id(i,j).

If we use

id(i,j) = i + j*imax

then id(3,7) is next to id(4,7), but notid(3,8).

Multiple dimensions are handled similarly, withthe lower dimensions varying most rapidly.

Lecture 4 97

Page 98: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Loop Ordering

Consequently, in the FD example, it is best tohave the i loop innermost, to access theelements of u[id(i,j)] sequentially.

If the j loop is innermost, then the cache linewith element u[id(i,j)] may have beendisplaced by the time that u[id(i+1,j)] is tobe computed.

This can have very dramatic consequences!

Lecture 4 98

Page 99: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Current Trends

Memory hierarchy seems likely to remain – veryhigh speed memory is too expensive

Importance of cache lines and data locality islikely to remain – transferring multiple bits of datain parallel is only way to get high throughput

Best we can hope for is that compilers will handlecode optimisation, but remember, high-endnumerical computing is not a big driver.

Lecture 4 99

Page 100: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

100

Page 101: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Numerically Intensive Computing in Finance

Lecture 5: Shared-memory Multiprocessors

Mike [email protected]

Lecture 5 101

Page 102: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Shared-memory Multiprocessors

CPU CPU CPU CPU CPU

cache cache cache cache cache

Main Memory

Conceptual arrangement:• multiple CPU’s, each with own cache• all linked to a unified main memory by a very

high bandwidth interconnect

Lecture 5 102

Page 103: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Shared-memory Multiprocessors

For historical reasons, they are also referred toas SMP systems – Symmetric Multi-Processors

“Symmetric” refers to the fact that all processorsare equal

An asymmetric system is one in which there is amaster processor, and a number of slaves– like the ClearSpeed card

Lecture 5 103

Page 104: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Interconnect

One challenge in building shared-memorysystems is achieveing sufficient bandwidthbetween all of the processors and multiple“memory ports” (points of entry into the mainmemory)• traditional PC bus is not scalable – fixed

bandwidth shared between more and moreprocessors• scalable performance is achieved using

commodity crossbar (full interconnect) chipsoriginally developed for network switches

Lecture 5 104

Page 105: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Cache Coherency

The other challenge in shared-memory multiprocessors ismaintaining coherency with write-back caches

CPU1 CPU2 CPU3 CPU4 CPU5

cache cache cache cache cache

Main Memory

Suppose CPU2 loads and modifies variable X, and thenCPU4 needs to load X – what happens?

Lecture 5 105

Page 106: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Cache Coherency

The solution is a “snoopy bus” linking the caches; CPU2spots the request from CPU4 and supplies the newervalue for X.

CPU1 CPU2 CPU3 CPU4 CPU5

cache cache cache cache cache

Main Memory

Lecture 5 106

Page 107: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Cache Coherency

In the MESI cache coherency protocol, a cacheline can be in one of 4 states:• Modified: sole owner of modified line• Exclusive: sole owner, not modified• Shared: shared ownership, not modified• Invalid: incorrect data

����

����

����

����

M E

SI �

?@@

@@

@@

@@

@@I@@

@@

@@

@@

@@R

write

write

writeby other

read by otherread by other

write

Lecture 5 107

Page 108: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Cache Coherency

Note: don’t want different processors “fighting”for ownership of the same cache line – can givevery bad performance

As with the main system bus, the snoopy bus hasproblems scaling to large numbers of processors.

There have been alternative methods used inlarge shared-memory NUMA (Non-UniformMemory Access) machines, but they wereexpensive.

Lecture 5 108

Page 109: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Shared-memory Computing

A key distinction: processors and processes

A processor is a piece of hardware which canexecute instructions

A process is a program consisting of a set ofinstructions

At any instant, there is precisely one processexecuting on each processor, but the sameprocess may be executing on more than oneprocessor

Lecture 5 109

Page 110: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Shared-memory Computing

In a shared-memory system, a user application isa single Unix process, with a number of virtualmemory pages holding the user’s data, and anumber of “threads” working on it.

Very like having a project being carried out by apool of “workers”:• some tasks can only be done by a single

worker, while the rest wait around• other tasks can be carried out in parallel by

many workers• key is deciding what can be done in parallel,

avoiding conflicts between workers

Lecture 5 110

Page 111: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Shared-memory Computing

The operating system is itself a multithreadedapplication, with perhaps one thread handlingdisk i/o, one network i/o, one task scheduling, etc.

Task scheduling for multiple users is particularlyimportant:• system maintains a list of active processes• each process gets given its turn for execution

for a few milliseconds, and then is put to theback of the queue to wait for its next turn• multithreaded processes are usually

executed on a corresponding number ofprocessors

Lecture 5 111

Page 112: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Static and Stack Memory Management

To understand some aspects of shared-memoryprogramming, need to know how compilershandle data within programs.

Static allocation means that the compiler decidesat compilation time where the data will sit withinthe user’s virtual memory.

Stack allocation means it’s handled on-the-flyduring execution, as needed.

Lecture 5 112

Page 113: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Static and Stack Memory Management

In C, static allocation is specified through the useof the static instruction

void counter(int n){static int kount;

if (n==0)kount = 0;

else if (n==1)kount = kount + 1;

elseprintf("%d", kount);

return 0;

}

Lecture 5 113

Page 114: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Static and Stack Memory Management

In C, stack allocation is the default, enablingroutines to be used recursively

int factorial(int n){int fact, nm;

if (n==1)fact = 1;

else if (n>1) {nm = n-1;fact = n*factorial(nm);}

}

return fact;

Lecture 5 114

Page 115: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Static and Stack Memory Management

These two examples show the key aspects ofeach approach.

Static allocation is persistent, continuing after aroutine finishes – may be more efficient becauseno run-time allocation is needed.

Stack allocation is transient, with fresh allocationeach time a routine starts, disappearing when itfinishes.

Lecture 5 115

Page 116: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Static and Stack Memory Management

So far, have considered only sequentialprocesses – what about multi-threading?

Simple example – suppose two threads want toprint something out at the same time.

The libraries that handle printing have internaldata. To avoid conflict, each call needs stackallocation giving independent private data— “thread-safe” libraries (often not the defaultbecause they’re less efficient).

Lecture 5 116

Page 117: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Static and Stack Memory Management

More generally, in multithreaded applicationsthere is the important concept of shared andprivate data:

Private data belongs to a particular thread• it is allocated on its own private stack• it can be seen and changed only by that

thread

Shared data is visible to all threads• it is either statically allocated, or allocated on

a master stack• any of the threads can change its value

Lecture 5 117

Page 118: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Shared-memory Programming

In general terms, there are two levels ofshared-memory programming.

At a low-level, one can start several threads andthen explicitly tell each what to do — in this case,the code will have instructions such as “if this isthread 3 then do the following ... ”.

Lecture 5 118

Page 119: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Shared-memory Programming

This is very flexible, but involves tediousprogramming.

In general I would recommend it only toexperienced programmers wanting to do anapplication in which different threads are doingentirely different things — e.g. one thread ismanaging network i/o, one is running anexperiment, one is handling terminal i/o.

For C programs, POSIX pthreads is thestandard, but I have very little experience ofusing it.

Lecture 5 119

Page 120: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Shared-memory Programming

The higher-level approach is to tell the compilerwhat can be done in parallel, and let itautomatically generate the code to handle themultiple threads.

In this case, typically there is a master threadwhich is always active, and a bunch of otherthreads which spring into action for parallel loops,and hibernate in between.

OpenMP is the standard for this higher-levelapproach, superceding the many vendor specificversions that used to exist.

Lecture 5 120

Page 121: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Numerically Intensive Computing in Finance

Lecture 6: OpenMP Programming

Mike [email protected]

Lecture 6 121

Page 122: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Overview

The code is executed sequentially by a masterthread except for regions (e.g. loops) which areexplicitly declared to be done in parallel.

The extra threads hibernate during the sequentialsections, are activated during the parallelsections, then get suspended again. This is allhandled by the compiler and the run-timeexecution environment.

The programmer is responsible for saying what isto be done in parallel. If the programmer makes amistake, execution may be slow and/or incorrect.

Lecture 6 122

Page 123: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

parallel for

The parallel for directive says the next loopis to be executed in parallel:

#pragma omp parallel for \private(i,du) shared(u,v)

for (i=0; i<imax; i++) {du = v[i]*v[i];u[i] += du;

}

Note the specification of private and sharedvariables. The default is that loop indices areprivate, and everything else is shared.

Lecture 6 123

Page 124: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

parallel for

Private variables are defined to exist transientlywithin the loop:• uninitialised on entry to the loop• undefined on exit from the loop

If there is a pre-existing global variable with thesame name, it is undefined what happens to this– avoid this!

Conceptually, the du variable in the previousexample becomes du n where n is the threadnumber, making these variables different fromany global variable du.

Lecture 6 124

Page 125: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

parallel for

There is control over how the loop iterations aredivided between the threads, through an optionalschedule argument.

schedule(static) splits the loop range into(almost) equal chunks, one for each thread. Thisis the default, and the best for simple loops withequal work per iteration

schedule(static,n) uses chunks of size n,assigned to threads in simple rotation.

Lecture 6 125

Page 126: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

parallel for

schedule(dynamic,n) uses chunks of size n

assigned to threads when they complete theprevious chunk. This is the best choice when thework per loop iteration varies considerably.

#pragma omp parallel for \private(i) shared(u) schedule(dynamic,n)

for (i=0; i<imax; i++) {if(u[i] < 0) {u[i] = small work(u[i]);

}if(u[i] >= 0) {u[i] = big work(u[i]);

}}

Lecture 6 126

Page 127: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

parallel for

With nested loops, remember it is the loopimmediately after the directive that is parallelised.

#pragma omp parallel for \private(i,j) shared(u)for (j=0; j<jmax; j++) {for (i=0; i<imax; i++) {u[id(i,j)] = ...

}}

Here the j loop is parallelised.

Lecture 6 127

Page 128: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

parallel for

for (j=0; j<jmax; j++) {#pragma omp parallel for \private(i) shared(j,u)

for (i=0; i<imax; i++) {u[id(i,j)] = ...

}}

Here the i loop is parallelised.

In general, parallelising the outer loop is best(less starting and suspending of threads) exceptwhen the outer loop is over a small range (poorload balancing — e.g. when there are 4 threadsand jmax = 5.)

Lecture 6 128

Page 129: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

parallel for

What can go wrong? What about

sum = 0;

#pragma omp parallel for \private(i,ds) shared(u,sum)

for (i=0; i<imax; i++) {ds = u[i]*u[i];sum += ds;

}

This is likely to give incorrect results because ofthe accumulation into sum.

Lecture 6 129

Page 130: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

parallel for

time

?

Thread 1load sum

add ds

store sum

Thread 2

load sum

add ds

store sum

What’s the problem? Consider two threads.

The overlapped additions to sum mean that thefirst thread’s contribution gets lost.

Lecture 6 130

Page 131: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

parallel for

First solution uses critical directive to say onlyone thread at a time can work with sum

sum = 0;

#pragma omp parallel for \private(i,ds) shared(u,sum)

for (i=0; i<imax; i++) {ds = u[i]*u[i];

#pragma omp critical{sum += ds;}

}

This will give valid results.

Lecture 6 131

Page 132: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

parallel for

Second solution uses atomic update, making theload, add, store sequence act as a singleinstruction — only possible for single operations.

sum = 0;

#pragma omp parallel for \private(i,ds) shared(u,sum)

for (i=0; i<imax; i++) {ds = u[i]*u[i];

#pragma omp atomicsum += ds;

}

This will give valid results.

Lecture 6 132

Page 133: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

parallel for

Although both of these solutions will give validresults, the performance will be appalling becausethe different threads will fight over access to thecache line holding the shared sum variable.

Instead, use special reduction instruction

sum = 0;

#pragma omp parallel for \private(i,ds) shared(u) reduction(+:sum)

for (i=0; i<imax; i++) {ds = u[i]*u[i];sum += ds;

}

Lecture 6 133

Page 134: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

parallel for

How does the compiler get good performance?

It creates temporary private variablessum local to accumulate the partial sums foreach thread, then at the end combines them withthe shared variable sum.

Works with other reduction operators such asmin, max, -, *.

Lecture 6 134

Page 135: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

parallel for

Another example of data dependencies isGauss-Seidel iteration.

#pragma omp parallel for private(i,j) shared(u)for (j=0; j<jmax; j++) {for (i=0; i<imax; i++) {u[id(i,j)]=0.25*(u[id(i-1,j)]+u[id(i+1,j)]

+u[id(i,j-1)]+u[id(i,j+1)]);}

}

This will produce incorrect results because itdoes not respect the fact that u[id(10,10)]should be updated after u[id(9,9)]

Lecture 6 135

Page 136: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

parallel for

To parallelise Gauss-Seidel correctly, first need toidentify inherent parallelism — all entries alongi + j = const can be updated in parallel.

w w w w w w w w ww w w w w w w w ww w w w w w w w ww w w w w w w w ww w w w w w w w ww w w w w w w w ww w w w w w w w ww w w w w w w w ww w w w w w w w w

@@

@@

@@

@@

@@

@@

@@@

@@

@@

@@

@@

@@

@

@@

@@

@@

@@

@@

@@

@@

@@

@@

@@

@@

@@

@@

@@

@@@

@@

@@

@@

@@

@@

@@

@@

@@

@@

@

@@

@@

@@

@@

@@

@@

@@

@@

@@

@@

@@

��

��

��

����

Lecture 6 136

Page 137: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

parallel for

Hence parallelise first half of loop as

for (k=0; k<imax; k++) {

#pragma omp parallel for private(i,j) shared(k,u)for (i=0; i<=k; i++) {j = k - i;u[id(i,j)]=0.25*(u[id(i-1,j)]+u[id(i+1,j)]

+u[id(i,j-1)]+u[id(i,j+1)]);}

}

and do the second half similarly.

Lecture 6 137

Page 138: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Other OpenMP Directives

• parallel sections and section

Defines a number of sections of code to behandled by multiple threads, one per section.

• parallel

Most general parallel construct, definingcode to be executed by multiple threads,often with low-level control over what eachthread does, based on its thread number.

Lecture 6 138

Page 139: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Financial Applications

For financial applications (and most others too)parallel for with shared, private andreduction clauses should be all that is needed.

Monte Carlo• use parallel for for parallel execution of

paths, with reduction to combine theresults to get average value

Lecture 6 139

Page 140: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Financial Applications

Multi-dimensional Black-Scholes solution• for explicit FD methods, use parallel

for for outermost grid dimension• for implicit FD methods, details depend on

the iterative solver– methods like GMRES and BiCGstab will

need reduction for vector dot products– Gauss-Seidel and ILU preconditioners

will require careful re-writing to exposeinherent parallelism

Lecture 6 140

Page 141: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Numerically Intensive Computing in Finance

Lecture 7: Distributed Memory Multiprocessors

Mike [email protected]

Lecture 7 141

Page 142: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Idealisation

BSP hardware model (Valiant, McColl)

/ / / / / /P P P P P PM M M M M M

• a number of processor/memory nodesconnected by a ‘network’• each processor has fast access to local

memory and slow access to remote memory• real hardware differs in having usual

memory/cache/register hierarchy

Lecture 7 142

Page 143: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Network

IBM’s Blue Gene uses a hypercubegeneralisation of a 2D network array

Lecture 7 143

Page 144: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Network

Clusters use a commodity switch (GigabitEthernet, Myrinet, Infiniband)

Key performance measures are:• latency – minimum time to communicate

between two processors• bandwidth per processor

Lecture 7 144

Page 145: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Network

Gigabit Ethernet• latency: 1–2 ms if using TCP/IP; 50µs if using

custom drivers• bandwidth per processor: 1Gb/s ≈ 100MB/s• now standard for PCs/servers

10Gig Ethernet• same latency as Gigabit Ethernet• bandwidth per processor: 10Gb/s ≈ 1GB/s• starting to be used for servers

Lecture 7 145

Page 146: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Network

Myrinet (from Myricom)• latency: 10 µs• bandwidth per processor: 2-10Gb/s ≈

250MB-1GB/s• the current proprietary market leader for

distributed-memory systems

Infiniband• latency: 10 µs• bandwidth per processor: 10–40Gb/s ≈

1–5GB/s• a new standard being adopted by major

manufacturers, including IBM and SUN

Lecture 7 146

Page 147: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Distributed-memory Computing

On commodity clusters, each node has its ownindependent Unix operating system kernel• completely independent computers,

connected by a network• each handles its own file i/o, network i/o,

process scheduling• if one machine “dies”, the rest carry on

regardless

Lecture 7 147

Page 148: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Distributed-memory Computing

Slightly different on the IBM Blue Gene• micro-kernel on each node• specialised functions such as file i/o only

performed on certain nodes• not clear what happends when one node

“dies”

Lecture 7 148

Page 149: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Distributed-memory Computing

User applications involve coordination betweenmultiple processes• problem data is split up between multiple

processes• typically, each process has unique use of

one node, to avoid scheduling difficulties• during program development, can run

multiple processes on one node to test code• processes communicate by sending

messages to each other

Lecture 7 149

Page 150: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Distributed-memory Computing

Basic process loop:

6

?

?

do some work using local data

communicate between processes

Lecture 7 150

Page 151: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Message Passing

• standard software (MPI, PVM) for allsystems• simple, crude, effective• requires action by both processes• sending:

– write message (put data into an array)– send to other process

• receiving:– receive message– read message (copy into another array)

Lecture 7 151

Page 152: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Message Passing

MPI (Message Passing Interface) is the standard:• FORTRAN, C and C++ implementations

available on all major platforms• designed by committee, so lots of options, but

in practice few are needed• highly optimised• safe for use in parallel libraries

PVM (Parallel Virtual Machine) is an older library,now obselete.

Lecture 7 152

Page 153: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Message Passing Concepts

Two key concepts: buffering and blocking

Message passing is buffered if the message istransferred via a message buffer, and not directlyfrom/to the process memory

Buffering is less efficient because of copying thedata, but it is usually simpler and safer (lessscope for the programmer to make mistakes)

Lecture 7 153

Page 154: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Message Passing

Message passing is blocking if the process waitsto complete the send or receive operation beforecontinuing.

Non-blocking operations can be more efficient(allowing possible overlap of computation andcommunication) but can be more confusing, anderror prone.

Lecture 7 154

Page 155: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Message Passing

When using buffering, it is simplest to usenon-blocking send (like sending a letter) andblocking receive (wait for the postman to deliverthe post)

Without buffering, it is simplest to use blockingsend/receive leading to a synchronous transfer(like sending a fax)

Lecture 7 155

Page 156: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Message Passing

• buffered, non-blocking send– task A continues after sending message• buffered, blocking receive

– task B waits until it gets message

task A..send(B,msg)...

task B....recv(A,msg).

Lecture 7 156

Page 157: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Message Passing

The big problem to be avoided is deadlock, inwhich all processors are waiting for someoneelse to send a message – a common error forbeginners

task A.recv(B,msg1)send(B,msg2).

������*

HHHHHHY

task B.recv(A,msg2)send(A,msg1).

Lecture 7 157

Page 158: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Message Passing

For buffered transfers with a non-blocking send,the following works correctly,

task A.send(B,msg1)recv(B,msg2).

HHHHHHj

�������

task B.send(A,msg2)recv(A,msg1).

but it still leads to deadlock for sends which areblocking/synchronous.

Lecture 7 158

Page 159: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Message Passing

For synchronous transfers, must use thefollowing:

task A.send(B,msg1)recv(B,msg2).

-

task B.recv(A,msg1)send(A,msg2).

Lecture 7 159

Page 160: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

160

Page 161: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Numerically Intensive Computing in Finance

Lecture 8: BSP Model of Distributed Computing

Mike [email protected]

Lecture 8 161

Page 162: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

BSP Hardware Model

/ / / / / /P P P P P PM M M M M M

• a number of processor/memory nodesconnected by a ‘network’• each processor has fast access to local

memory and slow access to remote memory

Aim is to predict likely performance on realhardware, and make choices about alternativeimplementation strategies

Lecture 8 162

Page 163: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

BSP Parameters

p = number of processors

s = processor speed (Mflops)

l =latency/synchronisation time

time for 1 floating point op

g =time to get/send 1 fp. variable

time to do 1 floating point op.

Note:• p, l, g are non–dimensional• estimated execution time will be s−1f(p, l, g)

• local memory access times are neglected– no modelling of cache performance

Lecture 8 163

Page 164: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

BSP Parameters

For a cluster with Intel/AMD processors andMyrinet networking:

l ≈ 10µs

0.5ns= 2× 104, g ≈ 50ns

0.5ns= 100

For an IBM Blue Gene system with slowerprocessors and faster networking:

l ≈ 10µs

1ns= 104, g ≈ 25ns

1ns= 25

Lecture 8 164

Page 165: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

BSP Computation Model

Execution proceeds in supersteps separated bysynchronisations

superstepsynchsuperstepsynch

"""

Each superstep consists of each process doingsome calculations using local data thencommunicating some data to other processors

Lecture 8 165

Page 166: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

BSP Cost Modelling

The cost of a single superstep is

s−1(no + l + ncg)

whereno = max number of f.p. operationsnc = max number of real variables communicatedby one process

For a given application and problem size, no and nc

will depend on p.

The BSP cost of the whole task is just the sum ofthe individual supersteps.

Lecture 8 166

Page 167: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Explicit FD Calculation

@@��

Suppose we want to perform a 2D explicit FDcalculation on a grid which is N1×N2.

To do this, we will partition the grid using a“processor grid” which is p1×p2 (with p=p1 p2)

Lecture 8 167

Page 168: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Explicit FD Calculation

qqqqqqqq

qqqqqqqq

q q q q q q q q

q q q q q q q qqqqqqqqq

qqqqqqqq

q q q q q q q q

q q q q q q q qqqqqqqqq

qqqqqqqq

q q q q q q q q

q q q q q q q q

qqqqqqqq

qqqqqqqq

q q q q q q q q

q q q q q q q qqqqqqqqq

qqqqqqqq

q q q q q q q q

q q q q q q q qqqqqqqqq

qqqqqqqq

q q q q q q q q

q q q q q q q q

qqqqqqqq

qqqqqqqq

q q q q q q q q

q q q q q q q qqqqqqqqq

qqqqqqqq

q q q q q q q q

q q q q q q q qqqqqqqqq

qqqqqqqq

q q q q q q q q

q q q q q q q q

qqqqqqqq

qqqqqqqq

q q q q q q q q

q q q q q q q qqqqqqqqq

qqqqqqqq

q q q q q q q q

q q q q q q q qqqqqqqqq

qqqqqqqq

q q q q q q q q

q q q q q q q q

Lecture 8 168

Page 169: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Explicit FD Calculation

ssssssss

ssssssss

s s s s s s s s

s s s s s s s s

To minimise memory requirements, eachprocessor works with just its part of the overall

grid, of sizeN1

p1×N2

p2, plus a copy of the

neighbouring nodes from adjacent partitions– often known as “halo nodes”

Lecture 8 169

Page 170: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Explicit FD Calculation

If each timestep requires m operations per gridpoint, the total number of operations per superstepis

no = mN1 N2

p1 p2

The new values of halo nodes then have to becommunicated to the neighbours on all four sides,so

nc = 2

(

N1

p1+

N2

p2

)

and the total BSP cost is

T = s−1

(

mN1 N2

p1 p2+ l + 2g

(

N1

p1+

N2

p2

))

Lecture 8 170

Page 171: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Explicit FD Calculation

Re-writing it as

T = s−1

(

mN1 N2

p+ l + 2g

(

N1

p1+

p1N2

p

))

and treating p1 as continuous, with p fixed, we findthis is minimised when

N1

p1=

N2

p2

This gives us our first result using BSP modelling— time is minimised by using square partitions(minimum ratio of surface to volume)

Lecture 8 171

Page 172: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Explicit FD Calculation

If we now define

Nlocal =N1

p1=

N2

p2

then the total cost is

T = s−1(

m N2local + l + 4g Nlocal

)

For good efficiency, want communication andlatency costs to be small compared to computation,so require

Nlocal �max(√

l/m, 4g/m)

This is our second BSP result — the minimumproblem size for effective parallelisation

Lecture 8 172

Page 173: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Explicit FD Calculation

Suppose we now consider a d-dimensionalproblem, with each partition of size Nd

local.

In this case, the total BSP cost per timestep is

T = s−1(

m Ndlocal + l + 2 d g Nd−1

local

)

For good efficiency require

Nlocal �max

(

l

m

)1/d

,2 d g

m

– probably best satisfied for d=3.

Lecture 8 173

Page 174: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Explicit FD Calculation

In general, we define parallel efficiency as

Parallel efficiency =sequential time

p× parallel time

In the 2D explicit FD case,

sequential time = ms−1N1N2 = ms−1pN2local

so we get

Parallel efficiency =

(

1 +l

mN2local

+4g

mNlocal

)−1

Lecture 8 174

Page 175: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Parallel Efficiency and Scalability

Scalability concerns what happens as you increasethe number of processors. However, one has to becareful with how it is defined:• Fixed overall problem size: as p increases,

Nlocal decreases so the parallel efficiencydecreases.• Fixed problem size per processor: as p

increases, Nlocal remains fixed and so doesthe parallel efficiency.

Personally, I think the second definition is moreappropriate – the point of using lots of processorsis to be able to tackle really big problems.

Lecture 8 175

Page 176: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Halos for FD Calculations

v vv v vv v

interior

The FD approximation to the 2D Black-Scholesequation uses a 7-point stencil because of thecross-derivative.

Lecture 8 176

Page 177: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Halos for FD Calculations

At first sight, it looks as if this will require halotransfers from 2 of the diagonal neighbours,as well as the four immediate neighbours.

However, with a little care, extra transfers canbe avoided.

The key is to complete halo exchange in thex-direction before starting halo exchange inthe y-direction.

Lecture 8 177

Page 178: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Halos for FD Calculations

v v v v v vv v v v v vv v v v v vv v v v v v

After the exchange in the x-direction, with theimmediate neighbours on either side, the nodeswith dots have up-to-date values.

Lecture 8 178

Page 179: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Halos for FD Calculations

v v v v v vv v v v v vv v v v v vv v v v v vv v v v v vv v v v v v

After the exchange in the y-direction, with theimmediate neighbours on either side, all nodeshave up-to-date values. The corner values comefrom copying the neighbours’ halos.

Lecture 8 179

Page 180: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

180

Page 181: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Numerically Intensive Computing in Finance

Lecture 9: An Introduction to MPIMessage-Passing

Mike [email protected]

Lecture 9 181

Page 182: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Key Reference

Using MPI: portable parallel programming withthe message-passing interface (second edition)by Gropp, Lusk and Skjellum is excellent!

• starts with basics and adds to them slowly• emphasises that most people need only a

limited subset of MPI• lots of examples of direct relevance• I suggest you stick to Chapters 1–4:

– Background– Introduction– Using MPI in Simple Programs– Intermediate MPI.

Lecture 9 182

Page 183: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Some Basics

A program using MPI must be compiled with aspecial command, usually mpicc or mpcc.

One thing this does is to provide a link to aheader file mpi.h which must be included ineach C file using the line

#include "mpi.h"

When run interactively, it is executed by a specialcommand of the formmprun -np n program

where n is the number of processes to be used.

Lecture 9 183

Page 184: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Some Basics

An MPI program usually starts with the lines

MPI Init(*argc,*argv);

MPI Comm size(MPI COMM WORLD, *nprocs);

MPI Comm rank(MPI COMM WORLD, *myid);

• MPI Init initialises things• MPI Comm size gives the number of processes• MPI Comm rank gives the “rank” within the group

(0 ≤ myid < nprocs)

Lecture 9 184

Page 185: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Some Basics

MPI COMM WORLD is a communicator which inthis case is a constant defined in mpi.h todenote the entire set of processes.

It is possible to construct other communicators,e.g. for communication between a subset ofprocesses, or to protect/isolate communicationwithin a library.

Lecture 9 185

Page 186: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Some Basics

The first routines to learn about are:• MPI Bcastto broadcast data from one

process to the others;• MPI Reduceto reduce data from all

processes to one;• MPI Send, MPI Recv, MPI Sendrecv

to send messages between processes• MPI Finalize terminates all MPI

communication

You can go a long way using just these routines.

Lecture 9 186

Page 187: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

MPI Bcast

The syntax of the broadcast subroutine is

MPI Bcast(*data,size,type,origin,

communicator)

• data is the data to be sent• size is the number of pieces of data• type is its type (e.g. MPI INT orMPI DOUBLE)• origin is the rank of the process doing the

broadcast

Lecture 9 187

Page 188: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

MPI Reduce

Similarly, the syntax of the reduction subroutine is

MPI Reduce(*input,*output,size,type,

operation,destination,

communicator)

• input is the data to be reduced• output is where the result is put on the process

given by destination; use MPI Allreduce

instead to send the output to all processes• operation is the reduction operation to be

performed (e.g. MPI SUM or MPI MAX)• the others are the same as for MPI Bcast

Lecture 9 188

Page 189: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

MPI Send

MPI Send(*data,size,type,

destination,tag,

communicator)

• destination is where the message is to besent, and tag is a user-chosen integer label• to be safe, think of this as a blocking

synchronous send; for small messages it mayhave a non-blocking implementation using asystem buffer

Lecture 9 189

Page 190: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

MPI Recv

MPI Recv(*data,size,type,

origin,tag,

communicator,*status)

• a blocking receive, will wait for a message withcorrect origin and tag, but these can be set toMPI ANY SOURCE and MPI ANY TAG

• status is a variable of special typeMPI Status with additional information• note that incoming messages do not have to be

read in the order in which they arrive

Lecture 9 190

Page 191: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

MPI Sendrecv

MPI Sendrecv(*data1,size1,type1,dest,tag1,

*data2,size2,type2,orig,tag2,

communicator,*status)

• a combined blocking send and receive;my personal favourite when most processesneed to both send and receive• can use MPI PROC NULL as destination (or origin)

if there is no message to be sent (or received)• combining operations enables MPI implementation

to be more efficient

Lecture 9 191

Page 192: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Vector datatypes

So far, all of the send/receive routines have dealtwith contiguous blocks of data. However, inpractice the data to be communicated is often notcontiguous (e.g. 2D halo exchange).

What to do?• Option 1: copy everything into a contiguous

array, then send• Option 2: use MPI’s capability to define new

vector datatypes

Lecture 9 192

Page 193: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Vector datatypes

1 2 3 4 5 6 7

8 9 10 11 12 13 14

15 16 17 18 19 20 21

22 23 24 25 26 27 28

29 30 31 32 33 34 35

Simplest to show an example from “Using MPI “

MPI Type vector(5,1,7,MPI DOUBLE,

&newtype)

Lecture 9 193

Page 194: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Vector datatypes

After the new data type has been defined it has to be“committed” using the command

MPI Type commit(&newtype)

and then it can be used, as in

MPI Send(&data[3],1,newtype,

destination,tag,

communicator)

Note this specifies just one item, of type newtype

with data[3] being the start of the item

Lecture 9 194

Page 195: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Vector datatypes

The general syntax of MPI Type vector is

MPI Type vector(count,size,stride,

oldtype,*newtype)

• count is the number of blocks• size is the size of each block (often 1)

composed of type oldtype

• stride is the offset between each block (≥ size)• newtype is the label for the new datatype, of type

MPI Datatype

Note: oldtype can itself be a derived datatype,so you can build up very complex datatypes.

Lecture 9 195

Page 196: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

196

Page 197: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Numerically Intensive Computing in Finance

Lecture 10: Explicit and Implicit FD Methods

Mike [email protected]

Lecture 10 197

Page 198: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Explicit FD Calculation

To recap, the explicit B-S discretisation is

V n+1 = (1− r∆t)V n +r∗∆t

2∆η

(

δ2η1+ δ2η2

)

V n

+σ2∆t

2∆η2

(

(1−ρ)δ2η1+ ρδ2η1η2

+ (1−ρ)δ2η2

)

V n

giving a 7-point stencil

x xx x x

x x

Lecture 10 198

Page 199: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Explicit FD Calculation

The computational grid is broken into partitions ...

Lecture 10 199

Page 200: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Explicit FD Calculation

... and each timestep involves calculations oneach partition followed by an updating of the halodata – a single BSP superstep

sssssssss

sssssssss

s s s s s s s s s

s s s s s s s s s

Lecture 10 200

Page 201: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Explicit FD Calculation

One practical point to note: each MPI task shouldonly allocate memory for the partition and itshalo, not the entire grid.

There are two options on handling indices andarrays within each partition:• use usual “global” indices with an adjustment

to the definition of id(i,j) so thatid(i,j) = offset+i+j*imax local

• use “local” indices with standard arrayswithout offsets – this is my personalpreference.

Lecture 10 201

Page 202: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Implicit FD Calculation

The implicit discretisation is

(1 + r∆t) V n+1 − r∗∆t

2∆η

(

δ2η1+ δ2η2

)

V n+1

− σ2∆t

2∆η2

(

(1−ρ)δ2η1+ ρδ2η1η2

+ (1−ρ)δ2η2

)

V n+1

= V n

which may be written collectively as

AV n+1 = b

giving a system of simultaneous equations to besolved iteratively.

Lecture 10 202

Page 203: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Implicit FD Calculation

Note that the operations necessary to evaluatethe matrix-vector product AV , are essentially thesame as for an explicit timestep.

Assuming the halo data is up-to-date, one cancompute on each partition the elements of theproduct AV which correspond to grid points inthat partition.

Lecture 10 203

Page 204: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Jacobi iteration

The difficulties involved in parallelising an implicitFD calculation depend on how the implicitequations are solved.

The simplest approach would be to use Jacobiiteration in which each point is updated using oldvalues of its neighbours.

Ak,kV(m+1)k = bk −

l 6=k

Ak,lV(m)l

Each Jacobi iteration step requires just onesuperstep to update the interior points andexchange halo data with neighbouring partitions.

Lecture 10 204

Page 205: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

CG Iteration

If a Krylov iterative solver with a simple diagonalpreconditioner is used, then it is alsostraightforward.

To see this, we will consider the use of CG tosolve

Ax = b

with A being symmetric and positive definite, witha sparse 5-point stencil in 2D.

Lecture 10 205

Page 206: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

CG Algorithm

x0 = 0; k = 0; r0 = b−Ax0

while |rk| > tolerancek = k + 1

if k = 1

p1 = r0else

βk = rTk−1rk−1/rT

k−2rk−2

pk = rk−1 + βkpk−1

endαk = pT

k rk−1/pTk Apk

xk = xk−1 + αkpk

rk = rk−1 − αkApk

endx = xk

Lecture 10 206

Page 207: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

CG Algorithm

The core part of the algorithm, after somere-arrangement, is

αk = pTk rk−1/pT

k Apk

xk = xk−1 + αkpk

rk = rk−1 − αkApk

βk+1 = rTk rk/rT

k−1rk−1

pk+1 = rk + βk+1pk

which can be calculated in three supersteps asfollows:

Lecture 10 207

Page 208: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

CG Algorithm

Superstep 1:• compute Apk on local partition• compute local contributions to pT

k rk−1 andpTk Apk and send to others

Superstep 2:• compute αk and update xk and rk

• compute local contribution to rTk rk and send

to others

Superstep 3:• compute βk and update pk+1

• exchange pk+1 halos

Lecture 10 208

Page 209: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

CG Algorithm

Alternatively, if we write it as

pk = rk−1 + βkpk−1

αk = pTk rk−1/pT

k Apk

xk = xk−1 + αkpk

rk = rk−1 − αkApk

βk+1 = rTk rk/rT

k−1rk−1

then it can be done in two supersteps as follows:

Lecture 10 209

Page 210: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

CG Algorithm

Superstep 1:• finish computing βk and update pk

(including halo copies)• compute Apk on local partition• compute local contributions to pT

k rk−1 andpTk Apk and send to others with Apk halo

Superstep 2:• compute αk and update xk and rk

(including halo copies)• compute local contribution to rT

k rk and sendto others

Lecture 10 210

Page 211: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

CG Algorithm

This second approach is unusual:• usually don’t modify halo values – just

“read-only” copies from the neighbouringpartition• works in this case because they are updated

in exactly the same way as on the “master”partition

Shows a little creativity can reduce the executiontime.

Lecture 10 211

Page 212: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

212

Page 213: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Numerically Intensive Computing in Finance

Lecture 11: More on Implicit Methods

Mike [email protected]

Lecture 11 213

Page 214: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Gauss-Seidel

If Gauss-Seidel is used to solve the equations, oras a preconditioner, it is harder to parallelise.• start by partitioning the grid into strips

Lecture 11 214

Page 215: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Gauss-Seidel: first approach

• first superstep: start with first row, first stripand work across to first partition boundary tosend ‘halo’ point to neighbour

u u u u uLecture 11 215

Page 216: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Gauss-Seidel: first approach

• next superstep: do second row of first strip,and first row of second strip

u u u u uu u u u u u u u u u

Lecture 11 216

Page 217: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Gauss-Seidel: first approach

• additional supersteps: continue the processuntil the grid is completed

u u u u u u u u u u u u u u u u u u u uu u u u u u u u u u u u u u u u u u u uu u u u u u u u u u u u u u u u u u u uu u u u u u u u u u u u u u u u u u u uu u u u u u u u u u u u u u uu u u u u u u u u uu u u u u

Lecture 11 217

Page 218: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Gauss-Seidel: first approach

If the grid size is N×N , then

# supersteps = N+p−1 ≈ N, assuming p� N

cost of single step = s−1

(

14N

p+ l + 2g

)

=⇒ Total cost = s−1

(

14N2

p+ Nl + 2Ng

)

≈ s−1

(

14N2

p+ Nl

)

since l� g.

Lecture 11 218

Page 219: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Gauss-Seidel: first approach

For good parallel efficiency, needN

p� l

14

For real hardware this implies huge problems areneeded to make good use of parallelism.

What to do?

1) Use Jacobi iteration instead – tempting but lazy.Personal view: start with best numerical algorithmand then worry about how to parallelise it.

2) Reduce number of supersteps

Lecture 11 219

Page 220: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Gauss-Seidel: second approach

Same as before, except do m rows beforetransferring boundary data to neighbouringpartition

t t t t t t t t t t t t t t tt t t t t t t t t t t t t t tt t t t t t t t t t t t t t tt t t t t t t t t t t t t t tt t t t t t t t t tt t t t t t t t t tt t t t t t t t t tt t t t t t t t t tt t t t tt t t t tt t t t tt t t t t

m6

?

Lecture 11 220

Page 221: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Gauss-Seidel: second approach

# supersteps =N

m+ p− 1 ≈ N

m+ p

superstep cost = s−1

(

14mN

p+ l + 2mg

)

Total time T ≈ s−1(

N

m+ p

)

(

14mN

p+ l + 2mg

)

Note that N � pg is necessary for communicationtime to be negligible compared to computation.This condition is satisfied for large problems onhardware with high bandwidth.

Lecture 11 221

Page 222: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Gauss-Seidel: second approach

If it is satisfied, then

T ≈ s−1

(

14mN + pl +14N2

p+

Nl

m

)

For fixed values of N, p, s, g, l the total time is aminimum when

dT

dm= 0 =⇒ 14N − Nl

m2= 0

=⇒ m =√

l/14

Lecture 11 222

Page 223: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Gauss-Seidel: second approach

For this optimum value for m we get

T ≈ s−1

(

14N2

p+ 2N

√14 l + pl

)

= s−1 4N2

p

1 +p√

l/14

N

2

and so for good parallel efficiency we requireN � p

√l in addition to N � pg.

These restrictions are now achievable withreasonably large problem sizes on real hardware.

Lecture 11 223

Page 224: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Gauss-Seidel

The lessons to be learned from this are:• don’t settle for the most obvious solution; if it

doesn’t give good performance work out whyand try to find a solution• the optimal parallel algorithm may depend on

hardware BSP parameters; the attraction ofBSP cost modelling is that it allows you tomodel the tradeoffs

Lecture 11 224

Page 225: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

ILU

ILU (incomplete LU factorisation) is sometimesused as a preconditioner for iterative solverssuch as GMRES.

It involves solving two systems of equations withtriangular matrices

LUx = b =⇒ Ly = b, Ux = y

The L solution is like the forward sweep in G-S;the U solution is like the reverse sweep.

Lecture 11 225

Page 226: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

ADI

Parallelisation of ADI preconditioners is complicatedbecause of the tri-diagonal equations to be solved.

Start by dividing N×N grid into√

p×√p partitions tominimise communication costs.

Lecture 11 226

Page 227: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

ADI

Using Thomas algorithm to solve the equations,in first superstep, begin m columns andcommunicate appropriate data to neighbours:

m-�

Lecture 11 227

Page 228: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

ADI

In second superstep, do the next m columns:

m-�

Repeat until the forward sweep is complete.

Lecture 11 228

Page 229: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

ADI

Use a similar procedure for the reverse sweep.

Optimum value for m can be deduced from BSPcost analysis.

Lecture 11 229

Page 230: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

230

Page 231: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

231

Page 232: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

232

Page 233: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Numerically Intensive Computing in Finance

Lecture 12: More on MPI

Mike [email protected]

Lecture 12 233

Page 234: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Cartesian Grids

Most finite difference methods use structuredgrids with i,j,k indexing (as opposed to finiteelement methods that often use unstructuredgrids composed of triangles/tetrahedra with avery general connectivity).

MPI calls them Cartesian grids, and provides anumber of special routines to make it easy towork with them.

Lecture 12 234

Page 235: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Cartesian Grids

MPI Dims create(nprocs,ndim,*pdims)

This routine creates a process grid to partition amulti-dimensional Cartesian grid• nprocs is the number of processes (input)• ndim is the number of dimensions (input)• pdims is an array containing the dimensions of

the process grid (output) with the product beingequal to nprocs

Lecture 12 235

Page 236: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Cartesian Grids

MPI Cart create(oldcomm, ndim, *pdims,

*periodic, *reorder,

*newcomm)

This routine assigns processes to the process gridand creates a new communicator• oldcomm is the old communicator (usuallyMPI COMM WORLD); newcomm is the new one• ndim and pdims are same as before• periodic is an array defining whether the grid

is to be periodic• reorder specifies whether to give MPI full

freedom in how to assign processors

Lecture 12 236

Page 237: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Cartesian Grids

MPI Cart coords(newcomm,myid,ndim,*coords)

This routine gives the coordinates of the process withinthe Cartesian process grid• myid is the rank obtained by callingMPI Comm rank(newcomm,*myid)

• coords is an integer array of size ndim giving thecoordinates

Lecture 12 237

Page 238: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Cartesian Grids

MPI Cart shift(newcomm,dir,shift,

*src,*dest)

Often want to shift data from a processor to itsneighbour in a particular direction. This routinegives the ID’s of the two neighbouring processes:• 0 ≤ dir < ndim is the direction• shift is the size of shift (usually 1)• src is the ID of the process below

(the source of shifted messages)• dest is the ID of the process above

(the destination of shifted messages)

Lecture 12 238

Page 239: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Cartesian Grids

These routines provide all the key capabilities forworking with Cartesian grids – all are used inPractical 5.

The one thing not provided is a simple routine toexchange halos – this you have to programyourself using a vector datatype.

Lecture 12 239

Page 240: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Cartesian Grids

ttttttttt t t t t t t t

nx

ny

In 2D, exchange in x-direction uses a stride ofnx, and exchange in y-direction is a simplecontiguous transfer.

Lecture 12 240

Page 241: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Cartesian Grids

In 3D, it is hard to visualise, but• in x-direction, halo has ny*nz elements with

a stride of nx• in z-direction, halo is a single contiguous

block of size nx*ny

• in y-direction halo has nz blocks of size nx

with stride nx*ny

Practical 5 generalises this to an arbitrarynumber of dimensions

Lecture 12 241

Page 242: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Blocking and Buffering

In lecture 7, discussed the concepts of blockingand buffering• blocking means the program waits until the

operation has completed before continuing• buffering means the data is copied to a

temporary buffer during transmission

MPI provides 5 different combinations of sendand receive — enough to confuse anyone!• MPI Send, MPI Ssend, MPI Bsend

all pair with MPI Recv

• MPI Sendrecv works on its own• MPI Isend pairs with MPI Irecv

Lecture 12 242

Page 243: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Blocking and Buffering

MPI Recv is a blocking receive• program can only continue once the

message has arrived

MPI Ssend is a blocking send• synchronous transfer like sending a fax• simple, but generally not efficient due to

unnecessary waiting

Lecture 12 243

Page 244: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Blocking and Buffering

MPI Bsend is a non-blocking buffered send• copies the data into a buffer before

continuing• user must supply the buffer – see

documentation• generally good for efficiency (less waiting)

but copying data costs time, and supplyingthe buffer is tedious and error-prone

MPI Send is a cross between MPI Ssend andMPI Bsend — a reasonable compromise• uses internal buffer for small messages• uses synchronous transfer for large ones

Lecture 12 244

Page 245: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Blocking and Buffering

MPI Sendrecv is a blocking send/recv pair• very well suited to halo exchange• the MPI system decides the order of sending

and receiving• no buffering so no time wasted on copying• very easy to use

Lecture 12 245

Page 246: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Blocking and Buffering

MPI Isend and MPI Irecv are quite different,non-blocking operations producing anasynchronous transfer

Continuing the letter/fax analogy, these are likeshipping a piano using a courier company:• sender says “here’s where the piano is”• receiver says “here’s where I want it to go

when it arrives”• the courier ships directly, when both ready• sender and receiver continue as usual,

occasionally checking to see if the piano hasgone/arrived

Lecture 12 246

Page 247: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Blocking and Buffering

The syntax for MPI Isend is:

MPI Isend(*data,size,type,

destination,tag,

communicator,*request)

The one extra argument compared to MPI Send

is request. This is a handle which can be usedlater to check if the send operation has beencompleted, or to wait for it to complete.

Lecture 12 247

Page 248: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Blocking and Buffering

Similarly, the syntax for MPI Irecv is:

MPI Irecv(*data,size,type,

origin,tag,

communicator,*request)

Compared to MPI Recv the argument statushas been replaced by the handle request.

Lecture 12 248

Page 249: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Blocking and Buffering

The status of a request can be tested with thecommand

MPI Test(*request,*flag,*status)

with flag being true if it has been completed.

Alternatively, can wait for it to be completed using

MPI Wait(*request,*status)

There are also MPI Waitall and MPI Waitany

variants for handling multiple requests; they do whattheir names suggest.

Lecture 12 249

Page 250: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Blocking and Buffering

When using MPI Isend and MPI Irecv theimportant thing is not to touch the data after thestart of the transfer and before its completion.

The whole point of using MPI Test andMPI Wait is to know when it is safe to startusing the data on the receiving side, and tore-use the storage on the sending side.

Using MPI advocates this form of sendingmessages — I agree in principle, but I thinkMPI Sendrecv is simpler / more intuitive.

Lecture 12 250

Page 251: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Other MPI Capabilities

• more general datatypes, useful for handlingstructures or a mix of real and integervariables• support for parallel libraries, to make sure

message-passing within the library does notconflict with the user’s own message-passing• MPI error handling• various scatter/gather operations in addition

to broadcast• routines for constructing new

communicators, e.g. to enable differentgroups of processes to do different tasks

Lecture 12 251

Page 252: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Other MPI Capabilities

MPI-2 (new standard, not yet fully implemented):• dynamic spawning of new processes, and

their inclusion into new communicators• parallel file I/O – better performance than all

file I/O being done by one process• remote memory operations put/get,

directly accessing remote memory withoutany action by remote process

Lecture 12 252

Page 253: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Final Advice

• get a solid understanding of the basics• read the early chapters of Using MPI

carefully, maybe skim through the rest• if using MPI Send, check it works if you useMPI Ssend instead• only use advanced capabilities if you’re sure

they will greatly simplify the programming(e.g. the Cartesian utilities) or greatlyimprove performance• keep the MPI code as isolated as possible

from the main application code• if it’s an important application, discuss it with

others with more experience

Lecture 12 253

Page 254: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Practical 4

Global view of data partitioning:

j = 0j = 1

j = m− 2j = m− 1

Lecture 12 254

Page 255: Numerically Intensive Computing in Finance Lecture 1 ...people.maths.ox.ac.uk/gilesm/talks/nicf06.pdfMoney and economics are what drive computing, not technology. Money: if there’s

'

&

$

%

Practical 4

Local view of data partitioning:

j = jlower − 1

j = jlower

j = jupper

j = jupper + 1

halojlocal = 0

jlocal = 1

jlocal = jmax − 2

jlocal = jmax − 1 halo

jlower = ((m−2) ∗myid)/nprocs + 1

jupper = ((m−2) ∗ (myid+1))/nprocs

jmax = jupper − jlower + 3

joff = jlower − 1

Lecture 12 255