357
University of Tartu Institute of Computer Science Scientific Computing Teadusarvutused [email protected] Tartu, 2013

Scientific Computing lectures

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scientific Computing lectures

University of Tartu Institute of Computer Science

Scientific ComputingTeadusarvutused

[email protected]

Tartu, 2013

Page 2: Scientific Computing lectures

2 Practical information

Lectures: Liivi 2 - 207 Wed 10:15Computer classes: Liivi 2 - 205, Thu 16:15 – Oleg Batrasev [email protected] eapFinal grade forms from :

1. Active partitipation at lectures ≈ 10%

2. Exam

3. Computer class exercises

4. Project work

Page 3: Scientific Computing lectures

3 Introduction 1.1 Course overview

Course homepage (http://www.ut.ee/~eero/SC/)

1 Introduction

Scientific Computing = Science + Engineering (Computer Science) +Mathematics

1.1 Course overview

Page 4: Scientific Computing lectures

4 Introduction 1.1 Course overview

1

1by Prof. Rob Scheichl

Page 5: Scientific Computing lectures

5 Introduction 1.1 Course overview

Lectures:

• NumPy, SciPy

• Scientific Computing - an Overview

Page 6: Scientific Computing lectures

6 Introduction 1.1 Course overview

• Large problems in Linear Algebra, condition number

• Algorithm Complexity Analysis

• Memory hierarchies and making use of it

• BLAS and LAPACK libraries with some examples

• Numerical integration and differentiation

• Numerical solution of differential and integral equations

• Using parallel computers for large problems

• Parallel programming models

• MPI (Message Passing Inteface)

Page 7: Scientific Computing lectures

7 Introduction 1.2 Literature

1.2 Literature

Scientific Computing:

1. Victor Eijkhout, Introduction to High Performance Scientific Computing, 2010(Available online e.g. through http://www.lulu.com)

2. RH Landau, A First Course in Scientific Computing. Symbolic, Graphic, andNueric Modeling Using Maple, Java, Mathematica, and Fortran90. PrincentonUniversity Press, 2005.

3. LR Scott, T Clark, B Bagheri. Scientific Parallel Computing. Princenton Uni-versity Press, 2005.

4. MT Heath, Scientific Computing; ISBN: 007112229X, McGraw-Hill Compa-nies, 2001.

Page 8: Scientific Computing lectures

8 Introduction 1.2 Literature

5. JW Demmel, Applied Numerical Linear Algebra; ISBN: 0898713897, Societyfor Industrial & Applied Mathematics, Paperback, 1997.

Python:

1. Hans Petter Langetangen, Python Scripting for Computational Science. ThirdEdition, Springer 2008. Raamatu veebisait (http://folk.uio.no/hpl/scripting/).

2. Neeme Kahusk, Sissejuhatus Pythonisse (http://www.cl.ut.ee/inimesed/nkahusk/sissejuhatus-pythonisse/).

3. Brad Dayley, Python Phrasebook, Sams 2007.

4. Travis E. Oliphant, Guide to NumPy (http://www.tramy.us), TrelgolPublishing 2006

Page 9: Scientific Computing lectures

9 Introduction 1.2 Literature

MPI:

1. W Gropp, E Lusk, A Skjellum, Using MPI, MIT Press, 1994.

2. MPI Commands. (http://www-unix.mcs.anl.gov/mpi/www/index.html)

Page 10: Scientific Computing lectures

10 Introduction 1.3 Scripting vs programming

1.3 Scripting vs programming

1.3.1 What is a script?

• Very high-level, often short, programwritten in a high-level scripting language

• Scripting languages:

– Unix shells,

– Tcl,

– Perl,

– Python,

– Ruby,

– Scheme,

– Rexx,

– JavaScript,

– VisualBasic,

– ...

• This course: Python

Page 11: Scientific Computing lectures

11 Introduction 1.3 Scripting vs programming

1.3.2 Characteristics of a script

• Glue other programs together

• Extensive text processing

• File and directory manipulation

• Often special-purpose code

• Many small interacting scripts may yield a big system

• Perhaps a special-purpose GUI on top

• Portable across Unix, Windows, Mac

• Interpreted program (no compilation+linking)

Page 12: Scientific Computing lectures

12 Introduction 1.3 Scripting vs programming

1.3.3 Why not stick to Java, C/C++ or Fortran?

Features of Perl and Python compared with Java, C/C++ and Fortran:

• shorter, more high-level programs

• much faster software development

• more convenient programming

• you feel more productive

Two main reasons:

• no variable declarations,but lots of consistency checks at run time

• lots of standardized libraries and tools

Page 13: Scientific Computing lectures

13 Introduction 1.4 Scripts yield short code

1.4 Scripts yield short code

Consider reading real numbers from a file, where each line can contain an arbi-trary number of real numbers: 1.1 9 5.2

1.762543E-02

0 0.01 0.001 9 3 7Python solution:

F = open(filename, ’r’)

n = F.read().split()Perl solution:

open F, $filename;

Page 14: Scientific Computing lectures

14 Introduction 1.4 Scripts yield short code

$s = join "", <F>;

@n = split ’ ’, $s;...Doing this in C++ or Java requires at least a loop, and in Fortran and C quite

some code lines are necessary

1.4.1 Using regular expressions

Suppose we want to read complex numbers written as text(-3, 1.4) or (-1.437625E-9, 7.11) or ( 4, 2 )

Python solution: m = re.search(r’\(\s*([^,]+)\s*,\s*([^,]+)\s*\)’,

’(-3,1.4)’)

re, im = [float(x) for x in m.groups()]Perl solution:

Page 15: Scientific Computing lectures

15 Introduction 1.4 Scripts yield short code

$s="(-3, 1.4)";

($re,$im)= $s=~ /\(\s*([^,]+)\s*,\s*([^,]+)\s*\)/;

Page 16: Scientific Computing lectures

16 Introduction 1.4 Scripts yield short code

Regular expressions like \(\s*([^,]+)\s*,\s*([^,]+)\s*\)

constitute a powerful language for specifying text patterns

• Doing the same thing, without regular expressions, in Fortran and C requiresquite some low-level code at the character array level

Remark: we could read pairs (-3, 1.4) without using regular expressions, s = ’(-3, 1.4 )’

re, im = s[1:-1].split(’,’)

Page 17: Scientific Computing lectures

17 Introduction 1.5 Classification of languages

1.5 Classification of languages

Many criteria to classify computer languages

• Dynamically vs statically typed languages

Python (dynamic): c = 1 # c is an integer

c = [1,2,3] # c is a listC (static):

double c; c = 5.2; // c can only hold doubles

c = "a string..." // compiler error

Page 18: Scientific Computing lectures

18 Introduction 1.5 Classification of languages

• Weakly vs strongly typed languages

Perl (weak): $b = ’1.2’

$c = 5*$b; # implicit type conversion: ’1.2’ -> 1.2Python (strong):

b = ’1.2’

c = 5*b #

’1.21.21.21.21.2’ - no implicit type conversion• Interpreted vs compiled languages

• High-level vs low-level languages (Python-C)

• Very high-level vs high-level languages (Python-C)

Page 19: Scientific Computing lectures

19 Introduction 1.5 Classification of languages

• Scripting vs system languages

• Object-oriented vs functional vs procedural

Page 20: Scientific Computing lectures

20 Introduction 1.6 Performance issues

1.6 Performance issues

1.6.1 Scripts can be slow

• Perl and Python scripts are first compiled to byte-code

• The byte-code is then interpreted

• Text processing is usually as fast as in C

• Loops over large data structures might be very slow

for i in range(len(A)):

A[i] = ...

Page 21: Scientific Computing lectures

21 Introduction 1.6 Performance issues

• Fortran, C and C++ compilers are good at optimizing such loops at compiletime and produce very efficient assembly code (e.g. 100 times faster)

• Fortunately, long loops in scripts can easily be migrated to Fortran or C

1.6.2 Scripts may be fast enough

Read 100 000 (x,y) data from file and write (x,f(y)) out again

• Pure Python: 4s

• Pure Perl: 3s

• Pure Tcl: 11s

• Pure C (fscanf/fprintf): 1s

• Pure C++ (iostream): 3.6s

• Pure C++ (buffered streams): 2.5s

Page 22: Scientific Computing lectures

22 Introduction 1.6 Performance issues

• Numerical Python modules: 2.2s (!)

• Remark: in practice, 100 000 data points are written and read in binary format,resulting in much smaller differences

1.6.3 When scripting is convenient

• The application’s main task is to connect together existing components

• The application includes a graphical user interface

• The application performs extensive string/text manipulation

• The design of the application code is expected to change significantly

• CPU-time intensive parts can be migrated to C/C++ or Fortran

• The application can be made short if it operates heavily on list or hash structures

Page 23: Scientific Computing lectures

23 Introduction 1.6 Performance issues

• The application is supposed to communicate with Web servers

• The application should run without modifications on Unix, Windows, and Mac-intosh computers, also when a GUI is included

1.6.4 When to use C, C++, Java, Fortran

• Does the application implement complicated algorithms and data structures?

• Does the application manipulate large datasets so that execution speed iscritical?

• Are the application’s functions well-defined and changing slowly?

• Will type-safe languages be an advantage, e.g., in large development teams?

Page 24: Scientific Computing lectures

24 Python in SC 2.1 Numerical Python (NumPy)

2 Python in Scientfic Computing

2.1 Numerical Python (NumPy)

• NumPy enables efficient numerical computing in Python

• NumPy is a package of modules, which offers efficient arrays (contiguous stor-age) with associated array operations coded in C or Fortran

• There are three implementations of Numerical Python

• Numeric from the mid 90s (still widely used)

• numarray from about 2000

• numpy from 2006 (the new and leading implementation)

• We recommend to use numpy (by Travis Oliphant)

Page 25: Scientific Computing lectures

25 Python in SC 2.1 Numerical Python (NumPy)

1 # A taste of NumPy: a least-squares procedure

2 from scitools.all import *

3 n = 20 ; x = linspace(0.0, 1.0, n) # coordinates

4 y_line = -2*x + 3

5 y = y_line + random.normal(0, 0.25, n) # line with noise

6 # create and solve least squares system:

7 A = array([x, ones(n)])

8 A = A.transpose()

9 result = linalg.lstsq(A, y)

10 # result is a 4-tuple, the solution (a,b) is the 1st entry:

11 a, b = result[0]

12 plot(x, y, ’o’, # data points w/noise

13 x, y_line, ’r’, # original line

14 x, a*x + b, ’b’) # fitted lines

15 legend(’data points’, ’original line’, ’fitted line’)

16 hardcopy(’vahimruut.png’)

Page 26: Scientific Computing lectures

26 Python in SC 2.1 Numerical Python (NumPy)

Resulting plot:

Page 27: Scientific Computing lectures

27 Python in SC 2.1 Numerical Python (NumPy)

2.1.1 NumPy: making arrays >>> from numpy import *>>> n = 4

>>> a = zeros(n) # one-dim. array of length n

>>> print a # str(a), float (C double) is default type

[ 0. 0. 0. 0.]

>>> a # repr(a)

array([ 0., 0., 0., 0.])

>>> p = q = 2

>>> a = zeros((p,q,3)) # p*q*3 three-dim. array

>>> print a

[[[ 0. 0. 0.]

[ 0. 0. 0.]]

[[ 0. 0. 0.]

[ 0. 0. 0.]]]

>>> a.shape # a’s dimension

(2, 2, 3)

Page 28: Scientific Computing lectures

28 Python in SC 2.1 Numerical Python (NumPy)

2.1.2 NumPy: making float, int, complex arrays >>> a = z e r o s ( 3 )>>> p r i n t a . dtype # a ’ s da ta t y p ef l o a t 6 4>>> a = z e r o s ( 3 , i n t )>>> p r i n t a[0 0 0]>>> p r i n t a . dtypei n t 3 2>>> a = z e r o s ( 3 , f l o a t 3 2 ) # s i n g l e p r e c i s i o n>>> p r i n t a[ 0 . 0 . 0 . ]>>> p r i n t a . dtypef l o a t 3 2>>> a = z e r o s ( 3 , complex ) ; aarray ( [ 0 . + 0 . j , 0 . + 0 . j , 0 . + 0 . j ] )>>> a . dtypedtype ( ’ complex128 ’ )

Page 29: Scientific Computing lectures

29 Python in SC 2.1 Numerical Python (NumPy)

• Given an array a, make a new array of same dimension and data type:

>>> x = zeros(a.shape, a.dtype)

2.1.3 Array with a sequence of numbers

• linspace(a, b, n) generates n uniformly spaced coordinates, startingwith a and ending with b

>>> x = linspace(-5, 5, 11)

>>> print x

[-5. -4. -3. -2. -1. 0. 1. 2. 3. 4. 5.]

Page 30: Scientific Computing lectures

30 Python in SC 2.1 Numerical Python (NumPy)

• A special compact syntax is available through the syntax >>> a = r_[-5:5:11j] # same as linspace(-5, 5, 11)

>>> print a

[-5. -4. -3. -2. -1. 0. 1. 2. 3. 4. 5.]• arange works like range (xrange)

>>> x = arange(-5, 5, 1, float)

>>> print x # upper limit 5 is not included

[-5. -4. -3. -2. -1. 0. 1. 2. 3. 4.]

2.1.4 Array construction from a Python list

array(list, [datatype]) generates an array from a list:

Page 31: Scientific Computing lectures

31 Python in SC 2.1 Numerical Python (NumPy)

>>> pl = [0, 1.2, 4, -9.1, 5, 8]

>>> a = array(pl)

Page 32: Scientific Computing lectures

32 Python in SC 2.1 Numerical Python (NumPy)

• The array elements are of the simplest possible type:

>>> z = array([1, 2, 3])

>>> print z # int elements possible

[1 2 3]

>>> z = array([1, 2, 3], float)

>>> print z

[ 1. 2. 3.]• A two-dim. array from two one-dim. lists:

>>> x = [0, 0.5, 1]; y = [-6.1, -2, 1.2] # Python lists

>>> a = array([x, y]) # form array with x and y as rows

Page 33: Scientific Computing lectures

33 Python in SC 2.1 Numerical Python (NumPy)

• From array to list: alist = a.tolist()

2.1.5 From “anything” to a NumPy array

• Given an object a, a = asarray(a)

converts a to a NumPy array (if possible/necessary)

• Arrays can be ordered as in C (default) or Fortran: a = asarray(a, order=’Fortran’)

isfortran(a) # returns True of a’s order is Fortran

Page 34: Scientific Computing lectures

34 Python in SC 2.1 Numerical Python (NumPy)

• Use asarray to, e.g., allow flexible arguments in functions:

def myfunc(some_sequence, ...):

a = asarray(some_sequence)

# work with a as array

myfunc([1,2,3], ...)

myfunc((-1,1), ...)

myfunc(zeros(10), ...)

Page 35: Scientific Computing lectures

35 Python in SC 2.1 Numerical Python (NumPy)

2.1.6 Changing array dimensions >>> a = array([0, 1.2, 4, -9.1, 5, 8])

>>> a.shape = (2,3) # turn a into a 2x3 matrix

>>> a.shape

(2, 3)

>>> a.size

6

>>> a.shape = (a.size,) # turn a into a vector of length

6 again

>>> a.shape

(6,)

>>> a = a.reshape(2,3) # same effect as setting a.shape

>>> a.shape

(2, 3)

Page 36: Scientific Computing lectures

36 Python in SC 2.1 Numerical Python (NumPy)

2.1.7 Array initialization from a Python function >>> def myfunc(i, j):

... return (i+1)*(j+4-i)

...

>>> # make 3x6 array where a[i,j] = myfunc(i,j):

>>> a = fromfunction(myfunc, (3,6))

>>> a

array([[ 4., 5., 6., 7., 8., 9.],

[ 6., 8., 10., 12., 14., 16.],

[ 6., 9., 12., 15., 18., 21.]])

Page 37: Scientific Computing lectures

37 Python in SC 2.1 Numerical Python (NumPy)

2.1.8 Basic array indexing a = linspace(-1, 1, 6)

# array([-1. , -0.6, -0.2, 0.2, 0.6, 1. ])

a[2:4] = -1 # set a[2] and a[3] equal to -1

a[-1] = a[0] # set last element equal to first one

a[:] = 0 # set all elements of a equal to 0

a.fill(0) # set all elements of a equal to 0

a.shape = (2,3) # turn a into a 2x3 matrix

print a[0,1] # print element (0,1)

a[i,j] = 10 # assignment to element (i,j)

a[i][j] = 10 # equivalent syntax (slower)

print a[:,k] # print column with index k

print a[1,:] # print second row

a[:,:] = 0 # set all elements of a equal to 0

Page 38: Scientific Computing lectures

38 Python in SC 2.1 Numerical Python (NumPy)

2.1.9 More advanced array indexing >>> a = linspace(0, 29, 30)

>>> a.shape = (5,6)

>>> a

array([[ 0., 1., 2., 3., 4., 5.,]

[ 6., 7., 8., 9., 10., 11.,]

[ 12., 13., 14., 15., 16., 17.,]

[ 18., 19., 20., 21., 22., 23.,]

[ 24., 25., 26., 27., 28., 29.,]])

>>> a[1:3,:-1:2] # a[i,j] for i=1,2 and j=0,2,4

array([[ 6., 8., 10.],

[ 12., 14., 16.]])

>>> a[::3,2:-1:2] # a[i,j] for i=0,3 and j=2,4

array([[ 2., 4.],

[ 20., 22.]])

Page 39: Scientific Computing lectures

39 Python in SC 2.1 Numerical Python (NumPy)

>>> i = slice(None, None, 3); j = slice(2, -1, 2)

>>> a[i,j]

array([[ 2., 4.],

[ 20., 22.]])

2.1.10 Slices refer the array data

• With a as list, a[:] makes a copy of the data

• With a as array, a[:] is a reference to the data!!! >>> b = a[1,:] # extract 2nd column of a

>>> print a[1,1]

12.0

>>> b[1] = 2

>>> print a[1,1]

Page 40: Scientific Computing lectures

40 Python in SC 2.1 Numerical Python (NumPy)

2.0 # change in b is reflected in a• Take a copy to avoid referencing via slices:

>>> b = a[1,:].copy()

>>> print a[1,1]

12.0

>>> b[1] = 2 # b and a are two different arrays now

>>> print a[1,1]

12.0 # a is not affected by change in b

Page 41: Scientific Computing lectures

41 Python in SC 2.1 Numerical Python (NumPy)

2.1.11 Integer arrays as indices

• An integer array or list can be used as (vectorized) index >>> a = linspace(1, 8, 8)

>>> a

array([ 1., 2., 3., 4., 5., 6., 7., 8.])

>>> a[[1,6,7]] = 10

>>> a # ?

array([ 1., 10., 3., 4., 5., 6., 10., 10.])

>>> a[range(2,8,3)] = -2

>>> a # ?

array([ 1., 10., -2., 4., 5., -2., 10., 10.])

>>> a[a < 0] # pick out the negative elements of a

array([-2., -2.])

>>> a[a < 0] = a.max()

>>> a # ?

Page 42: Scientific Computing lectures

42 Python in SC 2.1 Numerical Python (NumPy)

array([ 1., 10., 10., 4., 5., 10., 10., 10.])• Such array indices are important for efficient vectorized code

2.1.12 Loops over arrays

• Standard loop over each element:

for i in xrange(a.shape[0]):

for j in xrange(a.shape[1]):

a[i,j] = (i+1)*(j+1)*(j+2)

print ’a[%d,%d]=%g ’ % (i,j,a[i,j]),

print # newline after each row

Page 43: Scientific Computing lectures

43 Python in SC 2.1 Numerical Python (NumPy)

• A standard for loop iterates over the first index: >>> print a

[[ 2. 6. 12.]

[ 4. 12. 24.]]

>>> for e in a:

... print e

...

[ 2. 6. 12.]

[ 4. 12. 24.]• View array as one-dimensional and iterate over all elements:

for e in a.flat:

print e

Page 44: Scientific Computing lectures

44 Python in SC 2.1 Numerical Python (NumPy)

• For loop over all index tuples and values:

>>> for index, value in ndenumerate(a):

... print index, value

...

(0, 0) 2.0

(0, 1) 6.0

(0, 2) 12.0

(1, 0) 4.0

(1, 1) 12.0

(1, 2) 24.0

Page 45: Scientific Computing lectures

45 Python in SC 2.1 Numerical Python (NumPy)

2.1.13 Array computations

• Arithmetic operations can be used with arrays: b = 3*a - 1 # a is array, b becomes array

1) compute t1 = 3*a, 2) compute t2= t1 - 1, 3) set b = t2

• Array operations are much faster than element-wise operations: >>> import time # module for measuring CPU time

>>> a = linspace(0, 1, 1E+07) # create some array

>>> t0 = time.clock()

>>> b = 3*a -1

>>> t1 = time.clock() # t1-t0 is the CPU time of 3*a-1

>>> for i in xrange(a.size): b[i] = 3*a[i] - 1

>>> t2 = time.clock()

Page 46: Scientific Computing lectures

46 Python in SC 2.1 Numerical Python (NumPy)

>>> print ’3*a-1: %g sec, loop: %g sec’ % (t1-t0, t2-t1)

3*a-1: 2.09 sec, loop: 31.27 sec

2.1.14 In-place array arithmetics

• Expressions like 3*a-1 generate temporary arrays

• With in-place modifications of arrays, we can avoid temporary arrays (to someextent)

b = a

b *= 3 # or multiply(b, 3, b)

b -= 1 # or subtract(b, 1, b)Note: a is changed, use b = a.copy()

Page 47: Scientific Computing lectures

47 Python in SC 2.1 Numerical Python (NumPy)

• In-place operations: a *= 3.0 # multiply a’s elements by 3

a -= 1.0 # subtract 1 from each element

a /= 3.0 # divide each element by 3

a += 1.0 # add 1 to each element

a **= 2.0 # square all elements• Assign values to all elements of an existing array:

a[:] = 3*c - 1

Page 48: Scientific Computing lectures

48 Python in SC 2.1 Numerical Python (NumPy)

2.1.15 Standard math functions can take array arguments # let b be an array

c = sin(b)

c = arcsin(c)

c = sinh(b)

# same functions for the cos and tan families

c = b**2.5 # power function

c = log(b)

c = exp(b)

c = sqrt(b)

Page 49: Scientific Computing lectures

49 Python in SC 2.1 Numerical Python (NumPy)

2.1.16 Other useful array operations # a is an array

a.clip(min=3, max=12) # clip elements

a.mean(); mean(a) # mean value

a.var(); var(a) # variance

a.std(); std(a) # standard deviation

median(a)

cov(x,y) # covariance

trapz(a) # Trapezoidal integration

diff(a) # finite differences (da/dx) # more Matlab-like functions:

corrcoeff, cumprod, diag, eig, eye, fliplr, flipud, max,

min,

prod, ptp, rot90, squeeze, sum, svd, tri, tril, triu

Page 50: Scientific Computing lectures

50 Python in SC 2.1 Numerical Python (NumPy)

2.1.17 Temporary arrays

• Let us evaluate f1(x) for a vector x:

def f1(x):

return exp(-x*x)*log(1+x*sin(x))1. temp1 = -x

2. temp2 = temp1*x

3. temp3 = exp(temp2)

4. temp4 = sin(x)

5. temp5 = x*temp4

6. temp6 = 1 + temp4

7. temp7 = log(temp5)

8. result = temp3*temp7

Page 51: Scientific Computing lectures

51 Python in SC 2.1 Numerical Python (NumPy)

2.1.18 More useful array methods and attributes >>> a = zeros(4) + 3

>>> a

array([ 3., 3., 3., 3.]) # float data

>>> a.item(2) # more efficient than a[2]

3.0

>>> a.itemset(3,-4.5) # more efficient than a[3]=-4.5

>>> a

array([ 3. , 3. , 3. , -4.5])

>>> a.shape = (2,2)

>>> a

array([[ 3. , 3. ],

[ 3. , -4.5]])

Page 52: Scientific Computing lectures

52 Python in SC 2.1 Numerical Python (NumPy)

>>> a.ravel() # from multi-dim to one-dim

array([ 3. , 3. , 3. , -4.5])

>>> a.ndim # no of dimensions

2

>>> len(a.shape) # no of dimensions

2

>>> rank(a) # no of dimensions

2

>>> a.size # total no of elements

4

>>> b = a.astype(int) # change data type

>>> b

array([3, 3, 3, 3])

Page 53: Scientific Computing lectures

53 Python in SC 2.1 Numerical Python (NumPy)

2.1.19 Complex number computing >>> from math import sqrt

>>> sqrt(-1) # ?

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

ValueError: math domain error

>>> from numpy import sqrt

>>> sqrt(-1) # ?

Warning: invalid value encountered in sqrt

nan

>>> from cmath import sqrt # complex math functions

>>> sqrt(-1) # ?

1j

>>> sqrt(4) # cmath functions always return complex...

(2+0j)

Page 54: Scientific Computing lectures

54 Python in SC 2.1 Numerical Python (NumPy)

>>> from numpy.lib.scimath import sqrt

>>> sqrt(4)

2.0 # real when possible

>>> sqrt(-1)

1j # otherwise complex

Page 55: Scientific Computing lectures

55 Python in SC 2.1 Numerical Python (NumPy)

2.1.20 A root function # Goal: compute roots of a parabola, return real when possible,

# otherwise complex

def roots(a, b, c):

# compute roots of a*x^2 + b*x + c = 0

from numpy.lib.scimath import sqrt

q = sqrt(b**2 - 4*a*c) # q is real or complex

r1 = (-b + q)/(2*a)

r2 = (-b - q)/(2*a)

return r1, r2

>>> a = 1; b = 2; c = 100

>>> roots(a, b, c) # complex roots

((-1+9.94987437107j), (-1-9.94987437107j))

>>> a = 1; b = 4; c = 1

>>> roots(a, b, c) # real roots

(-0.267949192431, -3.73205080757)

Page 56: Scientific Computing lectures

56 Python in SC 2.1 Numerical Python (NumPy)

2.1.21 Array type and data type >>> import numpy

>>> a = numpy.zeros(5)

>>> type(a)

<type ’numpy.ndarray’>

>>> isinstance(a, ndarray) # is a of type ndarray?

True

>>> a.dtype # data (element) type object

dtype(’float64’)

>>> a.dtype.name

’float64’

>>> a.dtype.char # character code

’d’

>>> a.dtype.itemsize # no of bytes per array element

8

Page 57: Scientific Computing lectures

57 Python in SC 2.1 Numerical Python (NumPy)

>>> b = zeros(6, float32)

>>> a.dtype == b.dtype # do a and b have the same data type?

False

>>> c = zeros(2, float)

>>> a.dtype == c.dtype

True

Page 58: Scientific Computing lectures

58 Python in SC 2.1 Numerical Python (NumPy)

2.1.22 Matrix objects

• NumPy has an array type, matrix, much like Matlab’s array type >>> x1 = array([1, 2, 3], float)

>>> x2 = matrix(x1) # or just mat(x1)

>>> x2 # row vector

matrix([[ 1., 2., 3.]])

>>> x3 = mat(x).transpose() # column vector

>>> x3

matrix([[ 1.],

[ 2.],

[ 3.]])

>>> type(x3)

<class ’numpy.core.defmatrix.matrix’>

>>> isinstance(x3, matrix)

True

Page 59: Scientific Computing lectures

59 Python in SC 2.1 Numerical Python (NumPy)

• Only 1- and 2-dimensional arrays can be matrix

• For matrix objects, the * operator means matrix-matrix or matrix-vector multi-plication (not elementwise multiplication):

>>> A = eye(3) # identity matrix

>>> A = mat(A) # turn array to matrix

>>> A

matrix([[ 1., 0., 0.],

[ 0., 1., 0.],

[ 0., 0., 1.]])

>>> y2 = x2*A # vector-matrix product

>>> y2

matrix([[ 1., 2., 3.]])

>>> y3 = A*x3 # matrix-vector product

>>> y3

matrix([[ 1.],

[ 2.],

[ 3.]])

Page 60: Scientific Computing lectures

60 Python in SC 2.2 NumPy: Vectorisation

2.2 NumPy: Vectorisation

• Loops over an array run slowly

• Vectorization = replace explicit loops by functions calls such that the wholeloop is implemented in C (or Fortran)

• Explicit loops: r = zeros(x.shape, x.dtype)

for i in xrange(x.size):

r[i] = sin(x[i])• Vectorised version:

r = sin(x)

Page 61: Scientific Computing lectures

61 Python in SC 2.2 NumPy: Vectorisation

• Arithmetic expressions work for both scalars and arrays

• Many fundamental functions work for scalars and arrays

• Ex: x**2 + abs(x) works for x scalar or array

A mathematical function written for scalar arguments can (normally) take a arrayarguments: >>> def f(x):

... return x**2 + sinh(x)*exp(-x) + 1

...

>>> # scalar argument:

>>> x = 2

>>> f(x)

5.4908421805556333

>>> # array argument:

>>> y = array([2, -1, 0, 1.5])

Page 62: Scientific Computing lectures

62 Python in SC 2.2 NumPy: Vectorisation

>>> f(y)

array([ 5.49084218, -1.19452805, 1. ,

3.72510647])

2.2.1 Vectorisation of functions with if tests; problem

• Consider a function with an if test: def somefunc(x):

if x < 0:

return 0

else:

return sin(x)

# or

def somefunc(x): return 0 if x < 0 else sin(x)

Page 63: Scientific Computing lectures

63 Python in SC 2.2 NumPy: Vectorisation

• This function works with a scalar x but not an array

• Problem: x<0 results in a boolean array, not a boolean value that can be usedin the if test

>>> x = linspace(-1, 1, 3); print x

[-1. 0. 1.]

>>> y = x < 0

>>> y

array([ True, False, False], dtype=bool)

>>> ’ok’ if y else ’not ok’ # test of y in scalar

boolean context

...

ValueError: The truth value of an array with more than

one

element is ambiguous. Use a.any() or a.all()

Page 64: Scientific Computing lectures

64 Python in SC 2.2 NumPy: Vectorisation

2.2.2 Vectorisation of functions with if tests; solutions

A. Simplest remedy: call NumPy’s vectorize function to allow array argumentsto a function: >>> somefuncv = vectorize(somefunc, otypes=’d’)

>>> # test:

>>> x = linspace(-1, 1, 3); print x

[-1. 0. 1.]

>>> somefuncv(x) # ?

array([ 0. , 0. , 0.84147098])Note: The data type must be specified as a character

• The speed of somefuncv is unfortunately quite slow

Page 65: Scientific Computing lectures

65 Python in SC 2.2 NumPy: Vectorisation

B. A better solution, using where: def somefunc_NumPy2(x):

x1 = zeros(x.size, float)

x2 = sin(x)

return where(x < 0, x1, x2)

Page 66: Scientific Computing lectures

66 Python in SC 2.2 NumPy: Vectorisation

2.2.3 General vectorization of if-else tests def f(x): # scalar x

if condition:

x = <expression1>

else:

x = <expression2>

return x def f_vectorized(x): # scalar or array x

x1 = <expression1>

x2 = <expression2>

return where(condition, x1, x2)

Page 67: Scientific Computing lectures

67 Python in SC 2.2 NumPy: Vectorisation

2.2.4 Vectorization via slicing

• Consider a recursion scheme (which arises from a one-dimensional diffusionequation)

• Straightforward (slow) Python implementation:

n = size(u)-1

for i in xrange(1,n,1):

u_new[i] = beta*u[i-1] + (1-2*beta)*u[i] + beta*u[i+1]• Slices enable us to vectorize the expression:

u[1:n] = beta*u[0:n-1] + (1-2*beta)*u[1:n] + beta*u[2:n+1]

Page 68: Scientific Computing lectures

68 Python in SC 2.3 NumPy: Random numbers

2.3 NumPy: Random numbers

• Drawing scalar random numbers: import random

random.seed(2198) # control the seed

print ’uniform random number on (0,1):’, random.random()

print ’uniform random number on (-1,1):’, random.uniform(-1,1)

print ’Normal(0,1) random number:’, random.gauss(0,1)• Vectorized drawing of random numbers (arrays):

from numpy import random

random.seed(12) # set seed

u = random.random(n) # n uniform numbers on (0,1)

u = random.uniform(-1, 1, n) # n uniform numbers on (-1,1)

u = random.normal(m, s, n) # n numbers from N(m,s)

Page 69: Scientific Computing lectures

69 Python in SC 2.3 NumPy: Random numbers

• Note that both modules have the name random! A remedy:

import random as random_number # rename random for scalars

from numpy import * # random is now numpy.random

Page 70: Scientific Computing lectures

70 Python in SC 2.4 NumPy: Basic linear algebra

2.4 NumPy: Basic linear algebra

NumPy contains the linalg module for

• solving linear systems

• computing the determinant of a matrix

• computing the inverse of a matrix

• computing eigenvalues and eigenvectors of a matrix

• solving least-squares problems

• computing the singular value decomposition of a matrix

• computing the Cholesky decomposition of a matrix

Page 71: Scientific Computing lectures

71 Python in SC 2.4 NumPy: Basic linear algebra

2.4.1 A linear algebra session 1 from numpy import * # includes import of linalg

2 # fill matrix A and vectors x and b

3 b = dot(A, x) # matrix-vector product

4 y = linalg.solve(A, b) # solve A*y = b

5 if allclose(x, y, atol=1.0E-12, rtol=1.0E-12):

6 print ’--correct solution’

7 d = linalg.det(A)

8 B = linalg.inv(A)

9 # check result:

10 R = dot(A, B) - eye(n) # residual

11 R_norm = linalg.norm(R) # Frobenius norm of matrix R

12 print ’Residual R = A*A-inverse - I:’, R_norm

13 A_eigenvalues = linalg.eigvals(A) # eigenvalues only

14 A_eigenvalues, A_eigenvectors = linalg.eig(A)

15 for e, v in zip(A_eigenvalues, A_eigenvectors):

16 print ’eigenvalue %g has corresponding vector\n%s’ % (e, v)

Page 72: Scientific Computing lectures

72 Python in SC 2.5 Python: Modules for curve plotting and 2D/3D visualization

2.5 Python: Modules for curve plotting and 2D/3D visualization

• Interface to Gnuplot (curve plotting, 2D scalar and vector fields)

• Matplotlib (curve plotting, 2D scalar and vector fields)

• Interface to Vtk (2D/3D scalar and vector fields)

• Interface to OpenDX (2D/3D scalar and vector fields)

• Interface to IDL

• Interface to Grace

• Interface to Matlab

• Interface to R

• Interface to Blender

• PyX (PostScript/TEX-like drawing)

Page 73: Scientific Computing lectures

73 Python in SC 2.5 Python: Modules for curve plotting and 2D/3D visualization

2.5.1 Curve plotting with Easyviz

• Easyviz is a light-weight interface to many plotting packages, using a Matlab-like syntax

• Goal: write your program using Easyviz (“Matlab”) syntax and postpone yourchoice of plotting package

• Note: some powerful plotting packages (Vtk, R, matplotlib, ...) may be trou-blesome to install, while Gnuplot is easily installed on all platforms

• Easyviz supports (only) the most common plotting commands

• Easyviz is part of SciTools (Simula development)

from scitools.all import *

(imports all of numpy, all of easyviz, plus scitools)

Page 74: Scientific Computing lectures

74 Python in SC 2.5 Python: Modules for curve plotting and 2D/3D visualization

2.5.2 Basic Easyviz example from scitools.all import * # import numpy and plotting

t = linspace(0, 3, 51) # 51 points between 0 and 3

y = t**2*exp(-t**2) # vectorized expression

plot(t, y)

hardcopy(’tmp1.eps’) # make PostScript image for reports

hardcopy(’tmp1.png’) # make PNG image for web pages

Page 75: Scientific Computing lectures

75 Python in SC 2.5 Python: Modules for curve plotting and 2D/3D visualization

2.5.3 Decorating the plot plot(t, y)

xlabel(’t’)

ylabel(’y’)

legend(’t^2*exp(-t^2)’)

axis([0, 3, -0.05, 0.6]) # [tmin, tmax, ymin, ymax]

title(’My First Easyviz Demo’)or:

plot(t, y, xlabel=’t’, ylabel=’y’,

legend=’t^2*exp(-t^2)’,

axis=[0, 3, -0.05, 0.6],

title=’My First Easyviz Demo’,

hardcopy=’easyviz_demo.png’,

show=True) # display on the screen (default)

Page 76: Scientific Computing lectures

76 Python in SC 2.5 Python: Modules for curve plotting and 2D/3D visualization

The resulting plot

Page 77: Scientific Computing lectures

77 Python in SC 2.5 Python: Modules for curve plotting and 2D/3D visualization

2.5.4 Plotting several curves in one plot from scitools.all import * # for curve plotting

def f1(t):

return t**2*exp(-t**2)

def f2(t):

return t**2*f1(t)

t = linspace(0, 3, 51)

y1 = f1(t); y2 = f2(t)

plot(t, y1)

hold(’on’) # continue plotting in the same plot

plot(t, y2)

xlabel(’t’); ylabel(’y’)

legend(’t^2*exp(-t^2)’,’t^4*exp(-t^2)’)

title(’Plotting two curves in the same plot’)

hardcopy(’two_curves.png’)

Page 78: Scientific Computing lectures

78 Python in SC 2.5 Python: Modules for curve plotting and 2D/3D visualization

The resulting plot

Page 79: Scientific Computing lectures

79 Python in SC 2.5 Python: Modules for curve plotting and 2D/3D visualization

2.5.5 Example: plot a function given on the command line

• Specify $f(x)$ and $x$ interval as text on the command line: Unix/DOS> python plotf.py "exp(-0.2*x)*sin(2*pi*x)" 0 4*

piProgram:

from scitools.all import *formula = sys.argv[1]

xmin = eval(sys.argv[2])

xmax = eval(sys.argv[3])

x = linspace(xmin, xmax, 101)

y = eval(formula)

plot(x, y, title=formula)Thanks to eval, input (text) with correct Python syntax can be turned to running

code on the fly

Page 80: Scientific Computing lectures

80 Python in SC 2.5 Python: Modules for curve plotting and 2D/3D visualization

2.5.6 Plotting 2D scalar fields from scitools.all import *x = y = linspace(-5, 5, 21)

xv, yv = ndgrid(x, y)

values = sin(sqrt(xv**2 + yv**2))

surf(xv, yv, values)

Page 81: Scientific Computing lectures

81 Python in SC 2.5 Python: Modules for curve plotting and 2D/3D visualization

2.5.7 Adding plot features

# Matlab style commands:

setp(interactive=False)

surf(xv, yv, values)

shading(’flat’)

colorbar()

colormap(hot())

axis([-6,6,-6,6,-1.5,1.5])

view(35,45)

show()

# Optional Easyviz

# (Pythonic) short cut:

surf(xv, yv, values,

shading=’flat’,

colorbar=’on’,

colormap=hot(),

axis=[-6,6,-6,6,-1.5,1.5],

view=[35,45])

#

Page 82: Scientific Computing lectures

82 Python in SC 2.5 Python: Modules for curve plotting and 2D/3D visualization

The resulting plot

Page 83: Scientific Computing lectures

83 Python in SC 2.5 Python: Modules for curve plotting and 2D/3D visualization

2.5.8 Other commands for visualizing 2D scalar fields

• contour (standard contours)), contourf (filled contours), contour3 (el-evated contours)

• mesh (elevated mesh), meshc (elevated mesh with contours in the xy plane)

• surf (colored surface), surfc (colored surface with contours in the xy plane)

• pcolor (colored cells in a 2D mesh)

2.5.9 Commands for visualizing 3D fields

Scalar fields:

• isosurface

• slice_ (colors in slice plane), contourslice (contours in slice plane)

Page 84: Scientific Computing lectures

84 Python in SC 2.5 Python: Modules for curve plotting and 2D/3D visualization

Vector fields:

• quiver3 (arrows), (quiver for 2D vector fields)

• streamline, streamtube, streamribbon (flow sheets)

2.5.10 More info about Easyviz

• A plain text version of the Easyviz manual:

pydoc scitools.easyviz

• The HTML version:

http://folk.uio.no/hpl/easyviz/

• Download SciTools (incl.~Easyviz):

http://code.google.com/p/scitools/

Page 85: Scientific Computing lectures

85 Python in SC 2.6 I/O

2.6 I/O

2.6.1 File I/O with arrays; plain ASCII format

• Plain text output to file (just dump repr(array)):

a = linspace(1, 20, 20); a.shape = (2,10)

file = open("tmp.dat", "w")

file.write("Here is an array a:\n")

file.write(repr(a)) # dump string representation of a

file.close()

Page 86: Scientific Computing lectures

86 Python in SC 2.6 I/O

• Plain text input (just take eval on input line):

file = open("tmp.dat", "r")

file.readline() # load the first line (a comment)

b = eval(file.read())

file.close()

Page 87: Scientific Computing lectures

87 Python in SC 2.6 I/O

2.6.2 File I/O with arrays; binary pickling

• Dump (serialized) arrays with cPickle:

# a1 and a2 are two arrays

import cPickle

file = open("tmp.dat", "wb")

file.write("This is the array a1:\n")

cPickle.dump(a1, file)

file.write("Here is another array a2:\n")

cPickle.dump(a2, file)

file.close()

Page 88: Scientific Computing lectures

88 Python in SC 2.6 I/O

Read in the arrays again (in correct order): file = open("tmp.dat", "rb")

file.readline() # swallow the initial comment line

b1 = cPickle.load(file)

file.readline() # swallow next comment line

b2 = cPickle.load(file)

file.close()

Page 89: Scientific Computing lectures

89 Python in SC 2.7 ScientificPython

2.7 ScientificPython

• ScientificPython (by Konrad Hinsen)

• Modules for automatic differentiation, interpolation, data fitting via nonlinearleast-squares, root finding, numerical integration, basic statistics, histogramcomputation, visualization, parallel computing (via MPI or BSP), physicalquantities with dimension (units), 3D vectors/tensors, polynomials, I/O supportfor Fortran files and netCDF

• Very easy to install

Page 90: Scientific Computing lectures

90 Python in SC 2.7 ScientificPython

2.7.1 ScientificPython: numbers with units >>> from Scientific.Physics.PhysicalQuantities \

import PhysicalQuantity as PQ

>>> m = PQ(12, ’kg’) # number, dimension

>>> a = PQ(’0.88 km/s**2’) # alternative syntax (string)

>>> F = m*a

>>> F # ?

PhysicalQuantity(10.56,’kg*km/s**2’)

>>> F = F.inBaseUnits()

>>> F # ?

PhysicalQuantity(10560.0,’m*kg/s**2’)

>>> F.convertToUnit(’MN’) # convert to Mega Newton

>>> F # ?

PhysicalQuantity(0.01056,’MN’)

>>> F = F + PQ(0.1, ’kPa*m**2’) ; F # kilo Pascal m^2

PhysicalQuantity(0.010759999999999999,’MN’)

>>> F.getValue()

0.010759999999999999

Page 91: Scientific Computing lectures

91 Python in SC 2.8 SciPy

2.8 SciPy

2.8.1 Overview

• SciPy is a comprehensive package (by Eric Jones, Travis Oliphant, Pearu Pe-terson) for scientific computing with Python

• Much overlap with ScientificPython

• SciPy interfaces many classical Fortran packages from Netlib (QUADPACK,ODEPACK, MINPACK, ...)

• Functionality: special functions, linear algebra, numerical integration, ODEs,random variables and statistics, optimization, root finding, interpolation, ...

• May require some installation efforts (applies ATLAS)

• See www.scipy.org

Page 92: Scientific Computing lectures

92 Python in SC 2.8 SciPy

2.8.2 An example - using SciPy for verifying mathematical theory about a nu-merical method for solving weakly singular integral equations

A SciPy example - calculations of Weakly singular Fredholm integral equationswith a change of variables (http://www.ut.ee/~eero/WSIE/Fredholm/forpaper.py.html)

It calls Fortran90 program quad.f90 (http://www.ut.ee/~eero/WSIE/Fredholm/quad.f90.html)

• For generating the Python interface to quad.f90: f2py --fcompiler=intel -c quad.f90 -m quad

• (About the project background in the paper (http://math.ut.ee/~eero/publ/EVainikkoGVainikkoSINUM.pdf))

Page 93: Scientific Computing lectures

93 Python in SC 2.9 SymPy: symbolic computing in Python

2.9 SymPy: symbolic computing in Python

• SymPy is a Python package for symbolic computing

• Easy to install, easy to extend

• Easy to use: >>> from sympy import *>>> x = Symbol(’x’)

>>> f = cos(acos(x))

>>> f

cos(acos(x))

>>> sin(x).series(x, 4) # 4 terms of the Taylor series

x - 1/6*x**3 + O(x**4)

>>> dcos = diff(cos(2*x), x)

>>> dcos

-2*sin(2*x)

Page 94: Scientific Computing lectures

94 Python in SC 2.9 SymPy: symbolic computing in Python

>>> dcos.subs(x, pi).evalf() # x=pi, float evaluation

0

>>> I = integrate(log(x), x)

>>> print I

-x + x*log(x)

Page 95: Scientific Computing lectures

95 Python in SC 2.10 Python + Matlab = true

2.10 Python + Matlab = true

• A Python module, pymat, enables communication with Matlab:

from numpy import *import pymat

x = arrayrange(0, 4*math.pi, 0.1)

m = pymat.open()

# can send numpy arrays to Matlab:

pymat.put(m, ’x’, x);

pymat.eval(m, ’y = sin(x)’)

pymat.eval(m, ’plot(x,y)’)

# get a new numpy array back:

y = pymat.get(m, ’y’)

Page 96: Scientific Computing lectures

96 Python in SC 2.10 Python + Matlab = true

Another environment for Scientific computing: sage

www.sagemath.org

An example of advanced usage of sage

http://kheiron.at.mt.ut.ee/home/pub/0/

Page 97: Scientific Computing lectures

97 What is Scientific Computing? 3.1 Introduction to Scientific Computing

3 What is Scientific Computing?3.1 Introduction to Scientific Computing

• Scientific computing – subject on crossroads of

– physics, [chemist, social, engineering,...] sciences

– problems typically translated into

* linear algebraic problems

* sometimes combinatorial problems

• a computational scientist needs knowledge of some aspects of

– numerical analysis

– linear algebra

– discrete mathematics

Page 98: Scientific Computing lectures

98 What is Scientific Computing? 3.1 Introduction to Scientific Computing

• An efficient implementation needs some understanding of

– computer architecture

* both on the CPU level

* on the level of parallel computing

– some specific skills of software management

Page 99: Scientific Computing lectures

99 What is Scientific Computing? 3.1 Introduction to Scientific Computing

This course focus – Scientific Computing as:field of study concerned with constructing mathematical models and numerical

solution techniques and using computers to analyse and solve scientific, social scienceand engineering problems.

• typically – application of computer simulation and other forms of computationto problems in various scientific disciplines.

Main purpose of SC:

• mirroring

• predictionof real world processes’

• characteristics

– behaviour

– development

Page 100: Scientific Computing lectures

100 What is Scientific Computing? 3.1 Introduction to Scientific Computing

Example of Computational SimulationASTROPHYSICS: what happens with collision of two black holes in the universe?

Situation which is

• impossible to observe in the nature,

• test in a lab

• estimate barely theoretically

Computer simulation CAN HELP. But what is needed for simulation?:

• adequate mathematical model (Einstein’s general relativity theory)

• algorithm for numerical solution of the equations

• enough big computer for actual realisation of the algorithms

Page 101: Scientific Computing lectures

101 What is Scientific Computing? 3.1 Introduction to Scientific Computing

Frequently: need for simulation of situations that could be performed experiman-tally, but simulation on computers is needed because:

• HIGH COST OF THE REAL EXPERIMENT. Examples:

– car crash-tests

– simulation of gas explosions

– nuclear explosion simulation

– behaviour of ships in Ocean waves

– airplane aerodynamics

– strength calculations in big constructions (for example oil-platforms)

– oil-field simulations

Page 102: Scientific Computing lectures

102 What is Scientific Computing? 3.1 Introduction to Scientific Computing

• TIME FACTOR. Examples:

– Climate change predictions

– Geological development of the Earth (including oil-fields)

– Glacier flow model

– Weather prediction

• SCALE OF THE PROBLEM. Examples:

– Fish quantity in Norwegian fjords depending on the amount of[rain/snow]-fall

– modeling chemical reactions on the molecular level

– development of biological ecosystems

Page 103: Scientific Computing lectures

103 What is Scientific Computing? 3.1 Introduction to Scientific Computing

• PROCESSES THAT CAN NOT BE INTERVENED

– human heart model

– global economy model

• OTHER FACTORS

Page 104: Scientific Computing lectures

104 What is Scientific Computing? 3.2 Specifics of computational problems

3.2 Specifics of computational problems

Usually, computer simulation consists of:

1. Creation of mathematical model – usually in a form of equations – physicalproperties and dependencies of the subject

2. Algorithm creation for numerical solution of the equations

3. Application of the algorithms in computer software

4. Using the created software on a computer in a particular simulation process

5. Visualizing the results in an understandable way using computer graphics, forexample

6. Integration of the results and repetition/redesign of arbitrary given step above

Page 105: Scientific Computing lectures

105 What is Scientific Computing? 3.2 Specifics of computational problems

Most often

• Algorithm

– often written down in an intuitive way

– and/or using special modeling software

• computer program written, based on the algorithm

• testing

• iterating

Page 106: Scientific Computing lectures

106 What is Scientific Computing? 3.3 General strategy

3.3 General strategy

REPLACE A DIFFICULT PROBLEM WITH A MORE SIMPLE ONE

• – which has the same solution

• – or at least approximate solution

– but still reflecting the most important features of the problem

SOME EXAMPLES OF SUCH TECHNIQUES:

• Replacing infinite spaces with finite ones (in maths sense)

• Infinite processes replacement with finite ones

– replacing integrals with finite sums

– derivatives replaced by finite differences

Page 107: Scientific Computing lectures

107 What is Scientific Computing? 3.3 General strategy

• Replacing differential equations with algebraic equations

• Nonlinear equations replaced by linear equations

• replacing higher order systems with lower order ones

• Replacing complicated functions with more simple ones (like polynomials)

• Arbitrary structured matrix replacement with more simple structured matrices

Page 108: Scientific Computing lectures

108 What is Scientific Computing? 3.3 General strategy

MORE PARTICULARLY, THIS COURSE IS ABOUT:

Methods and analysis for development of reliable and efficient software for Sci-entific Computing

Reliability means here both the reliability of the software as well as adequatyof the results – how much can one rely on the achieved results:

• Is the solution acceptable at all? is it a real solution? (extraneous solution?, in-stability of the solution? etc); Does the solution algorithm guarantee a solutionat all?

• How big is the calculated solution’s deviation from the real solution? How wellthe simulation reflects the real world?

Another aspect: software reliability

Page 109: Scientific Computing lectures

109 What is Scientific Computing? 3.3 General strategy

Efficiency expressed on various levels the solution process

• speed

• amount of used resources

Resources can be:

– Time

– Cost

– Number of CPU cycles

– Number of processes

– Amount of RAM

– Human labour

Page 110: Scientific Computing lectures

110 What is Scientific Computing? 3.3 General strategy

Most often used general formula is:

minapproximation error

timeEven more generally: our task is to minimise time of the solution

Efficient method requires:

(i) good discretisation

(ii) good computer implementation

(iii) depends on computer architecture (processor speed, RAM size, mem-ory bus speed, availability of cache, number of cache levels and otherproperties )

Page 111: Scientific Computing lectures

111 Approximation 4.1 Sources of approximation error

4 Approximation in Scientific Computing

4.1 Sources of approximation error

4.1.1 Error sources that are under our control

MODELLING ERRORS – some physical entities in the model are simplified or evennot taken into account at all (for example: air resistance, viscosity, friction etc)

(Usually it is OK but sometimes it fails...)

Page 112: Scientific Computing lectures

112 Approximation 4.1 Sources of approximation error

Page 113: Scientific Computing lectures

113 Approximation 4.1 Sources of approximation error

MEASUREMENT ERRORS – laboratory equipment has its precision

Errors come also out of

• random measurement deviation

• backward noise

As an example, Newton and Planck constants are used with 8-9 decimal places whilelaboratory measurements are performed with much less precision!

THE EFFECT OF PREVIOUS CALCULATIONS – the input for calculations isoften already output of some previous calculation with some computational errors

Page 114: Scientific Computing lectures

114 Approximation 4.1 Sources of approximation error

4.1.2 Errors created during the calculations

Discretisation

As an example:

• replacing derivatives with finite differences

• finite sums used instead of infinite series

• etc

Round-off errors – error created during the calculations due to limited availableprecision, which the calcualtions are performed with

Page 115: Scientific Computing lectures

115 Approximation 4.1 Sources of approximation error

Example 4.1

Suppose, a computer program can find function f value f (x) for arbitrary x.Task: find an algorithm for calculating approximation to the derivative f ′(x)Algorithm: Choose small h > 0 and approximate:

f ′(x)≈ [ f (x+h)− f (x)]/h

Discretisation error is:

T := | f ′(x)− [ f (x+h)− f (x)]/h|.

Using Taylor series, we get an estimation:

T ≤ h2

∥∥ f ′′∥∥

∞.

Page 116: Scientific Computing lectures

116 Approximation 4.1 Sources of approximation error

Computational error is created using finite precision arithmetics approximatingthe real f (x) with an approximation f (x). Computational error C is:

C =

∣∣∣∣ f (x+h)− f (x)h

− f (x+h)− f (x)h

∣∣∣∣=

∣∣∣∣ [ f (x+h)− f (x+h)]− [ f (x)− f (x)]h

∣∣∣∣ ,which gives an estimate:

C ≤ 2h‖ f − f‖∞. (1)

The resulting error is ∣∣∣∣ f ′(x)− f (x+h)− f (x)h

∣∣∣∣ ,which can be estimated:

T +C ≤ h2‖ f ′′‖∞ +

2h‖ f − f‖∞.

Page 117: Scientific Computing lectures

117 Approximation 4.1 Sources of approximation error

=⇒ if h is large – the discretisation error is dominating, if h is small, computationalerror starts dominating.

Illustration. Let h = 10− j, j = 0,1, ...,16; f = x3, x = 1,103, 10−3. (We comparethe above scheme also with the central difference scheme:)

f ′(x)≈ f (x+h)− f (x−h)2h

Page 118: Scientific Computing lectures

118 Approximation 4.1 Sources of approximation error

E = abs(F(x,h)− f ′(x))

Page 119: Scientific Computing lectures

119 Approximation 4.1 Sources of approximation error

4.1.3 Computational error – forward error and input error – backward error

Condition number – upper limit of their ratio:

forward error≤ condition number×backward error

From (1) it follows that in Example 4.1 the value of condition number is: 2/h.In given calculations all the values are absolute: actual values of the approxi-

mated entities are not considered. Relative forward error and relative backward errorare in this case:

C|( f (x+h)− f (x))/h|

and‖ f − f‖∞

‖ f‖∞

.

Page 120: Scientific Computing lectures

120 Approximation 4.1 Sources of approximation error

Assuming that minx | f ′(x)|> 0, it follows easily from (1), that:

C|( f (x+h)− f (x))/h|

2h‖ f‖∞

minx | f ′(x)|

‖ f − f‖∞

‖ f‖∞

.

The value in the brackets · is called relational condition number of the problem.In general:

• If (absolute or relative) condition number is small,

– then small (absolute or relative) error in the input data can produce only asmall error in the result.

• If condition number is large

– then large error in the result can be caused even by a small error in theinput data

– such problems are said to be ill-conditioned

Page 121: Scientific Computing lectures

121 Approximation 4.1 Sources of approximation error

• Sometimes, in the case of finite precision arithmetics:

– backward error is much more simple to estimate than forward error

– Backward error combined with condition number makes it possibe to es-timate the forward error (absolute or relative)

Page 122: Scientific Computing lectures

122 Approximation 4.1 Sources of approximation error

Example 4.2 (One of the key problems in Scientific Computing)Consider solving the system of linear equations:

Ax = b, (2)

where the input consists of

• nonsingular n×n matrix A

• b ∈ Rn

The task is to calculate – an approximate solution: x ∈ Rn.Suppose, instead of exact matrix A – given its approximation A = A+δA, but (for

simplicity) b known exactly. The solution x = x+δx satisfies the system of equations

(A+δA)(x+δx) = b. (3)

Page 123: Scientific Computing lectures

123 Approximation 4.1 Sources of approximation error

Then from (2),(3) it follows that

(A+δA)δx =−(δA)x.

Multiplying it with (A+δA)−1 and taking norms, we estimate

‖δx‖ ≤ ‖(A+δA)−1‖‖δA‖‖x‖.

It follows that if x 6= 0 and A 6= 0, we have:

‖δx‖‖x‖

≤ ‖(A+δA)−1‖‖A‖‖δA‖‖A‖

∼= ‖A−1‖‖A‖‖δA‖‖A‖

, (4)

which is satisfied with δA sufficiently small.

Page 124: Scientific Computing lectures

124 Approximation 4.1 Sources of approximation error

• =⇒for calculation of x an important factor is relative condition numberκ(A) := ‖A−1‖‖A‖.

– It is usually called as condition number of matrix A.

– Depends on norm ‖ · ‖

Therefore, common practice for forward error estimation is to:

• find an estimate to the backward error

• use the estimate (4)

Page 125: Scientific Computing lectures

125 Approximation 4.2 Floating-Point Numbers

4.2 Floating-Point Numbers

The number −3.1416 in scientific notation is −0.31416× 101 or (as computeroutput) -0.31416E01.

sign

exponent

−.31416 101

mantissa base

– floating point numbers in computer notation. Usually, base is 2 (with a few excep-tions like IBM 370 had a base 16; base 10 in most of hand-helt calculators; 3 in anill-fated Russian computer).

For example, .101012×23 = 5.2510.Formally, a floating-point number system F, is characterised by four integers:

• Base (or radix) β > 1

• Precision p > 0

• Exponent range [L,U ] – L < 0 <U

Page 126: Scientific Computing lectures

126 Approximation 4.2 Floating-Point Numbers

Any floating-point number x ∈ F has the form

x =±d0 +d1β−1 + ...+dp−1β

1−pβ E , (5)

where integers di satify

0≤ di ≤ β −1, i = 0, ..., p−1,

and E ∈ [L,U ] (E is positive, zero or negative integer). The number E is called anexponent and in the part in the brackets · is called mantissa

Example. Number 2347 is represented as

2+3×10−1 +4×10−2 +7×10−3103

in arithmetics with precision 4 and base 10. Note that exact representation of thisnumber in precision 3 and base 10 is not possible!

Page 127: Scientific Computing lectures

127 Approximation 4.3 Normalised floating-point numbers

4.3 Normalised floating-point numbers

A number is normalised if d0 > 0Example. The number .101012×23 is normalised, but .0101012×24 is notFloating point systems are usually normalised because:

• Representation of each number is then unique

• No digits are waisted on leading zeros

• In binary (β = 2) system, the leading bit always1 =⇒ no need to store it!

Smallest positive normalised number in form (5) is 1×β L – underflow threashold.(In case of underflow, the result is smaller than the smallest representable floating-point number)

Page 128: Scientific Computing lectures

128 Approximation 4.3 Normalised floating-point numbers

Largest positive normalised number in form (5) is

(β −1)1+β−1 + ...+β

1−pβU

= (1−β−p)βU+1.

– overflow threashold.If the result of an arithmetic operation is an exact number not represented in the

floating-point number system F, the result is represented as (hopefully close) elementof F. (Rounding)

Page 129: Scientific Computing lectures

129 Approximation 4.4 IEEE (Normalised) Arithmetics

4.4 IEEE (Normalised) Arithmetics

• β = 2 (binary) ,

• d0 = 1 always – not stored

Single precision:

• p = 24, L =−126, U = 127

• Underflow threashold = 2−126 ≈ 10−38

• Overflow threashold = 2127 · (2−2−23)≈ 2128 ≈ 1038

• One bit for sign, 23 for mantissa and 8 for exponent:

1 23 8

• – 32-bit word.

Page 130: Scientific Computing lectures

130 Approximation 4.4 IEEE (Normalised) Arithmetics

Double precision:

• p = 53, L =−1022, U = 1023

• Underflow threashold = 2−1022 ≈ 10−308

• Overflow threashold= 21023 · (2−2−52)≈ 21024 ≈ 10308

• One bit for sign, 52 for mantissa and 11 for exponent:

1 52 11

– 64-bit word

• IEEE arithmetics standard – rounding towards the nearest element in F.

• (If the result is exactly between the two elements, the rounding is towards thenumber which has has the least significant bit equal to 0 – rounding towards theclosest even number)

Page 131: Scientific Computing lectures

131 Approximation 4.4 IEEE (Normalised) Arithmetics

IEEE subnormal numbers - unnormalised numbers with minimal possible expo-nent.

• Between 0 and the smallest normalised floating point value.

• Quarantees that f l(x− y) (the result of operation x− y in floating point arith-metics) in case x 6= y never zero – to avoid underflow in such situatons

IEEE symbols Inf and NaN – Inf (±∞), NaN (Not a Number)

• Inf - in case of overflow

– x/±∞ = 0 in case of arbitrary finite floating/point x

– +∞+∞ =+∞, etc.

• NaN is returned when operation does not have a well/defined finite or infinitevalue, for example

Page 132: Scientific Computing lectures

132 Approximation 4.4 IEEE (Normalised) Arithmetics

– ∞−∞

– 00

–√−1

– NaNx (where – one of operations: +, - , *, / ), etc

IEEE defines also double extended floating-point values

• 64 bit mantissa; 15 bit exponent

• most of the compilers do not support it

• Many platforms support also quadruple precision (double*16)

– often emulated with lower precision and therefore slow performance

Page 133: Scientific Computing lectures

133 Systems of LE 5.1 Systems of Linear Equations

5 Solving Systems of Linear Equations5.1 Systems of Linear Equations

System of linear equations:

a11x1 +a12x2 + ...+a1nxn = b1

a21x1 +a22x2 + ...+a2nxn = b2

. . . . . . . . . . . . . .

am1x1 +am2x2 + ...+amnxn = bm,

Matrix form: A- given matrix, vector b - given; vector of unknowns xa11 a12 · · · a1n

a21 a22 · · · a2n...

... . . . ...am1 am2 · · · amn

x1

x2...

xn

=

b1

b2...

bm

or Ax = b (6)

Page 134: Scientific Computing lectures

134 Systems of LE 5.1 Systems of Linear Equations

Suppose,

• m > n – overdetermined system; Does this system have a solution?a system with more equations than unknownshas usually no solution

• m< n – underdetermined system;How many solutions does it have?a system with fewer equations than unknownshas usually infinitely many solutions

• m = n – what about the solution in this case?a system with the same number of equations and unknowns has usuallya single unique solution←− we will deal only with this case now

Page 135: Scientific Computing lectures

135 Systems of LE 5.2 Classification

5.2 Classification

Two main types of systems of linear equations: systems with

• full matrix

– most of the values are nonzero

– how to store it?storage in a 2D array

• sparse matrix

– most of the matrix values are zero

– How to store such matrices?storing in a full matrix system would be waiste of memory

* different sparse matrix storage schemes (later in this course)

Quite different strategies for solution of systems with full or sparse matrices

Page 136: Scientific Computing lectures

136 Systems of LE 5.2 Classification

5.2.1 Problem Transformation

• Common strategy is to modify the problem (6) such that

– the solution remains the same

– modified problem – more easy to solve

What kind of transformations do not change the solution?

• Possible to multiply the both sides of the equation (6) with an arbitrary nonsin-gular matrix M without the change in solution.

– To check it, one can notice that the solution of MAz = Mb is:

z = (MA)−1Mb = A−1M−1Mb = A−1b = x.

Page 137: Scientific Computing lectures

137 Systems of LE 5.2 Classification

– For example, M = D – diagonal matrix,

– or M = P permutation matrix

NB! Although, theoretically the multiplication of (6) with nonsingular matrix M doesnot change the solution, we will see later that it may change the numerical process ofthe solution and the exactness of the solution...

The next question we ask: what type of systems are easy to solve?

Page 138: Scientific Computing lectures

138 Systems of LE 5.3 Triangular linear systems

5.3 Triangular linear systemsIf the system matrix A has a row i with a nonzero only on the diagonal, it is

easy to calculate xi = bi/aii; if now there is row j where except the diagnal a j j 6= 0the only nonzero is at position a ji, we find that x j = (b j− a jixi)/a j j and again, ifthere exist a row k such that akk 6= 0 and akl = 0 if l 6= i, j, we can have xk =

(bk−akixi−ak jx j)/akk etc.

• Such systems – easy to solve

• called triangular systems.

With rearranging rows it is possible to transform the system to Lower Triangularform L or Upper Triangular form U :

L =

`11 0 · · · 0`21 `22 · · · 0...

... . . . ...`n1 `n2 · · · `nn

, U =

u11 u12 · · · u1n

0 u22 · · · u2n...

... . . . ...0 0 · · · unn

Page 139: Scientific Computing lectures

139 Systems of LE 5.3 Triangular linear systems

L =

`11 0 · · · 0`21 `22 · · · 0...

... . . . ...`n1 `n2 · · · `nn

, U =

u11 u12 · · · u1n

0 u22 · · · u2n...

... . . . ...0 0 · · · unn

Solving system Lx = b is calledForward Substitution :x1 = b1/`11

xi =(

bi−∑i−1j=1 `i jx j

)/`ii

Solving system Ux = b is calledBack Substitution :xn = bn/unn

xi =(

bi−∑nj=i+1 ui jx j

)/uii

But how to transform an arbitrary matrix to a triangular form?

Page 140: Scientific Computing lectures

140 Systems of LE 5.4 Elementary Elimination Matrices

5.4 Elementary Elimination MatricesGiven a vector a = [a1,a2], for a1 6= 0:[

1 0−a2/a1 1

][a1

a2

]=

[a1

0

].

In general case, if a = [a1,a2, ...,an] and ak 6= 0:

Mka =

1 · · · 0 0 · · · 0... . . . ...

... . . . ...0 · · · 1 0 · · · 00 · · · −mk+1 1 · · · 0... . . . ...

... . . . ...0 · · · −mn 0 · · · 1

a1...

ak

ak+1...

an

=

a1...

ak

0...0

,

where mi = ai/ak, i = k+1, ...,n.

Page 141: Scientific Computing lectures

141 Systems of LE 5.4 Elementary Elimination Matrices

The divider ak is called pivotMatrix Mk is called also elementary elimination matrix or Gauss transforma-

tion

1. Mk is nonsingular⇐= Why?being lower triangular and unit diagonal

2. Mk = I−meTk , where m = [0, ...,0,mk+1, ...,mn]

T and ek is column k of unitmatrix

3. Lk =(de f ) M−1

k = I +meTk .

4. If M j, j > k is some other elementary elimination matrix with multiplicationvector t, then

MkM j = I−meTk − teT

j +meTk teT

j = I−meTk − teT

j ,

due to eTk t = 0.

Page 142: Scientific Computing lectures

142 Systems of LE 5.5 Gauss Elimination and LU Factorisation

5.5 Gauss Elimination and LU FactorisationSolving the system Ax = b

• apply series of Gauss elimination matrices from the left: M1, M2,...,Mn−1, tak-ing M = Mn−1 · · ·M1

– we get the linear system:

MAx = Mn−1 · · ·M1Ax = Mn−1 · · ·M1b = Mb

– upper triangular =⇒

* easy to solve.

The process is called Gauss Elimination Method (GEM)

Page 143: Scientific Computing lectures

143 Systems of LE 5.5 Gauss Elimination and LU Factorisation

• Denoting U = MA and L = M−1, we get that

L = M−1 = (Mn−1 · · ·M1)−1 = M−1

1 · · ·M−1n−1 = L1 · · ·Ln−1

is lower unit triangular (ones on the diagonal)

• =⇒A = LU.

Expressed in an algorithm:

Page 144: Scientific Computing lectures

144 Systems of LE 5.5 Gauss Elimination and LU Factorisation

Algoritm 5.1. LU-factorisation using Gauss elimination method (GEM) do k=1,...,n-1 # cycle over matrix columns

if akk==0 then stop # stop in case pivot == 0

do i=k+1,n

mik = aik/akk # coefficient calculation in column ienddo

do j=k+1,n

do i=k+1,n # applying transformations to

ai j = ai j−mikak j # the rest of the matrix

enddo

enddo

enddoNB! In practical implementation: For storing mik use corresponding elements in

A (will be zeroes anyway)

Page 145: Scientific Computing lectures

145 Systems of LE 5.6 Number of operations in GEM

5.6 Number of operations in GEM

Finding operation counts for Alg. 5.1:

• Replace loops with corresponding sums over number of particular operations:

n−1

∑i=1

(n

∑j=i+1

1+n

∑j=i+1

n

∑k=i+1

2

)

=n−1

∑i=1

((n− i)+2(n− i)2) =23

n3 +O(n2)

• used that ∑mi=1 ik = mk+1/(k + 1) + O(mk) (which is enough for finding the

number of operations with the highest order)

• How many operations there are for solving a triangular system?Number of operations for forward and backward substitution for L and U isO(n2)

• =⇒ the whole system solution Ax = b takes 23n3 +O(n2) operations

Page 146: Scientific Computing lectures

146 Systems of LE 5.7 GEM with row permutations

5.7 GEM with row permutations

• If pivot == 0 GEM won’t work

• Row permutations or partial pivoting may help

• For numerical stability, also the pivot must not be small

Example 5.1

• Consider matrix

A =

[0 11 0

]

– non-singular

– but LU-factorisation impossible without row permutations

Page 147: Scientific Computing lectures

147 Systems of LE 5.7 GEM with row permutations

• But on contrary, the matrix

A =

[1 11 1

]

– has the LU-factorisation

A =

[1 11 1

]=

[1 01 1

][1 10 0

]= LU.

– with A being actually singular matrix!

Page 148: Scientific Computing lectures

148 Systems of LE 5.7 GEM with row permutations

Example 5.2. Small pivots

• Consider

A =

[ε 11 1

],

with ε such that 0 < ε < εmach in given floating point system

– (i.e. 1+ ε = 1 in floating point arithmetics, (see lab 4 Exercise 1 (http://www.ut.ee/~eero/SC/lab4.pdf))

– Without row permutation we get (in floating-point arithmetics):

M =

[1 0−1/ε 1

]=⇒ L =

[1 0

1/ε 1

],U =

[ε 10 1−1/ε

]=

[ε 10 −1/ε

]

• But then

LU =

[1 0

1/ε 1

][ε 10 −1/ε

]=

[ε 11 0

]6= A

Page 149: Scientific Computing lectures

149 Systems of LE 5.7 GEM with row permutations

Using row permutation

• the pivot is 1;

• multiplier −ε =⇒

M =

[1 0−ε 1

]=⇒ L =

[1 0ε 1

],U =

[1 10 1− ε

]=

[1 10 1

]

in floating point arithmetics

• =⇒

LU =

[1 0ε 1

][1 10 1

]=

[1 1ε 1

]← OK!

Page 150: Scientific Computing lectures

150 Systems of LE 5.7 GEM with row permutations

Algoritm 5.2. LU-factorisation with GEM using row permutations do k=1,...,n-1 # cycle over matrix columns

Find index p such that: # looking for the best pivot

|apk| ≥ |aik|, k ≤ i≤ n # in given column

if p 6= k then interchange rows k and pif akk = 0 then continue with next k # skip such column

do i=k+1,n

mik = aik/akk # multiplier calculation in column ienddo

do j=k+1,n

do i=k+1,n # transformation application

ai j = ai j−mikak j # to the rest of the matrix

enddo

enddo

enddo

Page 151: Scientific Computing lectures

151 Systems of LE 5.7 GEM with row permutations

As a result, MA =U , where U upper-triangular, OK so far?but actually

M = Mn−1Pn−1 · · ·M1P1

• M−1 still lower-triangular?is not lower-triangular any more, although it is still denoted by L

– but is it triangular?we have still triangular L

– knowing the permutations P = Pn−1 · · ·P1 in advance would give

PA = LU,

where L indeed lower triangular matrix

Page 152: Scientific Computing lectures

152 Systems of LE 5.7 GEM with row permutations

• But, do we really need to actually perform the row exchanges – explicitly?we can perform instead of row exchanges just appropriate mapping ofmatrix (and vector) indeces

– We start with unit index mapping p = [1,2,3,4, ...,n]

– If rows i and j need to be exchanged, we exchange corresponding valuesp[i] and p[ j]

– In the algorithm, take everywhere ap[i] j instead of ai j (and other arrayscorrespondingly)

To solve the system Ax = b (6) How does the whole algorithm look like now?

• Solve the lower triangular system Ly = Pb with forward substitution

• Solve the upper triangular system Ux = y backward substitution

The term partial pivoting comes from the fact that we are looking for the best pivotonly in the current column of the matrix

Page 153: Scientific Computing lectures

153 Systems of LE 5.7 GEM with row permutations

• Complete pivoting – the best pivot is looked for from the whole remaining partof the matrix

– This means exchanging both rows and columns of the matrix

PAQ = LU,

where P and Q are permutation matrices.

– The system is solved in three stages: Ly = Pb; Uz = y and x = Qz

Although numerical stability is better in case of complete pivoting it is used seldomly,because

• much more costly

• actually rarely needed

Page 154: Scientific Computing lectures

154 Systems of LE 5.8 Reliability of the LU-factorisation with partial pivoting

5.8 Reliability of the LU-factorisation with partial pivotingIntroduce the vector norm:

‖x‖∞= maxi|xi|, x ∈ Rn

and corresponding matrix norm:

‖A‖∞ = supx∈Rn‖Ax‖∞

‖x‖∞

, x ∈ Rn.

Looking at the rounding errors in GEM, it can be shown that actually we find L, andU which satisfy the relation:

PA = LU−E,

where error E can be estimated:

‖E‖∞ ≤ nε‖L‖∞‖U‖∞ (7)

Page 155: Scientific Computing lectures

155 Systems of LE 5.8 Reliability of the LU-factorisation with partial pivoting

where ε is machine epsilonIn practice we replace the system PAx = Pb with the system LU = Pb. =⇒the

system we are solving is actually

(PA+E)x = Pb

for finding the approximate solution x.

• How far is it from the real solution?

From the matrix perturbation theory:Let us solve the system

Ax = b (8)

where A ∈ Rn×n and b ∈ Rn are given and x ∈ Rn is unknown. Suppose, A is givenwith an error A = A+δA. Perturbed solution satisfies the system of linear equations

(A+δA)(x+δx) = b. (9)

Page 156: Scientific Computing lectures

156 Systems of LE 5.8 Reliability of the LU-factorisation with partial pivoting

Theorem 5.1. Let A be nonsingular and δA be sufficiently small, such that

‖δA‖∞‖A−1‖∞ ≤12. (10)

Then (A+δA) is nonsingular and

‖δx‖∞

‖x‖∞

≤ 2κ(A)‖δA‖∞

‖A‖∞

, (11)

where κ(A) = ‖A‖∞‖A−1‖∞ is the condition number.Proof of the theorem (http://www.ut.ee/~eero/SC/konspekt/

perturb-toest/perturb-proof.pdf)(EST) (http://www.ut.ee/~eero/SC/konspekt/perturb-toest/

perturb-toest.pdf)

Page 157: Scientific Computing lectures

157 Systems of LE 5.8 Reliability of the LU-factorisation with partial pivoting

Remarks

1. The result is true for an arbitrary matrix norm derived from the correspondingvector norm

2. Theorem 5.1 says, that in case of small condition number a small relative errorin matrix A can cause small or large?only a small error in the solution x

3. If condition number is big, what can happen?everything can happen

4. It is not simple to calculate condition number but still, it can be estimated

Combining the result (7) with Theorem 5.1, we see that:GEM forward error can be estimated as follows:

‖δx‖∞

‖x‖∞

≤ 2nεκ(A)G,

where the coefficient G =[‖L‖∞‖U‖∞

‖A‖∞

]is called growth factor

Page 158: Scientific Computing lectures

158 Systems of LE 5.8 Reliability of the LU-factorisation with partial pivoting

Conclusions

• If G is not large, then well-conditioned matrix gives fairly good answer (i.e.O(nε)).

• In case of partial pivoting, the elements of matrix L are ≤ 1.

– Nevertherless, ∃ examples, where with well-conditioned matrix A the el-ements of U exponentially large in comparison with the elements in A.

– But these examples are more like “academic” – in practice rarely one findssuch (and the method can be used without any fair)

Page 159: Scientific Computing lectures

159 Systems of LE 5.9 Iterative refinement of the solution

5.9 Iterative refinement of the solutionHaving found an approximate solution x0 to the system Ax = b (6) using LU-

factorisation, it is possible to calculate the residual

r0 = b−Ax0.

• We want the residual r0 to be as small as possible,

• but we saw that with a large condition number this might not be the case.

• It is possible to use calculated LU-factorisation, for solving the system

As0 = r0

finding vector s0 and taking

Page 160: Scientific Computing lectures

160 Systems of LE 5.9 Iterative refinement of the solution

x1 = x0 + s0

for the refined solution, because

Ax1 = A(x0 + s0) = Ax0 +As0 = (b− r0)+ r0 = b.

• Usually, longer floating point representation needed for iterative refinement

• if the residual r0 is large, also the small precision floating point number systemwill do

Page 161: Scientific Computing lectures

161 BLAS 6.1 Motivation

6 BLAS (Basic Linear Algebra Subroutines)

6.1 Motivation

How to optimise programs that use a lot of linear algebra operations?

• Efficiency depends on

– processor speed

– number of arithmeticoperations

• but also on:

– the speed of memoryreferences

• Hierarchical memory structure:

Page 162: Scientific Computing lectures

162 BLAS 6.1 Motivation

Tape, CS, DVD, Network storage

HardDisk

RAM

cache

Registers

slow, large, cheap

fast, small, expensive

• Where are arithmetic operations performed on the picture?Useful arithmetic operations only on the top of the hierarchy

• before operations: what is data movement direction?data needs to be moved up; after operations the data is moveddown the hierarchy

• information movement faster on top

• What is faster, speed of arithmetic operations or data movement?As a rule: speed of arithmetic operations faster than data movement

Page 163: Scientific Computing lectures

163 BLAS 6.1 Motivation

Consider an arbitrary algorithm. Denote:

• f – flops (# arithmetic operations: +, - ×, /)

• m – # memory references

Introduce q = f/m.

Why this number is important?Why this number is important?

• t f – time spent on 1 flop • tm time for memory access,

Then calculation time is:

f ∗ t f +m∗ tm = f ∗ t f (1+1q(tmt f))

In general, tm t f and therefore, total time is reflecting processor speed only if q islarge or small?large

Page 164: Scientific Computing lectures

164 BLAS 6.1 Motivation

Example. Gauss elimination method – for each i – key operations:

A(i+1 : n, i) = A(i+1 : n, i)/A(i, i), (12)

A(i+1 : n, i+1 : n) = A(i+1 : n, i+1 : n)−A(i+1 : n, i)∗A(i, i+1 : n). (13)

• Operation (12) represents following general operation:

y = ax+y, x,y ∈ Rn, a ∈ R (14)

– Operation (14) called saxpy (sum of a times x plus y)

• (13) represents:A = A−vwT , A ∈ Rn×n, v,w ∈ Rn (15)

– (15) – matrix A rank-1 update, (Matrix vwT has rank 1, but Why?, because each rowof it is a multiple of vector w)

Page 165: Scientific Computing lectures

165 BLAS 6.1 Motivation

Operation (14) analysis:

• m = 3n+1 memory references:

– 2n+1 reads

* vectors x, y

* scalar a

– n writes

* new y

• Computations take f = 2n flops

• =⇒ q = 2/3+O(1/n) ≈ 2/3 forlarge n

Operation (15) analysis:

• m = n2 +2n memory references:

– n2 +2n reads

– n2 writes

• Computations – f = 2n2 flops

• =⇒ q = 1+O(1/n)≈ 1 with largen

(14) – 1st order operation (O(n) flops); (15) – 2nd order operation (O(n2) flops).Note that coefficient q is O(1) in both cases

Page 166: Scientific Computing lectures

166 BLAS 6.1 Motivation

Faster results in case of 3rd order operations (O(n3) operations with O(n2) mem-ory references). For example, matrix multiplication:

C = AB+C, A, B,C ∈ Rn×n. (16)

Here m = 4n2 and f = n2(2n−1)+n2 = 2n3 (check it!) =⇒ q = n/2→ ∞ if n→ ∞.This operation can give processor work near peak performance, with good algorithmscheduling!

Page 167: Scientific Computing lectures

167 BLAS 6.2 BLAS implementations

6.2 BLAS implementations

• BLAS – standard library for simple 1st, 2nd and 3rd order operations

– BLAS – freeware, available for example from netlib (http://www.netlib.org/blas/)

– Processor vendors often supply their own implementation

– BLAS ATLAS implementation ATLAS (http://math-atlas.sourceforge.net/) – self-optimising code

Example of using BLAS (fortran90):

• LU factorisation using BLAS3 operations (http://www.ut.ee/~eero/SC/konspekt/Naited/lu1blas3.f90.html)

• main program for testing different BLAS levels (http://www.ut.ee/~eero/SC/konspekt/Naited/testblas3.f90.html)

Page 168: Scientific Computing lectures

168 Optimisation 6.2 BLAS implementations

7 Optimisation methods

Minimising (or maximising) a (possibly multivariate) function F , F : A → R,where A - subset of the Eucledian space Rn

Objective function F called differently in different cases:

• loss function

• cost function (minimization)

• indirect utility function (minimiza-tion)

• utility function (maximization)

• an energy function

• energy functional

Page 169: Scientific Computing lectures

169 Optimisation 6.2 BLAS implementations

Classification

• number of independent controls (dimensionality)

– one variable

– several variables (two to ten)

– dozens

– hundreds

– tens of thousands

• an important special class – discrete optimiziation – large number of controlsof boolean or integer values

– requires combinatorial methods

• calculus of variations – infinitely many controls

Page 170: Scientific Computing lectures

170 Optimisation 6.2 BLAS implementations

• cost of evaluation of the function (realistic goals and methods are different)

– from milliseconds of computer time

– too costly measurements in natural experiments

• sharpness of the maximum

• The presence of constraints and their character

• Stochastic optimization -

– noise in measurements or

– uncertainty of some factors

• Structural optimization

• Multi-objective optimization (e.g. Light and strong)

• multi-modal optimization (many good solutions)

Page 171: Scientific Computing lectures

171 Optimisation 6.2 BLAS implementations

How much knowledge there is about F?not much a lot of knowledge

Type of xDiscrete Combinatorial search

Brute-forceStepwiseMCMCPopulation-based...

Algorithmic

Continous Numerical methodsGradient Descent,(Quasi-)Newton,population-based...

Analytic

Here – short insight only into the unconstrained optimization of problems with afew independent controls

Page 172: Scientific Computing lectures

172 Optimisation 7.1 Gradient descent methods

7.1 Gradient descent methods

• First order optimization algorithm

• finding local minimum

• special case of steepest descentmethod

• if F(x) - differentiable at x multivariate function =⇒ vector −∇F(x) directstowards smaller values (at least locally)

• =⇒ for small enough α: F(x−α∇F(x))≤ F(x)

• Method: make a sequence x0,x1,x2, ... such that

xn+1 = xn−αn∇F(xn)

Page 173: Scientific Computing lectures

173 Optimisation 7.1 Gradient descent methods

Algorithm

1. Compute the new direction ∆xn =−∇F(xn)

2. Choose the stepsize αn according to line search or backtracking

3. Update xn+1 = xn +αn∆xn

Page 174: Scientific Computing lectures

174 Optimisation 7.1 Gradient descent methods

Exact line search

• Minimise F along the ray

x+α∆x |α ≥ 0 : α = argmins≥0F(x+ s∆x)

• used when the cost of the minimization problem low compared to the cost ofcomputing the search direction itself

• In some special cases the minimizer can be found analytically

• and in others it can be computed efficiently

Page 175: Scientific Computing lectures

175 Optimisation 7.1 Gradient descent methods

Backtracking line search

• The minimiser α chosen approximately to minimise F along the ray x +α∆x |α ≥ 0

• or even reduce F “enough”

given a descent direction ∆x for F at x ∈ domF , β ∈ (0,0.5), γ ∈ (0,1) .α := 1while F(x+α∆x)> F(x)+βα∇F(x)T ∆x, α := γα

Page 176: Scientific Computing lectures

176 Optimisation 7.2 Newton method

7.2 Newton method

A. Single variable caseMethod: construct sequence xn starting from initial guess x0 s.t. xn converges to

x∗ for which F ′(x∗) = 0Assumption: F twice differentiableLet ∆x = x− xn

Second order Taylor expansion F(x) = F(xn + ∆x) = F(xn) + F ′(xn)∆x +12F ′′(xn)∆x2 has an extremum when its derivative with respect to ∆x is equal tozero =⇒ if second order Taylor expansion approximates well enough the func-tion F(x) and x0 is chosen close enough to x∗, then the sequence (xn) defined by:∆x = x− xn =− F ′(xn)

F ′′(xn)

xn+1 = xn−F ′(xn)

F ′′(xn), n = 0,1,2, ...

converges towards a root x∗ of F ′, for which F ′(x∗) = 0

Page 177: Scientific Computing lectures

177 Optimisation 7.2 Newton method

Geometric interpretation:

• at each iteration one approximates F(x) by a quadratic function around xn, andthen takes a step towards the maximum/minimum of that quadratic function

• Note that if f (x) happens to be a quadratic function, then the exact extremumis found in one step

B. Multivariable case

• replace derivative with the gradient ∇F

• replace reciprocal of the second derivative by inverse of the Hessian matrix

Given real-valued function F(x) = F(x1,x2, ...,xn). If all the partial derivatives ∃ andcontinous =⇒ Hessian matix of F defined as H(F)i j(x) = DiD jF(x) (here with Dk

we denote the differential operator with respect to xk)

Page 178: Scientific Computing lectures

178 Optimisation 7.2 Newton method

H(F)(x) =

∂ 2F∂x2

1

∂ 2F∂x1∂x2

· · · ∂ 2F∂x1∂xn

∂ 2F∂x2∂x1

∂ 2F∂x2

2· · · ∂ 2F

∂x2∂xn...

... . . . ...∂ 2F

∂xn∂x1

∂ 2F∂xn∂x2

· · · ∂ 2F∂x2

n

=H(x)

Newton method:

xn+1 = xn− [H(F)(xn)]−1

∇F(xn), n≥ 0

usually a small step size γ > 0 instead of γ = 1:

xn+1 = xn− γ[H(F)(xn)]−1

∇F(xn), n≥ 0

• Finding the inverse of the Hessian in high dimensions can be expensive

Page 179: Scientific Computing lectures

179 Optimisation 7.2 Newton method

– instead of directly inverting the Hessian – calculate the vector pn =

[H(F)(xn)]−1∇F(xn) as the solution to the system of linear equations

[H(F)(xn)]pn = ∇F(xn)

• can be solved by various factorizations or approximately (but to great accuracy)using iterative methods

– the Cholesky factorization and conjugate gradient will only work if[H(F)(xn)] is a positive definite matrix

– if not it’s often useful indicator of something gone wrong

* for example if a minimization problem is being approached and[H(F)(xn)] is not positive definite =⇒ the iterations are convergingto a saddle point and not a minimum

Page 180: Scientific Computing lectures

180 Optimisation 7.2 Newton method

• Quasi-Newton methods – approximation for the Hessian (or its inverse di-rectly) is built up from changes in the gradient

Page 181: Scientific Computing lectures

181 DE 8.1 Ordinary Differential Equations (ODE)

8 Numerical Solution of Differential Equations

Differential equation – mathematical equation for an unknown function of oneor several variables that relates the values of the function itself and its derivatives ofvarious orders

Example: the velocity of a ball falling through the air, considering only gravityand air resistance.

order the highest derivative of the dependent variable with respect to the independentvariable

8.1 Ordinary Differential Equations (ODE)

Ordinary Differential Equation (ODE) is a differential equation in which theunknown function (also known as the dependent variable) is a function of a singleindependent variable

Page 182: Scientific Computing lectures

182 DE 8.1 Ordinary Differential Equations (ODE)

Initial value problem

y′(t) = f (t,y(t)), y(t0) = y0,

where function f : [to,∞)×Rd → Rd , y0 ∈ Rd - initial condition

• (Boundary value problem - giving the solution at more than one point (onboundaries))

We consider here only first-order ODEs

• Higher ODE can be converted into system of first order ODEs

– Example: y′′ = −y can be rewritten as two first-order equations: y′ = zand z′ = −y.

8.1.1 Numerical methods for solving ODEs

Page 183: Scientific Computing lectures

183 DE 8.1 Ordinary Differential Equations (ODE)

Euler method (or forward Euler method)

• finite difference approximation

y′(t)≈ y(t +h)− y(t)h

⇒ y(t +h)≈ y(t)+hy′(t)⇒ y(t +h)≈ y(t)+h f (t,y(t))Start with t0, t1 = t0 +h, t2 = t0 +2h, etc.

yn+1 = yn +h f (tn,yn).

• Explicit method – the new value (yn+1) depends on values already known (yn)

Page 184: Scientific Computing lectures

184 DE 8.1 Ordinary Differential Equations (ODE)

Backward Euler method

• Different finite difference version:

y′(t)≈ y(t)− y(t−h)h

⇒ yn+1 = yn +h f (tn+1,yn+1)

• Implicit method – need to solve an equation to find yn+1!

Page 185: Scientific Computing lectures

185 DE 8.1 Ordinary Differential Equations (ODE)

Comparison of the methods

• Implicit methods computationally more complex

• Explicit methods can be unstable – in case of stiff equations

stiff equation – differential equation for which certain numerical methods for solvingthe equation are numerically unstable, unless the step size is taken to be extremelysmall

Page 186: Scientific Computing lectures

186 DE 8.1 Ordinary Differential Equations (ODE)

Examples 1

Initial value problem y′(t) =−15y(t), t ≥ 0, y(0) = 1Exact solution: y(t) = e−15t ⇒ y(t)→ 0 as t→ ∞

Explicit schemes with h = 1/4, h = 1/8Adams-Moulton scheme (Trapezoidal method)yn+1 = yn +

12h( f (tn,un)+ f (tn+1,yn+1))

-3

-2

-1

0

1

2

3

0 0.2 0.4 0.6 0.8 1

Euler h=1/4

Euler h=1/8

Adams-Moulton, h=1/8

Exact solution

Page 187: Scientific Computing lectures

187 DE 8.1 Ordinary Differential Equations (ODE)

Example 2Partial differential quation (see below): Wave equation (in 1D and) 2D

Wave equation

−(

∂ 2u∂x2 +

∂ 2u∂y2

)+

∂ 2u∂ t2 = f (x,y, t),

where

• u(x,y, t)– height of a surface (e.g. water level) in point (x,y) at time t

• f (x,y, t)– external force applied to the surface at time t (For simplicity heref (x,y, t) = 0)

• solving on the domain (x,y) ∈Ω = [0,1]× [0,1] at time t ∈ [0,T ]

• Dirichlet boundary conditions u(x,y, t) = 0, (x,y) ∈ ∂Ω and the values ofderivatives ∂u

∂ t |t=0 = 0 for (x,y) ∈Ω

Page 188: Scientific Computing lectures

188 DE 8.1 Ordinary Differential Equations (ODE)

Some examples on comparison of explicit vs implicit schemes (http://courses.cs.ut.ee/2009/sc/Main/FDMschemes)

1D wave equation failing with larger h value (http://www.ut.ee/~eero/SC/1DWaveEqExpl-failure.avi)

Page 189: Scientific Computing lectures

189 DE 8.2 Partial Differential Equations (PDE)

8.2 Partial Differential Equations (PDE)

8.2.1 PDE overview

Examples of PDE-s:

• Laplace’s equation

– important in many fields of science,

* electromagnetism

* astronomy

* fluid dynamics

– behaviour of electric, gravitational, and fluid potentials

– The general theory of solutions to Laplace’s equation – potential theory

– In the study of heat conduction, the Laplace equation – the steady-stateheat equation

Page 190: Scientific Computing lectures

190 DE 8.2 Partial Differential Equations (PDE)

• Maxwell’s equations – electrical and magnetical fields’ relationships– set of four partial differential equations

– describe the properties of the electric and magnetic fields and relate themto their sources, charge density and current density

• Navier-Stokes equations – fluid dynamics (dependencies between pressure,speed of fluid particles and fluid viscosity)

• Equations of linear elasticity – vibrations in elastic materials with given prop-erties and in case of compression and stretching out

• Schrödinger equations – quantum mechanics – how the quantum state of aphysical system changes in time. It is as central to quantum mechanics as New-ton’s laws are to classical mechanics

• Einstein field equations – set of ten equations in Einstein’s theory of generalrelativity – describe the fundamental interaction of gravitation as a result ofspacetime being curved by matter and energy.

Page 191: Scientific Computing lectures

191 DE 8.3 2nd order PDEs

8.3 2nd order PDEs

We consider now only a single equation caseIn many practical cases, 2nd order PDE-s occur, for example:

• Heat equation: ut = uxx

• Wave equation: utt = uxx

• Laplace’s equation: uxx +uyy = 0.

General second order PDE has the form: (canonical form)

auxx +buxy + cuyy +dux + euy + f u+g = 0.

Assuming not all of a, b and c zero, then depending on discriminant b2−4ac:b2−4ac > 0: hyperbolic equation, typical representative – wave equation;b2−4ac = 0: parabolic equation, typical representative – heat equationb2−4ac < 0: elliptical equation, typical representative – Poisson equation

Page 192: Scientific Computing lectures

192 DE 8.3 2nd order PDEs

• In case of changing coefficient in time, equations can change their type

• In case of equation systems, each equation can be of different type

• Of course, problem can be non-linear or higher order as well

In general,

• Hyperbolic PDE-s describe time-dependent conservative physical processeslike wave propagation

• Parabolic PDE-s describe time-dependent dissipative (or scattering) physicalprocesses like diffusion, which move towards some fixed-point

• Elliptic PDE-s describe systems that have reached a fixed-point and are there-fore independent of time

Page 193: Scientific Computing lectures

193 DE 8.4 Time-independent PDE-s

8.4 Time-independent PDE-s

8.4.1 Finite Difference Method (FDM)

• Discrete mesh in solving region

• Derivatives replaced with approximation by finite differences

Example. Consider Poisson equation in 2D:

−uxx−uyy = f , 0≤ x≤ 1, 0≤ y≤ 1, (17)

• boundary values as on the figure on the left:

Page 194: Scientific Computing lectures

194 DE 8.4 Time-independent PDE-s

0

y

x

0

1

0

y

x

0

1

0

• Define discrete nodes as on the figure on right

• Inner nodes, where computations are carried out are defined with

(xi,y j) = (ih, jh), i, j = 1, ...,n

– (in our case n = 2 and h = 1/(n+1) = 1/3)

Page 195: Scientific Computing lectures

195 DE 8.4 Time-independent PDE-s

Consider here that f = 0.Replacing 2nd order derivatives with standard 2nd order differences in mid-points,

we get

ui+1, j−2ui, j +ui−1, j

h2 +ui, j+1−2ui, j +ui, j−1

h2 = 0, i, j = 1, ...,n,

where ui, j is approximation of the real solution u = u(xi,y j) in point(xi,y j), and in-cludes one boundary value, if i or j is 0 or n+1. As a result we get:

4u1,1−u0,1−u2,1−u1,0−u1,2 = 0

4u2,1−u1,1−u3,1−u2,0−u2,2 = 0

4u1,2−u0,2−u2,2−u1,2−u1,3 = 0

4u2,2−u1,2−u3,2−u2,2−u2,3 = 0.

In matrix form:

Page 196: Scientific Computing lectures

196 DE 8.4 Time-independent PDE-s

Ax =

4 −1 −1 0−1 4 0 −1−1 0 4 −10 −1 −1 4

u1,1

u2,1

u1,2

u2,2

=

u0,1 +u1,0

u3,1 +u2,0

u0,2 +u1,3

u3,2 +u2,3

=

0011

= b.

This positively defined system can be solved directly with Cholesky factorisation(Gauss elimination for symmetric matrix, where factorisation A = LT L is found) oriteratively. Exact solution of the problem is:

x =

u1,1

u2,1

u1,2

u2,2

=

0.1250.1250.3750.375

.

Page 197: Scientific Computing lectures

197 DE 8.4 Time-independent PDE-s

In general case n2×n2 Laplace’s matrix has form:

A =

B −I 0 · · · 0

−I B −I . . . ...

0 −I B . . . 0... . . . . . . . . . −I0 · · · 0 −I B

, (18)

where n×n matrix B is of form:

B =

4 −1 0 · · · 0

−1 4 −1 . . . ...

0 −1 4 . . . 0... . . . . . . . . . −10 · · · 0 −1 4

.

It means that most of the elements of matrix A are zero How are such matrices called?– it is a sparse matrix

Page 198: Scientific Computing lectures

198 DE 8.5 FEM – 1D example

8.4.2 Finite element Method (FEM)

8.5 FEM – 1D exampleConsidertwo-point boundary value problem

−u′′(x) = f (x) for 0 < x < 1, (19)

u(0) = u(1) = 0,

where v′ = dvdx and f - given continuous function

Variational formulation

Multiply −u′′ = f by arbitrary function v ∈V defined as:

• v(0) = v(1) = 0

• v – continuous on [0,1]

• v′ – piecewise continuous andbounded on [0,1]

Page 199: Scientific Computing lectures

199 DE 8.5 FEM – 1D example

and integrate:

−∫ 1

0u′′vdx =

∫ 1

0f vdx.

Integrate by parts the left-hand side:

−∫ 1

0u′′vdx = −u′v

∣∣10 +

∫ 1

0u′v′ dx =

∫ 1

0u′v′ dx

=>

Variational Formulation of (19):

∫ 1

0u′v′ dx =

∫ 1

0f vdx, ∀v ∈V (20)

Page 200: Scientific Computing lectures

200 DE 8.5 FEM – 1D example

FEM with piecewise linear functions

Vh – set of piecewise linear functions defined on points 0 = x0 < x1 < ... < xM <

xM+1 = 1h j = x j− x j−1, j = 1, ...,M+1; h = max h j

FEM for the boundary value problem reformulation for piecewise linearfunctions:

Find uh ∈ Vh suchthat∫ 1

0u′hv′ dx =

∫ 1

0f vdx, ∀v ∈Vh (21)

Define linear basis functions (or test functions) by:

Page 201: Scientific Computing lectures

201 DE 8.5 FEM – 1D example

ϕ j(xi)=

1 if i = j0 if i 6= j

i, j = 1, ...,M

=> ∀v ∈Vh:

v(x) =M

∑i=1

ηiϕi(x), x ∈ [0,1], (22)

where ηi = v(xi) (linear combination of of the basis functions ϕi)(21) => ∫ 1

0u′hϕ

′j dx =

∫ 1

0f ϕ j dx, j = 1, ...,M

From (22) we have uh = ∑Mi=1 ξiϕi, ξi = uh(xi)

=> we can write

Page 202: Scientific Computing lectures

202 DE 8.5 FEM – 1D example

M

∑i=1

ξi

∫ 1

0ϕ′i ϕ′j dx =

∫ 1

of ϕi dx j = 1, ...,M

what is this actually written here?which is a linear system of equations!M unknowns (ξ1, ...,ξM) ; M equations. Matrix form:

Aξ = b, (23)

where

• A = (ai j) – M×M matrix with elements ai j =∫ 1

0 ϕ ′i ϕ′j dx;

• M-vectors ξ = (ξ1, ...,ξM), b = (b1, ...,bm): bi =∫ 1

0 f ϕi dx

Page 203: Scientific Computing lectures

203 DE 8.5 FEM – 1D example

A =

aii · · · a1M... . . . ...

aM1 · · · aMM

,ξ =

ξ1...

ξM

, b =

b1...

bM

A – stiffness matrix, b – load vector (terms coming from structural mechanics)Computation of ai j

•∫ 1

0 ϕ ′i ϕ′j dx = 0 if |i− j|> 1 Why?Since either ϕi or ϕ j = 0

• => Matrix A – tridiagonal

• for j = 1, ...,M:

∫ 1

0ϕ′jϕ′j dx =

∫ x j

x j−1

1h2

jdx+

∫ x j+1

x j

1h2

j+1dx =

1h j

+1

h j+1

• for j = 2, ...,M:

Page 204: Scientific Computing lectures

204 DE 8.5 FEM – 1D example∫ 1

0ϕ′jϕ′j−1 dx =

∫ 1

0ϕ′j−1ϕ

′j dx =

∫ x j

x j−1

1h2

jdx =− 1

h j

• A – symmetric and positive definite since with v(x) = ∑Mj=1 η jϕ j(x) we have

M

∑i, j=1

ηi

∫ 1

0ϕ′i ϕ′j dxη j =

∫ 1

0(

M

∑i=1

ηiϕ′i )(

M

∑j=1

η jϕ′j)dx =

∫ 1

0v′v′ dx≥ 0

With uniform h j = h = 1M+1 the system (23) has the form:

1h

2 −1 0 0 · · · 0−1 2 −1 0 · · · 0

0 −1 2 −1 . . . ...... 0 . . . . . . . . . 0... . . . −1 2 −10 · · · · · · 0 −1 2

ξ1

ξ2.........

ξM

=

b1

b2.........

bM

Page 205: Scientific Computing lectures

205 DE 8.6 2D FEM for Poisson equation

8.6 2D FEM for Poisson equationConsider as an example Poisson equation

−∆u(x) = f (x), ∀x ∈Ω,

u(x) = g(x), ∀x ∈ Γ,

where Laplacian ∆ is defined by

(∆u)(x) = (∂ 2u∂x2 +

∂ 2u∂y2 )(x), x =

(xy

)

u can be e.g.:

• temperature

• electro-magnetic potential

• displacement of an elastic mem-brane fixed at the boundary undera transversal load of intensity

Page 206: Scientific Computing lectures

206 DE 8.6 2D FEM for Poisson equation

Divergence theorem ∫Ω

divF dx =∫

Γ

F ·nds,

where F = (F1,F2) is a vector-valued function defined on Ω,

divF =∂F1

∂x1+

∂F2

∂x2,

and n = (n1,n2) – outward unit normal to Γ.Here dx – element of area in R2; ds – arc length along Γ.Applying divergence theorem to: F = (vw,0) and F = (0,vw) (details not shown

here) – Green’s first identity∫Ω

∇u ·∇vdx =∫

Γ

v∂u∂n

ds−∫

Ω

v4udx (24)

we come to a variational formulation of the Poisson problem on V :

Page 207: Scientific Computing lectures

207 DE 8.6 2D FEM for Poisson equation

V = v : v is continuous on Ω,∂v∂x1

and∂v∂x2

are piecewise continuous on Ω and

v = 0 on Γ

Poisson problem in Variational Formulation: Find u ∈V such that

a(u,v) = ( f ,v), ∀v ∈V (25)

where in case of Poisson equation

a(u,v) =∫

Ω

∇u ·∇vdx

and (u,v) =∫

Ωu(x)v(x)dx. The gradient ∇ of a scalar function f (x,y) is defined by:

∇ f =(

∂ f∂x

,∂ f∂y

)

Page 208: Scientific Computing lectures

208 DE 8.6 2D FEM for Poisson equation

Again, the Variational formulation of Poisson equation is:

∫Ω

[∂u∂x

∂v∂x

+∂u∂y

∂v∂y

]dx =

∫Ω

f vdx

With discretisation, we replace the space V with Vh – space of piecewise linearfunctions. Each function in Vh can be written as

vh(x,y) = ∑ηiϕi(x,y),

where ϕi(x,y) – basis function (or ’hat’ functions)

Page 209: Scientific Computing lectures

209 DE 8.6 2D FEM for Poisson equation

Ni

ϕivh

T on Ω

With 2D FEM we demand that the equation in the Variational formulation is sat-isfied for M basis functions ϕi ∈Vh i.e.∫

Ω

∇uh ·∇ϕidx =∫

Ω

f ϕidx i = 1, . . . ,M

But we have:

uh =M

∑j=1

ξ jϕ j =⇒

∇uh =M

∑j=1

ξ j∇ϕ j

Page 210: Scientific Computing lectures

210 DE 8.6 2D FEM for Poisson equation

M linear equations with respect to unknowns ξ j:

∫Ω

M

∑j=1

(ξ j∇ϕ j

)·∇ϕidx =

∫Ω

f ϕidx i = 1, . . . ,M

The stiffness matrix A = (ai j) elements and right-hand side b = (bi) calculation:

ai j = a(ϕi,ϕ j) =∫

Ω

∇ϕi ·∇ϕ jdx

bi j = ( f ,ϕi) =∫

Ω

f ϕidx

Integrals computed only where the pairs ∇ϕi ·∇ϕ j get in touch (have mutualsupport)

Page 211: Scientific Computing lectures

211 DE 8.6 2D FEM for Poisson equation

Example

two basis functions ϕi and ϕ j for nodes Ni and N j Their common support is τ ∪τ ′

so thatai j =

∫Ω

∇ϕi ·∇ϕ jdx =∫

τ

∇ϕi ·∇ϕ jdx+∫

τ ′∇ϕi ·∇ϕ jdx

Ni N j

Nk

τ

τ ′

aτk, j

Ni N j

Nk

τ

ϕi ϕ j

Element matrices

Consider single element τ

Pick two basis functions ϕi and ϕ j (out of three).

Page 212: Scientific Computing lectures

212 DE 8.6 2D FEM for Poisson equation

ϕk – piecewise linear =⇒ denoting by pk(x,y) = ϕk|τ :

pi(x,y) = αix+βiy+ γi,

p j(x,y) = α jx+β jy+ γ j.

=⇒∇ϕi∣∣τ= (∂ pi

∂x ,∂ pi∂y ) = (αi,βi) =⇒ their dot product

∫τ

∇ϕi ·∇ϕ jdx =∫

τ

αiα j +βiβ jdx.

Finding coefficients α and β – put three points (xi,yi,1), (x j,y j,0), (xk,yk,0) to theplane equation and solve the system

D

αi

βi

γi

=

xi yi 1x j y j 1xk yk 1

αi

βi

γi

=

100

Page 213: Scientific Computing lectures

213 DE 8.6 2D FEM for Poisson equation

The integral is computed by the multiplication with the triangle area 12 |detD|

aτi j =

∫τ

∇ϕi ·∇ϕ jdx =(αiα j +βiβ j

) 12

∣∣∣∣∣∣∣det

xi yi 1x j y j 1xk yk 1

∣∣∣∣∣∣∣

The element matrix for τ is

Aτ =

aτii aτ

i j aτik

aτji aτ

j j aτjk

aτki aτ

k j aτkk

Assembled stiffness matrix

• created by adding appropriately all the element matrices together

Page 214: Scientific Computing lectures

214 DE 8.6 2D FEM for Poisson equation

• Different types of boundary values used in the problem setup result in slightlydifferent stiffness matrices

• Most typical boundary conditions:

– Dirichlet’

– Neumann

• but also:

– free boundary condition

– Robin boundary condition

Right-hand side

Page 215: Scientific Computing lectures

215 DE 8.6 2D FEM for Poisson equation

Right-hand side b : bi =∫

Ωf ϕidx. Approximate calculation of an integral through

the f value in the middle of the triangle τ:

f τ = f (xi + x j + xk

3yi + y j + yk

3)

• support – the area of the triangle

• ϕi is a pyramid (integrating ϕi over τ gives pyramid volume)

bτi =

∫τ

f τϕidx =

16

f τ |detDτ |

• Assembling procedure for the RHS b values from bτi (somewhat similarly to

matrix assembly, but more simple)

• Introduction of Dirichlet boundary condition just eliminates the rows.

Page 216: Scientific Computing lectures

216 DE 8.6 2D FEM for Poisson equation

EXAMPLE

Consider unit square.

Vh – space of piecewise linar func-tions on Ω with zero on Γ and values de-fined on inner nodes

y

x

Discretised problem in Variational Formulation: Find uh ∈Vh such that

a(uh,v) = ( f ,v), ∀v ∈Vh (26)

where in case of Poisson equation

a(u,v) =∫

Ω

∇u ·∇vdx

Page 217: Scientific Computing lectures

217 DE 8.6 2D FEM for Poisson equation

• In FEM the equation (26) needs to be satisfied on a set of testfunctions ϕi =

ϕi(x),

– which are defined such that

ϕi =

1, x = xi

0 x = x j j 6= i

• and it is demanded that (26) is satisfied with each ϕi (i = 1, ...,N)

• As a result, a matrix of the linear equations is obtained

• The resulting matrix identical with the matrix from (18) !

Page 218: Scientific Computing lectures

218 DE 8.7 FEM for more general cases

8.7 FEM for more general casesPoisson equation with varying coefficient (different materials, varying conduc-

tivity, elasticity etc.)

−∇(λ (x,y)∇u) = f

• assume that λ – constant accross each element τ: λ (x,y) = λ τ (x,y) ∈ τ

• =⇒ Elemental matrices Aτ get multiplied by λ τ .

• Refinement methods for finiteelements

– static

– adaptive refinement

Page 219: Scientific Computing lectures

219 DE 8.8 FEM software

Other equations

• For higher order and mixed systems of PDEs, more complex finite elements

– higher order finite elements for discretization

– special type (mixed) finite elements

8.8 FEM software

• Input of data, f , g (boundary cond.), Ω, coefficients λ (x,y) etc.

• Triangulation construction

• Computation of element stiffness matrices Aτ

• Assembly of the global stiffness matrix A and load vector b

Page 220: Scientific Computing lectures

220 DE 8.8 FEM software

• Solution of the system of equations Ax = b

• Presentation of the result

Page 221: Scientific Computing lectures

221 DE 8.9 Sparse matrix storage schemes

8.9 Sparse matrix storage schemes

• As we saw, different discretisation schemes give systems with similar matrixstructures

• (In addition to FDM and FEM often also some other discretisation schemes areused like Finite Volume Method (but we do not consider it here))

• In each case, the result is a system of linear equations with sparse matrix.

How to store sparse matrices?How to store sparse matrices?

Page 222: Scientific Computing lectures

222 DE 8.9 Sparse matrix storage schemes

8.9.1 Triple storage format

• n×m matrix A each nonzero with 3 values: integers i and j and (in mostapplications) real matrix element ai j. =⇒ three arrays:

indi(1:nz), indj(nz), vals(nz)of length nz, – number of matrix A nonzeroes

Advantages of the scheme:

• Easy to refer to a particular element

• Freedom to choose the order of the elements

Disadvantages :

• Nontrivial to find, for example, all nonzeroes of a particular row or column andtheir positions

Page 223: Scientific Computing lectures

223 DE 8.9 Sparse matrix storage schemes

8.9.2 Column-major storage format

For each matrix A column k a vector row_ind(j) – giving row numbers i forwhich ai j 6= 0.

• To store the whole matrix, each column nonzeros

– added into a 1-dimensional array row_ind(1:nz)

– introduce cptr(1:M) referring to column starts of each column inrow_ind.

row_ind(1:nz), cptr(1:M), vals(1:nz)

Advantages:

• Easy to find matrix column nonzeroes together with their positions

Page 224: Scientific Computing lectures

224 DE 8.9 Sparse matrix storage schemes

Disadvantages:

• Algorithms become more difficult to read

• Difficult to find nonzeroes in a particular row

8.9.3 Row-major storage format

For each matrix A row k a vector row(i) giving column numbers j for whichai j 6= 0.

• To store the whole matrix, each row nonzero

– added into a 1-dimensional array col_ind(1:nz)

– introduce rptr(1:N) referring to row starts of each row in col_ind.

col_ind(1:nz), rptr(1:N), vals(1:nz)

Page 225: Scientific Computing lectures

225 DE 8.9 Sparse matrix storage schemes

Advantages:

• Easy to find matrix row nonzeroes together with their positions

Disadvantages:

• Algorithms become more difficult to read.

• Difficult to find nonzeroes in a particular column.

8.9.4 Combined schemes

Triple format enhanced with cols(1:nz), cptr(1:M), rows(1:nz),

rptr(1:N). Here cols and rows refer to corresponding matrix A values intriple format. E.g., to access row-major type stuctures, one has to index throughrows(1:nz)

Advantages:

Page 226: Scientific Computing lectures

226 DE 8.9 Sparse matrix storage schemes

• All operations easy to perform

Disadvantages:

• More memory needed.

• Reference through indexing in all cases

Page 227: Scientific Computing lectures

227 Iterative methods 9.1 Problem setup

9 Iterative methods

9.1 Problem setup

Itereative methods for solving systems of linear equations withsparse matrices

Consider system of linear equations

Ax = b, (27)

where N×N matrix A

• is sparse,

– number of elements for which Ai j 6= 0 is O(N).

• Typical example: Poisson equation discretisation on n×n mesh, (N = n×n)

– in average 5, nonzeros per A row

Page 228: Scientific Computing lectures

228 Iterative methods 9.1 Problem setup

In case of direct methods, like LU-factorisation

• memory consumption (together with fill-in): O(N2) = O(n4).

• flops: 2/3 ·N3 +O(N2) = O(n6).

Banded matrix LU-decomposition

• memory consumption (together with fill-in): O(N · L) = O(n3), where L isbandwidth

• flops: 2/3 ·N ·L2 +O(N ·L) = O(n4).

Page 229: Scientific Computing lectures

229 Iterative methods 9.2 Jacobi Method

9.2 Jacobi Method• Iterative method for solving (27)

• With given initial approximation x(0), approximate solution x(k), k = 1,2,3, ...of (27) real solution x are calculated as follows:

– i-th component of x(k+1), x(k+1)i is obtained by taking from (27) only the

i-th row:Ai,1x1 + · · ·+Ai,ixi + · · ·+Ai,NxN = bi

– solving this with respect to xi, an iterative scheme is obtained:

x(k+1)i =

1Ai,i

(bi−∑

j 6=iAi, jx

(k)j

)(28)

Page 230: Scientific Computing lectures

230 Iterative methods 9.2 Jacobi Method

The calculations are in essence parallel with respect to i – no dependence onother componens x(k+1)

j , j 6= i. Iteration stop criteria can be taken, for example:∥∥∥x(k+1)−x(k)∥∥∥< ε or k+1≥ kmax, (29)

– ε – given error tolerance

– kmax – maximal number of iterations

• memory consumption (no fill-in):

– NA6=0 – number of nonzeroes of matrix A

• Number of iterations to reduce∥∥∥x(k)−x

∥∥∥2< ε

∥∥∥x(0)−x∥∥∥

2:

#IT≥ 2lnε−1

π2 (n+1)2 = O(n2)

Page 231: Scientific Computing lectures

231 Iterative methods 9.2 Jacobi Method

• flops/iteration ≈ 10 ·N = O(n2), =⇒

#IT · flopsiteration

=Cn4 +O(n3) = O(n4).

coefficent C in front of n4 is:

C ≈ 2lnε−1

π2 ·10≈ 2 · lnε−1

• Is this good or bad?This is not very good at all... We need some better methods, because

– For LU-decomposition (banded matrices) we had C = 2/3

Page 232: Scientific Computing lectures

232 Iterative methods 9.3 Conjugate Gradient Method (CG)

9.3 Conjugate Gradient Method (CG) C a l c u l a t e r(0) = b−Ax(0) wi th g i v e n s t a r t i n g v e c t o r x(0)

f o r i = 1 , 2 , . . .s o l v e Mz(i−1) = r(i−1) # we assume here t h a t M = I nowρi−1 = r(i−1)T z(i−1)

i f i ==1p(1) = z(0)

e l s eβi−1 = ρi−1/ρi−2

p(i) = z(i−1)+βi−1p(i−1)

e n d i fq(i) = Ap(i) ; αi = ρi−1/p(i)T q(i)

x(i) = x(i−1)+αip(i) ; r(i) = r(i−1)−αiq(i)

check c o n v e r g e n c e ; c o n t in u e i f neededend

Page 233: Scientific Computing lectures

233 Iterative methods 9.3 Conjugate Gradient Method (CG)

• memory consumption (no fill-in):

NA6=0 +O(N) = O(n2),

where NA6=0 – # nonzeroes ofA

• Number of iterations to achieve∥∥∥x(k)−x

∥∥∥2< ε

∥∥∥x(0)−x∥∥∥

2:

#IT≈ lnε−1

2

√κ(A) = O(n)

• Flops/iteration ≈ 24 ·N = O(n2) , =⇒

#IT · flopsiteration

=Cn3 +O(n2) = O(n3),

Page 234: Scientific Computing lectures

234 Iterative methods 9.3 Conjugate Gradient Method (CG)

where C ≈ 12lnε−1 ·√

κ2(A).

=⇒ C depends on condition number of A! This paves the way for preonditioningtechnique

Page 235: Scientific Computing lectures

235 Iterative methods 9.4 Preconditioning

9.4 PreconditioningIdea:Replace Ax = b with system M−1Ax = M−1b.Apply CG to

Bx = c, (30)

where B = M−1A and c = M−1b.But how to choose M?Preconditioner M = MT to be chosen such that

(i) Problem Mz = r being easy to solve

(ii) Matrix B being better conditioned than A, meaning that κ2(B)< κ2(A)

Page 236: Scientific Computing lectures

236 Iterative methods 9.4 Preconditioning

Then#IT(30) = O(

√κ2(B))< O(

√κ2(A)) = #IT(27)

butflops

iteration(30) =

flopsiteration

(27)+ (i) >flops

iteration(27)

• =⇒We need to make a compromise!

• (In extreme cases M = I or M = A)

• Preconditioned Conjugate Gradients (PCG) Method

– obtained if to take in previous algorithm M 6= I

Page 237: Scientific Computing lectures

237 Iterative methods 9.5 Preconditioner examples

9.5 Preconditioner examplesDiagonal Scaling (or Jacobi method)

M = diag(A)

(i) flopsIteration = N

(ii) κ2(B) = κ2(A)

⇒ Is this good?no large improvement to be expeted

Page 238: Scientific Computing lectures

238 Iterative methods 9.5 Preconditioner examples

Incomplete LU-factorisation

M = LU ,

• L and U – approximations to actual factors L and U in LU-decompoition

– nonzeroes in Li j and Ui j only where Ai j 6= 0 (i.e. fill-in is ignored in LU-factorisation algorithm)

(i) flopsIiteration = O(N)

(ii) κ2(B)< κ2(A)

How good is this preconditioner?Some improvement at least expected!

κ2(B) = O(n2)

Page 239: Scientific Computing lectures

239 Iterative methods 9.5 Preconditioner examples

Gauss-Seidel method do k=1,2,...

do i=1,...,n

x(k+1)i =

1Ai,i

(bi−

i−1

∑j=1

Ai jx(k+1)j −

n

∑j=i+1

Ai, jx(k)j

)(31)

enddo

enddoNote that in real implementation, the method is done like:

do k=1,2,...

do i=1,...,n

xi =1

Ai,i

(bi−∑

j 6=iAi jx j

)(32)

enddo

enddoDo you see a problem with this preconditioner (with PCG method)?But the preconditioner is not symmetric, which makes CG not to converge!

Page 240: Scientific Computing lectures

240 Iterative methods 9.5 Preconditioner examples

Symmetric Gauss-Seidel methodTo get the symmetric preconditioner, another step is added:

do k=1,2,...

do i=1,...,n

xi =1

Ai,i

(bi−∑ j 6=i Ai jx j

)enddo

do i=n,...,1 step=-1

xi =1

Ai,i

(bi−∑ j 6=i Ai jx j

)enddo

enddo

Page 241: Scientific Computing lectures

241 Integral Equations 10.1

10 Solving Integral Equations

10.1

Quadrature Formula Method for Solving Integral Equations (http://www.ut.ee/~eero/SC/IntV6rrLah-eng.pdf)

Kvadratuurvalemite meetod integraalvõrrandi lahendamiseks (http://www.ut.ee/~eero/SC/IntV6rrLah.pdf)

Page 242: Scientific Computing lectures

242 Examples 11.1 Graph partitioning

11 Examples

11.1 Graph partitioning• Consider graph G = (T,K), T – set of nodes, K – Set of edges (each connect-

ing two nodes)

• One of the key tasks: dividing graph G into two subgraphs Gi = (Ti,Ki), i = 1,2without intersection by erasing a set of edges of G

• Subgraphs need to fullfill certain criteria:

– A. more or less equal sizes – #T1 ≈ #T2

* (# denoting number of elements in a given set)

* For example, in parallel processing, this means equal load in algo-rithms associated with the graph

– B. minimum # of erased edges. (In parallel program – minimising theamount of communication between different processors)

Page 243: Scientific Computing lectures

243 Examples 11.1 Graph partitioning

Example. Creating subdomains ofequal size with minimal cut

Let the region consist of triangles (“fi-nite elements”)

• each triangle may have connectionto its neighbours through nodes oredges.

The aim is to divide the triangles into twosets with minimal cut

• A typical problem with parallelisa-tion of PDE-s

• A dual graph can be constructed

– placing nodes at centers ofeach triangle

– connecting nodes with a com-mon side

Page 244: Scientific Computing lectures

244 Examples 11.1 Graph partitioning

Graph partitioning is applied to the dual graph to obtain two approximately equalsets of triangles.

Note. The method can be applied recursively to obtain other numbers of subdo-mains

Page 245: Scientific Computing lectures

245 Examples 11.1 Graph partitioning

11.1.1 Mathematical formulation

Let N = #T – number of nodes. Assign to each graph G node ti a real number xi

– obtaining vector x = xi ∈ RN . Define set of allowed vectors

V = x ∈ RN : xi =+1 or −1 ∀t ∈ T.

Introduce vector e = (1,1, ...,1)T ∈ V . Assuming now that N – even, partitioningproblem can be written as follows:

Find x ∈ V for whicheT x = 0 (33)

and which minimises the functional:

φ(x) := ∑(k,s)∈K

(xk− xs)2 (34)

over whole set of allowed vectors V .

Page 246: Scientific Computing lectures

246 Examples 11.1 Graph partitioning

• Solving this problem, corresponding graph is divided into two subgraphs set-ting each edge t for which

– xt = 1 into first set

– xt =−1 int the second set

• At the same time, the number (34) gives the number of erased edges multipliedwith 4

• Equal size of subgraphs follows from (33)

• Minimum of erased edges is quaranteed by (34)

Page 247: Scientific Computing lectures

247 Examples 11.1 Graph partitioning

Alternative formula for φ(x)Introduce graph G adjoint matrix AG = (ai j) , which is N×N matrix consisting

only of values 0 or 1:

ai j =

1, ∃ edge from node i to node j,0, otherwise.

• Note that matrix AG is symmetric

• Denote by dt number of connections from node t ∈ T and introduce diagonalmatrix

DG = diagdt : t ∈ T.

Define graph’s Laplacian matrix LG:

LG = DG−AG. (35)

Page 248: Scientific Computing lectures

248 Examples 11.1 Graph partitioning

Lemma 9.1.Functional φ and Laplacian matrix LG are connected with each other as follows:

12

φ(x) = xT LGx.

Proof. Exercise!Using Lemma 9.1 the partitioning problem can be reformulated as follows:Find x ∈ V suh that

eT x = 0 (36)

and which minimises the expression

xT LGx over all elements in x ∈ V . (37)

From Lemma 9.1 it follows that

Page 249: Scientific Computing lectures

249 Examples 11.1 Graph partitioning

• LG is

– symmetric

– non-negative definite

• All matrix LG eigenvalues λi ≥ 0

– Indeed, from Lemma 9.1 we see that LGx = 0 iff xt is constant on thewhole set t ∈ T , ie. 0 is eigenvalue of matrix LG with a correspondingeigenspace spane.

Exact solution to the linear planning problem (36),(37) is difficult to find. Therefore,in practice, minimisation over the allowed vector set V is replaced with minimisationover the RN subspace.

Page 250: Scientific Computing lectures

250 Examples 11.1 Graph partitioning

11.1.2 Modified problem

Find x ∈ RN such thateT x = 0 (38)

and minimises the expression

xT LGx over all elementsx ∈ RN , where‖x‖2 = N1/2. (39)

Eigenvalue problem formulationIn eigenvalue problem pairs (λ ,v) are searched, for which the following equation

is satisfied:LGv = λv. (40)

• In case of symmetric matrix all the eigenvalues λi, i = 1, ...,N are real.

Page 251: Scientific Computing lectures

251 Examples 11.1 Graph partitioning

• Eigenvectors are orthogonal with each other (ie. vTi v j = 0 if i 6= j) and form

a basis for space RN . From here it follows, that each vector x ∈ RN can berepresented as

x =N

∑i=1

wivi with mutliplierswi ∈ R. (41)

Lemma 9.2.Problem (38)(39) solution is given by matrix LG eigenvector corresponding to the

smallest matrix LG positive eienvector λ2.Proof. Exercise(Using (35) show that matrix LG eigenvector is v1 = e and corresponding eigenvalue is λ1 = 0. Then from (38) follows that w1 = 0 in expression (41). To find

minimum of the expression xT LGx use (41) and (40) and nonnegative definite property of the matrix.)

Page 252: Scientific Computing lectures

252 Examples 11.1 Graph partitioning

11.1.3 Method of Spectral Bisectioning

In case of graph G spectral bisectioning method an eigenvector x correspondingto matrix LG smallest positive eigenvalue λ2 is found.

• x consists of real numbers, not integer

• For obtaining bisection, arithmetic average a of vector x elements is found.

– All graph G nodes t for which xt < a are assigned to the first set

– xt ≥ a are assigned to the second set

Despite of heuristic nature of the algorithm, it results often in bisections of goodquality.

Details of the algorithm was a result of the scientific work:N. P. Kruyt, A conjugate gradient algorithm for the spectral partitioning of

graphs, Parallel Computing, 22 (1997), 1493 - 1502.

Page 253: Scientific Computing lectures

253 Examples 11.1 Graph partitioning

11.1.4 Finding smallest positive eigenvalue of matrix LG

As mentioned above, to find graph bisectioning, arithmetic average a of eigenvec-tor x needs to be found. (based of which, graph is divided) But as the result does notdepend on vector x norm, we are free to normalise ‖x‖2 = 1.

11.1.5 Power Method Choose x0 as an initial approximation

for k=0,1,...

yk := LGxk

xk+1 := yk/‖yk‖2 # normalisation

Page 254: Scientific Computing lectures

254 Examples 11.1 Graph partitioning

Theorem 9.1.Let eigenvalues of matrix LG be

0 = λ1 < λ2 ≤ ·· · ≤ λN

with corresponding orthonormal eigenvectors vi, i = 1, ...,N. Assume that the initialapproximation x0 has the direction xN represented , ie.

x0 =N

∑i=1

αivi, whereαN 6= 0. (42)

If k→∞, then xk converges to the vector vN – eigenvector corresponding to the largesteigenvalue λN of matrix LG.

Proof. Exercise. (Using (42) and (40) write out Akx0 . Taking λN in front of summation, show that multipliers in front of vi , i = 1, ..,N−1 tend

to zero.)

Page 255: Scientific Computing lectures

255 Examples 11.1 Graph partitioning

• Assuming that σ is not an eigenvalue of LG, then replacing the matrix LG withmatrix (LG−σ I)−1,

– we get an algorithm that convereges to the matrix (LG−σ I)−1 dominanteigenvalue.

– This is the eigenvector corresponding to the closest to σ of eigenvalues ofmatrix LG.

Page 256: Scientific Computing lectures

256 Examples 11.1 Graph partitioning

11.1.6 Inverse Power Method. Choose x0 as an initial approximation vector

Choose σ which is close, but not equal to the

eigenvalue we are searching

for k=0,1,...

solve (LG−σ I)yk = xk for yk

xk+1 := yk/‖yk‖2 # normalisationIn our case, if we choose σ < 0, convergence will be towards e , eigenvalue corre-sponding to the eigenvalue λ1 = 0.

Page 257: Scientific Computing lectures

257 Examples 11.1 Graph partitioning

To find eigenvector corresponding to the matrix LG smallest positive eigenvalue,we can use inverse power method but at the same time to assure, that the iterationvector xk won’t have any component towards e. For that we notice that

P(x) = x−((eT x)/(eT e)

)e

makes the vector x orthogonal against vector e.

Page 258: Scientific Computing lectures

258 Examples 11.1 Graph partitioning

11.1.7 Inverse Power Method with Projection Choose x0 as an initial approximation vector

Choose σ which is close, but not equal to the

eigenvalue we are searching

Normalise x0 := x0/‖x0‖2

Apply the projection x0← P(x0)

for k=0,1,...

solve (LG−σ I)yk = xk for yk

Apply the projection yk← P(yk)

normalise xk+1 := yk/‖yk‖2Remark. Number

R(xk) = xkTLGxk/xkT

xk

is called Rayleigh Quotient. In our case xkT xk = 1 then

Page 259: Scientific Computing lectures

259 Examples 11.1 Graph partitioning

R(xk) = xkTLGxk

• If xk converges to LG eigenvector vi, then Raylegh Quotient R(xk) converges tocorresponding eigenvalue λi.

• Rayleigh Quotient makes it possible to compute, how good (λi,vi) actually is.

– For this purpose one can calculate L2 eigenvalue norm:‖LGvi−R(vi)vi‖2.

Page 260: Scientific Computing lectures

260 // Computing 12.1 Motivation

12 Parallel Computing

12.1 MotivationWhy do we need computing power in an increasing manner?Much-cited legend: In 1947 computer engineer Howard Aiken said that USA will

need in the future at most 6 computers!

Gordon Moore’s (founder of Intel) law:(1965: the number of switches doubles every second year )1975: - refinement of the above: The number of switches on a CPU doubles

every 18 months

Until 2020 or 2030 we would reach in such a way to the atomic level (or quantumcomputer)

Page 261: Scientific Computing lectures

261 // Computing 12.1 Motivation

Flops:

first computers 102 100 Flopsdesktop computers today 109 Gigaflops (GFlops)

supercomputers nowadays 1012 Teraflops (TFlops)the aim of today 1015 Petaflops (PFlops)

next step 1018 Exaflops (EFlops)

• How can parallel processing help achieving these goals?

• Why should we talk about it at the Scientific Computin course?

Example 12.1. Why speeds of order 1012 are not enough?Weather forecast in Europe for next 48 hoursfrom sea lever upto 20km– need to solve a PDE in 3D (xyz and t)The volume of the region: 5000∗4000∗20 km3.Stepsize in xy-plane 1km Cartesian mesh z-direction: 0.1km. Timesteps: 1min.

Around 1000 flop per each timestep. As a result, ca

Page 262: Scientific Computing lectures

262 // Computing 12.1 Motivation

5000∗4000∗20km3×10 rectangles per km3 = 4∗109meshpoints

and

4∗109 meshpoints ∗48∗60timesteps×103 flop≈ 1.15∗1015 flop.

If a computer can do 109 flops, it will take around

1.15∗1015 flop / 109 flop= 1.15∗106 seconds ≈ 13 days !!

But, with 1012 flops,1.15∗103 seconds < 20min.

Not hard to imagine “small” changes in given scheme such that 1012 flops not enough:

Page 263: Scientific Computing lectures

263 // Computing 12.1 Motivation

• Reduce the mesh stepsize to 0.1km in each directions and timestep to 10seconds and the total time would grow from

20min to 8days.

• We could replace the Europe with the whole World model (the area: 2 ∗107km2 −→ 5∗108km2) and to combine the model with an Ocean model.

Therefore, we must say: The needs of science and technology grow faster thanavailable possibilities, need only to change ε and h to get unsatisfied again!

But again, why do we need a parallel computer to achieve this goal?

Page 264: Scientific Computing lectures

264 // Computing 12.1 Motivation

Example 12.2Have a look at the following piece of code:

do i=1,1000000000000

z(i)=x(i)+y(i) # ie. 3∗1012 memory accesses

end doAssuming, that data is traveling with the speed of light 3 ∗ 108m/s, for finishing

the operation in 1 second, the average distance of a memory cell must be:

r = 3∗108m/s∗1s3∗1012 = 10−4m.

Typical computer memory – a rectangular mesh. The length of each edge would be

2∗10−4m√3∗1012 ≈ 10−10m,

– the size of a small atom!

Page 265: Scientific Computing lectures

265 // Computing 12.1 Motivation

Why parallel computing not yet the only way of doing it?Three main reasons: hardware , algorithms and software

Hardware: speed of

• networking

• peripheral devices

• memory access

do not cope with the speed growth in pro-cessor capacity.

Algorithms: An enormous number ofdifferent algorithms exist but

• problems start with their im-plementation on some real lifeapplication

Software: development in progress;

• no good enough automaticparallelisers

• everything done as handwork

• no easily portable libraries for dif-ferent architectures

• does the paralleising effort pay offat all?

Page 266: Scientific Computing lectures

266 // Computing 12.2 Classification of Parallel Computers

12.2 Classification of Parallel Computers

• Architecture

– Single processor computer

– Multicore processor

– distributed system

– shared memory system

• Network topology

– ring, array, hypercube, . . .

• Memory access

– shared , distributed , hybrid

• operating system

– UNIX,

– LINUX,

– (OpenMosix)

– WIN*

• scope

– supercomputing

– distributed computing

– real time sytems

– mobile systems

– grid and cloud computing

– etc

Page 267: Scientific Computing lectures

267 // Computing 12.3 Parallel Computational Models

But impossible to ignore (implicit or explicit) parallelism in a computer or a setof computers

List of top parallel computer systems (http://www.top500.org).

12.3 Parallel Computational Models

• Data parallelism: e.g. vector processor

• Control Parallelism: e.g. shared memory

– Each process has access to all data. Access to this data is coordinated bysome form of locking procedure

• Message Passing: e.g. MPI

– Independent processes with independent memory. Communication be-tween the processes via sending messages.

Page 268: Scientific Computing lectures

268 // Computing 12.3 Parallel Computational Models

• Other models:

– Hybrid versions of the above models,

– remote memory operations (“one-sided” sends and receives)

13 Parallel programming models

Many different models for expressing parallelism in programming languages

• Actor model

– Erlang

– Scala

• Coordination languages

– Linda

• CSP-based (Communicating Se-quential Processes)

– FortranM

– Occam

Page 269: Scientific Computing lectures

269 // Computing 12.3 Parallel Computational Models

• Dataflow

– SISAL (Streams and Itera-tion in a Single AssignmentLanguage)

• Distributed

– Sequoia

– Bloom

• Event-driven and hardware de-scription

– Verilog hardware descriptionlanguage (HDL)

• Functional

– Concurrent Haskell

• GPU languages

– CUDA

– OpenCL

– OpenACC

– OpenHMPP (HMPP for Hy-brid Multicore Parallel Pro-gramming)

• Logic programming

– Parlog

• Multi-threaded

– Clojure

Page 270: Scientific Computing lectures

270 // Computing 12.3 Parallel Computational Models

• Object-oriented

– Charm++

– Smalltalk

• Message-passing

– MPI

– PVM

• Partitioned global address space(PGAS)

– High Performance Fortran(HPF)

Page 271: Scientific Computing lectures

271 // programming models 13.1 HPF

13.1 HPF

Partitioned global address space parallel programming modelFortran90 extension

• SPMD (Single Program Multiple Data) model

• each process operates with its own part of data

• HPF commands specify which processor gets which part of the data

• Concurrency is defined by HPF commands based on Fortran90

• HPF directives as comments:!HPF$ <directive>

Page 272: Scientific Computing lectures

272 // programming models 13.1 HPF

HPF example: matrix multiplication PROGRAM ABmult

IMPLICIT NONEINTEGER , PARAMETER : : N = 100INTEGER , DIMENSION (N,N) : : A, B , CINTEGER : : i , j

! HPF$ PROCESSORS s q u a r e ( 2 , 2 )! HPF$ DISTRIBUTE (BLOCK,BLOCK) ONTO s q u a r e : : C! HPF$ ALIGN A( i , * ) WITH C( i , * )! r e p l i c a t e c o p i e s o f row A ( i , * ) on to proc . s which compute C( i , j )! HPF$ ALIGN B( * , j ) WITH C( * , j )! r e p l i c a t e c o p i e s o f c o l . B ( * , j ) ) on to proc . s which compute C( i , j )

A = 1B = 2C = 0DO i = 1 , N

DO j = 1 , N! A l l t h e work i s l o c a l due t o ALIGNsC( i , j ) = DOT_PRODUCT(A( i , : ) , B ( : , j ) )

Page 273: Scientific Computing lectures

273 // programming models 13.1 HPF

END DOEND DOWRITE( * , * ) CEND

HPF programming methodology

• Need to find balance between concurrency and communication

• the more processes the more communication

• aiming to

– find balanced load based from the owner calculates rule

– data locality

Easy to write a program in HPF but difficult to gain good efficiencyProgramming in HPF technique is more or less like this:

Page 274: Scientific Computing lectures

274 // programming models 13.1 HPF

1. Write a correctly working serial program, test and debug it

2. add distribution directives introducing as less as possible communication

Page 275: Scientific Computing lectures

275 // programming models 13.2 OpenMP

13.2 OpenMP

OpenMP tutorial (http://www.llnl.gov/computing/tutorials/openMP/)

• Programming model based on thread parallelism

• Helping tool for a programmer

• Based on compiler directivesC/C++, Fortran*

• Nested Parallelism is supported(though, not all implementations support it)

• Dynamic threads

• OpenMP has become a standard

Page 276: Scientific Computing lectures

276 // programming models 13.2 OpenMP

• Fork-Join model

Page 277: Scientific Computing lectures

277 // programming models 13.2 OpenMP

OpenMP Example: Matrix Multiplication: ! h t t p : / / www. l l n l . gov / comput ing / t u t o r i a l s / openMP / e x e r c i s e . h tm lC******************************************************************************C OpenMp Example − Ma t r ix M u l t i p l y − F o r t r a n V e r s i o nC FILE : omp_mm . fC DESCRIPTION :C D e m o n s t r a t e s a m a t r i x m u l t i p l y u s i n g OpenMP . Threads s h a r e row i t e r a t i o n sC a c c o r d i n g to a p r e d e f i n e d chunk s i z e .C LAST REVISED : 1 / 5 / 0 4 B l a i s e BarneyC******************************************************************************

PROGRAM MATMULT

INTEGER NRA, NCA, NCB, TID , NTHREADS, I , J , K, CHUNK,+ OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM

C number of rows in m a t r i x APARAMETER (NRA=62)

C number of columns in m a t r i x APARAMETER (NCA=15)

C number of columns in m a t r i x BPARAMETER (NCB=7)

REAL*8 A(NRA,NCA) , B(NCA,NCB) , C(NRA,NCB)

Page 278: Scientific Computing lectures

278 // programming models 13.2 OpenMP

C S e t loop i t e r a t i o n chunk s i z eCHUNK = 10

C Spawn a p a r a l l e l r e g i o n e x p l i c i t l y s c o p i n g a l l v a r i a b l e s!$OMP PARALLEL SHARED(A, B , C ,NTHREADS,CHUNK) PRIVATE ( TID , I , J ,K)

TID = OMP_GET_THREAD_NUM( )IF ( TID . EQ . 0 ) THEN

NTHREADS = OMP_GET_NUM_THREADS( )PRINT * , ’ S t a r t i n g m a t r i x m u l t i p l e example wi th ’ , NTHREADS,

+ ’ t h r e a d s ’PRINT * , ’ I n i t i a l i z i n g m a t r i c e s ’

END IF

C I n i t i a l i z e m a t r i c e s!$OMP DO SCHEDULE( STATIC , CHUNK)

DO 30 I =1 , NRADO 30 J =1 , NCA

A( I , J ) = ( I−1) +( J−1)30 CONTINUE

!$OMP DO SCHEDULE( STATIC , CHUNK)DO 40 I =1 , NCA

DO 40 J =1 , NCBB( I , J ) = ( I−1) * ( J−1)

40 CONTINUE

Page 279: Scientific Computing lectures

279 // programming models 13.2 OpenMP

!$OMP DO SCHEDULE( STATIC , CHUNK)DO 50 I =1 , NRA

DO 50 J =1 , NCBC( I , J ) = 0

50 CONTINUE

C Do m a t r i x m u l t i p l y s h a r i n g i t e r a t i o n s on o u t e r l oopC D i s p l a y who does which i t e r a t i o n s f o r d e m o n s t r a t i o n p u r p o s e s

PRINT * , ’ Thread ’ , TID , ’ s t a r t i n g m a t r i x m u l t i p l y . . . ’!$OMP DO SCHEDULE( STATIC , CHUNK)

DO 60 I =1 , NRAPRINT * , ’ Thread ’ , TID , ’ d i d row ’ , I

DO 60 J =1 , NCBDO 60 K=1 , NCA

C( I , J ) = C( I , J ) + A( I ,K) * B(K, J )60 CONTINUE

C End of p a r a l l e l r e g i o n!$OMP END PARALLEL

C P r i n t r e s u l t sPRINT * , ’ ****************************************************** ’PRINT * , ’ R e s u l t Ma t r i x : ’DO 90 I =1 , NRA

Page 280: Scientific Computing lectures

280 // programming models 13.2 OpenMP

DO 80 J =1 , NCBWRITE( * , 7 0 ) C( I , J )

70 FORMAT(2 x , f8 . 2 , $ )80 CONTINUE

PRINT * , ’ ’90 CONTINUE

PRINT * , ’ ****************************************************** ’PRINT * , ’ Done . ’

END

Page 281: Scientific Computing lectures

281 // programming models 13.3 GPGPU

13.3 GPGPU

General-purpose computing on graphics processing units

13.3.1 CUDA

• Parallel programming platform andmodel created by NVIDIA

• CUDA gives access to

– virtual instruction set

– memory on parallel computa-tional elements on GPUs

• based on executing a large numberof threads simultaneously

• CUDA-accelerated libraries

• CUDA-extended programminglanguages (like nvcc)

– C, C++, Fortran

Page 282: Scientific Computing lectures

282 // programming models 13.3 GPGPU

CUDA Fortran Matrix Multiplication example (for more information, see http://www.pgroup.com/lit/articles/insider/v1n3a2.htm)

Host (CPU) code:

s u b r o u t i n e mmul ( A, B , C )

use c u d a f o rrea l , dimension ( : , : ) : : A, B , Ci n t e g e r : : N, M, Lrea l , d ev i ce , a l l o c a t a b l e , dimension ( : , : ) : : Adev , Bdev , Cdevtype ( dim3 ) : : dimGrid , dimBlockN = s i z e (A, 1 ) ; M = s i z e (A, 2 ) ; L = s i z e (B , 2 )a l l o c a t e ( Adev (N,M) , Bdev (M, L ) , Cdev (N, L ) )Adev = A( 1 : N, 1 :M)Bdev = B ( 1 :M, 1 : L )dimGrid = dim3 ( N/ 1 6 , L / 1 6 , 1 )dimBlock = dim3 ( 16 , 16 , 1 )c a l l mmul_kernel <<<dimGrid , dimBlock >>>( Adev , Bdev , Cdev , N,M, L

)

Page 283: Scientific Computing lectures

283 // programming models 13.3 GPGPU

C ( 1 : N, 1 :M) = Cdevd e a l l o c a t e ( Adev , Bdev , Cdev )

end s u b r o u t i n e

Page 284: Scientific Computing lectures

284 // programming models 13.3 GPGPU

GPU code: a t t r i b u t e s ( g l o b a l ) s u b r o u t i n e MMUL_KERNEL( A, B , C , N,M, L )

rea l , d e v i c e : : A(N,M) ,B(M, L ) ,C(N, L )i n t e g e r , v a l u e : : N,M, Li n t e g e r : : i , j , kb , k , tx , t yrea l , s h a r e d : : Ab ( 1 6 , 1 6 ) , Bb ( 1 6 , 1 6 )r e a l : : C i jt x = t h r e a d i d x%x ; t y = t h r e a d i d x%yi = ( b l o c k i d x%x−1) * 16 + t xj = ( b l o c k i d x%y−1) * 16 + t yC i j = 0 . 0do kb = 1 , M, 16

Fetch one element each into Ab and Bb; note that 16x16 =256 t h r e a d s in t h i s t h r e a d−block a r e f e t c h i n g s e p a r a t e e l e m e n t s

of Ab and BbAb(tx,ty) = A(i,kb+ty-1)Bb(tx,ty) =B(kb+tx-1,j) Wait u n t i l a l l e l e m e n t s o f Ab and Bb a r e f i l l e d

c a l l s y n c t h r e a d s ( )do k = 1 , 16

C i j = C i j + Ab ( tx , k ) * Bb ( k , t y )enddoWait until all threads in the thread-block finish with t h i s i t e r a t i o n ’ s Ab and Bbc a l l s y n c t h r e a d s ( )

enddoC( i , j ) = C i j

Page 285: Scientific Computing lectures

285 // programming models 13.3 GPGPU

end s u b r o u t i n e13.3.2 OpenCL

Open Computing Language (OpenCL):

• Heterogeneous systems of

– CPUs (central processing units)

– GPUs (graphics processing units)

– DSPs (dogotal signal processors)

– FPGAs (field-programmable gate arrays)

– and other processors

• language (based on C99)

– kernels (functions that execute on OpenCL devices)

Page 286: Scientific Computing lectures

286 // programming models 13.3 GPGPU

– plus application programming interfaces (APIs)

– ( fortrancl )

• parallel computing using

– task-based and data-based parallelism

– GPGPU

• open standard by Khronos Group (Apple, AMD, IBM, Intel and Nvidia)

– + adopted by Altera, Samsung, Vivante and ARM Holdings

Example: Matrix Multiplication

Page 287: Scientific Computing lectures

287 // programming models 13.3 GPGPU

/ * k e r n e l . c l

* Ma t r i x m u l t i p l i c a t i o n : C = A * B . ( ( Dev ice code ) )( h t t p : / / gpgpu−comput ing4 . b l o g s p o t . co . uk / 2 0 0 9 / 0 9 / ma t r i x−m u l t i p l i c a t i o n −2−o p e n c l . h tm l

) * // / OpenCL Ke rn e l_ _ k e r n e l voidmatr ixMul ( _ _ g l o b a l f l o a t * C ,

_ _ g l o b a l f l o a t * A,_ _ g l o b a l f l o a t * B ,i n t wA, i n t wB)

/ / 2D Thread ID/ / Old CUDA code/ / i n t t x = b l o c k I d x . x * TILE_SIZE + t h r e a d I d x . x ;/ / i n t t y = b l o c k I d x . y * TILE_SIZE + t h r e a d I d x . y ;i n t t x = g e t _ g l o b a l _ i d ( 0 ) ;i n t t y = g e t _ g l o b a l _ i d ( 1 ) ;/ / v a l u e s t o r e s t h e e l e m e n t t h a t i s/ / computed by t h e t h r e a df l o a t v a l u e = 0 ;f o r ( i n t k = 0 ; k < wA; ++k )

f l o a t elementA = A[ t y * wA + k ] ;f l o a t elementB = B[ k * wB + t x ] ;

Page 288: Scientific Computing lectures

288 // programming models 13.3 GPGPU

v a l u e += elementA * elementB ;/ / W r i t e t h e m a t r i x t o d e v i c e memory each/ / t h r e a d w r i t e s one e l e m e n tC[ t y * wA + t x ] = v a l u e ;

13.3.3 OpenACC

• standard (Cray, CAPS, Nvidia and PGI) to simplify parallel programming ofheterogeneous CPU/GPU systems

• using annotations (like OpenMP), C, C++, Fortran source code

• code started on both CPU and GPU automatically

• OpenACC to merge into OpenMP in a future release of OpenMP

Page 289: Scientific Computing lectures

289 // programming models 13.3 GPGPU

! A s i m p l e OpenACC k e r n e l f o r M at r i x M u l t i p l i c a t i o n! $acc k e r n e l s

do k = 1 , n1do i = 1 , n3

c ( i , k ) = 0 . 0do j = 1 , n2

c ( i , k ) = c ( i , k ) + a ( i , j ) * b ( j , k )enddo

enddoenddo

! $acc end k e r n e l s

! h t t p : / / www. bu . edu / t e c h / r e s e a r c h / c o m p u t a t i o n / l i n u x−c l u s t e r / ka tana−c l u s t e r / gpu−comput ing / openacc−f o r t r a n / ma t r i x−m u l t i p l y−f o r t r a n /

program m a t r i x _ m u l t i p l yuse omp_l ibuse openacci m p l i c i t nonei n t e g e r : : i , j , k , myid , m, n , c o m p i l e d _ f o r , o p t i o ni n t e g e r , parameter : : f d = 11i n t e g e r : : t1 , t2 , d t , c o u n t _ r a t e , count_maxrea l , a l l o c a t a b l e , dimension ( : , : ) : : a , b , c

Page 290: Scientific Computing lectures

290 // programming models 13.3 GPGPU

r e a l : : tmp , s e c sopen ( fd , f i l e = ’ w a l l c l o c k t i m e ’ , form= ’ f o r m a t t e d ’ )o p t i o n = c o m p i l e d _ f o r ( fd ) ! 1− s e r i a l , 2−OpenMP , 3−OpenACC , 4−bo th

! $omp p a r a l l e l! $ myid = OMP_GET_THREAD_NUM( )! $ i f ( myid . eq . 0 ) t h e n! $ w r i t e ( fd , " ( ’ Number o f p r o c s i s ’ , i 4 ) " ) OMP_GET_NUM_THREADS( )! $ e n d i f! $omp end p a r a l l e l

c a l l s y s t e m _ c l o c k ( count_max=count_max , c o u n t _ r a t e = c o u n t _ r a t e )do m=1 ,4 ! compute f o r d i f f e r e n t s i z e m a t r i x m u l t i p l i e s

c a l l s y s t e m _ c l o c k ( t 1 )n = 1000*2**(m−1) ! 1000 , 2000 , 4000 , 8000a l l o c a t e ( a ( n , n ) , b ( n , n ) , c ( n , n ) )

! I n i t i a l i z e m a t r i c e sdo j =1 , n

do i =1 , na ( i , j ) = r e a l ( i + j )b ( i , j ) = r e a l ( i − j )

enddoenddo

! $omp p a r a l l e l do s h a r e d ( a , b , c , n , tmp ) r e d u c t i o n ( + : tmp )! $ acc d a t a co py in ( a , b ) copy ( c )! $ acc k e r n e l s

Page 291: Scientific Computing lectures

291 // programming models 13.3 GPGPU

! Compute m a t r i x m u l t i p l i c a t i o n .do j =1 , n

do i =1 , ntmp = 0 . 0 ! e n a b l e s ACC p a r a l l e l i s m f o r k−l oopdo k =1 , n

tmp = tmp + a ( i , k ) * b ( k , j )enddoc ( i , j ) = tmp

enddoenddo

! $ acc end k e r n e l s! $ acc end d a t a! $omp end p a r a l l e l do

c a l l s y s t e m _ c l o c k ( t 2 )d t = t2−t 1s e c s = r e a l ( d t ) / r e a l ( c o u n t _ r a t e )w r i t e ( fd , " ( ’ For n = ’ , i4 , ’ , w a l l c l o c k t ime i s ’ , f12 . 2 , ’ seconds ’ ) " ) &

n , s e c sd e a l l o c a t e ( a , b , c )

enddoc l o s e ( fd )

end program m a t r i x _ m u l t i p l y

i n t e g e r f u n c t i o n c o m p i l e d _ f o r ( fd )

Page 292: Scientific Computing lectures

292 // programming models 13.3 GPGPU

i m p l i c i t nonei n t e g e r : : f d# i f d e f i n e d _OPENMP && d e f i n e d _OPENACC

c o m p i l e d _ f o r = 4w r i t e ( fd , " ( ’ Th i s code i s compi l ed wi th OpenMP & OpenACC ’ ) " )

# e l i f d e f i n e d _OPENACCc o m p i l e d _ f o r = 3w r i t e ( fd , " ( ’ Th i s code i s compi l ed wi th OpenACC ’ ) " )

# e l i f d e f i n e d _OPENMPc o m p i l e d _ f o r = 2w r i t e ( fd , " ( ’ Th i s code i s compi l ed wi th OpenMP ’ ) " )

# e l s ec o m p i l e d _ f o r = 1w r i t e ( fd , " ( ’ Th i s code i s compi l ed f o r s e r i a l o p e r a t i o n s ’ ) " )

# e n d i fend f u n c t i o n c o m p i l e d _ f o r

Page 293: Scientific Computing lectures

293 Automatic parallelisation 13.4 Automatic parallelisation

13.4 Automatic parallelisation

automatic parallelisationfor a general case– too difficult!(at least now)

DSL - Domain-SpecificLanguage

• experimental work at Stanford PPL (Pervasive Parallelism Lab, http://ppl.stanford.edu/main/)

– virtualized Scala allows embedded DSLs

– Delite - compiler and parallel runtime

Page 294: Scientific Computing lectures

294 Automatic parallelisation 13.4 Automatic parallelisation

– some DSLs:

* Liszt - allows DSL for solving mesh-based PDEs

* OptiML - a DSL for machine learning

Automatic parallelisation of computing-intensive vector codes

We are developing a DSL, which is based on Virtualized Scala

• ZPara (Oleg Batrašev) - DSL for expressing iterative solvers

– sequential code with various array operations

Page 295: Scientific Computing lectures

295 Automatic parallelisation 13.4 Automatic parallelisation

• given the code and initial distribution of arrays

– scan the program

– decide how rest of the arrays are distributed

– decide where and how ghost values are needed

– localize indices, insert communication code

– generate parallelized code, run on MPI

Page 296: Scientific Computing lectures

296 MPI 14.1 Concept

14 MPI (The Message Passing Interface)

14.1 Concept

What is MPI?

• A message-passing library specification

– extended message-passing model

– not a language or compiler specification

– not a specific implementation or product

• For parallel computers, clusters, and heterogeneous networks

• Full-featured

• Designed to provide access to advanced parallel hardware for end users, librarywriters, and tool developers

Page 297: Scientific Computing lectures

297 MPI 14.1 Concept

MPI as STANDARDGoals of the MPI standard MPI’s prime goals are:

• To provide source-code portability

• To allow efficient implementations

MPI also offers:

• A great deal of functionality

• Support for heterogeneous parallel architectures

Page 298: Scientific Computing lectures

298 MPI 14.1 Concept

4 types of MPI calls

1. Calls used to initialize, manage, and terminate communications

2. Calls used to communicate between pairs of processors. (Pair communication)

3. Calls used to communicate among groups of processors. (Collectivecommunication)

4. Calls to create data types.

Page 299: Scientific Computing lectures

299 MPI 14.2 MPI basic subroutines (functions)

14.2 MPI basic subroutines (functions)

MPI_Init: initialise MPI

MPI_Comm_Size: how many PE?

MPI_Comm_Rank: identify the PE

MPI_Send

MPI_Receive

MPI_Finalise: close MPI

Example (Fortran90) 11.1 Greetings (http://www.ut.ee/~eero/SC/konspekt/Naited/greetings.f90.html)

Page 300: Scientific Computing lectures

300 MPI 14.2 MPI basic subroutines (functions)

• Wildcards exist:

– MPI_ANY_SOURCE

– MPI_ANY_TAG

MPI message types

MPI_CHARACTER

MPI_INTEGER

MPI_REAL

MPI_DOUBLE_PRECISION

MPI blocking / non-blocking commands

• blocking - program execution is stopped until return from the call

• non-blocking - operatrion is performed in background

As a rule, all collective commands are blocking

Page 301: Scientific Computing lectures

301 MPI 14.3 Blocking MPI commands

14.3 Blocking MPI commands

• MPI_Barrier(comm,ierr)

• MPI_Bcast(buffer,len,type,root,comm,ierr) – process rootsends each process in comm a message

Page 302: Scientific Computing lectures

302 MPI 14.3 Blocking MPI commands

• MPI_Reduce(sndbuf,rcvbuf,len,type,MPI_op,root,comm,ierr)

– collect all values of sndbuf to root performing operation op.

– (rcvbuf value is evaluated only on root...)

operation can be:

MPI_op Operation (op)

MPI_MAX maximumMPI_MIN minimumMPI_PROD productMPI_SUM SumMPI_LAND logical .AND.MPI_LOR logical .OR.MPI_LXOR bogical .XOR.MPI_BAND bitwise .AND.MPI_BOR bitwise .OR.

MPI_BXOR bitwise .XOR.

Page 303: Scientific Computing lectures

303 MPI 14.3 Blocking MPI commands

• MPI_Allreduce(sndbuf,rcvbuf,len,type,op,comm,ierr) –as above, but rcvbuf. evaluated on all processes in comm

• MPI_Gather(sndbuf,sndlen,<MPI_send_type>,

rcvbuf,recvlen,<MPI_recv_type>,root,comm,ierr)

• MPI_Allgather(sndbuf,sndlen,<MPI_send_type>,

rcvbuf,recvlen,<MPI_recv_type>,comm,ierr)

P(0)

P(1)

P(2)

P(3)

P(0)

P(1)

P(2)

P(3)

MPI_AllgatherMPI_Gather

Page 304: Scientific Computing lectures

304 MPI 14.3 Blocking MPI commands

The opposite to MPI_Gather is MPI_Scatter:

• MPI_Scatter(sndbuf,sndlen,<MPI_send_type>,

rcvbuf,recvlen,<MPI_recv_type>,root,comm,ierr)’

At computer class we use:MPI for Python: http://mpi4py.scipy.org/docs/usrman/

tutorial.html

Note that most of MPI example code was given in Fortran90 but could easily betranslated to mpi4py

Page 305: Scientific Computing lectures

305 MPI 14.3 Blocking MPI commands

MPI: different ways to communicate

• MPI different “sender mode” :

– MPI_SSEND: synchronous way: return the control when all the messageis received

– MPI_ISEND: non blocking: start the communication and return control

– MPI_BSEND: buffered send: creates a buffer, copies the data and returnscontrol

• In the same way different MPI receiving:

– MPI _IRECV etc...

Page 306: Scientific Computing lectures

306 MPI 14.4 Non-Blocking Send and Receive, Avoiding Deadlocks

14.4 Non-Blocking Send and Receive, Avoiding Deadlocks

• Non-Blocking communications allows the separation between the initiation ofthe communication and the completion

Advantages: between the initiation and completion the program could do otheruseful computation (latency hiding)

Disadvantages: the programmer has to insert code to check for completion

• Non-blocking commands start communication in background:

– MPI_ISEND – MPI_IRECV

• For testing the completion 2 options:

– blocking MPI_WAIT – Non-blocking MPI_TEST

All non-blocking commands have additional parameter request for cheking themessage status

Page 307: Scientific Computing lectures

307 MPI 14.4 Non-Blocking Send and Receive, Avoiding Deadlocks

14.4.1 Unidirectional communication between two processors

In following examples we use: from mpi4py import MPI

import numpy as np

size = MPI.COMM_WORLD.size

rank = MPI.COMM_WORLD.rank

comm = MPI.COMM_WORLD

len = 100

data = np.arange(len,dtype=float) # (or similar)

tag = 99

...In principle, there are 4 possibilities:

Page 308: Scientific Computing lectures

308 MPI 14.4 Non-Blocking Send and Receive, Avoiding Deadlocks

Blocking send and blocking receive if rank==0:

print "[0] Sending: ", data

comm.Send([data, MPI.FLOAT], 1, tag)

else:

print "[1] Receiving..."

comm.Recv([data, MPI.FLOAT], 0, tag)

print "[1] Data: ", data

Page 309: Scientific Computing lectures

309 MPI 14.4 Non-Blocking Send and Receive, Avoiding Deadlocks

Non-blocking send and blocking receive if rank==0:

print "[0] Sending: ", data

request = comm.Isend([data, MPI.FLOAT], 1, tag)

... # calculate or do something useful...

request.Wait()

else:

print "[1] Receiving..."

comm.Recv([data, MPI.FLOAT], 0, tag)

print "[1] Data: ", data

Page 310: Scientific Computing lectures

310 MPI 14.4 Non-Blocking Send and Receive, Avoiding Deadlocks

Blocking send non-blocking receive if rank==0:

print "[0] Sending: ", data

comm.Send([data, MPI.FLOAT], 1, tag)

else:

print "[1] Receiving..."

request=comm.Irecv([data, MPI.FLOAT], 0, tag)

... # calculate or do something useful...

request.Wait()

print "[1] Data: ", data

Page 311: Scientific Computing lectures

311 MPI 14.4 Non-Blocking Send and Receive, Avoiding Deadlocks

Non-blocking send and non-blocking receive if rank==0:

print "[0] Sending: ", data

request = comm.Isend([data, MPI.FLOAT], 1, tag)

... # calculate or do something useful...

request.Wait()

else:

print "[1] Receiving..."

request=comm.Irecv([data, MPI.FLOAT], 0, tag)

... # calculate or do something useful...

request.Wait()

print "[1] Data: ", dataThe command request.Wait() can be put in arbitrary place in program be-

fore next reference of data

Page 312: Scientific Computing lectures

312 MPI 14.4 Non-Blocking Send and Receive, Avoiding Deadlocks

14.4.2 Mutual communication and avoiding deadlock

Non-blocking operations can be used also for avoiding deadlocks.Deadlock is a situation where processes wait after each other without any of them

able to do anything useful. Deadlocks can occur:

• caused of false order of send and receive

• caused by system send-buffer fill-up

In case of mutual communication there are 3 possibilities:

1. Both processes start with send followed by receive

2. Both processes start with receive followed by send

3. One process starts with send followed by receive, another vica versa

Depending on blocking there are different possibilities:

Page 313: Scientific Computing lectures

313 MPI 14.4 Non-Blocking Send and Receive, Avoiding Deadlocks

1. Send followed by receive.Analyse the following code: if rank==0:

comm.Send([sendbuf, MPI.FLOAT], 1, tag)

comm.Recv([recvbuf, MPI.FLOAT], 1, tag)

else:

comm.Send([sendbuf, MPI.FLOAT], 0, tag)

comm.Recv([recvbuf, MPI.FLOAT], 0, tag)Does this work?Works OK for small messages only (if sendbuf is smaller then system message

send-bubber)But what about large messages?Large messages produce DeadlockWhy? Would using Irecv help out?

Page 314: Scientific Computing lectures

314 MPI 14.4 Non-Blocking Send and Receive, Avoiding Deadlocks

Following is deadlock-free: if rank==0:

request=comm.Isend([sendbuf, MPI.FLOAT], 1, tag)

comm.Recv([recvbuf, MPI.FLOAT], 1, tag)

request.Wait()

else:

request=comm.Isend([sendbuf, MPI.FLOAT], 0, tag)

comm.Recv([recvbuf, MPI.FLOAT], 0, tag)

request.Wait()Question: Why Wait() cannot follow right after Isend(...) ?

Page 315: Scientific Computing lectures

315 MPI 14.4 Non-Blocking Send and Receive, Avoiding Deadlocks

2. Receive followed by send if rank==0:

comm.Recv([recvbuf, MPI.FLOAT], 1, tag)

comm.Send([sendbuf, MPI.FLOAT], 1, tag)

else:

comm.Recv([recvbuf, MPI.FLOAT], 0, tag)

comm.Send([sendbuf, MPI.FLOAT], 0, tag)Is this OK?This produces deadlock in any system message buffer size

Page 316: Scientific Computing lectures

316 MPI 14.4 Non-Blocking Send and Receive, Avoiding Deadlocks

But have a look at: if rank==0:

request=comm.Irecv([recvbuf, MPI.FLOAT], 1, tag)

comm.Send([sendbuf, MPI.FLOAT], 1, tag)

request.Wait()

else:

request=comm.Irecv([recvbuf, MPI.FLOAT], 0, tag)

comm.Send([sendbuf, MPI.FLOAT], 0, tag)

request.Wait()Is it deadlock-free(This is deadlock-free)

Page 317: Scientific Computing lectures

317 MPI 14.4 Non-Blocking Send and Receive, Avoiding Deadlocks

3. One starts with Send the other one with receive if rank==0:

comm.Send([sendbuf, MPI.FLOAT], 1, tag)

comm.Recv([recvbuf, MPI.FLOAT], 1, tag)

else:

comm.Recv([recvbuf, MPI.FLOAT], 0, tag)

comm.Send([sendbuf, MPI.FLOAT], 0, tag)Could we use non-blocking commands instead(Non-blocking commands can be used in whichever call here as well)

Page 318: Scientific Computing lectures

318 MPI 14.4 Non-Blocking Send and Receive, Avoiding Deadlocks

Generally, the following communication pattern is advised: if rank==0:

req1=comm.Isend([sendbuf, MPI.FLOAT], 1, tag)

req2=recomm.Irecv([recvbuf, MPI.FLOAT], 1, tag)

else:

req1=comm.Isend([sendbuf, MPI.FLOAT], 0, tag)

req2=comm.Irecv([recvbuf, MPI.FLOAT], 0, tag)

req1.Wait()

req2.Wait()

Page 319: Scientific Computing lectures

319 // Program Design 15.1 Two main approaches in parallel program design

15 Design and Evaluation of Parallel Programs

15.1 Two main approaches in parallel program design

• Data-parallel approach

– Data is devided between processes so that each process does the sameoperations but with different data

• Control-parallel approach

– Each process has access to all pieces of data, but they perform differentoperations on them

Majority of parallel programs use mix of the aboveIn this course data parallel approach is dominating

Page 320: Scientific Computing lectures

320 // Program Design 15.1 Two main approaches in parallel program design

What obstacles can be accounted?

• Data dependencies

Example. Need to solve a 2×2 system of linear equations(a 0b c

)(xy

)=

(fg

)

Data partitioning: First row goes to process 0, 2nd row to process1Isn’t it easy! −→ NOT!Write out as system of two equations:

ax = f

bx +cy = g.

Computation of y on process 1 depends on x computed on process 0. Therefore weneed

Page 321: Scientific Computing lectures

321 // Program Design 15.1 Two main approaches in parallel program design

• Communication (more expensive than arithmetic operations or memoryreferences)

• Synchronisation (Processes that are just waiting are useless)

• Extra work (avoid communication or write a parallel program)

Which way to go?

• Divide data evenly between machines −→ load balancing

• Reducing data dependencies −→ good partitioning (geometric, for example)

• Make communication-free parts of the calculations as large as possible −→algorithms with large granularity

Given requirements are not easily achievable. Some problems are easily parallelis-able, others not.

Page 322: Scientific Computing lectures

322 // Program Design 15.2 Assessing parallel programs

15.2 Assessing parallel programs15.2.1 Speedup

S(N,P) :=tseq(N)

tpar(N,P)

• tseq(N) time solving the same problem with best known sequential algorithm

– =⇒ often sequential algorithm differs from the parallel one!

– If to time the same parallel algorithm just on one processor (despite ofexistance of a better sequential algorithm), corresponding ratio is calledrelative speedup

• 0 < S(N,P)≤ P

• If S(N,P) = P – the parallel program has linear or optimal speedup. (Exam-ple: calculating π with a parallel quadrature formula)

Page 323: Scientific Computing lectures

323 // Program Design 15.2 Assessing parallel programs

• Sometimes it may happen that S(N,P)> P How is it possible?due to processor cache– called superlinear speedup

• But sometimes S(N,P)< 1 – What does this mean?slowdown instead of speedup!

15.2.2 Efficiency

E(N,P) :=tseq(N)

P · tpar(N,P).

Presumably, 0 < E(N,P)≤ 1.

Page 324: Scientific Computing lectures

324 // Program Design 15.2 Assessing parallel programs

15.2.3 Amdahl’s lawEach algorithm has some part(s) not parallelisable

• Let σ (0 < σ ≤ 1) denote sequential part of a parallel program that cannot beparallelised

• Assume, 1−σ part be parallelised optimally, =⇒

S(N,P) =tseq(

σ + 1−σ

P

)tseq

=1

σ + 1−σ

P

≤ 1σ.

If e.g. 5% of the algo-rithm is not parallelisable(i.e. σ = 0.05), we get that

P S(N,P)

2 1.94 3.5

10 6.920 10.3100 16.8∞ 20

=⇒ using a large num-ber of processors seemsuseless for gaining any rea-sonable speedup increase!

Page 325: Scientific Computing lectures

325 // Program Design 15.2 Assessing parallel programs

15.2.4 Validity of Amdahl’s law

John Gustavson & Ed Barsis (Scania Laboratory):1024-processor nCube/10 bet the Amdahl’s law! Is it possible at all?In their problem they had:σ ≈ 0.004...0.008Got S≈ 1000(According to Amdahl’s law S should have been only 125...250 !!!)Does Amdahl’s law hold?Mathematically, Amdahl’s law holds, of courseDoes it make sense to solve a problem with fixed problem size N on arbitrarily

large number of processors? (What if only 1000 operations at all in the whole calcu-lation – does it helpusing 1001 processors? ;-)

• The point is, that usually σ is not constant, but is reducing with N growing

• It is said that algorithm is efficiently parallel, if σ → 0 with N→ ∞

Page 326: Scientific Computing lectures

326 // Program Design 15.2 Assessing parallel programs

To avoid problems with the terms, often term scaled efficiency is used

15.2.5 Scaled efficiency

ES(N,P) :=tseq(N)

tpar(P ·N,P)

• Does solution time remain the same with problem size change?

• 0 < ES(N,P)≤ 1

• If ES(N,P) = 1, we have linear speedup

Page 327: Scientific Computing lectures

327 // Preconditioners 16.1 Comparing different solution algorithms

16 Parallel preconditioners

16.1 Comparing different solution algorithmsPoisson problem

Time [#IT] Parallel P = 8

Memory Flops n = 200 ×2−→ n = 400 n = 400 Speedup

LU O(n4) O(n6) - - - -

Banded∗) O(n3) O(n4) 3.6s ×25.1−→ 90.4s 92.4s 0.98

Jacobi O(n2) O(n4) > 1000s >1000s > 1000s -

CG O(n2) O(n3) 3.8s [357] ×7.9−→ 30.1s [702] 6.1s 4.9

PCG(ILU) O(n2) O(n3) 2.9s [147] ×6.8−→ 19.6s [249] 3.9s 5.0

DOUG∗) O(n2) O(n7/3) 6.5s [15] ×4.9−→ 32.1s [17] 4.3s 7.5

BoomerAMG O(n2) O(n2) 1.8s [3] ×4−→ 7.3s [3] 4.1s 1.8*) Time taken on a 1.5-2 times slower machine

• In all cases stopping criteria chosen as ε := 10−8

Page 328: Scientific Computing lectures

328 // Preconditioners 16.1 Comparing different solution algorithms

• BoomerAMG is a parallel Multigrid algoritm (Lawrence Livermore Nat. Lab.,CA, Hypre http://www.llnl.gov/CASC/hypre/). For Poisson equa-tion solver the multigrid preconditioner is optimal (regarding flops and numberof unknowns)

• DOUG – Domain Decomposition on Unstructured Grids (Unversity of Tartuand University of Bath (UK)) http://dougdevel.org

Page 329: Scientific Computing lectures

329 // Preconditioners 16.2 PCG

16.2 PCGRecall CG method

C a l c u l a t e r(0) = b−Ax(0) wi th g i v e n s t a r t i n g v e c t o r x(0)

f o r i = 1 , 2 , . . .s o l v e Mz(i−1) = r(i−1) # M−1 i s c a l l e d P r e c o n d i t i o n e rρi−1 = r(i−1)T z(i−1)

i f i ==1p(1) = z(0)

e l s eβi−1 = ρi−1/ρi−2

p(i) = z(i−1)+βi−1p(i−1)

e n d i fq(i) = Ap(i) ; αi = ρi−1/p(i)T q(i)

x(i) = x(i−1)+αip(i) ; r(i) = r(i−1)−αiq(i)

check c o n v e r g e n c e ; c o n t in u e i f neededend

Page 330: Scientific Computing lectures

330 // Preconditioners 16.3 Block-Jacobi method

16.3 Block-Jacobi method

M1

M2

2

Ω1

Ω

0

0

M =

where M1 and M2 are preconditioners for matrix A

A =

[A11 A12

A21 A22

]

diagonal blocks A11 and A22. (For example, M1 = A−111 , M2 = A−1

22 )

Page 331: Scientific Computing lectures

331 // Preconditioners 16.4 Overlapping Additive Schwarz method

16.4 Overlapping Additive Schwarz method

M1

M2

0

0

M =

2

Ω1

Ω

Both last examples – Domain Decomposition methods

• Schwarz method

– dates back to year 1870

– was meant to solve PDE in a region combined from two regions

Page 332: Scientific Computing lectures

332 // Preconditioners 16.5 Domain Decomposition Method (DDM)

16.5 Domain Decomposition Method (DDM)

Domain Decomposition – class of methods for parallelisation where the solution do-main is partitioned into subdomains

Partitioning examples

Page 333: Scientific Computing lectures

333 // Preconditioners 16.5 Domain Decomposition Method (DDM)

Page 334: Scientific Computing lectures

334 // Preconditioners 16.5 Domain Decomposition Method (DDM)

Page 335: Scientific Computing lectures

335 // Preconditioners 16.5 Domain Decomposition Method (DDM)

DDM Classification

• Non-overlapping methods

– Block-Jacobi method

– Additive Average method

• Overlapping methods

– Additive Schwarz method

– Substructuring method

• One-level method / Two-level (or multiple-level) methods

Page 336: Scientific Computing lectures

336 // Preconditioners 16.5 Domain Decomposition Method (DDM)

Non-overlapping, Block-Jacobi method:

1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18

19 20 21 22 23 24 25 26 27

28 29 30 31 32 33 34 35 36

37 38 39 40 41 42 43 44 45

46 47 48 49 50 51 52 53 54

55 56 57 58 59 60 61 62 63

64 65 66 67 68 69 70 71 72

73 74 75 76 77 78 79 80 81

Page 337: Scientific Computing lectures

337 // Preconditioners 16.5 Domain Decomposition Method (DDM)

Method with minimal overlap (optimal overlap):

Ω1

Ω2

Ω

Ω5

Ω4

Ω7

Ω8

Ω9

Ω6

3

Page 338: Scientific Computing lectures

338 // Preconditioners 16.5 Domain Decomposition Method (DDM)

Subdomain restriction operatorsRestriction operator Ri – matrix applied to global vector x returns components

xi = Rix belonging to subdomain Ωi.Define

M−1i

def= RT

i (RiARTi )−1Ri (i = 1, ...,P)

One-level Additive Schwarz preconditioner is defined by:

M−11AS

def=

P

∑i=1

M−1i

Page 339: Scientific Computing lectures

339 // Preconditioners 16.6 Multilevel methods

16.6 Multilevel methods

Introduce coarse grid

Page 340: Scientific Computing lectures

340 // Preconditioners 16.6 Multilevel methods

Coarse grid Restriction and Interpolation operatorsRestriction operator R0 is defined as transpose operator of interpolation operator

RT0 interpolating linearly (or bilinearly) values on coarse grid to fine grid nodes.

Coarse grid matrix A0

• assembled

– analytically or

– through formula: A0 = R0ART0

• DefineM−1

0def= RT

0 A−10 R0 = RT

0 (R0ART0 )−1R0

Two-level Additive Schwarz method defined as:

M−12AS =

P

∑i=0

M−1i

Page 341: Scientific Computing lectures

341 // Preconditioners 16.6 Multilevel methods

Requirements to the coarse grid:

• Coarse grid should cover all the fine grid nodes

• Each coarse grid cell must contain fine grid nodes

• If fine grid is with uneven density, coarse grid must adapt according to the finegrid in terms of nodes numbers in each cell

Page 342: Scientific Computing lectures

342 // Preconditioners 16.6 Multilevel methods

If all the requirements fulfilled, it can be shown that

Page 343: Scientific Computing lectures

343 // Preconditioners 16.7 Parallelising PCG

• condition number of Additive Schwarz method κ(B) does not depend on dis-cretisation parmeter n.

16.7 Parallelising PCGIn addition to parallel preconditioner (given above) we need to parallelise:

• Dot product operation

• Ax-operation

First is easily done with MPI_ALLREDUCE. (Attention is needecd only in the caseof overlaps)

Parallelising Ax-operation depends on existence of overlapIn case of no overlap (like Block-Jacobi method), usually technique with shadow-

nodes is used:

Page 344: Scientific Computing lectures

344 // Preconditioners 16.7 Parallelising PCG

22 23 24

15

6

13

4

16

25

7

21

12

3

31 32 33

Page 345: Scientific Computing lectures

345 // Preconditioners 16.7 Parallelising PCG

Before Ax-operation neighbours exchange values in nodes for which matrix Aelements Ai j 6= 0, i ∈Ωk, j ∈Ωl with k 6= l. For example, process P1 sends values atnodes 4, 13, 22 (in global numeration) to process P0 and receives values from nodes3, 12, ja 21 (saving these into shadow-node variables)

Non-blocking communication can be used!RealKind.f90 (http://www.ut.ee/~eero/F95jaMPI/Kood/mpi_CG/

HTML/RealKind.f90.html)sparse_mat.f90 (http://www.ut.ee/~eero/F95jaMPI/Kood/mpi_

CG/HTML/sparse_mat.f90.html)par_sparse_mat.f90 (http://www.ut.ee/~eero/F95jaMPI/Kood/

mpi_CG/HTML/par_sparse_mat.f90.html)par_sparse_lahenda.f90 (http://www.ut.ee/~eero/F95jaMPI/Kood/

mpi_CG/HTML/par_sparse_lahenda.f90.html)

Page 346: Scientific Computing lectures

346 // Preconditioners CONTENTS

Contents

1 Introduction 31.1 Course overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Scripting vs programming . . . . . . . . . . . . . . . . . . . . . . . 10

1.3.1 What is a script? . . . . . . . . . . . . . . . . . . . . . . . . 101.3.2 Characteristics of a script . . . . . . . . . . . . . . . . . . . . 111.3.3 Why not stick to Java, C/C++ or Fortran? . . . . . . . . . . . 12

1.4 Scripts yield short code . . . . . . . . . . . . . . . . . . . . . . . . . 131.4.1 Using regular expressions . . . . . . . . . . . . . . . . . . . 14

1.5 Classification of languages . . . . . . . . . . . . . . . . . . . . . . . 171.6 Performance issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.6.1 Scripts can be slow . . . . . . . . . . . . . . . . . . . . . . . 201.6.2 Scripts may be fast enough . . . . . . . . . . . . . . . . . . . 211.6.3 When scripting is convenient . . . . . . . . . . . . . . . . . . 22

Page 347: Scientific Computing lectures

347 // Preconditioners CONTENTS

1.6.4 When to use C, C++, Java, Fortran . . . . . . . . . . . . . . . 23

2 Python in Scientfic Computing 242.1 Numerical Python (NumPy) . . . . . . . . . . . . . . . . . . . . . . 24

2.1.1 NumPy: making arrays . . . . . . . . . . . . . . . . . . . . . 272.1.2 NumPy: making float, int, complex arrays . . . . . . . . . . . 282.1.3 Array with a sequence of numbers . . . . . . . . . . . . . . . 292.1.4 Array construction from a Python list . . . . . . . . . . . . . 302.1.5 From “anything” to a NumPy array . . . . . . . . . . . . . . 332.1.6 Changing array dimensions . . . . . . . . . . . . . . . . . . . 352.1.7 Array initialization from a Python function . . . . . . . . . . 362.1.8 Basic array indexing . . . . . . . . . . . . . . . . . . . . . . 372.1.9 More advanced array indexing . . . . . . . . . . . . . . . . . 382.1.10 Slices refer the array data . . . . . . . . . . . . . . . . . . . . 392.1.11 Integer arrays as indices . . . . . . . . . . . . . . . . . . . . 412.1.12 Loops over arrays . . . . . . . . . . . . . . . . . . . . . . . . 42

Page 348: Scientific Computing lectures

348 // Preconditioners CONTENTS

2.1.13 Array computations . . . . . . . . . . . . . . . . . . . . . . . 452.1.14 In-place array arithmetics . . . . . . . . . . . . . . . . . . . 462.1.15 Standard math functions can take array arguments . . . . . . 482.1.16 Other useful array operations . . . . . . . . . . . . . . . . . . 492.1.17 Temporary arrays . . . . . . . . . . . . . . . . . . . . . . . . 502.1.18 More useful array methods and attributes . . . . . . . . . . . 512.1.19 Complex number computing . . . . . . . . . . . . . . . . . . 532.1.20 A root function . . . . . . . . . . . . . . . . . . . . . . . . . 552.1.21 Array type and data type . . . . . . . . . . . . . . . . . . . . 562.1.22 Matrix objects . . . . . . . . . . . . . . . . . . . . . . . . . 58

2.2 NumPy: Vectorisation . . . . . . . . . . . . . . . . . . . . . . . . . . 602.2.1 Vectorisation of functions with if tests; problem . . . . . . . . 622.2.2 Vectorisation of functions with if tests; solutions . . . . . . . 642.2.3 General vectorization of if-else tests . . . . . . . . . . . . . . 662.2.4 Vectorization via slicing . . . . . . . . . . . . . . . . . . . . 67

2.3 NumPy: Random numbers . . . . . . . . . . . . . . . . . . . . . . . 68

Page 349: Scientific Computing lectures

349 // Preconditioners CONTENTS

2.4 NumPy: Basic linear algebra . . . . . . . . . . . . . . . . . . . . . . 702.4.1 A linear algebra session . . . . . . . . . . . . . . . . . . . . 71

2.5 Python: Modules for curve plotting and 2D/3D visualization . . . . . 722.5.1 Curve plotting with Easyviz . . . . . . . . . . . . . . . . . . 732.5.2 Basic Easyviz example . . . . . . . . . . . . . . . . . . . . . 742.5.3 Decorating the plot . . . . . . . . . . . . . . . . . . . . . . . 752.5.4 Plotting several curves in one plot . . . . . . . . . . . . . . . 772.5.5 Example: plot a function given on the command line . . . . . 792.5.6 Plotting 2D scalar fields . . . . . . . . . . . . . . . . . . . . 802.5.7 Adding plot features . . . . . . . . . . . . . . . . . . . . . . 812.5.8 Other commands for visualizing 2D scalar fields . . . . . . . 832.5.9 Commands for visualizing 3D fields . . . . . . . . . . . . . . 832.5.10 More info about Easyviz . . . . . . . . . . . . . . . . . . . . 84

2.6 I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 852.6.1 File I/O with arrays; plain ASCII format . . . . . . . . . . . . 852.6.2 File I/O with arrays; binary pickling . . . . . . . . . . . . . . 87

Page 350: Scientific Computing lectures

350 // Preconditioners CONTENTS

2.7 ScientificPython . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 892.7.1 ScientificPython: numbers with units . . . . . . . . . . . . . 90

2.8 SciPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 912.8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 912.8.2 An example - using SciPy for verifying mathematical theory

about a numerical method for solving weakly singular inte-gral equations . . . . . . . . . . . . . . . . . . . . . . . . . . 92

2.9 SymPy: symbolic computing in Python . . . . . . . . . . . . . . . . 932.10 Python + Matlab = true . . . . . . . . . . . . . . . . . . . . . . . . . 95

3 What is Scientific Computing? 973.1 Introduction to Scientific Computing . . . . . . . . . . . . . . . . . . 973.2 Specifics of computational problems . . . . . . . . . . . . . . . . . . 1043.3 General strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

4 Approximation in Scientific Computing 1114.1 Sources of approximation error . . . . . . . . . . . . . . . . . . . . . 111

Page 351: Scientific Computing lectures

351 // Preconditioners CONTENTS

4.1.1 Error sources that are under our control . . . . . . . . . . . . 1114.1.2 Errors created during the calculations . . . . . . . . . . . . . 1144.1.3 Computational error – forward error and input error – back-

ward error . . . . . . . . . . . . . . . . . . . . . . . . . . . 1194.2 Floating-Point Numbers . . . . . . . . . . . . . . . . . . . . . . . . . 1254.3 Normalised floating-point numbers . . . . . . . . . . . . . . . . . . . 1274.4 IEEE (Normalised) Arithmetics . . . . . . . . . . . . . . . . . . . . 129

5 Solving Systems of Linear Equations 1335.1 Systems of Linear Equations . . . . . . . . . . . . . . . . . . . . . . 1335.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

5.2.1 Problem Transformation . . . . . . . . . . . . . . . . . . . . 1365.3 Triangular linear systems . . . . . . . . . . . . . . . . . . . . . . . . 1385.4 Elementary Elimination Matrices . . . . . . . . . . . . . . . . . . . . 1405.5 Gauss Elimination and LU Factorisation . . . . . . . . . . . . . . . . 1425.6 Number of operations in GEM . . . . . . . . . . . . . . . . . . . . . 145

Page 352: Scientific Computing lectures

352 // Preconditioners CONTENTS

5.7 GEM with row permutations . . . . . . . . . . . . . . . . . . . . . . 1465.8 Reliability of the LU-factorisation with partial pivoting . . . . . . . . 1545.9 Iterative refinement of the solution . . . . . . . . . . . . . . . . . . . 159

6 BLAS (Basic Linear Algebra Subroutines) 1616.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1616.2 BLAS implementations . . . . . . . . . . . . . . . . . . . . . . . . . 167

7 Optimisation methods 1687.1 Gradient descent methods . . . . . . . . . . . . . . . . . . . . . . . . 1727.2 Newton method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

8 Numerical Solution of Differential Equations 1818.1 Ordinary Differential Equations (ODE) . . . . . . . . . . . . . . . . 181

8.1.1 Numerical methods for solving ODEs . . . . . . . . . . . . . 1828.2 Partial Differential Equations (PDE) . . . . . . . . . . . . . . . . . . 189

8.2.1 PDE overview . . . . . . . . . . . . . . . . . . . . . . . . . 189

Page 353: Scientific Computing lectures

353 // Preconditioners CONTENTS

8.3 2nd order PDEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1918.4 Time-independent PDE-s . . . . . . . . . . . . . . . . . . . . . . . . 193

8.4.1 Finite Difference Method (FDM) . . . . . . . . . . . . . . . 1938.4.2 Finite element Method (FEM) . . . . . . . . . . . . . . . . . 198

8.5 FEM – 1D example . . . . . . . . . . . . . . . . . . . . . . . . . . . 1988.6 2D FEM for Poisson equation . . . . . . . . . . . . . . . . . . . . . 2058.7 FEM for more general cases . . . . . . . . . . . . . . . . . . . . . . 2188.8 FEM software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2198.9 Sparse matrix storage schemes . . . . . . . . . . . . . . . . . . . . . 221

8.9.1 Triple storage format . . . . . . . . . . . . . . . . . . . . . . 2228.9.2 Column-major storage format . . . . . . . . . . . . . . . . . 2238.9.3 Row-major storage format . . . . . . . . . . . . . . . . . . . 2248.9.4 Combined schemes . . . . . . . . . . . . . . . . . . . . . . . 225

9 Iterative methods 2279.1 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

Page 354: Scientific Computing lectures

354 // Preconditioners CONTENTS

9.2 Jacobi Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2299.3 Conjugate Gradient Method (CG) . . . . . . . . . . . . . . . . . . . 2329.4 Preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2359.5 Preconditioner examples . . . . . . . . . . . . . . . . . . . . . . . . 237

10 Solving Integral Equations 24110.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

11 Examples 24211.1 Graph partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

11.1.1 Mathematical formulation . . . . . . . . . . . . . . . . . . . 24511.1.2 Modified problem . . . . . . . . . . . . . . . . . . . . . . . 25011.1.3 Method of Spectral Bisectioning . . . . . . . . . . . . . . . . 25211.1.4 Finding smallest positive eigenvalue of matrix LG . . . . . . . 25311.1.5 Power Method . . . . . . . . . . . . . . . . . . . . . . . . . 25311.1.6 Inverse Power Method. . . . . . . . . . . . . . . . . . . . . . 25611.1.7 Inverse Power Method with Projection . . . . . . . . . . . . . 258

Page 355: Scientific Computing lectures

355 // Preconditioners CONTENTS

12 Parallel Computing 26012.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26012.2 Classification of Parallel Computers . . . . . . . . . . . . . . . . . . 26612.3 Parallel Computational Models . . . . . . . . . . . . . . . . . . . . . 267

13 Parallel programming models 26813.1 HPF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27113.2 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27513.3 GPGPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281

13.3.1 CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28113.3.2 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28513.3.3 OpenACC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288

13.4 Automatic parallelisation . . . . . . . . . . . . . . . . . . . . . . . . 293

14 MPI (The Message Passing Interface) 29614.1 Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29614.2 MPI basic subroutines (functions) . . . . . . . . . . . . . . . . . . . 299

Page 356: Scientific Computing lectures

356 // Preconditioners CONTENTS

14.3 Blocking MPI commands . . . . . . . . . . . . . . . . . . . . . . . . 30114.4 Non-Blocking Send and Receive, Avoiding Deadlocks . . . . . . . . 306

14.4.1 Unidirectional communication between two processors . . . . 30714.4.2 Mutual communication and avoiding deadlock . . . . . . . . 312

15 Design and Evaluation of Parallel Programs 31915.1 Two main approaches in parallel program design . . . . . . . . . . . 31915.2 Assessing parallel programs . . . . . . . . . . . . . . . . . . . . . . 322

15.2.1 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32215.2.2 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32315.2.3 Amdahl’s law . . . . . . . . . . . . . . . . . . . . . . . . . . 32415.2.4 Validity of Amdahl’s law . . . . . . . . . . . . . . . . . . . . 32515.2.5 Scaled efficiency . . . . . . . . . . . . . . . . . . . . . . . . 326

16 Parallel preconditioners 32716.1 Comparing different solution algorithms . . . . . . . . . . . . . . . . 32716.2 PCG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

Page 357: Scientific Computing lectures

357 // Preconditioners CONTENTS

16.3 Block-Jacobi method . . . . . . . . . . . . . . . . . . . . . . . . . . 33016.4 Overlapping Additive Schwarz method . . . . . . . . . . . . . . . . . 33116.5 Domain Decomposition Method (DDM) . . . . . . . . . . . . . . . . 33216.6 Multilevel methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 33916.7 Parallelising PCG . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343