These notes represent the lecture contents in

These notes represent the lecture contents in Computational Methods in Nuclear Technology. Errors and inaccuracies are possible. I. Christoskov, January 2015

1/150

COMPUTATIONAL METHODS IN NUCLEAR TECHNOLOGY, 2014/1 5

1. Sets of homogeneous linear ordinary differential equations ........................................ 4

Example: Xenon poisoning......................................................................................... 10

2. Fourier transform ........................................................................................................ 14

Definitions .................................................................................................................. 14

Properties .................................................................................................................... 16

Fourier transform of discretely sampled data ............................................................. 17

Fast Fourier transform (FFT)...................................................................................... 19

FFT of real data....................................................................................................... 22

FFT of functions of two or more variables .............................................................23

Application: computed tomography ........................................................................... 24

3. Eigenvalues and eigenvectors of a matrix .................................................................. 32

Characteristic equation for the eigenvalues ................................................................ 35

Search for isolated eigenvalues and eigenvectors....................................................... 37

Power iteration........................................................................................................ 37

Inverse power iteration (Wielandt’s method) ......................................................... 39

Jacobi transformations for the diagonalisation of symmetric matrices ...................... 40

Householder reduction................................................................................................ 43

Eigenvalue problem for the reduced matrix ........................................................... 46

The QR algorithm ................................................................................................... 46

Application: Schrödinger equation ............................................................................. 48

Formulation............................................................................................................. 48

Solving.................................................................................................................... 50

Example .................................................................................................................. 51

4. Singular value decomposition..................................................................................... 54


2/150

Application: the least squares problem....................................................................... 55

Linear model ........................................................................................................... 55

Non-linear model .................................................................................................... 56

Solving the problem................................................................................................ 57

Example: Analysis of a gamma spectrum............................................................... 63

5. Orthogonal polynomials. Approximation of functions. Gaussian quadrature ............ 71

Approximation of functions........................................................................................ 71

Linear least squares................................................................................................. 71

Example: Polynomial approximation of the Runge function ..................................... 76

Data smoothing ........................................................................................................... 79

Gaussian quadrature.................................................................................................... 80

6. Monte Carlo methods.................................................................................................. 84

Generation of random deviates with a chosen probability distribution ...................... 85

Uniform distribution ................................................................................................... 86

Normal distribution..................................................................................................... 86

Central limit theorem.................................................................................................. 86

The transformation method for generating deviates with a specified probability

distribution ........................................................................................................................... 86

Sampling from the normal distribution (Box-Muller method) ............................... 87

The rejection method for generating deviates with a specified probability distribution

.............................................................................................................................................. 88

Assessing the quality of the sample............................................................................ 90

Monte Carlo integration.............................................................................................. 91

An example ............................................................................................................. 96

Monte Carlo for particle transport problems .............................................................. 99

Variance reduction methods ................................................................................. 102

Application: integral form of the neutron transport equation ................................... 110


3/150

7. Partial differential equations..................................................................................... 120

von Neumann stability analysis ................................................................................ 121

Lax scheme ........................................................................................................... 123

Diffusion initial value problem................................................................................. 123

Explicit scheme..................................................................................................... 123

Implicit scheme..................................................................................................... 124

Crank-Nicholson scheme...................................................................................... 125

Multidimensional case .......................................................................................... 126

An example: The one-dimensional heat equation..................................................... 127

Application: the diffusion equation in nuclear reactor physics ................................ 128

Example: one-dimensional two-group problem ................................................... 145

Further reading.............................................................................................................. 150


4/150

1. Sets of homogeneous linear ordinary differential equations

Consider the set:

( ) ( ) ( )

( ) ( ) ( )xyaxyadx

xdy

xyaxyadx

xdy

nnnnn

nn

++=

++=

...

....

...

11

11111

,

or:

Ayy =

dx

d. (1.1)

with an initial condition ( ) 00 yy = and constant coefficients njniaij ,...,1,,...,1, == .

An example of large sets of the form of (1) are the equations describing the nuclide

composition evolution in materials subject to neutron irradiation, incl. the nuclide evolution of

nuclear fuel.

The balance equation of the concentration iN of the i-th nuclide (i = 1,...,N) in such ma-

terial is:

( ) ( ) iiiij

jjijij

jjiji NNfN

dt

tdN λσλσγ +Φ−+Φ= ∑∑≠

→≠

→ , (1.2)

where Φ is the one-group scalar neutron flux, σ is an one-group neutron absorption

cross-section, λ is a decay constant, ij →γ is the yield of the i-th nuclide as a result of neutron

absorption by the j-th nuclide (including the process of neutron-induced fission), ijf → is the

yield of the i-th nuclide from spontaneous decay of the j-th nuclide.

Note:

Inhomogeneous equations and equations of higher order can be reduced to the form of (1). Let, for exam-

ple, the equation is:

( ) ( ) ( ) dcxxbyxaydx

xyd +++= '2

2

with an initial condition:

( ) ( ) 0000 ''; yxyyxy == .


5/150

The following dependent variables are introduced:

( ) ( ) ( ) ( ) ( ) ( ) 1;;'; 4321 ≡≡≡≡ xyxxyxyxyxyxy

Thus, the considered equation is reformulated as a set:

( ) ( )( ) ( ) ( ) ( ) ( )( ) ( )( )

04

43

43212

21

=

=

+++=

=

dx

xdy

xydx

xdy

xdyxcyxbyxaydx

xdy

xydx

xdy

with initial conditions:

( )( )( )( ) 1

'

04

003

002

001

====

xy

xxy

yxy

yxy

Through direct substitution it can be verified that

yAy n

n

dx

d = . (2)

Indeed:

( ) ( ) ( )

[ ]∑∑ ∑

∑ ∑∑∑

=

=

==

=

kkik

kk

jjkij

jk

kjkij

j

jij

jjij

i

yyaa

yaadx

xdyaxya

dx

d

dx

xdy

dx

d

2A

,

or:

( ) yAAyAy 22

2

==dx

d.

By setting 2AB ≡ , in an analogous fashion it is demonstrated that:

( ) ( ) [ ] [ ]∑∑∑ ∑∑ ==

=

=

kkik

kkik

kk

jjkij

jjij

i yyyabxybdx

d

dx

xyd

dx

d 32

2

ABA ,

i.e. yAy 33

3

=dx

d, etc.


6/150

On the other hand, for each dependent variable the following representation can be em-

ployed (Taylor series expansion):

( ) ( ) ( ) ( ) ( )( )( )

∑∞

=

=+++++=0

22

2

!

0...0

!

1...0

!2

100

n

nn

ikk

ik

iiii x

n

yx

dx

yd

kx

dx

ydx

dx

dyyxy (3)

And indeed, if the function ( )xfyi = has n derivatives at point 0x , then there exists a polynomial

( )xPn , for which:

а) ( ) ( ) ( ) ( ) ( )( ) ( )( )000000 ,...,'', xfxPxfxPxfxP nnnnn === и (4.1)

б) ( ) ( ) ( )( ) 00 , xxxxoxPxf nn →−+= . (4.2)

Proof:

а) Let the polynomial be sought in the form:

( ) ( ) ( )nnn xxAxxAAxP 0010 ... −++−+= (5)

Since from (5) it follows that ( ) 00 AxPn = , then from the first requirement in (4.1) it follows that

( )00 xfA = . Further, since ( ) ( ) ( ) 10021 ...2' −−++−+= n

nn xxnAxxAAxP , then from the second re-

quirement in (4.1) it follows that ( )01 ' xfA = .

In the same fashion, since ( ) ( ) ( ) 202 1...1.2'' −−−++= n

nn xxAnnAxP , then ( )!2

'' 02

xfA = . In the

general case the result is ( )( )

!0

k

xfA

k

k = . Thus, in fulfilment of the conditions (4.1), the polynomial in (4.2)

will have the form:

( ) ( ) ( )( )( )( ) ( )

( )( ) ( )nn

kk

n xxn

xfxx

k

xfxxxfxfxP 0

00

0000 !

...!

...' −++−++−+= .

б) The next step is to confirm that this polynomial satisfies the relation (4.2). Let

( ) ( ) ( )xPxfxr nn −≡ . From (4.1) it follows that ( ) ( ) ( )( ) 0...' 000 ==== xrxrxr nnnn . Then, by applying

L’Hôpital’s rule for resolving the indeterminate form ( )

( )nn

xx

xr

0− at 0xx → , one obtains:

( )( )

( )( )

( )( )( )

( )( )0

!!lim...

'limlim 0

0

1

100

000

==−

==−

=−

−

→−→→ n

xr

xxn

xr

xxn

xr

xx

xr nn

nn

xxnn

xxnn

xx,

i.e. in reality ( ) ( ) 00 , xxxxoxr nn →−= .

L’Hôpital’s rule:


7/150

If ( ) ( ) 0limlim ==→→

xgxfaxax

or ( ) ( ) ±∞==→→

xgxfaxax

limlim , then ( )( )

( )( )xg

xf

xg

xfaxax '

'limlim

→→= .

Proof for “0/0” :

( )( )

( ) ( )( )( ) ( )( )( ) ( )( )( ) ( )( )

( )( )

( )( )xg

xf

hag

haf

xghxg

xfhxf

xghxg

xfhxf

xg

xf

axhax

ax

h

h

h

axax

→→→

→

→

→

→

→→

=−+−+=

−+

−+=

−+

−+=

lim0

0lim

lim

limlim

lim

limlim

'

'lim

00

0

0

.

Through combining (2) with the joint formulation of (3) for all dependent variables, the

following representation of the solution of the set (1) is obtained:

( ) ( ) ( ) ( ) ( )

00

022

22

!...

!

1...

!2

1

...0!

1...0

!2

100

yA

yAAA1

yAyAAyyy

=

+++++=

+++++=

∑∞

=n

nnkk

kk

n

xx

kxx

xk

xxx

(6)

On the other hand, for the scalar case ( )

aydx

xdy = it can be easily verified that:

( ) ( )axyyn

xaxy

n

nn

exp! 00

0

=

= ∑

∞

=

(7)

Thus, by analogy to (7), the solution of the set (1) is concisely denoted as (matrix expo-

nential):

( ) ( )[ ] 0exp yAy xx = , where ( ) ∑∞

=

≡0 !

expn

nn

n

xx

AA . (8)

In the particular case of decoupled equations:

( ) ( ) nixyaxdx

dyiii

i ,...,1, == (9)

the coefficient matrix is of diagonal shape, ( )iiadiagA = , and the equation solutions are

( ) ( ) ( ) nixayxy iiii ,...,1,exp0 == . Or, in short notation:

( ) ( )( ) 0exp ydiagy xax ii= (10)

Since for this diagonal matrix ( )kii

k adiagA = , through comparison between (3), (7), (6)

and (8) it can be directly seen that:


8/150

( )( ) ( )xxaii Adiag expexp = , (11)

with the same definition of the matrix exponential ( )xAexp as in (8).

Based on that, the following procedure for evaluating ( )xAexp with a general matrix A

can be applied:

а) diagonalisation of A (cf. Topic 3), i.e. finding a matrix Z and a diagonal matrix D,

such that DZAZ =−1 , and respectively DZZA 1−= .

б) evaluation of ( )xAexp as:

( ) ( ) ( )( )ZdiagZZDZA xdxx iiexpexpexp 11 −− == . (12)

The last equality can be proved by accounting that the application of the defining ex-

pression (8) for ( )xAexp requires computing the powers kk xA , and

( )( ) ZDZAZDZDZZDZZA kk 121112 ... −−−− === (13)

The above approach is, however, restricted to problems, i.e. matrices A, of compara-

tively small dimensions.

In the general case a method which directly follows from the defining expression (8) is

preferred. The practical procedure will be outlined as follows.

Thus, after returning to the series (6) and introducing the vector ( )0

0 yc ≡ , it can be di-

rectly confirmed that the successive terms in this series will have the form:

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )kk

k

xxxx AccAccAccAcc

1;...;

3;

2; 1231201

+==== + (14)

The solution of the set (1) is accumulated as a sum:

( )∑∞

=

=0k

kcy (15)

The computational algorithm is as follows:

• initialise ( ) ( )00ii yc = and ( )( ) ( )00

ii cxy = , ni ,...,1=


9/150

• for ,....2,1=k : evaluate ( ) ( )∑=

−=n

j

kjij

ki ca

k

xc

1

1 and update the partial sum

( )( ) ( )( ) ( )ki

ki

ki cxyxy += −1 . If

( )

( )( ) δ≤− xk

k

1y

c, terminate the iteration on k.

The convergence of iterations is in principle guaranteed by property (4.2) of the Taylor

series. Nevertheless, in order to reduce the computational effort and avoid the accumulation of

excessive roundoff errors in the process of summing the series, it is strongly desirable to

minimise the number of iteration steps.

The chosen termination criterion is without a practical alternative, and its fulfilment is

guaranteed by property (4.2), provided that the solution propagation step x is sufficiently

small. Actually, however, this criterion imposes to the series the more general condition that

0!yA k

k

k

x decreases monotonously with the increase of k. The latter, in it turn, can be ensured

if the matrix norm xx AA = is reduced below a given limit. While an obvious way to

achieve this aim is to choose a sufficiently small propagation step x, with a large norm A

such an approach for solving the set (1) will be rendered impracticable.

A more realistic strategy is to reformulate the problem so that the norm A be reduced

to a level which is appropriate for the desired propagation step x.

A suitable matrix norm which corresponds to the vector norm ( ) ( )∑=

≡n

i

ki

k c1

c , is

∑=

=n

iij

ja

1

maxA . For the matrix generated by problem (1.2) this is the quantity

( )jjj

λσ +Φmax2 , and the physical explanation of this result is the fact that the rate of deple-

tion of a given nuclide is equal to the sum of the rates of production of all its daughter nu-

clides. Therefore, a way to reduce the norm A in the considered case is to exclude the nu-

clides with highest depletion rates (i.e. highest effective decay constants) from the set (1.2).

The exclusion is effected e.g. through replacing a chain of the form CBA →→ by a chain

CA → , where B is the short-living nuclide subject to exclusion. After solving the reduced

equations set and finding the concentrations of the precursors of these excluded nuclides, the

balance equations for the latter can easily be solved analytically.


10/150

For example, if nuclide B is much shorter-lived than nuclide A, i.e. AB λλ >> , then after

a relatively short period of time (several mean lifetimes of nuclide B) its concentration

reaches a so-called secular equilibrium with equated rates of production and of depletion, i.e.

( ) ( )tNtN BBABA** λλ =→ ,

where *λ are the above mentioned effective decay constants. Thus, with a known con-

centration ( )tNA , the sought concentration of nuclide B is

( ) ( )*

*

B

ABAB

tNtN

λλ →= .

In the more general case the balance equation for the nuclide B is:

( ) ( ) ( )tNtNtdt

dNBBABA

B ** λλ −= →

with a solution

( ) ( ) ( ) ( ) ( )( ) ''exp'exp0 *

0

** dttttNtNtN B

t

ABABBB −−+−= ∫→ λλλ .

With a known initial condition and a known concentration ( )tNA , the sought concentra-

tion ( )tNB can be obtained e.g. through numerical integration.

In particular, if in the expression for ( )tNB one can assume that ( ) .consttNA ≅ , it will

take the form

( ) ( )( )tNtN BAB

BAB

**

*

exp1 λλ

λ −−= →

and after several mean lifetimes of the nuclide B its concentration will approach the

above mentioned secular equilibrium level.

The exclusion criterion can be chosen empirically. For example, it can turn out that with

a chosen propagation step x the series (6) will converge after a reasonably large number of

iteration steps if x does not exceed 5 effective half-lives of the shortest living nuclide in he

system.

Example: Xenon poisoning

The 135Xe nuclide is distinguished by an exceptionally large thermal neutron capture

cross-section. Its accumulation in nuclear reactor fuel leads to a significant deterioration of


11/150

the multiplying properties of the reactor medium. For this reason the problem of modelling

the evolution of 135Xe concentration is of special importance in reactor physics.

The production and depletion chain of 135Xe is:

Xe

b.Xea

n

stableBaCsXe

ITefission

yxТhТ

Xe

hТsТTe

136

135103.2,13514.9,135

0.003

57.6,13519,135061.0

61072

)(6

2/12/1

2/12/1

×≈↓+

→ →→

↓

→ → →

==

=

===

−−

−−

σ

ββ

γ

ββγ

In a simplified form (with accounting of only the fission of 235U and omitting of 135Te

from the nuclear transitions chain) the balance equations for the concentrations of 235U, 135I

and 135Xe are:

( )

( ) ( )

( ) ( ) [ ] ( )tXetItUdt

dXe

tItUdt

dI

tUdt

dU

XeXea

If

Xe

If

I

a

1351352355135

1352355135

2355235

λσλσγ

λσγ

σ

+Φ−+Φ=

−Φ=

Φ−=

(1)

A set of exemplary values of the coefficients in (1) for WWER-1000 at rated power can

be produced as follows.

From the relation

[ ] [ ] [ ] [ ] [ ]WPNJEscmcm ff =××Φ× −− #.#. 51225σ (2)

With known

5fσ = 337.73 b = 3.3773E-22 cm2

fE =200 MeV = 3.204E-11 J (3)

259.046E10022.6235

/35300 23

5

55 +=×== HM

A

tgN

A

MN

P = 50 MW/tHM

for the one-group scalar flux one can obtain:


12/150

Φ = 5.11E+13 cm-2.s-1 (4)

With

5aσ = 416.1 b; 0615.0=+ ITeγ ; 0406.7 −= EXeγ ;

bEXea 06516.1 +=σ ; -1s05-2.1068E=Xeλ ; -1s05-2.9309E=Iλ (5)

the final form of the balance equations is:

( )

( ) ( )

( ) ( ) [ ] ( )tXetItUdt

dXe

tItUdt

dI

tUdt

dU

1355-5-

1355-

23511-135

1355-

2359-135

2358235

102.11107.74102.93101.22

102.93101.06

1013.2

⋅+⋅−⋅+⋅=

⋅−⋅=

×⋅−= −

(6)

Or:

( ) ( )( ) ( ) ( )( ) ( ) ( ) ( )

( ) ( ) ( ) 00;00;109.0460

109.85102.93101.22

102.93101.06

1013.2

3225

1

35-

25-

111-3

25-

19-2

181

==⋅=

⋅−⋅+⋅=

⋅−⋅=

⋅−= −

NNN

tNtNtNdt

tdN

tNtNdt

tdN

tNdt

tdN

(7)

The analytical solution of these equations is:

( ) ( ) ( )taNtN 1111 exp0=

( ) ( ) ( ) ( ) ( ) ( )( )tataaa

aNtaNtN 2211

2211

2112222 expexp0exp0 −

−+= (8)

( ) ( ) ( )

( ) ( ) ( )( )

( ) ( ) ( )( )

( ) ( ) ( )( ) ( ) ( )( )

−−−

−−

−+

−−

+

−−

+

=

3322

3322

3311

3311

2211

21321

33223322

322

33113311

311

3333

expexpexpexp0

expexp0

expexp0

exp0

aa

tata

aa

tata

aa

aaN

tataaa

aN

tataaa

aN

taNtN

Let, with these constants, the problem of tabulating the Xenon-135 concentration is

solved with a time step of one hour until 30 hours after starting the reactor at full power (zero


13/150

initial conditions for all nuclides except U-235). The matrix norm is 10497.1 −−= sEA , and

the application of the above mentioned empirical relation for the limiting time step gives a

value of sh 35185max = , i.e. much higher than the desired tabulation step. With a relative

error threshold 4.1 −= Eδ , the maximum number of Taylor expansion terms is 6 (in the be-

ginning of the transient process), whereas the typical number is 3-4.

After including of 135Te in the set of balance equations, the matrix norm becomes

1023.7 −−= sEA , so that the limiting time step decreases to sh 95max = . With the same

relative error threshold, the maximum number of Taylor expansion terms is 15 (in the begin-

ning of the transient process), and the typical number is 6. Since a single tabulation step is

completed via a considerable number of intermediate steps, and each intermediate step re-

quires a separate Taylor expansion, the number of matrix-vector multiplications per one tabu-

lation step is typically about 220, as compared with approximately 3 in the previous case. The

accuracy of the solution for the 135Xe concentration in both cases is equally good, although in

the second case the risk of excessive accumulation of roundoff errors is in principle higher.

0 5 10 15 20 25 300.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

0.040

q Xe

t, h

without Te-135 with Te-135

Figure 1. Xenon poisoning (relative neutron absorption rate in Xe-135) without and

with the inclusion of Te-135 in the set of balance equations.


14/150

2. Fourier transform

Definitions

Let ( )th is a function of the independent variable ( )∞∞−∈ ,t (e.g. time).

A Fourier image of this function is the following function ( )fH of the independent

variable ( )∞∞−∈ ,f (with the meaning of frequency, if t has the meaning of time):

( ) ( ) ( )∫∞

∞−

= ifttdthfH π2exp , (1)

With a known image ( )fH , the original ( )th can be restored through inverse transform:

( ) ( ) ( )∫∞

∞−

−= iftfdfHth π2exp , (2)

(Actually, statement (2) needs substantiation, and this will be done below.)

Or, with fπω 2≡ (ω is angular frequency, if f is (linear) frequency):

( ) ( ) ( )∫∞

∞−

= titdthH ωω exp и ( ) ( ) ( )∫∞

∞−

−= tiHdth ωωωπ

exp2

1

The original can be either a real or a complex function of a real independent variable.

As it is seen from (1), the image of a real function is in the general case of the complex type.

Let, for example, ( ) ( )tifCth 02exp π−= , i.e. the original is a single-frequency harmonic

oscillator.

The application of (1) leads to:

( ) ( ) ( ) ( )( )

=∞×≠×

=−== ∫∫∞

∞−

∞

∞− 0

00 ,

,02exp2exp

ffC

ffCtffidtCifttdthfH ππ

Let the integral on the right be denoted by

( ) ( )( )∫∞

∞−

−≡− tffidtff 00 2exp πδ , (i)

which is a function of f with the following property:


15/150

( )

=−∞≠−

=−0,

0,0

0

00 ff

ffffδ (ii)

The inverse transform (2) for this original ( ) ( )0ffCfH −= δ must restore the original,

i.e. must satisfy the equality:

( ) ( ) ( )∫∞

∞−

−−=− iftffdfCtifC πδπ 2exp2exp 00 . (iii)

At that, in particular if 0=t , then:

( )∫∞

∞−

−= 0ffdfCC δ (iv)

A function with the property (ii) and the additional properties required by (iii) and (iv),

i.e. more generally:

• ( )

=−∞≠−

=−0,

0,0

0

00 xx

xxxxδ

• ( ) 1lim0

00 =−∫

+

−∞→

ax

axadxxxδ

• ( ) ( ) ( )00

0

0

lim xfdxxfxxax

axa=−∫

+

−∞→δ ,

is known as the Dirac delta function.

It should be explicitly noted that the above considerations do not prove that (i) has all

properties of the Dirac delta function (only property (ii) is proved). It is instead only demon-

strated that if (i) has these properties, then the Fourier transform will be invertible, i.e. state-

ment (2) will be true.

Actually, expression (i) is one of the valid representations of Dirac δ-function. Thus,

based on (i), the general statement (2) for invertibility of the Fourier transform can be cor-

roborated:


16/150

( ) ( )

( ) ( ) ( )

( ) ( )( )

( ) ( ) ( )thttthdt

fttidfthdt

iftiftthdtdf

iftfdfH

=−=

−=

−

=

−

∫

∫ ∫

∫ ∫

∫

∞

∞−

∞

∞−

∞

∞−

∞

∞−

∞

∞−

∞

∞−

'''

'2exp''

2exp'2exp''

2exp

δ

π

ππ

π

(v)

Properties

a) symmetries („*” denotes complex conjugate)

original ( )th image ( )fH real ( ) ( )*fHfH =− imaginary ( ) ( )*fHfH −=− even ( ) ( )fHfH =− , i.e. even odd ( ) ( )fHfH −=− , i.e. odd real and even real and even real and odd imaginary and odd imaginary and even imaginary and odd imaginary and odd real and odd

b) scaling and shifting („⇔ ” denotes a bi-unique correspondence between the original

and the image)

( )

⇔a

fH

aath

1

( )bfHb

th

b⇔

1

( ) ( ) ( )00 2exp iftfHtth π⇔−

( ) ( ) ( )002exp ffHtifth −⇔− π

c) convolution ( ) ( )∫∞

∞−

−≡ τττ dthghg*

( ) ( )fHfGhg ⇔*


17/150

d) correlation ( ) ( ) ( )∫∞

∞−

+≡ τττ dhtghgcorr ,

( ) ( ) ( )fHfGhgcorr *, ⇔ , if g and h are real functions. For the particular case of auto-

correlation: ( ) ( ) 2, fGggcorr ⇔

e) total power in a signal, Pt

( ) ( )∫∫+∞

∞−

+∞

∞−

=≡ dffHdtthPt

22

f) differentiation

( ) ( )fifHdt

tdh π2⇔

g) Dirac delta function:

( )fδ⇔1 , ( ) 1⇔tδ

Fourier transform of discretely sampled data

Let the original is represented through a sequence of function values at equidistant val-

ues of the independent variable ,...3,2,1,0,1,2,3...,, +++−−−=∆= nntn , where t∆≡∆ is the

independent variable tabulation step, i.e. let there exists the sequence:

( ) ,...3,2,1,0,1,2,3...,, +++−−−=∆≡ nnhhn (3)

Nyquist-Shannon sampling theorem

„Let ∆

≡2

1cf . (4)

(this is the so-called Nyquist critical frequency)

If the continuous function ( )th , the values of which are sampled at an interval ∆, is bandwidth

limited to frequencies smaller in magnitude than the critical frequency cf , i.e. ( ) 0=fH for

cff ≥ , then this function is completely determined by its sampled values nh :

( ) ( )[ ]( )∑

∞

−∞= ∆−∆−∆=

n

cn nt

ntfhth

ππ2sin

” (5)

Let ( )th is represented by the finite sequence of sampled values:


18/150

( ) 1,...,0,, −=∆= Nkktth kk . (6)

If the sampling interval ∆ satisfies the requirements of the sampling theorem, and if

( ) 0=th for t < t0 and t > tN-1, or h(t) is a periodic function and the interval [ ]10, −Ntt contains

one of its periods, then the sequence (6) will carry the entire information content of h(t). Let,

for further simplicity, N is even.

The sequence of original functional values (6) can be used for finding a sequence of

frequency amplitudes ( )nfH . In principle their number can be arbitrary, but since Fourier

transform is a linear operation, only N of them can be mutually independent. For a representa-

tive description of the spectrum they must span the frequency spectrum [ ]cc ff +− , , in general

uniformly. These requirements are met by the set

( )2

,...,2

,,NN

nN

nffH nn +−=

∆= , (7)

representing the image ( )fH in the frequency range [ ]cc ff +− , .

The set (7) is computed through approximating the integral (1) by the sum:

( ) ( ) ( )

( )

+−=

×∆≡

∆=

∆

∆∆

=∆≈

≡

∑

∑∑

∫

−

=

−

=

−

=

∞

∞−

2,...,

2

2exp

2exp2exp

2exp

1

0

1

0

1

0

NNn

HN

iknh

kN

nihtifh

tiftdthfH

n

N

kk

N

kk

N

kknk

nn

π

ππ

π

(8)

The sequence nH is periodic in n with a period N: ,...2,1, == −− nHH nNn .

Because of that, usually nH is indexed in n from 0 to N-1. Thus, n = 0 corresponds to

zero frequency, 12

1 −≤≤ Nn corresponds to cff <<0 , 11

2−≤≤+ Nn

N – to 0<<− ff c ,

and 2

Nn = – to cff ±= .

The inverse discrete transform is computed in an analogous way:

∑−

=

−=1

0

2exp

1 N

nnk N

iknH

Nh

π (9)


19/150

Here it is important to note that although expression (8) for the frequency amplitudes is

approximate, their employment in the inverse transform (9) leads to an exact recovery of the

sequence of original function values.

Indeed, let

≡N

iW

π2exp . (10)

Then (8) takes the form:

∑−

=

=1

0

N

kk

nkn hWH , or WhH = , (11)

where ( ) nknk W≡W , and (9) translates into:

HWh +=N

1 (11а)

From (11а) it becomes evident that in order to fulfil the transform invertibility require-

ment, the W matrix must be unitary with a scaling factor N: 1WW N=+ . By making use of

the definition (10), this can be directly verified. Thus the statement about the accuracy of re-

storing the original, made in conjunction with (9), is also corroborated. Moreover, through the

limit approach 0→∆ , and therefore ∞→N , the transform invertibility is proved in the con-

tinuous case (2) as well, without resorting to the representation (i) of Dirac δ-function.

Fast Fourier transform (FFT)

Expression (11) would imply that N2 linear operations on complex numbers will be re-

quired for computing the discrete Fourier transform.

There exist, however, algorithms like FFT (Fast Fourier Transform) (Danielson and

Lanczos, 1942) through which the required number of operations is reduced to ( )NNO 2log .

The approach is described below.

The discrete transform for a sequence of N function values can be represented as a sum

of two transforms – separately for the even and odd-indexed data subsets, each of length N/2:


20/150

( ) ( )

1,...,0,

2/

2exp

2/

2exp

122exp

22exp

2exp

10

12/

012

12/

02

12/

012

12/

02

1

0

−=+=

+

=

++

=

=

∑∑

∑∑

∑

−

=+

−

=

−

=+

−

=

−

=

NnHWH

N

iknhW

N

iknh

N

kinh

N

kinh

N

iknhH

nn

n

N

kk

nN

kk

N

kk

N

kk

N

kkn

ππ

ππ

π

(12)

Here it is important to note that the Fourier images 0nH and 1

nH are periodic in n with a

period N/2, and also that the subdivision (12) can be applied recursively. Thus, for example,

0nH can be subdivided into an even and odd component, 00

nH and 01nH , each with a period

N/4:

( )

( )( )

( )( )

( ) ( )

1,...,0,

4

2exp

22exp

4

2exp

1222exp

222exp

22exp

01200

14/

0122

14/

022

14/

0122

14/

022

12/

02

0

−=+=

+

=

++

=

=

∑∑

∑∑

∑

−

=+

−

=

−

=+

−

=

−

=

NnHWH

N

ikn

N

nih

N

iknh

N

kinh

N

kinh

N

kinhH

nn

n

N

kk

N

kk

N

kk

N

kk

N

kkn

πππ

ππ

π

, (12.a)

and 1nH can also be subdivided into an even and odd component, 10

nH and 11nH , each

with a period N/4:

( )

( )( )( )

( )( )( )

( ) ( )

( ) 1,...,0,

4

2exp

22exp

4

2exp

11222exp

1222exp

122exp

11210

14/

01122

14/

0122

14/

01122

14/

0122

12/

012

1

−=+=

+

=

+++

+=

+=

∑∑

∑∑

∑

−

=++

−

=+

−

=++

−

=+

−

=+

NnHWHW

N

ikn

N

nihW

N

iknhW

N

kinh

N

kinh

N

kinhHW

nn

nn

N

kk

nN

kk

n

N

kk

N

kk

N

kkn

n

πππ

ππ

π

(12.b)

In this fashion, is N is an integer power of 2, the recursion can proceed to components

with a period 1=N

N, i.e.:

kn hH =...0100101 for some value of k. (13)

(And indeed, if ...0100101nH is periodic in n with a period, it will not depend on n).


21/150

The correspondence between k and the Fourier image component index, i.e. the se-

quence ‘0100101...’ can be established through the following observation.

If the leftmost superscript of Hn is ‘0’, i.e. H0…, then k is even, i.e. the least significant

(rightmost) bit in the binary representation of k is 0: k=[…0]. And conversely – if the leftmost

superscript is ‘1’, i.e. H1…, then k is odd, i.e. the least significant (rightmost) bit in the binary

representation of k is 1: k=[…1].

Further, if the second symbol of the superscript of Hn is ‘0’, i.e. Hx0…, then the position

of kh in the new list is even, i.e. the second least significant bit of k is 0: k=[…0x]. Similarly,

if the second symbol is ‘1’, i.e. Hx1…, then the second least significant bit of k is 1: k=[…1x].

In other words, the successive subdivision of data in even and odd numbered is equiva-

lents to successive testing from right to left of the bits in the binary representation of k. For

example, in particular, the binary record of the index k in expression (13) will be ...1010010.

Based on (12) and (13), the fast Fourier transform algorithm (the Cooley-Tukey algo-

rithm) will be as follows (the example is for N = 8):

− Rearrangement of the array of kh according to the rule: ( ) ( )00000000 hhhh ↔ ,

( ) ( )41001001 hhhh ↔ , ( ) ( )20102010 hhhh ↔ , ( ) ( )51103011 hhhh ↔ ,.... The rearranged array

will contain in successive positions pairs of single-component Fourier images

( )10, xxxx HH .

− Combining each pair of single-component Fourier images ( )10, xxxx HH according to

(12) and writing the two different values of the resultant two-component image in the

same pair of adjacent array locations (0xH or 1xH have a period N/4 = 2 in n). The

expressions are 001400000n

nnn HWHH += , ..., 111411011

nn

nn HWHH += , 1,0=n . The values

of nW4 also have a period N/4 = 2: 10 =W , 14 −=W .

− Combining each pair of two-component images ( )10, xx HH according to (12) and

writing the four different values of the resultant four-component image in the four ad-

jacent array locations previously occupied by the two-component images (0H or 1H

have a period N/2 = 4 in n). The expressions are 012000n

nnn HWHH += and

112101n

nnn HWHH += , 3,2,1,0=n . The values of nW2 also have a period N/2 = 4:

10 =W , iW =2 , 14 −=W , iW −=6 and alternate in sign with half that period, i.e. 2.


22/150

The array value replacement process is: a) read the values of 000H and 01

0H from array

positions 1 and 3; b) compute 010

000

00 HHH += and 01

0000

02 HHH −= , and write these

in array positions 1 and 3; then proceed similarly with 001H and 01

1H from positions 2

and 4 in order to produce 01H and 03H in the same positions, i.e. 2 and 4, etc.

− Finally, combining the two four-component images ( )10,HH according to (12) and

writing the result 1,...,0, −= NnH n in the eight adjacent array locations previously

occupied by the four-component images. The expression is

7,...,0,10 =+= nHWHH nn

nn . The values of nW are correspondingly: 10 =W ,

2

1

2

11 iW += , iW =2 , 2

1

2

13 iW +−= , 14 −=W , 2

1

2

15 iW −−= , iW −=6 ,

2

1

2

17 iW −= . These alternate in sign with half the current period, i.e. 4, and the

updating of the data array values is done in place as illustrated above.

Each application of (12) requires N operations (one multiplication and one addition),

and the number of steps is N2log , and therefore the total number of operations for imple-

menting the FFT algorithm is ( )NNO 2log .

FFT of real data

Since the real array 1,...,0, −= Nkfk will be twice shorter (in number of addresses, or

words) than the complex array 1,...,0, −= Nkhk , and the image 1,...,0, −= NnFn will never-

theless be complex, the question arises whether the transform can be done „in place”, and,

more generally, whether the number of real operations will remain equal to the total number

of complex operations (as in the complex case). The answer is expectedly affirmative, be-

cause ( ) ( )*fFfF =− , i.e. the frequency spectrum will contain twice as little information.

The algorithm for real data is implemented as follows.

The data are subdivided in two sets – with even and odd sequential numbers. The first

set is interpreted as the real, and the second one – as the imaginary part of a twice shorter set

of complex numbers: 12/,...,0,122 −=+= + Njiffh jjj . This synthetic complex array is sub-

jected to the standard Fourier transform routine. The output is a complex array

12/,...,0,10 −=+= NniFFH nnn with the following components (both complex):


23/150

∑−

=

=

12

02

0

2

2exp

N

kkn N

iknfF

π and ∑

−

=+

=

12

012

1

2

2exp

N

kkn N

iknfF

π (14)

By virtue of (12), the final result is obtained as:

1,....,0,2

exp 10 −=

+= NnFN

inFF nnn

π. (15)

The task of extracting 0nF and 1

nF from nH and the simultaneous producing of nF is

solved in the following way:

( ) ( ) 1,....,0,2

exp*2

*2

122 −=

−−+= −− NnN

inHH

iHHF nNnnNnn

π (15a)

The array F is complex and is twice longer than the real array f. Since

nnNn FFF == −− ** , only the amplitudes at positive frequencies are sufficient, and they can be

written in place of the original data. Because all values 2/,...,0, NnH n = are nevertheless

needed, and the Fourier image of the synthetic complex array is 12/,...,0, −= NnH n , one can

employ the fact that nNn HH −− = 2 , i.e. 02 HH N = .

The inverse transform is organised as follows.

• Construct

( ) ( )( ) ( ) 12/,...,0,*

2exp

2

1

*2

1

22

21

−=−

−=

+=

−

−

NnFFN

inF

FFF

nNnn

nNnn

π (16)

• Find the inverse transform of ( ) ( )21nnn iFFH += .

FFT of functions of two or more variables

Let, similarly to the one-dimensional case:

( ) ( ) 1,...,0,1,...,0,,, 22112121 −=−=∆∆≡ NkNkkkhkkh yx . (17)

The two-dimensional Fourier image is defined by the complex function:


24/150

( ) ( )

( )

( )∑ ∑

∑ ∑

∑∑

−

=

−

=

−

=

−

=

−

=

−

=

=

=

≡

1

021

1

0 2

22

1

11

1

021

1

0 1

11

2

22

1

021

1

0 1

11

2

2221

1

1

2

2

2

2

1

1

2

2

1

1

,2

exp2

exp

,2

exp2

exp

,2

exp2

exp,

N

k

N

k

N

k

N

k

N

k

N

k

kkhN

nik

N

nik

kkhN

nik

N

nik

kkhN

nik

N

niknnH

ππ

ππ

ππ

, (18)

or:

( )21,nnH = FFT by the second index of [FFT by the first index of ( )21,kkh ]

= FFT by the first index of [FFT by the second index of ( )21,kkh ].

The inverse transform amounts to inverting the signs of the exponentials and multiply-

ing the final result by 21

1

NN.

Application: computed tomography

Computed tomography is a technique of studying the internal structure of objects by

means of penetrating radiation (X-rays, but generally also light, gamma rays, etc.). In one of

the possible geometries of measurement the object is placed between a line source and a line

detector which are parallel to each other in the plane of the examined slice of the object. Then,

if the source of length xL is located along the x axis (e.g. between x = 0 and x = Lx at y = 0) ,

and the detector – at a distance yL from the source (i.e. between x = 0 and x = Lx at y = Ly),

then the registered intensity of the transmitted parallel beam will be:

( ) ( )

−= ∫

yL

y dyyxILxI0

0 ,exp, µ , (19)

where I0 is the constant linear density of the source intensity, and the linear attenuation

coefficient of the penetrating radiation ( )yx,µ completely characterises the internal structure

of the slice. Insofar as it can be assumed that outside the object ( ) 0, =yxµ , and that because

of the parallel beam the distance between the source and the detector does not affect the regis-

tered intensity, then:

( ) ( ) ( ) +∞<<∞−

−=== ∫

+∞

∞−

xdyyxIxILxI y ,,exp0,, 0 µθ . (20)


25/150

Here the parameter θ represents the angle of rotation of the source-detector frame (or of

the object) around the z axis which is transverse to the plane where the source, the examined

slice and the detector lie (cf. Fig. 1). This parameter is introduced because, as it will be seen

below, the sought distribution ( )yx,µ is reconstructed from transmitted intensity measure-

ments at a series of rotation angles θ.

Figure 1. Mutual arrangement of the source, the object and the detector in a computed

tomography measurement

Further it is convenient to assume that the measured quantity is actually:

( ) ( ) ( )∫+∞

∞−

=

−≡ '','

,'ln,'

0

dyyxI

xIxp µθθ , (21)

where x’, y’ are in a coordinate system rotated at an angle θ with respect to that in ex-

pression (20).


26/150

The normalisation and the taking of logarithm in (21) are trivial operations which with-

out any restriction can be assumed to be performed by the detector system. The x’ and y’ co-

ordinates are in a system rotated at an angle θ, so that in a particular measurement the source

is located along the x’ axis, e.g. between x’ = 0 and x’ = Lx at y’ = 0, and the detector – be-

tween x’ = 0 and x’ = Lx at y’ = Ly. The relation between x’, y’ and x, y (corresponding to

θ = 0) is:

=

⋅

− '

'

cossin

sincos

y

x

y

x

θθθθ

(22)

For the Fourier image of ( )θ,'xp , with using (21) and (22), one obtains:

( ) ( ) ( )

( ) ( )

( ) ( )( ) ( )( )

( ) ( )( )

( )λκ

λκπµ

θθκπµ

κπµ

κπθθκ

,

2exp,

sincos'2exp,',,'

''''2exp','

'''2exp,','

Μ=

+=

+=

=

≡

∫ ∫

∫ ∫

∫ ∫

∫

∞+

∞−

∞+

∞−

∞+

∞−

∞+

∞−

∞+

∞−

∞+

∞−

+∞

∞−

dxdyyxiyx

dydxyxiyxyyxx

dxdyxiyx

dxxixpP

, (23)

where:

θκλθκκ sin',cos' == , (24)

and ( ) ( ) ( )( )∫ ∫+∞

∞−

+∞

∞−

+≡Μ dydxyxiyx λκπµλκ 2exp,, is the two-dimensional Fourier image

of ( )yx,µ . The conversions in (23) employ the fact that coordinate system rotations like (22)

conserve the volume element and the integration limits – dxdydydx ='' .

The structure ( )yx,µ of the examined object can be reconstructed via an inverse Fou-

rier transform of (23):

( ) ( ) ( )( )∫ ∫+∞

∞−

+∞

∞−

+−Μ= λκλκπλκµ ddyxiyx 2exp,, . (25)

However, it is important to note that in (23) the variables κ and λ are interrelated

through (24), i.e. they are not mutually independent, and at a given θ they can span only a


27/150

small portion of the value range needed for performing the inverse transform (25). Therefore,

the inverse transform is at all possible only if a sufficiently large number of projections

( )θ,'xp exist at different values of θ.

With this in mind, expression (25) can provide a basis for the following algorithm:

• Collect an array of measurement results:

( ) MmNnxp mn ,...,1,,...,1,,' ==θ

• Compute M discrete one-dimensional Fourier transforms (FFT) in order to produce

the quantities

( ) ( ) ( ) MmLldxxixpP lmml ,...,1,,...,1,'''2exp,',' === ∫+∞

∞−

κπθθκ

• Build a correspondence map ( ) ( )jimlP λκθκ ,,' Μ⇒ using the relations (24).

• Compute ( )yx,µ through an inverse discrete two-dimensional Fourier transform

(FFT) of ( )λκ ,Μ .

A major drawback of the implementation of this algorithm is the mapping

( ) ( )jimlP λκθκ ,,' Μ⇒ . Because of (24), the sets ( )ji λκ ,Μ will be arranged along radial lines

in the ( )λκ , plane at angles mθ with respect to the κ axis, instead of forming an equidistant

Cartesian grid as would be needed in order to ensure the proper invertibility of the Fourier

transform. Although the available data can in principle be interpolated to the desired Cartesian

grid, the process would introduce a significant noise to the recovered two-dimensional distri-

bution ( )yx,µ , especially due to the sparsely scattered data points far from the origin of the

( )λκ , coordinate system. For this reason, the following approach (Radon transform) is al-

ways preferred.

Since the Fourier image ( )θκ ,'P of the measured quantity ( )θ,'xp is in ( )θκ ,' coordi-

nates, which according to (24) relate to ( )λκ , as polar to Cartesian, the integration in (25) can

be done after a corresponding change of variables and of the integration limits: the volume

element λκdd is replaced by θκκ dd '' , the limits in κ’ – by ( )∞,0 , and the limits in θ – by

( )π2,0 . Thus:


28/150

( ) ( ) ( )( )

( ) ( ) ( )∫∫ ∫

∫ ∫

=

−=

+−=

∞

∞

ππ

π

θθθκκκπθκ

θκκθθκπθκµ

2

0

2

0 0

2

0 0

,'''''2exp,'

''sincos'2exp,',

dxCddxiP

ddyxiPyx

(26)

( )θ,'xC differs in form from the inverse Fourier transform of the function ( )θκκ ,''P

only by the integration limits: ( )+∞,0 instead of ( )+∞∞− , .

The bringing of this integral to the form of an inverse Fourier transform is based on the

circumstance that the exchange of places of the source and the detector (or rotating the

source-detector frame by 180°) does not change the measurement conditions, i.e. the detector

response. Namely:

( ) ( )πθθ +−= ,',' xpxp , (27)

and consequently ( ) ( )πθκθκ +−= ,',' PP

Therefore:

( ) ( ) ( )

( ) ( )( )

( ) ( ) ( )( )( )

( ) ( ) ( ) ( )( )

( ) ( ) ( ) ( )( )

( ) ( )

( )∫

∫ ∫

∫ ∫∫

∫ ∫∫

∫∫

∫

∫ ∫

=

−=

−−+−+−=

−−++−=

+++−++

+−=

−=

∞

∞−

∞−

∞

∞∞

∞

∞

∞

π

π

π

π

π

π

θθ

θκκκπθκ

θκκκππθκκκκπθκ

θκκκππθκκκκπθκ

θκκπθπθκππθκ

κκθθκπθκ

θκκκπθκµ

0

0

0

0

0

0 00

0

0

0

2

0 0

,'

''''2exp,'

''''2exp,'''''2exp,'

''''2exp,'''''2exp,'

''sincos'2exp,'

''sincos'2exp,'

''''2exp,',

dxC

ddxiP

ddxiPdxiP

ddxiPdxiP

d

dyxiP

dyxiP

ddxiPyx

(28)

where

( ) ( ) ( )

( ) ( ) ( )∫

∫∞

∞−

∞

∞−

−=

−≡

'''2exp','

''''2exp,','

κκπκθκ

κκκπθκθ

dxiBP

dxiPxC

, (29)


29/150

is an inverse Fourier transform of the function ( ) ( )θκκ ,PB , and ( ) κκ ≡B plays the

role of a „frequency filter”.

In practice, the measurements are performed at a sequence of discrete rotation angles

1,...,0, −= Mmmθ , and the image ( )yx,µ is reconstructed in a discrete grid of cells (pixels)

bounded by ( ) JjIiyx ji ,...,1,,...,1,, == . The algorithm based on (23), (28) and (29) will

take the form:

For each 1,...,0, −= Mmmθ :

1. Discretisation of ( )θ,'xp in x’ with a step ∆ :

( ) ∆=−= kxNkxp kmk ',1,...,0,,' θ

2. Finding the Fourier image (FFT) of ( )mxp θ,' :

( ) ( )( )mnmn xpFFTN

nNnP θκθκ ,'',1,...,0,,' =

∆=−=

3. Inverse Fourier transform (IFFT) of the product ( )mP θκκ ,'' :

( ) ( )( )mmk PIFFTNkxC θκκθ ,''1,...,0,,' =−=

4. Mapping of ( )mkxC θ,' into the pixel grid in which ( )yx,µ is to be reconstructed.

This is effected as follows. First, for each pixel ( )ji yx , , the coordinate

mjmi yxx θθ sincos' += is computed. Then the index k is found for which 1''' +≤≤ kk xxx .

Finally, the quantity ( )mijC θ is evaluated through interpolation between ( )mkxC θ,' and

( )mkxC θ,' 1+ .

5. Adding of ( ) θθ ∆mijC to the current value of ijµ (with a starting value 0=ijµ ) – i.e.

integration by θ according to (28).

Example

Let the original image is the so-called Shepp-Logan tomographic phantom, composed

of ellipses with different optical density (Fig. 2). Then, for example, the application of the

above algorithm with a 512 × 512 pixel grid and 256 equidistant discrete angles between 0°

and 180°, i.e. 256 linear projections ( ) 255,...,0,511,...,0,,' == mkxp mk θ , gives the recon-


30/150

structed image from Fig. 3. The observed noise effects can additionally be filtered by apply-

ing a suitable frequency filter in (29) of the general form ( ) ( )κϕκκ ≡B .

Figure 2. The Shepp-Logan phantom in A. C. Kak, M. Slaney, Principles of Computer-

ized Tomographic Imaging, IEEE Press, 1988.


31/150

Figure 3. A reconstruction of the image from Fig. 2 in a 512 × 512 pixel grid, based on

256 linear projections – above. Linear projections at 0° and 90° – below.


32/150

3. Eigenvalues and eigenvectors of a matrix

The scalar λ and the non-zero vector x are an eigenvalue and an eigenvector of the

square matrix A, if:

xAx λ= (1)

It is seen that an eigenvector may have an arbitrary scaling. It is also seen from (1) than

non-zero eigenvectors can exist only if the following equality is fulfilled:

( ) 0det =− 1A λ (2)

The left side of (2) is a polynomial of degree n (if this is the number of rows and col-

umns of A) with respect to λ and its roots are the eigenvalues of A. These eigenvalues are not

necessarily distinct or real valued. From (1) and (2) it also follows that between the eigenval-

ues and the eigenvectors there exists an unique correspondence – ( ) niii ,...,1,, =xλ , which

does not imply that the eigenvectors are always distinct. The addition e.g. of xτ to both sides

of (1) shifts the eigenvalues by an additive constant τ without changing the eigenvectors.

Therefore the occurrence of a zero eigenvalue is not a special case because through such shift-

ing any eigenvalue can be brought to zero, or conversely – to non-zero.

It can easily be verified that:

− If the matrix B is a polynomial of a given degree m of the matrix A, i.e. ( )AB mP= ,

then the eigenvalues of B are the same polynomial of the eigenvalues of A, i.e.

( ) niP imi ,...,1, == λµ , where iλ and iµ are correspondingly the eigenvalues of A

and B, and the eigenvectors of A and B coincide.

− If nii ,...,1, =λ are the eigenvalues of A, then nii

i ,...,1,1 ==λ

µ are the eigenvalues

of B = A-1, and the eigenvectors of A and B coincide.

Here it is appropriate to remind some definitions and statements.

A matrix is symmetric if it coincides with its transpose:

TAA = , or jiij aa = (3)

A matrix is Hermitian if it coincides with the complex conjugate of its transpose (also called Hermitian

conjugate):


33/150

+= AA , or *jiij aa = (4)

A matrix is orthogonal if its transpose is equal to its inverse:

1AAAA == TT , (5)

and unitary, if its Hermitian conjugate equals its inverse. From (5) it is seen that the columns of an or-

thogonal (unitary) matrix, interpreted as vectors, are mutually orthonormal.

A matrix is normal if it commutes with its Hermitian conjugate:

AAAA ++ = (6)

It is evident that symmetric (Hermitian) and orthogonal (unitary) matrices are normal.

All eigenvalues of a symmetric/Hermitian matrix are real.

The eigenvectors of a normal matrix with distinct eigenvalues are mutually orthogonal and form a basis

in the n-dimensional vector space. (If a normal matrix has coinciding (i.e. degenerate) eigenvalues, then the

eigenvectors corresponding to the set of degenerate eigenvalues can be replaced by linear combinations of them-

selves (these combinations will also be eigenvectors which correspond to this eigenvalue set), so that a Gram-

Schmidt orthogonalisation can be performed in order to produce a complete and orthogonal set of eigenvectors,

same as in the case of distinct eigenvalues.)

If a matrix is not normal (e.g. some random real matrix), then its eigenvectors in general cannot be

brought to an orthonormal basis, but usually, although not always, form a complete set – i.e. an arbitrary non-

zero n-element vectors, where n is the matrix dimension, can be represented as a linear combination of the ma-

trix eigenvectors.

A matrix which is columnwise assembled of a set of orthonormal eigenvectors is obviously unitary. Also,

a matrix which is columnwise assembled of the eigenvectors of a real symmetric matrix, is orthogonal since the

eigenvectors of the relevant real and symmetric matrix are all real.

λ and x satisfying the equality

TT xAx λ= (7)

are correspondingly a left eigenvalue and a left eigenvector. It is seen that the transposed left eigenvectors

of A are right eigenvectors of AT. Therefore, the left and right eigenvectors of a symmetric/Hermitian matrix are

mutually transpose/Hermitian conjugate. The left and right eigenvalues of A coincide because expression (7) is

equivalent to ( ) 0=− x1A Tλ , and ( )( ) ( )1A1A λλ −=− detdet T (a basic property of determinants), i.e.

the characteristic polynomial of (7) coincides with that of (1).

In general, left and right eigenvectors are mutually orthogonal (with distinct eigenvalues) or can be made

such (with degenerate eigenvalues). This will be shown below.

Let XR is a matrix columnwise assembled from the right eigenvectors of A, and XL – a

matrix rowwise assembled from the left eigenvectors of A. Then (1) and (7) can be written as:


34/150

( )( ) LnL

nRR

XdiagAX

diagXAX

λλλλ

,...,

,...,

1

1

==

(8)

If the first equality is multiplied on the left by XL, the second – on the right by XR, and

the resulting expressions are subtracted, one obtains:

( ) ( ) ( )( )RLnnRL XXdiagdiagXX λλλλ ,...,,..., 11 = . (9)

Since the only type of matrices which commute with a diagonal matrix with distinct

elements, are also diagonal, then if the eigenvalues of A are distinct, the left and right eigen-

vectors of A will be mutually orthogonal. These can always be normalised (eigenvectors are

only unique with a precision to a multiplier) in order to obtain:

1XX =RL , i.e. 1−= RL XX . (10)

(In the case of degenerate eigenvalues their corresponding left or right eigenvectors can

be linearly combined in a procedure similar to Gram-Schmidt, so that relation (10) will also

hold. An exception is the case of an incomplete set of eigenvectors when the matrix RLXX

will have zero elements on the diagonal.)

By multiplying the first equality in (8) on the left by XL and using the result (10), the

following is obtained:

( )nRR λλ ,...,11 diagAXX =− . (11)

This expression is a particular case of a similarity transformation on matrix A:

AZZA 1−→ , (12)

where in the general case the transformation matrix Z can be arbitrary.

Similarity transformations conserve the eigenvalues. And indeed:

( ) ( )( ) ( ) ( ) ( ) ( )1AZ1AZZ1AZ1AZZ λλλλ −=−=−=− −−− detdetdetdetdetdet 111 . (13)

Since the last expression matches the left side in (2), the eigenvalues of AZZ 1− will co-

incide with those of A.

With accounting of (11) - (13) it is seen that each matrix A with a complete set of ei-

genvectors (i.e. each normal matrix and most random matrices) can be diagonalised through

similarity transformations. At that, the columns of the applied transformation matrix will be

right eigenvectors of A, and the rows of the inverse of this matrix – left eigenvectors of A.


35/150

(Of course, the elements of the resultant diagonal matrix will be the corresponding eigenval-

ues of A.)

The eigenvectors of a real symmetric matrix are real and orthonormal. This follows

from the fact that since AA =T , if ( )nRR λλ ,...,1diagXAX = , then

( ) TRn

TR

TTR XdiagAXAX λλ ,...,1== , i.e. T

RL XX = (compare with the second expression in (8)).

Then, according to (10), TRR XX =−1 .

In this case the diagonalising matrix of the similarity transformation will be orthogonal.

A similarity transformation of this kind is known as orthogonal:

AZZA T→ , (14)

where Z is an orthogonal matrix.

Based on the above, the general strategy of the most commonly applied methods of

finding eigenvalues and eigenvectors consist in approximating the matrix A to a diagonal

form through a sequence of similarity transformations:

( )nλλ ,...,... 13211

11

21

3211

11

211

1

1

diagPPPAPPPPAPPPAPPAZZ

→→→→→−

−−−−−−

32143421 (15)

When this diagonal (or practically diagonal) form is achieved, the eigenvectors will be

found in the columns of the cumulative transformation matrix:

...321 PPPX =R (16)

Characteristic equation for the eigenvalues

One of the possible approaches for finding the eigenvalues of A is through searching the

roots of the polynomial equation fro λ, i.e. ( ) ( ) 0det ==− λλ nP1A , commonly known as a

characteristic equation. Since the size of the matrix is usually large it is desirable that the

characteristic polynomial be represented in a form suitable for evaluating and for applying

standard algorithms for finding the roots of a polynomial, namely through the coefficients

before the powers of λ: ( ) ∑=

=n

i

iin aP

0

λλ .

The polynomial coefficients can be determined via the method of Krylov based on the

following considerations.


36/150

If the eigenvectors nii ,...,1, =x of matrix A form a complete set, then an arbitrary non-

zero vector y can be represented as their linear combination: ∑=

=n

iii

1

xy α .

Hence, if the polynomial Qn with a non-zero set of coefficients satisfies the equality

( ) 0=AnQ , then the equality ( ) ( ) 01

==∑=

n

iiinin QQ xyA λα will also hold, where λi are the ei-

genvalues of A. Since the vectors nii ,...,1, =x are presumed to be mutually linearly inde-

pendent and the set of coefficients nii ,...,1, =α is non-zero, from the last equality it follows

that ( ) niQ in ,...,1,0 ==λ .

On the other hand, λi are zeroes of the characteristic polynomial Pn which is also of de-

gree n, and two polynomials of the same degree and coinciding zeroes can differ only by a

common multiplier. Thus, in the context of the problem solved the polynomial Qn will coin-

cide with the characteristic polynomial of matrix A.

Since the characteristic polynomial coefficients are sought with a precision to a com-

mon multiplier, the characteristic equation can be written in the form

( ) 01

1 =+= ∑=

−n

i

ii

nn bP λλλ .

Then, with a sufficiently arbitrarily chosen fixed non-zero vector y, the equality

( ) 01

1 =+= ∑=

−n

i

ii

nn bP yAyAyA (17)

can be regarded as a set of n linear algebraic equations for the n unknowns nbb ,...,1 . The

coefficients ( ) nkc ki

ki ,...,1,1 == − yA before the unknowns nibi ,...,1, = , as well as the absolute

terms ( ) nkd kn

k ,...,1, == yA , can be evaluated through recursive build-up of the sequence

( ) ( ) ( ) ( ) ( )1010 ,...,, −==≡ nn AvvAvvyv . (18)

For finding the zeroes of the characteristic polynomial there exits standard methods,

similar to those for finding the zeroes of a function. A necessary condition for the successful

and efficient finding of these zeroes is their initial localisation (bracketing). The following

relations can be helpful in solving this task:

( )QP,min≤λ , (19)


37/150

where ∑=j

iji

aP max and ∑=j

jii

aQ max

( )( ) ( ) ( )( )( )( ) ( ) ( )( )iii

iiii

i

iiii

iiii

PaPa

PaPa

+≤≤−

+≤≤−

ImmaxImImmin

RemaxReRemin

λ

λ

or (20)

( )( ) ( ) ( )( )( )( ) ( ) ( )( )iii

iiii

i

iiii

iiii

QaQa

QaQa

+≤≤−

+≤≤−

ImmaxImImmin

RemaxReRemin

λ

λ,

where ∑≠

=ij

iji aP and ∑≠

=ij

jii aQ .

If the matrix is diagonally dominant, i.e. iii Pa > or iii Qa > , then ( )iiii

Pa −≥ minλ or

( )iiii

Qa −≥ minλ .

Search for isolated eigenvalues and eigenvectors

Power iteration

Let the eigenvectors of A, nii ,...,1, =x , are a complete set. Then any vector 0v can be

represented as ∑=i

ii xv α0 . Thus, the sequence of vectors ,...2,10 =≡ k

kk vAv will have the

following representation:

...,...,,1 ∑∑ ==i

imiim

iiii xvxv λαλα . , (21)

where nii ,...,1, =λ are the eigenvalues of A.

Let nλλλ ≥≥≥ ...21 . Therefore, if 01 ≠α then

1101

1lim xvA α

λ=

∞→

mmm

, or ( )( ) 1

1

v

vlim λ=+

∞→im

im

m, or

m

m

m vyvy.

.lim 1

1+

∞→=λ , (22)

where y is an arbitrary vector which is not orthogonal to 1x . In practice it is convenient

to choose y with 1 in the position corresponding to that of the element of mv with the largest

absolute value and 0’s everywhere else. This is so because at a large m this element of the


38/150

iterate vector will tend to stabilise and will become representative of the magnitude of the

eigenvector 1x .

Accelerating the convergence

Since

+= ∑

=

n

iii

m

imm

2 1111 xxv α

λλαλ , the convergence rate of the power method will

depend on the ratio 1

2

λλ

(known as the dominance ratio). If this ratio is too close to 1, one of

the following acceleration methods can be applied.

1) Let ke is the k-th column of the identity matrix. Then

m

mn

i

mii

m

n

i

mii

m

mk

mkmR

+ →

+

+=≡ ∞→

=

=

++

+

∑

∑

1

21

211

2

1111

1

.

.

λλβλ

λαλα

λαλα

veve

. (23)

or, ( )mm RrR −×=− + 111 λλ , where 1

2

λλ=r . This allows to apply Aitken’s δ2-process:

m

m

m

m

R

R

R

R

−−=

−− +

+

+

1

11

11

21

λλ

λλ

⇒

( ) ( )12

22

212

212

212

212

1 22 ++

++

++

+++

++

++

∆−∆∆−=

+−−−=

+−−=

nn

nm

mmm

mmm

mmm

mmm RRRR

RRR

RRR

RRRλ (24)

2) The power iteration can be performed with 1A τ− instead of A. The corresponding

eigenvalues will be τλµ −= ii and through a suitable choice of τ the dominance ratio 1

2

µµ

can be reduced, thereby increasing the convergence rate.

For example, if the eigenvalues are real and positive and nλ is the smallest of them,

then with nλτ = convergence will accelerate because 1

2

1

2

λλ

τλτλ <

−−

. Actually the optimum

value of τ is 2

2 nλλ +. If the eigenvalues can be either positive or negative, τ can be chosen so

that the second and third in magnitude among the numbers τλ −i be with approximately

equal magnitudes and opposite signs.


39/150

3) If A is symmetric, then its eigenvectors will be mutually orthogonal and

∑=

++ ==

n

i

miimmmm

1

1221 λαvvAvv , and ∑

=

=n

i

miimm

1

22λαvv , (25)

so that

+=+

m

mm

mm O2

1

21

1

λλλ

vvvv

. (26)

Inverse power iteration (Wielandt’s method)

Let there exists a good estimate τ for a given eigenvalue, e.g. jλ , i.e.

jkkj ≠−<<− ,τλτλ , and let the eigenvectors of A, nii ,...,1, =x , form a complete set.

Then, beginning with an arbitrary non-zero vector 0v the following iteration process can be

organised:

( ) ,...2,1,1 ==− − kkk vv1A τ (27)

If nii ,...,1, =≠ λτ , then ( ) 1−− 1A τ exists and the iteration process has the form

( ) ,...2,1,11 =−= −

− kkk v1Av τ (28)

of a power iteration with the matrix ( ) 1−− 1A τ having eigenvalues

nii

i ,...,1,1 =−

≡τλ

µ , (29)

with jiij ≠>> ,µµ . This will clearly result in a very fast convergence to jµ .

At that, the vector sequence ,...2,1=kkv will converge to the eigenvector xj which corre-

sponds to jµ . Indeed, if

∑=

=n

iii

10 xv α , then ( ) ( )∑

=

−

−=−=

n

iim

i

imm

10 xv1Av

τλατ , i.e.

( ) jjmji

i

m

i

jijjm

mj xxxv α

τλτλ

αατλ →

−−

+=− ∞→≠∑ (30)


40/150

The practical procedure is as follows. Let at a given stage the estimates of jx and jλ

are mv and mτ , and the equation ( ) mmm vv1A =− +1τ is solved, where mv is normalised so

that 1. =mm vv . Since 1+mv is an improved estimate of jx , it can be assumed that jm xv ≅+1 ,

i.e. ( ) ( ) 11 ++ −≅=− mmjmmm vvv1A τλτ . (This is so because jjj λ xAx = and correspondingly

( ) ( ) jmjjm xx1A τλτ −=− ). Then, ( ) mmmjmm vvvv .1. 1+−≅= τλ and therefore:

mmmj vv .

1

1+

+≅ τλ . (31)

This evaluation of jλ is taken as a new estimate 1+mτ , prior to the next iteration step

1+mv is normalised to unit length, and the algorithm is repeated.

The described procedure is directly extended to complex eigenvalues and eigenvectors.

Regardless of whether a given eigenvalue is real or complex, the numerical stability of the

implementation is better if the initial vector 0v is chosen to be real. Expression (31) has the

form mm

mmmj vv

vv.

.

1+

+≅ τλ and here it is also expedient that mv is normalised so that

1* =⋅ mm vv .

Jacobi transformations for the diagonalisation of symmetric matrices

An implementation of the previously outlined general approach of similarity transfor-

mations (15) is the following sequence of orthogonal transformations leading to asymptotic

diagonalisation of a real symmetric matrix.

The orthogonal matrices for the consecutive transformations are chosen to be of the

form:

−=

1

0

...

1

...

0

1

M

OLL

M

MM

M

LLO

M

cs

sc

Ppq , (32)


41/150

where the numbers c and s are the cosine and sine of some angle ϕ, so that 122 =+ sc ;

all diagonal elements are 1, except ppp = c and pqq = c; all off-diagonal elements are 0, except

ppq = s and pqp = -s.

Each of the orthogonal transformations will be:

( ) ( ) ( ) ( )kpq

kTkpq

k PAPA 1−= . (33)

The index ,...2,1=k denotes the sequential number of the transformation and ( )0A is

the original matrix A.

It is seen from (32) and (33) that the transformation will affect only the matrix elements

in the p-th and q-th rows and columns (generally different for each application of (33)).

From (33) it directly follows that these transformations will preserve the symmetry of

the matrix: ( )( ) ( ) ( )( ) ( ) ( ) ( ) ( ) ( )kkpq

kTkpq

kpq

TkTkpq

Tk APAPPAPA === −− 11

The expressions for the altered matrix elements are as follows:

( ) ( ) ( ) ( ) ( ) ( )

( ) ( )

( ) ( ) ( ) ( )qqppqppqqppq

qppqqqppqqqppqqqpppp

piipqiiqqiiqqiiqpiippiip

aascasca

scaacasascaasaca

qipisacaasacaa

−+−=

++=−+=

≠≠+=−=

22

2222

~2~;2~

,~;~

(34)

The technique of Jacobi’s method consists in zeroing the off-diagonal matrix elements

through a sequence of transformations of the type of (33)/(34). In particular, as it follows

from the last expression in (34), for zeroing of ( )qppqa~ it is necessary that the angle ϕ be the

solution of the equation:

( )pq

ppqq

a

aa

sc

sc

222ctg

22 −=−≡ϕ (35)

Unfortunately, any subsequent transformations which affect some of the already altered

rows or columns will in general make the previously zeroed off-diagonal elements non-zero

again. In spite of this, the convergence of Jacobi’s method is a fact which is proved by the

following considerations.

Let after the k-1-th step the current sums of the squares of the diagonal and the off-

diagonal elements are correspondingly

( ) ∑=−

iii

kd aS 21 and ( ) ∑

≠

− =ji

ijk

n aS 21 (36)


42/150

Then, from relations (34) and the condition ( ) 0~ =qppqa it can be demonstrated that:

( ) ( ) 21 2 pqk

nk

n aSS −= − and ( ) ( ) 21 2 pqk

dk

d aSS += − (37)

And indeed,

( ) ( )( ) ( )( ) ( ) ( )( )[ ]

( ) ( ) ( ) ( ) ( )[ ]( ) 21

,

2222221

22

,

221

2

0

~~

pqk

n

qpiqiiqpiip

kn

qppqqpi

piipqiiqqiiqpiipk

nk

n

aS

ascascS

aasacasacaSS

−=

+++++=

++++−+=

−

≠

−

≠

−

∑

∑

, (38)

where ( )1−knS is the sum of squares of those off-diagonal elements which do not belong

to rows/columns p and q. The second statement in (37) can also be directly verified but it also

follows from the fact that orthogonal transformations preserve the sum of squares of all ma-

trix elements.

In the particular case of the matrix (32) the last assertion can be proved simply. Let x is an arbitrary non-

zero vector. For the elements of the vector xPy pq= it can be immediately checked that qpixy ii ,, ≠= ,

qpp sxcxy += , qpq cxsxy +−= . Like in (38), it can be shown that ∑∑==

=n

ii

n

ii xy

1

2

1

2 . Similarly it can be

demonstrated that the same result holds for xPy Tpq= . Then, let APB T

pq= . From the above it follows that the

sum of squares of the matrix elements of B will be equal to the sum of squares of the matrix elements of A. Let

now pqBPC = , i.e. TTpq

T BPC = . In the same way it follows that the sum of squares of the matrix elements

of CT will be equal to the sum of squares of the matrix elements of BT. The same, of course, is true for C and B.

By looking back to (33) it is seen that C relates to A as ( )kA to ( )1−kA in (33).

Therefore in the process of orthogonal transformations the ratio of the off-diagonal

norm to the diagonal norm will decrease monotonously and thus the sequence of such ratios

will converge to zero.

In practice, after some number of iterations ( )kA will become diagonal within machine

precision, and its elements will hold the eigenvalues of A. The columns of the matrix

( ) ( ) ( )kpqpqpq PPPV ...21= will contain the eigenvectors of A because from DAVV =T , where V is

orthogonal and D – diagonal with the same eigenvalues as A, it follows that VDAV = which

for V coincides with the definition (8) for right eigenvectors. This matrix is built up in the

process of transformations: ( )ipqVPV ← , beginning with 1V = . Or, more specifically:


43/150

nicssc iqipiqiqipip ,...,1,vvv~;vvv~ =+=−= . (39)

Householder reduction

An alternative to diagonalisation through similarity transformations (which is possible

only for symmetric matrices) is the following two-stage strategy.

− First apply an economic algorithm to reduce the matrix to a simpler form through a

finite number of similarity transformations. For symmetric matrices this form is

usually tri-diagonal, and for non-symmetric – the so-called Hessenberg form. (A

Hessenberg matrix contains only zeroes under its first subdiagonal or above its first

superdiagonal).

− Then find the eigenvalues and eigenvectors of the reduced matrix. One possible ap-

proach is through finding (or localising) the zeroes of the characteristic polynomial,

followed by refining the values of these zeroes (eigenvalues of the matrix) and find-

ing their corresponding eigenvectors – e.g. through inverse power iteration (Wie-

landt’s method). Another approach are the so-called QR or QL methods which will

be examined below and which with a symmetric matrix also lead to finding the ei-

genvectors.

A standard technique for implementing the first stage of this strategy is Householder’s

method.

The method leads to tri-diagonalisation of a symmetric matrix of reduction of a non-

symmetric matrix to Hessenberg form after n-2 orthogonal transformations effected by means

of the so-called Householder matrices. For a symmetric matrix each orthogonal transforma-

tion zeroes a corresponding portion of a column and a row, and for a non-symmetric matrix –

only of a column.

The Householder matrix has the form ww1P ⊗−= 2 , where w is a real vector for

which 12 =w . The symbol ⊗ denotes the so-called external (or matrix) product:

[ ] jiij ww=⊗ ww . This matrix is symmetric and orthogonal because

( )( ) ( )( )

1wwww1

ww1ww1

wwwwww1ww1ww1P

=⊗+⊗−=

+⊗−=

+⊗−=

⊗⊗+⊗−=⊗−⊗−=

==

==

∑∑

44

4444

4422

,...,1,...,1

,...,1,...,1

2

njniji

kkk

njnik

jkki wwwwwwww (40)


44/150

Therefore 1−= PP , but since TPP = , then TPP =−1 , i.e. P is orthogonal.

Further P will be used in the form H

uu1P

⊗−= where u is an arbitrary real non-zero

vector and 2

2

1u=H .

Let 1exxu −= where e1 is the first column of the identity matrix and x is some vector.

Then:

( ) ( )[ ]

( )H

x

HH

1

2

11

xxux

xexxu

xxexxu

xPx

−−=

⋅−−=

−⊗−= (40)

Since

( ) ( ) 1

2

11

2

2

1

2

1xH xxexxexxu −=−⋅−== , (41)

then ( )

uxxu

=−H

x1

2

and 1exuxPx =−= (42)

This means that the considered matrix P zeroes all elements of the vector x except the

first one which becomes equal to the norm of x.

In the context of consecutive similarity transformations let the first transformation ma-

trix P1 contains in its first row/column the first row/column of the identity matrix, and below

and to the right contains a Householder matrix (n-1)P1 with dimension ( ) ( )11 −×− nn for which

the vector x consists of the lower n-1 elements of the first column of A. Then:


45/150

It is seen that P1 will zero the lower n-2 elements in the first column of A. k is the norm

of the vector ( )121,..., naa .

It is clear that with a symmetric matrix A the similarity (orthogonal) transformation

( )TTT APPPAPAPPA 111111' === will result in:

The matrix P2 for the next step is composed similarly to P1 but from a Householder ma-

trix (n-2)P2 with dimension ( ) ( )22 −×− nn , based of the lower n-2 elements of the second col-

umn of A’ , supplemented to the left and above with the first tow columns/rows of the identity

matrix:

The identity block in the upper left corner ensures preservation of the achieved tri-

diagonal form, and the (n-2)-sized Householder matrix ( )2

2 P−n tri-diagonalises the second col-

umn/row. It is evident that n-2 orthogonal transformations of this type will bring the symmet-

ric matrix A to a tri-diagonal form.

As it can be easily concluded, with non-symmetric matrices the elements above the di-

agonal will not be zeroed and the transformed matrix A will be of Hessenberg form:


46/150

In both cases if at the next stage the eigenvectors of the reduced matrix will be sought,

the eigenvectors of A can be obtained from them by applying the resultant transformation

matrix 221 ... −= nPPPQ .

Eigenvalue problem for the reduced matrix

The eigenvalues and eigenvectors of the simplified matrix (after a Householder reduc-

tion) can be determined through finding the roots of the characteristic polynomial (only ap-

proximate estimates are sufficient) and applying Wielandt’s method for refining the eigenval-

ues and/or finding their corresponding eigenvectors.

Thus, for example, for a tri-diagonal symmetric matrix J (resulting from a Householder

reduction of a symmetric matrix) the polynomial ( ) ( )1A µµ −= detnp can be obtained

through expanding the determinant in minors using the sequence of recurrent relations:

( )( )

( ) ( ) ( ) ( ) nipJpJp

Jp

p

iiiiiii ,...,3,2,

...

1

22

1,1

111

0

=−−=

−==

−−− µµµµ

µµµ

(43)

In the process of building the polynomial sequence, information is gathered about the

intervals which bracket the zeroes of ( )µnp , so that these zeroes can easily be localised e.g.

by the bisection method.

The QR algorithm

The QR algorithm is an infinite (like Jacobi’s method) sequence of similarity transfor-

mations which is used to diagonalise a symmetric tri-diagonal matrix or reduce a non-

symmetric matrix to an upper triangular form. In both cases the eigenvalues are the diagonal

elements of the transformed matrix, and in the first case the eigenvectors are found in the col-

umns of the resultant transformation matrix. For non-symmetric matrices the eigenvectors are

found through inverse power iteration (Wielandt’s method). Since accurate estimates of the

eigenvalues are already available, the convergence of this iteration will be rapid. Also, be-

cause of the simple form of the transformed matrix its inversion if significantly facilitated.

The QR is based on the possibility to represent each matrix as the product QRA = ,

where Q is an orthogonal matrix and R is an upper triangular matrix.


47/150

This possibility is due to the fact that through a sequence of Householder reductions the

result RTAAPPP ==−− 121 ...nn can be obtained, where iP are the Householder matrices as

defined above. Since they are orthogonal and symmetric, i.e. 1−== iTii PPP , the matrix

( ) 1211211

1211 ......... −−

−−−

− ==== nTn

TTnn PPPPPPPPPTQ will also be orthogonal (and symmetric).

On the other hand, since RTA = , then QRA = .

Let now, after finding Q and R, the matrix RQA =' is formed. Since AQR T= , then

AQQA T=' is similar to A (i.e. it has the same eigenvalues and eigenvectors). An important

circumstance is that this similarity transformation conserves the symmetric, tri-diagonal or

Hessenberg form of the matrix.

Thus, the QR algorithm consists of the following infinite iterative procedure:

( )...

...

1 kkTkkkk

kkk

QAQQRA

RQA

==

=

+ (44)

As it is seen, this is a direct analogue of Householder reduction, although with the fol-

lowing two important differences:

− the applied Householder matrices zero all subdiagonal elements (the Householder

reduction does not zero the first subdiagonal);

− the orthogonal transformations are an infinite number (with Householder reduction

their number is n-2).

The following statement holds, the proof of which is outside the scope of this text:

If A has eigenvalues of different absolute value iλ , i.e. nλλλ >>> ...21 , then as

∞→k , Ak approaches an upper triangular form (in the symmetric case – a diagonal

form), and at that its diagonal elements approach its eigenvalues (the eigenvalues of a

triangular matrix always coincide with its diagonal elements). If A has an eigenvalue

iλ of multiplicity p, then as ∞→k , Ak will also converge to an upper triangular ma-

trix with diagonal elements converging to its eigenvalues, except for a diagonal block

matrix of order p, the eigenvalues of which converge to iλ . Also, the subdiagonal ele-


48/150

ments of Ak (in the symmetric case – the superdiagonal as well) approach zero like

( )k

j

ikija

λλ

~ (i > j, ji λλ < ).

In the context of the above, convergence can be accelerated through shifting of the ei-

genvalues, i.e. through applying the QR step to 1A kk k− where kk is a suitably chosen con-

stant. Then the convergence rate will be determined by j

i

kj

ki

k

k

λλ

λλ <

−−

. The choice of kk is

made on the basis of the current eigenvalue estimates.

Application: Schrödinger equation

Formulation

The Schrödinger equation is a quantum mechanical analogue of the equation of motion

in classical mechanics. Its underpinning can be explained as follows.

The general expression for the amplitude of a plane wave is ( ) ( )tiiAt ωψ −⋅= krr exp, ,

where k is a wave vector.

De Broglie’s wave associated with a free particle is ( )

−⋅= tii

t ωψh

prr exp, , where p

is the momentum of the particle and its energy is ωh=E .

Schrödinger’s wave function for a non-relativistic free particle, which is a direct coun-

terpart of Broglie’s wave function, is ( )

−⋅=

hh m

itpit

2exp,

2prrψ . (The second term in the

exponent follows from ωh=E and m

pE

2

2

= )

The differentiation of this wave function in time and coordinates yields

( ) ( )tm

pt

ti ,

2,

2

rr ψψ =∂∂

h and ( ) ( ) ( ) ( )tptt

pt

xi

i

,,;,,2

22

2

2

2

2

rrrr ψψψψhh

−=∇−=∂∂

. (45)

Therefore,

( ) ( )tm

tt

i ,2

, 22

rr ψψ ∇−=∂∂ h

h (46)


49/150

is a differential equation satisfied by this wave function.

Let the particle is in a potential field, i.e. the potential energy of the particle is ( )rV .

Then the kinetic energy of the particle will be ( )rVEm

pEk −==

2

2

, where ωh=E is the full

energy. Let the full energy is everywhere higher than the potential energy.

It is also assumed that the wave function of such particle will be a superposition of

plane waves of the considered form, with a time dependence

− tiE

hexp , and a coordinate

dependence ( )

⋅=h

prr

iexpψ . Then, for each of these waves:

( ) ( )tEtt

i ,, rr ψψ =∂∂

h and ( ) ( ) ( )( ) ( )tVEmt

pt ,

2,,

22

22 r

rrr ψψψ

hh

−−=−=∇ . (47)

With constant full energy E, the coordinate and time dependences are separable and the

second expression in (47) can be applied independently:

( ) ( )( ) ( )tVEtm

,,2

22

rrr ψψ −=∇− h, or ( ) ( ) ( ) ( )tEtVt

m,,,

22

2

rrrr ψψψ =+∇− h. (48)

This is the time-independent Schrödinger equation for the wave function of a particle

with constant full energy E.

After substituting ( )tE ,rψ by ( )tt

i ,rψ∂∂

h , the above equation takes the form:

( ) ( ) ( ) ( )tVtm

tt

i ,,2

, 22

rrrr ψψψ +∇−=∂∂ h

h . (49)

This is the time-dependent Schrödinger equation in which the full energy E is not ex-

plicitly included. It is valid for any time dependence of the full energy, and therefore for any

wave function.

It is clear that the time-independent equation will be satisfied by the spatial part of the

wave function ( )rf alone: ( ) ( )

−= tiE

fth

exp, rrψ , so that

( ) ( ) ( ) ( )rrrr EffVfm

=+∇− 22

2

h, or after nondimensionalisation: (50)


50/150

( ) ( ) ( ) ( )rrrr EffVf =+∇− 2 . (51)

Since ( ) ( ) 2rr ψ=P is the probability to find the particle in the vicinity of r , a natural

boundary condition will be ( ) 0=rψ far from the region where the potential energy is non-

zero.

Solving

1) If the derivatives are approximated by finite differences, the one-dimensional prob-

lem will take the form:

jjjjj Effh

fVh

fh

=−

++− +− 12212

121, (52)

where ( )ii xff ≡ , .1 constxxh ii =−≡ + , ( )ii xVV ≡ . Or

fHf E= , (53)

which is a standard eigenvalue problem for the eigenvalues Ei and the eigenvectors f i,

and the matrix H is symmetric and tri-diagonal so that the QR method can be applied directly.

In the two- or three-dimensional case the matrix is symmetric and with a band structure.

2) An alternative approach is to expand the solution in basis functions ( )rkϕ selected on

the grounds of an analysis of the physical problem: ( ) ( )∑=k

kkaf rr ϕ .

Thus:

( )( ) ( ) ( )∑∑ =+∇−k

kkk

kk aEVa rrr ϕϕ2 , (54)

or, after integrating over the problem space (to the boundaries where the boundary con-

ditions are imposed):

( ) ( )( ) ( )[ ] ( ) ( )[ ]∑ ∫∑ ∫ =+∇−k

kklk

kkl ardEaVrd rrrrr ϕϕϕϕ *32*3 , i.e. SaHa E= . (55)

If the basis functions are orthogonal, then ijijS δ= and the problem is aHa E= .

The problem SaHa E= is known as a generalised eigenvalue problem. It can be repre-

sented in the standard form as aHaS E=−1 . However, a serious computational difficulty will

arise from the fact that even if H and S are symmetric, HS 1− will not preserve this property.


51/150

In the presently considered particular case of a symmetric matrix S the following rem-

edy can be applied. If S is also diagonally dominant, then it will be positive-definite. For

symmetric positive-definite matrices there exists the decomposition += LLS where L is a

lower triangular matrix (this decomposition is performed economically after the so-called

Cholesky method). Then, the similar to HS 1− matrix G will be:

( )( ) ( )[ ] ( ) ( ) ( )+−−−+−−+−−++−+−+ ==== 111111111 LHLLHLLHLLLLHSLG . (56)

The equality ( ) ( )+−−+ = 11LL on which the final result in (56) is based is because of the

symmetry of S ( SS =+ ):

( ) LSLSLL 111 −−+−+ =⇒= ; ( ) ( ) ( ) ( ) 11111 −+−+−+−+−+ ==⇒== LLSLLSLSL . (57)

Thus, by construction the matrix G is similar to HS 1− , and because of the last equality

in the chain (56) it will inherit the symmetry of H. Therefore it is advantageous to replace the

problem aHaS E=−1 by the problem aGa E= .

Example

The one-dimensional time-independent Schrödinger equation has the form:

( ) ( ) ( ) ( )xEfxfxVdx

xfd =+−2

2

, (58)

and with ( ) 2xxV = :

( ) ( ) ( ) 022

2

=−+ xfxEdx

xfd (59)

Solutions of this equation with ,...1,0,12 =+= nnEn are the so-called Hermite func-

tions ( )xnψ :

( ) ( ) ( ) 012 22

2

=−++ xxndx

xdn

n ψψ, (60)

i.e. they are eigenfunctions for the Schrödinger equation in the considered case, and

,...1,0,12 =+= nnEn are their corresponding eigenvalues.

The Hermite functions are:

( ) ( ) ( ) ( )xHxnx nn

n 2exp!2 221−=

−πψ , (61)


52/150

where ( )xHn are the Hermite polynomials.

Hermite functions form an orthogonal set:

( ) ( ) nmmn xxdx δψψ =∫∞

∞−

(62)

Hermite polynomials can be evaluated through the recurrence relation:

( ) ( ) ( )( ) ( ) xxHxH

nxnHxxHxH nnn

2,1

,...2,1,22

10

11

===−= −+ (63)

Therefore, if method 1), i.e. (52), is implemented with a given spatial discretisation

Nixi ,...,1, = , then the eigenvalues are expected to converge to ,...1,0,12 =+= nnEn , and

the eigenvectors – to ( ) Nixin ,...,1, =ψ .

In particular, e.g. with 101 −=x , 10=nx and 400=N , the discretisation approach in x

(finite differencing) gives the following first 10 eigenvalues:

n 0 1 2 3 4 5 6 7 8 9 En 1.00 3.00 5.00 7.00 8.99 11.0 13.0 15.0 17.0 19.0

The figures below illustrate some of the eigenvectors and their corresponding Hermite

functions.


53/150

0 100 200 300 400

-0.15

-0.10

-0.05

0.00

0.05

0.10

0.15

-0.002

0.000

0.002

0.004

0.006

0.008

0.010

0.012

0.014

0.016

0.018

0.020

0.022

0.024

0.026

0.028

0.030ψ

(x)

i(xi)

ψ(x), n=0 discretisation basis functions

ψ(x

)2

ψ(x)2

Figure 1. The eigenvector and its corresponding Hermite function at n = 0 (the graphic

representations fully match), together with ( ) ( )xxP 2ψ= .

0 100 200 300 400

-0.15

-0.10

-0.05

0.00

0.05

0.10

0.15

-0.002

0.000

0.002

0.004

0.006

0.008

0.010

0.012

0.014

0.016

0.018

ψ(x

)

i(xi)

ψ(x), n=3 discretisation basis functions

ψ(x

)2

ψ(x)2

Figure 2. The eigenvector and its corresponding Hermite function at n = 3 (the graphic

representations fully match), together with ( ) ( )xxP 2ψ= .


54/150

4. Singular value decomposition

Let nm×A is an arbitrary matrix. Since for any vector x the following inequality is valid:

( ) ( ) ( ) ( ) ( ) 0≥⋅=⋅= +++++ AxAxAxAxxAAx , (1)

then the matrix AA + is positive-semidefinite.

(A symmetric (Hermitian) matrix А is said to be positive-definite if for any non-zero vector x

the quadratic form 0>+Axx . Such matrix is negative-definite if 0<+Axx , and respectively posi-

tive- or negative-semidefinite if 0≥+Axx or 0≤+Axx .)

On the other hand, the maximum and minimum eigenvalues of a Hermitian (symmetric)

matrix H like AA + are:

xxHxx

x +

+

≠=

0max maxλ and

xxHxx

x +

+

≠=

0min minλ . (2)

Therefore, the eigenvalues of the matrix ( ) nn×+AA are non-negative and can be written

in the form niii ,...,1,2 == σλ .

The quantities ( )0≥iσ are known as the singular values of matrix A.

It can be shown that for an arbitrary matrix nm×A of rank r there exist such orthogonal

(unitary) matrices nm×U and nn×V that nnnnnmmn ×××+

× = ΣVAU , where nn×Σ is a diagonal matrix,

the first nr ≤ diagonal elements of which, i.e. 0...21 >≥≥≥ rσσσ , are the non-zero singu-

lar values of A, and its remaining n-r diagonal elements are zeroes.

From the relation nnnnnmmn ×××+

× = ΣVAU it follows that +×××× = nnnmnm VUA nnΣ .

Singular value decomposition (SVD) consists in finding the matrices Σ, U and V for a

given matrix A. The process has two stages.

At the first stage, through a sequence of Householder reductions the matrix A is brought

to upper bidiagonal form: nnnmmnnn ×××× = QAPJ , where 11...PPPP −= nn and 221 ... −= nQQQQ

are products of Householder matrices. Since P and Q are orthogonal (unitary), then

( )QAAQPAQPAQJJ ++++++ == . Therefore the matrix JJ+ is similar to AA + , i.e. it has

the same eigenvalues. Thus the singular values of the upper bidiagonal matrix J coincide with

those of the matrix А.


55/150

At the second stage the superdiagonal elements of J are brought to vanishing magnitude

through an infinite iterative procedure analogous to the QR method. The general form of the

transformation is nnnnnnnn ×××× = TJSΣ where Σ is practically diagonal (with negligible off-

diagonal elements), and S and T are orthogonal (unitary). In a way analogous to the previous

proof it is shown that the singular values of J and Σ coincide. The resultant transformation

+= VUA Σ is effected through the orthogonal (unitary) matrices mnnnmn ××× = PSU and

nnnnnn ××+× = TQV .

Application: the least squares problem

Let miyi ,...,1, = is a set of observations, each with a variance misi ,...,1,2 = .

Linear model

Let these observations are compared against a model

( )∑=

≡n

jijji xXay

1

~ , (3)

where xi are values of the independent variable x which in some general way describes

the measurement conditions, and ( ) njxX j ,...,1, = are the so-called basis function of the

model. The number of basis functions n is not larger than the number of observations m, and

usually mn << .

The model parameters nja j ,...,1, = are so chosen as to minimise the deviation between

the observations and the respective values predicted by the model. A measure of this devia-

tion is the quantity

( ) ( )( ) ( )∑ ∑∑

∑∑

= ==

=

=

−=

−=−≡m

i

n

jjiji

m

i i

n

jijj

i

im

i i

ii aCbs

xXa

s

y

s

yyR

1

2

11

2

1

12

22

~ aa , (4)

where ( )i

ijij s

xXC ≡ are the elements of a matrix nm×C and b is a vector composed of

mis

yb

i

ii ,...,1, =≡ .


56/150

The form of ( )a2R follows from the assumption that the observations are mutually in-

dependent random variables with Gaussian distributions with expected values iy~ and vari-

ances 2is . This assumption in turn is based on the central limit theorem in probability theory.

( )a2R is the squared value of the residual bCaR −= of the set of linear equations

bCa = . This set is usually overdetermined and its residual is expectedly non-zero. The data

modelling problem consists in minimising this residual.

A frequently employed method of minimising ( )a2R is through solving the set:

( )nj

a

R

j

,...,1,02

==∂

∂ a

which leads to the following inhomogeneous linear system for the parameters a:

bDa = ,

where:

( ) ( ) ( )∑∑

==

===m

i i

ijij

m

i i

ijikkj njk

s

xXyb

s

xXxXD

12

12

,...,1,,,

This is the so-called „method of normal equations”, as discussed e.g. in the lecture notes

on Programming and Computational Physics.

Non-linear model

Let the model compared against the available observations is

( )nii aaxfy ,...,;~1≡ . (3a)

The number n of model parameters is not larger than the number of observations m, and

usually mn << .

The ( )a2R quantity (4) can be represented through linearization of the model with re-

spect to its parameters using the Taylor expansion:

( ) ( )( ) ( )( ) ( )( )kjj

ki

n

j j

kii aax

a

fxfxf −×

∂∂+≅ ∑

=aaa ;;;

1

,

where ( )ka is some known estimate of the sought parameters.


57/150

Following the considerations in the linear model case, the task of minimising ( )a2R is

solved iteratively starting with an initial estimate of the sought parameters and reduces to

minimising the squared residuals of sets of the form baC =δ , where:

( ) ( )kk aaa −≡ +1δ ,

( )( )i

ki

jij s

xa

f

C

a;∂∂

≡ and ( )( )

i

kii

i s

xfyb

a;−≡ , njmi ,...,1,,...,1 == .

The normal equations method is also iterative and consists in solving of equation sets of

the form:

baD =δ ,

where:

( )( ) ( )( )( )( )( ) ( )( )∑

∑

=

=

=∂∂−=

∂∂

∂∂=

m

i

ki

li

kii

l

m

i

ki

j

ki

lilj

njlxa

f

s

xfyb

xa

fx

a

f

sD

12

12

,...,1,,;;

,;;1

aa

aa

Solving the problem

And so, the data modelling problem is expressed in minimising the square ( )a2R of the

residual mnnmm baCR −= × of the equation set bCa = . (It is assumed that this set is gener-

ated by a linear model or by e linearised non-linear model. In the latter case the role of basis

functions is played by ( ) njxa

f

j

,...,1,; =∂∂

a .)

This set is usually overdetermined and has no exact solution. The vector a which mini-

mises the square (or the norm) of the residual is interpreted as an approximate solution to this

set.

If the matrix C is subjected to SVD, i.e. the matrices U, Σ and V are found for which

+= VUC Σ , then the system can be transformed to a formally determined system with a di-

agonal matrix in the following way:

( ) ( ) dzbUaVbaVUaC =→=→== ++×

+×××× ΣΣΣ nnnnnn nnmnnmnnm , or: (5)

nridz

ridz

ii

iii

,...,1,0

,...,1,

+====σ

, (6)


58/150

where r is the rank of C.

The solution of (6) is rjd

zj

jj ,...,1, ==

σ and, formally, nrjz j ,...,1, +=∞=

The reverse transition to the original model parameters is obvious: Vza = .

It turns out that njj ,...,1, =σ are non-zero only when the basis functions are linearly

independent at mixi ,...,1, = , i.e. the vectors ( ) njmixX ijj ,...,1,,...,1, ==≡X are mutually

linearly independent.

The potential linear dependence between some of the vectors jX can be due either to a

deficiency of the model, i.e. „redundant” basis functions – in its turn either by design, or be-

cause their unique contributions are masked by numerical errors – or to the fact that with the

available observations, with accounting of their uncertainties, the contribution of some of the

basis functions cannot be distinguished.

If the rank of C is smaller than the number of model parameters (because of the dis-

cussed essential or effective linear dependence between some basis functions) the choice for

the respective nrjz j ,...,1, += in (6) is not unique, and hence the solution a of the lest

squares problem, i.e. the solution which minimises the squared residual of the set bCa = , is

not unique.

In this case it is desirable to choose, among all possibilities, the model parameters set a

with the smallest norm ∑=

≡n

iia

1

2a . This is so because otherwise some of the basis functions

would enter the model with excessively large contributions ai which would ultimately cancel

each other. A result of this kind is always less stable numerically and, which is worse, would

mislead about the contribution of certain factors through which the observations are attempted

to be explained.

It will be chosen below that if nrjz j ,...,1,0 +== is substituted in (6), i.e. the matrix

1−Σ is replaced by ( )nriridiag i ,...,1,0;,...,1,~ 1 +===− ωΣ where

ii σ

ω 1≡ , and the solution

of bCa = is evaluated as bUΣVdΣVVza +−− === 110

~~, then:


59/150

а) The solution 0a has the smallest norm among those which minimise the norm of the

residual of bCa = ;

б) The solution 0a ensures the smallest norm of the residual of bCa =

The first statement is proved as follows.

1. Any solution of bCa = can be represented in the form:

( )( ) ( )i

n

iii VbUbUVΣdVΣVza ∑

=

+−− ⋅====1

11 ω (7)

where i

i σω 1≡ , and ( )iU and ( )iV are correspondingly the i-th columns of U and V.

That is, a is a linear combination of the columns of V with coefficients ( )( )bU ⋅iiω .

This result is verified as follows. First, the matrix-vector product Axy = can be repre-

sented as ( )∑=

=n

iii x

1

Ay where ( ) nii ,...,1, =A are the columns of A. Thus ( )∑=

==n

iiiz

1

VVza

where iz are the elements of the vector bUΣz +−= 1 . Second, the product of a diagonal matrix

and a vector, ydiag )( ix , is the vector with elements ii yx . That is, bUΣ+−1 is a vector with

elements iii xz ω= , where ix are the elements of the vector bU+ ,

( ) bU ⋅=== ∑∑==

+i

m

jjji

m

jjiji bUbUx

11

. Thus, finally: ( )( ) ( )i

n

iii VbUbUVΣa ∑

=

+− ⋅==1

1 ω .

2. Important properties of SVD, the proof of which is outside the scope of this text, are:

− The columns of V with sequential numbers coinciding with those of the zero singu-

lar values form a complete basis in the subspace of vectors 0≠x for which 0=Cx

(this subspace is known as the kernel (or nullspace) of the matrix)

− The columns of U with sequential numbers coinciding with those of the non-zero

singular values form a complete basis in the subspace of vectors y for which

0≠= yCx with 0≠x (this subspace is known as the image (or range) of the ma-

trix).

3. Therefore, if C has zero singular values, then a can be written in the form:

( ) ( ) δaVVa +=+= ∑∑+==

011

i

n

riii

r

ii βα , (8)


60/150

where the form of iα and iβ is according to (7). At that, for the vector δ (generally

non-zero) it is always true that 0=Cδ (follows from the above statements about the corre-

sponding columns of V), and thus 0CaCa = with an arbitrary choice of δ through iβ . That

is, both 0a and any other vector a of the form (8) are solutions to the set bCa = .

From (5) through (7) it follows that nrii ,..,1, +=∞=β . By virtue of the freedom of

choice for δ through iβ , let 0=δ is selected by setting nrii ,..,1,0 +==β , i.e. 01 =≡

ii σ

ω ,

when 0=iσ (!). As it will be shown immediately below, this choice ensures the minimum

norm of a.

4. And so, let for the purpose of this proof δ is some non-zero vector for which 0=Cδ ,

and which consequently can be represented as ( )i

n

rii V∑

+= 1

γ . Let also ( )idiag ω≡−1~Σ is with zero

elements where the singular values are zero, i.e. 0=iσ . Then the norm of any of the possible

vectors a (model parameter sets in the context of the examined problem) will be:

( )δVbUΣ

δVbUΣVδbUΣVδaa

++−

++−+−

+=

+=+=+=1

110

~

~~

, (9)

The second equality from the right is based on the orthogonality of V, and the last

equality – on the fact that orthogonal matrices preserve the vector norm: zVz = . Also, from

the choice of 1~−Σ it immediately follows that 01~

abUΣV =+− , where 0a is according to (8).

Because of the mentioned choice of δ, δV + can be represented as:

( )

( ) ( ) ( ) ( )( )( ) ( )( )nrjrjcol

colcol

j

n

riiji

j

n

riiij

j

n

riii

,...,1,,,...,1,0

11

1

+===

⋅=

⋅=

=

∑∑

∑

+=+=

+=

++

γ

γγ

γ

VVVV

VVδV

,

i.e. the vector δV + has zero elements if the respective singular values are non-zero and

potentially non-zero elements if the respective singular values are zero.

With accounting of the form of 1~−Σ it is seen that the vector bUΣz +−= 1~~ has the fol-

lowing structure:


61/150

( )( ) ( ) ( )( )nrjrjzcolcol jj

jjj

,...,1,0,,..,1,~~ 1 +===== ++− bUbUΣz ω ,

i.e. the vector z~ has zero elements if the respective singular values are zero and poten-

tially non-zero elements if the respective singular values are non-zero.

By the way, the vector z~ is actually the solution z of the set (6), for which the choice

nrizi ,...,1,0 +== is made. Here this is achieved automatically through the substitution

01 =≡

ii σ

ω when 0=iσ .

5. From the above analysis of (9) it can be concluded that the norm of δaa += 0 is

formed by the sum of the squares of the non-zero elements of z~ and of δV + , taken sepa-

rately (i.e. there is no instance of summing the corresponding vector elements before the

squaring operation). The conclusion from this result in its own turn is that the smallest norm

of δaa += 0 is achieved with nrii ,...,1,0 +==γ , i.e. with 0=δ .

Thus it is finally proved that the substitution 01 =≡

ii σ

ω when 0=iσ ensures the

smallest norm of the model parameters set a, and that this solution with the smallest norm is

0a which is obtained from the solution of the set (6) by setting nrizi ,...,1,0 +== .

In the context of the discussion about the reasons for the emergence of zero singular

values (including almost zero values because of an effective linear dependence between the

basis functions over the grid of points in which the model is evaluated), a good practice of

applying SVD is to choose a threshold τ for the singular values in dependence of the accuracy

of observations and of performing the arithmetic operations needed for evaluating the model.

Thus, if τσ >j , then j

jj

dz

σ= , and if τσ ≤j , then 0=jz .

From the general logic of the problem it follows that the emergence of zero or practi-

cally zero singular values demands a reformulation of the model (removal or redefinition of

certain basis functions).

The second statement, namely that SVD ensures the smallest norm (square) of the re-

sidual bCaR −≡ 0 of the set bCa =0 , is as follows.


62/150

1. Let now δ is an arbitrary non-zero vector for which 0≠= gCδ . Therefore, g can be

represented as ( )i

r

ii Ug ∑

=

=1

γ . Let δaa += 0 , where 0a is the above commented solution with

smallest norm. The norm of the residual of bCa = in this case will be:

( )( )( ) ( )[ ]( ) gUbU1ΣΣ

gUbU1ΣΣUgb1UΣUΣ

gbbUΣVVUgbCaR

++−

++−+−

+−+

+−=

+−=+−=

+−=+−=

1

11

10

~

~~

~Σ

(10)

2. Since ( ) ( )( )nrjrjcolj

,...,1,0,,...,1,1~ 1 +===−ΣΣ , then 1ΣΣ −−1~

will have non-zero

elements only where the corresponding singular values are zero. Therefore, the vector

( ) bU1ΣΣ +− −1~ will have potentially non-zero elements only in the positions of the zero singu-

lar values.

On the other hand, because of the already mentioned choice of g, gU+ can be repre-

sented as follows:

( )

( ) ( ) ( ) ( )( )( ) ( )( )nrjrjcol

colcol

jj

r

iiji

j

r

iiij

j

r

iii

,...,1,0,,...,1,

11

1

+===

⋅=

⋅=

=

∑∑

∑

==

=

++

γ

γγ

γ

UUUU

UUgU

,

i.e. the vector gU+ has potentially non-zero elements only if the corresponding singular

values are non-zero.

The conclusion from this analysis of (10) is that the norm of the residual of the set is

formed from the sum of the squares of the non-zero elements of ( ) bU1Σ +− −1~Σ and of gU+

taken separately (i.e. there is no instance of summing the corresponding vector elements be-

fore the squaring operation). The conclusion from this result in its turn is that the smallest

residual norm is achieved with nrii ,...,1,0 +==γ , i.e. with 0=δ . This smallest norm will

be ( ) ∑+=

=n

rjjd

1

20aR , where bUd += .


63/150

Thus it was proved that the above commented and selected solution 0a , which has the

smallest norm, also ensures the smallest norm of the residual of the set bCa = , i.e. this set is

solved in the least-squares sense.

In summary, the usefulness of SVD and the introduction of a suitably chosen positive

threshold τ is as follows:

• The occurrence of zero or effectively zero singular values is an indicator which can

provide guidelines for improving or simplifying the model;

• The accuracy of finding the parameters ja is improved. This is so because the ratio

min

max

σσ

can be regarded as a condition number ( )Ccond of the matrix C. Since

( )b

bC

a

a δδcond≤ , where aδ is the vector of errors of the model parameters, and

bδ is the vector of errors of the observations, then the condition number of the coeffi-

cients matrix is a measure of the sensitivity of the solution to errors in the vector of

constant terms, i.e. typically to errors in the input data. (It can be shown that the condi-

tion number is also a measure of the sensitivity of the solution to errors in the matrix

elements.) Thus, the zeroing of those jσ which are smaller than τ results in reducing

the condition number to τ

σ max . Since the condition number is essentially the error in-

crease factor, this reduction improves the reliability of determining the sought parame-

ters ja .

It can additionally be shown that the variances of the model parameters are given by the

expression:

( ) ∑=

=

=

m

i i

jij nj

Va

1

2

2 ,...,1,σ

σ (12)

(Here, same as above, iσ are the diagonal elements of the matrix Σ.)

Example: Analysis of a gamma spectrum

The model is


64/150

( ) ∑=

−−++≡N

k k

kki

cxAbxbxy

1

2

21 2

1exp,~

σa ,

where:

− x is a measure of the energy of gamma-quanta;

− 21 bxb + is the background;

− N is the number of instrument lines;

− Ak, ck and σk are respectively the amplitudes, positions and standard deviations of

those lines.

The line full width at half maximum (FWHM) is related to the standard deviation as fol-

lows: σ2ln22=FWHM . The relation between amplitude A and area S is: AS σπ2= .

The model is obviously non-linear and the equations for its parameters are of the form

explained in the respective section above. This model is simplified and in particular relies on

the assumption that the non-uniform energy sensitivity of the detector is accounted in ad-

vance.

The application of SVD to this problem is illustrated below. The choice is m = 100,

miixi ,...,1, == , 2=N and the observations are synthesized with the parameter values from

Table 1 in the following way:

( ) ( ) mixyxyy iiii ,...,1,,~,~ =×+= ξaa ,

where iξ is a random number sampled from the standard Gaussian distribution ( )1,0N ,

i.e. ( ) ( )( )aa ,~,,~iii xyxyNy ∈ .

The way of synthesizing the observations iy corresponds to the meaning of iy as a

number of registered gamma-quanta with an energy corresponding to the i-th channel, and

this number has a random value sampled from the Poisson distribution, which at a sufficiently

large mathematical expectation converges to the normal (Gaussian) distribution with a vari-

ance equal to the expectation.

Table 1. Model parameters

k zk bk Ak ck σk

1 300 -2.02 8000 41 5


65/150

2 100 302.02 9000 55 5

2,1, =kzk are the background levels at 1x and mx , and the coefficients 2,1, =kbk are

obtained through linear interpolation between these two values.

The initial parameter estimates with which the iterative minimisation of 2R is started,

as well as the parameter estimates during the minimisation process, are shown in Table 2.

The course of minimisation of 2R is illustrated in Figures 1 and 2.

Table 2. Minimisation steps. k is the step sequential number. k = 0 refers to the initial parame-

ter estimates.

k R2 z1 z2 A1 c1 σ1 A2 c2 σ2

0 202145 0 0 10000 35 1 10000 50 1 1 150419 459.4 145.5 2913.5 35.3 1.50 5024.3 50.1 1.98 2 53074 412.2 137.5 3314.0 37.2 4.10 4885.1 51.0 6.19 3 33696 285.7 88.9 6199.1 42.7 6.20 8185.6 56.0 6.62 4 26726 320.7 101.6 3882.0 34.9 2.74 11508.6 50.4 7.34 5 10924 280.8 74.1 2780.4 37.3 4.22 8884.3 50.6 7.65 6 39229 305.9 102.7 9307.4 45.8 8.28 7809.6 56.3 4.75 7 11855 306.9 121.2 6975.9 42.4 6.25 9116.5 57.0 4.88 8 744.6 308.6 96.5 7776.6 40.5 4.73 8950.9 55.0 5.25 9 96.6 301.8 98.4 7961.4 41.0 5.00 9045.6 55.0 5.02 10 93.3 301.4 98.2 8008.1 41.0 5.02 9021.9 55.1 4.99 11 93.3 301.4 98.2 8008.3 41.0 5.02 9022.1 55.1 4.99


66/150

0 1 2 3 4 5 6 7 8 9 10 11

0

50000

100000

150000

200000

R2

k

Figure 1. R2(a) in the course of minimisation.


67/150


68/150

Figure 2. Consecutive steps of the minimisation of R2(a).

The advantage of SVD before the normal equations method are better exhibited when

the model is in a large discrepancy with the observations.

Let, based on a modification of the previous example, the observations consist only of

the second line, whereas the model remains with two lines and its initial parameter estimates

are the same as in the previous example.

In this case the course of iterations through SVD is shown in Table 3, and the initial and

final steps are illustrated in Figure 3. It is seen that the implausible first line can easily be

identified and excluded from the model, and even without doing this the remaining parame-

ters are found accurately enough. Also, in this case three of the singular values effectively

approach zero in the course of iterations, whereas in the previous example all singular values

remain significantly non-zero throughout the iteration process.

Table 3. Minimisation steps through SVD. k is the sequential step number. k = 0 refers to the

initial parameter estimates.

k R2 z1 z2 A1 c1 σ1 A2 c2 σ2

0 820110 0.0 0.0 10000 35 1.0 10000 50 1.0 1 82627 376.3 144.0 -44.5 35.0 0.99 4143.8 50.3 1.70 ... ... ... ... ... ... ... ... ... ... 8 94.4 374.7 147.9 -280.3 -751.8 -459.2 9027.7 55.0 5.01


69/150

Figure 3. Initial and final step of the minimisation of R2(a) through SVD.

The course of iterations with the “normal equations” method is shown in Table 4, and

the initial and final steps are illustrated in Figure 4. The iterations are stopped at step 7 when

the matrix of the linear system becomes singular.

Table 4. Minimisation steps through solving the „normal equations”. k is the sequential num-

ber of the iteration step. k = 0 refers to the initial parameter estimates.

k R2 z1 z2 A1 c1 σ1 A2 c2 σ2

0 820110 0.0 0.0 10000 35 1.0 10000 50 1.0 1 82622 376.2 144.1 -43.1 35.0 0.99 4145.1 50.3 1.70 ... ... ... ... ... ... ... ... ... ... 6 1.44E+07 6553.1 5821.1 259613 70475.4 3840.1 9011.2 55.1 4.99


70/150

Figure 4. Initial and final step of the minimisation of R2(a) through solving the „normal equa-

tions”.


71/150

5. Orthogonal polynomials. Approximation of functions.

Gaussian quadrature

Approximation of functions

A typical problem in computational modelling is the approximation of a known function

( )xf by a combination (most often linear) of functions belonging to a suitably chosen class.

A frequently preferred class is ( ) xpn where ( )xpn are polynomials of degree ,...2,1,0=n

An especially common instance of polynomial approximation is the Taylor series expansion.

Another standard class of basis functions, evidently related to the Fourier transform, are

,...1,0,cos,sin =nnxnx Still another popular choice are the exponential functions. Many

other sets of linearly independent functions with suitable properties are also used for the pur-

pose of function approximation.

Linear least squares

A customary measure of the proximity between the approximating and the approxi-

mated functions is the weighted sum of the squared differences between the respective func-

tional values. More specifically, let ( )xf is the approximated function and nixi ,...,1, = is a

sequence of chosen values of the independent variable (nodes) at which the values of ( )xf

are known (e.g. observed), usually with some uncertainty. Let ( )ii xff ≡ are the exact values

of ( )xf at ix , and the corresponding observed values are if . Let ( ) ,...1,0, =Φ jxj is a set of

basis functions defined at each ix , and the approximating function at the nodes is

( )∑=

Φ≡m

jijji xaf

0

~. Then the coefficients ja must be determined so as to minimise the quantity

( ) ( ) ( ) ( )∑∑ ∑== =

=

Φ−≡

n

iii

n

i

m

jijjii RxwxafxwS

1

2

1

2

0

2 a , (1)

where ( )ii xww ≡ are suitable weights associated with the corresponding observations

and/or nodes.

The least squares measure, although most frequently applied, is not an exclusive option in solving func-

tion approximation problems. Another standard possibility is to minimise the quantity

( ) ( ) ( )xfxfxRxx

~maxmax −≡ over the interval where the approximation is constructed.


72/150

Despite the existence of different ways of defining the function approximation problem,

here only the case of polynomial approximation in the least squares sense will be considered.

A standard analytical approach to the minimisation of ( )a2S is to find a solution of the

set of equations

( ) ( ) ( ) mkxxafwa

S n

iik

m

jijjii

k

,...,0,021 0

2

==Φ

Φ−−≡

∂∂

∑ ∑= =

a. (2)

Or: kj

jkj ag ρ=∑ where (3)

( ) ( )∑=

ΦΦ≡n

iijikikj xxwg

1

и ( )∑=

Φ≡n

iikiik xfw

1

ρ . (4)

The following remarks can be made about this system of 1+m linear equations:

• If nm =+1 , its solution a leads to ( ) 02 =aS , i.e. the function ( ) ( )∑ Φ=j

jj xaxf~

in-

terpolates between the data points nifxi ,...,1,, = . If nm >+1 the problem is ill-

posed and no unique solution a can be found.

• The matrix kjg≡G is symmetric. If the basis functions ( ) ,...1,0, =Φ jxj are or-

thogonal over the set of data points nixi ,...,1, = with a weight function ( )xw , i.e.

( ) ( ) ( )

=≠≠=

ΦΦ∑= jk

jkxxxw

n

iijiki ;0

;0

1

, then the matrix G will be diagonal.

• If the matrix G is not diagonal and the basis functions are polynomial then G may

tend to be ill-conditioned and consequently the roundoff errors in finding the solution

a may become inadmissibly high. Indeed, let for simplicity ( ) jj xx =Φ , ( ) 1=xw and

all nodes ix be equidistant in the interval [ ]1,0 . Then, with large n,

1

1

01 ++=≈≡ ∫∑ +

=

+

jk

ndxxnxg jk

n

i

jkikj , i.e. HG n= where:


73/150

++

+

+

+

=

12

1.........

1

13

1.........

3

12

1...

4

1

3

1

2

11

1...

3

1

2

11

mm

m

m

m

H .

(the factor n arises from the quadrature expression ∑∫=

++ ∆≈n

i

jki

jk xxdxx1

1

0

where n

x1=∆ )

It is seen that with the increase of m the determinant of H will decrease, i.e. the matrix

will approach singularity, and the elements of its inverse will grow in absolute value. Thus for

example, with 9=m 1−H will have elements of the order of 12103× . Accounting for the fact

that ρHa 1−= and that the elements of ρ will contain inevitable roundoff errors, it is evident

that the errors in a will become unacceptably high.

Orthogonal polynomials

The above considerations show that it is especially desirable that the polynomial basis

functions be constructed as orthogonal over the set of nodes nixi ,...,1, = , i.e.

( ) ( ) kjxpxpwn

iikiji ≠=∑

=

,01

. (5)

Then the system of equations for the coefficients a will take the form:

mkad kkkk ,...,0, == ω , where (6)

( )∑=

=n

iikikk xpwd

1

2 and ( )∑=

≡n

iikiik xpfw

1

ω . (7)

The solution of this system, mkda kkkk ,...,0, == ω , is found immediately and all diffi-

culties arising from the ill conditioned matrix G as per the above discussion are avoided. (Es-

sentially this approach can in part be regarded as an analogue of the SVD technique in the

particular case of linear data modelling with polynomial basis functions.)

Also, if m is replaced by с m+1 it will be sufficient to complement the already existing

solution with 1,111 ++++ = mmmm da ω . (The non-diagonal would have to be completely re-

evaluated and therefore a new linear system for the entire set a would have to be solved.) This


74/150

circumstance simplifies the choice of a degree M for the approximation. Indeed, if the ap-

proximated function ( )xf is exactly (or almost exactly) a polynomial of degree M and the

approximation is of degree Mm> , then the values of mMja j ,...,1, += should statistically

approach zero. The above assertion is verified directly. First, the orthogonal polynomials are

mutually linearly independent (this is straightforwardly demonstrated through the way of their

construction as described further below), so that any polynomial can be represented as a linear

combination of them. Thus, if the approximated function ( )xf is exactly a polynomial of de-

gree M, it can be represented as ( ) ( )∑=

=M

kkk xpxf

0

α . Let also the observed function values are

exact, i.e. ( )ii xff = . Then, according to (7):

( ) ( ) ( )

( ) ( )

>≤≤

==

=≡

∑ ∑

∑ ∑∑

= =

= ==

Mm

Mmdxpxpw

xpxpwxpfw

mmmim

M

k

n

iikik

n

iim

M

kikki

n

iimiim

,0

0,

0 1

1 01

αα

αω

With this the assertion is proved for exact observations. If these contain uncertainties,

then by virtue of the central limit theorem in probability theory (insofar as the prerequisites

for its validity are fulfilled) the observations will be normally distributed around the exact

function values: ( )( )iii xfNf σ,∈ , and from the above expression it becomes clear that the

random variables mω will have a zero expectation at m > M. This is so because

( ) ( ) ( ) ( ) ( ) ( ) ( ) ...1 011

=

=== ∑ ∑∑∑= ===

n

iim

M

kikki

n

iimii

n

iimiim xpxpwxpxfwxpfEwE αω where ( )E

is the mathematical expectation operator which is linear.

It can also be demonstrated that in the examined case the quantity ( ) ( )12 −− mnS a will

be independent of m for Mm≥ .

Thus, for finding M (which in general is not known in advance) it is expedient to solve

the equations ρGa = (or ωDa = ) successively for ,...2,1,0=m until the corresponding val-

ues of ( ) ( )12 −− mnS a stop decreasing significantly with the increase of m.

The general procedure of constructing a set of orthogonal polynomials (Gram-Schmidt

process) consists in setting 10 =p and finding the higher-degree polynomials through the re-

current relation


75/150

( ) ( ) ( )∑=

+ +=n

iiinn xpxxpxp

01 α , (8)

at which the coefficients iα (a separate set for each n) are determined from the condi-

tions

( ) nipp in ,...,0,0,1 ==+ where ( ) ( ) ( ) ( )∫≡b

a jiji xpxpxdxwpp , . (9)

More specifically, since already ( ) ijji pp δ=, for i and j ≤ n, then the orthogonality con-

dition is

( ) ( ) ( ) 0,,,1 =+=+ kkkknkn pppxppp α . Therefore (10)

( )( ) nk

pp

pxp

kk

knk ,...,0,

,

, =−=α . (11)

Considering again the mutual orthogonality of the already constructed polynomials and

using the defined recurrent relation, the numerators of kα become:

( ) ( ) ( ) ( ) 2,...,0,0,,,,0

1 −==−== ∑=

+ nkppppxpppxpk

iiniknknkn α . (12)

Therefore, the recurrent relation is simplified to

( ) ( ) ( ) ( ) ( ) 11

1 −−=

+ −−=+= ∑ nn

n

niiinn pxpxxpxxpxp γβα , (13)

where ( )( )nn

nnn pp

pxp

,

,=−= αβ and ( )( )11

11 ,

,

−−

−− =−=

nn

nnn pp

pxpαγ . (14)

The last expression is simplified further:

( ) ( ) ( ) ( ) ( ) ( )nnnnnnnnnnnnnn ppppppppxpppxp ,,,,, 221,111 =−−== −−−−−− αα , (15)

so that ( )

( )11,

,

−−

=nn

nn

pp

ppγ . (16)

The orthogonalised polynomials can be normalised: ( )ii

ii

pp

pp

,← .

The mutual linear independence between the constructed set of polynomials follows

immediately from (8).


76/150

After defining ( ) ( ) ( )∑=

≡n

kkjkikji xpxpwpp

1

, , the discrete counterpart of this process be-

comes completely analogous to the described above.

Example: Polynomial approximation of the Runge function

Runge function, ( )2251

1

xxf

+= , is noteworthy by the fact that its interpolation by a

polynomial of a relatively high degree in equidistant nodes xi within the interval [ ]1,1+− tends

to oscillate near the ends of this interval. A similar behaviour is exhibited by the polynomial

approximation of this function.

Let the number of nodes is n = 100 and the maximum degree of orthogonal polynomial

approximation with a weight function ( ) 1=xw is M = 30. The dependence of

( ) MmmS ,...,0,2 = on the degree m is shown below.

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 301E-5

1E-4

1E-3

0.01

0.1

1

10

100

S2 (m

)

m

Figure 1. Behaviour of ( )mS2 in the course of approximating the Runge function by orthogo-

nal polynomials.

It is seen that ( )mS2 decreases exponentially (the ordinate scale is logarithmic). The

stepwise behaviour is due to the fact that Runge’s function is even and the orthogonal poly-


77/150

nomials over the chosen interval and with the chosen weight function are Legendre’s poly-

nomials. These polynomials are even functions of x when of an even degree, and odd when of

an odd degree. Therefore the contributions of the polynomials of an odd degree must be zero.

This is indeed so – the coefficients before the polynomials of an odd degree take on negligible

values as compared with those at an even degree. In this context it is appropriate to build the

approximation with only the even-degree polynomials belonging to this orthogonal set.

The approximations at some selected degrees are shown below.

Figure 2. Approximations of the Runge function with orthogonal polynomials over 100 equi-

distant nodes in the interval [-1,+1].

The examined example shows that a polynomial approximation of a relatively low de-

gree can eliminate the deficiencies of an attempted global polynomial interpolation (the inter-


78/150

polating polynomial in this case would deviate significantly from the true function between

the nodes, especially near the ends of the interval).

Below a general polynomial approximation following the standard linear least squares

approach is illustrated. The basis functions are ( ) kk xx =Φ .

-2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 321E-5

1E-4

1E-3

0.01

0.1

1

10

100

S2 (m

)

m

Figure 3. Behaviour of ( )mS2 in the course of a general polynomial approximation of

Runge’s function.


79/150

Figure 4. General polynomial approximations of Runge’s function over 100 equidistant nodes

in the interval [-1,+1].

It is seen that up to relatively low degrees the two approaches are equivalent and fail to

produce a consistent approximation. At higher degrees the general polynomial approximation

with basis functions ( ) kk xx =Φ is again unsuccessful because of numerical instability,

whereas the approximation with orthogonal polynomials leads to a satisfactory solution of the

approximation (and effective interpolation) problem.

Data smoothing

The least squares approximation by polynomials for the purpose of data modelling is

normally performed over the entire set of observations nif ,...,1, = and the weights iw are

assumed to be reciprocal to the observation variances: 21 iiw σ= . However, the technique of


80/150

least squares approximation by orthogonal polynomials is also applied for the purpose of the

so-called data „smoothing” which aims at filtering out the potential noise in order to reveal (at

least visually) the trends in the examined functional dependence. Smoothing usually encom-

passes a restricted number n of adjacent nodes, the approximating polynomial degree m is low

and the weight function is ( ) 1=xw . If the nodes are equidistant and the number n of the in-

cluded nodes is chosen to be odd, i.e. 12 += Ln , then it is convenient to introduce a new in-

dependent variable 1

1

+

+

−−=

Ln

L

xx

xxs which varies from -1 to 1 (node numbering is changed for

simplicity). Under these restrictions and with this choice of an independent variable the set of

orthogonal polynomials ( ) mjsp j ,...,0, = will be determined solely by the values of n and m.

Thus the expressions for evaluating the smoothed data if~

will have a fixed form. For exam-

ple, with 3=n ( )1=L and 1=m they will be:

( )3211 256

1~ffff −+= , ( )3212 3

1~ffff ++= , ( )3213 52

6

1~ffff ++−= . (17)

Since at each step the n-point data set which is subjected to the smoothing procedure is

shifted forward by one position, the expressions for 1

~f and 3

~f are applied to the first and the

last node respectively, and the expression for 2

~f – to all interior nodes.

Gaussian quadrature

The general approach in numerical integration, i.e. the evaluation of ( )∫≡b

aab xdxfI , ul-

timately reduces to interpolation of the integrand and subsequent approximation of abI by an

analytical expression for the integral of the interpolating function. The final result can always

be represented in the form ( )∑=

≅n

iiiab xfaI

1

where the nodes nixi ,...,1, = are chosen in a cer-

tain way, and the coefficients (weights) niai ,...,1, = depend on this choice. It is evident that

in the case of polynomial interpolation and an arbitrary choice of distinct nodes the quadrature

formula will be exact for integrands which are polynomials of a degree not higher than n-1.

On the other hand, if some special conditions (n in number) are imposed on the selec-

tion of nodes, it can be expected that the total number of 2n conditions (n for the nodes + n for

the interpolation requirement) will be sufficient to construct a quadrature formula which will

be exact for integrands which are polynomials of a degree not higher than 2n-1. This is


81/150

namely the idea behind the so-called Gaussian quadrature. This method is aimed at finding a

quadrature formula of the form:

( ) ( ) ( )

( ) ( ) 0;0

1

>≥

≅

∫

∑∫=

b

a

n

iii

b

a

xdxwxw

xfaxfxdxw, (18)

The rationale of introducing a weight function ( )xw is that it allows to decompose the

integrand ( )xg into a product of a function ( )xw which may be sufficiently „complex” and,

in particular, not even nearly reducible to a polynomial of a comparatively low degree, and a

„simple” function ( )xf which may be exactly or approximately represented as a polynomial

of a low degree: ( ) ( ) ( )xfxwxg = . In this sense a quadrature as defined in (18) can provide a

powerful tool for economic and precise integration of a wide range of functions.

Let ...,...,,,1 10 nQQQ ≡ are a set of orthogonal polynomials with respect to ( )xw in

[ ]ba, .

For such polynomials:

( ) ( ) ( )

( ) ( ) ( ) 0

,0

=

≠=

∫

∫b

a n

b

a ji

xqxQxdxw

jixQxQxdxw, (19)

where ( )xq is any polynomial of degree lower than n.

(The last is due to the fact that orthogonal polynomials form a basis and allow the ex-

pansion ( ) ( )∑−

=

=1

0

n

iii xQxq α .)

Let nixi ,...,1, = are the zeroes (roots) of the orthogonal polynomial ( )xQn . For cer-

tain appropriate weight functions it can be shown that these roots are real, distinct and lie

within the interval (a,b).

Let:

а) The nodes in (18) are the roots of ( )xQn ;

б) The weights niai ,...,1, = are chosen so that the quadrature (18) will be interpola-

tional, i.e. exact for all polynomials of degree n-1.


82/150

For example, Lagrange polynomials ( )( )

( )∏∏

≠

≠

−

−=

jiij

jii

j xx

xx

xL allow the representation:

( ) ( ) ( ) ( )xfRxLxfxfn

iii ;

1

+=∑=

. If ( )xf is precisely a polynomial of degree n-1, then the remainder

( ) 0; =xfR and ( ) ( ) ( ) ( ) ( ) ( )∑∑ ∫∫==

==n

iii

n

i

b

a ii

b

axfaxLxdxwxfxfxdxw

11

Let ( )xn 12 −Φ is an arbitrary polynomial of degree 2n-1. It can always be represented in

the form:

( ) ( ) ( ) ( )xrxqxQx nnnn 1112 −−− +=Φ , (20)

where the quotient ( )xqn 1− and the remainder ( )xrn 1− are polynomials of degree not

higher than n-1.

Then:

( ) ( ) ( ) ( ) ( )

( ) ( ) ( )[ ] ( )44 844 76444444 8444444 76

48476444 8444 7644 844 76

5

112

4

111

3

11

2

1

1

12

0

0

∑∑

∑∫∫

=−

=−−

=−−−

Φ=+=

==+=Φ

n

iini

n

iininini

n

iini

b

a n

b

a n

xaxrxqxQa

xraxrxdxwxxdxw

(21)

The second expression is because of (20) and (19), the third is equivalent to the second

because of the choice of weights, the fourth is equivalent to the third because of the choice of

nodes, and the fifth is an alternative representation of the fourth (because of (20)).

Thus it was proved that with the above choice of n weights and nodes the quadrature

formula (18) is exact for polynomials of degree 2n-1.

The sets of orthogonal polynomials needed for evaluating the nodes (and weights) can

be explicitly constructed for each particular weight function ( )xw and integration limits. In

most cases, however, it is expedient to employ standard quadrature sets for a selection of

weight functions and to convert the integration limits to the standard ones through a linear

change of variables. Some of the orthogonal polynomial sets corresponding to frequently cho-

sen weight functions and integration limits are:

− Legendre polynomials: [ ] [ ]1,1, −=ba and ( ) 1=xw .


83/150

− Chebyshev polynomials of the first kind: ( ) ( )1,1, −=ba and ( )21

1

xxw

−= .

− Chebyshev polynomials of the second kind: [ ] [ ]1,1, −=ba and ( ) 21 xxw −= .

− Laguerre polynomials: [ ) [ )∞= ,0,ba and ( ) ( ) 1,exp −>−= αα xxxw .

− Hermite polynomials: ( ) ( )∞∞−= ,,ba and ( ) ( )2exp xxw −= .


84/150

6. Monte Carlo methods

Monte Carlo (MC) methods are a widely applied class of computational algorithms for

simulating the behaviour of various physical and mathematical systems, as well as for solving

some general computational problems. The principal characteristic feature of MC methods are

their stochasticity expressed in the employment of random (or most often pseudo random)

numbers for performing a simulation or solving a numerical problem .

A typical application of MC is the modelling of transport phenomena – neutron trans-

port for the purposes of nuclear reactor analysis, photon transport for radiation shielding

analyses and determining the efficiency of registration in detector systems, etc. In these prob-

lems the MC approach essentially consists in finding certain quantitative characteristics of a

given macroscopic system through simulating the microscopic interactions therein (these in-

teractions most often are of an intrinsically stochastic nature).

Based on a more general mathematical formulation of such class of problems it can be

said that Monte Carlo methods represent a technique of solving of systems of integro-

differential equations for the sought macroscopic characteristics (e.g. radiation fields).

Another self-standing application of MC methods is numerical integration. Determinis-

tic methods of numerical integration require evaluating the integrand in a set of selected

points (most often equidistant) in the function arguments space. With a large number of inde-

pendent variables this approach may be impracticable – for example, a case with 100 inde-

pendent variables and 10 grid points in each dimension would require 10100 evaluations of the

integrand. This example is not purely artificial because in many physical problems there ex-

ists an unique correspondence between their dimensionality and the number of degrees of

freedom. With Monte Carlo methods, on the other hand, the information about the integrand

is collected through random sampling of points in the space of independent variables, and the

integral is evaluated as an appropriately defined average of the function values. By the law of

large numbers such method will converge as N1 , i.e. a quadruple increase of the sample

volume will result in halving the uncertainty of the estimated value of the integral, independ-

ently of the problem dimensionality. The efficiency of the method can be further improved if

the points in the arguments space are sampled from a probability distribution with a density


85/150

function which resembles in shape the magnitude of the integrand. Thus the evaluated func-

tional values will be predominantly among those with a larger contribution to the integral.

MC methods are also a powerful and widely used technique for numerical optimisation

(finding the global minimum of a function, often of a large number of independent variables).

These methods are based on the so-called random walk. The itinerary through the space of

function arguments is with a principal downhill trend, but with a possibility of random excur-

sions in the direction of function growth, thus reducing the chance of getting stuck in a local

minimum instead of the global one.

From the above considerations it follows that a prerequisite for the employment of MC

methods is the capability to generate samples from various and practically arbitrary probabil-

ity distributions. An obvious example in support of this statement is numerical integration.

Another clear example is presented by the task of solving transport problems. For instance, if

the energy of an incident particle is E’, then its energy E after a scattering event is a random

quantity sampled from the distribution ( ) ( )( )',

',

E

EEEf

s

s

rr

Σ→Σ= , where r is the point of scatter-

ing. This point in its turn is also random and it is sampled from the distribution

( ) ( ) ( )[ ]',,,exp', ERTEf t Ωrrr −Σ= , where ( ) ( )∫ −Σ≡R

t dRERERT0

'','',,, ΩrΩr is the so-called

optical thickness, R is the free path length for a particle travelling in direction Ω from its point

of emergence (previous collision) r0 to the current collision point, i.e. Ωrr R+= 0 .

Generation of random deviates with a chosen probability distribution

Here it is appropriate to recall some of the properties of random variables and of their

probability distributions.

Let x is a random variable and ( )bxaP ≤≤ is the probability that x takes a value between a and b.

The probability density function( )xf of this random variable is defined by the property:

( ) ( )xxxxPxxf ∆+≤≤=∆ 000 at 0→∆x . (1)

Therefore:

( ) 0≥xf and ( ) 1=∫+

−dxxf

x

x where x- and x+ are the bounds of the possible values of x, which in par-

ticular can be -∞ and +∞, and

( ) ( )∫=≤≤b

adxxfbxaP (2)

The cumulative distribution function( )xF is defined as:

( ) ( )00 xxPxF ≤= (3)

Therefore:


86/150

( ) ( )∫ −=

x

xdxxfxF '' ; ( ) 1 → +→xx

xF ; ( ) 0 → −→xxxF ;

( ) ( ) ( )aFbFbxaP −=≤≤ ' ; ( ) ( )xf

dx

xdF = (4)

Uniform distribution The uniform distribution ( )baU , has the following probability density function:

( )

><

≤≤−=

bxax

bxaab

dxdxxf

,:0

: (5)

If ( )1,0Ux∈ , then the expectation of this random variable is ( )2

1== ∫∞

∞−dxxxfµ and its variance is

( ) ( )12

122 =−= ∫∞

∞−dxxfx µσ . (6)

Normal distribution

The normal (Gaussian) distribution ( )σµ,N has the following probability density function:

( )

−−=2

2

1exp

2 σµ

πσxdx

dxxf , (7)

where µ and 2σ are respectively the expectation and the variance of this distribution.

If the random variable x has a distribution ( )σµ,N , then the random variable σ

µξ −≡ x will have the

distribution ( )1,0N . And conversely, if the random variable x has a distribution ( )1,0N , then the random vari-

able xσµξ +≡ will have a distribution ( )σµ,N . These statements follow from the definitions of mathemati-

cal expectation and of variance and are valid for any probability density functions. More specifically, for the expectation:

( ) ( ) ( ) ( )xEdxxfxxE σµσµσµ +=+≡+ ∫

and for the variance:

( ) ( ) ( )( )[ ] ( ) ( )xDdxxfxExxD 22 σσµσµσµ =+−+≡+ ∫ .

Central limit theorem If ,..., 21 xx are mutually independent random variables, all belonging to the same distribution with an

expectation µ and variance σ2, then at ∞→n the random variable ∑=

≡n

iin xs

1

will belong to the distribution

( )σµ nnN , .

In particular, if ,...1, =ixi are sampled from the distribution ( )1,0U , then, for instance, the random

variable 612

1

−≡∑=i

ixξ will have a distribution approximating ( )1,0N .

The transformation method for generating deviates with a specified prob-

ability distribution

Let the random variable ( )xy ϕ= with a probability density function ( )yg is some pre-

scribed (fully deterministic) function of the random variable x with a probability density func-


87/150

tion ( )xf . Then the following relation holds (fundamental transformation law of probabili-

ties):

( ) ( )dxxfdyyg = , (8)

where dx is an arbitrary small increment, whereas ( ) ( )xdxxdy ϕ−+ϕ= . Or,

( ) ( ) ( ) ( )ydy

dxf

dy

dxxfyg 1−ϕ== . (9)

In particular, if ( )xFy = , where F is the cumulative distribution function (3) corre-

sponding to ( )xf , then from (4), (9) and the fact that F is a monotonically increasing function

of x, it follows that ( ) ( ) ( ) 1=====dF

dx

dx

dF

dF

dx

dx

dF

dy

dxxfFgyg . Since at that 10 ≤≤ F ,

then the random variable ( )xFy = will have an uniform distribution between 0 and 1.

Thus if ξ is sampled from the uniform distribution between 0 and 1, then

( )ξ= −1Fx (10)

will be a random deviate from the specified distribution ( )xf .

Let, for example, ( ) xexf −= , ∞<< x0 . Therefore ( ) xexF −−= 1 , so that the applica-

tion of (10) results in the prescription to evaluate the random deviate x belonging to the speci-

fied distribution ( )xf according to the expression ( )ξ−−= 1lnx where ( )1,0U∈ξ . Since, on

the other hand, ξζ −≡ 1 is also from the distribution ( )1,0U , then finally:

ζln−=x where ( )1,0U∈ζ . (11)

Sampling from the normal distribution (Box-Muller m ethod)

Let the random variables nxxx ,...,, 21 have the joint probability density function

( )nxxf ,...,1 .

Let also exist the prescribed, i.e. completely deterministic functional dependences:

( ) ( )

( ) ( )nnnn

nn

xxxxy

xxxxy

,...,,...,

...

,...,,...,

11

1111

ϕ

ϕ

=

=. (12)

Then, analogously to (8), the following relation holds:


88/150

( ) ( ) ( )( ) n

n

nnnn dydy

yy

xxxxfdydyyyp ...

,...,

,...,det,...,...,..., 1

1

1111

∂∂= , (13)

where ( )( )

∂∂

n

n

yy

xx

,...,

,...,det

1

1 is the absolute value of the determinant of the Jacobian ma-

trix,

( )( )

∂∂

∂∂

∂∂

∂∂

≡∂∂

n

nn

n

n

n

y

x

y

x

y

x

y

x

yy

xx

...

.........

...

,...,

,...,

1

1

1

1

1

1 .

Now let in particular: 212

211

2sinln2

2cosln2

xxy

xxy

π

π

−=

−=. (14)

Then:

( )

1

22

22

211

arctg2

1

2

1exp

y

yx

yyx

π=

+−= (15)

The determinant of the Jacobian matrix will be:

( )( )

−=

∂∂ −− 22

21

2122

21

2

1

2

1

,

,det yy ee

yy

xx

ππ.

From the above and from (13) it follows that y1 and y2 are mutually independent random

variables, each with a probability density function ( )1,0N .

The rejection method for generating deviates with a specified probability

distribution

Let the desired probability density function is ( )xp , and ( )xf is a suitably chosen en-

veloping function (cf. Fig.1), such that ( ) ( )xpxf ≥ and ( )∫∞

∞−

dxxf has a finite value.

If the above described transformation method is applied for sampling from the distribu-

tion ( )xp , and the generated sample is plotted on the x axis, then the number of random devi-

ated in a given interval of this axis will be statistically proportional to the area under the curve


89/150

of ( )xp in this interval. If these deviates ( ),...1, =ixi are regarded as the abscissae of points

in the ( )( )xpx, plane, and their corresponding ordinates are chosen as random deviates from

the uniform distributions ( )( )ixpU ,0 , then the density (number per unit area) of these points

under the curve of ( )xp will be a statistically constant quantity.

An analogous procedure applied to the function ( )xf will differ from the above de-

scribed only in the necessity to account for the fact that ( )∫+∞

∞−

>= 1Adxxf . Thus, in the algo-

rithm for this method the random deviate ξ must be sampled from the distribution ( )AU ,0

instead of the distribution ( )1,0U . Of course, the ordinates of the points in the plane of the

plot of ( )xf must be sampled from the uniform distributions ( )( )ixfU ,0 . In this case the

abscissae of those points with ordinates not larger than ( )ixp will form a sample from the

distribution ( )xp . This is so because they will meet all requirements of the transformation

method if it were applied directly to ( )xp .

And so, let the definition interval of ( )xf and ( )xp is [ ]+− xx , (this interval can as well

be infinite) and let the analogue of the cumulative distribution function for ( )xf is

( ) ( )∫−

≡x

x

dxxfxF '' . Let ( ) ( )∫+

−

=≡ +

x

x

dxxfxFA '' . In this case the algorithm of sampling from

( )xp by the so-called rejection method will be as follows:

1. Sampling of ( )AU ,0∈ξ . Computing ( )ξ10

−= Fx . Sampling of ( )( )00 ,0 xfUy ∈ .

2. Comparing y0 with ( )0xp . If ( )00 xpy ≤ , the abscissa x0 is accepted as a random de-

viate x from the distribution ( )xp . Otherwise steps 1 and 2 are repeated.

It is clear that the relative share of “successful” hits will be equal to the ratio of the area

under ( )xp to the area under ( )xf . Therefore for a better efficiency of the algorithm it is de-

sirable that ( )xf compactly envelopes ( )xp and has an easily computable inverse of its

primitive function (antiderivative). Thus the rejection method can be regarded as a substitute

of the transformation method when the evaluation of the inverse of the antiderivative of the

desired probability density function ( )xp is difficult or computationally expensive.


90/150

Figure 1. The rejection method. The illustration is from NUMERICAL RECIPES IN FORTRAN 77: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43064-X) Copyright (C) 1986-1992 by Cambridge University Press.

Assessing the quality of the sample

The quality of a sample from a given probability distribution can be assessed by means

of the so-called Pearson’s 2χ test.

Let the continuous random variable x has a probability density function ( )xf and let

Lixi ,...,1, = (1)

is a sample from this distribution.

Let the range of possible values of x is subdivided into M intervals with boundaries

1,...,1, += Mjbj . (2)

Let further

MjN j ,...,1, = (3)

are the numbers of observations in the corresponding intervals, i.e. kN is the number of

elements of the sample (1) with values within the interval [ ]1, +kk bb . It is clear that LNM

jj =∑

=1

.

On the other hand, let MjF j ,...,1, = are the expected numbers of observations:

( )∫+

=1k

k

b

b

k dxxfLF . (4)


91/150

If the hypothesis that Lixi ,...,1, = is a sample from the distribution ( )xf is true, then

the observation numbers MjN j ,...,1, = will have a Poisson distribution with a probability

density function:

( ) ( )!

exp

j

jN

jj N

FFNp

j −= . (5)

( )jNp is the probability of occurrence of jN successful outcomes if their expected

number is jF and if this expected number is strictly proportional to the total number of trials

(here the total number of trials is the sample volume L), and if the occurrence of a successful

outcome does not depend on the number of trials after the occurrence of the previous success-

ful outcome. It is seen that the process of sampling from ( )xf fully corresponds to these con-

ditions.

The Poisson distribution (5) has an expectation jj F=µ and a variance jj F=2σ . When

the expectation is large, e.g. 10>jF , the Poisson distribution converges to the Gaussian

(normal): ( )jj FFN , .

By virtue of the last two statements (if the prerequisite for the second of them is ful-

filled), the quantity

j

jjj

F

FNz

−= (6)

will have a standard Gaussian distribution ( )1,0N , and therefore the sum

( )∑∑

==

−=≡

M

j j

jjM

jj F

FNzS

1

2

1

220 (7)

will have a 2Mχ distribution with M degrees of freedom.

Then, if the probability ( )20

2 SSPM > is lower than e.g. 0.05, then the above mentioned

hypothesis is rejected, i.e. the quality of the tested sample is not satisfactory.

Monte Carlo integration

The simplest approach can be explained through an analogy with the elementary algo-

rithms with equidistant abscissae (e.g. the rectangle rule). Let the task is to evaluate


92/150

( )∫≡1

0

dxxfI . According to the rectangle rule, if the interval [0,1] is subdivided into N equal

subintervals, the estimate of the integral will be ( )∑=

≡≅N

iixf

NRI

1

1 where xi are the centres

of these subintervals, uniformly distributed in the interval [0,1]. Thus R can be interpreted as

an estimate of the average function value in this interval: fR ≅ .

It is clear that with a sufficiently large number of points, N, this estimate will not de-

pend on the specific choice of these points, provided that they are uniformly distributed in the

interval [0,1]. Thus, if Nii ,...,1=ξ is a random sample with a volume N from the uniform distri-

bution ( )1,0U , then the sample average ( )∑=

≅N

iif

Nf

1

1 ξ will also be an estimate of the inte-

gral ( )∫≡1

0

dxxfI .

These considerations are directly generalised for arbitrary integration limits, as well as

for a higher dimensionality of the integral.

In the present context ( )xf is a random variable because it is a function of a random

argument. Let this random variable has a probability density function ( )fp . Let

( ) ( )∫+

−

≡=f

f

p dfffpfEµ and (1)

( ) ( )( ) ( )∫+

−

−≡=f

f

p dffpfffD 22 µσ (2)

are respectively the expectation and the variance of f.

In this sense ( ) Niff ii ,...,1, =≡ ξ is a random sample from the distribution ( )fp , and

its sample average ∑=

≅N

iifN

f1

1 is obviously also a random quantity. The sample elements

if are mutually independent values of the random variable f , i.e. they are mutually independ-

ent random variables, each with a distribution ( )fp . Therefore, for each of them are valid the

equalities: ( ) pifE µ= and ( ) 2pifD σ= .

Thus, the expectation of the sample average is:


93/150

( ) pp

N

i

i

NN

N

fEfE µ

µ==

=∑=1

, (4)

and the variance of this average:

( ) ( )N

NN

fDNN

fDfD p

p

N

ii

N

i

i

22

21

21

11 σσ ===

= ∑∑==

(5)

The first equality in the above chain is because to the mutual independence of the sam-

ple elements:

( )

( )

( ) ( )

( ) ( )( )

( )( )[ ]∑

∑∑

∑∑

∑∑

∑∑

≠

≠=

==

==

==

−−+=

−−+−=

−

−=

−=

−=

−=

=

jipjpi

p

jipjpi

N

ipi

N

jpj

N

ipi

N

ipip

N

ii

p

N

ii

N

ii

ffEN

fffEN

ffEN

fEN

NfN

E

fN

EfN

DfD

µµσ

µµµ

µµ

µµ

µ

2

1

2

2

112

2

12

2

1

2

11

1

1

11

11

(6)

( )( )[ ] ( )( )( )

( )( ) ( )( )

( ) 0

,

2 =−=

−−=

−−=−−

∫∫

∫ ∫

+

−

+

−

+

−

+

−

pp

pjj

f

f

j

f

f

piii

f

f

pjpiji

f

f

jipjpi

ffpdfffpdf

ffffpdfdfffE

µµ

µµ

µµµµ

(7)

(The last chain of equalities relies on the definition of the expectation of a function of a multidimensional

(in particular two-dimensional) random variable with a probability density function ( )21, xxp :

( )( ) ( ) ( )∫ ∫+

−

+

−

=1

1

2

2

21212121 ,,,x

x

x

x

xxpxxdxdxxxE ϕϕ

and on the fact that for the mutually independent random variables if and jf

( ) ( ) ( )jiji fpfpffp =, .)

Expression (5) deserves a special emphasis because it actually pertains to all esti-

mates obtained through any Monte Carlo method.


94/150

The possibility to choose a different probability distribution (beside the uniform) for the

random arguments x for the integration problem can be justified and commented in conjunc-

tion with the already discussed transformation method for sampling from a specified distribu-

tion. There the algorithm reduced to populating the area under the curve of a non-negative

function ( )xf with points, the density (number per unit area) of which is statistically constant

everywhere under this curve. Thus the number abN of points with abscissae between two

fixed boundaries, e.g. a and b, would be proportional to the integral of the function within

these limits ( )∫≡b

a

ab dxxfF . In this sense, if there exists a function ( ) ( )xfxg ≤ and the num-

ber of points with ordinates ( )ii xgy ≤ ( ix are the abscissae of these points) is abM , then the

integral ( )∫≡b

a

ab dxxgG can be estimated through the expression abab

abab F

N

MG = , provided

that the value of abF is known.

This approach has the following two extreme variants.

The first one is ( )[ ]

( )xgAxfba,

max≥= , at which ( )abAFab −×= and the abscissae and

ordinates of the respective points are sampled from the uniform distributions ( )baU , and

( )AU ,0 . A big disadvantage in this case is that if ( ) Axg << in most of the interval [ ]ba, (a

typical situation), then with a reasonably large number of points abN the number of hits abM

under the curve of ( )xg will be quite small and correspondingly with a very large relative

uncertainty (this follows from the discussion about the Poisson distribution made above in

relation to the assessment of the quality of samples). This large relative uncertainty is directly

inherited by the estimate of abG .

The second extremity is to choose ( ) ( )xgxf ≈ in the interval [ ]ba, (observing the re-

quirement that ( ) ( )xgxf ≥ in this interval). In this case, of course, the value of abF must be

known in advance, the abscissae ix must be evaluated by the transformation method for ( )xf

and the ordinates must be sampled from the uniform distributions ( )( )ixfU ,0 . This ensures

abab NM ≈ and the statistical uncertainty of the estimate abG is minimised. In the absolutely

extreme case of ( ) ( )xgxf = , and correspondingly abab NM = , the estimate abG will coincide


95/150

with the actual integral but the employment of a Monte Carlo procedure will be completely

meaningless.

The above considerations can be generalised and further specified as follows.

The problem of evaluating ( )∫=b

a

ab dxxfI is equivalent to the problem of evaluating

( ) ( )( ) ( ) ( )∫∫ =

b

a

b

a

dxxxgdxxg

xfxg ϕ . Let here ( )xg plays the role of a probability distribution from

which the arguments xi of the function values ( )ixϕ are sampled. In full agreement with the

considerations so far and with the initial particular example, an estimate of the integral abI

will be the sample average ( )∑=

=N

iix

N 1

1 ϕϕ . (In the initial example the integration limits

were [ ]1,0 , ( ) ( )1,01 Uxg == , and obviously ( ) ( )xfx =ϕ ).

In order that ( )xg can play the role of a probability distribution it is necessary that

( ) [ ]baxxg ,,0 ∈≥ and ( )∫ =b

a

dxxg 1. Let, in particular, [ ]ba, is such that in it ( ) 0≥xf , too.

This can always be ensured by subdividing the original integration interval into regions where

the integrand does not change its sign, and where the integrand is negative, by assigning a

negative sign to the final result.

In this way the distribution ( )xg can be chosen to resemble in shape the actual inte-

grand, i.e. ( ) ( )xfxCg ≅ , where C is a multiplier through which the requirement for an unit

integral value of ( )xg is fulfilled. Thus, in contrast with the actual integrand (which in par-

ticular may consist of one or several peaks, so that only the functional values calculated in

their close vicinities will have a significant contribution to the evaluated integral), the effec-

tive integrand ( ) ( )( ) Cxg

xfx ≅=ϕ will have a practically uniform contribution to the integral

everywhere within the integration limits [ ]ba, .

As it was already shown, an estimate of the integral abI will be the sample average

( )∑=

=N

iix

N 1

1 ϕϕ where Nixi ,...,1, = is a sample from the distribution ( )xg . It was also


96/150

shown that the variance of ϕ will be ( ) ( )N

p22 σϕσ = where ( )p2σ is the variance of the

probability distribution ( )ϕp of ( )xϕ (with a random argument x the functional value ( )xϕ is

also random). In the considered case ( ) Cx ≅ϕ and ( ) ( )Cp −≅ ϕδϕ , where δ is Dirac delta

function, so that ( ) 02 ≅pσ .

This is namely the desired effect of reducing the uncertainty of the estimated value of

the integral. Of course, the extreme case of ( ) Cx =ϕ will lead to ( ) 02 =pσ and hence to

zero uncertainty, but then it would be necessary that the normalising constant C be precisely

equal to the sought integral abI (!) (this follows from ( ) ( )xfxCg = and the requirement for

an unit integral of the probability distribution ( )xg ) and the entire stochastic procedure is

rendered meaningless.

The general conclusion is that the application of variance reduction methods requires

the preliminary knowledge of an approximate solution to the integration problem, whereas the

role of the Monte Carlo procedure is reduced to a potentially small refinement of this solution.

This conclusion is also valid for the application considered below.

An example

Let the task is to evaluate the integral ( )∫≡ 2

0cos

π

dxxI . The exact result with which the

numerical estimate can be compared is 1.0.

1. Let the probability distribution ( )xg is uniform between 0 and 2

π, i.e. ( )

π2=xg . The

cumulative distribution function is ( ) xxGπ2= , its inverse is ( ) ξπξ

21 == −Gx . The integral

estimate is ϕ≅I , where ( )∑=

=N

iix

N 1

1 ϕϕ is the sample average, and

( ) ( )( ) ( )xxg

xfx cos

2

πϕ == . The respective estimate of the standard deviation of the result is

( )N

22 ϕϕϕσ

−= where ( )∑

=

=N

iix

N 1

22 1 ϕϕ . The arguments ix are sampled from the

distribution ( )xg by the transformation method, i.e. ( )ii Gx ξ1−= , ( )1,0Ui ∈ξ .


97/150

2. A better choice would be a distribution ( )xg which resembles in shape the integrand

( )xf . Since the implementation of the transformation method for sampling from ( )xg is of-

ten a complicated task, ( )xg is chosen as a trade-off between the shape requirement and the

ease of implementation of the transformation method. An example of such trade-off might be

( ) ( )

∈−=2

,0,exp1 π

xxc

xg , where the denominator

−−=2

exp1π

c ensures a normalisa-

tion to an unit integral of ( )xg . This choice is by no means optimum and is justified only by

the fact that ( )xf is a decreasing function within the integration limits. The integral is evalu-

ated as ϕ≅I where ( ) ( )( ) ( ) ( )xxcxg

xfx cosexp==ϕ , and the standard deviation of this result

is estimated analogously to the previous case. The rule of sampling from ( )xg , as per the

transformation method, is ( )ii cx ξ−−= 1ln where ( )1,0Ui ∈ξ .

Another choice of a probability distribution ( )xg can be e.g. the linear function

( ) ( )

∈+=2

,0,1 π

xbaxc

xg . The constants a and b are chosen so, that ( )

∈>2

,0,0π

xxg ,

and e.g. ( )c

g1

0 = and c

g01.0

2=

π. The denominator

2

0

2

2

π

+= bxxa

c ensures a normali-

sation to an unit integral of ( )xg . The cumulative distribution function is

( )

+= bxxa

cxG 2

2

1. Its inversion is already not quite trivial since it requires funding the

roots of a second-degree polynomial and selecting the root which lies in the interval

2,0π

.

The result is ( )a

acbbGx

ξξ 221 ++−

== − . The integral is evaluated as ϕ≅I where

( ) ( )( )

( )x

xc

xg

xfx

αϕ

−==

1

cos, and the standard deviation of this result is estimated analogously to

the previous case. The rule of sampling from ( )xg , as per the transformation method, is

( )ii Gx ξ1−= where ( )1,0Ui ∈ξ .

This choice of ( )xg is only slightly more efficient than the previous one. It was made

solely for the purpose of illustrating the potential difficulties of implementing the transforma-

tion method.


98/150

3. The above illustrated difficulties can be overcome through interpolating between a set

of stored values ( ) MixG i ,...,1, = in order to provide a quick and simple (although approxi-

mate) solution of the equation ( ) ξ=xG for x with a known ξ. Here it is convenient and ap-

propriate to evaluate ( )xg , which is needed for the formation of ( )xϕ , by means of an analo-

gous interpolation procedure. This approach allows a great degree of freedom for the choice

of ( )xg . In the considered case it was chosen to approximate ( ) ( )xxf cos= at

∈2

,0π

x by

( ) ( )( )xbac

xg exp1 += with the additional condition that ( ) 0≥xg ,

∈2

,0π

x . With a = 1.4

and b = -0.28457 the similarity between ( )xf and ( )xg is quite good. The constant c, as pre-

viously, ensures an unit integral of ( )xg from 0 to 2

π. The results below are obtained through

piecewise linear interpolation.

With an arbitrary but fixed random sequence ( ) NiUi ,...,1,1,0 =∈ξ the results are as

follows.

Table 1. Monte Carlo evaluation of ( )∫ 2

0cos

π

dxx

N=10 N=100 000

( )xg I σ I σ

c

1 (uniform)

0.8886 0.1527 0.9984 0.0015

( )xc

−exp1

(exponential) 1.0449 0.0490 0.9996 0.0007

( )baxc

+1 (linear)

1.0239 0.0431 1.0004 0.0004

( )( )xbac

exp1 + (approximation, interpolation)

0.9948 0.0132 1.0035 0.0002

N is the volume of the sample from the uniform distribution, which in the first case is

used directly and in all other cases is employed for generating a sample from ( )xg . For com-

parability, all results with a given N are produced using the same sample. As it should be ex-

pected, with large N all results converge to the same correct estimate. The relative advantage

of choosing non-uniform distributions ( )xg which in a certain sense resemble the integrand is

also seen. It must be noted that with a very good approximation (e.g. similar to the attempt in


99/150

the last case) the numerical integration (by Monte Carlo, or any other method) can success-

fully be replaced by an analytical integration of the approximating function.

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.60.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

5

1

4 3

2

y

x

1: cos(x) 2: const. 3: exp(-x) 4: 1-ax 5: a+b*exp(x)

Figure 2. Monte Carlo evaluation of ( )∫ 2

0cos

π

dxx . Integrand and probability distributions

( )xg .

Monte Carlo for particle transport problems

In its simplest form this MC method consists in tracking the histories of a finite number

of particles (N) with accounting for the stochastic nature of the events during a history. This

implies sampling from various probability distributions, e.g. the scattering angle, the free path

to a collision, etc..

The procedure in the particular case of a stationary problem with an external source in a

non-multiplying medium would be as follows. A history is started through sampling a random

set of coordinates, an initial energy and a direction from some known probability distributions

which characterise the source. After determining the free path to the first collision, again as a

random number with a known probability distribution, the collision location is found and the

material zone is identified to which this location belongs. In a random way, based on data


100/150

about the total cross-sections interpreted as a probability distribution, the target nuclide and

the reaction type – absorption or scattering – are determined. In the case of absorption the

particle history is terminated. In the opposite case (scattering), the direction of the scattered

particle is sampled from the probability distribution of the scattering angle and its energy is

determined uniquely from the requirement to conserve the energy and momentum of the sys-

tem (this would be so with elastic scattering, whereas with inelastic scattering the energy is

also sampled from a respective probability distribution). The history of the scattered particle

is tracked analogously for the subsequent collisions and is terminated in the case of absorption

or leaving the problem boundaries.

Let the purpose of the described procedure is to estimate the mathematical expectation

of a given quantity – e.g. flux, current, reaction rate, etc. This quantity usually matches none

of the random microscopic parameters of the particle histories – scattering angles, locations,

free paths between successive collisions, etc.

Thus the question arises about what statistics of which of those parameters shall be used

as an estimator for the desired quantity. Let, for example, the desired quantity is the scalar

flux Φ.

Further, if ΦΣ= tVc is the average number of collisions in some volume V, then the

mean value (the expectation) of the scalar flux in this volume will be tV

c

Σ=Φ , where tΣ is

the macroscopic total cross-section. A sample estimate of this quantity will be cV tΣ

=Φ 1,

where

∑=

=N

nnc

Nc

1

1 (1)

is the sample average of the collisions number in volume V per unit time, and nc is the

respective number of collisions contributed by the n-th history. Since usually the flux of parti-

cles with certain energy (i.e. with energy within the interval [ ]dEEE +, ) is sought, the con-

tribution of an individual history will comprise only the events at which the particle energy

has been in the desired interval. Similarly, if an estimate of the directional flux is sought, then

only the events caused by particles with directions of travel within a given solid angle are

accounted.


101/150

On the other hand, the scalar flux is defined as pNv≡Φ , where v is the particle veloc-

ity and Np is their number in a given phase volume. Based on this, the sample estimate of the

flux will be lV

1=Φ , where

∑=

=N

nnlN

l1

1 (2)

is the average track length of the particles in volume V . The track length contributed by

the n-th history is ∑=

=I

iinl

1

v , where I is the number of traversals of volume V per unit time

having occurred while following the n-th history, and iv is the particle velocity during each

traversal. Here also the contribution nl typically comprises only the events at which the en-

ergy (and/or the direction of travel) of the particle has been within the desired interval. This

scalar flux estimator is known as „track type” because the respective events resemble those in

a track detector.

Since the number of collisions in a given volume is expectedly smaller than the number

of crossings of this volume, the statistic (2) will be more stable than (1) and is usually pre-

ferred.

From the above examples it can be seen that the useful result from Monte Carlo simula-

tions of transport phenomena is estimated as a sample average

∑=

=N

nnx

Nx

1

1, (3)

where nx is the contribution of the n-th history in the estimated quantity.

From those examples it is also seen that there are two extreme types of estimators, the

statistics of which are used for evaluating the desired physical quantity – parameters with a

binomial distribution, with which the contributions of the individual particles to the estimated

result are either 0 or 1 (e.g. number of the collisions of a particle with a given energy while

crossing a given volume), and parameters with all histories having non-zero contributions to

the sought estimate (e.g. a “track type” statistic in a large volume for a wide energy range).

Estimators with a binomial distributions are almost never used in practice, but it is also im-

possible to find an estimator with which all histories would have a non-zero contribution to


102/150

the estimated result. Most often the probability distribution of the contribution of the chosen

estimator x to the evaluated result is of the form:

( ) ( ) ( )xgxcxf +−= 0δ , (4)

where δ is Dirac delta function.

The delta-function term in (4) is due to the histories with no contribution to the esti-

mated results, whereas ( )xg describes the distribution of non-zero contributions of particle

histories. A similar distribution is observed e.g. in shielding problems where a large relative

share c of particles cannot penetrate through the shield and correspondingly cannot contribute

to the flux on its outer surface.

The expectation ( ) ( )∫∫ == dxxxgdxxxfx is not affected by the contribution of the δ-

function. This is so because of the general property ( ) ( ) ( )00 hdxxxh =−∫ δ . With a correct

normalisation, the sample average is also not biased by the zero-contribution histories, i.e. by

the presence of ( )0−xcδ in the distribution (4).

This is unfortunately not the case with the variance:

( ) ( )( ) ( ) ( ) ( ) 2222

22

g

f

dxxgxxdxxxxc

dxxfxx

σσδ

σ

δ +=−+−=

−=

∫∫

∫, (5)

where 22 xc=δσ is attributed by the inefficiency of the MC procedure, i.e. by the fact

that not all particles (histories) have a contribution to the estimated result, and 2gσ is the vari-

ance inherent to the distribution of the estimators with a non-zero contribution to this result.

With the above illustrated physically realistic approach to modelling the particle histo-

ries, usually referred to as “analogue MC”, the relative share c of non-productive histories can

be very high and may lead to an inacceptable increase of the variance of the estimated quan-

tity. This is so because the variance of the sample average (through which the useful result is

evaluated) is ( )N

x f2

2 σσ = .

Variance reduction methods

Different methods, formally expressed in some departure from physical reality in the

course of modelling particle histories, are employed for reducing the variance while observing


103/150

the obligatory requirement to preserve an unbiased estimate of the sought result. The common

designation of any procedures which involve such variance reduction methods is “non-

analogue MC”.

Similarly to the conclusions made in relation with function integration, a fruitful ap-

proach to the present task would be to sample the initial phase coordinates of the particles

predominantly from those regions where the contribution of particle histories to the estimated

quantity is larger. That is, to sample the initial phase coordinates from a probability density

function which is proportional to the contribution of particles with such initial coordinates to

the estimated quantity. The desired distribution can be constructed as described below.

The directional flux is a solution of the following equation (in the particular case of a

stationary system with a fixed source):

0=+ QϕL , (6)

The boundary condition for (6) is

( ) 0,0,, <⋅= ΩdsΩr Esϕ (6a)

at all positions on the convex outer surface of the problem.

In (6) ( )EQQ ,,Ωr= , and the operator L is such that:

( ) ( ) ( )( ) ( )∫∫ Ω→→Σ+

Σ−∇⋅−=

''',',',',

,,,,,

dEdEEE

EEE

n

t

ΩrΩΩr

ΩrrΩrΩL

ϕ

ϕϕϕ, (7)

at which the cross-section nΣ characterises jointly the scattering and the fission sources

(the latter in the case of neutron transport in a multiplying medium).

This operator has its adjoint counterpart L+, defined as:

( ) ( ) ( )( ) ( )∫∫ Ω→→Σ+

Σ−∇⋅+=+

''',',',',

,,,,,

dEdEEE

EEE

n

t

ΩrΩΩr

ΩrrΩrΩL

ϕ

ϕϕϕ (8)

Let the virtual detector in the MC simulation is described through its response cross-

section ( )Ed ,,ΩrΣ . (It is clear that outside the detector this cross-section is zero.)

Let the so-called adjoint flux +ϕ is a solution of the equation

0=Σ+++dϕL . (9)


104/150

The boundary condition for (9) is

( ) 0,0,, >⋅=+ΩdsΩr Esϕ (9a)

at all positions on the convex outer surface of the problem.

The adjoint operator L+ has the following property:

( ) ( )∫∫∫∫∫∫+++ Ω=Ω ϕϕϕϕ LL dErdddErdd 33 (10)

This property can be proved as follows.

а) For the second terms in (7) and (8):

( ) ( ) ( )[ ]( ) ( ) ( )[ ]∫∫∫

∫∫∫+

+

ΣΩ=

ΣΩ

EEEdErdd

EEEdErdd

t

t

,,,,,

,,,,,

3

3

ΩrrΩr

ΩrrΩr

ϕϕ

ϕϕ (10a)

б) For the third terms in (7) and (8):

( ) ( ) ( )[ ]( ) ( ) ( )

( ) ( ) ( )[ ]∫∫∫ ∫∫

∫

∫∫∫ ∫∫

+

+

+

→→ΣΩΩ=

→→ΣΩΩ=

→→ΣΩΩ

',',',','',,

',',',',,,''

',',',','',,

3

3

3

EEEdEdEdErdd

EEEEdEdEdrdd

EEEdEdEdErdd

n

n

n

ΩrΩΩrΩr

ΩrΩΩrΩr

ΩrΩΩrΩr

ϕϕ

ϕϕ

ϕϕ

(10b)

в) For the first terms in (7) and (8):

Let

( ) ( ) ( ) ( )[ ]∫∫∫++ ∇⋅+∇⋅Ω≡∆ EEEEdErdd ,,,,,,,,3

ΩrΩΩrΩrΩΩr ϕϕϕϕ (10c)

is the difference between the first term in (8) and the first term in (7).

The integrand in (10c) can be represented as:

( )++++

+++

++++

++

⋅∇=Ω∂∂+Ω

∂∂+Ω

∂∂=

∂∂Ω+

∂∂Ω+

∂∂Ω=

∂∂Ω+

∂∂Ω+

∂∂Ω+

∂∂Ω+

∂∂Ω+

∂∂Ω=

∇⋅+∇⋅

ϕϕϕϕϕϕϕϕ

ϕϕϕϕϕϕ

ϕϕϕϕϕϕϕϕ

ϕϕϕϕ

Ω

ΩΩ

zyx

zyx

zyxzyx

xyx

xyx

zyxzyx (10d)

After applying the divergence theorem (Gauss's theorem or Ostrogradsky's theorem):


105/150

( )( )[ ] ( )( )( )[ ]∫ ∫ ∫∫ ∫ ∫++ Ω⋅=Ω⋅∇=∆ ΩrΩdsΩ ,,3 EddEdrddE sϕϕϕϕ (10e)

From the last results and from the boundary conditions (6а) for ϕ and (9a) for +ϕ it

follows that 0=∆ .

After multiplying the forward transport equation by +ϕ , the adjoint by ϕ, and after inte-

grating over all phase variables, the following equality is obtained:

( ) ( )[ ] ( ) ( )[ ]∫∫∫∫∫∫+Ω=ΣΩ≡ EQEdErddEEdErddR d ,,,,,,,, 33 ΩrΩrΩrΩr ϕϕ , (11)

where R is the detector response.

The choice of a source Q is in principle arbitrary. Let, in particular:

( ) ( )000 ,,,, EEEQ −−−= ΩΩrrΩr δ (12)

Then:

( ) ( )000 ,,000 ,, ERE ΩrΩr =+ϕ , (13)

where ( ) ( ) ( ) ( )[ ]∫∫∫ ΣΩ=000000 ,,

3,, ,,,,

EdE EEdErddRΩrΩr ΩrΩr ϕ is the response rate from

the detector caused by the flux arising from the unit source (12).

Therefore the adjoint flux, which is a solution of (9), has the meaning of a distribution

of the importance of the source particles with respect to the invoked detector response.

Of course, the task of solving (9) and convolving the source by the obtained solution is

tantamount to the effort of achieving an exact solution of the transport problem, which ren-

ders the MC simulation exercise worthless.

The practical usefulness of the described approach consists in employing an approxi-

mate estimate of the importance function evaluated through solving a simplified form of

equation (9) – e.g. after reducing the dimensionality, homogenisation, simplifying the energy

dependence, etc. If such an approximate estimate of the importance function is supplied, then

the phase coordinates of a tracked particle (both the initial ones and those after a collision or

fission event (the latter in a multiplying medium)) can be sampled from the distribution

( )E,,Ωr+ϕ (obtained from the original importance function after proper normalisation), at

which the contribution of this particle to the useful statistic (i.e. the detector response) must be

multiplied by the weight factor +=in

inw,

,

1

ϕ, referring to the n-th history after the i-th event in


106/150

the process of its tracking. The effective result is replacement of the distribution ( )xf in (4)

by a distribution ( )xf~

for which 22~ ff

σσ < .

Often the described systematic approach, known as importance sampling, is substituted

or augmented by heuristic techniques with the same general property, namely:

1) Instead of sampling from the “natural” distribution f , a sample from the modified

distribution f~

is drawn.

2) A weight function ( ) ( )( )xf

xfxw ~= is introduced, through which an unbiased estimate of

the sought mean value is obtained:

( ) ( )∫= dxxfxxwx~

.

Thus, instead of accumulating the statistic x (3) from a sample from the distribution

( )xf , a statistic

( )∑=n

nn xxwN

wx1

is formed from a sample ,...,...,11 nnxwxw from the distribution

( )xf~

. At that both statistics will have a mathematical expectation x .

Of course, the purpose of the described substitution is that the modified estimator

( )xxw will have a smaller variance than the original estimator x.

Some of the standard variance reduction techniques will be discussed below. Here it

should only be mentioned that in the context of (4) and (5) the purpose of non-analogue MC

consists in increasing the share of particles (histories) with non-zero contribution to the de-

sired statistic and in reducing the variance of the distribution of these contributions, at which

the accumulated statistic remains unbiased with respect to the analogue case.

Implicit absorption

With analogue MC, after choosing the collision location, a random number ( )1,0U∈ξ is

generated and is compared to the ratio t

a

ΣΣ

. If t

a

ΣΣ<ξ , then absorption is assumed and the

particle history is terminated. In the opposite case it is assumed that the collision event has

resulted in scattering. A new energy and travel direction are sampled for the scattered particle


107/150

and the tacking of its history is continued. According to the currently considered approach,

however, the particle history is not terminated because of absorption. Instead, with non-

analogue MC the weight associated with the particle after the i-th collision is reduced so as to

correspond to the probability to escape an absorption:

ΣΣ−×=+

t

ainin ww 1,1,

This procedure raises the computational expense since the total number of collisions in-

creases, but can achieve a significant reduction of variance because through decreasing the

share of histories of the share of histories with a zero contribution to the estimated quantity.

The total efficiency of the method can increase, provided that a suitable criterion for terminat-

ing a history is introduced.

Russian roulette for the termination of histories

It is clear that with implicit absorption and a large system size (i.e. relatively low leak-

age) the efficiency of accumulating the useful statistics will be low because after a sufficient

number of collisions the particle weight will drop to an insignificant level. One of the possible

solutions to this problem is to terminate the history when the weight passes a certain low

threshold. This, however, will result in underestimating the statistics and may cause biasing of

the sought quantities. (In shielding problems, for example, a physically justified measure of

the importance of a history is the energy of the slowing down particle. Insofar as a contribu-

tion to the evaluated quantity (e.g. dose rate) will have only the particles with an energy above

a certain threshold, the history can be terminated after a scattering event to an energy below

this threshold without biasing the dose rate estimate.)

An universal and reliable method of terminating a particle history, which does not bias

the estimates of the sought quantities, is the so-called Russian roulette. Following this ap-

proach, first the particle weight is checked for dropping below a given threshold. If so, a ran-

dom number ( )1,0U∈ξ is generated and compared with Ξ1

, where Ξ is a constant, typically

between 2 and 10. If Ξ

>ξ 1, then the history is terminated. In the opposite case its tracking is

continued, but for preserving an unbiased estimate the particle weight is multiplied by Ξ. This

method increases the variance but decreases the time for tracking a history. With an appropri-

ate choice of Ξ an overall increase of the efficiency of the MC procedure can be achieved.


108/150

Russian roulette for particle splitting

This method is applied for particles with a weight which has grown above a certain

threshold. An obvious cause for a weight increase could be the use of an importance function

+ϕ for the formation of weights.

The implementation is analogous to the previous case, except that here Ξ functions as a

multiplication constant. If for a given particle such multiplication occurs, each of the Ξ

daughter particles is assigned a weight equal to Ξ1

of the weight of the cloned parent particle.

Additional comments on the Russian roulette technique

The Russian roulette for terminating particles with low weights increases the share of

the delta function at x = 0 in the distribution ( )( )xxwf (4) (because of the terminated histo-

ries), as well as the values of the continuous distribution ( )( )xxwg at large ( )xxw far above

its mathematical expectation (due to the relative increase of the weights of non-terminated

histories). Thus the overall variance of ( )( )xxwf increases. And conversely, the Russian rou-

lette for splitting of particles with high weights increases the values of ( )( )xxwg at smaller

( )xxw , nearer its mathematical expectation. Thus the overall variance of ( )xxw decreases,

however at the expense of an overhead computational effort for tracking the additional histo-

ries.

The importance sampling approach is often implemented in conjunction with Russian

roulette through introduction of the so-called weight windows. A weight window is typically

cantered at the importance value ( )E,,Ωr+ϕ (in this context referred to as a target weight)

and is assigned some chosen width. The current particle weight is compared with the weight

window boundaries. If this current weight is below the lower weight window boundary, a

Russian roulette procedure for terminating the history is invoked. If above the upper bound-

ary, a Russian roulette for particle splitting is applied. This replaces the need for sampling

from modified (biased) probability distributions for the new particle’s parameters after a colli-

sion event, with the task of sampling from a biased source distribution sill retained.

Path stretching

This technique is used as an alternative to particle splitting for deep penetration prob-

lems and consists in increasing the free path between successive collisions for particles travel-


109/150

ling in a direction which is in some way preferred (i.e. important for the accumulation of the

useful statistics). Let, for example, this is the positive direction of the Ox axis. Then the total

cross-section tΣ is replaced with:

( )xtex p eΩ.1−Σ=Σ ,

where xe is the unit vector of Ox, and p, 10 <≤ p , is some chosen path stretching pa-

rameter.

It is clear that this non-physical free path elongation must be compensated through mul-

tiplying the particle weight by a suitable factor wex in order to conserve the actual reaction

rates. Since the collision probability over a distance between u and duu + along the particle’s

travel trajectory is ( )duutt Σ−Σ exp , for preserving the actual value of this probability it is

necessary that:

( ) ( )duuwduu exexextt Σ−Σ=Σ−Σ expexp

Therefore:

( )( )

( )( ) ( )( )

( ) ( )( )

x

xt

xtx

xtxt

tt

exex

ttex

p

up

upp

upp

u

u

uw

eΩeΩ

eΩeΩ

eΩeΩ

.1

.exp

.exp.1

1

.1exp.1

exp

exp

exp

−Σ−=

Σ−=

−Σ−−ΣΣ−Σ=

Σ−ΣΣ−Σ=

Forced collisions

This correction is especially useful if e.g. a reaction rate in some small volume has to be

estimated. It is clear that the accumulated statistic will improve if the number of collisions in

this volume can be artificially increased, however without biasing the sought estimate. The

forced collisions technique consists in splitting a particle into two particles with lower

weights, then letting the first one traverse the volume while enforcing a collision for the other

one. Let the weight of the initial particle is w and its forthcoming path through the volume is

u. The collision escape probability for this initial particle will be ( )utΣ−exp . When this parti-

cle is split in two, the one which is let to traverse the volume is assigned a weight

( )uww te Σ−= exp , whereas the one for which a collision is enforced – a weight

( )( )uww tc Σ−−= exp1 . The history of the particle with weight ew is continued in the stan-

dard way, i.e. as it would be for the initial particle, while for the particle with weight cw a


110/150

collision location within the volume is sought as follows. The probability distribution for the

free path ux ≤≤0 before a collision in the volume is:

( ) ( )( )

( )( )u

x

dxx

xxf

t

ttu

tt

tt

Σ−−Σ−Σ=

Σ−Σ

Σ−Σ=

∫exp1

exp

exp

exp

0

.

According to the transformation method, the random number with the desired distribu-

tion will be:

( )( )( )ux tt

Σ−−−Σ

−= exp11ln1 ξ ,

where ( )1,0U∈ξ . The history of this particle after the forced collision is followed in the

standard way.

Although this method may cause the emergence of particles with low weights, no Rus-

sian roulette for terminating the history within the volume is applied.

Application: integral form of the neutron transport equation

The purpose of this example is to show that the general Monte Carlo approach of sam-

pling from chosen probability distributions can be regarded as a method of solving of integral,

and hence of differential equations.

The neutron transport equation has the form:

( ) ( ) ( ) ( )

( ) ( ) ( )∫ ∫Ω Ωϕ→→Σ+

=ϕΣ+ϕ∇Ω+∂

ϕ∂

' ''',',',,,',,,,

,,,,,,,,,.,,,

v

1

E s

t

dEdtEtEEtES

tEtEtEt

tE

ΩrΩΩ'rΩr

ΩrΩrΩrΩr

(14)

The total source which combines the external, scattering and fission sources is:

( ) ( ) ( ) ( )∫ ∫Ω Ωϕ→→Σ+Ω=Ω' '

'',',',,,',,,,,,,E s dEdtEtEEtEStEq ΩrΩΩ'rrr (15)

With assuming an isotropic transport medium, neglecting the time dependence of cross-

section and introducing energy groups, the neutron transport equation becomes:

( ) ( ) ( ) ( ) ( )tqttt

tgg

tgg

g ,,,,,,.,,

v

1

g

ΩrΩrrΩrΩΩr

=ϕΣ+ϕ∇+∂

ϕ∂, (16)


111/150

where ( )( ) ( )

( )∫

∫

∆

∆

ϕ

ϕΣ=Σ

g

g

E

E ttg

dEtE

dEtEE

,,,

,,,,

Ωr

Ωrrr (17)

The representation:

Ωrr u+= ' , (18)

is introduced, where r is an arbitrary but fixed observation point, r’ is some chosen

starting point along the linear trajectory of free flight of neutrons with a travel direction Ω

(e.g. the point of last collision, after which a neutron emerges with energy in group g and

travel direction Ω), and u is the free path length from the starting point r’ to the observation

point r .

Figure 2. A geometric representation for constructing the integral form of the neutron trans-port equation.

Then, after leaving out the time dependence and accounting that ( )ΩrΩ ,. gϕ∇ is a de-

rivative along Ω, (16) can be rewritten as follows (here and below the group index g will be

omitted for simplicity):

( ) ( ) ( ) ( )ΩΩrΩΩrΩrΩΩr ,','',' uquuudu

dt +=+ϕ+Σ++ϕ , (19)


112/150

Further it will be more convenient to represent the spatial dependence in terms of the

observation point r . Formally, based on (19), this reduces to a change of the independent vari-

able, R ≡ -u, from which it follows that dR

d

du

d −= , and substituting the notation r ’ by r .

These changes are equivalent to following the particle trajectories from the point of observa-

tion back to the point of emergence:

( ) ( ) ( ) ( )ΩΩrΩΩrΩrΩΩr ,,, RqRRRdR

dt −=−ϕ−Σ+−ϕ− , (20)

Or, if the arbitrary but fixed parameters r and Ω are omitted:

( ) ( ) ( ) ( )RqRRRdR

dt =ϕΣ+ϕ− (21)

Such ordinary differential equation can be converted to an integral form by applying the

integrating factor:

( ) ( )[ ]Ωr ,;exp RTR −=µ , (22)

where:

( ) ( )∫ −Σ=R

t dRRRT0

'',; ΩrΩr (23)

Namely:

( ) ( )[ ] ( ) ( ) ( ) ( ) ( )( ) ( )RRq

RRRdR

RdRRR

dR

dt

µ

µµµ

=

ϕ−Σ+ϕ−=ϕ− Ωr (24)

(The derivative of the integrating factor is obtained according to the differentiation rule for a

definite integral: ( ) ( ) ( ) ( )∫∫ ∂

∂+−=)(

)(

)(

)(

',',,','

xb

xa

xb

xa x

xxFdx

dx

daaxF

dx

dbbxFxxFdx

dx

d)

The integration of (24) in limits from 0 to R leads to:

( ) ( )[ ] ( ) ( )

( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( ) ( )∫

∫

∫∫

+ϕ=ϕ

−=ϕ−ϕ

−=ϕ

R

R

RR

RRqRR

RRqRR

RRqdRRRdR

d

0

0

00

''0

''00

''''''

µµ

µµµ

µµ

, i.e.


113/150

( ) ( ) ( )

( ) ( )∫ ∫

∫

−Σ−−+

−Σ−−ϕ=ϕ

R R

t

R

t

dRRRqdR

dRRR

0

'

0

0

''''exp,''

''exp,,

ΩrΩΩr

ΩrΩΩrΩr (25)

(Here the fact that ( ) 10 =µ is used.)

If on physical grounds it is assumed that for all r and Ω

( ) ( )0

''exp,lim0 =

∞→

−Σ−−ϕ ∫

R

dRRRR

t ΩrΩΩr,

then the final form of the integral formulation of the neutron transport equation is:

( ) ( ) ( )[ ]∫∞

−−=ϕ0

,;exp,, ΩrΩΩrΩr RТRdRq (26)

The neglected time dependence supposes a conditional criticality formulation of the fis-

sion source. Also, in correspondence with the most frequently solved conditional criticality

problems, it is assumed that there is no external source. Under these conditions and with re-

storing of the group index g, the neutron source has the form:

( ) ( ) ( ) ( ) ( ) ( )∑ ∫∑∫ ϕΩΣ4

+ϕΣΩ= →'

4 '''

4 '' ,'11

,.,',g

gggg

gggg dk

dqπ

ν

πχ

πΩ'rrrΩ'rΩΩ'rΩr

Thus the integral equation of neutron transport (26) takes the form:

( )

( )[ ]( ) ( ) ( )

( )( ) ( ) ( )

∫∑∫

∑ ∫∞

→

−ϕ−Σ−Σ

−ΣΩ+

−ϕΩ4

−Σ−

−

=ϕ

0

'4 ''

'

'

'4 ''

,.,

'

,'11

,;exp

,

gg

tgt

g

gg

gggg

g

g

RRR

Rd

RdRRk

RTdR

π

πν

πχ

Ω'ΩrΩrΩr

ΩΩ'Ωr

Ω'ΩrΩrΩr

Ωr

Ωr

(27)

A simplified and generalised formulation of such integral equation (homogeneous Fred-

holm equation of the second kind) is:

( ) ( ) ( ) ( ) ( )∫∫ +=b

a

b

a

dssfstLdssfstKtf ,,1

λ (28)


114/150

The numerical approach to solving such equation is based on approximating the inte-

grals with quadrature formulae which are always of the form: ( ) ∑∫=

≅N

jjj

b

a

wdxx1

ϕϕ where

( )jj xϕϕ ≡ . Or:

( ) ( ) ( ) ( ) ( )∑∑==

+=N

jjjj

N

jjjj sfstLwsfstKwtf

11

,,1

λ (29)

If the right-hand side is evaluated at the quadrature nodes, then for the function values

( ) Nisff ii ,...,1, =≡ the following linear system is obtained:

NifLfKfN

jjij

N

jjiji ,...,1,

1

11

=+= ∑∑==λ

, where ( )jijij ssKwK ,≡ and ( )jijij ssLwL ,≡ (30)

Or, in matrix-vector notation:

fAfLfKff λλ

=⇒+= 1, where ( ) KL1A 1−−= (31)

This is a standard eigenproblem and well developed methods exist for its solving. In or-

der to evaluate the integral equation solution ( )tf at arguments different from the quadrature

nodes, expression (29) is directly applied.

The representation (28) can be viewed as an invitation for applying an accurate and eco-

nomical method of numerical integration. In this context an especially good choice would be

the above described Monte Carlo approach according to which the quadrature nodes (the ar-

gument values of the integrand) are chosen predominantly in regions where the contribution

of the integrand to the evaluated integral are higher.

Applied to the considered physical problem this would require sampling the phase space

with a density distribution which is proportional to the probability of populating with neu-

trons the point of observation (i.e. energy group g, travel direction Ω and position r ). Insofar

as the observation points also span the entire phase space (or a large portion thereof), the

above condition is equivalent to the requirement to follow the histories of individual particles

through sampling of new phase coordinates after each collision in correspondence with the

natural probability distribution of these coordinates. This distribution is expressed through

the respective macroscopic cross-sections.


115/150

An alternative formulation of this conclusion is that if the purpose is to obtain a solution

of the transport equation everywhere in the phase space, this would require the employment

of an analogue Monte Carlo simulation. As it will be demonstrated below, such is the case of

searching the effective multiplication factor – except for the possible introduction of implicit

absorption and of an importance function proportional to the probability of causing fission. Of

course, if a fixed-source problem is solved for the reaction rates (the flux) in restricted regions

of the phase space (virtual detectors), then the correct choice is non-analogue MC with suit-

able variance reduction techniques.

The analysis of the conditionally critical formulation of the homogeneous neutron trans-

port equation for a multiplying medium shows that the effective multiplication factor k is the

maximum eigenvalue of the integral equation (27). With the conventional introduction of

„neutron generations”, which corresponds to the power method of finding the maximum ei-

genvalue, the effective multiplication factor can be regarded as a ratio of the number of neu-

trons in the (n+1)-th generation to the number of neutrons in the n-th generation. This way of

estimating k in practice reduces to iterations on the fission neutron source.

Thus after introducing an indexing of the generations (i.e. the iterations), multiplying

and dividing some terms by ( )rtgΣ and multiplying both sides of equation (27) by ( )rν

gΣ , it

can be written in the form:

( )( )

( )( )

( )( ) ( )

( )[ ]

( ) ( )( )

( )( )

( )( )

( )( )∫

∑∫

∑ ∫∞

→

−

−−Σ

−ΣΩ+

−4Ω

−Σ−Σ

−

×−

ΣΣΣ

=ΣΣ

0

'4 '

'

'

'4

1'

'

'

,.,

'

,'1

,;exp

,

g

ngt

g

gg

g

ngt

g

gg

g

tgt

g

g

ngt

g

g

RR

Rd

Rd

R

RR

k

RT

dR

π

π

νν

ν

ψ

ψπ

χ

ψ

Ω'ΩrΩr

ΩΩ'Ωr

Ω'ΩrΩr

ΩrΩr

Ωr

rr

r

Ωrr

r

, (32)

where ( )( ) ( ) ( )( )ΩrrΩr ,, ng

tg

ng ϕΣ≡ψ is the so-called collision density.

The expression on the left is the rate of emergence of fission neutrons in volume d3r

around point r , owing to neutrons from generation (n) with energy in group g and travel direc-

tion Ω, which have reached this volume without intermediate collisions and have emerged at

any point r’ along the trajectory of free flight because of:


116/150

a) fission caused by neutrons from the (n-1)-th generation with incident energy in

any group g’ and any travel direction Ω’, or

b) scattering of neutrons from the n-th generation with incident energy in any

group g’ and any travel direction Ω’.

For the sake of compactness it is convenient to introduce the following operator nota-

tions:

– integral operator of particle transport

( ) ( ) ( )( )∫ →Σ≡→ rrrΩrrL RPdR tgg ',' (33)

– integral operator of particle collisions

( ) ( )( )

( )( )∑∫

=

→→

ΣΣ

ΣΣ

Ω≡→g

gtg

sg

sg

gggg d

1'4

'

'

'

''

'.,'',

π r

r

r

ΩΩrΩΩrC , (34)

where:

− ( ) ( )[ ]dRRTdRP g Ωrrr ,;exp' −≡→ is the probability of reaching the volume d3r

around point r through free flight in direction Ω from the point of emergence r’ , and

'rr −=R is the travelled path.

− ( )

( )

ΣΣ →

r

ΩΩrsg

gg

'

' '., is a normalised probability distribution for sampling a new energy

group and a new travel direction of the scattered neutron, and ( )( )

ΣΣ

r

rsg

sg

'

' is the prob-

ability to escape absorption.

With those, analogously to (32) the following can be written:

( )( ) ( ) ( ) ( ) ( ) ( ) ( )[ ]','',',',', ''

1,ΩrΩΩrCΩrΩrrLΩr n

gggnf

ggn

g S ψψ →+→= →− , (35)

where ( )( ) ( ) ( ) ( )( )

( )( )∑ ∫

Ω

ΣΣ

4≡ −

−−

'4

1'

'

'

11, ,'

11,

g

ngt

g

ggn

nfg d

kS

π

ν

ψχπ

Ω'rr

rrΩr is the normalised to

the multiplication factor for the (n-1)-th generation source of neutrons in group g at point r


117/150

due to fission caused by neutrons from the previous (n-1)-th generation, and ( )( )

ΣΣ

r

rtg

gν

is the

average number of fission neutrons born after a collision of a neutron with energy in group g.

The term ( )( ) ( ) ( ) ( )ΩrΩrrLΩr ,',', 1,, −→= nfgg

ncg SS is the contribution to ( )( )Ωr ,n

gψ

from first collision of fission neutrons.

After isolating this term, (35) is written in the form:

( )( ) ( )( ) ( ) ( ) ( ) ( )','',',',, ''

,ΩrΩΩrCΩrrLΩrΩr n

ggggnc

gn

g S ψψ →→+= → (36)

Expression (36) is a practical basis for implementing the following MC procedure:

− The modelling of an individual history starts with the introduction of a particle with

phase coordinates sampled from the source distribution ( )( )Ωr ,1, −nfgS .

− The location of its first collision is determined through sampling the kernel of the

transport operator L .

− The particle weight is multiplied by the absorption escape probability and the kernel

of the collision operator C is used for sampling its new energy and direction of

travel.

− Further the kernels of the transport operator L and of the collision operator C are

successively sampled for determining the locations of the second, third, etc. colli-

sion and the weight, energy and travel direction of the particle after the respective

collision.

− The history is followed until the particle weight drops below a certain threshold or

until the particle leaves the system.

− The integration required by the operators (33) and (34) corresponds to the following

summation (the generation index (n) is omitted):

( ) ( )∑∞

=

=0j

jgg ,, ΩrΩr ψψ , (37)

where ( )Ωr ,jgψ is the contribution of particles which after j collisions have

emerged with an energy in group g and a travel direction Ω, and undergo their

next collision at point r .


118/150

Namely:

( ) ( )( ) ( ) ( ) ( )ΩrΩrrLΩrΩr ,',', 1,,0 −→== nfgg

ncgg SS,ψ ;

…

( ) ( ) ( ) ( )','',',' 1'' ΩrΩΩrCΩrrLΩr −

→ →→= jgggg

jg , ψψ

The specific steps for solving the full problem (which includes finding the effective

multiplication factor) are as follows:

− The initial phase coordinates ( )000 ,, Ωrg are sampled from the distribution

( )( )Ωr ,1, −nfgS .

− For finding the point of first collision 001 Ωrr R+= a free path R is sampled from the

distribution ( ) ( )

+Σ−+Σ ∫ 00

0

00 ''exp00

ΩrΩr RdRR tg

Rtg . The probability of scattering

(escaping absorption) there is ( )( )1

1

0

0

r

rtg

sg

ΣΣ

. Scattering is enforced and the particle weight

after this event is multiplied by the scattering probability.

− The new energy group g1 is sampled from the distribution ( )

( )1

1

0

0

r

rsg

gg

ΣΣ → , where

( ) ( )∫ →→ ΩΣ≡Σπ4 011 .,

00ΩΩrr gggg d , after which the new travel direction Ω1 is sampled

from the distribution ( )

( )1

01

10

10.,

r

ΩΩr

gg

gg

→

→

ΣΣ

. The subsequent collisions are modelled in an

analogous way until the particle history is terminated.

− The accumulated statistic (37) is used to form a new estimate of the effective multipli-

cation factor:

( )

( )( )( )( )∑∫ ∫

∑∫ ∫−Ω

Ω=

g

nfg

g

nfg

n

Sdrd

Sdrd

k

π

π

4

1,3

4

,3

,

,

Ωr

Ωr

, (38)

where the new fission source is:

( )( ) ( ) ( )( )

( )( )∑ ∫ ΩΣΣ

4=

'4 '

'

', ','1

,g

ngt

g

gg

nfg dS

π

ν

ψχπ

Ωrr

rrΩr (39)


119/150

Before being used in (35), this source is normalised:

( )( ) ( )( )( )Ω←Ω ,

1, ,, rr nf

gnnf

g Sk

S (40)

− The preceding steps are repeated until the estimate (38) is stabilised.

In the context of (37) it is seen that from the viewpoint of solving the integral equation

(32)/(37) the above described choice of values for the phase variables g, r , Ω as samples from

their probability distributions has the effect of an economical procedure of numerical integra-

tion with which the density of sampling of elementary volumes in the phase space is propor-

tional to the probability of populating these volumes.


120/150

7. Partial differential equations

According to their characteristics, i.e. curves along which a partial differential equation

transforms into an ordinary differential equation, partial differential equations are subdivided

into the following three categories:

− hyperbolic, a prototypical example being the one-dimensional wave equation:

( ) ( )2

22

2

2 ,v

,

x

txu

t

txu

∂∂=

∂∂

− parabolic, a prototypical example being the one-dimensional diffusion equation:

( ) ( )

∂∂

∂∂=

∂∂

x

txuD

xt

txu ,,

− elliptic, a prototypical example being the Poisson equation:

( ) ( ) ( )yxy

yxu

x

yxu,

,,2

2

2

2

ρ=∂

∂+∂

∂

From a computational point of view, however, a different classification is more impor-

tant – whether the respective equation defines only a boundary value problem, or both a

boundary and an initial value problem.

Although in either case the general approach to solving the equation can be reduced to

representing the derivatives as final differences, i.e. to the construction of a differencing

scheme, and thus to transforming the equation into a system of linear algebraic equations for

the values of the dependent variable at a set of fixed argument values (or for the average val-

ues of the dependent variable over the grid cells), in the case of solving an initial value prob-

lem there exist some special additional restrictions with respect to the proportions between the

discretisation steps in space and in time.

An example for the latter is the following equation:

x

u

t

u

∂∂−=

∂∂

v , (1)

the solution of which is in the form:

( ) ( )txftxu v, −= (2)

(f is an arbitrary function and v is a constant)


121/150

The direct approach to constructing a difference scheme for (1) involves the choice of

grid points Jjxjxx j ,...,1,0 =∆+= and Nntnttn ,...,1,0 =∆+= .

Let ( )njnj txuu ,≡ and let the time derivative is represented by the forward difference:

( ) ( )tOt

uutx

t

unj

nj

nj ∆+∆−

=∂∂ +1

, (3)

The difference expression (3) refers to the beginning tn of the current time step, so that it

can be directly related to the already known solution of equation (1) at this moment ( )ntxu , .

The spatial derivative can be represented by the more accurate central difference, again

based only on quantities which are known at time tn:

( )211

, 2xO

x

uu

x

unj

nj

nj

∆+∆−

=∂∂ −+ (4)

................................... The truncation error of finite differencing is estimated through the Taylor expansion of the function:

( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( ) ( ) ...'''6

1''

2

1'

...'''6

1''

2

1'

32

32

+−+−=−

++++=+

xfhxfhxhfxfhxf

xfhxfhxhfxfhxf

Therefore:

for the forward difference: ( ) ( ) ( ) ( ) ...''

2

1' ++=−+

xhfxfh

xfhxf

for the central difference: ( ) ( ) ( ) ( ) ...'''

6

1'

22 ++=−−−

xfhxfh

hxfhxf

.................................... Thus, the difference approximation of (1) leads to the following explicit relation for

evaluating 1+nju :

x

uu

t

uu nj

nj

nj

nj

∆−

−=∆− −+

+

2v 11

1

, or ( )nj

nj

nj

nj uu

x

tuu 11

1

2

v−+

+ −∆∆−= (5)

Unfortunately, however, a difference scheme of this kind, known as FTCS (Forward in

Time, Centred in Space), turns out to be unstable. This can be demonstrated as follows.

von Neumann stability analysis

A differencing scheme for solving an initial value problem is stable if the accumulated

with time roundoff error of the solution remains restricted (neutral stability) or decreases (full

stability). If this error grows unrestrictedly with time, the differencing scheme is unstable.


122/150

Let the roundoff error of the solution of the difference equation (e.g. (5)) is:

nj

nj

nj uU −≡ε , (6)

where njU is the exact, and nju is the numerical solution of the difference equation.

The exact solution must satisfy the difference equation, i.e. the equality:

( )x

trUUrUU n

jnj

nj

nj ∆

∆−≡−+= −++

2

v;11

1

is satisfied exactly.

If equation (5) is subtracted from this, the result will be:

( )nj

nj

nj

nj r 11

1−+

+ −+= εεεε (7)

The full coincidence between (7) and (5) shows that the time behaviour of roundoff er-

ror and of the numerical solution will be the same.

Under sufficiently general conditions the spatial behaviour of roundoff error can be ex-

panded into a Fourier series (similar to the discrete Fourier transform discussed in Chapter 2):

( ) ( ) ( )∑=m

mm xiktAtx exp,ε (8)

Under equally general conditions the typical time behaviour of roundoff error can be as-

sumed to be exponential (i.e. the relative change of this error after each time integration step

remains practically constant), so that finally:

( ) ( ) ( )∑≅m

mmm xiktCtx expexp, αε (9)

For the purpose of a stability analysis the behaviour of the series (9) can be inferred

from the behaviour of any of its terms. Thus it can further be assumed that:

( )( ) ( )( ) ( )( )xjikxjiktn nnj ∆=∆∆= expexpexp ξαε , (10)

where the amplification factor ( )t∆≡ αξ exp will depend on the wave number k through

the index m in (9).

Thus, if ( ) 1>kξ for some k, the differencing scheme will be unstable.

Through substituting (10) in (7), the amplification factor ( )kξ is found to be:


123/150

( )( ) ( )( ) ( ) ( )( )[ ]xikxikrxjikxjik nn ∆−−∆+×∆=×∆ expexp1expexp ξξξ ,

or:

( ) ( )xkirk ∆+= sin21ξ (11)

Therefore, for an arbitrary k, ( ) ( ) 1sinv12

>

∆∆∆+= xkx

tkξ and the FTCS scheme is

unconditionally unstable.

The instability of FTCS can be remedied through the adoption of the so-called

Lax scheme

This scheme consists in the following substitution of nju on the right-hand side of (5):

( )nj

nj

nj uuu 112

1−+ +→ , (12)

which is equivalent to a linear interpolation in space and leads to:

( ) ( )nj

nj

nj

nj

nj uu

x

tuuu 1111

1

2

v

2

1−+−+

+ −∆∆−+= (13)

Thus, analogously to (11):

( ) ( ) ( )xkirxkk ∆+∆= sincosξ and ( ) ( ) ( )xkrxkk ∆+∆= 222 sincosξ (14)

Therefore the stability condition ( ) 1≤kξ will be:

1v ≤∆∆≡x

tr . (15)

This restriction is known as the Courant condition.

Diffusion initial value problem

The simplest example is the one-dimensional problem with constant coefficients:

2

2

x

uD

t

u

∂∂=

∂∂

(16)

Explicit scheme

The counterpart of the above examined FTCS scheme will be:


124/150

( )2

111 2

x

uuuD

t

uu nj

nj

nj

nj

nj

∆+−

=∆− −+

+

, or ( ) ( )nj

nj

nj

nj

nj uuu

x

tDuu 112

1 2 −++ +−

∆∆+= (17)

(The difference expression for the second derivative is a central difference)

The von Neumann analysis leads to the following relation for the amplification factor:

( )( )

∆−=∆−−=2

sin41cos121 2 xkrxkrξ , (18)

where ( )2x

tDr

∆∆≡ .

Thus the stability condition ( ) 1≤kξ becomes:

2

1≤r , or ( )

D

xt

2

2∆≤∆ (19)

This condition can in principle be met, but it often imposes an excessively severe re-

striction on the time integration step and may render the problem practically impossible to

solve.

Implicit scheme

If the expected final state of the modelled quantity (the solution of the differential equa-

tion) is stationary, i.e. ( )

0, →

∂∂

∞→tt

txu, and hence also

( )0

,2

2

→∂

∂∞→tx

txu, an uncondi-

tional stability can be ensured through the following implicit differencing scheme with respect

of time (BTCS):

( )2

11

111

1 2

x

uuuD

t

uu nj

nj

nj

nj

nj

∆+−

=∆− +

−++

++

(20)

In contrast with the explicit scheme (17), here the finite difference representation of the

time derivative, ( ) ( )tOtxt

u

t

uunj

nj

nj ∆+

∂∂=

∆−

+

+

1

1

, , is referred to the end 1+nt of the time step,

and this imposes the same requirement to the right-hand side of equation (16).

The adoption of this implicit scheme involves the task of solving the following inhomo-

geneous linear system with a tridiagonal matrix:

( ) 1,...,12,21 11

111 −==−++− +

+++

− Juruurru nj

nj

nj

nj , (21)


125/150

where ( )2x

tDr

∆∆≡ .

It is seen that at ( )∞→∞→∆ rt expression (20) will be a finite difference counterpart

of ( )

0,

2

2

=∂

∂x

txu which correctly reflects the expected final state of the solution of the differen-

tial equation.

The von Neumann analysis produces the following equation for the amplification factor:

( ) ( ) ( ) 1exp21exp =∆−++∆−− xikrrxikr ξξξ , or

( ) ( )( )

∆+=

∆−+=

2sin41

1

cos121

1

2 xkr

xkrkξ (22)

Therefore ( ) 1<kξ with any step ∆t and the scheme (16) is unconditionally stable.

Crank-Nicholson scheme

Through combining the FTCS and BTCS schemes a difference scheme can be obtained

which is both unconditionally stable and of second order of accuracy (i.e. with a truncation

error ( )( )2tO ∆ ) with respect to the finite difference representation of the time derivative:

( ) ( )( )2

1111

111

1 22

2 x

uuuuuuD

t

uu nj

nj

nj

nj

nj

nj

nj

nj

∆+−++−

=∆− −+

+−

+++

+

(23)

The latter will be true if the right-hand side can be interpreted as an estimate of 2

2

x

u

∂∂

at

time ttn ∆+2

1, so that the finite difference representation of the time derivative can acquire

the meaning of a central difference: ( )21

2

1, tOttx

t

u

t

uunj

nj

nj ∆+

∆+∂∂=

∆−+

.

The requirement that the right-hand side of equation (16) is referred to the centre of the

time step is fulfilled through linear interpolation:

( ) ( )

∂∂+

∂∂=

∆+∂∂

+12

2

2

2

2

2

,,2

1

2

1, njnjnj tx

x

utx

x

uttx

x

u


126/150

Similarly to the above it can be shown that with this differencing scheme the amplifica-

tion factor will be:

( )

∆+

∆−=

2sin21

2sin21

2

2

xkr

xkr

kξ , (24)

so that the method is unconditionally stable with any step ∆t.

All of the discussed differencing schemes can be directly generalised for the case ( )xD

on the basis of the following representation of ( ) ( )jxxx

txuxD

x =∂∂

∂∂ ,

:

( ) ( )( )2

121121

x

uuDuuD jjjjjj

∆−−− −−++ (25)

Multidimensional case

The treatment below will be made on the example of the two-dimensional problem:

∂∂+

∂∂=

∂∂

2

2

2

2

y

u

x

uD

t

u (26)

The generalisation of explicit methods like FTCS is straightforward and the computa-

tional procedure is analogous to the one-dimensional case.

The application of an implicit method, e.g. of the Crank-Nicholson scheme, leads to the

following relation:

( )nljy

nljy

nljx

nljx

nlj

nlj uuuuruu ,

21,

2,

21,

2,

1, 2

1 δδδδ ++++= +++ , (27)

where yx

tDr ∆=∆≡∆

∆∆≡ ,2 and n

ljn

ljn

ljn

ljx uuuu ,1,,1,2 2 −+ +−≡δ , and similar for n

ljy u ,2δ .

After the introduction of one-dimensional indexing in space, the relations (27) take the

form of a large inhomogeneous linear system of equations with a sparse matrix for the solu-

tion values at the grid modes. Unlike the one-dimensional case of a tridiagonal system for the

solution of which there exist direct and economical methods, here usually specialised iterative

methods are required.


127/150

A possible alternative which can allow to circumvent or simplify the iterative solving of

the algebraic problem is the so-called alternating-direction implicit method (ADI). The

method can be regarded as an equivalent of the Crank-Nicholson scheme and has the same

second order of accuracy with respect to time and space, as well as the same unconditional

stability. With this method each time step is subdivided into two steps of size ∆t/2 (in the

three-dimensional Cartesian case – three steps of size ∆t/3). At each step an one-dimensional

implicit scheme along the respective direction is applied:

( )

( )1,

22/1,

22/1,

1,

,22/1

,2

,2/1

,

2

12

1

++++

++

++=

++=

nljy

nljy

nlj

nlj

nljx

nljx

nlj

nlj

uuruu

uuruu

δδ

δδ (28)

It is clear that each substep will involve the solving of a linear system with a tridiagonal

matrix.

An example: The one-dimensional heat equation

The equation is parabolic and has the same form as the diffusion equation:

( ) ( )txx

Ttx

t

T,,

2

2

∂∂=

∂∂ α ,

where T is temterature and α is the so-called thermal diffusivity.

The selected example is in slab geometry,

∈2

,0π

x , and the boundary conditions cor-

respond to an adiabatic process:

( ) 0,2

,0 =

∂∂=

∂∂

tx

Tt

x

T π

The selected initial condition is:

( ) ( )xxT cos0, =

The assumed value of α is 0.061644.

The assimptotic solution is:

( ) ( )ππ

π

2cos

2,

2

0

=→∞ ∫ dxxxT


128/150

With a spatial step 2100

1 π× and a time step 1 s the solution by the implicit time-

differencing scheme (BTCS) is illustrated in the figures below. The asymptotic distribution of

temtrerature is reached after 28 s (maximum absolute deviation from the asymptotic value

below 0.001). For comparison, the maximum allowed time step for the explicit scheme

(FTCS) is 2×10-3 s.

Solution of the one-dimensional heat equation at different moments of time

Application: the diffusion equation in nuclear reactor physics

With known group constants and boundary conditions, the conditionally critical multi-

group stationary diffusion equation is:


129/150

G)1,...,=(g

)()(1

)()()()()()(G

1g'''

1''' ∑∑

==→ ΦΣ+ΦΣ=ΦΣ+Φ∇⋅∇− rrrrrrrr ggg

G

ggggg

rgggD ν

λχ

,

where ggtg

rg →Σ−Σ≡Σ is the removal cross-section from group g.

Since epithermal neutrons cannot gain energy from scattering, if all thermal neutrons

are aggregated in a single group (as a rule with an upper boundary 0.625 eV), then the scatter-

ing source will be:

∑−

=→ ΦΣ

1

1''' )()(

g

gggg rr .

(Here it should be reminded that the group numbering is by decreasing energy).

All further treatment will refer to this case, although the final results are directly gener-

alised for any energy group structure. If we also assume that the fission neutron source

( ) ∑=

ΦΣ≡G

1g''' )()(

1rrr ggggF ν

λχ

is estimated in advance (e.g. from a previous approximation of the group fluxes), then

each equation in the above system can be solved separately, provided that this is done in the

order g = 1, g = 2, ..., g = G.

And so, the diffusion equation in group g will have the form:

)()()()()( rrrrr ggrggg SD =ΦΣ+Φ∇⋅∇− ,

where the source

( )rrrr g

g

ggggg FS +ΦΣ≡∑

−

=→

1

1''' )()()(

is assumed to be known.

After introducing a discretisation in space, so that all boundaries between conditionally

homogenised regions of the reactor medium coincide with boundaries between nodes, of one-

dimensional numbering of the nodes (e.g. from top do bottom, from inside to outside and

from left to right), and after integrating over the node volumes and dividing by these volumes

Vk, the diffusion equation in group g is represented as a set of balance equations:


130/150

( )KkSJV kk

kr

k

kk

k

,...,1,1

'

' ==ΦΣ+∑ ,

where the group index g is omitted and the leakage term is transformed using the diver-

gence theorem (Gauss's theorem or Ostrogradsky's theorem). The summation over k’ involves

the nodes with which the node k has a common interface, and the net currents across these

common interfaces are:

( )∫ Φ∇−≡'

.'kkS sk

kk dDJ rs .

The source

( )∫=kV g

kk rSd

VS r31

is, of course, known, and the nodal group fluxes

( )∫ Φ≡ΦkV g

kk rd

Vr31

are subject to determination.

It is evident that the resulting set of balance equations can be solved for the nodal fluxes

Φk only if a way of expressing the currents 'kkJ through the fluxes is found. There exist vari-

ous ways of introducing the sought relations, commonly referred to as nodal models.

Below the simplest nodal method will be considered, and at that only in one dimension,

i.e. with a spatial dependence only on the x coordinate.

Let xk-1, xk and xk+1 are the coordinates of the centres of two neighbouring nodes (here

with the shape of infinite slabs), and 2/1−kx and 2/1+kx are the left and right boundary coordi-

nates of node k (respectively common with nodes k-1 and k+1. Let also all nodes are with an

equal width h.

In this case the normal projections of the current across the left and right interfaces of

node k will be correspondingly

( )2/11

−− Φ= kk

kk x

dx

dDJ and ( )2/1

1+

+ Φ−= kkkk x

dx

dDJ .

With approximating the derivatives by central differences, e.g.


131/150

( ) ( ) ( )h

xxx

dx

d kkk

Φ−Φ≅Φ ++

12/1 ,

and with the assumption that the flux at the centre of the node coincides with the node-

averaged value, i.e.

( ) Kkx kk ,...,1, =Φ=Φ ,

the result would be:

( )kkkkk h

DJ Φ−Φ−= −−

11 1

and ( )kkkkk h

DJ Φ−Φ−= ++

11 1

.

It is seen that these relations do not ensure a continuity of the current at the interfaces

between neighbouring nodes, e.g.

( ) ( )

Φ−Φ−=≠

Φ−Φ−= +++++

kkkkkkkk

kk h

DJh

DJ 11111 11

.

This makes them unfit for representing the neutron balance in the system.

The correct approach is to express the currents through the flux values at the node inter-

faces. For example, for the k-th node:

( )kkkk

kk h

DJ Φ−Φ−= −− 11 2 and ( )k

kkk

kk h

DJ Φ−Φ−= ++ 11 2,

where kk

kk 1

1−

− Φ=Φ и kk

kk 1

1+

+ Φ=Φ are the flux values at the left and the right interface of

node k. The introduction of such interface values automatically ensures continuity of the flux

at the interfaces between neighbouring nodes, which corresponds to physical reality.

The freedom of choosing the interface flux values allows to ensure the continuity of the

currents across the interfaces:

( ) ( )11

11

22 −−

−− Φ−Φ−=Φ−Φ− k

kkkkkkk h

Dh

D and ( ) ( )111

1 22 +++

+ Φ−Φ−=Φ−Φ− kkkkk

kkk h

Dh

D .

The above relations are equations for the interface fluxes. Their solutions are:

1

111

−

−−−

+Φ+Φ=Φ

kk

kkkkkk DD

DD and

1

111

+

+++

+Φ+Φ=Φ

kk

kkkkkk DD

DD.

After substituting these solutions in the expressions for the interface currents, the sought

relation between those currents and the nodal fluxes is obtained:


132/150

( )11

11 21−

−

−− Φ−Φ+

−= kkkk

kkkk DD

DD

hJ and ( )kk

kk

kkkk DD

DD

hJ Φ−Φ

+−= +

+

++1

1

11 21.

A disadvantage of this simplest nodal scheme is that the two underlying assumptions,

and namely – replacement of the derivatives by finite differences and of the central fluxes by

the node-averaged ones – are too crude for nodes as large as the transverse dimension of a

fuel assembly. For this reason actual coarse-mesh diffusion calculations are founded on more

accurate but much more sophisticated nodal schemes. On the other hand, this nodal scheme is

fully adequate and commonly applied for the so-called fine-mesh diffusion calculations where

the transverse dimension of the nodes does not exceed the pitch of the fuel pin grid.

With the adopted nodal scheme the set of balance equations obtains the standard form

of a system of inhomogeneous linear equations:

kkkkkkkkkk Saaa =Φ+Φ+Φ ++−− 11,,11, ,

where 1

121,

21

−

−− +

−=kk

kkkk DD

DD

ha ,

1

121,

21

+

++ +

−=kk

kkkk DD

DD

ha and ∑−Σ=

'',,,

kkkkrkk aa .

The first and the last equation which refer to the two external problem boundaries are

completed using the boundary conditions. The technique is analogous to the already dis-

cussed, and the result are certain special expressions of the coefficients a1,1 и aN,N, where N is

the number of nodes.

In particular, let the boundary condition is of the logarithmic type and refers to the right

boundary of the considered problem: RK

RKJ Φ= α , i.e.

( ) RKK

RKK h

D Φ=Φ−Φ− α2

This condition, taken as an equation for the boundary flux, has the solution:

K

K

RK

D

hΦ

+=Φ

21

1α

After substituting in the expression for the boundary flux:

K

K

KRK h

D

DJ Φ

+=

2

αα

.


133/150

If the nodes are of unequal size, Nkhk ,...,1, = , the matrix elements for the set of bal-

ance equations will have the form:

21

21

12,1 ~~

~~1

DD

DD

ha

+−= , 2,1

1

1

111,1 ~

~1

aD

D

ha −

++Σ=

αα

1

11, ~~

~~1

−

−− +

−=kk

kk

kkk DD

DD

ha ,

1

11, ~~

~~1

+

++ +

−=kk

kk

kkk DD

DD

ha and ∑−Σ=

'',,

kkkkkk aa

1

11, ~~

~~1

−

−− +

−=NN

NN

NNN DD

DD

ha , 1,, ~

~1

−−+

+Σ= NNN

N

NNNN a

D

D

ha

αα

,

where kk

k Dh

D2~ = , and α is the ratio ΦJ on the external boundary.

In matrix-vector notation the resulting inhomogeneous linear system has the form

SAΦ = . If the nodes of equal size, then, as follows from the definitions of the matrix ele-

ments, the matrix A will be symmetric. In all cases it will also be diagonally dominant, which

means that each of its diagonal elements is larger than the sum of the magnitudes of the off-

diagonal elements in the respective row. It can be shown that real symmetric diagonally domi-

nant matrices are positive definite, i.e. all their eigenvalues are real and positive, and also that

the eigenvectors of real symmetric matrices are mutually linearly independent.

Two- or three-dimensional spatial discretisation leads to analogous expressions for the

leakage term and to a matrix with the same properties as in the discussed one-dimensional

case.

From the viewpoint of solving the resultant inhomogeneous linear system, the following

additional characteristics of the matrix A are important:

− It is especially large. Thus, for example, with a course-mesh discretisation of the core

of WWER-1000 the number of equations is typically about 5000, whereas with a fine-

mesh grid it is about two million.

− The matrix is sparse, i.e. has a very small number of non-zero elements. As it can be

see from the above one-dimensional example, each of its rows contains only two non-

zero off-diagonal elements, and these are adjacent to the non-zero diagonal element.

The number and the locations of non-zero off-diagonal elements in the case of two- or

three-dimensional problems will depend on the node shape. For reactors with a square

grid of assemblies (or fuel pins) the single-index numbering of nodes will lead in the


134/150

two-dimensional case to the emergence of four non-zero off-diagonal elements, and in

the three-dimensional case – to six such elements. In both cases the matrix has a band

structure, i.e. all non-zero elements are confined in a strip around its diagonal. For the

solving of systems with such matrices there exist efficient direct methods which re-

quire of the order of N linear operations. For reactors with a triangular grid of assem-

blies (or fuel pins), such as those of the WWER type, the single-index numbering of

nodes will lead in the two-dimensional case to the emergence of six non-zero off-

diagonal elements, and in the three-dimensional case – to eight such elements. Unfor-

tunately, however, with such grids the matrix of the system can never be brought to a

band shape, so that the existing direct and efficient methods for solving the respective

linear system are not applicable. In all cases for the solving of systems with large

sparse matrices it is practical to employ methods which require storing only the non-

zero matrix elements (usually they are not even stored at all, and are instead evaluated

in the course of the computational process). It is also desirable that the number and the

locations of the non-zero elements are not changed during the computational process.

In the terminology of reactor core calculations it is conventional to refer to the stage of

solving the fixed-source one-group boundary value problem (in diffusion or in other approxi-

mation) as to inner iterations.

A distinctive characteristic of all methods for solving linear systems with large sparse

matrices is that they are iterative, i.e. the solution is obtained through a succession of ap-

proximations.

______________________________________________________________________

Another, more sophisticated nodal scheme for the diffusion problem can be constructed

as follows.

In energy group representation the diffusion equation for node n has the form:

nn

eff

nnsnnrnn

kΦFΦΣΦΣΦD ˆ1ˆˆˆ ,,2 +=+∇− , (1)

where:

( ) ( )( )rrΦnG

nn col ΦΦ≡ ,...,1 is the sought multigroup flux;

( )nG

nn DDdiag ,...,ˆ1≡D is the matrix of diffusion coefficients;


135/150

( )nrG

nrnr diag ,,1

, ,...,ˆ ΣΣ≡Σ is the matrix of removal cross-sections;

ns,Σ is the matrix of in-scatter cross sections with elements nsgg

nsgg

,'

,', →Σ=Σ (with one

thermal group this matrix is upper triangular);

nF is the fission source cross-sections matrix with elements ngg

nggF ,

'',νχ Σ=

Equation (1) has the following more general formulation:

0ˆ2 =−∇ nnnΦAΦ , (2)

where:

( )

−−≡

− n

eff

nsnrnn

kFΣΣDA ˆ1ˆˆˆˆ ,,1

Usually, the matrix nA can be diagonalised through similarity transformations (cf.

Chapter 3). This means that there exists a matrix nZ , such that ( ) nnnn ΛZAZ ˆˆˆˆ 1=

−, where

( )nG

nn diag λλ ,...,ˆ1=Λ contains the eigenvalues of nA , and the columns of nZ are the eigen-

vectors of nA . (Here it must be noted that although the matrix nA is real, in the general case

the elements of nΛ and nZ are complex.)

Therefore, ( ) 1ˆˆˆˆ −= nnnn ZΛZA .

The substitution of the latter in (2) leads to ( ) 0ˆˆˆ 12 =−∇− nnnnn ΦZΛZΦ , and after multi-

plying on the left by ( ) 1ˆ −nZ and on the right by nZ , the result is:

0ˆ2 =−∇ nnnΨΛΨ , (3)

where:

( ) nnnn ZΦZΨ ˆˆ 1−≡

The form of the leakage term in (3) is due to the fact that the elements of the matrix nA

are spatially constant within node n because the problem (1) is formulated after an appropriate

homogenisation of the group constants.

The system (3) can be solved for nΨ and afterwards the solution of (1) can be obtained

through the reverse transformation:


136/150

( ) ( )( ) 1ˆˆ −= nnnn ZrΨZrΦ (4)

The equations in (3) are separated, i.e.

( ) ( ) Ggng

ng

ng ,...,1,2 =Ψ=Ψ∇ rr λ , (5)

and each of them has particular solutions of the form:

( ) ( )rκr .exp ng

ng

ng ψ=Ψ , (6)

where:

( ) ng

ng λ=2κ

It is evident that with a fixed eigenvalue ngλ there is an infinite number of possible com-

binations of the three components of the vector ngκ , whereas any practically constructible

solution of (3) can include only a part of them.

If the problem is one-dimensional, the particular solutions of (5) are two:

( ) ( )xx ng

ng

ng κψ ±=Ψ exp , (7)

where the quantity ng

ng λκ = is also generally complex.

The form of the solutions (6) и (7) is for Cartesian coordinates. In other coordinate sys-

tems these solutions have a different form but with analogous general properties.

The functions ( )rngΨ are conventionally known as modes of the scalar flux, and corre-

spondingly the methods for solving the diffusion equation based on (2) - (4) – as modal meth-

ods.

Returning to the diffusion equation (1), the following can be noted:

1) From (4) it follows that the solution of (1) will also be a linear combination of the

particular solutions of (3).

2) Any isolated equation from the system (1) can be formally written as an inhomoge-

neous one-group equation, i.e.

( ) ( ) ( )rrr ng

ng

nrg

ng

ng QD =ΦΣ+Φ∇− ,2 (8)


137/150

(The source on the right-hand side is estimated iteratively within the iterative process

for evaluating effk , which is generally inherent to the eigenproblem (1)/(3).)

In this case the homogeneous part of (8) has a form identical to (5):

( ) ( )rr ng

ng

ng B Φ=Φ∇ 22 , (9)

where ng

nrgn

g DB

,2 Σ

≡ and, correspondingly has particular solutions identical in form and

properties to those of (6)/(7). The only, however important difference is that the quantity 2n

gB ,

which is a counterpart of ngλ from (2), is guaranteed to be real and positive.

The general solution of the inhomogeneous equation (8) can be obtained as a linear

combination of its particular solutions and of particular solutions of its homogeneous part (9).

The inclusion of particular solutions of the homogeneous part is expedient by virtue of state-

ment 1) above. Regarding the particular solutions of the inhomogeneous equation, their inclu-

sion is, of course, mandatory, and their functional form in principle will be determined by the

functional form of ( )rngQ . On the other hand, the original equation (1) is actually homogene-

ous and ( )rngQ is essentially a linear combination of the sought general solutions

( ) Ggng ,...,1, =Φ r . Therefore, no formal restrictions are imposed on the choice of a functional

form of the particular solutions of the inhomogeneous equation (8). This freedom can be used

for reducing the number of particular solutions of the homogeneous problem in constructing

the general solution of (8).

Based on the above considerations, the following comparative assessment of the direct

(8) and the modal (3) formulations of the multigroup diffusion problem can be made:

Except for the one-dimensional problem and for simplified cases with nodes, the

boundaries of which coincide with coordinate surfaces, none of the practically con-

structible solutions can be exact with either of the two formulations (cf. the com-

ment on (6), which is also valid for (9)).

The selection of a finite (and not quite large) number of wave vectors for (6)/(9) and

the determination of the coefficients in the respective linear combinations is made

so as to satisfy important balances between reaction rates, to preserve certain sym-

metries of the flux and current, and to provide a sufficient number of definite values


138/150

of representative particular solutions or of their moments (weighted integrals) at the

node interfaces. This selection depends also on the number and the boundaries of

the energy groups. One of the criteria for representativeness and sufficiency is the

physical requirement for continuity of the directional fluxes and currents at the node

interfaces.

The modal formulation (3) is conceptually simpler and does not involve an arbitrary

choice of a solution to the inhomogeneous equation (8). This advantage is somewhat

neutralised by the need to include a larger number of solutions of the form (6). Be-

cause of the complicated relation between modes and group fluxes (currents), the

implementation of the continuity requirements for the flux and the current is rather

sophisticated. These drawbacks, along with the conceptual inconsistency of dimen-

sionality reduction through transverse integration, make the modal formulations

heavier to implement and computationally more expensive.

The direct formulation (1)/(8), on its part, requires a formally arbitrary construction

of a solution to the inhomogeneous equation. The process is heuristic and its impact

on the accuracy of the general diffusion problem is difficult to assess in advance. On

the other hand, the freedom of constructing such solution can advantageously be

used for reducing the number of components of the general solution and hence of

the number of equations for determining the free parameters on which this solution

depends. Also, the inhomogeneous form of the one-group equations harmonises

well with the technique of transverse integration for reducing the problem dimen-

sionality and thence the mathematical and computational complexity of the resultant

implementation. An additional advantage is the simplicity of formulating the inner

and outer boundary conditions.

With the traditional two energy groups both approaches ensure practically equal at-

tainable accuracy, while the direct approach has the advantage of its large number of

successful and as a rule computationally simple implementations.

The transverse integration technique which is normally characteristic of the direct

approach has the inherent to this technique advantages and disadvantages – a sim-

pler mathematical and computational apparatus, however also an ambiguity and a

potential inaccuracy arising from the representation of the transverse leakage.


139/150

Below the so-called direct approach will be illustrated on the example of an one-

dimensional formulation for two energy groups.

The two-group diffusion equation is:

( ) ( ) ( )

( ) ( ) ( )rrrJ

rrrJ

nns

nna

n

g

ng

ng

eff

nnr

n

k

122

2

1,11

1

ΦΣ=ΦΣ+⋅∇

ΦΣ=ΦΣ+⋅∇ ∑=

ν (10)

Let the hexagonal prismatic nodes have a transverse dimension Hr (between the centres

of adjacent nodes) and height Hz. Thus the node volume V and the area of its base hexF are

2

2

3, rhexzhex HFHFV == .

The transverse averaging of (10) lead to the following one-dimensional formulation:

( ) ( ) ( ) ( ),101

2

2

zQzzdz

dDdxdy

F ggggg

Fhex hex

=ΦΣ+Φ−⇒∫∫ (11)

where:

( ) ( )∫∫ Φ=ΦhexF

ghex

g dxdyF

z r1

and ( ) ( ) ( ),zLzSzQ ggg −= and

( ) ( )∫∫ Φ

∂∂+

∂∂−=

hexF

ghex

gg yx

dxdyF

DzL r

2

2

2

2

is the so-called transverse leakage.

The form of the group source ( )zSg follows directly from (10).

According to the results obtained above, the particular solutions of the homogeneous

form of (11) are ( )Bz±exp (here and below the group indices will be omitted). Let the gen-

eral solution is sought in the form:

( ) ( ) ( ) ( )BzaBzazpczi

ii −++≈Φ ∑=

expexp 21

2

0

, (12)

where the coefficients ci and ai are subject to determination. ( ) ( )∑=

=2

02

iii zpczP is a sec-

ond degree polynomial which represents the particular solution of the inhomogeneous equa-

tion (11). ( )zpi are polynomials of degree i.


140/150

The polynomial representation of the inhomogeneous equation solution is of course not

the only possibility; however it is preferred due to its simple form. The substitution of (12) in

(11) leads to the following result:

( ) ( )zQD

zPBC1

22 =+− , (13)

where the constant C is the second derivative of ( )zP2 .

It is seen that the second degree chosen for the polynomial particular solution of the in-

homogeneous equation is the highest which allows to expand the source ( )zQ in the same

polynomial terms ( )zpi : ( ) ( )∑=

=2

0iii zpqzQ . The additive constant, which according to (13)

distinguishes the representations of ( )zP2 and ( )zQ , is accommodated in a natural way by the

coefficient before ( )zp0 . (Although it is possible to choose a lower degree of the polynomial

expansion, this would unnecessarily reduce the capability of the model (12) to reproduce suf-

ficiently accurately the shape of the actual solution of the diffusion problem.)

The process of finding the coefficients in the polynomial expansion of the source

(which is always assumed as known and is updated iteratively) is strongly facilitated if the

polynomial terms ( )zpi are constructed as orthogonal: ( ) ( ) ij

H

H ji dzzpzp δ=∫+

−2

2

. Then:

( ) ( )∫+

−= 2

2

H

H ii dzzQzpq .

Since the exponential terms in the flux model satisfy the homogeneous diffusion equa-

tion, with the chosen polynomial expansions the inhomogeneous equation (11) takes the form:

( ) ( ) ( ) 02

0

2

0

2

02 =−Σ+− ∑∑∑

=== iii

iii

iii zpqzpczp

dz

dcD (14)

In addition, because of the orthogonality of the polynomial terms:

( )( ) ( ) ( ) 0142

0

2

2

2

22

2

=−Σ+−⇒ ∑ ∫∫=

−−

kki

H

Hiki

H

Hk qczp

dz

dzdzpcDzdzp (15)


141/150

The second derivative ( )zpdz

di2

2

is zero for i = 0, 1 and a constant for i = 2. In the latter

case this makes it different from ( )zp0 by a constant multiplier.

Therefore, for 2,1=k the form of (15) is 0=−Σ kk qc and

Σ= k

k

qc (16)

For k = 0 the form of (15) is ( ) ( ) 000

2

2

22

2

02 =−Σ+− ∫−

qczpdz

dzdzpDc

H

H

and

220

0 cB

qc

α−Σ

= , (17)

where ( ) ( )∫−

=2

2

22

2

0

H

H

zpdz

dzdzpα .

Expressions (16) and (17) fully determine the coefficients before the polynomial terms

in the flux representation (12).

The coefficients before the exponential terms in (12) are found from the inner and outer

boundary conditions.

Let ±0J and ±

1J are the average outgoing and incoming partial currents on the bottom

and top interfaces of the node. With accounting of Fick’s law, the applicable expressions are:

( ) ( )

+Φ

+Φ=

−Φ±

−Φ= ±±

22

1

24

1;

22

1

24

110

H

dz

zdD

HJ

H

dz

zdD

HJ m (18)

After comparing these relations with the flux model (12) it is seen that the partial cur-

rents will be linear combinations of the coefficients 2,...,0, =ici and 2,1, =kak . The cou-

pling of these linear combinations through the inner boundary conditions between adjacent

nodes, supplemented with the outer boundary conditions, will define a linear algebraic system

for the nodal coefficients 2,1, =kak . The solving of this system is equivalent to solving the

inhomogeneous diffusion equation (11) with a known source. The practical procedure of ob-

taining a solution of the described problem is as follows.


142/150

Let ( )20,..,cccol≡C , ( )21,aacol≡A , ( )±±± ≡ 10 , JJcolJ . With this notation the above

mentioned linear combinations are written as:

AQCPJ ±±± += ˆˆ (19)

a) Let at first the incoming partial currents for the node, −J , be assumed as known. This

allows to solve (19) for A:

( ) ( )CPJQA −−−− −= ˆˆ 1 (20)

b) These incoming partial currents are actually also subject to determination. The equa-

tions for them are expressed as continuity conditions for the current across the node inter-

faces:

+−+− ==10 ,0,1,1,0 , mnmn JJJJ , (21)

where the indices m0 and m1 denote the bottom and respectively top neighbour of node

n.

Through (19), equations (21) – totally 2 per node, are essentially formulated for the co-

efficients before the exponential terms, which are also 2 per node.

The implementation of (21) requires a relation between the incoming and the outgoing

partial currents. Based on (19) and (20), this relation is:

( )( ) ( ) −−−+−−−+++ +−= JQQCPQQPJ11 ˆˆˆˆˆˆ (22)

The overall diffusion problem is therefore solved through the following two-tier itera-

tive procedure.

a) Inner iterations for updating the coefficients before the exponential terms in the flux

expansion (12) – through on relations (21).

The inner iterations cycle begins with finding the coefficients before the polynomial

components in (12) through (16) and (17).

A basis of the inner iterations is the evaluation of outgoing partial currents through the

incoming ones and the polynomial coefficients using (22).

In preparation for the next inner iteration the incoming currents are expressed through

the outgoing ones using (21) for the inner interfaces, or using the outer boundary conditions

for the external problem surfaces.


143/150

The cycle is completed by calculating new polynomial coefficients for the scattering

source in the next group or for the fission source in the next outer iteration. Since this source

is evidently proportional to the flux, the calculation reduces to finding the so-called flux mo-

ments ( ) ( )∫+

−Φ= 2

2

H

H ii dzzzpf . For this purpose the coefficients before the exponential terms in

the expansion (12) are expressed through the partial currents using (20).

b) The outer iterations (source iterations) involve updating the transverse leakage (if the

problem is actually three-dimensional) and the fission source, and a calculation of the effec-

tive multiplication factor.

______________________________________________________________________

The joint solving of the group equations with a dependent source, i.e. the search for the

fission neutron source and the effective multiplication factor, has the following general char-

acteristics.

The generalised matrix-vector representation of the spatially discretised multi-group dif-

fusion criticality problem is:

ΦFΦA ˆ1ˆλ

= , (1)

where the vector ( )Ggcol ΦΦΦΦ ,...,,...,1≡ is a concatenation of the vectors Φg intro-

duced above as solutions of the one-group inhomogeneous equations; the matrix A has a

block structure assembled from the matrices gA of these equations; the matrix F is an analo-

gous result from the spatial and energy discretisation of the fission neutron source

( ) ( ) ( )∫∞

ΦΣ0

',',' EEdEE rrνχ .

The form of this generalised representation does not depend on the applied nodal

scheme. The one-group inhomogeneous equations are solved usually through inner iterations

with a fixed fission source estimated from a previous flux approximation, and the solving of

the inhomogeneous equations is tantamount to inverting the matrix A . The factor λ has the

meaning of an effective multiplication factor and is actually included in the normalisation of

the fixed fission source, most often to ∑ ∑∑ =ΦΣg g n

ng

ngg 1

1

''

,'

νχλ

. It is evident that this nor-

malisation is effectively imposed on the previous flux approximation. With this normalisation


144/150

of the right-hand side of the multigroup equation, the result from the j-th cycle of inner itera-

tions is:

( ) ( ) ( )11 ˆˆ −−= jj ΦFAΦ (2)

It is seen that the succession of those cycles, which is essentially the outer iteration

process, is a power iteration with the matrix FAB ˆˆˆ 1−≡ . Therefore, with a sufficiently large j

the relation (2) is equivalent to:

ΦBΦ ˆ1 =λ , (3)

where 1λ is the largest eigenvalue of B and Φ is its corresponding eigenvector. It can

be shown that only this eigenvector has real and non-negative values of all of its components,

which is actually the physical requirement to the neutron flux. Therefore, only the largest ei-

genvalue of B will have the physical meaning of an effective multiplication factor. And of

course, (3) is an alternative formulation of (1).

Within the framework of the power iteration process (Chapter 3), the new estimate of

1λ is obtained e.g. through the expression ( )( )

( )11 −⋅⋅=

j

jj

ΦyΦyλ , where y can in principle be any

non-zero vector. The choice of ( )jΦy = , i.e. ( )( ) ( )

( ) ( )11 −⋅⋅=

jj

jjj

ΦΦ

ΦΦλ , will accelerate convergence

if the eigenvectors of B are mutually orthogonal (Chapter 3). Although such orthogonality is

not guaranteed, this choice is standard for solving the described problem. Moreover, for

evaluating ( )j1λ it is sufficient that the numerator and the denominator be the same linear com-

binations of the elements of ( )jΦ and ( )1−jΦ respectively. Based on this, and in accordance

with the commonly adopted definition of the effective multiplication factors, the latter is most

often evaluated through the ratio:

( )( )( )

( ) ( )

( )

( ) ( )∑ ∑∑

∑ ∑

ΦΣ

ΦΣ

ΦΣ

=⋅

=−

==

=

−

n

jng

gg

jng

ggn

n

jng

ggn

jj

jj

eff

V

V

k1

2

1

2

1

22

1

1

2

ˆˆ

ˆ

νν

ν

ΦFΦFΦF

, (4)

where nV is the volume of node n.


145/150

The rightmost expression in (4) is for the particular (however commonplace) case of

two energy groups with accounting for the fact that 11 =χ and 02 =χ .

Here it must be reminded that before the next cycle of inner iterations the fission neu-

tron source is normalised: ( )( )

( )jj

eff

j

kΦFΦF ˆ1ˆ ← . It is also worth mentioning that the term inner

iterations is only conventional and actually refers to the effective inversion of the matrix A ,

which is some cases does not involve an iterative procedure.

The outer iterations can be accelerated using certain specialised methods which how-

ever will not be discussed here.

The outer iterations can be embedded in criticality search iterations which consist in

varying the material composition of the reactor medium (e.g. of the boron concentration

and/or of the position of chosen control rods) in order to achieve keff = 1, and the latter in their

own turn can be embedded in a cycle of so-called burnup iterations. Burnup iterations are a

means of accounting for the effect of the evolution of the nuclide composition of fuel during

reactor operation. The process is iterative because the material properties, respectively the

matrices A and F , depend on the sought nuclide composition (often aggregately character-

ised by a quantity known as fuel burnup). These iterations usually account separately for the

so-called poisoning (generation, decay and neutron-induced depletion of fission products with

large neutron absorption cross-sections).

Example: one-dimensional two-group problem

The examples consists in solving the one-dimensional two-group stationary diffusion

equation in plane geometry with diffusion constants representative of the fuel and the reflec-

tor region of WWER-1000. The boundary conditions are void (α = 0.5). The nodal scheme is

based on finite differencing and the power iteration is not accelerated. The convergence crite-

ria are 710−=kε for the multiplication factor and 610−=fε for the fission source shape.

The first example is for a problem with a single fuel material (with relatively high mul-

tiplying properties) without reflector. The effect of the discretisation step h (constant for the

entire problem) is studied. The thickness of the fuel layer is 50 cm and is close to the critical

value.

The effect from the discretisation step h (with a step of 16.7 cm the problem is subdi-

vided into only 3 nodes) is demonstrated to be quite significant from a physical point of view.


146/150

It is also seen that the fission rate shape is very close to cosine, as it should be expected for a

bare homogeneous slab.

Bare homogeneous slab. Effect from the discretisation step

h, cm keff δkeff, pcm # of source iterations 16.7 1.011556 1520 33 10 1.00215 580 23 5 0.99791 155 18 1 0.996355 0 23

(1 pcm = 10-5)

0 10 20 30 40 500.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

normalisedfission rate

rel.

units

cm

Bare homogeneous slab. Fission rate


147/150

0 10 20 30 40 500.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

thermal flux

epithermal flux

rel.

units

cm

Bare homogeneous slab. Two-group flux.

The second example differs from the first by the addition of a 20 cm thick reflector. The

observed effect is a large increase of the multiplication factors and a higher sensitivity to the

discretisation step h. The latter is due to the complicated flux shape in the vicinity of the bor-

der between the fuel region and the reflector. Another important effect is the flattened fission

density shape. The characteristic peaking of the thermal neutron flux in the reflector close to

the fuel region is observed as well.

Reflected homogeneous slab. Effect from the dscretisation step

h, cm keff δkeff, pcm # of source iterations 16.7 (20 in the reflector) 1.092594 4605 12 10 1.063492 1695 14 5 1.050144 360 16 1 1.046545 0 21


148/150

20 30 40 50 60 700.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6


rel.

units

cm

Reflected homogeneous slab. Fission rate

0 10 20 30 40 50 60 70 80 900.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

thermal flux

epithermal flux

rel.

units

cm

Reflected homogeneous slab. Two-group flux


149/150

The third example differs from the second one in that the central 10 cm contain fuel

with weakest multiplying properties, surrounded by a 10 cm intermediate layer with better

multiplying properties and a 10 cm outer layer with strongest multiplying properties (identical

to that from the preceding examples). It is seen that the fission rate shape is flattened further,

although the choice of fuel regions if far from optimum.

Reflected heterogeneous slab. Effect from the discretisation step

h, cm keff δkeff, pcm # of source iterations 10 0.98103 1852 18 5 0.966429 392 17 1 0.962509 0 18

20 30 40 50 60 700.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6


rel.

units

cm

Reflected heterogeneous slab. Fission rate


150/150

0 10 20 30 40 50 60 70 80 900.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

thermal flux

epithermal flux

rel.

units

cm

Reflected heterogeneous slab. Two-group flux

Further reading

1. W. H. Press, S. A. Teukolsky, W. T. Vetterling, B. P. Flannery: Numerical recipes in FORTRAN, Second Edition, Cambridge University Press, 1992

2. A. C. Kak, M. Slaney, Principles of Computerized Tomographic Imaging, IEEE Press, 1988

3. E. E. Lewis, W. F. Miller, Computational Methods of Neutron Transport, John Wiley & Sons, 1984

4. R. Stammler, M. J. Abbate, Methods of Steady State Reactor Physics in Nuclear Design, Academic Press, 1983

5. M. Hjorth-Jensen, Computational Physics, University of Oslo (2013) (http://www.physics.ohio-state.edu/~ntg/6810/readings/Hjorth-Jensen_lectures2013.pdf)

Documents

These notes represent the lecture contents in