NA1) - National University of Ireland, Galwayniall/MA385/MA385.pdf · · 2017-11-22... { Numerical Analysis I (\NA1") Niall Madden November 22, ... tation of numerical methods that

$Page 1: NA1) - National University of Ireland, Galwayniall/MA385/MA385.pdf · · 2017-11-22... { Numerical Analysis I (\NA1") Niall Madden November 22, ... tation of numerical methods that$
MA385 – Numerical Analysis I (“NA1”)

Niall Madden

November 22, 2017

0 MA385: Preliminaries 2

0.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

0.2 Taylor’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1 Solving nonlinear equations 8

1.1 Bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2 The Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.4 Fixed Point Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.5 LAB 1: the bisection and secant methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.6 What has Newton’s method ever done for me? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2 Initial Value Problems 22

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2 Euler’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4 Runge-Kutta 2 (RK2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.5 LAB 2: Euler’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.6 Runge-Kutta 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.7 From IVPs to Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.8 LAB 3: RK2 and RK3 methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3 Solving Linear Systems 39

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2 Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3 LU-factorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4 Solving linear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.5 Vector and Matrix Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.6 Condition Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.7 Gerschgorin’s theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

1

Chapter 0

MA385: Preliminaries

0.1 Introduction

0.1.1 Welcome to MA385 (“NA1”)

This is a Semester 1, upper level module on numerical

analysis. It is often taken in conjunction with MA378

(“NA2”). You may be taking this course as part of your

degree in

� Bachelor of Science in Mathematics, Applied Math-

ematics and/or Computer Science;

� Denominated B.Sc. in Mathematical Science, Fi-

nancial Mathematics, or Computer Science and In-

formation Technology;

� Graduate programme in Applied Science or Data

Analytics.

The basic information for the course is as follows.

Lecturer: Dr Niall Madden, School of Maths. My office

is in room ADB-1013, Aras de Brun.

Email: [email protected]

Lectures: Monday at 9 and Thursday at 3, in AC201.

Tutorial/Lab: TBA (to begin during Week 3)

Assessment: � Two written assignments (20%)

� Three computer labs (10%)

� One in-class test in Week 6 [tbc] (10%)

� Two-hour exam at the end of semester (60%)

The main text-book is Suli and Mayers, An Intro-

duction to Numerical Analysis, Cambridge University

Press [1]. This is available from the library at 519.4 MAY,

and there are copies in the bookshop. It is very well

suited to this course: though it does not over compli-

cate the material, it approaches topics with a reasonable

amount of rigour. There is a good selection of interest-

ing problems. The scope of the book is almost perfect

for the course, especially for those students taking both

semesters. You should buy this book.

Other useful books include

� G.W. Stewart, Afternotes on Numerical Analysis,

SIAM [3]. In the library at 519.4 STE. Moreover,

the full text is freely available online to NUI Galway

users! This book is very readable, and suited to

students who would enjoy a bit more discussion.

� Cleve Moler, Numerical Computing with MATLAB [2].

The emphasis is on the implementation of algo-

rithms in MATLAB, but the techniques are well

explained and there are some nice exercises. Also,

it is freely available online.

� James F Epperson, An Introduction to Numerical

Methods and Analysis,[5]. There are five copies in

the library at 519.4. It is particularly good a moti-

vating why we study particular numerical methods.

� Stoer and Bulirsch,Introduction to Numerical Anal-

ysis [6]. A very complete reference for this course.

Web site:

The on-line content for the course will be hosted at

http://www.maths.nuigalway.ie/MA385. There you’ll find

various pieces of these notes, copies of slides, problem

sets, and lab sheets,

We will also use the MA385 module on BlackBoard

for announcements, emails, and the Grade Book. If you

are registered for MA385, you should be automatically

enrolled onto the blackboard site. If you are enrolled in

MA530, please send an email to me.

These notes are a synopsis of the course mate-

rial. My aim is to provide these in three main sections,

and always in advance of the class. They contain most

of the main remarks, statements of theorems, results and

exercises. However, they will not contain proofs of the-

orems, examples, solutions to exercises, etc. Please let

me know of the typos and mistakes that you spot.

You should try to bring these notes to class. It will

make following the lecture easier, and you’ll know what

notes to take down.

2

mailto:[email protected]

http://www.maths.nuigalway.ie/MA385

http://nuigalway.blackboard.com

Introduction 3 MA385: Preliminaries

0.1.2 What is Numerical Analysis?

Numerical analysis is the design, analysis and implemen-

tation of numerical methods that yield exact or approxi-

mate solutions to mathematical problems.

It does not involve long, tedious calculations. We

won’t (usually) implement Newton’s Method by hand, or

manually do the arithmetic of Gaussian Elimination, etc.

The Design of a numerical method is perhaps the

most interesting; its often about finding a clever way

swapping the problem for one that is easier to solve, but

has the same or similar solution. If the two problems have

the same solution, then the method is exact. If they are

similar (but not the same), then it is approximate.

The Analysis is the mathematical part; its usually cul-

minates in proving a theorem that tells us (at least) one

of the following

� The method will work: that our algorithm will yield

the solution we are looking for;

� how much effort is required;

� if the method is approximate, determine how close

the true solution be to the real one. A description

of this aspect of the course, to quote the Epperson

[5], is being “rigorously imprecise or approximately

precise”.

We’ll look at the implementation of the methods in labs.

Topics

0. We’ll preface the course with a review of Taylor’s

theorem. It is central to the algorithms of the fol-

lowing sections.

1. Root-finding and solving non-linear equations.

2. Initial value ordinary differential equations.

3. Matrix Algorithms: solving systems of linear equa-

tions and estimating eigenvalues.

We also see how these methods can be applied to real-

world, including Financial Mathematics.

Learning outcomes

When you have successfully completed this course, you

will be able to demonstrate your factual knowledge of

the core topics (root-finding, solving ODEs, solving lin-

ear systems, estimating eigenvalues), using appropriate

mathematical syntax and terminology.

Moreover, you will be able to describe the fundamen-

tal principles of the concepts (e.g., Taylor’s Theorem)

underpinning Numerical Analysis. Then, you will apply

these principles to design algorithms for solving mathe-

matical problems, and discover the properties of these

algorithms. course to solve problems.

Students will gain the ability to use a MATLAB to im-

plement these algorithms, and adapt the codes for more

general problems, and for new techniques.

Mathematical Preliminaries

Anyone who can remember their first and second years

of analysis and algebra should be well prepared for this

module. Students who know a little about differential

equations (initial value and boundary value) will find a

certain sections (particularly in Semester II) somewhat

easier than those who haven’t.

If its been a while since you covered basic calculus,

you will find it very helpful to revise the following: the In-

termediate Value Theorem; Rolle’s Theorem; The Mean

Value Theorem; Taylor’s Theorem, and the triangle in-

equality: |a+b| 6 |a|+ |b|. You’ll find them in any good

text book, e.g., Appendix 1 of Suli and Mayers.

You’ll also find it helpful to recall some basic linear

algebra, particularly relating to eigenvalues and eigenvec-

tors. Consider the statement: “all the eigenvalues of a

real symmetric matrix are real”. If are unsure what the

meaning of any of the terms used, or if you didn’t know

that its true, you should have a look at a book on Linear

Algebra.

0.1.3 Why take this course?

Many industry and academic environments require grad-

uates who can solve real-world problems using a math-

ematical model, but these models can often only be re-

solved using numerical methods. To quote one Financial

Engineer: “We prefer approximate (numerical) solutions

to exact models rather than exact solutions to simplified

models”.

Another expert, who leads a group in fund manage-

ment with DB London, when asked “what sort of gradu-

ates would you hire”, the list of specific skills included

� A programming language and a 4th-generation lan-

guage such as MATLAB (or S-PLUS).

� Numerical Analysis

Graduates of our Financial Mathematics, Computing

and Mathematics degrees often report to us that they

were hired because that had some numerical analysis

background, or were required to go and learn some before

they could do some proper work. This is particularly true

in the financial sector, games development, and mathe-

matics civil services (e.g., MET office, CSO).

Bibliography

[1] E Suli and D Mayers, An Introduction to Nu-

merical Analysis, 2003. 519.4 MAY.

[2] Cleve Moler, Numerical Computing

with MATLAB, Cambridge Univer-

sity Press. Also available free from

http://www.mathworks.com/moler

[3] G.W. Stewart, Afternotes on Numerical Analy-

sis, SIAM, 1996. 519.4 STE.

[4] G.W. Stewart, Afternotes goes to Graduate

School, SIAM, 1998. 519.4 STE.

[5] James F Epperson, An introduction to numeri-

cal methods and analysis. 519.4EPP

[6] Stoer and Bulirsch, An Introduction to Numer-

ical Analysis, Springer.

[7] Michelle Schatzman, Numerical Analysis: a

mathematical introduction, 515 SCH.

0.2 Taylor’s Theorem

Taylor’s theorem is perhaps the most important math-

ematical tool in Numerical Analysis. Providing we can

evaluate the derivatives of a given function at some point,

it gives us a way of approximating the function by a poly-

nomial.

Working with polynomials, particularly ones of degree

3 or less, is much easier than working with arbitrary func-

tions. For example, polynomials are easy to differentiate

and integrate. Most importantly for the next section of

this course, their zeros are easy to find.

Our study of Taylor’s theorem starts with one of the

first theoretical results you learned in university mathe-

matics: the mean value theorem.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Theorem 0.2.1 (Mean Value Theorem). If f is function

that is continuous and differentiable for all a 6 x 6 b,

then there is a point c ∈ [a,b] such that

f(b) − f(a)

b− a= f ′(c).

This is just a consequence of Rolle’s Theorem, and

has few different interpretations. One is that the slope of

the line that intersects f at the points a and b is equal

to the slope of the tangent to f at some point between

a and b.

Take notes:

There are many important consequences of the MVT,

some of which we’ll return to later. Right now, we in-

terested in the fact that the MVT tells us that we can

approximate the value of a function by a near-by value,

with accuracy that depends on f ′:

Take notes:

4

http://www.mathworks.com/moler

Taylor’s Theorem 5 MA385: Preliminaries

Or we can think of it as approximating f by a line:

Take notes:

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

But what if we want a better approximation? We

could replace our function with, say, a quadratic polyno-

mial. Let p2(x) = b0+b1(x−a)+b2(x−a)2 and solve

for the coefficients b0, b1 and b2 so that

p2(a) = f(a), p ′2(a) = f′(a), p ′′2 (a) = f

′′(a).

Take notes:

This gives that

p2(x) = f(a) + f′(a)(x− a) + (x− a)2

f ′′(a)

2.

Next, if we try to construct an approximating cubic of

the form

p3(x) = b0 + b1(x− a) + b2(x− a)2 + b3(x− a)

3,

=

3∑k=0

bk(x− a)k,

with the property that

p3(a) = f(a), p ′3(a) = f′(a),

p ′′3 (a) = f′′(a), p ′′′3 (a) = f ′′′(a).

(0.2.1)

Note: we can write (0.2.1) in a more succinct way, using

the mathematical short-hand:

p(k)3 (a) = f(k)(a) for k = 0, 1, 2, 3.

Again we find that

bk =f(k)

k!for k = 0, 1, 2, 3.

As you can probably guess, this formula can be easily

extended for arbitrary k, giving us the Taylor Polynomial.

Definition 0.2.2 (Taylor Polynomial). The Taylor1 Poly-

nomial of degree k (also called the Truncated Taylor Se-

ries) that approximates the function f about the point

1

Brook Taylor, 1865 – 1731, England. He

(re)discovered this polynomial approxima-

tion in 1712, though it’s importance was

not realised for another 50 years.

x = a is

pk(x) = f(a) + (x− a)f ′(a) +(x− a)2

2f ′′(a)+

(x− a)3

3!f ′′′(a) + · · ·+ (x− a)k

k!f(k)(a).

We’ll return to this topic later, with a particular em-

phasis on quantifying the “error” in the Taylor Polyno-

mial.

Example 0.2.3. Write down the Taylor polynomial of

degree k that approximates f(x) = ex about the point

x = 0.

Take notes:

Fig. 0.1: Taylor polynomials for f(x) = ex about x = 0

As Figure 0.1 suggests, in this case p3(x) is a more

accurate estimation of ex than p2(x), which is more ac-

curate than p1(x). This is demonstrated in Figure 0.2

were it is shown the difference between f and pk.

Fig. 0.2: Error in Taylor polys for f(x) = ex about x = 0


0.2.1 The Remainder

We now want to examine the accuracy of the Taylor

polynomial as an approximation. In particular, we would

like to find a formula for the remainder or error :

Rk(x) := f(x) − pk(x).

With a little bit of effort one can prove that:

Rk(x) :=(x− a)k+1

(k+ 1)!f(k+1)(σ), for some σ ∈ [x,a].

We won’t prove this in class, since it is quite stan-

dard and features in other courses you have taken. But

for the sake of completeness, a proof is included below

in Section 0.2.4 below.

Example 0.2.4. With f(x) = ex and a = 0, we get that

Rk(x) =xk+1

(k+ 1)!eσ, some σ ∈ [0, x].

Example 0.2.5. How many terms are required in the

Taylor Polynomial for ex about x = 0 to ensure that the

error at x = 1 is

� no more than 10−1?




Take notes:

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

There are other ways of representing the remainder,

including the Integral Representation of the Remainder :

Rn(x) =

∫xa

f(n+1)(t)

n!(x− t)ndt. (0.2.2)

0.2.2 An application of Taylor’s Theorem

The reasons for emphasising Taylor’s theorem so early in

this course are that

� It introduces us to the concept of approximation,

and error estimation, but in a very simple setting;

� It is the basis for deriving methods for solving both

nonlinear equations, and initial value ordinary dif-

ferential equations.

With the last point in mind, we’ll now outline how to

derive Newton’s method for nonlinear equations. (This

is just a taster : we’ll return to this topic in the next

section).

Take notes:


0.2.3 Exercises

Exercise 0.1. Write down the formula for the Taylor

Polynomial for

(i) f(x) =√1+ x about the point x = 0,

(ii) f(x) = sin(x) about the point x = 0,

(iii) f(x) = log(x) about the point x = 1.

Exercise 0.2. Prove the Integral Mean Value Theorem:

there exists a point c ∈ [a,b] such that

f(x) =1

b− a

∫ba

f(x)dx.

Exercise 0.3. The Fundamental Theorem of Calculus

tells us that∫xa f′(t)dt = f(x) − f(a). This can be re-

arranged to get f(x) = f(a) +∫xa f′(t)dt. Use this and

integration by parts to deduce (0.2.2) for the case n = 1.

(Hint: Check Wikipedia!)

0.2.4 A proof of Taylor’s Theorem

Here is a proof of Taylor’s theorem. It wasn’t cov-

ered in class. One of the ingredients need is Gener-

alised Mean Value Theorem: if the functions F and G are

continuous and differentiable, etc, then, for some point

c ∈ [a,b],F(b) − F(a)

G(b) −G(a)=F ′(c)

G ′(c). (0.2.3)

Theorem 0.2.6 (Taylor’s Theorem). Suppose we have a

function f that is sufficiently differentiable on the interval

[a, x], and a Taylor polynomial for f about the point x =

a

pn(x) = f(a) + (x− a)f ′(a) +(x− a)2

2f ′′(a)+

(x− a)3

3!f ′′(a) + . . .

(x− a)k

k!f(n)(a). (0.2.4)

If the remainder is written as Rn(x) := f(x) − pn(x),

then

Rn(x) :=(x− a)n+1

(n+ 1)!f(n+1)(σ), (0.2.5)

for some point σ ∈ [x,a].

Proof. We want to prove that, for any n = 0, 1, 2, . . . ,

there is a point σ ∈ [a, x] such that

f(x) = pn(x) + Rn(x).

If x = a then this is clearly the case because f(a) =

pn(a) and Rn(a) = 0.

For the case x 6= a, we will use a proof by induction.

The Mean Value Theorem tells us that there is some

point σ ∈ [a, x] such that

f(x) − f(a)

x− a= f ′(σ).

Using that p0(a) = f(a) and that R0(x) = (x− a)f ′(σ)

we can rearrange to get

f(x) = p0(a) + R0(x),

as required.

Now we will assume that (0.2.4)–(0.2.5) are true for

the case n = k − 1; and use this to show that they

are true for n = k. From the Generalised Mean Value

Theorem (0.2.3), there is some point c such that

Rk(x)

(x− a)k+1=

Rk(x) − Rk(a)

(x− a)k+1 − (a− a)k+1=

R ′k(c)

(k+ 1)(c− a)k,

where here we have used that Rk(a) = 0. Rearranging we

see that we can write Rk in terms of it’s own derivative:

Rk(x) = R′k(c)

(x− a)k+1

(k+ 1)(c− a)k. (0.2.6)

So now we need an expression for R ′k. This is done bynoting that it also happens to be the remainder for theTaylor polynomial of degree k− 1 for the function f ′.

R ′k(x) = f′(x) −

d

dx

(f(a) + (x− a)f ′(a)+

(x− a)2

2!f ′′(a) + · · ·+ (x− a)k

k!f(k)(a)

).

R ′k(x) = f′(x)−(

f ′(a) + (x− a)f ′′(a) + · · ·+ (x− a)k−1

(k− 1)!f(k)(a)

).

But the expression on the last line of the above equation

is the formula for the Taylor Polynomial of degree −1 for

f ′. By our inductive hypothesis:

R ′k(c) :=(c− a)k

k!f(k+1)(σ),

for some σ. Substitute into (0.2.6) above and we are

done.

Chapter 1

Solving nonlinear equations

1.1 Bisection

1.1.1 Introduction

Linear equations are of the form:

find x such that ax+ b = 0

and are easy to solve. Some nonlinear problems are also

easy to solve, e.g.,

find x such that ax2 + bx+ c = 0.

Cubic and quartic equations also have solutions for which

we can obtain a formula. But most equations to not have

simple formulae for this soltuions, so numerical methods

are needed.

References

� Suli and Mayers [1, Chapter 1]. We’ll follow this

pretty closely in lectures.

� Stewart (Afternotes ...), [3, Lectures 1–5]. A well-

presented introduction, with lots of diagrams to

give an intuitive introduction.

� Moler (Numerical Computing with MATLAB) [2,

Chap. 4]. Gives a brief introduction to the methods

we study, and a description of MATLAB functions

for solving these problems.

� The proof of the convergence of Newton’s Method

is based on the presentation in [5, Thm 3.2].

Our generic problem is:

Let f be a continuous function on the interval [a,b].

Find τ = [a,b] such that f(τ) = 0.

Here f is some specified function, and τ is the solution

to f(x) = 0.

This leads to two natural questions:

(1) How do we know there is a solution?

(2) How do we find it?

The following gives sufficient conditions for the exis-

tence of a solution:

Proposition 1.1.1. Let f be a real-valued function that

is defined and continuous on a bounded closed interval

[a,b] ⊂ R. Suppose that f(a)f(b) 6 0. Then there

exists τ ∈ [a,b] such that f(τ) = 0.

Take notes:

OK – now we know there is a solution τ to f(x) = 0, but

how to we actually solve it? Usually we don’t! Instead

we construct a sequence of estimates {x0, x1, x2, x3, . . . }

that converge to the true solution. So now we have to

answer these questions:

(1) How can we construct the sequence x0, x1, . . . ?

(2) How do we show that limk→∞ xk = τ?

There are some subtleties here, particularly with part (2).

What we would like to say is that at each step the error

is getting smaller. That is

|τ− xk| < |τ− xk−1| for k = 1, 2, 3, . . . .

But we can’t. Usually all we can say is that the bounds

on the error is getting smaller. That is: let εk be a bound

on the error at step k

|τ− xk| < εk,

then εk+1 < µεk for some number µ ∈ (0, 1). It is

easiest to explain this in terms of an example, so we’ll

study the simplest method: Bisection.

1.1.2 Bisection

The most elementary algorithm is the “Bisection Method”

(also known as “Interval Bisection”). Suppose that we

know that f changes sign on the interval [a,b] = [x0, x1]

and, thus, f(x) = 0 has a solution, τ, in [a,b]. Proceed

as follows

1. Set x2 to be the midpoint of the interval [x0, x1].8

Bisection 9 Solving nonlinear equations

2. Choose one of the sub-intervals [x0, x2] and [x2, x1]

where f change sign;

3. Repeat Steps 1–2 on that sub-interval, until f suffi-

ciently small at the end points of the interval.

This may be expressed more precisely using some

pseudocode.

Method 1.1.2 (Bisection).

Set eps to be the stopping criterion.

If |f(a)| 6 eps, return a. Exit.

If |f(b)| 6 eps, return b. Exit.

Set x0 = a and x1 = b.

Set xL = x0 and xR = x1.

Set k = 1

while( |f(xk)| > eps)

xk+1 = (xL + xR)/2;

if (f(xL)f(xk+1) < eps)

xR = xk+1;

else

xL = xk+1

end if;

k = k+ 1

end while;

Example 1.1.3. Find an estimate for√2 that is correct

to 6 decimal places.

Solution: Try to solve the equation f(x) := x2−2 = 0 on

the interval [0, 2]. Then proceed as shown in Figure 1.1

and Table 1.1.

−0.5 0 0.5 1 1.5 2 2.5−2

−1

0

1

2

3

x[0]

=a x[1]

=bx[2]

=1 x[3]

=1.5

f(x)= x2 −2

x−axis

Fig. 1.1: Solving x2 − 2 = 0 with the Bisection Method

Note that at steps 4 and 10 in Table 1.1 the error

actually increases, although the bound on the error is

decreasing.

1.1.3 The bisection method works

The main advantages of the Bisection method are

� It will always work.

� After k steps we know that

Theorem 1.1.4.

|τ− xk| 6(12

)k−1|b− a|, for k = 2, 3, 4, ...

k xk |xk − τ| |xk − xk−1|0 0.000000 1.411 2.000000 5.86e-012 1.000000 4.14e-01 1.003 1.500000 8.58e-02 5.00e-014 1.250000 1.64e-01 2.50e-015 1.375000 3.92e-02 1.25e-016 1.437500 2.33e-02 6.25e-027 1.406250 7.96e-03 3.12e-028 1.421875 7.66e-03 1.56e-029 1.414062 1.51e-04 7.81e-03

10 1.417969 3.76e-03 3.91e-03...

......

...22 1.414214 5.72e-07 9.54e-07

Table 1.1: Solving x2−2 = 0 with the Bisection Method

Take notes:

A disadvantage of bisection is that it is not as efficient

as some other methods we’ll investigate later.

1.1.4 Improving upon bisection

The bisection method is not very efficient. Our next goals

will be to derive better methods, particularly the Secant

Method and Newton’s method. We also have to come up

with some way of expressing what we mean by “better”;

and we’ll have to use Taylor’s theorem in our analyses.

1.1.5 Exercises

Exercise 1.1. Does Proposition 1.1.1 mean that, if there

is a solution to f(x) = 0 in [a,b] then f(a)f(b) 6 0?

That is, is f(a)f(b) 6 0 a necessary condition for their

being a solution to f(x) = 0? Give an example that

supports your answer.

Exercise 1.2. Suppose we want to find τ ∈ [a,b] such

that f(τ) = 0 for some given f, a and b. Write down an

estimate for the number of iterations K required by the

bisection method to ensure that, for a given ε, we know

|xk − τ| 6 ε for all k > K. In particular, how does this

estimate depend on f, a and b?

Exercise 1.3. How many (decimal) digits of accuracy

are gained at each step of the bisection method? (If you

prefer, how many steps are need to gain a single (decimal)

digit of accuracy?)

Exercise 1.4. Let f(x) = ex − 2x− 2. Show that there

is a solution to the problem: find τ ∈ [0, 2] such that

f(τ) = 0.

Taking x0 = 0 and x1 = 2, use 6 steps of the bisection

method to estimate τ. Give an upper bound for the error

|τ− x6|. (You may use a computer program to do this).

The Secant Method 10 Solving nonlinear equations

1.2 The Secant Method

1.2.1 Motivation

Looking back at Table 1.1 we notice that, at step 4 the

error increases rather decreases. You could argue that

this is because we didn’t take into account how close x3 is

to the true solution. We could improve upon the bisection

method as described below. The idea is, given xk−1 and

xk, take xk+1 to be the zero of the line intersects the

points(xk−1, f(xk−1)

)and

(xk, f(xk)

). See Figure 1.2.

−0.5 0 0.5 1 1.5 2 2.5−3

−2

−1

0

1

2

3

x0

(x0, f(x

0))

x1

(x1, f(x

1))

x2

f(x)= x2 −2

secant line

x−axis

−0.5 0 0.5 1 1.5 2 2.5−3

−2

−1

0

1

2

3

x0

x1

(x1, f(x

1))

x2

(x2, f(x

2))

x3

f(x)= x2 −2

secant line

x−axis

Fig. 1.2: The Secant Method for Example 1.2.2

Method 1.2.1 (Secant). 1 Choose x0 and x1 so that

there is a solution in [x0, x1]. Then define

xk+1 = xk − f(xk)xk − xk−1

f(xk) − f(xk−1). (1.2.1)

Example 1.2.2. Use the Secant Method to solve the

nonlinear problem x2 − 2 = 0 in [0, 2]. The results are

shown in Table 1.2. By comparing Tables 1.1 and 1.2,

we see that for this example, the Secant method is much

more efficient than Bisection. We’ll return to why this is

later.

The Method of Bisection could be written as the

weighted average

xk+1 = (1− σk)xk + σkxk−1, with σk = 1/2.

1The name comes from the name of the line that intersects

a curve at two points. There is a related method called “false

position” which was known in India in the 3rd century BC, and

China in the 2nd century BC.

k xk |xk − τ|0 0.000000 1.41e1 2.000000 5.86e-012 1.000000 4.14e-013 1.333333 8.09e-024 1.428571 1.44e-025 1.413793 4.20e-046 1.414211 2.12e-067 1.414214 3.16e-108 1.414214 4.44e-16

Table 1.2: Solving x2 − 2 = 0 using the Secant Method

We can also think of the Secant method as being a

weighted average, but with σk chosen to obtain faster

convergence to the true solution. Looking at Figure 1.2

above, you could say that we should choose σk depending

on which is smaller: f(xk−1) or f(xk). If (for example)

|f(xk−1)| < |f(xk)|, then probably |τ − xk−1| < |τ − xk|.

This gives another formulation of the Secant Method.

xk+1 = (1− σk)xk + σkxk−1, (1.2.2)

where

σk =f(xk)

f(xk) − f(xk−1).

When its written in this form it is sometimes called a

relaxation method.

1.2.2 Order of Convergence

To compare different methods, we need the following

concept:

Definition 1.2.3 (Linear Convergence). Suppose that

τ = limk→∞ xk. Then we say that the sequence {xk}∞k=0

converges to τ at least linearly if there is a sequence of

positive numbers {εk}∞k=0, and µ ∈ (0, 1), such that

limk→∞ εk = 0, (1.2.3a)

and

|τ− xk| 6 εk for k = 0, 1, 2, . . . . (1.2.3b)

and

limk→∞

εk+1

εk= µ. (1.2.3c)

So, for example, the bisection method converges at least

linearly.

The reason for the expression “at least” is because we

usually can only show that a set of upper bounds for the

errors converges linearly. If (1.2.3b) can be strengthened

to the equality |τ−xk| = εk, then the {xk}∞k=0 converges

linearly, (not just “at least” linearly).

As we have seen, there are methods that converge

more quickly than bisection. We state this more precisely:


Definition 1.2.4 (Order of Convergence). Let

τ = limk→∞ xk. Suppose there exists µ > 0 and a se-

quence of positive numbers {εk}∞k=0 such that (1.2.3a)

and and (1.2.3b) both hold. Then we say that the se-

quence {xk}∞k=0 converges with at least order q if

limk→∞

εk+1

(εk)q= µ.

Two particular values of q are important to us:

(i) If q = 1, and we further have that 0 < µ < 1, then

the rate is linear.

(ii) If q = 2, the rate is quadratic for any µ > 0.

1.2.3 Analysis of the Secant Method

Our next goal is to prove that the Secant Method con-

verges. We’ll be a little lazy, and only prove a suboptimal

linear convergence rate. Then, in our MATLAB class,

we’ll investigate exactly how rapidly it really converges.

One simple mathematical tool that we use is the

Mean Value Theorem Theorem 0.2.1 . See also [1, p420].

Theorem 1.2.5. Suppose that f and f ′ are real-valued

functions, continuous and defined in an interval I = [τ−

h, τ+h] for some h > 0. If f(τ) = 0 and f ′(τ) 6= 0, then

the sequence (1.2.1) converges at least linearly to τ.

Before we prove this, we note the following

� We wish to show that |τ− xk+1| < |τ− xk|.

� From Theorem 0.2.1, there is a pointwk ∈ [xk−1, xk]

such that

f(xk) − f(xk−1)

xk − xk−1= f ′(wk). (1.2.4)

� Also by the MVT, there is a point zk ∈ [xk, τ] such

that

f(xk) − f(τ)

xk − τ=f(xk)

xk − τ= f ′(zk). (1.2.5)

Therefore f(xk) = (xk − τ)f′(zk).

� Using (1.2.4) and (1.2.5), we can show that

τ− xk+1 = (τ− xk)

(1− f ′(zk)/f

′(wk)

).

Therefore

|τ− xk+1|

|τ− xk|6∣∣1− f ′(zk)

f ′(wk)

∣∣.� Suppose that f ′(τ) > 0. (If f ′(τ) < 0 just tweak

the arguments accordingly). Saying that f ′ is con-

tinuous in the region [τ−h, τ+h] means that, for

any ε > 0 there is a δ > 0 such that

|f ′(x) − f ′(τ)| < ε for any x ∈ [τ− δ, τ+ δ].

Take ε = f ′(τ)/4. Then |f ′(x) − f ′(τ)| < f ′(τ)/4.

Thus

3

4f ′(τ) 6 f ′(x) 6

5

4f ′(τ) for any x ∈ [τ−δ, τ+δ].

Then, so long as wk and zk are both in [τ−δ, τ+δ]

f ′(zk)

f ′(wk)6

5

3.

Take notes:

(See also details in Section 1.2.5).

Given enough time and effort we could show that the

Secant Method converges faster that linearly. In partic-

ular, that the order of convergence is q = (1+√5)/2 ≈

1.618. This number arises as the only positive root of

q2 − q − 1. It is called the Golden Mean, and arises in

many areas of Mathematics, including finding an explicit

expression for the Fibonacci Sequence: f0 = 1, f1 = 1,

fk+1 = fk + fk−1 for k = 2, 3, . . . . That gives, f0 = 1,

f1 = 1, f2 = 2, f3 = 3, f4 = 5, f5 = 8, f6 = 13, . . . .

A rigorous proof depends on, among other things,

and error bound for polynomial interpolation, which is

the first topic in MA378. With that, one can show that

εk+1 6 Cεkεk−1. Repeatedly using this we get:

� Let r = |x1 − x0| so that ε0 6 r and ε1 6 r,

� Then ε2 6 Cε1ε0 6 Cr2

� Then ε3 6 Cε2ε1 6 C(Cr2)r = C2r3.

� Then ε4 6 Cε3ε2 6 C(C2r3)(Cr2) = C4r5.

� Then ε5 6 Cε4ε3 6 C(C4r5)(C2r3) = C7r8.

� And in general, εk = Cfk−1rfk .

1.2.4 Exercises

Exercise 1.5. ? Suppose we define the Secant Method

as follows.

Choose any two points x0 and x1.

For k = 1, 2, . . . , set xk+1 to be the point

where the line through(xk−1, f(xk−1)

)and(

xk, f(xk))

that intersects the x-axis.

Show how to derive the formula for the secant method.


Exercise 1.6. ?

(i) Is it possible to construct a problem for which the

bisection method will work, but the secant method

will fail? If so, give an example.

(ii) Is it possible to construct a problem for which the

secant method will work, but bisection will fail? If

so, give an example.

1.2.5 Appendix (Proof of convergence ofthe secant method)

Here are the full details on the proof of the fact that

the Secant Method converges at least linearly (Theo-

rem 1.2.5). Before you read it, take care to review the

notes from that section, particularly (1.2.4) and (1.2.5).

Proof. The method is

xk+1 = xk − f(xk)xk − xk−1

f(xk) − f(xk−1).

We’ll use this to derive an expression of the error at step

k+ 1 in terms of the error at step k. In particular,

τ− xk+1 = τ− xk + f(xk)xk − xk−1

f(xk) − f(xk−1)

= τ− xk + f(xk)/f′(wk)

= τ− xk + (xk − τ)f′(zk)/f

′(wk)

= (τ− xk)

(1− f ′(zk)/f

′(wk)

).

Therefore

|τ− xk+1|

|τ− xk|6∣∣1− f ′(zk)

f ′(wk)

∣∣.So it remains to be shown that∣∣1− f ′(zk)

f ′(wk)

∣∣ < 1.

Lets first assume that f ′(τ) = α > 0. (If f ′(τ) = α < 0

the following arguments still hold, just with a few small

changes). Because f ′ is continuous in the region [τ −

h, τ+ h], for any given ε > 0 there is a δ > 0 such that

|f ′(x) −α| < ε for and x ∈ [τ− δ, τ+ δ]. Take ε = α/4.

Then |f ′(x) − α| < α/4. Thus

α3

46 f ′(x) 6 α

5

4for any x ∈ [τ− δ, τ+ δ].

Then, so long as wk and zk are both in [τ− δ, τ+ δ]

f ′(zk)

f ′(wk)6

5

3.

This gives|τ− xk+1|

|τ− xk|6

2

3,

which is what we needed.

Newton’s Method 13 Solving nonlinear equations

1.3 Newton’s Method

1.3.1 Motivation

These notes are loosely based on Section 1.4 of [1] (i.e.,

Suli and Mayers, Introduction to Numerical Analysis).

See also, [3, Lecture 2], and [5, §3.5] The Secant method

is often written as

xk+1 = xk − f(xk)φ(xk, xk−1),

where the function φ is chosen so that xk+1 is the root

of the secant line joining the points(xk−1, f(xk−1)

)and(

xk, f(xk)). A related idea is to construct a method

xk+1 = xk−f(xk)λ(xk), where we choose λ so that xk+1

is the point where the tangent line to f at (xk, f(xk)) cuts

the x-axis. This is shown in Figure 1.3. We attempt to

solve x2−2 = 0, taking x0 = 2. Taking the x1 to be zero

of the tangent to f(x) at x = 2, we get x1 = 1.5. Taking

the x2 to be zero of the tangent to f(x) at x = 1.5, we

get x2 = 1.4167, which is very close to the true solution

of τ = 1.4142.

0.8 1 1.2 1.4 1.6 1.8 2 2.2−3

−2

−1

0

1

2

3

x0

(x0, f(x

0))

x1

f(x)= x2 −2

tangent line

x−axis

1.2 1.3 1.4 1.5 1.6 1.7−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

x1

(x1, f(x

1))

x2

f(x)= x2 −2

secant line

x−axis

Fig. 1.3: Estimating√2 by solving x2 − 2 = 0 using

Newton’s Method

Method 1.3.1 (Newton’s Method2 ).

2

Sir Isaac Newton, 1643 - 1727, England.

Easily one of the greatest scientist of all

time. The method we are studying ap-

peared in his celebrated Principia Mathe-

matica in 1687, but it is believed he had

used it as early as 1669.

1. Choose any x0 in [a,b],

2. For i = 0, 1, . . . , set xk+1 to the root of the line

through xk with slope f ′(xk).

By writing down the equation for the line at(xk, f(xk)

)with slope f ′(xk), one can show (see Exercise 1.7-(i))

that the formula for the iteration is

xk+1 = xk −f(xk)

f ′(xk). (1.3.6)

Example 1.3.2. Use Newton’s Method to solve the non-

linear problem x2−2 = 0 in [0, 2]. The results are shown

in Table 1.3. For this example, the method becomes

xk+1 = xk −f(xk)

f ′(xk)= xk −

x2k − 2

2xk=

1

2xk +

1

xk.

k xk |xk − τ| |xk − xk−1|0 2.000000 5.86e-011 1.500000 8.58e-02 5.00e-012 1.416667 2.45e-03 8.33e-023 1.414216 2.12e-06 2.45e-034 1.414214 1.59e-12 2.12e-065 1.414214 2.34e-16 1.59e-12

Table 1.3: Solving x2 − 2 = 0 using Newton’s Method

By comparing Table 1.2 and Table 1.3, we see that

for this example, the Newton’s method is more efficient

again than the Secant method.

Deriving Newton’s method geometrically certainly has

an intuitive appeal. However, to analyse the method, we

need a more abstract derivation based on a Truncated

Taylor Series.

Take notes:

1.3.2 Newton Error Formula

We saw in Table 1.3 that Newton’s method can be much

more efficient than, say, Bisection: it yields estimates

that converge far more quickly to τ. Bisection converges


(at least) linearly, whereas Newton’s converges quadrat-

ically, that is, with at least order q = 2.

In order to prove that this is so, we need to

1. write down a recursive formula for the error;

2. show that it converges;

3. then find the limit of |τ− xk+1|/|τ− xk|2.

Step 2 is usually the crucial part.

There are two parts to the proof. The first involves

deriving the so-called “Newton Error formula”. Then

we’ll apply this to prove (quadratic) convergence. In all

cases we’ll assume that the functions f, f ′ and f ′′ are

defined and continuous on the an interval Iδ = [τ−δ, τ+

δ] around the root τ. The proof we’ll do in class comes

directly from the above derivation (see also Epperson [5,

Thm 3.2]).

Theorem 1.3.3 (Newton Error Formula). If f(τ) = 0

and

xk+1 = xk −f(xk)

f ′(xk),

then there is a point ηk between τ and xk such that

τ− xk+1 = −(τ− xk)

2

2

f ′′(ηk)

f ′(xk),

Take notes:

Example 1.3.4. As an application of Newton’s error for-

mula, we’ll show that the number of correct decimal dig-

its in the approximation doubles at each step.

Take notes:

1.3.3 Convergence of Newton’s Method

We’ll now complete our analysis of this section by proving

the convergence of Newton’s method.

Theorem 1.3.5. Let us suppose that f is a function such

that

� f is continuous and real-valued, with continuous

f ′′, defined on some close interval Iδ = [τ−δ, τ+δ],

� f(τ) = 0 and f ′′(τ) 6= 0,

� there is some positive constant A such that

|f ′′(x)|

|f ′(y)|6 A for all x,y ∈ Iδ.

Let h = min{δ, 1/A}. If |τ − x0| 6 h then Newton’s

Method converges quadratically.

Take notes:


1.3.4 Exercises

Exercise 1.7. ? Write down the equation of the line

that is tangential to the function f at the point xk. Give

an expression for its zero. Hence show how to derive

Newton’s method.

Exercise 1.8. (i) It is possible to construct a problem

for which the bisection method will work, but New-

ton’s method will fail? If so, give an example.

(ii) It is possible to construct a problem for which New-

ton’s method will work, but bisection will fail? If

so, give an example.

Exercise 1.9. (i) Write down Newton’s Method as ap-

plied to the function f(x) = x3 − 2. Simplify the

computation as much as possible. What is achieved

if we find the root of this function?

(ii) Do three iterations by hand of Newton’s Method

applied to f(x) = x3 − 2 with x0 = 1.

Exercise 1.10. (This is taken from Exercise 3.5.1 of Ep-

person). If f is such that |f ′′(x)| 6 3 and |f ′(x)| > 1 for

all x, and if the initial error in Newton’s Method is less

than 1/2, give an upper bound for the error at each of

the first 3 steps.

Exercise 1.11. Here is (yet) another scheme called Stef-

fenson’s Method : Choose x0 ∈ [a,b] and set

xk+1 = xk −

(f(xk)

)2f(xk + f(xk)

)− f(xk)

for k = 0, 1, 2, . . . .

(a) ? Explain how this method relates to Newton’s Method.

(b) [Optional] Write a program, in MATLAB, or your

language of choice, to implement this method. Ver-

ify it works by using it to estimate the solution to

ex = (2 − x)3 with x0 = 0. Submit your code and

test harness as Blackboard assignment. No credit is

available for this part, but feedback will be given on

your code. Also, it will help you prepare for the final

exam.

Exercise 1.12. ? (This is Exercise 1.6 from Suli and

Mayers) The proof of the convergence of Newton’s method

given in Theorem 1.3.5 uses that f ′(τ) 6= 0. Suppose that

it is the case that f ′(τ) = 0.

(i) What can we say about the root, τ?

(ii) Starting from the Newton Error formula, show that

τ− xk+1 =(τ− xk)

2

f ′′(ηk)

f ′′(µk),

for some µk between τ and xk. (Hint: try using

the MVT ).

(iii) What does the above error formula tell us about the

convergence of Newton’s method in this case?

Fixed Point Iteration 16 Solving nonlinear equations

1.4 Fixed Point Iteration

1.4.1 Introduction

Newton’s method can be considered to be a particular

instance of a very general approach called Fixed Point

Iteration or Simple Iteration.

The basic idea is:

If we want to solve f(x) = 0 in [a,b], find

a function g(x) such that, if τ is such that

f(τ) = 0, then g(τ) = τ.

Next, choose x0 and set xk+1 = g(xk) for

k = 0, 1, 2 . . . .

Example 1.4.1. Suppose that f(x) = ex − 2x − 1 and

we are trying to find a solution to f(x) = 0 in [1, 2]. We

can reformulate this problem as

For g(x) = ln(2x + 1), find τ ∈ [1, 2] such

that g(τ) = τ.

If we take the initial estimate x0 = 1, then Simple Itera-

tion gives the following sequence of estimates.

k xk |τ− xk|0 1.0000 2.564e-11 1.0986 1.578e-12 1.1623 9.415e-23 1.2013 5.509e-24 1.2246 3.187e-25 1.2381 1.831e-2...

......

10 1.2558 6.310e-4

To make this table, I used a numerical scheme to solve the

problem quite accurately to get τ = 1.256431. (In gen-

eral we don’t know τ in advance–otherwise we wouldn’t

need such a scheme). I’ve given the quantities |τ − xk|

here so we can observe that the method is converging,

and get an idea of how quickly it is converging.

We have to be quite careful with this method: not

every choice is g is suitable.

Suppose we want the solution to

f(x) = x2−2 = 0 in [1, 2]. We could

choose g(x) = x2 + x − 2. Taking

x0 = 1 we get the iterations shown

opposite.

k xk0 11 02 -23 04 -25 0...

...

This sequence doesn’t converge!

We need to refine the method that ensure that it will

converge. Before we do that in a formal way, consider

the following...

Example 1.4.2. Use the Mean Value Theorem to show

that the fixed point method xk+1 = g(xk) converges if

|g ′(x)| < 1 for all x near the fixed point.

Take notes:

This is an important example, mostly because it in-

troduces the “tricks” of using that g(τ) = τ and g(xk) =

xk+1. But it is not a rigorous theory. That requires some

ideas such as the contraction mapping theorem.

1.4.2 A short tour of fixed points and con-tractions

A variant of the famous Fixed Point Theorem3 is :

Suppose that g(x) is defined and continuous

on [a,b], and that g(x) ∈ [a,b] for all x ∈[a,b]. Then there exists a point τ ∈ [a,b]

such that g(τ) = τ. That is, g(x) has a fixed

point in the interval [a,b].

Try to convince yourself that it is true, by sketching the

graphs of a few functions that send all points in the in-

terval, say, [1, 2] to that interval, as in Figure 1.4.

g(x)

b

a

a b

Fig. 1.4: Sketch of a function g(x) such that, if a 6 x 6b then a 6 g(x) 6 b

The next ingredient we need is to observe that g is a

contraction. That is, g(x) is continuous and defined on

[a,b] and there is a number L ∈ (0, 1) such that

|g(α) − g(β)| 6 L|α− β| for all α,β ∈ [a,b]. (1.4.7)

3LEJ Brouwer, 1881–1966, Netherlands


Theorem 1.4.3 (Contraction Mapping Theorem).

Suppose that the function g is a real-valued, defined,

continuous, and

(a) it maps every point in [a,b] to some point in [a,b];

(b) and it is a contraction on [a,b],

then

(i) g has a fixed point τ ∈ [a,b],

(ii) the fixed point is unique,

(iii) the sequence {xk}∞k=0 defined by x0 ∈ [a,b] and

xk = g(xk−1) for k = 1, 2, . . . converges to τ.

Proof:

Take notes:

1.4.3 Convergence of Fixed Point Itera-tion

We now know how to apply to Fixed-Point Method and

to check if it will converge. Of course we can’t perform

an infinite number of iterations, and so the method will

yield only an approximate solution. Suppose we want the

solution to be accurate to say 10−6, how many steps are

needed? That is, how large must k be so that

|xk − τ| 6 10−6?

The answer is obtained by first showing that

|τ− xk| 6Lk

1− L|x1 − x0|. (1.4.8)

Take notes:

Example 1.4.4. If g(x) = ln(2x + 1) and x0 = 1, and

we want |xk − τ| 6 10−6, then we can use (1.4.8) to

determine the number of iterations required.

Take notes:

This calculation only gives an upper bound for the

number of iterations. It is correct, but not necessarily

sharp. In practice, one finds that 23 iterations is sufficient

to ensure that the error is less than 10−6. Even so, 23

iterations a quite a lot for such a simple problem. So can

conclude that this method is not as fast as, say, Newton’s

Method. However, it is perhaps the most generalizable.

1.4.4 Knowing When to Stop

Suppose you wish to program one of the above methods.You will get your computer to repeat one of the iterativemethods until your solution is sufficiently close to thetrue solution:

x[0] = 0

tol = 1e-6

i=0

while (abs(tau - x[i]) > tol) // This is the

// stopping criterion

x[i+1] = g(x[i]) // Fixed point iteration

i = i+1

end

All very well, except you don’t know τ. If you did, you

wouldn’t need a numerical method. Instead, we could

choose the stopping criterion based on how close succes-

sive estimates are:

while (abs(x[i-1] - x[i]) > tol)

This is fine if the solution is not close to zero. E.g., if

its about 1, would should get roughly 6 accurate figures.

But is τ = 10−7 then it is quite useless: xk could be

ten times larger than τ. The problem is that we are

estimating the absolute error.

Instead, we usually work with relative error:

while (abs (x[i−1]−x[i]

x[i] ) > tol)


1.4.5 Exercises

Exercise 1.13. Is it possible for g to be a contraction

on [a,b] but not have a fixed point in [a,b]? Give an

example to support your answer.

Exercise 1.14. Show that g(x) = ln(2x + 1) is a con-

traction on [1, 2]. Give an estimate for L. (Hint: Use the

Mean Value Theorem).

Exercise 1.15. Consider the function g(x) = x2/4 +

5x/4− 1/2.

(i) It has two fixed points – what are they?

(ii) For each of these, find the largest region around

them such that g is a contraction on that region.

Exercise 1.16. Although we didn’t prove it in class, it

turns out that, if g(τ) = τ, and the fixed point method

given by

xk+1 = g(xk),

converges to the point τ (where g(τ) = τ), and

g ′(τ) = g ′′(τ) = · · · = g(p−1)(τ) = 0,

then it converges with order p.

(i) Use a Taylor Series expansion to prove this.

(ii) We can think of Newton’s Method for the problem

f(x) = 0 as fixed point iteration with g(x) = x −

f(x)/f ′(x). Use this, and Part (i), to show that, if

Newton’s method converges, it does so with order

2, providing that f ′(τ) 6= 0.

LAB 1: the bisection and secant methods 19 Solving nonlinear equations

1.5 LAB 1: the bisection and se-

cant methods

The goal of this section is to help you gain familiarity

with the fundamental tasks that can be accomplished

with MATLAB: defining vectors, computing functions,

and plotting. We’ll then see how to implement and anal-

yse the Bisection and Secant schemes in MATLAB.

You’ll find many good MATLAB references online. I

particularly recommend:

� Cleve Moler, Numerical Computing with MATLAB,

which you can access at http://uk.mathworks.

com/moler/chapters

� Tobin Driscoll, Learning MATLAB, which you can

access through the NUI Galway library portal.

MATLAB is an interactive environment for mathe-

matical and scientific computing. It the standard tool for

numerical computing in industry and research.

MATLAB stands for Matrix Laboratory. It specialises

in matrix and vector computations, but includes func-

tions for graphics, numerical integration and differentia-

tion, solving differential equations, etc.

MATLAB differs from most significantly from, say,

Maple, by not having a facility for abstract computation.

1.5.1 The Basics

MATLAB an interpretive environment – you type a com-

mand and it will execute it immediately.

The default data-type is a matrix of double precision

floating-point numbers. A scalar variable is an instance

of a 1× 1 matrix. To check this set,

>> t=10 and use >> size(t)

to find the numbers of rows and columns of t.

A vector may be declared as follows:

>> x = [1 2 3 4 5 6 7]

This generates a vector, x, with x1 = 1, x2 = 2, etc.

However, this could also be done with x=1:7

More generally, if we want to define a vector x =

(a,a+ h,a+ 2h, . . . ,b), we could use x = a:h:b;

For example

>> x=10:-2:0 gives x = (10, 8, 6, 4, 2, 0).

If h is omitted, it is assumed to be 1.

The ith element of a vector is access by typing x(i).

The element of in row i and column j of a matrix is given

by A(i,j)

Most “scalar” functions return a matrix when given

a matrix as an argument. For example, if x is a vector of

length n, then y = sin(x) sets y to be a vector, also

of length n, with yi = sin(xi).

MATLAB has most of the standard mathematical

functions: sin, cos, exp, log, etc.

In each case, write the function name followed by the

argument in round brackets, e.g.,

>> exp(x) for ex.

The * operator performs matrix multiplication. For

element-by-element multiplication use .*

For example,

y = x.*x sets yi = (xi)2.

So does y = x.^2. Similarly, y=1./x set yi = 1/xi.

If you put a semicolon at the end of a line of MAT-

LAB, the line is executed, but the output is not shown.

(This is useful if you are dealing with large vectors). If no

semicolon is used, the output is shown in the command

window.

1.5.2 Plotting functions

Define a vector

>> x=[0 1 2 3] and then set >> f = x.^2 -2

To plot these vectors use:

>> plot(x, f)

If the picture isn’t particularly impressive, then this might

be because Matlab is actually only printing the 4 points

that you defined. To make this more clear, use

>> plot(x, f, ’-o’)

This means to plot the vector f as a function of the vector

x, placing a circle at each point, and joining adjacent

points with a straight line.

Try instead: >> x=0:0.1:3 and f = x.^2 -2

and plot them again.

To define function in terms of any variable, type:

>> F = @(x)(x.^2 -2);

Now you can use this function as follows:

>> plot(x, F(x));

Take care to note that MATLAB is case sensitive.

In this last case, it might be helpful to also observe

where the function cuts the x-axis. That can be done

by also plotting the line joining, for example, the points

(0, 0), and (3, 0):

>> plot(x,F(x), [0,3], [0,0]);

Tip: Use the >> help menu to find out what the

ezplot function is, and how to use it.

1.5.3 Programming the Bisection Method

Revise the lecture notes on the Bisection Method.

Suppose we want to find a solution to ex − (2− x)3 = 0

in the interval [0, 5] using Bisection.

� Define the function f as:

>> f = @(x)(exp(x) - (2-x).^3);

� Taking x1 = 0 and x2 = 5, do 8 iterations of the

Bisection method.

� Complete the table below. You may use that the

solution is (approximately)

τ = 0.7261444658054950.

http://uk.mathworks.com/moler/chapters

http://uk.mathworks.com/moler/chapters

LAB 1: the bisection and secant methods 20 Solving nonlinear equations

k xk |τ− xk|1

2

3

4

5

6

7

8

Implementing the Bisection method by hand is very

tedious. Here is a program that will do it for you. You

don’t need to type it all in; you can download it from

www.maths.nuigalway.ie/MA385/lab1/Bisection.m

3 clear; % Erase all stored variables4 fprintf('\n\n---------\n Using Bisection\n');5 % The function is6 f = @(x)(exp(x) - (2-x).ˆ3);7 fprintf('Solving f=0 with the function\n');8 disp(f);9

10

11 tau = 0.72614446580549503614; % true solution12 fprintf('The true solution is %12.8f\n', tau);13

14 %% Our initial guesses are x 1=0 and x 2 =2;15 x(1)=0;16 fprintf('%2d | %14.8e | %9.3e \n', ...17 1, x(1), abs(tau - x(1)));18 x(2)=5;19 fprintf('%2d | %14.8e | %9.3e \n', ...20 2, x(2), abs(tau - x(2)));21 for k=2:822 x(k+1) = (x(k-1)+x(k))/2;23 if ( f(x(k+1))*f(x(k-1)) < 0)24 x(k)=x(k-1);25 end26 fprintf('%2d | %14.8e | %9.3e\n', ...27 k+1, x(k+1), abs(tau - x(k+1)));28 end

Read the code carefully. If there is a line you do not

understand, then ask a tutor, or look up the on-line help.

For example, find out what that clear on Line 3 does

by typing >> doc clear

Q1. Suppose we wanted an estimate xk for τ so that

|τ− xk| 6 10−10.

(i) In §1.1 we saw that |τ − xk| 6 ( 12 )k−1|b − a|.

Use this to estimate how many iterations are

required in theory.

(ii) Use the program above to find how many iter-

ations are required in practice.

1.5.4 The Secant method

Recall the the Secant Method in (1.2.1).

Q2 (a) Adapt the program above to implement the

secant method.

(b) Use it to find a solution to ex − (2− x)3 = 0

in the interval [0, 5].

(c) How many iterations are required to ensure

that the error is less than 10−10?

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q3 Recall from Definition 1.2.4 the order of conver-

gence of a sequence {ε0, ε1, ε2, . . . } is q if

limk→∞

εk+1

εqk= µ,

for some constant µ.

We would like to verify that q = (1 +√5)/2 ≈

1.618. This is difficult to do computationally be-

cause, after a relatively small number of iterations,

the round-off error becomes significant. But we

can still try!

Adapt the program above so that at each iteration

it displays

|τ− xk+1|

|τ− xk|,

|τ− xk+1|

|τ− xk|1.618,

|τ− xk+1|

|τ− xk|2,

and so deduce that the order of converges is greater

than 1 (so better than bisection), less than 2, and

roughly (1+√5)/2.

1.5.5 To Finish

Before you leave the class upload your MATLAB code

for the Q3 (only) to “Lab 1” in the “Assignments and

Labs” section Blackboard. This file must include your

name and ID number as comments. Ideally, it should

incorporate your name or ID into the file name (e.g.,

Lab1 Donal Duck.m). Include your answers to Q1 and

Q2 as comments in that file.

1.5.6 Extra

The bisection method is popular because it is robust: it

will always work subject to minimal constraints. How-

ever, it is slow: if the Secant works, then it converges

much more quickly. How can we combine these two al-

gorithms to get a fast, robust method? Consider the

following problem:

Solve 1−2

x2 − 2x+ 2= 0 on [−10, 1].

You should find that the bisection method works (slowly)

for this problem, but the Secant method will fail. So write

a hybrid algorithm that switches between the bisection

method and the secant method as appropriate.

Take care to document your code carefully, to show

which algorithm is used when.

How many iterations are required?

http://www.maths.nuigalway.ie/MA385/lab1/Bisection.m

What has Newton’s method ever done for me? 21 Solving nonlinear equations

1.6 What has Newton’s method ever

done for me?

When studying a numerical method (or indeed any piece

of Mathematics) it is important to ask: “why are we are

doing this?”. Sometimes it is so that you can under-

stand other topics later. Sometimes because it is inter-

esting/beautiful in its own right; Most commonly it is

because it is useful. Here are some instances of each of

these:

1. The analyses we have used in this section allowed us

to consider some important ideas in a simple setting.

Examples include

� Convergence, including rates of convergence.

� Fixed-point theory, and contractions. We’ll be

seeing analogous ideas in the next section (Lips-

chitz conditions).

� The approximation of functions by polynomials

(Taylor’s Theorem). This point will reoccur in the

next section, and all through-out next semester.

2. Applications come from lots of areas of science

and engineering. Less obvious might be applications to

financial mathematics.

The celebrated Black-Scholes equation for pricing a

put option can be written as

∂V

∂t−

1

2σ2S2

∂2V

∂S2− rS

∂V

∂S+ rV = 0.

where

� V(S, t) is the current value of the right (but not

the obligation) to buy or sell (“put” or “call”) an

asset at a future time T ;

� S is the current value of the underlying asset;

� r is the current interest rate (because the value of

the option has to be compared with what we would

have gained by investing the money we paid for it);

� σ is the volatility of the asset’s price.

Often one knows S, T and r, but not σ. The method of

implied volatility is when we take data from the market

and then find the value of σ which would give the data

as the solution to the Black-Scholes equation. This is a

nonlinear problem and so Newton’s method can be used.

See Chapters 13 and 14 of Higham’s “An Introduction

to Financial Option Valuation” for more details.

(We will return to the Black-Scholes problem again

at the end of the next section).

3. Some of these ideas are interesting and beautiful.

Consider Newton’s method. Suppose that we want to

find the complex nth roots of unity: the set of numbers

{z0, z1, z2, . . . , zn−1} who’s nth roots are 1. For example,

the 4th roots of unity are 1, i, −1 and −i.

The nth roots of unity have a simple expression:

zk = eiθ where θ =2kπ

n

for k ∈= {0, 1, 2 . . . ,n− 1} and i =√−1. Plotted in the

Argand Plane, these point form a regular polygon.

But pretend that we don’t have this formula, and

want to use Newton’s method to find a given root. We

could try to solve f(z) = 0 with f(z) = zn − 1. The

iteration is:

zk+1 = zk −(zk)

n − 1

n(zk)n−1.

However, there are n possible solutions; given a particular

starting point, which root with the method converge to?

If we take a number of points in a region of space, iterate

on each of them, and then colour the points to indicate

the ones that converge to the same root, we get the fa-

mous Julia 4 set, an example of a fractal. One such Julia

set, generated by the MATLAB script Julia.m, which you

can download from the course website, is shown below in

Figure 1.5.

Fig. 1.5: A contour plot of a Julia set with n = 5

4

Gaston Julia, French mathematician 1893–

1978. The famous paper which introduced

these ideas was published in 1918 when he

was just 25. Interest later waned until the

1970s when Mandelbrot’s computer experi-

ments reinvigorated interest.

http://www.maths.nuigalway.ie/~niall/MA385/Julia.m

Chapter 2

Initial Value Problems

2.1 Introduction

The first part of this introduction is based on [5, Chap.

6]. The rest of the notes mostly follow [1, Chap. 12].

The growth of some tumours can be modelled as

R ′(t) = −1

3SiR(t) +

2λσ

µR+√µ2R2 + 4σ

, (2.1.1)

subject to the initial condition R(t0) = a, where R is the

radius of the tumour at time t. Clearly, it would be use-

ful to know the value of R at certain times in the future.

Though it’s essentially impossible to solve for R exactly,

we can accurately estimate it. The equation in (2.1.1)

is an example of an initial value differential equation or,

simply, and initial value problem: we are given the so-

lution at some initial time, and must solve a differential

equation to get the solution at later times. In this sec-

tion, we’ll study techniques approximating solutions to

such problems.

Initial Value Problems (IVPs) are differential equa-

tions of the form: Find y(t) such that

dy

d t= f(t,y) for t > t0, with y(t0) = y0. (2.1.2)

Here y ′ = f(t,y) is the differential equation and y(t0) =

y0 is the initial value.

Some IVPs are easy to solve. For example:

y ′ = t2 with y(1) = 1.

Just integrate the differential equation to get that

y(t) = t3/3+ C,

and use the initial value to find the constant of integra-

tion. This gives the solution y ′(t) = (t3 + 2)/3. How-

ever, most problems are much harder, and some don’t

have solutions at all.

In many cases, it is possible to determine that a given

problem does indeed have a solution, even if we can’t

write it down. The idea is that the function f should be

“Lipschitz”, a notion closely related to that of a contrac-

tion (1.4.7).

Definition 2.1.1. A function f satisfies a Lipschitz 1

Condition (with respect to its second argument) in the

rectangular region D if there is a positive real number L

such that

|f(t,u) − f(t, v)| 6 L|u− v|, (2.1.3)

for all (t,u) ∈ D and (t, v) ∈ D.

Example 2.1.2. For each of the following functions f,

show that is satisfies a Lipschitz condition, and give an

upper bound on the Lipschitz constant L.

(i) f(t,y) = y/(1+ t)2 for 0 6 t 6∞.

(ii) f(t,y) = 4y− e−t for all t.

(iii) f(t,y) = −(1+ t2)y+ sin(t) for 1 6 t 6 2.

Take notes:

The reason we are interested in functions satisfying

Lipschitz conditions is as follows:

1Rudolf Otto Sigismund Lipschitz, Germany, 1832–1903. Made

many important contributions to science in areas that include differ-

ential equations, number theory, Fourier Series, celestial mechanics,

and analytic mechanics.22

Introduction 23 Initial Value Problems

Proposition 2.1.3 (Picard’s2). Suppose that the real-

valued function f(t,y) is continuous for t ∈ [t0, tM] and

y ∈ [y0−C,y0+C]; that |f(t,y0)| 6 K for t0 6 t 6 tM;

and that f satisfies the Lipschitz condition (2.1.3). If

C >K

L

(eL(tM−t0) −1

),

then (2.1.2) us a unique solution on [t0, tM]. Further:

|y(t) − y(t0)| 6 C t0 6 t 6 tM.

You are not required to know this theorem for this course.

However, it’s important to be able to determine when a

given f satisfies a Lipschitz condition.

2.1.1 Exercises

Exercise 2.1. For the following functions show that they

satisfy a Lipschitz condition on the corresponding do-

main, and give an upper-bound for L:

(i) f(t,y) = 2yt−4 for t ∈ [1,∞),

(ii) f(t,y) = 1+ t sin(ty) for 0 6 t 6 2.

Exercise 2.2. Many text books, instead of giving the

version of the Lipschitz condition we use, give the fol-

lowing: There is a finite, positive, real number L such

that

|∂

∂yf(t,y)| 6 L for all (t,y) ∈ D.

Is this statement stronger than (i.e., more restrictive then),

equivalent to or weaker than (i.e., less restrictive than)

the usual Lipschitz condition? Justify your answer.

2Charles Emile Picard, France, 1856–1941. He made impor-

tant discoveries in the fields of analysis, function theory, differen-

tial equations and geometry. He supervised the Ph.D. of Padraig

de Brun, Professor of Mathematics (and President) of University

College Galway/NUI Galway.

Euler’s method 24 Initial Value Problems

2.2 Euler’s method

2.2.1 The idea

Classical numerical methods for IVPs attempt to gen-

erate approximate solutions at a finite set of discrete

points t0 < t1 < t2 < · · · < tn. The simplest is Eu-

ler’s Method3 and may be motivated as follows.

Suppose we know y(ti), and want to find y(ti+1).

From the differential equation we know the slope of the

tangent to y at ti. So, if this is similar to the slope of

the line joining (ti,y(ti)) and (ti+1,y(ti+1)):

y ′(ti) = f(ti,y(ti)) ≈yi+1 − yiti+1 − ti

.

....................................

....................................

....................................

....................................

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

............................................................................

.......... .......... .......... .......... ........

.. .......... .................... .......... ....

...... .......... .......... .......... ..........

.......... .......... .......... .......... ........

.. .......... .................... .......... ....

...... .......... .......... ..

......................................................................................

........................................

........................................

........................................

........................................

........................................

........................................

........................................

........................................

........................................

........................................

........................................

........................................

...

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

............................

.......... ............................

............................ ............................................... ............................

....................................

ppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp

ppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp

Tangent to y(x) at x = xi

y(t)Secant line joining the points

ti+1

(ti,yi)

ti h

on y(t) at ti and ti+1

(ti+1,yi+1)

2.2.2 The formula

Method 2.2.1 (Euler’s Method). Choose equally spaced

points t0, t1, . . . , tn so that

ti − ti−1 = h = (tn − t0)/n for i = 0, . . . ,n− 1.

We call h the “time step”. Let yi denote the approxi-

mation for y(t) at t = ti. Set

yi+1 = yi + hf(ti,yi), i = 0, 1, . . . ,n− 1. (2.2.1)

2.2.3 An example

Example 2.2.2. Taking h = 1, estimate y(4) where

y ′(t) = y/(1+ t2), y(0) = 1. (2.2.2)

The true solution to this is y(t) = earctan(t).

Take notes:

3

Leonhard Euler, 1707–1783, Switzerland. One

of the greatest Mathematicians of all time, he

made vital contributions to geometry, calculus

and number theory. finite differences, special

functions, differential equations, continuum me-

chanics, astronomy, elasticity, acoustics, light,

hydraulics, music, cartography, and much more.

If we had chosen h = 4 we would have only required

one step: yn = y0+4f(t0,y0) = 5. However, this would

not be very accurate. With a little work one can show

that the solution to this problem is y(t) = etan−1(t) and

so y(4) = 3.7652. Hence the computed solution with

h = 1 is much more accurate than the computed solution

when h = 4. This is also demonstrated in Figure 2.1

below, and in Table 2.1 where we see that the error seems

to be proportional to h.

Fig. 2.1: Euler’s method for Example 2.2.2 with h = 4,

h = 2, h = 1 and h = 1/2

n h yn |y(tn) − yn|1 4 5.0 1.2352 2 4.2 0.4354 1 3.960 0.1958 1/2 3.881 0.115

16 1/4 3.831 0.06532 1/8 3.800 0.035

Table 2.1: Error in Euler’s method for Example 2.2.2


2.2.4 Exercises

Exercise 2.3. As a special case in which the error of

Euler’s method can be analysed directly, consider Euler’s

method applied to

y ′(t) = y(t), y(0) = 1.

The true solution is y(t) = et.

(i) Show that the solution to Euler’s method can be

written as

yi = (1+ h)ti/h, i > 0.

(ii) Show that

limh→0

(1+ h)1/h = e.

This then shows that, if we denote by yn(T) the ap-

proximation for y(T) obtained using Euler’s method

with n intervals between t0 and T , then

limn→∞yn(T) = eT .

Hint: Let w = (1+ h)1/h, so that

logw = (1/h) log(1+ h).

Now use l’Hospital’s rule to find limh→0w.


2.3 Error Analysis

2.3.1 General one-step methods

Euler’s method is an example of a one-step methods,

which have the general form:

yi+1 = yi + hΦ(ti,yi;h). (2.3.1)

To get Euler’s method, just take Φ(ti,yi;h) = f(ti,yi).

In the introduction, we motivated Euler’s method

with a geometrical argument. An alternative, more math-

ematical way of deriving Euler’s Method is to use a Trun-

cated Taylor Series.

Take notes:

This again motivates formula (2.2.1), and also sug-

gests that at each step the method introduces a (local)

error of h2y ′′(η)/2. (More of this later).

2.3.2 Two types of error

We’ll now give an error analysis for general one-step

methods, and then look at Euler’s Method as a specific

example. First, some definitions.

Definition 2.3.1. Global Error : Ei = y(ti) − yi.

Definition 2.3.2. Truncation Error :

Ti :=y(ti+1) − y(ti)

h−Φ(ti,y(ti);h). (2.3.2)

It can be helpful to think of Ti as representing how

much the difference equation (2.2.1) differs from the dif-

ferential equation. We can also determine the truncation

error for Euler’s method directly from a Taylor Series.

Take notes:

The relationship between the global error and trun-

cation errors is explained in the following (important!)

result, which in turn is closely related to Theorem 2.1.3:

Theorem 2.3.3 (Thm 12.2 in Suli & Mayers). Let Φ()

be Lipschitz with constant L. Then

|En| 6 T

(eL(tn−t0) −1

L

), (2.3.3)

where T = maxi=0,1,...n |Ti|.

(Part of the following proof uses the fact that, if

|Ei+1| 6 |Ei|(1+ hL) + h|Ti|, then

|Ei| 6T

L

[(1+ hL)i − 1

]i = 0, 1, . . . ,N.

Show that this is indeed the case is an exercise).

Take notes:

2.3.3 Analysis of Euler’s method

For Euler’s method, we get

T = max06j6n

|Tj| 6h

2max

t06t6tn|y ′′(t)|.

Example 2.3.4. Given the problem:

y ′ = 1+ t+y

tfor t > 1; y(1) = 1,

find an approximation for y(2).

(i) Give an upper bound for the global error taking n =

4 (i.e., h = 1/4)

(ii) What n should you take to ensure that the global

error is no more that 0.1?

To answer these questions we need to use (2.3.3),

which requires that we find L and an upper bound for T .

In this instance, L is easy:

Analysis of Euler’s method 27 Initial Value Problems

Take notes:

(This is a particularly easy example. Often we need to

employ the mean value theorem. See [1, Eg 12.2].)

To find T we need an upper bound for |y ′′(t)| on

[1, 2], even though we don’t know y(t). However, we do

know y ′(t)....

Take notes:

With these values of L and T , using (2.3.3) we find

En 6 0.644. In fact, the true answer is 0.43, so we see

that (2.3.3) is somewhat pessimistic.

To answer (ii): What n should you take to ensure that

the global error is no more that 0.1? (We should get

n = 26. This is not that sharp: n = 19 will do).

Take notes:

2.3.4 Convergence and Consistency

We are often interested in the convergence of a method.

That is, is it true that

limh→0

yn = y(tn)?

Or equivalently that,

limh→0

En = 0?

Given that the global error for Euler’s method can be

bounded:

|En| 6 hmax |y ′′(t)|

2L

(eL(tn−t0) −1

)= hK, (2.3.4)

we can say it converges.

Definition 2.3.5. The order of accuracy of a numerical

method is p if there is a constant K so that

|En| 6 Khp.

(The term order of convergence is often use instead of

order of accuracy).

In light of this definition, we read from (2.3.4) that

Euler’s method is first-order.

One of the requirements for convergence is Consis-

tency :

Definition 2.3.6. A one-step method yn+1 = yn +

hΦ(tn,yn;h) is consistent with the differential equa-

tion y ′(t) = f(t,y(t)

)if f(t,y) ≡ Φ(t,y; 0).

It is quite trivial to see that Euler’s method is consis-

tent. In the next section, we’ll try to develop methods

that are of higher order than Euler’s method. That is,

we will study methods for which one can show

|En| 6 Khp for some p > 1.

For these, showing consistency is a little (but only a little)

more work.

2.3.5 Exercises

Exercise 2.4. An important step in the proof of Theo-

rem 2.3.3, but which we didn’t do in class, requires the

observation that if |Ei+1| 6 |Ei|(1+ hL) + h|Ti|, then

|Ei| 6T

L

[(1+ hL)i − 1

]i = 0, 1, . . . ,N.

Use induction to show that is indeed the case.

Exercise 2.5. Suppose we use Euler’s method to find an

approximation for y(2), where y solves

y(1) = 1, y ′ = (t− 1) sin(y).

(i) Give an upper bound for the global error taking n =

4 (i.e., h = 1/4)

(ii) What n should you take to ensure that the global

error is no more that 10−3?

Runge-Kutta 2 (RK2) 28 Initial Value Problems

2.4 Runge-Kutta 2 (RK2)

The goal of this section is to develop some techniques

to help us derive our own methods for accurately solving

Initial Value Problems. Rather than using formal theory,

the approach will be based on carefully chosen examples.

As a motivation, suppose we numerically solve some

differential equation and estimate the error. If we think

this error is too large, we could re-do the calculation with

a smaller value of h. Or we could use a better method,

where the error is proportional to h2, or h3, etc. These

are high-order “Runge-Kutta” methods rely on evaluating

f(t,y) a number of times at each step in order to improve

accuracy.

First, in Section 2.4.1, we’ll motivate one such method.

Then, in 2.4.2, we’ll look at the general framework.

2.4.1 Modified Euler Method

Recall the motivation for Euler’s method from §2.2. We

can do something similar to derive what’s often called the

Modified Euler’s method, or, less commonly, the Mid-

point Euler’s method.

In Euler’s method, we use the slope of the tangent to

y at ti as an approximation for the slope of the secant

line joining the points (ti,y(ti)) and (ti+1,y(ti+1)).

One could argue, given the diagram below, that the

slope of the tangent to y at t = (ti+ti+1)/2 = ti+h/2

would be a better approximation. This would give

y(ti+1) ≈ yi + hf(ti +

h

2,y(ti +

h

2)). (2.4.1)

However, we don’t know y(ti + h/2), but can approx-

imate it using Euler’s Method: y(ti + h/2) ≈ yi +

(h/2)f(ti,yi). Substituting this into (2.4.1) gives

yi+1 = yi + hf(ti +

h

2,yi +

h

2f(ti,yi)

). (2.4.2)

......................................

......................................

........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

............................................................................................

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.............................. .................................................. ..............................

.............................................................................

........................................

..........................................

..........................................

..........................................

..........................................

..........................................

..........................................

..........................................

..........................................

..........................................

..........................................

..........................................

..........................................

..........................................

......

.......... .......... ........

.. .................... ....

...... .......... ..........

.......... .......... ........

.. .................... ....

...... .......... ..........

.......... .......... ........

.. .................... ....

...... .......... ..........

.......... .......... ........

.. .................... ....

...... .......

pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp

ppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppy(x)Secant line joining the points

ti+1

(ti,yi)

ti h

Tangent to y(t) at t = ti +h/2

on y(t) at ti and ti+1

Example 2.4.1. Use the Modified Euler Method to ap-

proximate y(1) where

y(0) = 1, y ′(t) = y log(1+ t2).

This has the solution y(t) = (1+t2)t exp(−2t+2 tan−1 t).

In Table 2.2, and also Figure 2.2, we compare the error in

the solution to this problem using Euler’s Method (left)

and the Modified Euler’s Method (right) for various val-

ues of n.

Table 2.2: Errors in solutions to Example 2.4.1 using

Euler’s and Modified Euler’s MethodsEuler Modified

n En En/En−1 En En/En−1

1 3.02e-01 7.89e-022 1.90e-01 1.59 2.90e-02 2.724 1.11e-01 1.72 8.20e-03 3.548 6.02e-02 1.84 2.16e-03 3.7916 3.14e-02 1.91 5.55e-04 3.9032 1.61e-02 1.95 1.40e-04 3.9564 8.13e-03 1.98 3.53e-05 3.98128 4.09e-03 1.99 8.84e-06 3.99

Clearly we get a much more accurate result using the

Modified Euler Method. Even more importantly, we get

a higher order of accuracy : if we reduce h by a factor

of two, the error in the Modified method is reduced by a

factor of four.

100

101

102

103

104

10−10

10−8

10−6

10−4

10−2

100

Euler error

RK2 error

Fig. 2.2: Log-log plot of the errors when Euler’s and

Modified Euler’s methods are used to solve Example 2.4.1

2.4.2 General RK2

The “Modified Euler Method” we have just studied is an

example of one of the (large) family of 2nd-order Runge-

Kutta (RK2) methods. Recalling that one-step methods

are written as

yi+1 = yi + hΦ(ti,yi;h),

then the general RK2 method is

k1 = f(ti,yi)

k2 = f(ti + αh,yi + βhk1).

Φ(ti,yi;h) = (ak1 + bk2)

(2.4.3)

If we choose a, b, α and β in the right way, then the

error for the method will be bounded by Kh2, for some

constant K.


An (uninteresting) example of such a method is if we

take a = 1 and b = 0, it reduces to Euler’s Method. If

we can choose α = β = 1/2,a = 0,b = 1 and get the

“Modified” method above.

Our aim now is to deduce general rules for choosing a,

b, α and β. We’ll see that if we pick any one of these four

parameters, then the requirement that the method be

consistent and second-order determines the other three.

2.4.3 Using consistency

By demanding that RK2 be consistent we get that a +

b = 1.

Take notes:

2.4.4 Ensuring that RK2 is 2nd-order

Next we need to know how to choose α and β. The

formal way is to use a two-dimensional Taylor series ex-

pansion. It is quite technical, and not suitable for doing

in class. Detailed notes on it are given in Section 2.4.6

below. Instead we’ll take a less rigorous, heuristic ap-

proach.

Because we expect that, for a second order accurate

method, |En| 6 Kh2 where K depends on y ′′′(t), if we

choose a problem for which y ′′′(t) ≡ 0, we expect no

error...

Take notes:

In the above example, the right-hand side of the dif-

ferential equation, f(t,y), depended only on t. Now we’ll

try the same trick: using a problem with a simple known

solution (and zero error), but for which f depends explic-

itly on y.

Consider the DE y(1) = 1,y ′(t) = y(t)/t. It has a

simple solution: y(t) = t. We now use that any RK2

method should be exact for this problem to deduce that

α = β.

Take notes:

Now we collect the above results all together and

show that the second-order Runge-Kutta (RK2) methods

are:

yi+1 = yi + h(ak1 + bk2)

k1 = f(ti,yi), k2 = f(ti + αh,yi + βhk1),

where we choose any b 6= 0 and then set

a = 1− b, α =1

2b, β = α.

It is easy to verify that the Modified method satisfies

these criteria.

2.4.5 Exercises

Exercise 2.6. A popular RK2 method, called the Im-

proved Euler Method, is obtained by choosing α = 1.

(i) Use the Improved Euler Method to find an approx-

imation for y(4) when

y(0) = 1, y ′ = y/(1+ t2),

taking n = 2. (If you wish, use MATLAB.)

(ii) Using a diagram similar to the one in Figure 2.2

for the Modified Euler Method, justify the assertion

that the Improved Euler Method is more accurate

than the basic Euler Method.


(iii) Show that the method is consistent.

(iv) Write out what this method would be for the prob-

lem: y ′(t) = λy for a constant λ. How does this re-

late to the Taylor series expansion for y(ti+1) about

the point ti?

Exercise 2.7. In his seminal paper of 1901, Carl Runge

gave the following example of what we now call a Runge-

Kutta 2 method :

yi+1 = yi +h

4

[f(ti,yi)+

3f(ti +

2

3h,yi +

2

3hf(ti,yi)

)].

(i) Show that it is consistent.

(ii) Show how this method fits into the general frame-

work of RK2 methods. That is, what are a, b, α,

and β? Do they satisfy the following conditions?

β = α, b =1

2α, a = 1− b. (2.4.4)

(iii) Use it to estimate the solution at the point t = 2

to y(1) = 1, y ′ = 1 + t + y/t taking n = 2 time

steps.

2.4.6 Formal Dirivation of RK2

This section is based on [1, p422]. We won’t actually

cover this in class, instead opting to to deduce the same

result in an easier, but unrigorous way in Section 2.4.4.

If were to do it properly, this how we would do it.

We will require that the Truncation Error (2.3.2) be-

ing second-order. More precisely, we want to be able to

say that

|Tn| = |y(tn+1) − y(tn)

h−Φ(tn,y(tn);h)| 6 Ch

2

where C is some constant that does not depend on n or

h.

So the problem is to find expressions (using Taylor se-

ries) for both(y(tn+1) − y(tn)

)/h and Φ(tn,y(tn);h)

that only have O(h2) remainders. To do this we need to

recall two ideas from 2nd year calculus:

� To differentiate a function f(a(t),b(t)

)with re-

spect to t:

df

dt=∂f

∂a

da

dt+∂f

∂b

db

dt;

� The Taylor Series for a function of 2 variables, trun-

cated at the 2nd term is:

f(x+ η1,y+ η2) = f(x,y) + η1∂f

∂x(x,y)+

η2∂f

∂y(x,y) + C(max{η1,η2})

2.

for some constant C. See [1, p422] for details.

To get an expression for |y(tn+1) − y(tn)|, use aTaylor Series:

y(tn+1) = y(tn) + hy′(tn) +

h2

2y ′′(tn) + O(h3)

= y(tn) + hf(tn,y(tn)

)+h2

2

(f(tn,y(tn)

)) ′+ O(h3)


)+h2

2

(∂

∂tf(tn,y(tn)

)+

y ′(tn)∂

∂yf(tn,y(tn)

))+ O(h3),


)+

h2

2

[∂

∂tf+ f

∂

∂yf

](tn,y(tn)

)+ O(h3),


)+h2

2

({t + {yF

)+ O(h3).

This gives

|y(tn+1) − y(tn)|

h= f(tn,y(tn)

)+

h

2

[∂

∂tf+ f

∂

∂yf

](tn,y(tn)

)+ O(h2). (2.4.5)

Next we expand the expression f(ti+αh,yi+βhf(ti,yi)

using a (two dimensional) Taylor Series:

f(tn + αh,yn + βhf(tn,yn)

)=

f(tn,yn) + hα∂

∂tf(tn,yn)+

hβf(tn,yn)∂

∂yf(tn,yn)

)+ O(h2).

This leads to the following expansion for Φ(tn,y(tn)):

Φ(tn,y(tn);k

)= (a+ b)f

(tn,y(tn)

)+

h

[bα

∂

∂tf+ bβf

∂

∂yf

](tn,y(tn)

)+ O(h2). (2.4.6)

So now, if we are to subtract (2.4.6) from (2.4.5) and

leave only terms of O(h2) we have to choose a+ b = 1,

bα = 1/2 = bβ = 1/2. That is, choose α and let

β = α, b =1

2α, a = 1− b. (2.4.7)

(For a more detailed exposition, see [1, Chap 12]).

LAB 2: Euler’s Method 31 Initial Value Problems

2.5 LAB 2: Euler’s Method

In this session you’ll develop your knowledge of MATLAB

by using it to implement Euler’s Method for IVPs, and

study its their order of accuracy.

You should be able to complete at least Section 2.5.8

today. At the end of the class, verify your participa-

tion by uploading your results to Blackboard (go to

“Assignments and Labs” and then “Lab 2”). In Lab 3,

you’ll implement higher-order schemes.

2.5.1 Four ways to define a vector

We know that the most fundamental object in MATLAB

is a matrix. The the simplest (nontriveal) example of a

matrix is a vector. So we need to know how to define

vectors. Here are several ways we could define the vector

x = (0, .2, .4, . . . , 1.8, 2.0). (2.5.1)

x = 0:0.2:2 % From 0 to 2 in steps of 0.2

x = linspace(0, 2, 11); % 11 equally spaced

% points with x(1)=0, x(11)=2.

x = [0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, ...

1.6, 1.8, 2.0]; % Define points individually

The last way is rather tedious, but this one is worse:

x(1)=0.0; x(2)=0.2, x(3)=0.4; x(4)=0.6; ...

We’ll see a less tedious way of doing this last approach

in Section 2.5.3 below.

2.5.2 Script files

MATLAB is an interpretative environment: if you type

a (correct) line of MATLAB code, and hit return, it will

execute it immediately. For example: try >> exp(1)

to get a decimal approximation of e.

However, we usually want to string together a collec-

tion of MATLAB operations and run the repeatedly. To

do that, it is best to store these commands in a script

file. This is done be making a file called, for example,

Lab2.m and placing in it a series of MATLAB commands.

A script file is run from the MATLAB command window

by typing the name of the script, e.g., >> Lab2

Try putting some of the above commands for defining

a vector into a script file and run it.

2.5.3 for–loops

When we want to run a particular set of operations a

fixed number of times we use a for–loop.

It works by iterating over a vector; at each iteration

the iterand takes each value stored in the vector in turn.

For example, here is another way to define the vector

in (2.5.1):

for i=0:10 % 0:10 is the vector [0,1,...,10]

x(i+1) = i*0.2;

end

2.5.4 Functions

In MATLAB we can define a function in a way that is

quite similar to the mathematical definition of a function.

The syntax is >> Name = @(Var)(Formula);

Examples:

f = @(x)(exp(x) - 2*x -1);

g = @(x)(log(2*x +1));

Now, for example, if we call g(1), it evaluates as log(3) =

1.098612288668110. Furthermore, if x is a vector, so too

is g(x).

A more interesting example would be to try

>> xk = 1;

and the repeat the line

>> xk = g(xk)

Try this and observe that the values of xk seem to be

converging. This is because we are using Fixed Point

Iteration.

Later we’ll need to know how to define functions of

two variables. This can be done as:

F = @(y,t)(y./(1 + t.^2));

2.5.5 Plotting functions

MATLAB has two ways to plot functions. The easiest

way is to use a function called ezplot:

ezplot(f, [0, 2]);

plots the function f(x) for 0 6 x 6 2. A more flexible

approach is to use the plot function, which can plot

one vector as a function of another. Try these examples

below, making sure you first have defined the vector x

and the functions f and g:

figure(1);

plot(x,f(x));

figure(2);

plot(x,f(x), x, g(x), ’--’, x,x, ’-.’);

Can you work out what the syntax ’--’ and ’-.’ does?

If not, ask a tutor. Also try

plot(x,f(x), ’g-o’, x, g(x), ’r--x’, ...

x,x, ’-.’);

LAB 2: Euler’s Method 32 Initial Value Problems

2.5.6 How to learn more

These notes are not an encyclopedic guide to MATLAB –

they have just enough information to get started. There

are many good references online.

Exercise: access Learning MATLAB by Tobin Driscoll

through the NUI Galway library portal. Read Section 1.6:

“Things about MATLAB that are very nice to know, but

which often do not come to the attention of beginners”.

2.5.7 Initial Value Problems

The particular example of an IVP that we’ll look at in

this lab is: estimate y(4) given that is one that we had

earlier in (2.2.2):

y ′(t) = y/(1+ t2), for t > 0, and y(0) = 1.

The true solution to this is y(t) = earctan(t). If you don’t

want to solve this problem by hand, you could use Maple.

The Maple command is:

dsolve({D(y)(t)=y(t)/(1+t^2),y(0)=1},y(t));

2.5.8 Euler’s Method

Euler’s Method is

� Choose n, the number of points at which you will

estimate y(t). Let h = (tn − t0)/n, and ti =

t0 + ih.

� For i = 0, 1, 2, . . . ,n−1 set yi+1 = yi+hf(ti,yi).

Then y(tn) ≈ yn. As shown in Section 2.3.3, the global

error for Euler’s method can be bounded:

|En| := |y(T) − yn| 6 Kh,

for some constant K that does not depend on h (or n).

That is, if h is halved (i.e., n is doubled), the error is

halved as well.

Download the MATLAB script file Euler.m. It can

be run in MATLAB simply by typing >> Euler

It implements Euler’s method for n = 1. Read the file

carefully and make sure you understand it.

The program computes a vector y that contains the

estimates for y at the time-values specified in the vector

t. However, MATLAB indexes all vectors from 1, and

not 0. So t(1) = t0, t(2) = t1, ... t(n+ 1) = tn.

By changing the value of n, complete the table.

We want to use this table to verify that Euler’s Method

is 1st-order accurate. That is:

|En| 6 Khρ with ρ = 1.

A computational technique that verifies the order of the

method is to estimate ρ by

ρ ≈ log2( |En||E2n|

). (2.5.2)

Table 2.3: Complete this table showing the convergence

of Euler’s method

n yn En ρ

2 4.2 4.347× 10−1

4 3.96 1.947× 10−1

8

16

32

64

128

256

512

Use the data in the table verify that ρ ≈ 1 for Euler’s

Method. Upload these results to the Assignments

-- Lab 2 section of Blackboard. For example, take a

photo of the table and upload that. However, it would

be better to upload a modified version of the Euler.m

script that computes ρ for each n.

2.5.9 More MATLAB: formatted output

When you run the Euler.m script file, you’ll see that the

output is not very pretty. In particular, the data are not

nicely tabulated. The script uses the fprintf function

to display messages and values. Its basic syntax is:

fprintf(’Here is a message\n’);

where the \n indicates a new line.

Using fprintf to display the contents of a variable

is a little more involved, and depends on how we want

the data displayed. For example:

� To display an integer:

fprintf(’Using n=%d steps\n’, n);

� To display an integer, padded to a maximum of 8

spaces:

fprintf(’Using n=%8d steps\n’, n);

� To display a floating point number:

fprintf(’y(n)=%f\n’, Y(n+1));

� To display in exponential notation:

fprintf(’Error=%e\n’, Error);

� To display 3 decimal places:

fprintf(’y(n)=%.3f\n’, Y(n+1));

fprintf(’Error=%.3e\n’, Error);

Use these to improve the formatting of the output from

your Euler.m script so that the output is nicely tabu-

lated. To get more information on fprintf, type

>> doc fprintf

or ask one of the tutors.

http://www.maths.nuigalway.ie/MA385/lab2/Euler.m

Runge-Kutta 4 33 Initial Value Problems

2.6 Runge-Kutta 4

2.6.1 RK 4

It is possible to construct methods that have rates of

convergence that are higher than RK2 methods, such as

the RK4 methods, for which

|y(tn) − yn| 6 Ch4.

However, writing down the general form of the RK4 method,

and then deriving conditions on the parameters is rather

complicated. Therefore, we’ll simply state the most pop-

ular RK4 method, without proving that it is 4th-order.

4th-Order Runge Kutta Method (RK4):

k1 = f(ti,yi),

k2 = f(ti +h

2,yi +

h

2k1),

k3 = f(ti +h

2,yi +

h

2k2),

k4 = f(ti + h,yi + hk3),

yi+1 = yi +h

6

(k1 + 2k2 + 2k3 + k4).

It can be interpreted as

k1 is the slope of y(t) at ti.

k2 is an approximation for the slope of y(t) at ti + h/2

(using Euler’s Method).

k3 is an improved approximation for the slope of y(t) at

ti + h/2.

k4 is an approximation for the slope of y(t) at ti+1 com-

puted using the slope at y(ti + h/2).

Finally The slope of the secant line joining y(t) at the

points ti and ti+1 is approximated using a weighted

average of the of the above values.

Example 2.6.1. Recall the test problem from Exam-

ple 2.4.1 Table 2.4 and Table 2.6.1 give the errors in the

solutions computed using various methods and values of

n.

2.6.2 RK4: consistency and convergence

Although we won’t do a detailed analysis of RK4, we can

do a little. In particular, we would like to show it is

(i) consistent,

(ii) convergent and fourth-order, at least for some ex-

amples.

Example 2.6.2. It is easy to see that RK4 is consistent:

Table 2.4: Errors in solutions to Example 2.4.1 using

Euler’s, Modified, and RK4|y(tn) − yn|

n Euler Modified RK41 3.02e-01 7.89e-02 8.14e-042 1.90e-01 2.90e-02 1.08e-044 1.11e-01 8.20e-03 5.07e-068 6.02e-02 2.16e-03 2.44e-07

16 3.14e-02 5.55e-04 1.27e-0832 1.61e-02 1.40e-04 7.11e-1064 8.13e-03 3.53e-05 4.18e-11

128 4.09e-03 8.84e-06 2.53e-12256 2.05e-03 2.21e-06 1.54e-13512 1.03e-03 5.54e-07 7.33e-15

100

101

102

103

10−15

10−10

10−5

100

Euler error

RK2 error

RK4 error

Fig. 2.3: Log-log plot of the errors when Euler’s, Modified

Euler’s, and RK-4 methods are used to solve Example

2.4.1

Take notes:

Example 2.6.3. In general, showing the rate of con-

vergence is tricky. Instead, we’ll demonstrate how the

method relates to a Taylor Series expansion for the prob-

lem y ′ = λy where λ is a constant.

Take notes:

Runge-Kutta 4 34 Initial Value Problems

2.6.3 The (Butcher) Tableau

A great number of RK methods have been proposed and

used through the years. A unified approach of represent-

ing and studying them was developed by John Butcher

(Auckland, New Zealand). In his notation, we write an

s-stage method as

Φ(ti,yi;h) =

s∑j=1

bjkj, where

k1 = f(ti + α1h,yi),

k2 = f(ti + α2h,yi + β21hk1),

k3 = f(ti + α3h,yi + β31hk1 + β32hk2),

...

ks = f(ti + αsh,yi + βs1hk1 + . . .βs,s−1hks−1),

The most convenient way to represent the coefficients

is in a tableau:

α1

α2 β21

α3 β31 β32...αs βs1 βs2 · · · βs,s−1

b1 b2 · · · bs−1 bs

The tableau for the basic Euler method is trivial:

0

1

The two best-known RK2 methods are probably the

“Modified Euler method” and the “Improved Euler Method”.

Their tableaux are:

01/2 1/2

0 1and

01 1

1/2 1/2

A three-stage method, some times called “RK3-1”

has the tableau

02/3 2/32/3 1/3 1/3

1/4 0 3/4

The tableau for the RK4 method above is:

01/2 1/21/2 0 1/2

1 0 0 1

1/6 2/6 2/6 1/6

You should now convince yourself that these tableaux

do indeed correspond to the methods we did in class.

2.6.4 Even higher-order methods?

A Runge Kutta method has s stages if it involves s eval-

uations of the function f. (That it, its formula features

k1,k2, . . . ,ks). We’ve seen a one-stage method that is

1st-order, a two-stage method that is 2nd-order, ..., and

a four-stage method that is 4th-order. It is tempting to

think that for any s we can get a method of order s using

s stages. However, it can be shown that, for example, to

get a 5th-order method, you need at least 6 stages; for a

7th-order method, you need at least 9 stages. The the-

ory involved is both intricate and intriguing, and involves

aspects of group theory, graph theory, and differential

equations. Students in third year might consider this as

a topic for their final year project.

2.6.5 Exercises

Exercise 2.8. We claim that, for RK4:

|EN| = |y(tN) − yN| 6 Kh4.

for some constant K. How could you verify that the

statement is true using the data of Table 2.3, at least for

test problem in Example 2.4.2? Give an estimate for K.

Exercise 2.9. Recall the problem in Example 2.2.2: Es-

timate y(2) given that

y(1) = 1, y ′ = f(t,y) := 1+ t+y

t,

(i) Show that f(t,y) satisfies a Lipschitz condition and

give an upper bound for L.

(ii) Use Euler’s method with h = 1/4 to estimate y(2).

Using the true solution, calculate the error.

(iii) Repeat this for the RK2 method of your choice (with

a 6= 0) taking h = 1/2.

(iv) Use RK4 with h = 1 to estimate y(2).

Exercise 2.10. Here is the tableau for a three stage

Runge-Kutta method:

0α2 1/21 β31 2

1/6 b2 1/6

1. Use that the method is consistent to determine b2.

2. The method is exact when used to compute the

solution to

y(0) = 0, y ′(t) = 2t, t > 0.

Use this to determine α2.

3. The method should agree with an appropriate Tay-

lor series for the solution to y ′(t) = λy(t), up to

terms that are O(h3). Use this to determine β31.

From IVPs to Linear Systems 35 Initial Value Problems

2.7 From IVPs to Linear Systems

In this final theoretical section, we highlight some of the

many important aspects of the numerical solution of IVPs

that are not covered in detail in this course:

� Systems of ODEs;

� Higher-order equations;

� Implicit methods; and

� Problems in two dimensions.

We have the additional goal of seeing how these meth-

ods related to the earlier section of the course (nonlinear

problems) and next section (linear equation solving).

2.7.1 Systems of ODEs

So far we have solved only single IVPs. However, must

interesting problems are coupled systems: find functions

y and z such that

y ′(t) = f1(t,y, z),

z ′(t) = f2(t,y, z).

This does not present much of a problem to us. For

example the Euler Method is extended to

yi+1 = yi + hf1(t,yi, zi),

zi+1 = zi + hf2(t,yi, zi).

Example 2.7.1. In pharmokinetics, the flow of drugs

between the blood and major organs can be modelled

dy

dt(t) = k21z(t) − (k12 + kelim)y(t).

dz

dt(t) = k12y(t) − k21z(t).

y(0) = d, z(0) = 0.

where y is the concentration of a given drug in the blood-

stream and z is its concentration in another organ. The

parameters k21, k12 and kelim are determined from phys-

ical experiments.

Euler’s method for this is:

yi+1 = yi + h(− (k12 + kelim)yi + k21zi

),

zi+1 = zi + h(k12yi + k21zi

).

2.7.2 Higher-Order problems

So far we’ve only considered first-order initial value prob-

lems:

y ′(t) = f(t,y); y(t0) = y0.

However, the methods can easily be extended to high-

order problems:

y ′′(t) + a(t)y ′(t) = f(t,y); y(t0) = y0,y′(t0) = y1.

We do this by converting the problem to a system: set

z(t) = y ′(t). Then:

z ′(t) = −a(t)z(t) + f(t,y), z(t0) = y1,

y ′(t) = z(t), y(t0) = y0.

Example 2.7.2. Consider the following 2nd-order IVP

y ′′(t) − 3y ′(t) + 2y(t) + et = 0,

y(1) = e, y ′(1) = 2e.

Let z = y ′, then

z ′(t) = 3z(t) − 2y(t) + et, z(0) = 2e

y ′(t) = z(t), y(0) = e.

Euler’s Method is

zi+1 = zi + h(3zi − 2yi + e

ti),

yi+1 = yi + hzi.

2.7.3 Implicit methods

Although we won’t dwell on the point, there are many

problems for which the one-step methods we have seen

will give a useful solution only when the step size, h, is

small enough. For larger h, the solution can be very un-

stable. Such problems are called “stiff” problems. They

can be solved, but are best done with so-called “implicit

methods”, the simplest of which is the Implicit Euler

Method:

yi+1 = yi + hf(ti+1,yi+1).

Note that yi+1 appears on both sizes of the equation.

To implement this method, we need to be able to solve

this non-linear problem. The most common method for

doing this is Newton’s method.

2.7.4 Towards Partial Differential Equa-tions

So far we’ve only considered ordinary differential equa-

tions: these are DEs which involve functions of just one

variable. In our examples above, this variable was time.

But of course many physical phenomena vary in space

and time, and so the solutions to the differential equa-

tions the model them depend on two or more variables.

The derivatives expressed in the equations are partial

derivatives and so they are called partial differential equa-

tions (PDEs).

We will take a brief look at how to solve these (and

how not to solve them). This will motivate the following

section, on solving systems of linear equations.

Students of financial mathematics will be familiar with

the Black-Scholes equations for pricing an option, which

we mentioned in Section 1.6:

∂V

∂t−

1

2σ2S2

∂2V

∂S2− rS

∂V

∂S+ rV = 0.


With a little effort, (see, e.g., Chapter 5 of “The Mathe-

matics of Financial Derivatives: a student introduction”,

by Wilmott, Howison, and Dewynne) this can be trans-

formed to the simpler-looking heat equation:

∂u

∂t(t, x) =

∂2u

∂x2(t, x), for (x, t) ∈ [0,L]× [0, T ],

and with the initial and boundary conditions

u(0, x) = g(x) and u(t, 0) = a(t),u(t,L) = b(t).

Example 2.7.3. If L = π, g(x) = sin(x), a(t) = b(t) ≡0 then u(t, x) = e−t sin(x) (see Figure 2.4).

Fig. 2.4: The true solution to the heat equation

In general, however, for arbitrary g, a, and b, a ex-

plicit solution to this problem is not available, so a numer-

ical scheme must be used. Suppose we somehow know

∂2u/∂x2, then we could just use Euler’s method:

u(ti+1, x) = u(ti, x) + h∂2u

∂x2(ti, x).

Although we don’t know ∂2u∂x2

(ti, x) we can approximate

it. The algorithm is as follows:

1. Divide [0, T ] into N intervals of width h, giving the

grid {0 = t0 < t1 < · · · < tN−1 < tN = T }, with

ti = t0 + ih.

2. Divide [0,L] into M intervals of width H, giving

the grid {0 = x0 < x1 < · · · < xM = L} with

xj = x0 + jH.

3. Denote by ui,j the approximation for u(t, x) at

(ti, xj).

4. For each i = 0, 1, . . . ,N − 1, use the following

approximation for ∂2u∂x2

(ti, xj)

δ2xui,j =1

H2

(ui,j−1 − 2ui,j + ui,j+1

),

for k = 1, 2, . . . ,M− 1, and then take

ui+1,j := ui,j − h[δ2xui,j

].

This scheme is called an explicit method : if we know

ui,j−1, ui,j and ui,j+1 then we can explicitly calculate

ui+1,j. See Figure 2.5.

u0,0 u0,M

ui,j

ui+1,j

ui,j+1ui,j−1

(time)t

x (space)

Fig. 2.5: A finite difference grid: if we know ui,j−1, ui,jand ui,j+1 we can calculate ui+1,j.

Unfortunately, as we will see in class, this method is

not very stable. Unless we are very careful choosing h

and H, huge errors occur in the approximation for larger

i (time steps).Instead one might use an implicit method : if we know

ui−1,j, we compute ui,j−1, ui,j and ui,j+1 simultane-ously. More precisely: solve ui,j−h

[δ2xui,j

]= ui−1,j for

i = 1, then i = 2, i = 3, etc. Expanding the δ2x termwe get, for each i = 1, 2, 3, . . . , the set of simultaneousequations

ui,0 = a(ti),

αui,j−1 + βui,j + αui,j+1 = ui−1,k, k = 1, 2, . . . ,M− 1

ui,M = b(ti),

where α = − hH2 and β = 2h

H2 + 1. This could be

expressed more clearly as the matrix-vector equation:

Ax = f,

where

A =

1 0 0 0 . . . 0 0 0 0α β α 0 . . . 0 0 0 00 α β α . . . 0 0 0 0

.... . .

...0 0 0 0 . . . α β α 00 0 0 0 . . . 0 α β α0 0 0 0 . . . 0 0 0 1

,

and

x =

ui,0ui,1ui,2

...ui,n−2

ui,n−1

ui,n

,y =

a(0)ui−1,1

ui−1,2...

ui−1,n−2

ui−1,n−1

b(T)

.


So “all” we have to do now is solve this system of

equations. That is what the next section of the course is

about.

Exercise 2.11. Write down the Euler Method for the

following 3rd-order IVP

y ′′′ − y ′′ + 2y ′ + 2y = x2 − 1,

y(0) = 1,y ′(0) = 0,y ′′(0) = −1.

Exercise 2.12. Use a Taylor series to provide a derivation

for the formula

∂2u

∂x2(ti, xj) ≈

1

H2

(ui,j−1 − 2ui,j + ui,j+1

).

Exercise 2.13. ? (Your own RK3 method). Here are

some entries for 3-stage Runge-Kutta method tableaux.

Method 0: α2 = 2/3, α3 = 0, b1 = 1/12, b2 = 3/4,

β32 = 3/2

Method 1: α2 = 1/4, α3 = 1, b1 = −1/6, b2 = 8/9,

β32 = 12/5

Method 2: α2 = 1/4, α3 = 1/2, b1 = 2/3, b2 = −4/3,

β32 = 2/5

Method 3: α2 = 1/4, α3 = 1/3, b1 = 3/2, b2 = −8,

β32 = 4/45

Method 4: α2 = 1, α3 = 1/4, b1 = −1/6, b2 = 5/18,

β32 = 3/16

Method 5: α2 = 1, α3 = 1/5, b1 = −1/3, b2 = 7/24,

β32 = 4/25

Method 6: α2 = 1, α3 = 1/6, b1 = −1/2, b2 = 3/10,

β32 = 5/36

Method 7: α2 = 1/2, α3 = 1/7, b1 = 7/6, b2 =

22/15, β32 = −10/49

Method 8: α2 = 1/2, α3 = 1/8, b1 = 4/3, b2 = 13/9,

β32 = −3/16

Method 9: α2 = 1/3, α3 = 1/9, b1 = 4, b2 = 15/4,

β32 = −2/27

Answer the following questions for Method K, where

K is the last digit of your ID number. For example, if

your ID number is 01234567, use Method 7.

(a) Assuming that the method is consistent, determine

the value of b3.

(b) Consider the initial value problem:

y(0) = 1, y ′(t) = λy(t).

Using that the solution is y(t) = eλt, write out a

Taylor series for y(ti+1) about y(ti) up to terms of

order h4 (use that h = ti+1 − ti).

Using that your method should agree with the Taylor

Series expansion up to terms of order h3, determine

β21 and β31.

Exercise 2.14. (Attempt this exercises after completing

Lab 3). Write a MATLAB program that implements your

method from Exercise 2.13.

Use this program to check the order of convergence

of the method. Have it compute the error for n = 2,

n = 4, . . . , n = 1024. Then produce a log-log plot of

the errors as a function of n.

Exercise 2.15. (Tutorial problem). Suppose that a 3-

stage Runge-Kutta method tableaux has the following

entries:

α2 =1

3, α3 =

1

9, b1 = 4, b2 =

15

4, β32 = −

2

27.

(a) Assuming that the method is consistent, determine

the value of b3.

(b) Consider the initial value problem:

y(0) = 1, y ′(t) = λy(t).

Using that the solution is y(t) = eλt, write out a

Taylor series for y(ti+1) about y(ti) up to terms of

order h4 (use that h = ti+1 − ti).

Using that your method should agree with the Taylor

Series expansion up to terms of order h3, determine

β21 and β31.

From IVPs to Linear Systems 38 Solving Linear Systems

2.8 LAB 3: RK2 and RK3 meth-

ods

In this lab, you will extend the code for Euler’s method

from Lab 2 to implement higher-order methods to solve

IVPs of the form

y(t0) = y0, y ′(t) = f(t,y) for t > t0.

In particular, you will write programs to implement

certain RK2 and RK3 methods.

2.8.1 RK2

A generic one-step method is written as

yi+1 = yi + hΦ(ti,yi;h) for i = 1, 2, . . . ,n.

To get a Runge-Kutta 2 (“RK2”) method, set

k1 = f(ti,yi), (2.8.1a)

k2 = f(ti + αh,yi + βhk1), (2.8.1b)

Φ(ti,yi;h) = ak1 + bk2. (2.8.1c)

In Section 2.4, we saw that if we pick any b 6= 0, and let

a = 1− b, α =1

2b, β = α, (2.8.2)

then we get a second-order method: |En| 6 Kh2.

For example, if we choose b = 1, we get the so-called

Modified or mid-point Euler Method from Section 2.4.

However, any value of b, other than b = 0 should give

a second-order method.

Download the MATLAB script Euler Solution.m

and run it. Make sure you understand how it works.

Next, adapt to implement an RK2 method as follows.

1. Take b in (2.8.1c) to be the last digit of your ID

number, unless that is 0, in which case take b = −1.

(For example, if your ID number is 01234567, take

b = 7. If your ID number is 76543210, take b = −1).

Compute the values of a, α and β according to (2.8.2).

2. Choose an initial value problem to solve, and for which

you know the exact solution. To avoid having a prob-

lem that is too simple,

� your solution should involve trigonometric, loga-

rithmic or nth-root functions.

� f should depend explicitly on both t and y.

(Hint: decide on the solution first, and then differen-

tiate that to get f). You also need to choose an initial

time, t0, and a final time for the simulation, tn.

3. The MATLAB program should approximate the solu-

tion to this IVP using your RK2 method for n = 2,

n = 4, n = 8, . . . , n = 512 (at least). For each n

it should output the estimate for y(tn) and the error

|En| = |y(tn) − yn|.

4. The program should produce a figure displaying a log-

log plot of these errors against the corresponding val-

ues of n, as well as n−2 against n. If your method is

second-order, then these two lines should be parallel.

2.8.2 RK3

Next write a program that implements the RK3-1 method

given in Section 2.6.3:

α1

α2 β21

α3 β31 β32

b1 b2 b3

=

02/3 2/32/3 1/3 1/3

1/4 0 3/4

The method is then

k1 = f(ti,yi),

k2 = f(ti + α2h,yi + β21hk1),

k3 = f(ti + α3h,yi + β31hk1 + β32hk2),

and

Φ(ti,yi;h) = b1k1 + b2k2 + b3k3.

As with the RK2 method, the MATLAB program

should approximate the solution to this IVP using this

RK3 method for n = 2, n = 4, n = 8, . . . . For each

n it should output the estimate for y(tn) and the error

|En| = |y(tn) − yn|.

The program should also produce a figure displaying

a log-log plot of these errors against the corresponding

values of n, as well as n−3 against n.

2.8.3 What to upload

When your RK2 and RK3 programs are complete, upload

them to the “Lab 3”, under the “Assignments and Labs”

tab on Blackboard. This can be done in two separate

files, but it is okay to combine them to a single file if you

prefer. (After all, every RK2 method is an example of an

RK3 method).

Add appropriate comments to the top of your

file(s) indicating who wrote it, when they wrote it,

what it does, and how it does it (“Who/When/What/

How?”). Include your ID number, and give the program

a sensible name, which includes something distinctive like

you name and ID number (so that I don’t end up with

50 programmes all called RK2.m).

Make sure your programmes run as-is before upload-

ing. If you don’t, you might have given it an invalid name,

such as one containing spaces or mathematical symbols.

The deadline for uploading your code is Friday, 17 Novem-

ber.

http://www.maths.nuigalway.ie/MA385/lab2/Euler_Solution.m

Chapter 3

Solving Linear Systems

3.1 Introduction

This section of the course is concerned with solving sys-

tems of linear equations (“simultaneous equations”). All

problems are of the form: find the set of real numbers

x1, x2, . . . , xn such that

a11x1 + a12x2 + · · ·+ a1nxn = b1

a21x1 + a22x2 + · · ·+ a2nxn = b2

...

an1x1 + an2x2 + · · ·+ annxn = bn

where the aij and bi are real numbers.

It is natural to rephrase this using the language of

linear algebra: Find x = (x1, x2, . . . xn)T ∈ Rn such

that

Ax = b, (3.1.1)

where A ∈ Rn×n is a n × n-matrix and b ∈ Rn is a

(column) vector with n entries.

In this section, we’ll try to find a clever way to solving

this system. In particular,

1. we’ll argue that its unnecessary and (more impor-

tantly) expensive to try to compute A−1;

2. we’ll have a look at Gaussian Elimination, but will

study it in the context of LU-factorisation;

3. after a detour to talk about matrix norms, we cal-

culate the condition number of a matrix;

4. This last task will require us to be able to estimate

the eigenvalues of matrices, so we will finish the

module by studying how to do that.

3.1.1 A short review of linear algebra

� A vector x ∈ Rn, is an ordered n-tuple of real

numbers, x = (x1, x2, . . . , xn)T .

� A matrix A ∈ Rm×n is a rectangular array of n

columns of m real numbers. (But we will only deal

with square matrices, i.e., ones where m = n).

� You already know how to multiply a matrix by a

vector, and a matrix by a matrix.

� You can write the scalar product of two vectors,

x and y as (x,y) = xTy =∑ni=1 xiyi. (This is

also called the “(usual) inner product” or the “dot

product”).

� Recall that (Ax)T = xTAT , so (x,Ay) = xTAy =

(ATx)Ty = (ATx,y).

� Letting I be the identity matrix, if there is a matrix

B such that AB = BA = I, when we call B the

inverse of A and write B = A−1. If there is no

such matrix, we say that A is singular.

Lemma 3.1.1. The following statements are equivalent.

1. For any b, Ax = b has a solution.

2. If there is a solution to Ax = b, it is unique.

3. If Ax = 0 then x = 0.

4. The columns (or rows) of A are linearly indepen-

dent.

5. There exists a matrix A−1 ∈ Rn×n such that

AA−1 = I = A−1A.

6. det(A) 6= 0.

Item 5 of the above lists suggests that we could solve

(3.1.1) as follows: find A−1 and then compute x =

A−1b. However, usually, this is a very bad idea, be-

cause we only need x and computing A−1 requires quite

a bit a work.

3.1.2 The time it takes to compute det(A)

In an introductory linear algebra course, we would com-

pute

A−1 =1

det(A)A∗

where A∗ is the adjoint matrix. But it turns out that

computing det(A) directly can be very time-consuming.

Let dn be the number of multiplications required to

compute the determinant of an n × n matrix using the39

Introduction 40 Solving Linear Systems

Method of Minors (also known as “Laplace’s Formula”).

We know that if

A =

(a bc d

)then det(A) = ad− bc. So d2 = 2. If

A =

(a b cd e fg h j

)

then

det(A) =

∣∣∣∣∣a b cd e fg h j

∣∣∣∣∣ = a∣∣∣∣e fh j

∣∣∣∣− b ∣∣∣∣d fg j

∣∣∣∣+ c ∣∣∣∣d eg h

∣∣∣∣ ,so d3 = 3d2. Next:

Take notes:

3.1.3 Exercises

Exercise 3.1. Suppose you had a computer that com-

puter perform 2 billion operations per second. Give a

lower bound for the amount of time required to compute

the determinant of a 10-by-10, 20-by-20, and 30-by-30,

matrix using the method of minors.

Exercise 3.2. Show that det(σA) = σn det(A) for any

A ∈ Rn×n and any scalar σ ∈ R.

Note: this exercise gives us another reason to avoid

trying to calculate the determinant of the coefficient ma-

trix in order to find the inverse, and thus the solution to

the problem. For example A ∈ R10×10 and the system is

rescaled by σ = 0.1, then det(A) is rescaled by 10−10 –

it is almost singular!

3.2 Gaussian Elimination

3.2.1 The basic idea

In the earlier sections of this course, we studied approx-

imate methods for solving a problem: we replaced the

problem (e.g, finding a zero of a nonlinear function) with

an easier problem (finding the zero of a line) that has

a similar solution. The method we’ll use here, Gaussian

Elimination 1, is exact: it replaces the problem with one

that is easier to solve and has the same solution.

Example 3.2.1. Consider the problem:(−1 3 −13 1 −22 −2 5

)(x1x2x3

)=

(50−9

)

We can perform a sequence of elementary row operations

to yield the system:(−1 3 −10 2 −10 0 5

)(x1x2x3

)=

(53−5

).

In the latter form, the problem is much easier: from

the 3rd row, it is clear that x3 = −1, substitute into row

2 to get x2 = 1, and then into row 1 to get x1 = −1.

3.2.2 Elementary row operations as ma-trix multiplication

Recall that in Gaussian elimination, at each step in the

process, we perform an elementary row operation such as

A =

(a11 a12 a13a21 a22 a23a31 a32 a33

)

being replaced by(a11 a12 a13

a21 + µ21a11 a22 + µ21a12 a23 + µ21a13a31 a32 a33

)=

A+ µ21

(0 0 0a11 a12 a130 0 0

)

where µ21 = −a21/a11. Because(0 0 0a11 a12 a130 0 0

)=

(0 0 01 0 00 0 0

)(a11 a12 a13a21 a22 a23a31 a32 a33

)

we can write the row operation as(I+µ21E

(21))A, where

E(pq) is the matrix of all zeros, except for epq = 1.

1

Carl Freidrich Gauß, Germany, 1777-1855. Although

he produced many very important original ideas, this

wasn’t one of them. The Chinese knew of “Gaussian

Elimination” about 2000 years ago. His actual con-

tributions included major discoveries in the areas of

number theory, geometry, and astronomy.


In general each of the row operations in Gaussian

Elimination can be written as(I+ µpqE

(pq))A where 1 6 q < p 6 n, (3.2.2)

and(I + µpqE

(pq))

is an example of a Unit Lower Tri-

angular Matrix.

As we will see, each step of the process will involve

multiplying A by a unit lower triangular matrix, resulting

in an upper triangular matrix. It turns out that triangular

matrices have important properties. So we’ll study these,

and see how they relate to Gaussian Elimination.

3.2.3 Unit Lower Triangular and Upper Tri-angular Matrices

Definition 3.2.2. L ∈ Rn×n is a Lower Triangular (LT)

Matrix (LT) if the only non-zero entries are on or below

the main diagonal, i.e., if lij = 0 for 1 6 i < j 6 n. It is

a unit Lower Triangular Matrix if lii = 1.

U ∈ Rn×n is an Upper Triangular (UT) Matrix if

uij = 0 for 1 6 j < i 6 n. It is a Unit Upper Triangular

Matrix if uii = 1.

Example 3.2.3.

Take notes:

Triangular matrices have many important properties.

One of these is: the determinant of a triangular ma-

trix is the product of the diagonal entries (for proof,

see Exercise 3.4). Another is that the eigenvalues of a

triangular matrix are just its diagonal entries (see Exer-

cise 3.5. Some more properties are noted Theorem 3.2.6

below. The style of proof used in that theorem recurs

through out this section of the course and so is studying

for its own sake. The key idea is to partition the matrix

and then apply inductive arguments.

Definition 3.2.4. X is a submatrix of A if it can be

obtained by deleting some rows and columns of A.

Definition 3.2.5. The Leading Principal Submatrix of

order k of A ∈ Rn×n is A(k) ∈ Rk×k obtained by delet-

ing all but the first k rows and columns of A. (Simply

put, it’s the k × k matrix in the top left-hand corner of

A).

Next recall that if A and V are matrices of the same

size, and each are partitioned

A =

(B CD E

), and V =

(W XY Z

),

where B is the same size as W, C is the same size as X,

etc. Then

AV =

(BW + CY BX+ CZDW + EY DX+ EZ

).

Armed with this, we are now ready to state our the-

orem on lower triangular matrices.

Theorem 3.2.6 (Properties of Lower Triangular Matri-

ces). For any integer n > 2:

(i) If L1 and L2 are n× n Lower Triangular (LT) Ma-

trices that so too is their product L1L2.

(ii) If L1 and L2 are n × n Unit LT matrices, then so

too is their product L1L2.

(iii) L1 is nonsingular if and only if all the lii 6= 0. In

particular all unit LT matrices are nonsingular.

(iv) The inverse of a LT matrix is an LT matrix. The

inverse of a unit LT matrix is a unit LT matrix.

In class, we’ll cover the main ideas needed to prove

part (iv) of Theorem 3.2.6, and for LT matrices (the

arguments for unit LT matrices are almost identical: if

anything, a little easier). The other parts are left to

Exercise 3.6. We restate Part (iv) as follows:

Suppose that L ∈ Rn×n is a lower triangu-

lar matrix, with n > 2, and that there is a

matrix L−1 ∈ Rn×n such that L−1L = In.

Then L−1 is also a lower triangular matrix.

Proof. Here are the main ideas. Some more details are

given in Section 3.2.4 below.

Take notes:


Theorem 3.2.7. Statements analogous to Theorem 3.2.6

hold for upper triangular and unit lower triangular matri-

ces. (For proof, see Exercise 3.7)

3.2.4 Details of the proof of Theorem 3.2.6

The proof is by induction. First show that the statement

is true for n = 2 by writing down the of a 2×2 nonsingular

lower triangular matrix.Next we shall that the result holds true for k =

3, 4, . . .n − 1. For the case where L ∈ Rn×n, partitionL by the last row and column:

L =

l1,1 0 . . . 0 0l2,1 l2,2 . . . 0 0ln−1,1 ln−1,2 . . . ln−1,n−1 0

...ln,1 ln,2 . . . ln,n−1 lnn

=

L(n−1) 0

rT ln,n

where L(n−1) is the LT matrix with n − 1 rows and

column obtained by deleting the last row and column ofL, 0 = (0, 0, . . . , 0)T and r ∈ Rn−1

? . Do the same forL−1:

L−1 =

(X yzT β

),

and use the fact that their product is In:

LL−1 =

(L(n−1) 0rT ln,n

)(X yzT β

)=

(L(n−1)X+ 0zT L(n−1)y+ 0βrTX+ ln,nz

T rTy+ ln,nβ

)=

(In−1 00T 1

).

This gives that

� L(n−1)X = In−1 so, by the inductive hypothesis,

X is an LT matrix.

� L(n−1)y = 0. Since Ln−1 is invertible, we get that

y = L−1N−10 = 0.

� zT = (−rTX)/lnn which exists because, since det(L) 6=0 we know that lnn 6= 0.

� Similarly, β = (1− rTy)/lnn

3.2.5 Exercises

Exercise 3.3. Every step of Gaussian Elimination can

be thought of as a left multiplication by a unit lower

triangular matrix. That is, we obtain an upper triangular

matrix U by multiplying A by k unit lower triangular

matrices: LkLk−1Lk−2 . . .L2L1A = U, where each Li =

I + µpqE(pq), and E(pq) is the matrix whose only non-

zero entry is epq = 1. Give an expression for k in terms

of n.

Exercise 3.4. ? Let L be a lower triangular n×n matrix.

Show that det(L) =

n∏j=1

ljj. Hence give a necessary and

sufficient condition for L to be invertible. What does that

tell us about Unit Lower Triangular Matrices?

Exercise 3.5. Let L be a lower triangular matrix. Show

that each diagonal entry of L, ljj is an eigenvalue of L.

Exercise 3.6. Prove Parts (i)–(iii) of Theorem 3.2.6.

Exercise 3.7. Prove Theorem 3.2.7.

Exercise 3.8. Suppose the n × n matrices A and C

are both lower triangular matrices, and that there is a

n × n matrix B such that AB = C. Must B be a lower

triangular matrix?

Suppose A and C are unit lower triangular matrices, and

AB = C. Must B be a unit lower triangular matrix?

Why?

Exercise 3.9. ? Construct an alternative proof of the

first part of Theorem 3.2.6 (iv) as follows: Suppose that

L is a non-singular lower triangular matrix. If b ∈ Rn

is such that bi = 0 for i = 1, . . . ,k 6 n, and y solves

Ly = b, then yi = 0 for i = 1, . . . ,k 6 n. (Hint:

partition L by the first k rows and columns.)

Now use this to give a alternative proof of the fact

that the inverse of a lower triangular matrix is itself lower

triangular.


3.3 LU-factorisation

In this section we’ll see see that applying Gaussian elim-

ination to solve the linear system Ax = b is equivalent

to factoring A as the product of a unit lower triangular

(LT) matrix and upper triangular (UT) matrix.

3.3.1 Factorising A

In Section 3.2 we noted that each elementary row oper-

ation in Gaussian Elimination (GE) involves replacing A

with (I + µrsE(rs))A. But (I + µrsE

(rs)) is a unit LT

matrix. Also, when we are finished we have an upper

triangular matrix. So we can write the whole process as

LkLk−1Lk−2 . . .L2L1A = U,

where each of the Li is a unit LT matrix. But Theo-

rem 3.2.6 tells us that the product of unit LT matrices

is itself a unit LT matrix. So we can write the whole

process as

LA = U.

Theorem 3.2.6 also tells us that the inverse of a unit LT

matrix exists and is a unit LT matrix. So we can write

A = LU,

where L is unit lower triangular and U is upper triangular.

This is called...

Definition 3.3.1. The LU-factorization of a matrix, A,

is a unit lower triangular matrix L and an upper triangular

matrix U such that LU = A.

Example 3.3.2. If A =

(3 2−1 2

)then:

Take notes:

Example 3.3.3. If A =

(3 −1 12 4 30 2 −4

)then:

Take notes:

3.3.2 A formula for LU-factorisation

In the above examples, we deduced the factorisation by

inspection. But the process suggests a algorithm/for-

mula. That is, we need to work out formulae for L and

U where

ai,j = (LU)ij =

n∑k=1

likukj 1 6 i, j 6 n.

Since L and U are triangular, so

If i 6 j then ai,j =

i∑k=1

likukj

If j < i then ai,j =

j∑k=1

likukj

The first of these equations can be written as

ai,j =

i−1∑k=1

likukj + liiuij.

But lii = 1 so:

ui,j = aij −

i−1∑k=1

likukj

{i = 1, . . . , j− 1,j = 2, . . . ,n. (3.3.3a)

And from the second:

li,j =1

ujj

(aij −

j−1∑k=1

likukj

) {i = 2, . . . ,n,j = 1, . . . , i− 1.

(3.3.3b)

Example 3.3.4. Find the LU-factorisation of

A =

−1 0 1 2−2 −2 1 4−3 −4 −2 4−4 −6 −5 0

Details of Example 3.3.4: First, using (3.3.3a) with

i = 1 we have u1j = a1j:

U =

−1 0 1 20 u22 u23 u24

0 0 u33 u34

0 0 0 u44

.

Then (3.3.3b) with j = 1 we have li1 = ai1/u11:

L =

1 0 0 02 1 0 03 l32 1 04 l42 l43 1

.

Next (3.3.3a) with i = 2 we have u2j = a2j − l21u2j:

U =

−1 0 1 20 −2 −1 00 0 u33 u34

0 0 0 u44

,


then (3.3.3b) with j = 2 we have li2 = (ai2−li1u12)/u22:

L =

1 0 0 02 1 0 03 2 1 04 3 l43 1

Etc....

3.3.3 Existence of an LU-factorisation

Question: It this factorisation possible for all matrices?

That answer is no. For example, suppose one of the

uii = 0 in (3.3.3b). We’ll now characterise the matrices

which can be factorised.

To prove the next theorem we need the Binet-Cauchy

Theorem: det(AB) = det(A) det(B).

Theorem 3.3.5. If n > 2 and A ∈ Rn×n is such that

every leading principal submatrix of A is nonsingular for

1 6 k < n, then A has an LU-factorisation.

Take notes:

3.3.4 Exercises

Exercise 3.10. ? Many textbooks and computing sys-

tems compute the factorisation A = LDU where L and

U are unit lower and unit upper triangular matrices re-

spectively, and D is a diagonal matrix. Show such a fac-

torisation exists, providing that if n > 2 and A ∈ Rn×n,

then every leading principal submatrix of A is nonsingular

for 1 6 k < n.

Exercise 3.11. Suppose that A ∈ Rn×n is nonsingular

and symmetric, and that every leading principle subma-

trix of A is nonsingular. Use Exercise 3.10 so show that

A can be factorised as A = LDLT . (This is called an

“LDL-factorisation”).


3.4 Solving linear systems

Now that we know how to construct the LU-factorisation

of a matrix, we can use it to solve a linear system. We

also need to consider the computational efficiency of the

method.

3.4.1 Solving LUx = b

From Theorem 3.3.5 we can factorize a A ∈ Rn×n (that

satisfies the theorem’s hypotheses) as A = LU. But our

overarching goal is to solve the problem: “find x ∈ Rn

such that Ax = b, for some b ∈ Rn”. We do this by

first solving Ly = b for y ∈ Rn and then Ux = y.

Because L and U are triangular, this is easy. The process

is called back-substitution.

Example 3.4.1. Use LU-factorisation to solve−1 0 1 2−2 −2 1 4−3 −4 −2 4−4 −6 −5 0

x1x2x3x4

=

−2−3−11

Solution: Take the LU-factorisation from Example 3.3.4.

Then...

Take notes:

3.4.2 Pivoting

We did not cover this section is class; please read it in

your own time.

Example 3.4.2. Suppose we want to compute the LU-

factorisation of

A =

(0 2 −42 4 33 −1 1

).

We can’t compute l21 because u11 = 0. But if we

swap rows 1 and 3 we get the matrix in Example 3.3.3

and so can form an LU-factorization. This like changing

the order of the linear equations we want to solve. If

P =

(0 0 10 1 01 0 0

)then PA =

(3 −1 12 4 30 2 −4

).

This is called Pivoting and P is the permutation matrix.

Definition 3.4.3. P ∈ Rn×n is a Permutation Matrix if

every entry is either 0 or 1 (it is a Boolean Matrix) and

if all the row and column sums are 1.

Theorem 3.4.4. For any A ∈ Rn×n there exists a per-

mutation matrix P such that PA = LU.

For a proof, see p53 of text book.

3.4.3 The computational cost

How efficient is the method of LU-factorization for solv-

ing Ax = b? That is, how many computational steps

(additions and multiplications) are required? In Section

2.6 of the textbook [1], you’ll find a discussion that goes

roughly as follows:

Suppose we want to compute li,j. From the formula

(3.3.3), we see that this would take j− 2 additions, j− 1

multiplications, 1 subtraction and 1 division: a total of

2j− 1 operations. We know (see Exercise 3.13) that

1+ 2+ · · ·+ k =1

2k(k+ 1), and

12 + 22 + · · ·k2 =1

6k(k+ 1)(2k+ 1).

So the number of operations required for computing L is

n∑i=2

i−1∑j=1

(2j− 1) =

n∑i=2

i2 − 2i+ 1 =

1

6n(n+ 1)(2n+ 1) − n(n+ 1) 6 Cn3

for some C. A similar (slightly smaller) number of opera-

tions is required for computing U. (For a slightly different

approach that yields cruder estimates, but requires a little

less work, have a look at Lecture 10 of Afternotes [3]).

This doesn’t tell us how long a computer program will

take to run, but it does tell us how the execution time

grows with n. For example, if n = 100 and the program

takes a second to execute, then if n = 1000 we’d expect

it to take about a thousand seconds.

3.4.4 Towards an error analysis

Unlike the other methods we studied so far in this course,

we shouldn’t have to do an error analysis, in the sense of

estimating the difference between the true solution, and

our numerical one. That is because the LU-factorisation

approach should give us exactly the true solution.

However, things are not that simple. Unlike the meth-

ods in earlier sections, the effects of (inexact) floating

point computations become very pronounced. In the next

section, we’ll develop the ideas needed to quantify these

effects.

3.4.5 Exercises

Exercise 3.12. Suppose that A is symmetric and has an

LDL-factorisation (see Exercises 3.11). How could this

factorization be used to solve Ax = b?

Exercise 3.13. Prove that

1+ 2+ · · ·+ k =1

2k(k+ 1), and

12 + 22 + · · ·k2 =1

6k(k+ 1)(2k+ 1).


3.5 Vector and Matrix Norms

Motivation: error analysis for linear solvers

As mentioned in Section 3.4.4, all computer implemen-

tations of algorithms that involve floating-point num-

bers (roughly, finite decimal approximations of real num-

bers) contain errors due to round-off error. We saw

this, for example, in labs when convergence of Newton’s

method stagnated when the approximation was within

about 10−16 of the true solution.

It transpires that computer implementations of LU-

factorization, and related methods, can lead to these

round-off errors being greatly magnified: this phenomenon

is the main focus of this section of the course.

You might remember from earlier sections of the course

that we had to assume functions where well-behaved in

the sense that|f(x) − f(y)|

|x− y|6 L,

for some number L, so that our numerical schemes (e.g.,

fixed point iteration, Euler’s method, etc) would work. If

a function doesn’t satisfy a condition like this, we say it is

“ill-conditioned”. One of the consequences is that a small

error in the inputs gives a large error in the outputs. We’d

like to be able to express similar ideas about matrices:

that A(u − v) = Au − Av is not too “large” compared

to u− v. To do this we used the notion of a “norm” to

describing the relatives sizes of the vectors u and Au.

3.5.1 Three vector norms

When we want to consider the size of a real number,

without regard to sign, we use the absolute value. Im-

portant properties of this function are:

1. |x| > 0 for all x.

2. |x| = 0 if and only if x = 0.

3. |λx| = |λ||x|.

4. |x+ y| 6 |x|+ |y| (triangle inequality).

This notion can be extended to vectors and matrices.

Definition 3.5.1. Let Rn be the set of all the vectors of

length n of real numbers. The function ‖ · ‖ : Rn → Ris called a norm on Rn if, for all u, v ∈ Rn

1. ‖v‖ > 0,

2. ‖v‖ = 0 if and only if v = 0.

3. ‖λv‖ = |λ|‖v‖ for any λ ∈ R,

4. ‖u+ v‖ 6 ‖u‖+ ‖v‖ (triangle inequality).

The norms of a vector give us some information about

the size of the vector. But there are different ways of

measuring the size: you could take the absolute value of

the largest entry, you cold look at the “distance” for the

origin, etc... There are three important examples.

Definition 3.5.2. Let v ∈ Rn: v = (v1, v2, . . . , vn−1, vn)T .

(i) The 1-norm (also known as the Taxi cab or Man-

hattan norm) is

‖v‖1 =

n∑i=1

|vi|.

(ii) The 2-norm (a.k.a. the Euclidean norm)

‖v‖2 =

( n∑i=1

v2i

)1/2

.

Note, if v is a vector in Rn, then

vTv = v21 + v22 + · · ·+ v2n = ‖v‖22.

(iii) The ∞-norm (also known as the max-norm)

‖v‖∞ =n

maxi=1

|vi|.

Example 3.5.3. If v = (−2, 4,−4)T then

Take notes:

...................................................................................................................................................................................................................

..............

..............

..............

..............

.....

(0, 1)

(0,−1)

(1, 0)

(−1, 0)........................................

....................

......................................................................

...................................................................... .......... .......... .......... .......... ..........

....................

......................................

(0, 1)

(0,−1)

(1, 0)

(−1, 0)

(0, 1)

(0,−1)

(−1, 0)

(1, 0)

Fig. 3.1: The unit vectors in R2: ‖x‖1 = 1, ‖x‖2 = 1,

‖x‖∞ = 1,


In Figure 3.1, the first diagram shows the unit ball in

R2 given by the 1-norm: the vectors x = (x1, x2) in R2

are such that ‖x‖1 = |x1|+ |x2| = 1 are all found on the

diamond (top left). In the second diagram, the vectors

have√x21 + x

22 = 1 and so are arranged in a circle (top

right). The bottom diagram gives the unit ball in ‖ · ‖∞,

for which the largest component of each vector is 1.

It is easy to show that ‖·‖1 and ‖·‖∞ are norms. And

it is not hard to show that ‖ · ‖2 satisfies conditions (1),

(2) and (3) of Definition 3.5.1. But it takes a little bit of

effort to show that ‖ · ‖2 satisfies the triangle inequality.

Details are given in Section 3.5.9.

3.5.2 Matrix Norms

Definition 3.5.4. Given any norm ‖ · ‖ on Rn, there is

a subordinate matrix norm on Rn×n defined by

‖A‖ = maxv∈Rn

?

‖Av‖‖v‖

, (3.5.1)

where A ∈ Rn×n and Rn? = Rn/{0}.

You might wonder why we define a matrix norm like

this. The reason is that we like to think of A as an

operator on Rn: if v ∈ Rn then Av ∈ Rn. So rather

than the norm giving us information about the “size” of

the entries of a matrix, it tells us how much the matrix

can change the size of a vector.

3.5.3 Computing Matrix Norms

The formula for a subordinate matrix norm in Defini-

tion 3.5.4, is sensible, but not much use to if we actually

want to compute, say, ‖A‖1, ‖A‖∞ or ‖A‖2. For a given

A, we’d have to calculate ‖Av‖/‖v‖ for all v, and there

is rather a lot of them. Fortunately, there are some easier

ways of computing the more important norms:

� The∞-norm of a matrix is also the largest absolute-

value row sum.

� The 1-norm of a matrix is also the largest absolute-

value column sum.

� The 2-norm of the matrix A is the square root of

the largest eigenvalue of ATA.

3.5.4 The max-norm on Rn×n

Theorem 3.5.5. For any A ∈ Rn×n the subordinate

matrix norm associated with ‖ · ‖∞ on Rn can be com-

puted by

‖A‖∞ = maxi=1,...,n

n∑j=1

|aij|.

Take notes:

A similar result holds for the 1-norm, the proof of

which is left to Exercise 3.17

Theorem 3.5.6.

‖A‖1 = maxj=1,...,n

n∑i=1

|ai,j|. (3.5.2)

3.5.5 Computing ‖A‖2Computing the 2-norm of a matrix is a little harder than

computing the 1- or ∞-norms. However, later we’ll need

estimates not just for ‖A‖, but also ‖A−1‖. And, unlike

the 1- and ∞-norms, we can estimate ‖A−1‖2 without

explicitly forming A−1.

3.5.6 Eigenvalues

We begin by recalling some important facts about eigen-

values and eigenvectors.

Definition 3.5.7. Let A ∈ Rn×n. We call λ ∈ C an

eigenvalue of A if there is a non-zero vector x ∈ Cn

such that

Ax = λx.

We call any such x an eigenvector associated with A.

Some properties of eigenvalues:

(i) If A is a real symmetric matrix (i.e., A = AT ), its

eigenvalues and eigenvectors are all real-valued.

(ii) If λ is an eigenvalue of A, then 1/λ is an eigenvalue

of A−1.

(iii) If x is an eigenvector associated with the eigenvalue

λ then so too is ηx for any non-zero scalar η.

(iv) An eigenvector may be normalised as

‖x‖22 = xTx = 1.

(v) There are n eigenvectors λ1, λn, . . . , λn associated

with the real symmetric matrix A. Let x(1), x(2),

..., x(n) be the associated normalised eigenvectors.

Then the eigenvectors are linearly independent and


so form a basis for Rn. That is, any vector v ∈ Rn

can be written as a linear combination:

v =

n∑i=1

αix(i).

(vi) Furthermore, these eigenvectors are orthogonal and

orthonormal :

(x(i))Tx(j) =

{1 i = j

0 i 6= j

3.5.7 Singular values

The singular values of a matrix A are the square roots of

the eigenvalues of ATA. They play a very important role

in matrix analysis and in areas of applied linear algebra,

such as image and text processing. Our interest here is

in their relationship to ‖A‖2.

Lemma 3.5.8. For any matrix A, the eigenvalues of

ATA are real and non-negative.

Proof:

Take notes:

The importance of this result is that the square root of

an eigenvalue of ATA is a non-negative and real. Fur-

thermore, part of the above proof involved showing that,

if(ATA

)x = λx, then

√λ =‖Ax‖2‖x‖2

.

This at the very least tells us that

‖A‖2 := maxx∈Rn

?

‖Ax‖2‖x‖2

> maxi=1,...,n

√λi.

With a bit more work, we can show that if λ1 6 λ2 6· · · 6 λn are the the eigenvalues of B = ATA, then

‖A‖2 =√λn.

Theorem 3.5.9. Let A ∈ Rn×n. Let λ1 6 λ2 6 · · · 6λn, be the eigenvalues of B = ATA. Then

‖A‖2 = maxi=1,...,n

√λi =

√λn,

Important: Here we will just outline the main idea. You

should study the full proof yourself in the text-book.

3.5.8 Exercises

Exercise 3.14. Show that, for any vector x ∈ Rn, ‖x‖∞ 6‖x‖2 and ‖x‖22 6 ‖x‖1‖x‖∞. For each of these inequal-

ities, give an example for which the equality holds. De-

duce that ‖x‖∞ 6 ‖x‖2 6 ‖x‖1.

Exercise 3.15. Show that if x ∈ Rn, then ‖x‖1 6n‖x‖∞ and that ‖x‖2 6

√n‖x‖∞.

Exercise 3.16. Show that, for any subordinate matrix

norm on Rn×n, the norm of the identity matrix is 1.

Exercise 3.17. Prove that

‖A‖1 = maxj=1,...,n

n∑i=1

|ai,j|.

Hint: Suppose that

n∑i=1

|aij| 6 C, j = 1, 2, . . .n,

show that for any vector x ∈ Rn

n∑i=1

|(Ax)i| 6 C‖x‖1.

Now find a vector x such that∑ni=1 |(Ax)i| = C‖x‖1.

Now deduce the result.

3.5.9 Appendix: 2-norm

As mentioned in Section 3.5.1, it takes a little effort to

show that ‖ · ‖2 is indeed a norm on R2; in particular to

show that it satisfies the triangle inequality, we need the

Cauchy-Schwarz inequality.

Lemma 3.5.10 (Cauchy-Schwarz).

|

n∑i=1

uivi| 6 ‖u‖2‖v‖2, ∀u, v ∈ Rn.

The proof can be found in any text-book on analysis.

Now can now apply it to show that

‖u+ v‖2 6 ‖u‖2 + ‖v‖2.

This can be done as follows:

‖u+ v‖22 = (u+ v)T (u+ v)

= uTu+ 2uTv+ vTv

6 uTu+ 2|uTv|+ vTv (by the triangle-inequality)

6 uTu+ 2‖u‖‖v‖+ vTv (by Cauchy-Schwarz)

= (‖u‖+ ‖v‖)2.

It follows directly that

Corollary 3.5.11. ‖ · ‖2 is a norm.


3.6 Condition Numbers

3.6.1 Consistency of matrix norms

It should be clear from (3.5.1) that, if ‖·‖ is a subordinate

matrix norm then, for any u ∈ Rn, A ∈ Rn×n,

‖Au‖ 6 ‖A‖‖u‖.

Take notes:

It is an important result: we’ll need it later. There is an

analogous statement for the product of two matrices:

Definition 3.6.1 (Consistent norm). A matrix norm ‖ ·‖is consistent (or “sub-multiplicative”) if

‖AB‖ 6 ‖A‖‖B‖, for all A,B ∈ Rn×n.

Theorem 3.6.2. Any subordinate matrix norm is consis-

tent.

The proof is left to Exercise 3.18. You should note

that there are matrix norms which are not consistent.

See Exercise 3.19.

3.6.2 A short note on computer represen-tation of numbers

This course is concerned with the analysis of numerical

methods to solve problems. It is implicit that these meth-

ods are implemented on a computer. However, comput-

ers are finite machines: there are limits on the amount of

information they can store. In particular there are limits

on the number of digits that can be stored for each num-

ber, and there are limits on the size of number that can

be stored. Because of this, just working with a computer

introduces an error.

Modern computers don’t store numbers in decimal

(base 10), but in binary (base 2) “floating point num-

bers” of the form : ±a× 2b−X. Most often 8 bytes (64

bits or binary digits) are used to store the three compo-

nents of this number: the signs, a (the “significand” or

“mantissa”) and b − 1024 (the exponent). One bit is

used for the sign, 52 for a, and 11 for b. This is called

double precision. Note that a has roughly 16 decimal

digits.

(Some older computer systems sometimes use single

precision where a has 23 bits—giving 8 decimal digits—

and b has 7; so too do many new GPU-based systems).

When we try to store a real number x on a computer,

we actually store the nearest floating-point number. That

is, we end up storing x+δx, where |δx| is the “round-off”

error and |δx|/|x| is the relative error.

Since this is not a course on computer architecture,

we’ll simplify a little and just take it that single and dou-

ble precision systems lead to a relative error of 10−8 and

10−16 respectively.

3.6.3 Condition Numbers

(Please refer to p68–70 of [1] for an thorough develop-

ment of the concepts of local condition number and rel-

ative local condition number).

Definition 3.6.3. The condition number of a matrix is

κ(A) = ‖A‖‖A−1‖.

If κ(A)� 1 then we say A is ill-conditioned.

The following theorem tells us how the condition num-

ber of a matrix is related to (numerical) error.

Theorem 3.6.4. Suppose that A ∈ Rn×n is nonsingular

and that b, x ∈ Rn are non-zero vectors. If Ax = b and

A(x+ δx) = (b+ δb) then

‖δx‖‖x‖

6 κ(A)‖δb‖‖b‖

.

Take notes:

Theorem 3.6.4 means that, if we are solving the system

Ax = b but because of (e.g.) round-off error, there

is an error in the right-hand side, then the relative er-

ror in the solution is bounded by the condition number

of A multiplied by the relative error in the right-hand

side. Also, this result shows that the numerical error in

solving a linear system can be quite dependent on error

introduced by just using floating point approximations of

the entries in the right-hand side vector, rather than the

solution process itself. This important observation, and

the analysis above can be easily extended to the case

where, instead of solving A(x+δx) = (b+δb), we solve

(A+ δA)(a+ δx) = (b+ δb). That is, where the coeffi-

cient matrix contains errors due to round-off. A famous

example where this occurs is the Hilbert Matrix. See

the nice blog post by Cleve Moler on the issue: Cleve’s

Corner, Feb 2nd, 2013.

http://blogs.mathworks.com/cleve/2013/02/02/hilbert-matrices

http://blogs.mathworks.com/cleve/2013/02/02/hilbert-matrices


3.6.4 Calculating κ∞ and κ1

Example 3.6.5. Suppose we are using a computer to

solve Ax = b where

A =

(10 120.08 0.1

)and b =

(11

).

But, due to round-off error, right-hand side has a relative

error (in the ∞-norm) of 10−6. Then we can bound the

relative error, in the max-norm, as follows.

Take notes:

For every matrix norm we get a different condition

number. Consider the following example:

Example 3.6.6. Let A be the n× n matrix

A =

1 0 0 . . . 01 1 0 . . . 01 0 1 . . . 0...1 0 0 . . . 1

.

What are κ1(A) and κ∞(A)?Take notes:

3.6.5 Estimating κ2

In the previous example we computed κ(A) by finding

the inverse of A. In general, that is not practical. How-

ever, we can estimate the condition number of a matrix.

In particular, this is usually possible for κ2(A). Recall

that Theorem 3.5.9 tells us that ‖A‖2 =√λn where λn

is the largest eigenvalue of B = ATA.

Next, observe that ATA and AAT are “similar”, i.e.,

they have the same eigenvalues. This can be done by not-

ing that, if ATAx = λx, then A(ATA)x = (AAT )Ax =

λ(Ax). That is, if λ is an eigenvalue of ATA, with cor-

responding eigenvector x, then λ is also an eigenvalue of

AAT with corresponding eigenvector Ax.

Then, because, for any non-singular matrix X, (XT )−1 =

(X−1)T , it follows that, if λ1 is the smallest eigenvalue of

ATA, then 1/λ1 is the largest eigenvalue of (ATA)−1.

We can conclude that

κ2(A) =

(λn/λ1

)1/2

,

where λ1 and λn are, respectively, the smallest and largest

eigenvalues of B = ATA. Section 3.7 is dedicated to es-

timating λ1 and λn.

3.6.6 Exercises

Exercise 3.18. Prove that, if ‖·‖ is a subordinate matrix

norm, then it is consistent, i.e., for any pair of n × nmatrices, A and B,

‖AB‖ 6 ‖A‖‖B‖.

Exercise 3.19. One might think it intuitive to define the

“max” norm of a matrix as follows:

‖A‖∞ = maxi,j

|aij|.

Show that this is indeed a norm on Rn×n. Show that,

however, it is not consistent.

Exercise 3.20. Let A be the matrix

A =

2 0 0 0 01 2 0 0 01 0 2 0 01 0 0 2 01 0 0 0 2

.

What is κ1(A)? and κ∞(A)? (Hint: You’ll need to find

A−1. Recall that the inverse of a lower triangular matrix

is itself lower triangular).

Note: to do this using Maple:

> with(linalg):

> n:=5;

> f:= (i,j) -> piecewise( (i=j),2, (j=1),1);

> A := matrix(5,5, f);

> evalf(cond(A,1)); evalf(cond(A,2));

> evalf(cond(A,infinity));

To do this with MATLAB, try:

>> A = diag([2,2,2,2,2])

>> A(2:5,1)=1

>> cond(A,1)

>> cond(A,2)

>> cond(A,inf)

Exercise 3.21. Let A be the matrix

A =

(0.1 0 010 0.1 100 0 0.1

)Compute κ∞(A). Suppose we wish to solve the system

of equations Ax = b on single precision computer system

(i.e., the relative error in any stored number is approxi-

mately 10−8). Give an upper bound on the relative error

in the computed solution x.


3.7 Gerschgorin’s theorems

The goal of this final section is to learn of a simple, but

very useful approach, to estimating eigenvalues of ma-

trices. This can be used, for example, when computing

‖A‖2 and κ2(A).

This involves two theorems which together provide

quick estimates of the eigenvalues of a matrix.

3.7.1 Gerschgorin’s First Theorem

(See Section 5.4 of [1]).

Definition 3.7.1. Given a matrix A ∈ Rn×n, the Ger-

schgorin 2 Discs3 Di are the discs in the complex plane

with centre aii and radius ri:

ri =

n∑j=1,j6=i

|aij|.

So Di = {z ∈ C : |aii − z| 6 ri}.

Theorem 3.7.2 (Gerschgorin’s First Theorem). All the

eigenvalues of A are contained in the union of the Ger-

schgorin discs.

Proof. Let λ be an eigenvalues of A, so Ax = λx for the

corresponding eigenvector x. Suppose that xi is the entry

of x with largest absolute value. That is |xi| = ‖x‖∞.

Looking at the ith entry of the vector Ax we see that

(Ax)i = λxi =⇒n∑j=1

aijxj = λxi.

This can be rewritten as

aiixi +

n∑j=0j6=i

aijxj = λxi,

which gives

(aii − λ)xi = −

n∑j=0j6=i

aijxj

By the triangle inequality,

|aii − λ||xi| = |

n∑j=0j6=i

aijxj| 6n∑j=0j6=i

|aij||xj| 6 |xi|

n∑j=0j6=i

|aij|,

2Semyon Aranovich Gerschgorin, 1901–1933, Belarus. The

ideas here appeared in a paper in 1931. They were subsequently

popularised by others, most notably Olga Taussky, the pioneering

matrix theorist (A recurring theorem on determinants, The Ameri-

can Mathematical Monthly, vol 56, p672–676. 1949.)3We say “disc”, rather than “circle” to make clear that it is the

region bounded by a circle, and not just the circle itself

since |xi| > |xj| for all j. Dividing by |xi| gives

|aii − λ| 6n∑j=0j6=i

|aij|,

as required.

Note that we made no assumption on the symmetry

of A. However, if A is symmetric, then its eigenvalues are

real and so the theorem can be simplified: the eigenvalues

of A are contained in the union of of the intervals Ii =

[aii − ri,aii + ri], for i = 1, . . . ,n.

Example 3.7.3. Let

A =

(4 −2 1−2 −3 01 0 2

)

Take notes:

3.7.2 Gerschgorin’s 2nd Theorem

The first theorem proves that every eigenvalue is con-

tained in a disc. The converse, that every disc contains

an eigenvalue, is not true in general. However, something

similar does hold.

Theorem 3.7.4 (Gerschgorin’s Second Theorem). If k

of discs are disjoint (have an empty intersection) from

the others, their union contains k eigenvalues.

Proof. We won’t do the proof in class, and you are not

expected to know it. Here is a sketch of it: let B(ε) be

the matrix with entries

bij =

{aij i = j

εaij i 6= j.

So B(1) = B and B(0) is the diagonal matrix whose

entries are the diagonal entries of A.

Each of the eigenvalues of B(0) correspond to its

diagonal entries and (obviously) coincide with the Ger-

schgorin discs of B(0) – the centres of the Gerschgorin

discs of A.

The eigenvalues of B are the zeros of the characteris-

tic polynomial det(B(ε)−λI) of B. Since the coefficients

of this polynomial depend continuously on ε, so too do

the eigenvalues.

Now as ε varies from 0 to 1, the eigenvalues of B(ε)

trace a path in the complex plane, and at the same time


the radii of the Gerschgorin discs of A increase from 0 to

the radii of the discs of A. If a particular eigenvalue was

in a certain disc for ε = 0, the corresponding eigenvalue

is in the corresponding disc for all ε.

Thus if one of the discs of A is disjoint from the

others, it must contain an eigenvalue.

The same reasoning applies if k of the discs of A

are disjoint from the others; their union must contain k

eigenvalues.

3.7.3 Using Gerschgorin’s theorems

Example 3.7.5. Locate the regions contains the eigen-

values of

A =

(−3 1 21 4 02 0 −6

)(The actual eigenvalues are approximately −7.018, −2.130

and 4.144.)

Take notes:

Example 3.7.6. Use Gerschgorin’s Theorems to find an

upper and lower bound for the Singular Values of the

matrix

A =

(4 −1 22 3 11 1 4

)and hence find an upper bound for κ2(A).

Take notes:

Example 3.7.7. Let D ∈ Rn×n be the matrix with the

same diagonal entries as A, and zeros for all the off-

diagonals. That is:

D =

a1,1 0 . . . 0 00 a2,2 . . . 0 0...

. . ....

0 0 . . . an−1,n−1 00 0 . . . 0 an,n

Some iterative schemes (e.g., Jacobi’s and Gauss-Seidel)

for solving system of equations involve successive multi-

plication by the matrix T = D−1(A −D). Proving that

these methods work often involves showing that if λ is

an eigenvalue of T then |λ| < 1. Using Gerschgorin’s

theorem, we can show that this is indeed the case if A is

strictly diagonally dominant.

3.7.4 Exercises

Exercise 3.22. A real matrix A = {ai,j} is Strictly Di-

agonally Dominant if

|aii| >

n∑j=1,j6=i

|ai,j| for i = 1, . . . ,n.

Show that all strictly diagonally dominant matrices are

nonsingular.

Exercise 3.23. Let

A =

(4 −1 0−1 4 −10 −1 −3

)

Use Gerschgorin’s theorems to give an upper bound for

κ2(A).

Documents

NA1) - National University of Ireland, Galwayniall/MA385/MA385.pdf · · 2017-11-22... { Numerical Analysis I (\NA1") Niall Madden November 22, ... tation of numerical methods that