NLA Lecture Notes - hesse-kerstin.deTitle NLA_Lecture_Notes.dvi Created Date 1/7/2011 7:46:57 PM

The University of Sussex – Department of Mathematics

G1110 & 852G1 – Numerical Linear Algebra

Lecture Notes – Autumn Term 2010

Kerstin Hesse

H(w) a = −(w∗ a)w + w

a

w

(w∗ a)w

−(w∗ a)ww

Sw

Figure 1: Geometric explanation of the Householder matrix H(w).

Lecture notes and course material by Holger Wendland, David Kay, and others, who taught

the course ‘Numerical Linear Algebra’ at the University of Sussex, served as a starting point

for the current lecture notes. The current lecture notes are about twice as many pages as the

previous version. Apart from corrections and improvements, many new examples and some

linear algebra revision sections have been added compared to the previous lecture notes.

Contents

Introduction and Motivation iii

0.1 Motivation: An Interpolation Problem . . . . . . . . . . . . . . . . . . . . . . . iii

0.2 Motivation: A Boundary Value Problem . . . . . . . . . . . . . . . . . . . . . . v

1 Revision: Some Linear Algebra 1

1.1 Vectors in Rn and Cn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Determinants of Square Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 Inverse Matrix of a Square Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.5 Eigenvalues and Eigenvectors of a Square Matrix . . . . . . . . . . . . . . . . . 12

1.6 Other Notation: The Landau Symbol . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Matrix Theory 15

2.1 The Eigensystem of a Square Matrix . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Upper Triangular Matrices and Back Substitution . . . . . . . . . . . . . . . . . 26

2.3 Schur Factorization: A Triangular Canonical Form . . . . . . . . . . . . . . . . . 30

2.4 Vector Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.5 Matrix Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.6 Spectral Radius of a Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3 Floating Point Arithmetic and Stability 57

3.1 Condition Numbers of Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.2 Floating Point Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.3 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.4 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.5 An Example of a Backward Stable Algorithm: Back Substitution . . . . . . . . . 66

i

ii Contents

4 Direct Methods for Linear Systems 71

4.1 Standard Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.2 The LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.3 Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.4 Cholesky Factorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.5 QR Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5 Iterative Methods for Linear Systems 99

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.2 Fixed Point Iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.3 The Jacobi and Gauss-Seidel Iterations . . . . . . . . . . . . . . . . . . . . . . . 106

5.4 Relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6 The Conjugate Gradient Method 125

6.1 The Generic Minimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 125

6.2 Minimization with A-Conjugate Search Directions . . . . . . . . . . . . . . . . . 130

6.3 Convergence of the Conjugate Gradient Method . . . . . . . . . . . . . . . . . . 141

7 Calculation of Eigenvalues 151

7.1 Basic Localisation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

7.2 The Power Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

7.3 Inverse Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

7.4 The Jacobi Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

7.5 Householder Reduction to Hessenberg Form . . . . . . . . . . . . . . . . . . . . 177

7.6 QR Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

Introduction and Motivation

The topics of this course center around the numerical solution of linear systems and the

computation of eigenvalues.

Let A ∈ Rm×n be an m×n matrix and let b ∈ Rm be a vector. How do we find the (approximate)

solution x ∈ Rn to the linear system

Ax = b

if the matrix A is very large, say, if A is a 106 × 106 matrix? Unlike in linear algebra, where

we have learnt under what assumptions on A and b a (unique) solution exists, here the focus

is on how this system should be solved with the help of a computer. In devising

algorithms for the numerical solution of such linear systems, we will exploit the properties

of the matrix A.

If the matrix A ∈ Rn×n is a square matrix, then we may want to find the eigenvalues λ and

the corresponding eigenvectors x ∈ Rn, that is,

Ax = λx.

While we have learnt in linear algebra results on the existence of the eigenvalues and correspond-

ing eigenvectors, numerical linear algebra is concerned with the numerical computation of

the eigenvalues on a computer for large square matrices A.

The numerical solution of large linear systems and the numerical computation of eigenvalues are

some of the most important topics in numerical analysis. For example, the approximation or

interpolation of measured data, or the discretization of a differential equation lead to a linear

system. The discretization of a differential equation can also lead to the problem of finding the

eigenvalues of a matrix.

We discuss two motivating examples that illustrate how the problem of the (numerical) solution

of a large linear system might arise in applications.

0.1 Motivation: An Interpolation Problem

Suppose we are given N data sites a ≤ x1 < x2 < . . . < xN ≤ b in [a, b] and corresponding

observations f1, f2, . . . , fN ∈ R. Suppose further that the observations follow an unknown

iii

iv Introduction and Motivation

generation process, that is, there is an unknown function f such that f(xi) = fi. (For example,

the data could be temperatures measured at a fixed time at equidistant locations along a thin

metal rod. In this case we would like to use the measured temperature data to derive a function

of the location along the thin metal rod that describes the temperature along the rod at the

fixed time.)

One possibility to reconstruct the unknown function f is to choose a set of N continuous

basis functions ϕ1, . . . , ϕN ∈ C([a, b]) and to approximate f by a function s of the form

s(x) =

N∑

j=1

αj ϕj(x),

where the coefficients are determined by the interpolation conditions

fi = s(xi) =N∑

j=1

αj ϕj(xi), i = 1, 2, . . . , N.

This leads to a linear system, which can be written in matrix form as

ϕ1(x1) ϕ2(x1) . . . ϕN(x1)

ϕ1(x2) ϕ2(x2) . . . ϕN(x2)...

.... . .

...

ϕ1(xN ) ϕ2(xN ) . . . ϕN(xN )

α1

α2

...

αN

=

f1

f2

...

fN

. (0.1)

The matrix in the linear system is an N ×N matrix, and if N is large, say N ≥ 106, then such

a system is not easy to solve.

The focus of this course is on how to solve linear systems, such as, for example, the one in the

interpolation problem (0.1) above. In this course, we will not discuss interpolation problems

themselves, although they are a very interesting area of study and research. Despite that we

make here two comments on the problem above.

Remark 0.1 (comments on the interpolation problem)

(i) One crucial issue is the choice of the basis functions ϕ1, ϕ2, . . . , ϕN . For example,

possible choices could be:

(a) polynomials: ϕj(x) = xj−1, j = 1, 2, . . . , N , or

(b) shifted Gaussians: ϕj(x) = e−(x−xj)2, j = 1, 2, . . . , N , where the xj in the definition of

ϕj is the data site xj .

However, the choice of the basis functions is not arbitrary but will, in applications, be

determined by information about the kind of measured data that is approximated.

(ii) Also, if the data f1, f2, . . . , fN is measured data then it is usually not exact but has

measurement errors (noise), such that fi = f(xi) + ǫi, where ǫi is the measurement

error. In this case interpolation leads usually to a rather poor approximation of the data,

and it would be better to use an approximation scheme that imposes conditions that

demand only s(xi) ≈ fi, i = 1, 2, . . . , N .

Introduction and Motivation v

0.2 Motivation: A Boundary Value Problem

Consider the following one-dimensional boundary value problem: find a function u(x) that is

twice continuously differentiable and satisfies the differential equation

− d2u

dx2(x) + u(x) = f(x) on (0, 1), (0.2)

subject to the boundary conditions

u(0) = 0 and u(1) = 0. (0.3)

The function f in (0.2) is continuous on [0, 1].

One way to solve this boundary value problem numerically is to use finite differences. The

basic idea here is to approximate the derivative by difference quotients:

u′(x) ≈ u(x + h) − u(x)

hor u′(x) ≈ u(x) − u(x − h)

h. (0.4)

The first formula in (0.4) is a forward difference while the second one is a backward dif-

ference. Using first a forward and then a backward rule for the second derivative leads to the

centred second difference

u′′(x) ≈ u′(x + h) − u′(x)

h≈ 1

h

(u(x + h) − u(x)

h− u(x) − u(x − h)

h

)

=u(x + h) − 2 u(x) + u(x − h)

h2. (0.5)

Finite Difference Method:

When finding a numerical approximation using finite differences we divide the interval [0, 1]

into n+1 subintervals of equal length h = 1/(n+1) with endpoints at the equally spaced nodes

xi = i h =i

n + 1, i = 0, 1, . . . , n, n + 1.

Our aim is to construct a vector u = uh = (u0, u1, . . . , un, un+1)T such that uj is an approxi-

mation of u(xj), j = 0, 1, . . . , n, n + 1, where u denotes the (exact) solution to the boundary

value problem (0.2) and (0.3).

Expressing (0.2) and (0.3) on the grid x0, x1, . . . , xn, xn+1 and replacing the derivatives by finite

differences, with the help of (0.5), we obtain

− u(xi+1) − 2 u(xi) + u(xi−1)

h2+ u(xi) ≈ f(xi), i = 1, 2, . . . , n, (0.6)

and

u(x0) = 0 and u(xn+1) = 0. (0.7)

vi Introduction and Motivation

Replacing in (0.6) and (0.7) the values u(xj) by the approximations uj and using the abbrevi-

ation fi := f(xi), we get the equations

− ui+1 − 2 ui + ui−1

h2+ ui = fi, i = 1, 2, . . . , n, (0.8)

and

u0 = 0 and un+1 = 0. (0.9)

These equations (0.8) and (0.9) lead to the following linear system for the computation of the

finite difference approximation u = uh = (u0, u1, . . . , un, un+1)T :

1

h2

1 0 0 0 · · · 0 0

−1 2 + h2 −1 0 · · · 0 0

0 −1 2 + h2 −1. . .

... 0

... 0. . .

. . .. . . 0

...

0...

. . . −1 2 + h2 −1 0

0 0 · · · 0 −1 2 + h2 −1

0 0 0 . . . 0 0 1

u0

u1

u2

...

un−1

un

un+1

=

0

f1

f2

...

fn−1

fn

0

. (0.10)

The involved matrix, which we denote by A, is in R(n+2)×(n+2) and is tridiagonal. With

f := (0, f1, f2, . . . , fn, 0)T , we can write (0.10) as Au = f .

Remark 0.2 (comments on the finite difference approximation)

(i) This system of equations is sparse, that is, the number of non-zero entries is much less

than (n+2)2. This sparsity can be used to produce more efficient methods of storage, only

storing the non-zero entries of the matrix. Also the sparse matrix reduces the number of

required operations when calculating matrix-matrix and matrix-vector multiplications.

(ii) To obtain an accurate approximation to the true solution u we may have to choose h very

small (very fine step size), thereby increasing the size of the linear system.

(iii) For a general boundary value problem in N -dimensions the size of the linear system can

grow rapidly, for example, three dimensional problems grow over 8 times larger with each

uniform refinement of the domain.

In this course we will learn about direct methods (for example, Gaussian elimination) and

iterative methods (that is, the construction of a sequence of improving approximations to

the solution) that are used to numerically solve linear systems of equations (such as the ones

encountered in this section and the previous section).

We will look at how efficient (how much time and memory are required?) and stable (do they

give good approximations and do they converge and under what conditions?) these methods

are.

Chapter 1

Revision: Some Linear Algebra

In this chapter we first introduce common notation and give a brief revision of some definitions

and results from linear algebra that we will frequently use in this course.

In this course capital (upper-case) letters A, B, C, . . . usually denote matrices, and bold-face

lower-case letters a,b,x,y, . . . denote vectors. Functions are denoted by lower-case letters.

1.1 Vectors in Rn and C

n

A vector x in Rn (or C

n) is a column vector

x =

x1

x2

...

xn

, where x1, x2, . . . , xn ∈ R (or x1, x2, . . . , xn ∈ C).

The vector 0 is the zero vector, where all entries are zero.

We denote by ei in Rn (or in Cn) the standard ith basis vector containing a one in the ith

component and zeros elsewhere. For example in R3 and C3, the standard basis vectors are

e1 =

1

0

0

, e2 =

0

1

0

, e3 =

0

0

1

.

The vectors x1,x2, . . . ,xm in Rn (or in Cn) are linearly independent if the following holds

true: Ifm∑

j=1

aj xj = a1 x1 + a2 x2 + . . . + am xm = 0 (1.1)

1

2 1.1. Vectors in Rn and Cn

with the real (complex) numbers a1, a2, . . . , am, then the numbers aj , j = 1, 2, . . . , m, are all

zero.

In other words, x1,x2, . . . ,xm are linearly independent if the only real (complex) numbers

a1, a2, . . . , am for which (1.1) holds are a1 = a2 = . . . = am = 0.

The vectors x1,x2, . . . ,xm in Rn (or in C

n) are linearly dependent, if they are not linearly

independent. This means x1,x2, . . . ,xm in Rn (or in Cn) are linearly dependent, if there exist

real (complex) numbers a1, a2, . . . , am not all zero such that (1.1) holds.

Any m > n vectors in Rn (in Cn) are linearly dependent.

Any set of n linearly independent vectors in Rn (in Cn) is a basis for Rn (for Cn). If

v1,v2, . . . ,vn is a basis for Rn (for Cn), then the following holds: For every vector x in Rn

(in Cn), there exist uniquely determined real (complex) numbers a1, a2, . . . , an such that

x =

n∑

j=1

aj vj = a1 v1 + a2 v2 + . . . + an vn.

For a column vector x in Rn or in C

n we denote by xT the transposed (row) vector, that is,

x =

x1

x2

...

xn

and xT =

(x1, x2, . . . , xn

).

Likewise the transpose of a row vector y is the corresponding column vector yT , that is,

y =(y1, y2, . . . , yn

)and yT =

y1

y2

...

yn

.

For a column vector x ∈ Cn we denote by x∗ := xT the conjugate (row) vector, that is,

x =

x1

x2

...

xn

and x∗ = xT =

(x1, x2, . . . , xn

).

Here, y indicates taking the complex conjugate of y ∈ C, that is, if y = a + i b with a, b ∈ R

and i the imaginary unit, then y = a − i b. Likewise the conjugate of a complex row vector y

is the corresponding conjugate column vector y∗ := yT , that is,

y =(y1, y2, . . . , yn

). and y∗ = yT =

y1

y2

...

yn

.

1. Revision: Some Linear Algebra 3

For complex numbers y = a + i b ∈ C with a, b ∈ R, we have

|y| =√

y y =√

(a − i b)(a + i b) =√

a2 + b2.

The Euclidean inner product of two real-valued vectors x,y ∈ Rn is given by

xT y = xT · y =n∑

j=1

xj yj = x1 y1 + x2 y2 + . . . + xn yn.

We note that the Euclidean inner product for Rn is symmetric, that is, xT y = yT x for any

x,y ∈ Rn.

The Euclidean inner product of two complex vectors x,y ∈ Cn is given by

x∗ y = x∗ · y = xT · y =n∑

j=1

xj yj = x1 y1 + x2 y2 + . . . + xn yn.

The Euclidean inner product for Cn satisfies x∗y = y∗x for any x,y ∈ C

n.

The Euclidean norm of a vector x ∈ Rn (or x ∈ Cn) is defined by

‖x‖2 =√

xTx =

(n∑

j=1

|xj |2)1/2

or ‖x‖2 =

√x∗ x =

(n∑

j=1

|xj |2)1/2

.

The geometric interpretation of the Euclidean norm ‖x‖2 of a vector x in Rn or Cn is that ‖x‖2

measures the length of x.

We say that two vectors x and y in Rn (or in Cn) are orthogonal (to each other) if xT y = 0 (or

if x∗ y = 0, respectively). For vectors in Rn, this means geometrically that the angle between

these two vectors is π/2, that is 90◦. A set of m vectors x1,x2, . . . ,xm in Rn (or in Cn) is called

orthogonal, if they are mutually orthogonal, that is, if xj is orthogonal to xk whenever j 6= k.

It is easily checked that a set of orthogonal vectors is, in particular, also linearly independent.

A basis v1,v2, . . . ,vn of Rn is called an orthonormal basis for Rn if the basis vectors have

all length one and are mutually orthogonal, that is

‖vj‖2 = 1 for j = 1, 2, . . . , n, and vTj vk = 0 for j, k = 1, 2 . . . , n with j 6= k.

Likewise, a basis v1,v2, . . . ,vn of Cn is called an orthonormal basis for Cn if the basis vectors

have all length one and are mutually orthogonal, that is,

‖vj‖2 = 1 for j = 1, 2, . . . , n, and v∗j vk = 0 for j, k = 1, 2 . . . , n with j 6= k.

An orthonormal basis v1,v2, . . . ,vn of Rn has a very useful property: Any vector x ∈ Rn

has the representation

x =n∑

j=1

(vTj x)vj (1.2)

4 1.2. Matrices

as a linear combination of the basis vectors v1,v2, . . . ,vn.

The validity of (1.2) is easily established: Assume that x =∑n

j=1 aj vj, and take the Euclidean

inner product with vk. Because v1,v2, . . . ,vn form an orthonormal basis, vTk vj = 0 if j 6= k

and vTk vj = ‖vk‖2

2 = 1 if j = k. Thus

vTk x = vT

k

(n∑

j=1

aj vj

)=

n∑

j=1

aj vTk vj = ak vT

k vk = ak ‖vk‖22 = ak.

Replacing aj = vTj x in x =

∑nj=1 aj vj now verifies (1.2).

Analogously to (1.2) an orthonormal basis v1,v2, . . . ,vn of Cn has the following property: Any

vector x ∈ Cn has the representation

x =

n∑

j=1

(v∗j x)vj (1.3)

as a linear combination with respect to the orthonormal basis v1,v2, . . . ,vn.

Exercise 1 State the properties of an inner product/scalar product for a complex vector space

V , and verify that the Euclidean inner product for Cn has these properties.

Exercise 2 Show formula (1.3).

1.2 Matrices

The matrix A ∈ Cm×n (or A ∈ R

m×n) is an m × n (m rows and n columns) matrix with

complex-valued (or real-valued) entries:

A := (ai,j) = (ai,j)1≤i≤m; 1≤j≤n :=

a1,1 a1,2 · · · a1,n

a2,1 a2,2 · · · a2,n

......

...

am,1 am,2 · · · am,n

,

where ai,j ∈ C (and ai,j ∈ R, respectively).

Occasionally, we will denote the column vectors of a matrix A = (ai,j) in Cm×n (or in Rm×n)

by aj , j = 1, 2, . . . , n, that is

A := (a1, a2, · · · , an), with aj =

a1,j

a2,j

...

am,j

∈ C

m (or ∈ Rm), j = 1, 2, . . . , n.


To denote the (i, j)th entry ai,j of A = (ai,j) we may occasionally also write Ai,j := ai,j or

A(i, j) := ai,j.

A matrix is called square if it has the same number of rows and columns. Thus square matrices

are matrices in Cn×n or R

n×n. The diagonal of a square matrix A = (ai,j) in Cn×n or in Rn×n

are the entries aj,j, j = 1, 2, . . . , n.

Vectors are special cases of matrices, and a column vector x ∈ Cn (or x ∈ Rn) is just an n × 1

matrix. Likewise a row vector in Cn (or Rn) is just an 1 × n matrix.

Two matrices of special importance are the m×n zero matrix, and, among the square matrices,

the n × n identity matrix: The zero matrix in Cm×n and in Rm×n is the m × n matrix that

has all entries zero. We denote the m × n zero matrix by 0. The identity matrix in Cn×n

and in Rn×n is the n × n matrix in which the entries on the diagonal are all one and all other

entries are zero. We denote the n× n identity matrix by I. For example, in C3×3 and R

3×3 we

have

0 =

0 0 0

0 0 0

0 0 0

and I =

1 0 0

0 1 0

0 0 1

.

The scalar multiplication of a matrix A = (ai,j) in Cm×n (or in Rm×n) with a complex (or

real) number µ is defined componentwise, that is, µ A in Cm×n (or in Rm×n, respectively) is

defined by

(µ A)i,j := µ ai,j, i = 1, 2, . . . , m; j = 1, 2, . . . , n. (1.4)

The addition of two m× n matrices A = (ai,j) and B = (bi,j) in Cm×n (or in Rm×n) is defined

componentwise, that is, A + B in Cm×n (or in Rm×n, respectively) is defined by

(A + B)i,j := ai,j + bi,j, i = 1, 2, . . . , m; j = 1, 2, . . . , n. (1.5)

The set Cm×n (or Rm×n) of complex (or real) m×n matrices with the scalar multiplication (1.4)

and the addition (1.5) forms a complex vector space (or real vector space, respectively).

The matrix multiplication A B of A = (ai,j) ∈ Cm×n and B = (bi,j) ∈ Cn×p (or A = (ai,j) ∈Rm×n and B = (bi,j) ∈ Rn×p) gives the matrix C = (ci,j) ∈ Cm×p (or C = (ci,j) ∈ Rm×p,

respectively), with the entries

ci,j =n∑

k=1

ai,k bk,j = ai,1 b1,j + ai,2 b2,j + . . . + ai,n bn,j, i = 1, 2, . . . , m; j = 1, 2, . . . , p.

In words, ci,j is computed by taking the Euclidean inner product of the ith row vector of A

with the jth column vector of B.

Note that for square matrices A and B in Cn×n (or in Rn×n), both A B and B A are defined,

but in general A B 6= B A, that is, matrix multiplication is not commutative.

6 1.2. Matrices

Thus the Euclidean inner product x∗ y of two vectors x and y in Cn (and the Euclidean

inner product xT y of two vectors x and y in Rn) is just the matrix multiplication of the 1× n

matrix x∗ = xT (and xT , respectively) with the n × 1 matrix y.

The outer product B = (bi,j) ∈ Cn×n of two vectors x,y ∈ Cn is given by

B = xy∗, where bi,j := xi yj, i = 1, 2, . . . n; j = 1, 2, . . . n.

Analogously, the outer product of x and y in Rn is xyT = (xi yj) ∈ R

n×n.

In analogy to the transpose of a vector in Rn and the conjugate of a vector in Cn, we define

the transposed matrix of a matrix A ∈ Rm×n and the Hermitian conjugate matrix of a matrix

A ∈ Cn×m.

The transposed matrix or transpose of a matrix A = (ai,j) in Rm×n (or in Cm×n) is the

matrix AT in Rn×m (or in C

n×m) whose (i, j)th entry is given by

(AT )i,j = aj,i.

A square matrix A = (ai,j) in Rn×n (or in Cn×n) is called symmetric if AT = A, that is,

ai,j = aj,i for all i, j = 1, 2, . . . , n.

The Hermitian conjugate (matrix) (or adjoint (matrix)) of A = (ai,j) ∈ Cm×n is the

matrix A∗ := AT ∈ Cn×m whose (i, j)the entry is

(A∗)i,j = aj,i.

A square matrix A = (ai,j) ∈ Cn×n is called Hermitian (or self-adjoint) if A∗ = A, that is,

aj,i = ai,j for all i, j = 1, 2, . . . , n.

For A ∈ Rm×n and B ∈ Rn×p, we have

(A B)T = BT AT ,

and for A ∈ Cm×n and B ∈ Cn×p, we have

(A B)∗ = B∗ A∗.

Let A = (ai,j) ∈ Cm×n be a complex m × n matrix. The null space or kernel of the matrix

A is defined by

null(A) = ker(A) :={x ∈ C

n : Ax = 0}

(1.6)

The range of the matrix A is defined by

range(A) :={y ∈ C

m : Ax = y for some x ∈ Cn}. (1.7)

The range of A is the linear space spanned by the column vectors of A.


Analogous statements hold for real matrices A ∈ Rm×n with the only difference that Cn and

Cm in (1.6) and (1.7) need to be replaced by R

n and Rm, respectively.

The rank of a matrix A ∈ Cm×n or A ∈ Rm×n is the dimension of the range of A, that is,

rank(A) := dim(range(A)

).

The trace of a square matrix A = (ai,j) ∈ Cn×n or A = (ai,j) ∈ Rn×n, denoted by trace(A), is

the sum of its diagonal entries, that is,

trace(A) =

n∑

j=1

aj,j = a1,1 + a2,2 + . . . + an,n.

Symmetric matrices in Rn×n and Hermitian matrices in Cn×n may have the following useful

properties:

A square matrix A = (ai,j) ∈ Rn×n is called positive definite if A is symmetric (that is, A

satisfies AT = A) and

xT Ax =

n∑

i=1

n∑

j=1

ai,j xi xj > 0 for all x ∈ Rn with x 6= 0.

A square matrix A = (ai,j) ∈ Rn×n is called positive semidefinite if A is symmetric and

xT Ax =

n∑

i=1

n∑

j=1

ai,j xi xj ≥ 0 for all x ∈ Rn.

A square matrix A = (ai,j) ∈ Cn×n is said to be positive definite if A is Hermitian (that is,

A satisfies A∗ = A) and

x∗Ax =n∑

i=1

n∑

j=1

ai,j xi xj > 0 for all x ∈ Cn with x 6= 0.

A square matrix A = (ai,j) ∈ Cn×n is said to be positive semidefinite if A is Hermitian and

x∗Ax =n∑

i=1

n∑

j=1

ai,j xi xj ≥ 0 for all x ∈ Cn.

We have the following useful property of positive definite matrices:

If A = (ai,j) in Rn×n (or in Cn×n) is positive definite, then the upper principal submatrices Ap :=

(ai,j)1≤i,j≤p, p ∈ {1, 2, . . . , n}, are positive definite, and det(Ap) > 0 for all p ∈ {1, 2, . . . , n}.

8 1.2. Matrices

Theorem 1.1 (characterization of positive definite matrices)

(i) A symmetric matrix A ∈ Rn×n (that is, A = AT ) is positive definite if and only if all

its eigenvalues are positive.

(ii) An Hermitian matrix A ∈ Cn×n (that is, A = A∗) is positive definite if and only if all

its eigenvalues are positive.

Exercise 3 Show that Cm×n with the usual matrix addition and scalar multiplication is a com-

plex vector space.

Exercise 4 Find the range and the null space of the following matrix

A :=

1 0 −1 2

0 1 3 1

−1 1 5 0

.

Exercise 5 For a matrix A ∈ Cm×n show that the range of A is the linear space spanned by

the column vectors of A.

Exercise 6 Which, if any, of the following square matrices are symmetric or Hermitian? (Note

here i is always the imaginary unit!)

A :=

−1 2 i

2 i 3

−i 3 1

, B :=

−1 2 −i

2 7 5

i 5 3

, C :=

2 −2 8

−2 7 −1

8 −1 3

.

Exercise 7 Show that (A B)T = BT AT for any A ∈ Rm×n and B ∈ R

n×p. Show that (A B)∗ =

B∗ A∗ for any A ∈ Cm×n and B ∈ Cn×p.

Exercise 8 Compute the trace of the 3 × 3 matrix

A =

32

0 12

0 3 0

12

0 32

, (1.8)

Exercise 9 Show that the symmetric real 3×3 matrix A given by (1.8) in the previous question

is positive definite.

Exercise 10 Let A ∈ Cn×n be a positive definite matrix. If C ∈ Cn×m show that:

(a) C∗ A C is positive semidefinite.

(b) rank(C∗ A C

)= rank(C).

(c) C∗ A C is positive definite if and only if rank(C) = m.


1.3 Determinants of Square Matrices

In this subsection let A be a square matrix with either real or complex entries.

The determinant det(A) of a 2 × 2 matrix

A =

(a b

c d

)

is defined by

det(A) =

∣∣∣∣a b

c d

∣∣∣∣ = a d − b c.

The determinant det(A) of a 3 × 3 matrix

A =

a1,1 a1,2 a1,3

a2,1 a2,2 a2,3

a3,1 a3,2 a3,3

is defined by

det(A) =

∣∣∣∣∣∣

a1,1 a1,2 a1,3

a2,1 a2,2 a2,3

a3,1 a3,2 a3,3

∣∣∣∣∣∣= a1,1 C1,1 − a1,2 C1,2 + a1,3 C1,3,

where C1,1, −C1,2, and C1,3 are the so-called cofactors of a1,1, a1,2, and a1,3, respectively, and

are defined by

C1,1 =

∣∣∣∣a2,2 a2,3

a3,2 a3,3

∣∣∣∣ , C1,2 =

∣∣∣∣a2,1 a2,3

a3,1 a3,3

∣∣∣∣ , and C1,3 =

∣∣∣∣a2,1 a2,2

a3,1 a3,2

∣∣∣∣ .

We observe that C1,j is the determinant of the 2×2 submatrix of A that is obtained by deleting

the 1st row and jth column of A.

The procedure for computing the determinant of a 3 × 3 matrix can be generalized to the

following formula for computing the determinant of an n × n matrix, where n ≥ 2:

The determinant det(A) of the n × n matrix A = (ai,j) is given by

det(A) =n∑

j=1

ai,j (−1)i+j Ci,j,

for any i ∈ {1, 2, . . . , n}, where Ci,j is the determinant of the (n− 1)× (n − 1) submatrix of A

obtained by deleting the ith row and jth column from A.

10 1.4. Inverse Matrix of a Square Matrix

Equivalently, we can also compute the determinant of an n×n matrix A = (ai,j) by expanding

with respect to a column:

det(A) =

n∑

i=1

ai,j (−1)i+j Ci,j,

for any j ∈ {1, 2, . . . , n}, where Ci,j is the determinant of the (n− 1)× (n− 1) submatrix of A

obtained by deleting the ith row and jth column from A.

We note that for all n × n matrices A,

det(A) = det(AT ).

Let A and B be n × n matrices. Then the determinant of a product of two matrices satisfies

det(A B) = det(A) det(B).

Exercise 11 Compute the determinant of the 3 × 3 matrix

A =

32

0 12

0 3 0

12

0 32

.

Exercise 12 Prove the following statement: If A = (ai,j) in Rn×n is positive definite, then the

upper principal submatrices Ap := (ai,j)1≤i,j≤p, p ∈ {1, 2, . . . , n}, are positive definite.

Exercise 13 Prove that for all square n × n matrices A we have det(A) = det(AT ).

1.4 Inverse Matrix of a Square Matrix

A square matrix A in Cn×n (or in Rn×n) is called invertible or non-singular, if there exists

a matrix X in Cn×n (and in R

n×n, respectively) such that

A X = X A = I,

where I is the n× n identity matrix. The matrix X is then called the inverse (matrix) of A,

and we usually denote the inverse matrix of A by A−1.

If a square matrix A in Cn×n or Rn×n is not invertible, then we call A singular.

A fundamental result about inverse matrices is the following: A matrix square matrix A is

invertible (or non-singular) if and only if det(A) 6= 0.

This implies conversely that a square matrix A is singular if and only if det(A) = 0.


The easiest way to compute the inverse of a non-singular matrix A by hand is to write the

augmented matrix (A|I) and then transform this system with elementary row operations such

that we have the identity matrix on the left-hand side; then we obtain (I|A−1) with the inverse

matrix A−1 on the right-hand side.

With the help of the inverse matrix we can solve linear systems as follows. Assume A is a

non-singular square n×n matrix in Cn×n and b is a given vector in Cn. Then the linear system

Ax = b

has the solution x = A−1b, which follows easily from

A−1b = A−1(Ax)

= (A−1A)x = I x = x,

where we have used the fact that matrix multiplication is associative. We will see in this

course that computing the inverse of a large matrix is a rather expensive process, and therefore

computing the inverse and then using x = A−1 b is numerically not a good way to solve large

linear systems.

For two invertible (or non-singular) square n × n matrices A and B we have

(A B)−1 = B−1A−1,

and, for invertible matrices A ∈ Rn×n and B ∈ Cn×n, we have

(AT )−1 = (A−1)T and (B∗)−1 = (B−1)∗.

For a non-singular matrix A in Cn×n (or in Rn×n), we have

det(A−1) =(det(A)

)−1.

A square matrix Q ∈ Rn×n is said to be orthogonal (or an orthogonal matrix) if

QT = Q−1 ⇔(QT Q = I and Q QT = I

). (1.9)

A square matrix Q ∈ Cn×n is said to be unitary (or a unitary matrix) if

Q∗ = Q−1 ⇔(Q∗Q = I and Q Q∗ = I

). (1.10)

The second characterization in (1.9) and (1.10), respectively, tells us that the column vectors

qi, i = 1, 2, . . . n, of Q are an orthonormal basis of Rn and Cn, respectively. That is, (1.9) is

equivalent to

qTi qj = δi,j, i, j = 1, 2, . . . , n,

and (1.10) is equivalent to

q∗i qj = qi

T qj = δi,j, i, j = 1, 2, . . . , n,

respectively, where δi,j is the Kronecker delta, defined to be one if i = j and zero otherwise.

Likewise, the row vectors of Q form an orthonormal basis of Rn and Cn, respectively.

12 1.5. Eigenvalues and Eigenvectors of a Square Matrix

Exercise 14 Show that the 3 × 3 matrix

A =

32

0 12

0 3 0

12

0 32

,

is non-singular. Compute the inverse matrix A−1 of A.

Exercise 15 Let A and B in Cn×n be invertible matrices. Prove the following statements:

(a) (A B)−1 = B−1A−1.

(b) (A−1)∗ = (A∗)−1. Use this result to conclude that (AT )−1 = (A−1)T if A is in Rn×n.

(c) det(A−1) = (det(A))−1.

Exercise 16 Show that a positive definite matrix A ∈ Rn×n is non-singular and that the inverse

matrix A−1 is also positive definite.

Exercise 17 Show that the inverse of a unitary matrix is unitary. Use this the result to show

that the unitary matrices in Cn×n with the matrix multiplication form a (multiplicative) group.

Exercise 18 Consider the matrix A ∈ R3×3 and the vector b ∈ R3 given by

A =

1 −1 0

−1 2 1

0 1 3

and b =

2

−2

4

.

(a) Compute the determinant of A.

(b) Is A invertible? Why?

(c) Compute the inverse matrix A−1 of A

(d) Use the inverse matrix A−1 to solve the linear system Ax = b.

(e) Show that A is positive definite. (Hint: use Theorem 1.1.)

1.5 Eigenvalues and Eigenvectors of a Square Matrix

In this section, we consider Rn×n as a subset of C

n×n, so that all definitions for complex n × n

matrices also apply to real n × n matrices.

Let A be a square matrix in Cn×n. A complex number λ ∈ C is an eigenvalue of A if there

exists a non-zero vector x ∈ Cn \ {0} such that

Ax = λx. (1.11)


The vector x in (1.11) is then called an eigenvector of A with the eigenvalue λ.

By writing (1.11) equivalently as

λx − Ax = 0 ⇔ (λ I − A)x = 0,

we see that a non-zero vector x satisfying (1.11) exists if and only if det(λ I − A) = 0. The

determinant

p(A, λ) := det(λ I − A) (1.12)

is a polynomial in λ of exact degree n, and (1.12) is called the characteristic polynomial

of A. Clearly, the (complex) roots of the characteristic polynomial are the eigenvalues of the

matrix A. By the fundamental theorem of algebra, any (complex) polynomial of exact degree

n has n complex roots, counted with multiplicity. Therefore, any square matrix A ∈ Cn×n

has n complex eigenvalues, counted with multiplicity.

To compute the eigenvalues and the corresponding eigenvectors of a square matrix

A by hand, we proceed as follows: First we compute the characteristic polynomial p(A, λ) =

det(λ I − A) and find its roots. These roots are the eigenvalues of A. For each eigenvalue λ of

A, we solve the linear system (λ I − A)x = 0 to find the eigenvectors x corresponding to λ.

For a real n × n matrix A ∈ Rn×n, the characteristic polynomial p(A, λ) = det(λ I − A) has n

complex roots (counted with multiplicity). In general, some (or even all) of these roots may

be not in R. Thus, for the special case A ∈ Rn×n, we can in general not conclude that A has

n real eigenvalues, counted with multiplicity.

It is clear that for large n the computation of the eigenvalues is a far from trivial problem.

1.6 Other Notation: The Landau Symbol

For two functions f, g : N → R, we will write f = O(g) if there is a constant C > 0 and N ∈ N

such that

|f(n)| ≤ C|g(n)| for all n ≥ N.

The symbol O is called the Landau symbol.

For example, consider the matrix-vector multiplication Ax, where A ∈ Cn×n and x ∈ Cn. Since

the ith component of Ax is given by

(Ax)i =n∑

j=1

ai,j xj , (1.13)

the number of multiplications in (1.13) is n and the number of additions in (1.13) is n − 1, so

that the number of elementary operations to compute (Ax)i is 2n−1, that is, O(n). The total

14 1.6. Other Notation: The Landau Symbol

number of operations to compute a matrix-vector multiplication is therefore n(2n−1) = 2n2−n,

that is, O(n2).

In numerical linear algebra the number of elementary operations needed to execute an algorithm

is of great interest, since it determines the runtime and efficiency of the algorithm. Usually

information about the cost or number of elementary operations is not given as an exact

figure but rather by listing it as O(n), O(n2), O(n3), . . . , as appropriate, in terms of the

dimension n of the problem. In the example above the dimension is the number of components

n of the involved vectors.

Chapter 2

Matrix Theory

In this chapter we learn various basics from matrix theory and encounter the first numerical

algorithm.

In Section 2.1, we revise some facts and results about the eigenvalues and eigenvectors of a

square matrix in Cn×n. These facts are needed throughout the course, and we will come back

to them at various stages later-on. In Section 2.2, we introduce upper triangular matrices

(and lower triangular matrices), and we learn how a linear system with an upper triangular

matrix can be easily solved by back substitution. In Section 2.3, we learn about the Schur

factorization: Given a square matrix A in Cn×n, the Schur factorization guarantees that

there exists a unitary matrix S, such that the matrix

U = S∗ A S = S−1 A S

is an upper triangular matrix. This result will be exploited at various stages later in the course.

In Section 2.4, we revise some facts about vector norms, and in Section 2.5, we introduce a

variety of matrix norms that will be used throughout the course. In Section 2.6, we define the

spectral radius of a square matrix A ∈ Cn×n which is the maximum of the absolute values of the

n complex eigenvalues of A. In formulas, let λ1, λ2, . . . , λn ∈ C be the n complex eigenvalues

(counted with multiplicity) of A ∈ Cn×n; then the spectral radius of A ∈ Cn×n is defined by

ρ(A) := max{|λ1|, |λ2|, . . . , |λn|

}.

We also learn a result about the relation between the spectral radius and matrix norms.

2.1 The Eigensystem of a Square Matrix

The material in this section on the eigensystem, that is, the eigenvalues and eigenvectors, of

a square matrix should be familiar from linear algebra. However, it is strongly recommended

that you carefully go through this section, revise the material, and solve the exercises!

15

16 2.1. The Eigensystem of a Square Matrix

In this section real square matrices A ∈ Rn×n are considered as the special case of matrices in

Cn×n with real entries; so all results for matrices in C

n×n hold also for matrices in Rn×n.

Consider a square matrix A in Cn×n,

A = (aij) =

a1,1 a1,2 · · · a1,n

a2,1 a2,2 · · · a2,n

......

. . ....

an,1 an,2 · · · an,n

.

Definition 2.1 (eigenvalues and eigenvectors)

A complex number λ ∈ C is called an eigenvalue of A = (ai,j) ∈ Cn×n if there exists a

non-zero vector x ∈ Cn such that

Ax = λx ⇔ λx − Ax =(λ I − A

)x = 0, (2.1)

where I is the n × n identity matrix. A vector x satisfying (2.1) is called an eigenvector

corresponding to the eigenvalue λ.

From (2.1) it is clear that, once we know an eigenvalue λ, any corresponding eigenvector x can

be computed by solving the linear system (λ I − A)x = 0. The linear system (λ I − A)x = 0

has non-zero solutions if and only if det(λ I − A) = 0. Thus we have found a criterion for

determining whether a complex number is an eigenvalue: λ is an eigenvalue of A if and only if

det(λ I − A) = 0. From the properties of the determinant it is easily seen that

det(λ I − A) = λn + cn−1 λn−1 + . . . + c1 λ + c0, (2.2)

with suitable complex coefficients c0, c1, . . . , cn−1, that is, det(λ I − A) is a polynomial in λ of

exact degree n. Thus any eigenvalue of A is a root of the polynomial det(λ I − A).

Theorem 2.2 (roots of the characteristic polynomial are eigenvalues)

Let A ∈ Cn×n. A complex number λ ∈ C is an eigenvalue of A if and only if it is a root

of the characteristic polynomial of A, defined by

p(A, λ) := det(λ I − A).

Counted with multiplicity, the characteristic polynomial has exactly n complex roots,

that is, A has exactly n complex eigenvalues, counted with multiplicity.

The last statement in the theorem above follows from the fundamental theorem of algebra.

Definition 2.3 (spectrum of a matrix)

Let A ∈ Cn×n. The set of all eigenvalues of A is called the spectrum of A and is denoted

by

Λ(A) :={λ ∈ C : det(λ I − A) = 0

}.

2. Matrix Theory 17

Once we have found the n complex eigenvalues of A ∈ Cn×n, we can compute the corresponding

eigenvectors x to an eigenvalue λ by solving the linear system (λ I − A)x = 0.

If an eigenvalue occurs with a multiplicity k > 1, there can be as at most k linearly in-

dependent eigenvectors. It is not difficult to show that the set of eigenvectors to a given

eigenvalue λ forms a vector space, called the eigenspace of the eigenvalue λ. Indeed let λ

be an eigenvalue of A and define the eigenspace of λ by

Eλ(A) := {x ∈ Cn : Ax = λx} .

Now consider two elements x and y from Eλ(A). Then the linear combination αx+ β y is also

in Eλ(A), since

A(αx + β y

)= α Ax + β Ay = α λx + β λy = λ

(αx + β y

).

This guarantees closure under vector addition and scalar multiplication, and hence Eλ ⊂ Cn is

a vector space.

Example 2.4 (eigenvalues and eigenvectors)

Compute the eigenvalues and eigenvectors of the matrix

A =

0 −2 2

−2 −3 2

−3 −6 5

.

Solution: We compute the roots of the characteristic polynomial

det(λ I − A) = det

λ 2 −2

2 λ + 3 −2

3 6 λ − 5

= λ (λ + 3) (λ − 5) − 12 − 24 + 6 (λ + 3) + 12 λ − 4 (λ − 5)

= λ3 − 2 λ2 − λ + 2 = (λ − 2) (λ − 1) (λ + 1),

where the roots were determined by guessing that λ = 1 is a root and then using long division

and the binomial formulas. Thus we have found that λ1 = 2, λ2 = 1 and λ3 = −1 are the

eigenvalues of A.

To compute the corresponding eigenvectors xj, we solve for each eigenvalue λj the linear system

(λj I−A)xj = 0, which we write in augmented matrix form as (λj I−A|0) and solve by Gaussian

elimination.

For λ1 = 2, we have

(2 I − A)x2 = 0 ⇔

2 2 −2

2 5 −2

3 6 −3

x1 = 0 ⇔

2 2 −2

2 5 −2

3 6 −3

∣∣∣∣∣∣

0

0

0

.


We multiply the first row of the augmented matrix by 1/2, and in the next step, we add (−2)

times the new first row to the second row and (−3) times the new first row to the third row.

⇔

1 1 −1

2 5 −2

3 6 −3

∣∣∣∣∣∣

0

0

0

⇔

1 1 −1

0 3 0

0 3 0

∣∣∣∣∣∣

0

0

0

.

Now we subtract the new second row from the new third row, and subsequently we divide the

new second row by 3

⇔

1 1 −1

0 3 0

0 0 0

∣∣∣∣∣∣

0

0

0

⇔

1 1 −1

0 1 0

0 0 0

∣∣∣∣∣∣

0

0

0

.

Finally, we subtract the second row from the first row and obtain

⇔

1 0 −1

0 1 0

0 0 0

∣∣∣∣∣∣

0

0

0

⇔

1 0 −1

0 1 0

0 0 0

x1 = 0 ⇔ x1 = α

1

0

1

,

where α ∈ R. Thus all eigenvectors x1 corresponding to λ1 = 2 are of the form x1 = α (1, 0, 1)T ,

where α ∈ R \ {0}.

For λ2 = 1, we have

(I − A)x2 = 0 ⇔

1 2 −2

2 4 −2

3 6 −4

x2 = 0 ⇔

1 2 −2

2 4 −2

3 6 −4

∣∣∣∣∣∣

0

0

0

.

We multiply the first row by (−2) and add it to the second row, and we multiply the first row

by (−3) and add it to the third row. Subsequently we subtract the new second row from the

new third row. Then we divide the new second row by 2.

⇔

1 2 −2

0 0 2

0 0 2

∣∣∣∣∣∣

0

0

0

⇔

1 2 −2

0 0 2

0 0 0

∣∣∣∣∣∣

0

0

0

⇔

1 2 −2

0 0 1

0 0 0

∣∣∣∣∣∣

0

0

0

.

Finally we add 2 times the second row to the first row. Thus

⇔

1 2 0

0 0 1

0 0 0

∣∣∣∣∣∣

0

0

0

⇔

1 2 0

0 0 1

0 0 0

x2 = 0 ⇔ x2 = α

2

−1

0

,

with α ∈ R. Thus all eigenvectors corresponding to the eigenvalue λ2 = 1 are of the form

x2 = α (2,−1, 0)T with α ∈ R \ {0}.

2. Matrix Theory 19

For λ3 = −1, we have

(−I − A)x2 = 0 ⇔

−1 2 −2

2 2 −2

3 6 −6

x2 = 0 ⇔

−1 2 −2

2 2 −2

3 6 −6

∣∣∣∣∣∣

0

0

0

.

We multiply the first row by 2 and add it to the second row, and we multiply the first row by

3 and add it to the third row. Subsequently we multiply the new second row by (−2) and add

it to the third row. Afterwards we divide the new second row by 6.

⇔

−1 2 −2

0 6 −6

0 12 −12

∣∣∣∣∣∣

0

0

0

⇔

−1 2 −2

0 6 −6

0 0 0

∣∣∣∣∣∣

0

0

0

⇔

−1 2 −2

0 1 −1

0 0 0

∣∣∣∣∣∣

0

0

0

.

Finally we add 2 times the second row to the first row multiplied by (−1).

⇔

1 0 0

0 1 −1

0 0 0

∣∣∣∣∣∣

0

0

0

⇔

1 0 0

0 1 −1

0 0 0

x3 = 0 ⇔ x3 = α

0

1

1

,

where α ∈ R. Thus all eigenvectors corresponding to the eigenvalue λ3 = −1 are of the form

x3 = α (0, 1, 1)T , with α ∈ R \ {0}.

We summarize what we have found so far: the spectrum of A is

Λ(A) = {−1, 1, 2}, (2.3)

and the eigenspaces Eλ(A) of the eigenvalues λ are

E2 ={α (1, 0, 1)T : α ∈ R

},

E1 ={α (2,−1, 0)T : α ∈ R

},

E−1 ={α (0, 1, 1)T : α ∈ R

}. (2.4)

This completes the example. 2

Exercise 19 Compute the eigenvalues and corresponding eigenvectors of the 3 × 3 matrix

A =

32

0 12

0 3 0

12

0 32

.

Exercise 20 Consider the matrix A ∈ R3×3 given by

A =

1 −1 0

−1 2 1

0 1 3

.


(a) Compute the eigenvalues of A by hand.

(b) Compute all eigenvectors to the eigenvalue that is an integer by hand.

An important property of eigenvalues is that they are invariant under so-called basis transfor-

mations or similarity transformations.

Lemma 2.5 (eigenvalues are invariant under basis transformation)

Let A ∈ Cn×n, and let S ∈ Cn×n with det(S) 6= 0. Then S−1 A S is called a basis trans-

formation or similarity transformation of A, and

det(λ I − S−1 A S) = det(λ I − A),

so that A and S−1AS have the same eigenvalues. We say the eigenvalues of A are invariant

under a basis transformation or a similarity transformation.

Proof of Lemma 2.5. From det(B C) = det(B) det(C) and S−1 I S = S−1 S = I,

det(λ I − S−1 A S) = det(S−1 (λ I − A) S

)= det(S−1) det(λ I − A) det(S),

and, since det(S−1) = (det(S))−1, the result follows. 2

Lemma 2.5 gives us the following idea: If we are only interested in the eigenvalues of a square

matrix A ∈ Cn×n, then it would be useful to find a suitable non-singular matrix S ∈ Cn×n,

such that the eigenvalues of S−1 A S are easier to compute.

In order to execute this idea, we first need to understand the nature of eigenvectors better.

The following lemma is elementary but has far reaching consequences.

Lemma 2.6 (eigenvectors to different eigenvalues are linearly independent)

Let A ∈ Cn×n. Eigenvectors to different eigenvalues of A are linearly independent. More

precisely, let λ1, λ2, . . . , λm be m distinct eigenvalues of A, and let x1,x2, . . . ,xm be cor-

responding eigenvectors, that is, Axj = λj xj for j = 1, 2, . . . , m. Then the eigenvectors

x1,x2, . . . ,xm are linearly independent.

Proof of Lemma 2.6. The proof is given by induction on m ≥ 2.

Initial step m = 2: Consider two different eigenvalues λ1 and λ2 and let x1 and x2 be two

corresponding eigenvectors, that is, Axi = λi xi, i = 1, 2. To show the linear independence of

x1 and x2 consider

α1 x1 + α2 x2 = 0. (2.5)

If (2.5) implies α1 = α2 = 0, then we have shown that x1 and x2 are linearly independent.

Assume therefore that at least one of the coefficients α1 and α2 in (2.5) is non-zero, say α1 6= 0.

2. Matrix Theory 21

Then from (2.5)

x1 = − α2

α1x2. (2.6)

Multiplying from the left with A on both sides of (2.6) yields

Ax1 = − α2

α1

Ax2 ⇒ λ1 x1 = − α2

α1

λ2 x2 = λ2 x1 ⇒ (λ1 − λ2)x1 = 0, (2.7)

where we have used (2.6) in the middle equation. Since λ1 − λ2 6= 0 and x1 6= 0, we have

(λ1 −λ2)x1 6= 0, and the last formula in (2.7) is a contradiction. We see that only α1 = α2 = 0

are possible in (2.5), and hence x1 and x2 are linearly independent.

Induction step m → m + 1: The induction step is left as an exercise. 2

Exercise 21 Show the induction step in the proof of Lemma 2.6.

After these preparations we can present one of the major theorems of linear algebra.

Theorem 2.7 (basis transformation into diagonal form)

Let A ∈ Cn×n, and let λ1, λ2, . . . , λn ∈ C denote its n complex eigenvalues. If there are

n linearly independent corresponding eigenvectors x1,x2, . . . ,xn (that is, Axj = λj xj

for j = 1, 2, . . . , n), then the n eigenvectors x1,x2, . . . ,xn form a basis of Cn. Under this

assumption, let S denote the matrix that contains the eigenvectors x1,x2, . . . ,xn as column

vectors, that is,

S := (x1,x2, . . . ,xn).

Then the basis transformation (or similarity transformation) S−1 A S yields the diagonal

matrix with the eigenvalues λ1, λ2, . . . , λn along the diagonal. In formulas,

S−1 A S =

λ1 0 0 · · · 0

0 λ2 0 · · · 0

0 0 λ3. . .

......

.... . .

. . . 00 0 · · · 0 λn

.

The proof of this theorem is surprisingly simple and very intuitive, and greatly helps in under-

standing Theorem 2.7.

Proof of Theorem 2.7. We first consider A S. Since the jth column vector of S = (si,j) is

the eigenvector xj (that is, xj = (s1,j, s2,j, . . . , sn,j)T ) and Axj = λj xj, we have that

(A S)k,j =

n∑

i=1

ak,i si,j, = λj sk,j, k = 1, 2, . . . , n. (2.8)


In other words, A S is the matrix whose jth column is given by λj xj . The (i, j)th entry in

S−1 A S = S−1 (A S) is given by

(S−1 A S)i,j =n∑

k=1

(S−1)i,k (A S)k,j =n∑

k=1

(S−1)i,k λj sk,j = λj

n∑

k=1

(S−1)i,k sk,j = λj δi,j,

where we have used in the last step that S−1S = I. Thus S−1 A S is indeed the diagonal matrix

with the eigenvalues λ1, λ2, . . . , λn along the diagonal. 2

Example 2.8 We illustrate Theorem 2.7 for our matrix from Example 2.4

A =

0 −2 2

−2 −3 2

−3 −6 5

.

In Example 2.4, we found that the spectrum of A is (see (2.3))

Λ(A) = {−1, 1, 2},

and that the eigenspaces of the eigenvalues are (see (2.4))

E2 ={α (1, 0, 1)T : α ∈ R

},

E1 ={α (2,−1, 0)T : α ∈ R

},

E−1 ={α (0, 1, 1)T : α ∈ R

}.

Hence we choose the matrix S to be

S =

1 2 0

0 −1 1

1 0 1

,

and we expect that

S−1 A S =

2 0 0

0 1 0

0 0 −1

. (2.9)

To verify (2.9), we compute S−1 and then execute the matrix multiplication S−1 A S. We write

the augmented linear system (S|I), and use elementary row operations to transform it into

(I|S−1), and we find (details left as an exercise)

1 2 0

0 −1 1

1 0 1

∣∣∣∣∣∣

1 0 0

0 1 0

0 0 1

⇔

1 0 0

0 1 0

0 0 1

∣∣∣∣∣∣

−1 −2 2

1 1 −1

1 2 −1

.

Thus the inverse S−1 is given by

S−1 =

−1 −2 2

1 1 −1

1 2 −1

,

2. Matrix Theory 23

and executing the matrix multiplications in (2.9) shows that (2.9) is indeed true.

Note that in the definition of the matrix S the normalization of the basis vectors plays no role.

That is, if we choose instead of S

T =

α1 2α2 0

0 −α2 α3

α1 0 α3

,

with any non-zero numbers α1, α2, α3 ∈ R, the we also have

T−1 A T =

2 0 0

0 1 0

0 0 −1

.

The permutation of the columns in S or T corresponds to a corresponding permutation of the

eigenvalues in the diagonal matrix. 2

Exercise 22 For the matrix A from Example 19,

A =

32

0 12

0 3 0

12

0 32

,

find a matrix S such that

S−1 A S =

λ1 0 0

0 λ2 0

0 0 λ3

,

with λ1 ≥ λ2 ≥ λ3. Compute S−1 and execute the matrix multiplication S−1 A S to verify that

you have chosen S correctly.

Whether a matrix A ∈ Cn×n has n linearly independent eigenvectors is a non-trivial problem.

The following lemma gives a sufficient but not a necessary condition for the existence of n

linearly independent eigenvalues.

Lemma 2.9 (sufficient cond. for the existence of n lin. indep. eigenvectors)

Let A ∈ Cn×n have n distinct complex eigenvalues λ1, λ2, . . . , λn (that is, the eigenvalues

λ1, λ2, . . . , λn are all different). Then A has n linearly independent eigenvectors.

Proof of Lemma 2.9. For each eigenvalue λj choose one eigenvector xj . From Lemma 2.6,

the eigenvectors x1,x2, . . . ,xn are linearly independent, since the eigenvalues λ1, λ2, . . . , λn are

distinct. 2


From det(S−1 A S) = det(S−1) det(A) det(S) = (det(S))−1 det(A) det(S) = det(A) it is clear

that the determinant is invariant under a basis transformation. Thus if λ1, λ2, . . . , λn are the n

complex eigenvalues of A and if A has n linearly independent eigenvectors, then Theorem 2.7

implies that

det(A) = λ1 λ2 · · · λn. (2.10)

We mention here that the trace of the square matrix (defined as the sum of its diagonal

elements) is also invariant under basis transformations, that is,

trace(S−1 A S) = trace(A).

Hence, we see with the help of Theorem 2.7 that if A ∈ Cn×n has n linearly independent

eigenvectors, then the trace of A is given by

trace(A) = λ1 + λ2 + · · · + λn, (2.11)

where λ1, λn, . . . , λn denote the n complex eigenvalues of A.

In fact, (2.10) and (2.11) even hold without the assumption that A has n linearly independent

eigenvectors, but the proof of this needs more advanced linear algebra.

Later-on we need the following special cases of Theorem 2.7 above, where A ∈ Rn×n is symmetric

or A ∈ Cn×n is Hermitian. For these results, remember the following definitions: A matrix

S ∈ Cn×n is called unitary if S∗ = ST

= S−1, and a matrix S ∈ Rn×n is called orthogonal if

ST = S−1.

Theorem 2.10 (basis transformation into diagonal form for Hermitian matrices)

Let A ∈ Cn×n be Hermitian, that is, A = A∗. Then there exists a unitary matrix

S ∈ Cn×n such that

S∗ A S = S−1 A S =

λ1 0 0 · · · 0

0 λ2 0 · · · 0

0 0 λ3. . .

......

.... . .

. . . 00 0 · · · 0 λn

.

The values λ1, λ2, . . . , λn are real and are the eigenvalues of A, and the columns of the

matrix S are an orthonormal basis of eigenvectors. More precisely, if we denote the

jth column vector of S by xj, j = 1, 2, . . . , n, then x1,x2, . . . ,xn is an orthonormal basis of

Cn and Axj = λj xj, j = 1, 2, . . . , n.

For the special case of real matrices, we have the following result.

2. Matrix Theory 25

Theorem 2.11 (basis transformation into diagonal form for symmetric matrices)

Let A ∈ Rn×n be symmetric, that is, A = AT . Then there exists an orthogonal matrix

S ∈ Rn×n such that

ST A S = S−1 A S =

λ1 0 0 · · · 0

0 λ2 0 · · · 0

0 0 λ3. . .

......

.... . .

. . . 00 0 · · · 0 λn

.

The values λ1, λ2, . . . , λn are real and are the eigenvalues of A, and the columns of the

matrix S are an orthonormal basis of eigenvectors. More precisely, if we denote the

jth column vector of S by xj, j = 1, 2, . . . , n, then x1,x2, . . . ,xn is an orthonormal basis of

Rn and Axj = λj xj, j = 1, 2, . . . , n.

Proof of Theorem 2.10. A proof of Theorem 2.10 will be discussed in Exercise 32 with the

help of the Schur factorization. 2

Exercise 23 For the matrix A from Examples 19 and 22,

A =

32

0 12

0 3 0

12

0 32

,

find an orthogonal matrix S such that

S−1 A S =

λ1 0 0

0 λ2 0

0 0 λ3

, (2.12)

with λ1 ≥ λ2 ≥ λ3. Verify that your matrix S is orthogonal. Verify that (2.12) is true by

executing the matrix multiplications in (2.12).

Exercise 24 Let A ∈ Cn×n, and let S ∈ Cn×n be a non-singular matrix. Show that

trace(S−1 A S) = trace(A).

Exercise 25 Consider the real 2 × 2 matrix

A =

(3 −1

−1 3

).

(a) Calculate the eigenvalues λ1 and λ2 (where λ1 ≥ λ2) of A by hand.

26 2.2. Upper Triangular Matrices and Back Substitution

(b) Calculate the eigenspaces correponding to the eigenvalues from (a) by hand.

(c) Find an orthogonal 2 × 2 matrix S (that is, ST = S−1) such that

ST A S =

(λ1 0

0 λ2

), where λ1 > λ2.

Execute the matrix multiplication ST A S to verify that your choice of S is correct.

2.2 Upper Triangular Matrices and Back Substitution

Two important classes of matrices are upper triangular matrices and lower triangular matrices.

Definition 2.12 (upper and lower triangular matrices)

A square matrix A = (ai,j) in Cn×n or Rn×n is said to be upper triangular (or an upper

triangular matrix) if ai,j = 0 for i > j. Thus an n × n upper triangular matrix is of the

form

A = (ai,j) =

a1,1 a1,2 a1,3 · · · a1,n

0 a2,2 a2,3 · · · a2,n

0 0 a3,3 · · · a3,n...

.... . .

. . ....

0 0 · · · 0 an,n

. (2.13)

Similarly, an n × n matrix A = (ai,j) is said to be lower triangular (or a lower trian-

gular matrix) if ai,j = 0 for i < j.

Example 2.13 (upper and lower triangular matrices)

A =

1 2 3

0 1 2

0 0 1

, B =

−1 0 0

4 −2 0

5 6 −3

.

Then A is a 3 × 3 upper triangular matrix, and B is a 3 × 3 lower triangular matrix. 2

The following lemma establishes some important properties of lower triangular and upper tri-

angular matrices.

2. Matrix Theory 27

Lemma 2.14 (properties of upper/lower triangular matrices)

The set of upper triangular matrices in Cn×n (or R

n×n) with the usual matrix addition and

usual scalar multiplication with complex (or real) numbers forms a complex (or real) vector

space.

Let A, B ∈ Cn×n (or A, B ∈ Rn×n) be upper triangular matrices. Then A B is also an upper

triangular matrix in Cn×n (or Rn×n, respectively). If A is non-singular, then A−1 is also an

upper triangular matrix in Cn×n (or in R

n×n, respectively).

Analogous statements hold for lower triangular matrices.

Let A = (ai,j) in Cn×n or in Rn×n be an upper triangular or lower triangular matrix. Then

the following holds:

(i) det(A) = a1,1 a2,2 · · · an,n.

(ii) A is non-singular/invertible if and only if aj,j 6= 0 for all j = 1, 2, . . . , n.

(iii) The eigenvalues of A are the entries a1,1, a2,2, . . . , an,n on the diagonal of A.

Proof of Lemma 2.14. The set of all n×n matrices in Cn×n (or Rn×n) with matrix addition

and scalar multiplication with complex (or real) numbers forms a complex (or real) vector

space. Therefore, to verify that the upper triangular matrices form a vector space, it is enough

to check the closure under addition and scalar multiplication. This is easily done and is left

as an exercise. The next statements are covered in the exercises, but we prove (i) to (iii) for

upper triangular matrices.

Let A be an n × n upper triangular matrix (2.13). Computing the determinant det(A) by

expansion with respect to the first column yields

det(A) = a1,1 det

a2,2 a2,3 · · · a2,n

0 a3,3 · · · a3,n...

. . .. . .

...0 · · · 0 an,n

.

The resulting submatrix whose determinant needs to be computed is again upper triangular,

and repeating the procedure yields finally

det(A) = a1,1 a2,2 · · · an,n, (2.14)

which proves (i). Since a matrix is invertible if and only if det(A) 6= 0, (2.14) implies immedi-

ately (ii). To see the statement (iii) consider the matrix λ I − A. Since A is upper triangular

and λ I is diagonal, the matrix λ I − A is also upper triangular and its entries on the diagonal

are λ − aj,j, j = 1, 2, . . . , n. Thus form (2.14),

p(A, λ) = det(λ I − A

)= (λ − a11) (λ − a2,2) · · · (λ − an,n),

and we see that the eigenvalues of A are indeed a1,1, a2,2, . . . , an,n. 2

28 2.2. Upper Triangular Matrices and Back Substitution

For an invertible upper triangular matrix A = (ai,j) in Cn×n (or in Rn×n), the linear

system Ax = b,

a1,1 a1,2 a1,3 · · · a1,n

0 a2,2 a2,3 · · · a2,n

0 0 a3,3 · · · a3,n...

.... . .

. . ....

0 0 · · · 0 an,n

x1

x2

x3

...

xn

=

b1

b2

b3

...

bn

,

is easily solved by observing that the last equation gives

an,n xn = bn, ⇒ xn =1

an,nbn. (2.15)

Then, the second last equation reads

an−1,n−1 xn−1 + an−1,n xn = bn−1,

and its can be can be solved for xn−1 via

xn−1 =1

an−1,n−1

(bn−1 − an−1,n xn

),

where we now use that xn was computed via (2.15) in the previous step. Proceeding in this

way, the jth equation reads

aj,j xj + aj,j+1 xj+1 + · · ·+ aj,n xn = aj,j xj +

n∑

k=j+1

aj,k xk = bj ,

and can be solved for xj as follows:

xj =1

aj,j

(bj −

n∑

k=j+1

aj,k xk

),

where we use that xn, xn−1, . . . , xj+1 are known from previous steps. We can continue this

procedure until we have computed all xj , j = n, n − 1, . . . , 2, 1. This procedure is called back

substitution and the following theorem summarizes what we have derived just now.

Theorem 2.15 (back substitution algorithm)

Let A = (ai,j) in Cn×n (or in R

n×n) be an upper triangular matrix that is also invertible.

Then the solution to Ax = b can be computed with O(n2) elementary operations via

xj =1

aj,j

(bj −

n∑

k=j+1

aj,k xk

), j = n, n − 1, . . . , 2, 1. (2.16)

For the definition of the Landau symbol O see Section 1.6.

2. Matrix Theory 29

Proof of Theorem 2.15. Essentially we have derived the theorem above before stating

it; the only part of the statement that needs some consideration are the O(n2) elementary

operations. Let us consider (2.16) for a given fixed j. The computation of xj involves n − j

additions/subtractions and n− j + 1 multiplications/divisions, that is, 2n + 1− 2j elementary

operations. Thus the back substitution algorithm needs in total

n∑

j=1

(2n + 1 − 2j) = (2n + 1)n − 2n∑

j=1

j = (2n + 1)n − (n + 1)n = n2,

that is, O(n2) elementary operations. 2

Example 2.16 (back substitution)

Solve the following linear system with an upper triangular matrix with back substitution:

1 0 2

0 1 −1

0 0 −3

x1

x2

x3

=

3

0

6

.

Solution: Using back substitution, we have

x3 =6

(−3)= −2, x2 =

1

1

(0−(−1) x3

)= x3 = −2, x1 =

1

1

(3−0 x2−2 x3

)= 3−2 (−2) = 7.

Thus the solution is x = (7,−2,−2)T . 2

The MATLAB code for the implementation of the back substitution algorithm is:

function x = back_sub(U,b)

% executes the back substitution algorithm for solving U x = b

% input: U = n by n upper triangular matrix

% b = 1 by n matrix, right-hand side

% output: x = 1 by n vector

n = size(U,1);

x = zeros(1,n);

x(n) = b(n) / U(n,n);

for i = n-1:-1:1

x(i) = (b(i) - U(i,i+1:n) * x(i+1:n)’) / U(i,i);

end

Exercise 26 Solve the following linear system by hand with the back substitution algorithm:

1 1 1

0 2 2

0 0 3

x1

x2

x3

=

−1

3

6

.

30 2.3. Schur Factorization: A Triangular Canonical Form

Exercise 27 Solve the following linear system by hand with the back substitution algorithm:

2 −1 3 1

0 1 2 −1

0 0 −2 1

0 0 0 3

x1

x2

x3

x4

=

12

−3

1

9

.

Exercise 28 Show that the upper triangular matrices with diagonal elements all different from

zero, with the usual matrix multiplication, form a (multiplicative) group.

Exercise 29 Forward substitution: Consider a linear system Ax = b, where A ∈ Rn×n

is a lower triangular matrix, b ∈ Rn the given right-hand side, and x ∈ Rn the unknown

solution. Analogously to the back substitution algorithm we can formulate a forward sub-

stitution algorithm to compute xj, j = 1, 2, . . . , n − 1, n. Derive the forward substitution

algorithm.

2.3 Schur Factorization: A Triangular Canonical Form

In this section we encounter the Schur factorization which guarantees that for any matrix

A ∈ Cn×n, there exists a unitary matrix such that S∗ A S is an upper triangular matrix. Since

the matrix S is unitary, we have S−1 = S∗ and therefore S∗ A S = S−1 A S, that is, we have

a basis transformation (or similarity transformation) with a unitary matrix that transforms A

into an upper triangular matrix. The proof of the Schur factorization is constructive in that

we will explicitly construct the matrix S with the help of so-called Householder matrices or

elementary Hermitian matrices.

Definition 2.17 (Householder matrix or elementary Hermitian matrix)

A Householder matrix or elementary Hermitian matrix is any matrix of the form

H(w) := I − 2ww∗, where w ∈ Cn, with w∗ w = ‖w‖2

2 = 1 or w = 0.

(2.17)

Figure 2.1 illustrates that the Householder H(w) matrix w with w 6= 0 represents a reflection

on the hyperplane

Sw = {z ∈ Cn : w∗ z = 0}

that is orthogonal to w. Indeed consider any vector a ∈ Cn and decompose it into the compo-

nent in the direction of w and the orthogonal part (which lies in Sw):

a = (w∗ a)w +(a − (w∗ a)w

). (2.18)

If we apply H(w) to the vector a, then, from (2.18) and w∗ w = 1,

H(w) a = (I − 2ww∗) a = a− 2ww∗ a

2. Matrix Theory 31

= (w∗ a)w +(a − (w∗ a)w

)− 2ww∗

((w∗ a)w +

(a− (w∗ a)w

))

= (w∗ a)w +(a − (w∗ a)w

)− 2ww∗

((w∗ a)w

)

= (w∗ a)w +(a − (w∗ a)w

)− 2 (w∗ a) (w∗ w)w

= −(w∗ a)w +(a− (w∗ a)w

),

where we have used that a− (w∗ a)w is orthogonal to w. From the last representation we see

that H(w) a is indeed a reflection of a on the hyperplane Sw.

H(w) a = −(w∗ a)w + w

a

w

(w∗ a)w

−(w∗ a)ww

Sw

Figure 2.1: Householder transformation: In the picture w denotes the projection of a onto

the hyperplane Sw. Then a = (w∗ a)w + w, and since w∗ w = 0 and w∗ w = 1, we find

H(w) a = (I − 2ww∗)((w∗ a)w + w) = (w∗ a)w + w − 2 (w∗ a)w = −(w∗ a)w + w.

Example 2.18 (Householder matrix)

Let w = (0,−3/5, 4/5)T . Then ‖w‖2 = 1, and the matrix

H(w) = I − 2

0

− 35

45

(0,− 3

5, 4

5

)= I − 2

0 0 0

0 925

− 1225

0 − 1225

1625

=

1 0 0

0 725

4925

0 4925

− 725

is a 3 × 3 Householder matrix. 2

The next lemma states some properties of Householder matrices.


Lemma 2.19 (properties of Householder matrices)

A Householder matrix H(w), given by (2.17), has the following properties:

(i) H(w) is Hermitian, that is,(H(w)

)∗= H(w).

(ii) H(w) is invertible/non-singular.

(iii) det(H(w)

)= −1 for w 6= 0.

(iv) H(w) is unitary, that is,(H(w)

)−1=(H(w)

)∗. Hence, a product of Householder

matrices is unitary.

(v) Storing H(w) only requires the n elements of w.

Proof of Lemma 2.19. From (A B)∗ = B∗ A∗, (A + B)∗ = A∗ + B∗, and (A∗)∗ = A we find

(H(w)

)∗=(I − 2ww∗)∗ = I∗ − 2 (ww∗)∗ = I − 2 (w∗)∗ w∗ = I − 2ww∗ = H(w),

thus proving (i).

Next we work out det(H(w)

). If w = 0 then H(w) = I and hence det

(H(w)

)= 1. If

w 6= 0, then we will show that the eigenvalues of H(w) are 1, with multiplicity n − 1, and −1,

with multiplicity 1. Therefore, we have that det(H(w)

)= (−1) 1n−1 = −1, which proves (iii).

Consider as before the hyperplane Sw = {z ∈ Cn : w∗ z = 0}, which is a (n − 1)-dimensional

subspace of Cn. For any vector a ∈ Sw, we have w∗ a = 0, and hence for a ∈ Sw,

H(w) a =(I − 2ww∗) a = a− 2w (w∗ a) = a,

that is, a is an eigenvector of H(w) corresponding to the eigenvalue λ = 1. Since dim(Sw) =

n − 1, the eigenvalue λ = 1 has at least n − 1 linearly independent eigenvectors and hence it

has at least multiplicity n − 1. Now consider the vector w itself. Then, since w∗ w = 1,

H(w)w =(I − 2ww∗)w = w − 2w (w∗ w) = −w,

that is, w is an eigenvector corresponding to the eigenvalue λ = 1. Combining these results,

we see that the eigenvalue λ = 1 has the multiplicity n − 1 and the eigenvalue λ = −1 has the

multiplicity 1.

From the proof so far it is clear that H(w) is invertible, since we have established that its

determinant is different from zero. Thus (ii) holds true.

To show that H(w) is unitary, we have to show that

(H(w)

)∗H(w) = H(w)

(H(w)

)∗= I.

Since(H(w)

)∗= H(w) from (i), it is enough to show that H(w) H(w) = I. Indeed,

H(w) H(w) =(I − 2ww∗) (I − 2ww∗)

2. Matrix Theory 33

= I − 4ww∗ + 4 (ww∗) (ww∗)

= I − 4ww∗ + 4w (w∗ w)w∗

= I − 4ww∗ + 4ww∗ = I

where we have used the associativity of matrix multiplication and w∗ w = 1.

That the product of Householder matrixes is also unitary follows from the following general

statement: if A and B in Cn×n are unitary, then A B is unitary. Indeed (A B)∗ = B∗ A∗ =

B−1 A−1, and we know that B−1 A−1 is the inverse matrix to A B.

Statement (v) is evident. 2

Lemma 2.20 (construction of Householder matrices)

Let x and y be given vectors in Cn such that y∗ y = x∗ x and y∗ x = x∗ y. Then there exists

a Householder matrix (or an elementary Hermitian matrix) H(w), such that H(w)x = y

and H(w)y = x. If x and y are real then so is w.

Proof of Lemma 2.20. If x = y then w = 0 and H(w) = I. If x 6= y we define

w =x − y√

(x − y)∗ (x − y)=

x − y

‖x − y‖2. (2.19)

Clearly, w∗ w = 1, and from (2.19)

H(w)x =(I − 2ww∗)x = x − (x − y)

2 (x − y)∗ x

(x − y)∗ (x − y), (2.20)

and, using x∗ x = y∗ y = 1 and x∗ y = y∗ x,

2 (x− y)∗ x = 2x∗ x − 2y∗ x =(x∗ x + y∗ y

)−(y∗ x + x∗ y

)= (x − y)∗ (x − y). (2.21)

Substituting (2.21) into (2.20) yields now H(w)x = y. From (H(w))−1 = (H(w))∗ = H(w)

(see (i) and (iv) in Lemma 2.19) and H(w)x = y, we have

H(w)y = H(w)(H(w)x

)=(H(w) H(w)

)x = I x = x.

If the vectors x and y are real, then, from the definition (2.19), clearly the vector w is also

real. 2

Lemma 2.20 is often used to map a given vector x = (x1, x2, . . . , xn)T onto a scalar multiple of

the first unit vector e1 = (1, 0, 0, . . . , 0)T , that is, we want to find a Householder matrix H(w)

such that

H(w)x = c e1

with a suitable complex number c. From the conditions in Lemma 2.20, we find

(c e1)∗ (c e1) = (c c) (e∗

1 e1) = |c|2 = x∗ x = ‖x‖22 ⇒ |c| = ‖x‖2


and

(c e1)∗ x = c x1 = x∗ (c e1) = x1 c = c x1 ⇒ c x1 ∈ R,

and thus c is a real multiple of x1. Combining both properties, we see that

c =x1

|x1|‖x‖2,

and we choose the vector w of the Householder matrix from (2.19) as

w =

x − x1

|x1|‖x‖2 e1

√(x − x1

|x1|‖x‖2 e1

)∗(x − x1

|x1|‖x‖2 e1

) =

x − x1

|x1|‖x‖2 e1

∥∥∥∥x − x1

|x1|‖x‖2 e1

∥∥∥∥2

.

We can now use Lemma 2.20 to prove the following Theorem.

Theorem 2.21 (Schur factorization)

Let A be a matrix in Cn×n. Then there exists a unitary matrix S ∈ Cn×n such that S∗ A S

is an upper triangular matrix. This is known as the Schur factorization of A.

Proof of Theorem 2.21. The proof is given by induction over n.

Initial step: Clearly the result holds for n = 1.

Induction step n − 1 → n: Assume now the result holds for all (n − 1)× (n − 1) matrices. We

need to show that the result also holds for all n × n matrices.

Let x be an eigenvector to some eigenvalue λ of A, that is

Ax = λx, where x 6= 0. (2.22)

By Lemma 2.20 and our considerations after this lemma, there exists a Householder matrix

H(w1) such that

H(w1)x = c1 e1 and H(w1) e1 =1

c1

x, (2.23)

with |c1| = ‖x‖2 6= 0. Using(H(w1)

)∗= H(w1) and (2.22) and (2.23) we have

(H(w1)

)∗A H(w1) e1 =

1

c1

H(w1) Ax =λ

c1

H(w1)x =λ

c1

c1 e1 = λ e1.

Since(H(w1)

)∗A H(w1) e1 is the first column of

(H(w1)

)∗A H(w1), we may write the matrix(

H(w1))∗

A H(w1) in the form

(H(w1)

)∗A H(w1) = H(w1) A H(w1) =

λ | a∗

− − −0 | A(1)

, (2.24)

2. Matrix Theory 35

for some vector a ∈ Cn−1 and some (n − 1) × (n − 1) matrix A(1).

By the induction hypothesis there exists an (n − 1) × (n − 1) unitary matrix V , such that

V ∗ A(1) V = T , where T is an upper triangular (n − 1) × (n − 1) matrix. Consider the matrix

S = H(w1)

1 | 0T

− − −0 | V

with S∗ =

1 | 0T

− − −0 | V ∗

(H(w1)

)∗.

From V ∗ V = V V ∗ = I and(H(w1)

)∗H(w1) = H(w1)

(H(w1)

)∗= I, then S S∗ = S∗ S = I,

that is, the matrix S is unitary. We will now show that S∗ A S is an upper triangular matrix.

From (2.24)

S∗ A S =

1 | 0T

− − −0 | V ∗

(H(w1)

)∗A H(w1)

1 | 0T

− − −0 | V

=

1 | 0T

− − −0 | V ∗

λ | a∗

− − −0 | A(1)

1 | 0T

− − −0 | V

=

1 | 0T

− − −0 | V ∗

λ | a∗ V

− − −0 | A(1) V

=

λ | a∗V

− − −0 | V ∗A(1) V

=

λ | a∗V

− − −0 | T

.

Hence, S∗ A S is upper triangular. 2

Example 2.22 (Schur factorization)

Find the Schur factorization of the matrix

A =

3 0 1

0 3 0

1 0 3

.

Solution: To do this we follow the steps of the proof of for Theorem 2.21. First we find the

eigenvalues of A.


)= det

λ − 3 0 −1

0 λ − 3 0

−1 0 λ − 3


= (λ − 3)3 − (λ − 3) = (λ − 3)[(λ − 3)2 − 1

]= (λ − 3) (λ − 2) (λ − 4),

and we see that the eigenvalues are λ1 = 4, λ2 = 3, and λ3 = 2.

We choose λ2 = 3, find a corresponding eigenvector, and determine the Householder matrix

that maps the eigenvector onto c e1. For λ = λ2 = 3, the eigenvector x2 satisfies the linear

system

0 0 −1

0 0 0

−1 0 0

x2 = 0 ⇒ x2 = α

0

1

0

, α ∈ R.

For α = 1, ‖e1‖2 = ‖x2‖ = 1 and e∗1 x2 = x∗

2 e1 = 0. Hence, we choose the vector w1 for the

first Householder matrix as

w1 =x2 − e1

‖x2 − e1‖=

1√2

−1

1

0

.

The corresponding Householder matrix is given by

H(w1) = I−2w1 w∗1 = I−2

1

(√

2)2

−1

1

0

(−1, 1, 0) = I−

1 −1 0

−1 1 0

0 0 0

=

0 1 0

1 0 0

0 0 1

,

and it is easily verified that indeed H(w1)x2 = e1 and H(w1) e1 = x2. Now we execute the

matrix multiplication(H(w1)

)∗A H(w1) = H(w1) A H(w1) and obtain

H(w1) A H(w1) =

0 1 0

1 0 0

0 0 1

3 0 1

0 3 0

1 0 3

0 1 0

1 0 0

0 0 1

=

0 1 0

1 0 0

0 0 1

0 3 1

3 0 0

0 1 3

=

3 0 0

0 3 1

0 1 3

.

Now we consider the 2 × 2 submatrix

A(1) =

(3 1

1 3

),

and determine its eigenvalues:

p(A(1), λ) = det(λ I − A(1)

)= det

(λ − 3 −1

−1 λ − 3

)= (λ − 3)2 − 1 = (λ − 2) (λ − 4).

2. Matrix Theory 37

The eigenvalues of A(1) are λ1 = 4 and λ2 = 2. We choose λ2 = 2 and find a corresponding

eigenvector x(1)2

(−1 −1

−1 −1

)x

(1)2 = 0 ⇒ x

(1)2 = α

(1

−1

), α ∈ R.

For α = 1, ‖x(1)2 ‖2 = ‖

√2 e1‖, where e1 is now the first unit vector in R2, and we have

(x(1)2 )∗ (

√2 e1) = (

√2 e1)

∗ x(1)2 . The vector w

(1)2 of the Householder matrix in R

2 is given by

w(1)2 =

x(1)2 −

√2 e1∥∥x(1)

2 −√

2 e1

∥∥2

=((1 −

√2)2 + 1

)−1/2

(1 −

√2

−1

)=(2 (2 −

√2))−1/2

(1 −

√2

−1

).

Thus the Householder matrix in R2 is given by

H(w(1)2 ) = I − 2w

(1)2

(w

(1)2

)∗= I −

(2 −

√2)−1

(1 −

√2

−1

)(1 −

√2,−1

)

= I +(√

2 (1 −√

2))−1

((1 −

√2)2 −(1 −

√2)

−(1 −√

2) 1

)

= I +(√

2)−1

((1 −

√2) −1

−1 (1 −√

2)−1

)

=

((√

2)−1 −(√

2)−1

−(√

2)−1 −(√

2)−1

)= − 1√

2

(−1 1

1 1

).

The corresponding unitary matrix in R3 is then given by

H2 :=

1 0 0

0 (√

2)−1 −(√

2)−1

0 −(√

2)−1 −(√

2)−1

,

and

H2

(H(w1) A H(w1)

)H2

=

1 0 0

0 (√

2)−1 −(√

2)−1

0 −(√

2)−1 −(√

2)−1

3 0 0

0 3 1

0 1 3

1 0 0

0 (√

2)−1 −(√

2)−1

0 −(√

2)−1 −(√

2)−1

=

1 0 0

0 (√

2)−1 −(√

2)−1

0 −(√

2)−1 −(√

2)−1

3 0 0

0√

2 −2√

2

0 −√

2 −2√

2

=

3 0 0

0 2 0

0 0 4

.


Thus we have

S∗ A S =

3 0 0

0 2 0

0 0 4

with the unitary matrix

S := H(w1) H2 =

0 1 0

1 0 0

0 0 1

1 0 0

0 (√

2)−1 −(√

2)−1

0 −(√

2)−1 −(√

2)−1

=

0 (√

2)−1 −(√

2)−1

1 0 0

0 −(√

2)−1 −(√

2)−1

.

In this example the Schur factorization has brought A into diagonal form, but in general this

is not the case, and we only obtain a upper triangular matrix. 2

We note here that the Schur factorization is mainly of ‘theoretical interest’, but is not interesting

for implementation or as an algorithm. For example, you can use the Schur factorization to

prove the statements investigated in the remark and the exercise below.

Remark 2.23 (special case of Theorem 2.21 for A∗ = A)

An important consequence of Theorem 2.21 is that if A is Hermitian, that is, A∗ = A, then the

upper triangular matrix S∗ A S is also Hermitian, and thus S∗ A S = S−1 A S must be a real

diagonal matrix. Furthermore, A is Hermitian if and only if all eigenvalues are real

and there are n orthonormal eigenvectors!

Exercise 30 Construct a Householder matrix that maps the vector (2, 0, 1)T onto the vector

(√

5, 0, 0)T .

Exercise 31 Let A ∈ Cn×n be an Hermitian matrix, that is, A∗ = AT

= A. Without using

any known results, but by just exploiting the definition of an Hermitian matrix, show that A

has only real eigenvalues.

Exercise 32 Let A be a Hermitian matrix.

(a) By using the Schur factorization, show that there exists a unitary matrix S such that

S∗ A S = U , where U is a real diagonal matrix.

(b) Use the Schur factorization to show that Cn has an orthonormal basis of eigenvectors of A.

(c) Show that A is positive definite if and only if all eigenvalues are positive.

(d) Show that if A is positive definite, then det(A) > 0.

(e) Show that A is positive definite if and only if A = Q∗ Q, with some matrix Q satisfying

det(Q) 6= 0.

2. Matrix Theory 39

2.4 Vector Norms

In this section we briefly revise some material on norms that has been covered in ‘Further

Analysis’.

Definition 2.24 (norm, normed linear space, and unit vector)

Let V be a complex (or real) vector space. A norm for V is a function ‖ · ‖ : V → R with

the following properties: For all x,y ∈ V and α ∈ C (or α ∈ R) we have

(i) ‖x‖ ≥ 0; and ‖x‖ = 0 if and only if x = 0,

(ii) ‖αx‖ = |α| ‖x‖, and

(iii) ‖x + y‖ ≤ ‖x‖ + ‖y‖ (triangle inequality).

A vector space V with a norm ‖ · ‖ is called a normed vector space or normed linear

space.

A unit vector with respect to the norm ‖ · ‖ is a vector x ∈ V such that ‖x‖ = 1.

Example 2.25 (norms on Cn and R

n)

Here is a list of the most important vector norms for Cn (or Rn): for x = (x2, x2, . . . , xn)T in

Cn or Rn, define the p-norms by

(a) 1-norm: ‖x‖1 :=

n∑

j=1

|xj | = |x1| + |x2| + . . . + |xn|,

(b) 2-norm or Euclidean norm: ‖x‖2 :=

(n∑

j=1

|xj |2)1/2

,

(c) p-norm: ‖x‖p :=

(n∑

j=1

|xj |p)1/p

=(|x1|p + |x2|p + . . . + |xn|p

)1/p

, for 1 ≤ p < ∞,

(d) ∞-norm: ‖x‖∞ := max1≤j≤n

|xj |.

We will use these p-norms frequently in this course, in particular, for p ∈ {1, 2,∞}. 2

Example 2.26 (unit balls with respect to some norms on Rn)

The unit ball in Rn with respect to the p-norm ‖ · ‖p is defined to be the set

Bp :={x ∈ R

n : ‖x‖p ≤ 1}.

For R2, Figure 2.2 below shows the unit balls B1, B2, and B∞ with respect to the 1-norm,

the Euclidean norm, and the ∞-norm, respectively. Only the unit ball with respect to the

Euclidean norm looks like we imagine a ball. 2

The following theorem generalizes the Cauchy-Schwarz inequality.

40 2.4. Vector Norms

1−1

1

−1

B∞

B2

B1

Figure 2.2: The unit balls Bp in R2 with respect to the p-norms ‖ · ‖1 (blue), ‖ · ‖2 (black), and

‖ · ‖∞ (red), respectively.

Theorem 2.27 (Holder’s inequality)

The p-norms ‖ · ‖p, where 1 ≤ p ≤ ∞ for Cn (and R

n), defined in Example 2.25 above,

satisfy Holder’s inequality:

|x∗ y| ≤ ‖x‖p ‖y‖q , where1

p+

1

q= 1.

The special case p = q = 2 is known as the Cauchy-Schwarz inequality.

The next definition has far reaching consequences.

Definition 2.28 (equivalent norms)

Two norms ‖ · ‖ and ‖| · ‖| for a (real or complex) vector space V are called equivalent if

there are two positive constants c1 and c2 such that

c1‖x‖ ≤ ‖|x‖| ≤ c2‖x‖ for all x ∈ V.

A norm allows us to define the notion of convergence of sequences.

Definition 2.29 (convergence with respect to a norm)

Let V be a (real or complex) vector space, and let ‖·‖ : V → R be a norm for V . A sequence

{xk} ⊂ V converges with respect to ‖ · ‖ to x ∈ V if

limk→∞

‖xk − x‖ = 0.

If two norms are equivalent, then they define the same notion of convergence.

2. Matrix Theory 41

Theorem 2.30 (equivalence of all norms on Cn (or Rn))

On Cn (or R

n) all norms are equivalent.

Proof of Theorem 2.30. It suffices to show that all norms are equivalent to the 1-norm

‖ · ‖1. Let ‖ · ‖ be any norm on Cn. The representation x =

∑nj=1 xj ej of any vector x =

(x1, x2, . . . , xn)T shows that

‖x‖ =

∥∥∥∥∥

n∑

j=1

xj ej

∥∥∥∥∥ ≤n∑

j=1

‖xj ej‖ =

n∑

j=1

|xj| ‖ej‖ ≤(

max1≤j≤n

‖ej‖)‖x‖1 =: M ‖x‖1,

with M := max1≤j≤n ‖ej‖. From this we can conclude that the norm ‖·‖ is Lipschitz continuous

with respect to the 1-norm ‖ · ‖1, that is,

∣∣‖x‖ − ‖y‖∣∣ ≤ ‖x − y‖ ≤ M ‖x − y‖1 for all x,y ∈ C

n.

(See Exercise 35 for the first inequality.) Since the unit sphere S1 = {x ∈ Cn : ‖x‖1 = 1}in Cn is closed and bounded, the unit sphere S1 is compact. Hence the norm ‖ · ‖ attains its

minimum and maximum on S1 (because it is a continuous function), that is, there are positive

constants c1 and c2 such that

c1 ≤ ‖x‖ ≤ c2 for all x ∈ Cn with ‖x‖1 = 1. (2.25)

For general x ∈ Cn \ {0}, we apply (2.25) to the unit vector y = x/‖x‖1 and obtain

c1 ≤1

‖x‖1‖x‖ ≤ c2 ⇒ c1 ‖x‖1 ≤ ‖x‖ ≤ c2 ‖x‖1. (2.26)

The second estimate in (2.26) holds for all x ∈ Cn \ {0} and trivially also for x = 0. Thus we

have verified that ‖ · ‖ and ‖ · ‖1 are equivalent . 2

Example 2.31 (equivalent norms on Rn)

For example, we have for all x ∈ Rn

‖x‖2 ≤ ‖x‖1 ≤√

n ‖x‖2 ,

‖x‖∞ ≤ ‖x‖2 ≤√

n ‖x‖∞ ,

‖x‖∞ ≤ ‖x‖1 ≤ n ‖x‖∞ .

The estimates will be proved in Exercise 36 below. 2

Exercise 33 Show that ‖ · ‖1 and ‖ · ‖∞ are norms for Cn.

Exercise 34 By considering the inequality

0 ≤(α x + β y)T

(αx + β y)

and choosing α, β ∈ R appropriately, for x,y ∈ Rn, prove the Cauchy-Schwarz inequality.

42 2.5. Matrix Norms

Exercise 35 Prove the lower triangle inequality: If ‖ · ‖ is a norm for a vector space V ,

then ∣∣‖x‖ − ‖y‖∣∣ ≤ ‖x − y‖ for all x,y ∈ V.

Hint: Consider writing y = x + (y − x), and use the triangle inequality.

Exercise 36 Show the inequalities in Example 2.31

2.5 Matrix Norms

When analyzing matrix algorithms we will require the use of matrix norms, since they allow

us to estimate whether the matrix is ‘well-conditioned’. If the matrix is not ‘well-conditioned’,

(which means usually if the matrix is close to being singular), then the quality of a numerically

computed solution could be poor.

We recall that the set of all complex (or real) m × n matrices with the usual component-wise

matrix addition and scalar multiplication forms a complex (or real) vector space.

Lemma 2.32 (m × n matrices form a vector space)

The set Cm×n (or Rm×n) of all complex (or real) m × n matrices with the component-wise

addition of A = (ai,j) and B = (bi,j), defined by

(A + B)i,j := ai,j + bi,j , i = 1, 2, . . . , m; j = 1, 2, . . . , n,

and the component-wise scalar multiplication of α ∈ C (or α ∈ R) and A = (ai,j), defined

by

(α A)i,j := α ai,j , i = 1, 2, . . . , m; j = 1, 2, . . . , n,

is a complex (or real) vector space.

Proof of Lemma 2.32. The proof is straight-forward and was covered in Exercise 3. 2

Since Cm×n (and Rm×n) are vector spaces, we can introduce norms on them. For many purposes,

it is convenient to introduce norms, which are ‘induced’ by given vector norms on Cm and Cn

(and Rm and Rn, respectively). We will now explore the concept of an ‘induced matrix norm’.

In this section, we write ‖·‖(m) to indicate that ‖·‖(m) is a norm on an m-dimensional vector

space (usually Cm or Rm).

Let A ∈ Cm×n. Since the unit sphere S = {x ∈ C

n : ‖x‖(n) = 1} in the finite dimensional

space Cn with norm ‖ · ‖(n) is closed and bounded, and hence compact, the real-valued function

‖Ax‖(m) assumes its supremum on the unit sphere S. Thus

supx∈Cn,

‖x‖(n)=1

‖Ax‖(m) = ‖Ax′‖(m) =: C for some x′ ∈ Cn with ‖x′‖(n) = 1, (2.27)

2. Matrix Theory 43

and, in particular, the supremum has a finite value C. We also see from (2.27) that

‖Ax‖(m) ≤ C for all x ∈ Cn with ‖x‖(n) = 1 ⇔ ‖Ay‖(m) ≤ C ‖y‖(n) for all y ∈ C

n,

where the equivalence follows by setting x = y/‖y‖(n) for y 6= 0. This motivates the definition

of the induced matrix norm.

Definition 2.33 (induced matrix norm)

Let ‖·‖(m) and ‖·‖(n) be norms on Cm (or R

m) and Cn (or R

n), respectively, and let A ∈Cm×n (or Rm×n). The induced matrix norm of A is defined by

‖A‖(m,n) := supx∈Cn,x 6=0

‖Ax‖(m)

‖x‖(n)

= supx∈Cn,

‖x‖(n)=1

‖Ax‖(m) .

Obviously, for any x ∈ Cn (or Rn), with x 6= 0,

‖Ax‖(m)

‖x‖(n)

≤ ‖A‖(m,n) ,

and hence,

‖Ax‖(m) ≤ ‖A‖(m,n) ‖x‖(n) for all x ∈ Cn (or x ∈ R

n). (2.28)

Exercise 37 Let ‖ · ‖(m) and ‖ · ‖(n) be norms for Cm and C

n, respectively, and let A ∈ Cm×n.

Show that the induced matrix norm ‖ · ‖(m,n) satisfies the properties of a norm.

Definition 2.34 (compatible matrix norm)

Let ‖·‖(m) and ‖·‖(n) be norms on Cm (or Rm) and Cn (or Rn), respectively, and let ‖ ·‖(m,n)

denote the induced matrix norm on Cm×n (or Rm×n). Let ‖| ·‖| denote another matrix norm

on Cm×n (or Rm×n). We say that a matrix norm ‖| · ‖| is compatible with the induced

matrix norm ‖ · ‖(m,n) if

‖Ax‖(m) ≤ ‖|A‖| ‖x‖(n) for all x ∈ Cn(or x ∈ R

n).

We observe that by the definition of the induced matrix norm ‖ · ‖(m,n), any compatible matrix

norm ‖| · ‖| clearly satisfies

‖A‖(m,n) ≤ ‖|A‖| for all A ∈ Cm×n

(or A ∈ R

m×n), (2.29)

and, in fact, (2.29) characterizes a compatible matrix norm.

The next two definitions introduce important matrix norms.


Definition 2.35 (Frobenius norm)

The Frobenius norm for an m × n matrix A = (ai,j) (in Cm×n or R

m×n) is given by

‖A‖F =

(m∑

i=1

n∑

j=1

|ai,j|2)1/2

. (2.30)

Definition 2.36 (induced matrix p-norms)

Let A ∈ Cm×n (or A ∈ Rm×n). For 1 ≤ p ≤ ∞, equip Cm (or Rm) and Cn (or Rn) with the

corresponding p-norm, where

‖x‖p :=

(d∑

j=1

|xj|p)1/p

, x ∈ Cd (or x ∈ R

d), 1 ≤ p < ∞,

and

‖x‖∞ := sup1≤j≤d

|xj |, x ∈ Cd (or x ∈ R

d).

Then the induced matrix p-norms for A ∈ Cm×n (or A ∈ Rm×n) are given by

‖A‖p = supx∈Cn,x 6=0

‖Ax‖p

‖x‖p

= supx∈Cn,‖x‖p=1

‖Ax‖p .

The next Theorem gives more explicit formulas for some induced matrix p-norms.

Theorem 2.37 (important induced matrix p-norms)

Let A = (ai,j) be an m × n matrix in Cm×n (or Rm×n), and let σ1, σ2, . . . , σn be the non-

negative eigenvalues of the Hermitian matrix A∗ A (or of the symmetric matrix AT A,

respectively). Then the induced matrix p-norms, for p ∈ {1, 2,∞}, are given by

‖A‖1 = max1≤j≤n

(m∑

i=1

|ai,j|)

= max1≤j≤n

‖aj‖1 , (2.31)

‖A‖2 =√

max1≤j≤n

|σj| =√

max1≤j≤n

σj, (2.32)

‖A‖∞ = max1≤i≤m

(n∑

j=1

|ai,j|)

, (2.33)

where the vector aj denotes the jth column vector of A.

In Theorem 2.37, note that since A∗A (or AT A) is Hermitian (or symmetric, respectively), due

to Theorem 2.10 (and Theorem 2.11, respectively), the matrix A∗A (and AT A) has only real

2. Matrix Theory 45

eigenvalues. Let σ be an eigenvalue of A∗A (or AT A), and let x 6= 0 be a corresponding

eigenvector. Then for A ∈ Cm×n

A∗ Ax = σ x ⇒ x∗ A∗ Ax = σ x∗ x ⇒ ‖Ax‖22 = σ ‖x‖2

2 ⇒ σ =‖Ax‖2

2

‖x‖22

≥ 0,

and for A ∈ Rm×n

AT Ax = σ x ⇒ xT AT Ax = σ xTx ⇒ ‖Ax‖22 = σ ‖x‖2

2 ⇒ σ =‖Ax‖2

2

‖x‖22

≥ 0.

Thus we see that the eigenvalues of A∗A (and AT A, respectively) are indeed non-negative.

Proof of Theorem 2.37. To derive the induced matrix 1-norm of a matrix A ∈ C(m×n),

consider x ∈ Cn with ‖x‖1 =∑n

j=1 |xj | = 1. Then for such x, with aj = (a1,j, a2,j , . . . , am,j)T ,

‖Ax‖1 =

∥∥∥∥∥

n∑

j=1

xjaj

∥∥∥∥∥1

≤n∑

j=1

‖xj aj‖1 ≤n∑

j=1

|xj| ‖aj‖1

≤(

max1≤j≤n

‖aj‖1

) n∑

k=1

|xk| =

(max1≤j≤n

‖aj‖1

)‖x‖1 = max

1≤j≤n‖aj‖1 , (2.34)

where we have used ‖x‖1 = 1 in the last step. Furthermore, we may choose x = ek, where

j = k maximizes ‖aj‖1, that is, we have ‖ak‖1 = max1≤j≤n ‖aj‖1. Then

‖A ek‖1 = ‖ak‖1 = max1≤j≤n

‖aj‖1 . (2.35)

From (2.35) we see that the upper bound in (2.34) is attained for x = ek and hence from (2.34)

and (2.35)

max1≤j≤n

‖aj‖1 = ‖ak‖1 = ‖A ek‖1 ≤ supx∈Cn,‖x‖1=1

‖Ax‖1 ≤ max1≤j≤n

‖aj‖1 ,

which (using the sandwich theorem) verifies (2.31).

In the case of the induced matrix 2-norm of a matrix A ∈ C(m×n), we have

‖A‖2 = supx∈Cn,‖x‖2=1

‖Ax‖2 = supx∈Cn,‖x‖2=1

√(Ax)∗(Ax) = sup

x∈Cn,‖x‖2=1

√x∗ A∗ Ax. (2.36)

Since, A∗A is Hermitian there are n orthonormal eigenvectors, z1, z2, . . . , zn corresponding to

the real non-negative eigenvalues σ1, σ2, . . . , σn. Let x = α1 z1 + α2 z2 + · · · + αn zn so that

x∗ x = |α1|2 + |α2|2 + · · · + |αn|2. Then

x∗ A∗ Ax =

(n∑

j=1

αj zj

)∗

A∗ A

(n∑

k=1

αk zk

)

=

(n∑

j=1

αj zj

)∗( n∑

k=1

αk A∗ A zk

)


=

(n∑

j=1

αj zj

)∗( n∑

k=1

αk σk zk

)

=

n∑

j=1

n∑

k=1

αj αk σk z∗j zk

= σ1 |α1|2 + σ2 |α2|2 + . . . + σn |αn|2

≤(

max1≤j≤n

σj

)(|α1|2 + |α2|2 + . . . + |αn|2

)2

=

(max1≤j≤n

σj

)x∗ x =

(max1≤j≤n

σj

)‖x‖2

2 , (2.37)

where we have used A∗A zk = σk zk, k = 1, 2, . . . , n, and the orthonormality z∗j zk = δj,k of the

eigenvectors z1, z2, . . . , zn. Hence, from (2.36) and (2.37)

‖A‖2 = supx∈Cn,‖x‖2=1

√x∗ A∗ Ax ≤ sup

x∈Cn,‖x‖2=1

√max1≤j≤n

σj ‖x‖2 =√

max1≤j≤n

σj . (2.38)

Finally, let k be such that σk = max1≤j≤n

σj . Then choosing x = zk gives equality, since

‖A zk‖22 = z∗k A∗ A zk = z∗k (σk zk) = σk ‖zk‖2

2 = σk = max1≤j≤n

σj . (2.39)

Thus, from (2.39), (2.38), and ‖zk‖2 = 1,√

max1≤j≤n

σj = ‖A zk‖2 ≤ supx∈Cn,‖x‖2=1

‖Ax‖2 = ‖A‖2 ≤√

max1≤j≤n

σj , (2.40)

and we see from the sandwich theorem that ‖A‖2 =√

max1≤j≤n

σj .

Finally, for the induced matrix ∞-norm, we get, using |xj | ≤ ‖x‖∞ for all j = 1, 2, . . . , n,

‖A‖∞ = supx∈Cn,

‖x‖∞

=1

‖Ax‖∞ = supx∈Cn,

‖x‖∞

=1

max1≤i≤n

∣∣∣∣∣

n∑

j=1

ai,j xj

∣∣∣∣∣ ≤ supx∈Cn,

‖x‖∞

=1

max1≤i≤n

n∑

j=1

|ai,j| |xj|

≤ supx∈Cn,

‖x‖∞

=1

(max1≤i≤n

n∑

j=1

|ai,j|)‖x‖∞ = max

1≤i≤n

n∑

j=1

|ai,j| =

n∑

j=1

|ak,j|, (2.41)

for some k. To show that this upper bound is attained we may choose the vector x =

(x1, x2, . . . , xn)T with components

xj =ak,j

|ak,j|, j = 1, 2, . . . , n,

which satisfies ‖x‖∞ = 1 since |xj | = |ak,j|/|ak,j| = 1 for all j = 1, 2, . . . , n. Then

‖A‖∞ ≥ ‖Ax‖∞ = max1≤i≤n

∣∣∣∣∣

n∑

j=1

ai,jak,j

|ak,j|

∣∣∣∣∣ ≥∣∣∣∣∣

n∑

j=1

ak,jak,j

|ak,j|

∣∣∣∣∣ ≥n∑

j=1

|ak,j|2|ak,j|

=n∑

j=1

|ak,j|. (2.42)

2. Matrix Theory 47

Since the lower bound in (2.42) and the upper bound in (2.41) coincide, we see from the sand-

wich theorem that (2.33) holds true. 2

Example 2.38 (matrix norms)

Consider the real 3 × 3 matrix A, defined by

A =

1 0 −4

2 0 2

0 3 0

.

Then

‖A‖1 = max1≤j≤n

(n∑

i=1

|ai,j|)

= max{3, 3, 6} = 6,

and

‖A‖∞ = max1≤i≤n

(n∑

j=1

|ai,j|)

= max{5, 4, 3} = 5.

We compute AT A and obtain

AT A =

1 2 0

0 0 3

−4 2 0

1 0 −4

2 0 2

0 3 0

=

5 0 0

0 9 0

0 0 20

.

Since the matrix AT A is diagonal, the eigenvalues are the elements on the diagonal. Hence

AT A has the eigenvalues σ1 = 20, σ2 = 9, and σ3 = 5 Hence, we have

‖A‖2 =√

max {20, 9, 5} =√

20. 2

Let ‖·‖(ℓ), ‖·‖(m) and ‖·‖(n) be norms on Cℓ, Cm and Cn, respectively. Let A be an ℓ×m matrix

and B be an m × n matrix. For any x ∈ Cn we have

‖A B x‖(ℓ) ≤ ‖A‖(ℓ,m) ‖B x‖(m) ≤ ‖A‖(ℓ,m) ‖B‖(m,n) ‖x‖(n) .

Hence, the induced norm for matrixes in Cℓ×n satisfies

‖A B‖(ℓ,n) ≤ ‖A‖(ℓ,m) ‖B‖(m,n) . (2.43)

Note: In general (2.43) is not an equality.

Lemma 2.39 (norms of product of matrix with unitary matrix)

For any A ∈ Cm×n and any unitary matrix Q ∈ C

m×m, we have

‖Q A‖2 = ‖A‖2 and ‖Q A‖F = ‖A‖F . (2.44)


Proof of Lemma 2.39. Since Q is unitary (that is, Q∗ Q = Q Q∗ = I), for any x ∈ Cn,

‖Q Ax‖2 =((

Q Ax)∗ (

Q Ax))1/2

=(x∗ A∗ Q∗ Q Ax

)1/2

=(x∗ A∗ Ax

)1/2

=((Ax)∗ (Ax)

)1/2

= ‖Ax‖2 . (2.45)

From (2.45), the result ‖Q A‖2 = ‖A‖2 follows directly from the definition of an induced matrix

norm.

The second result follows from the fact that ‖B‖2F = trace(B∗B) for any matrix B ∈ Cm×n

(see Exercise 39 below). Hence, since Q is unitary,

‖Q A‖2F = trace

((Q A)∗ (Q A)

)= trace

(A∗ Q∗ Q A

)= trace(A∗ A) = ‖A‖2

F ,

which proves ‖Q A‖F = ‖A‖F . 2

Exercise 38 For the matrix A from Exercises 19, 22, and 23,

A =

32

0 12

0 3 0

12

0 32

,

compute the induced matrix p-norms for p ∈ {1, 2,∞} and the Frobenius norm.

Exercise 39 Let B = (bi,j) be in Cm×n. Show that ‖B‖F =√

trace(B∗ B). Conclude that for

B in Rm×n we have ‖B‖F =√

trace(BT B).

Exercise 40 Consider the matrix A := uv∗, where u ∈ Cm and v ∈ C

n. Show that the

induced matrix 2-norm satisfies

‖A‖2 = ‖u‖2 ‖v‖2.

Exercise 41 Define the matrix norm ‖ · ‖ : Rn×n → R by

‖A‖ := 7n∑

i=1

n∑

j=1

|ai,j|, A = (ai,j) ∈ Rn×n.

(a) Show that ‖ · ‖ defines a matrix norm for Rn×n by verifying the norm properties.

(b) Show that there exists no vector norm ‖| · ‖| for Rn that induces the matrix norm ‖ · ‖, that

is, show that there exists no vector norm ‖| · ‖| : Rn → R such that

‖A‖ = supx∈Rn,‖|x‖|=1

‖|Ax‖| for all A ∈ Rn×n.

2. Matrix Theory 49

2.6 Spectral Radius of a Matrix

In this section we consider only square matrices in Cn×n. We introduce the spectral radius

which gives a lower bound for any induced and any compatible matrix norm fir Cn×n.

Definition 2.40 (spectral radius)

Let A ∈ Cn×n with eigenvalues λ1, λ2, . . . , λn ∈ C. The spectral radius of A is defined by

ρ(A) := max1≤j≤n

|λj|.

Note that, if λ is an eigenvalue of A with eigenvector x, then λr is an eigenvalue of Ar,

r = 1, 2, 3, . . . , with eigenvector x. (Indeed, Ar x = Ar−1 (Ax) = λ Ar−1 x = . . . = λr x.)

Hence, since g(x) = xr, r ∈ N, is strictly monotonically increasing for x ≥ 0,

ρ(Ar) =[ρ(A)

]r. (2.46)

Example 2.41 (spectral radius)

For the matrix A from Examples 2.4 and 2.8,

A =

0 −2 2

−2 −3 2

−3 −6 5

,

we found the spectrum Λ(A) = {−1, 1, 2}. Thus the spectral radius of this matrix is

ρ(A) = max{| − 1|, |1|, |2|} = max{1, 2} = 2. 2

Exercise 42 For the matrix A from Exercises 19, 22, 23 and 38,

A =

32

0 12

0 3 0

12

0 32

,

determine the spectral radius.

Exercise 43 Calculate by hand the induced matrix p-norms ‖A‖1, ‖A‖2, ‖A‖∞, and the spec-

tral radius for the matrices

B :=

(0 0

1 −2

)and A =

54

− 12√

2− 1

4

− 12√

232

12√

2

− 14

12√

254

.

50 2.6. Spectral Radius of a Matrix

Exercise 44 Calculate by hand the induced matrix p-norms ‖A‖1, ‖A‖2, ‖A‖∞, the Frobenius

norm, and the spectral radius for the matrix

A =

1 −1 0

−1 2 1

0 1 3

.

Exercise 45 For A ∈ Rm×n show that

‖A‖2 ≤√‖A‖∞ ‖A‖1.

The next lemma expresses the induced matrix 2-norm in terms of the spectral radius.

Lemma 2.42 (relation between induced matrix 2-norm and spectral radius)

Let A ∈ Cm×n (or A ∈ Rm×n). Then the induced matrix 2-norm is given by

‖A‖2 =√

ρ(A∗A) (or ‖A‖2 =√

ρ(AT A), respectively).

If A ∈ Cn×n (or A ∈ Rn×n) is Hermitian (or symmetric, respectively), then the induced

matrix 2-norm is given by

‖A‖2 = ρ(A).

Proof of Lemma 2.42. We give the proof only for the case of complex matrices. The first

statement follows directly from Theorem 2.37 and the definition of the spectral radius. Indeed,

from Theorem 2.37, for A ∈ Cm×n

‖A‖2 =√

max {|σ| ∈ R : σ is eigenvalue of A∗A} =√

ρ(A∗A).

If A ∈ Cn×n is Hermitian, then A∗ = A and the eigenvalues of A are real (see Theorem 2.10).

Hence A∗A = A A = A2. Let λ be an eigenvalue of A, and let x be a corresponding eigenvector.

Then

A∗Ax = A2 x = A (λx) = λ2 x,

that is, if λ is an eigenvalue of A, then σ = λ2 = |λ|2 is an eigenvalue of the square matrix

A∗A. Thus from Theorem 2.37

‖A‖2 =√

max {σ ∈ R : σ is eigenvalue of A∗A}=

√max {|λ|2 ∈ R : λ is eigenvalue of A}

= max {|λ| : λ is eigenvalue of A} = ρ(A),

which proves the second statement. 2

The next theorem establishes a connection between any induced (or compatible) matrix norm

and the spectral radius. This theorem will be very useful for establishing various theoretical

results.

2. Matrix Theory 51

Theorem 2.43 (relation between induced norms and the spectral radius)

(i) Let ‖ · ‖(n) be a norm on Cn, and let ‖ · ‖ be any matrix norm compatible with the

induced matrix norm ‖ · ‖(n,n). Then

ρ(A) ≤ ‖A‖ for all A ∈ Cn×n.

(ii) For any ǫ > 0 there exits a norm ‖ · ‖(n) on Cn such that the corresponding induced

matrix norm ‖ · ‖(n,n) satisfies

ρ(A) ≤ ‖A‖(n,n) ≤ ρ(A) + ǫ for all A ∈ Cn×n.

It is clear that an analogous statement holds for norms on Rn and real square matrices in Rn×n.

Proof of Theorem 2.43. Let ‖ · ‖(n) be an arbitrary norm on Cn, and let ‖ · ‖(n,n) denote

the induced matrix norm. Then we know that any compatible matrix norm ‖ · ‖ satisfies

‖A‖(n,n) ≤ ‖A‖ for all A ∈ Cn×n (see formula (2.29)). Hence it is enough to prove

ρ(A) ≤ ‖A‖(n,n) for all A ∈ Cn×n.

Let λ1, λ2, . . . λn ∈ C be the eigenvalues of A, where we may assume that

ρ(A) = max1≤i≤n

|λi| = |λ1|,

but do not assume any ordering among λ2, . . . , λn. Let x be an eigenvector corresponding to

the eigenvalue λ1. Then Ax = λ1 x, and, using (2.28),

|λ1| ‖x‖(n) = ‖λ1 x‖(n) = ‖Ax‖(n) ≤ ‖A‖(n,n) ‖x‖(n) ⇒ |λ1| ≤ ‖A‖(n,n) .

Hence,

ρ(A) = |λ1| ≤ ‖A‖(n,n) ,

which verifies (i).

From the Schur factorization (see Theorem 2.21) there exists a unitary matrix W such that

W ∗ A W is upper triangular, with the eigenvalues λ1, λ2, . . . , λn of A as the diagonal elements.

Here we may again assume that |λ1| = ρ(A). Inspection of the proof of Theorem 2.21 shows

that we can choose W such that λ1 is the (1, 1)th entry of W ∗ A W . Since we do not assume

any ordering among λ2, . . . , λn, we may assume (after renumbering λ2, . . . , λn as required) that

λj is the (j, j)th entry of W ∗ A W . Thus

W ∗ A W =

λ1 r1,2 · · · r1,n

0 λ2 · · · r2,n

.... . .

. . ....

0 · · · 0 λn

. (2.47)


Let D be a diagonal matrix with diagonal elements 1, δ, δ2, . . . , δn−1, which we denote by

D = diag(1, δ, δ2, . . . , δn−1), where δ ≤ 2−1 min{1, ǫ/r} and r = max1≤i≤j≤n |ri,j|. Let S = W D,

then from (2.47) and D−1 = diag(1, δ−1, δ−2, . . . , δ−(n−1))

S−1 A S =

λ1 δr1,2 · · · δn−1r1,n

0 λ2 · · · δn−2r2,n

.... . .

. . ....

0 · · · 0 λn

,

that is, (S−1A S)i,i+k = δk ri,i+k for i = 1, 2, . . . , n − 1 and k = 1, 2, . . . , n − i. Furthermore,

using Theorem 2.37 and ρ(A) = |λ1|,

∥∥S−1 A S∥∥∞ = max

1≤i≤n

n∑

j=1

|(S−1 A S)i,j|

= max1≤i≤n

(|λi| + |δ ri,i+1| + |δ2 ri,i+2| + . . . + |δn−i ri,n|

)

= max1≤i≤n

(|λi| + δ

(|ri,i+1| + δ |ri,i+2| + . . . + δn−i−1 |ri,n|

))

≤ max1≤i≤n

(|λi| + δ r

(1 + δ + . . . + δn−i−1

))

= |λ1| + δ r(1 + δ + . . . + δn−1−1

)

= ρ(A) + δ r(1 + δ + · · ·+ δn−2

),

where the second term is a geometric progression. Hence, using δ ≤ 1/2 and δ ≤ ǫ/(2r) by

definition of δ,

∥∥S−1 A S∥∥∞ ≤ ρ(A) + δ r

(1 − δn−1

1 − δ

)≤ ρ(A) + δ r

1

1 − δ≤ ρ(A) +

ǫ

2 rr 2 ≤ ρ(A) + ǫ. (2.48)

Since ‖S−1AS‖∞ is the matrix norm induced by the vector norm ‖x‖S−1,∞ := ‖S−1 x‖∞,

x ∈ Cn, (see Exercise 46 below), from (i) and (2.48),

ρ(A) ≤∥∥S−1 A S

∥∥∞ ≤ ρ(A) + ǫ,

and the proof is complete. 2

Exercise 46 Let S ∈ Cn×n be an invertible matrix. Show that ‖x‖ := ‖S−1x‖∞ defines a norm

for Cn. Show that this norm induces the matrix norm

‖A‖ := ‖S−1 A S‖∞, A ∈ Cn×n.

The next result (see Theorem 2.45 below) considers a particular type of series of matrices, the

so-called Neumann series, and conditions for its convergence. The Neumann series can be seen

2. Matrix Theory 53

as a generalization of the geometric series. Theorem 2.45 below also gives a condition on A

under which the matrix I − A is non-singular (or invertible).

Definition 2.44 (convergence of sequences of matrices w.r.t. matrix norm)

Let {Xr} be a sequence of m×n matrices in Cm×n or Rm×n. We say that the sequence {Xr}converges with respect to a given matrix norm ‖·‖, with limit X ∈ Cm×n or X ∈ Rm×n,

respectively, if for every ǫ > 0 there is an N = N(ǫ) such that

‖X − Xr‖ < ǫ for all r ≥ N.

Equivalently, {Xr} in Cm×n (or in Rm×n) converges with respect to ‖ · ‖ to X ∈ Cm×n (or

X ∈ Rm×n) if

limr→∞

‖Xr − X‖ = 0.

In ‘Further Analysis’, we have learned that convergence can be defined for any normed linear

space; the definition above is just one special case, where the linear space is the set of real or

complex m×n matrices and where the norm is the given matrix norm. From the general notion

of convergence in a normed linear space it is known that the limit of a convergent sequence is

unique. However, we can also easily verify this for the concrete case in Definition 2.44 above.

Assume the sequence {Xr} has two limits X and Y . Then, given ǫ > 0, there exist N = N(ǫ)

and M = M(ǫ) such that

‖X − Xr‖ <ǫ

2for all r ≥ N, and ‖Y − Xr‖ <

ǫ

2for all r ≥ M.

Choose R := max{N, M}; then from the triangle inequality for r ≥ R,

‖X − Y ‖ = ‖(X − Xr) − (Y − Xr)‖ ≤ ‖X − Xr‖ + ‖Y − Xr‖ <ǫ

2+

ǫ

2= ǫ. (2.49)

Since ǫ > 0 was arbitrary, we see from (2.49) that X = Y and the limit is unique.

We have proved in Theorem 2.30 that all norms on Ck are equivalent. Letting k = m × n, we

see that all matrix norms on Cm×n are equivalent. Thus the notion of convergence

and the limit do not depend on the choice of the matrix norm, since equivalent matrix

norms induce the same notation of convergence.


Theorem 2.45 (Neumann series)

Let A ∈ Cn×n (or A ∈ R

n×n). Let {Xr} be defined by

Xr := I + A + · · ·+ Ar =

r∑

k=0

Ak, r = 0, 1, 2, . . . .

Then {Xr} converges if and only if ρ(A) < 1. If ρ(A) < 1, then I − A is non-singular and

the limit of {Xr} is

X =∞∑

k=0

Ak = (I − A)−1. (2.50)

The series in (2.50) is called the Neumann series of the matrix A.

Proof of Theorem 2.45. First we show that I − A is non-singular if ρ(A) < 1.

If λ1, λ2, . . . , λn are the eigenvalues of A then 1 − λ1, 1 − λ2, . . . , 1 − λn are the eigenvalues of

I − A. A matrix is non-singular if and only if its determinant is different from zero, and

det(I − A) = (1 − λ1)(1 − λ2) · · · (1 − λn). (2.51)

If ρ(A) = max1≤j≤n

|λj| < 1, then |λj| < 1 for all j = 1, 2, . . . , n, and (2.51) implies det(I −A) 6= 0.

Thus I − A is nonsingular if ρ(A) < 1.

Next we show that ρ(A) < 1 implies the convergence of the sequence {Xr} to (I − A)−1.

Assume ρ(A) < 1, and choose ǫ := (1− ρ(A))/2. From Theorem 2.43, given ǫ > 0, there exists

some norm for Cn such that for the corresponding induced matrix norm ‖ · ‖, we have

‖A‖ ≤ ρ(A) + ǫ = ρ(A) +1 − ρ(A)

2=

1 + ρ(A)

2:= α < 1 (since ρ(A) < 1).

Because ρ(A) < 1, the matrix I − A is invertible, and because ‖A‖ ≤ α < 1,

∥∥(I − A)−1 − Xr

∥∥ =∥∥(I − A)−1

(I − (I − A) Xr

)∥∥

=∥∥(I − A)−1 Ar+1

∥∥

≤∥∥(I − A)−1

∥∥ ‖A‖r+1

≤∥∥(I − A)−1

∥∥ αr+1. (2.52)

In (2.52), we have used in the second step the definition of Xr and

I − (I − A)Xr = I − Xr + A Xr

= I −r∑

k=0

Ak +

r∑

k=0

Ak+1

2. Matrix Theory 55

= I − I −r∑

k=1

Ak +

r∑

ℓ=1

Aℓ + Ar+1

= Ar+1.

Since 0 < α < 1, the left-hand side in (2.52) tends to zero for r → ∞, and so {Xr} converges

to the limit (I − A)−1.

Finally, we prove that if {Xr} converges then ρ(A) < 1, by showing the equivalent negated

statement: If ρ(A) ≥ 1, then {Xr} diverges.

Let ρ(A) ≥ 1, and let ‖ · ‖ be some induced matrix norm on Cn×n. Choose ǫ = 1: Then for

every N ∈ N

‖XN+1 − XN‖ =

∥∥∥∥∥

N+1∑

k=0

Ak −N∑

k=0

Ak

∥∥∥∥∥ =∥∥AN+1

∥∥ ≥ ρ(AN+1) =[ρ(A)

]N+1 ≥ 1 = ǫ, (2.53)

where we have used (2.46). From (2.53), we see that {Xr} is not a Cauchy sequence, and hence

{Xr} diverges. 2

A useful consequence of Theorem 2.45 is the following error estimate for the norm of (I ±A)−1

where ‖A‖ < 1 in some induced (or compatible) matrix norm ‖ · ‖.

Corollary 2.46 (estimate for ‖(I ± A)−1‖ if ‖A‖ < 1)

Let A be a n×n matrix in Cn×n or Rn×n, and assume that in some induced (or compatible)

matrix norm ‖ · ‖ we have ‖A‖ < 1. Then the I ± A is invertible and

‖(I ± A)−1‖ ≤(1 − ‖A‖

)−1. (2.54)

Proof of Corollary 2.46. If for some induced matrix norm ‖A‖ = ‖ ± A‖ < 1, then, from

Theorem 2.43, ρ(±A) ≤ ‖ ± A‖ < 1. Thus I ± A = I − (∓A) is invertible, and (I ± A)−1 is

given by the Neumann series

(I ± A)−1 =(I − (∓A)

)−1=

∞∑

k=0

(∓A)k =

∞∑

k=0

(∓1)k Ak.

Thus from the triangle inequality

∥∥(I ± A)−1∥∥ =

∥∥∥∥∥

∞∑

k=0

(∓A)k

∥∥∥∥∥

≤ 1 + ‖∓A‖ + ‖∓A‖2 + . . .

= 1 + ‖A‖ + ‖A‖2 + . . . =

∞∑

k=0

‖A‖k


= (1 − ‖A‖)−1,

where the last step follows from the geometric series, since ‖A‖ < 1. 2

Exercise 47 Show that the following series of matrices converges and compute its limit:

∞∑

k=0

Ak, where A =

12

−2 −1

0 13

0

0 0 − 12

.

Exercise 48 Show that the following series of matrices converges and compute its limit:

∞∑

k=0

Ak, where A =

14

0 0

1 − 12

0

−1 −2 13

.

Exercise 49 Let A ∈ Cn×n, and assume that the n complex eigenvalues λ1, λ2, . . . , λn of A

satisfy |λj| > 1 for all j = 1, 2, . . . , n. Is the matrix I − A invertible? Give a proof of your

answer.

Exercise 50 Let A ∈ Cn×n satisfy ρ(A) < 1, and let S ∈ Cn×n be a unitary matrix (that is,

S∗ = ST

= S−1). Show that I − S∗ A S is invertible and find its inverse in two different ways.

Chapter 3

Floating Point Arithmetic and Stability

In this chapter we learn about the condition numbers of matrices, floating point arithmetic,

and the stability of numerical algorithms.

3.1 Condition Numbers of Matrices

Suppose A is a square matrix, and we want to solve the linear system Ax = b. Unfortunately,

our input data A and b have been perturbed such that we are actually given A + ∆A and

b+∆b, and we do not know A and b. We can try to model the error in the output by assuming

that x + ∆x solves

(A + ∆A) (x + ∆x) = b + ∆b.

Multiplying this out and using Ax = b yields

(A + ∆A) ∆x = ∆b− (∆A)x.

For the time being, let us assume that A + ∆A is invertible, which is true if A is invertible and

∆A is only a small perturbation. Then,

∆x = (A + ∆A)−1(∆b − (∆A)x

),

that is,

‖∆x‖ ≤ ‖(A + ∆A)−1‖(‖∆b‖ + ‖∆A‖ ‖x‖

), (3.1)

which gives a first estimate on the absolute error ‖∆x‖.

However, more relevant is the relative error ‖∆x‖/‖x‖, and, from (3.1), we find

‖∆x‖‖x‖ ≤ ‖(A + ∆A)−1‖ ‖A‖

( ‖∆b‖‖A‖ ‖x‖ +

‖∆A‖‖A‖

)

57

58 3.1. Condition Numbers of Matrices

≤ ‖(A + ∆A)−1‖ ‖A‖(‖∆b‖

‖b‖ +‖∆A‖‖A‖

), (3.2)

where the last line follows from ‖b‖ = ‖Ax‖ ≤ ‖A‖ ‖x‖. The last estimate in (3.2) shows that

the relative input error magnified by a factor

‖(A + ∆A)−1‖ ‖A‖

gives an upper bound for the relative output error. This demonstrates the problem of condi-

tioning. The latter expression can further be modified to incorporate the so-called condition

number of a matrix.

Definition 3.1 (condition number)

The condition number of an invertible square matrix A with respect to an induced or

compatible matrix norm ‖ · ‖ is defined as

κ(A) := ‖A‖ ‖A−1‖.

The next theorem estimates the relative error in terms of the condition number.

Theorem 3.2 (estimate of the relative error)

Let ‖ · ‖ be an induced or compatible matrix norm. Suppose that A is an invertible square

matrix, let x be the solution to Ax = b, and assume that the perturbation matrix ∆A

satisfies ‖A−1‖ ‖∆A‖ < 1. Then, there exists a unique vector ∆x such that

(A + ∆A) (x + ∆x) = b + ∆b,

and for b 6= 0, we have

‖∆x‖‖x‖ ≤ κ(A)

(1 − κ(A)

‖∆A‖‖A‖

)−1(‖∆b‖‖b‖ +

‖∆A‖‖A‖

). (3.3)

Proof of Theorem 3.2. Essentially we need to estimate the factor ‖(A + ∆A)−1‖ ‖A‖ in

(3.2). However, we first need to verify that A + ∆A is non-singular, and hence ∆x exists and

is uniquely determined by (A + ∆A) (x + ∆x) = b + ∆b.

We have

A + ∆A = A(I + A−1 ∆A

), (3.4)

and, by the assumption ‖A−1‖ ‖∆A‖ < 1, and hence

∥∥− A−1 ∆A∥∥ = ‖A−1 ∆A‖ ≤ ‖A−1‖ ‖∆A‖ < 1. (3.5)

Thus, by Theorem 2.45 we know that I + A−1 ∆A = I − (−A−1 ∆A) is invertible. Since

I + A−1 ∆A and A are invertible, we see from (3.4) that A + ∆A is invertible. Furthermore,

3. Floating Point Arithmetic and Stability 59

from (3.5) and (2.54) in Corollary 2.46,

∥∥I + A−1 ∆A∥∥ ≤ 1

1 − ‖A−1 ∆A‖ ≤ 1

1 − ‖A−1‖ ‖∆A‖ . (3.6)

Since, from (3.4)

(A + ∆A)−1 =(I + A−1 ∆A

)−1A−1,

the estimate (3.6) implies

∥∥(A + ∆A)−1∥∥ ≤

∥∥(I + A−1∆A)−1∥∥ ‖A−1‖ ≤ 1

1 − ‖A−1‖ ‖∆A‖ ‖A−1‖. (3.7)

Using (3.7) to estimate ‖(A + ∆A)−1‖ ‖A‖ yields

‖(A + ∆A)−1‖ ‖A‖ ≤ 1

1 − ‖A−1‖ ‖∆A‖ ‖A−1‖ ‖A‖

=

(1 − ‖A−1‖ ‖A‖ ‖∆A‖

‖A‖

)−1

κ(A)

=

(1 − κ(A)

∆A

‖A‖

)−1

κ(A).

Substitution of this estimate into (3.2) yields the desired estimate of the relative error. 2

Example 3.3 (condition numbers)

For the non-singular matrix A, given by

A =

2 0 −1

1 0 2

0 2 0

.

compute the condition numbers with respect to the induced p-norms, for p ∈ {1, 2,∞}.

Solution: First we need to find the inverse matrix A−1. Transforming the augmented system

(A|I) with elementary row operation to the form (I|A−1), we find

A−1 =

25

15

0

0 0 12

− 15

25

0

.

Using Theorem 2.37, we find

‖A‖1 = 3 and ‖A−1‖1 = 3/5 ⇒ κ1(A) = ‖A‖1 ‖A−1‖1 =9

5,

and

‖A‖∞ = 3 and ‖A−1‖∞ = 3/5 ⇒ κ∞(A) = ‖A‖∞ ‖A−1‖∞ =9

5.

60 3.1. Condition Numbers of Matrices

For the condition number with respect to the induced matrix 2-norm we need to find the

eigenvalues of AT A and (A−1)T A−1.

AT A =

2 1 0

0 0 2

−1 2 0

2 0 −1

1 0 2

0 2 0

=

5 0 0

0 4 0

0 0 5

,

and thus AT A has the eigenvalues λ1 = 5 and λ2 = 4. Hence, using Theorem 2.37,

‖A‖2 =√

ρ(AT A) =√

max{4, 5} =√

5.

(A−1)T A−1 =

25

0 − 15

15

0 25

0 12

0

25

15

0

0 0 12

− 15

25

0

=

15

0 0

0 15

0

0 0 14

,

and thus (A−1)T A−1 has the eigenvalues λ1 = 1/4 and λ2 = 1/5. Hence, from Theorem 2.37,

‖A−1‖2 =√

ρ((A−1)T A−1

)=√

max{1/4, 1/5} =√

1/4 = 1/2.

Hence, κ2(A) = ‖A‖2 ‖A−1‖2 =√

5/2. 2

Remark 3.4 (condition number for positive definite symmetric matrix)

Note that if A ∈ Rn×n is symmetric and positive definite all eigenvalues of A are positive

(see Theorem 1.1). If these eigenvalues are denoted by λ1 ≥ λ2 ≥ · · · ≥ λn > 0, then the

eigenvalues of A−1 are given by 1/λj, j = 1, 2, . . . , n. Thus we have in the induced 2-norm

‖A‖2 =√

ρ(AT A) (see Theorem 2.37 and Lemma 2.42) for any symmetric positive definite

matrix A ∈ Rn×n. Thus

‖A‖2 =√

ρ(AT A) = ρ(A) = λ1,

‖A−1‖2 =√

(A−1)T A−1 = ρ(A−1) = λ−1n ,

where we have used the fact that the inverse of a positive definite symmetric matrix is also

a positive definite symmetric matrix. Thus

κ2(A) = ‖A‖2 ‖A−1‖2 =λ1

λn

. (3.8)

We see from (3.8) that if λ1 >> λn, then the condition number κ2(A) is very large. This

means, from (3.3) in Theorem 3.2, that in the upper bound for the relative error the input

error is amplified by a very large factor. The problem to solve Ax = b is poorly conditioned.

Exercise 51 For the matrix A from Examples 19, 22, and 23, and 38,

A =

32

0 12

0 3 0

12

0 32

,


compute the condition numbers with respect to the induced p-norms, for p ∈ {1, 2,∞}.

Exercise 52 Compute the condition numbers with respect to the induced p-norms, where

p ∈ {1, 2,∞}, of the matrix

A =

1 −1 0

−1 2 1

0 1 3

.

Exercise 53 Let A = (ai,j) in Rn×n be invertible and satisfy

n∑

j=1

|ai,j| = 1, 1 ≤ i ≤ n.

Let D ∈ Rn×n be an arbitrary diagonal matrix with det(D) 6= 0. Let κ∞(A) denote the condition

number of A with respect to the induced matrix ∞-norm ‖ · ‖∞. Show that

κ∞(D A) ≥ κ∞(A).

3.2 Floating Point Arithmetic

Since computers use a finite number of bits to represent a real number, they can only represent

a finite subset of the real numbers. In general the range of numbers is great enough, but the

major problem is the gaps.

Hence, it is time to discuss the representation of a number in a computer. We are used to

represent a number in digits, i.e.

105.67 = 1000 · 0.10567 = +103(1 · 10−1 + 0 · 10−2 + 5 · 10−3 + 6 · 10−4 + 7 · 10−5

).

However, there is no need to represent numbers with respect to a base 10. A more general

representation is the B-adic representation.

Definition 3.5 (B-adic representation of floating point numbers)

A B-adic, normalized floating point number of precision m is either x = 0 or

x = ±Be

−1∑

k=−m

xk Bk, x1 6= 0, xk ∈ {0, 1, 2, . . . , B − 1}.

Here, e denotes the exponent with emin ≤ e ≤ emax, B ∈ N with B ≥ 2 is the base, and∑−1k=−m xk Bk denotes the mantissa.

The corresponding floating point number system consists of all numbers, which can be

represented in this form.

62 3.2. Floating Point Arithmetic

For a computer the different types of representing a number, like single and double precision,

are defined by the IEEE754 norm. For example, it states that for a double precision number we

have B = 2 and m = 52. Hence, if one bit is used to store the sign and if 11 bits are reserved

for representing the exponent, a double precision number can be stored in 64 bits.

As a consequence, the distribution of floating point numbers is not uniform. Furthermore, every

real number and the result of every operation has to be rounded. There are different rounding

strategies, but the default one is rounding to the nearest machine number.

Definition 3.6 (machine precision)

The machine precision, denoted by eps, is the smallest number, which satisfies

|x − rd(x)| ≤ eps |x| (3.9)

for all x ∈ R within the range of the floating point number system, where rd(x) is rounding

to the nearest machine number.

Note that (3.9) is a relative error since for x ∈ R \ {0} (within the range of the floating point

number system)|x − rd(x)|

|x| ≤ eps.

Theorem 3.7 (precision of floating point numbers)

For a floating point number system with base B and precision m we have

|x − rd(x)| ≤ |x|B1−m (3.10)

for all x within the range of the floating point number system, which means eps ≤ B1−m.

Proof of Theorem 3.7. For every real number x 6= 0 in the range of the floating point system

there exists an integer e between emin and emax such that

x = ±Be−1∑

k=−∞xk Bk = ±Be

−1∑

k=−m

xk Bk +

(±Be

−m−1∑

k=−∞xk Bk

),

where x1 6= 0. Hence, we have

|x − rd(x)| ≤ Be−m−1∑

k=−∞xk Bk ≤ Be

−m−1∑

k=−∞(B − 1) Bk = Be (B − 1)

∞∑

ℓ=m+1

(B−1)ℓ

= Be (B − 1)

(1

1 − B−1− 1 − B−(m+1)

1 − B−1

)

= Be (B − 1)

(B

B − 1− B − B−m

B − 1

)


= Be (B − 1)B−m

B − 1= Be−m, (3.11)

where we have used 0 ≤ xk ≤ B − 1 and the formulas for the geometric sum and geometric

series. On the other hand, we have, because x1 6= 0 and thus x1 ≥ 1,

|x| ≥ Be x1 B−1 ≥ Be B−1 = Be−1 (3.12)

such that, from (3.11) and (3.12),

|x − rd(x)||x| ≤ Be−m

Be−1= B1−m,

which gives the desired result (3.10) by multiplying with |x|. 2

For the operations +,−, ·, /, we will denote the corresponding floating point operations

by ⊕,⊖,⊙,⊘.

Theorem 3.8 (accuracy of floating point operations)

Let ∗ be one of the operations +,−, ·, / and let ⊛ be the equivalent floating point operation,

then for all x, y from the floating point system, there exists ǫ with |ǫ| ≤ eps such that

x ⊛ y = (x ∗ y)(1 + ǫ). (3.13)

If x ∗ y 6= 0 in (3.13), then (3.13) implies

|(x ⊛ y) − (x ∗ y)||x ∗ y| = |ǫ| ≤ eps. (3.14)

In words, (3.14) means that every operation in floating point arithmetic is exact up to

a relative error of size ≤ eps.

3.3 Conditioning

Consider a problem f : X → Y from a normed vector space X of data to a normed vector space

Y of solutions. For a particular point x ∈ X we say that the problem f(x) is well-conditioned

if all small perturbations of x lead only to small changes in f(x). A problem is ill-conditioned

if there exists a small perturbation that leads to a large change in f(x).

In general, one distinguishes between the absolute condition number, which is the smallest

number κ(x), which satisfies

‖f(x + ∆x) − f(x)‖ ≤ κ(x) ‖∆x‖ for all ∆x,

64 3.3. Conditioning

and the more useful relative condition number, which is the smallest number κ(x) satisfying

‖f(x + ∆x) − f(x)‖‖f(x)‖ ≤ κ(x)

‖∆x‖‖x‖ for all ∆x.

Note that the absolute condition number κ(x) and the relative condition number κ(x) depend

on x. As we will see in the examples below that the absolute condition number can vary a lot

depending on x.

A small relative condition number indicates a well-conditioned problem, and a large

relative condition number indicates an ill-conditioned problem.

We discuss some examples of well-conditioned and ill-conditioned problems.

Example 3.9 (matrix-vector multiplication)

Let A ∈ Rm×n or A ∈ Cm×n, and let ‖ · ‖(m) and ‖ · ‖(n) be norms on Rm and Rn, or Cm and Cn,

respectively. Let f(x) = Ax be the problem of matrix-vector multiplication. Then the relative

error satisfies∥∥f(x + ∆x) − f(x)

∥∥(m)

‖f(x)‖(m)

=

∥∥A (x + ∆x) − Ax∥∥

(m)

‖Ax‖(m)

=‖A ∆x‖(m)

‖Ax‖(m)

≤ ‖A‖(m,n) ‖∆x‖(n)

‖Ax‖(m)

=‖A‖(m,n) ‖x‖(n)

‖Ax‖(m)

‖∆x‖(n)

‖x‖(n)

= κ(x)‖∆x‖(n)

‖x‖(n)

,

with the relative condition number

κ(x) :=‖A‖(m,n) ‖x‖(n)

‖Ax‖(m)

.

If m = n and if A is invertible, we have ‖x‖(n) = ‖A−1 Ax‖(n) ≤ ‖A−1‖(n,n) ‖Ax‖(n) such that

κ(x) =‖A‖(n,n) ‖x‖(n)

‖Ax‖(n)

≤ ‖A‖(n,n) ‖A−1‖(n,n) = κ(A). 2

In praxis, we often do not apply the definitions of the absolute and relative condition number

exactly, but investigate the conditioning of a problem in a slightly less formalized way, as

illustrated in the following examples.

Example 3.10 (addition of two numbers)

Let f : R2 → R be defined as f(x, y) = x + y. Then,

f(x + ∆x, y + ∆y) − f(x, y)

f(x, y)=

x + ∆x + y + ∆y − (x + y)

x + y

=∆x + ∆y

x + y=

∆x

x + y+

∆y

x + y

=x

x + y

∆x

x+

y

x + y

∆y

y.


Hence, the addition of two numbers is well-conditioned if both numbers have the same sign and

are sufficiently far away from zero. However, if both numbers are of the same magnitude with

different sign, that is, x ≈ −y, then adding these numbers is highly ill-conditioned because of

cancelation (that is, x + y ≈ 0). 2

Example 3.11 (multiplying two numbers)

For the multiplication f : R2 → R, f(x, y) = x y, of two real numbers x and y, we have

f(x + ∆x, y + ∆y) − f(x, y)

f(x, y)=

(x + ∆x) (y + ∆y) − (x y)

x y

=x ∆y + y ∆x + ∆x ∆y

x y

=∆y

y+

∆x

x+

∆x

x

∆y

y,

which shows that the multiplication of two numbers is always well-conditioned. 2

Example 3.12 (function evaluation)

Let f(x) = ex2. Since f ′(x) = 2 x ex2

= 2 x f(x), we can use Taylor’s formula to represent

f(x + ∆x) = f(x) + f ′(x)∆x + O((∆x)2

).

Substituting this expression into [f(x + ∆x) − f(x)]/f(x) yields

f(x + ∆x) − f(x)

f(x)=

[ex2

+ 2 x ex2∆x + O

((∆x)2

)]− ex2

ex2

= 2 x ∆x +1

ex2 O((∆x)2

)

= 2 x2 ∆x

x+ O

((∆x)2

).

The problem is well-conditioned for small x and ill-conditioned for large x. 2

3.4 Stability

Consider a problem f : X → Y from a vector space X of data to a vector space Y of solutions. A

numerical algorithm for the computation of f(x) = y is defined to be another map f : X → Y

such that f(x) ‘approximates’ f(x). That is, given data x ∈ X, this data will be rounded

to floating point precision and supplied to the algorithm f . What results is a floating point

number f(x) ∈ Y . The major question is how accurate this output f(x) is, compared to the

true answer f(x). We will look at the relative error of the numerical solution,

errrel :=

∥∥f(x) − f(x)∥∥

‖f(x)‖ .

66 3.5. An Example of a Backward Stable Algorithm: Back Substitution

We would like this error to be small, of order eps.

We say that an algorithm f for a problem f is stable if for each x ∈ X

∥∥f(x) − f(x)∥∥

‖f(x)‖ = O(eps) for some x with‖x − x‖‖x‖ = O(eps).

‘A stable algorithm gives nearly the right answer to nearly the right question.’

A stronger and simpler numerical linear algebra condition is that of backward stability. We

say that an algorithm f is backward stable if for each x ∈ X,

f(x) = f(x) for some x with‖x − x‖‖x‖ = O(eps). (3.15)

‘A backward stable algorithm gives exactly the right answer to nearly the right question.’

3.5 An Example of a Backward Stable Algorithm: Back

Substitution

In this section we will verify that back substitution (see Theorem 2.15) is backward stable.

Recall that an n × n matrix U = (ui,j) is said to be upper triangular if ui,j = 0 for i > j.

For example, for n = 4, an upper triangular matrix is of the form

U =

u1,1 u1,2 u1,3 u1,4

0 u2,2 u2,3 u2,4

0 0 u3,3 u3,4

0 0 0 u4,4

.

Similarly, an n × n matrix L = (ℓi,j) is said to be lower triangular if ℓi,j = 0 for i < j.

Suppose we wish to solve an upper triangular linear system of equations U x = b, where

U ∈ Rn×n is a non-singular upper triangular matrix, and x,b ∈ Rn. Then we use back

substitution, as discussed in Section 2.2. The vector x = (x1, x2, . . . , xn)T is computed via

xj =1

uj,j

(bj −

n∑

k=j+1

uj,k xk

), j = n, n − 1, . . . , 2, 1. (3.16)

From Theorem 2.15, the total number of elementary operations of this algorithm is O(n2).


Theorem 3.13 (backward stability of back substitution)

Consider the equation U x = b, where b ∈ Rn and U ∈ R

n×n is a non-singular upper

triangular matrix. The back substitution algorithm (3.16), when applied to a floating point

number system, is backward stable in the sense that the computed solution x ∈ Rn satisfies

(U + ∆U

)x = b

for some upper triangular matrix ∆U ∈ Rn×n with

‖∆U‖‖U‖ = O(eps).

Specifically, for each i, j,|∆Ui,j ||Ui,j|

≤ n · eps + O(eps2).

Proof of Theorem 3.13. We only show this for the cases n = 1, 2.

Case n = 1: For this case back substitution consists of the step

x1 = b1 ⊘ U1,1.

Using the fundamental property of floating point arithmetic (see Theorem 3.8) gives

x1 =b1

U1,1

(1 + ǫ1), |ǫ1| ≤ eps.

This is now used to express the error as if it resulted from a perturbation of U , by writing

x1 =b1

U1,1 (1 + ǫ′1), |ǫ′1| ≤ eps + O(eps2),

where ǫ′1 = −ǫ1/(1 + ǫ1).

From this we have that x1 is the correct solution to the perturbed problem,

(U1,1 + ∆U1,1) x1 = b1, with ∆U1,1 = ǫ′1 U1,1.

Hence,|∆U1,1||U1,1|

≤ eps + O(eps2).

Case n = 2: For the case of a 2 × 2 matrix U , we have the first step as before

x2 = b2 ⊘ U2,2 =b2

U2,2 (1 + ǫ′1), |ǫ′1| ≤ eps + O(eps2). (3.17)

The second step is

x1 =(b1 ⊖ (x2 ⊙ U1,2)

)⊘ U1,1.


Again, using the fundamental property of floating point systems (see Theorem 3.8), we have

x1 =(b1 ⊖ x2 U1,2 (1 + ǫ2)

)⊘ U1,1

=(b1 − x2 U1,2 (1 + ǫ2)

)(1 + ǫ3) ⊘ U1,1

=

(b1 − x2 U1,2 (1 + ǫ2)

)(1 + ǫ3) (1 + ǫ4)

U1,1,

and we have |ǫi| ≤ eps, i = 2, 3, 4.

Using the same method as in the first step we may shift the ǫ terms to obtain

x1 =b1 − x2 U1,2 (1 + ǫ2)

(1 + ǫ′3) (1 + ǫ′4) U1,1,

with |ǫ′3|, |ǫ′4| ≤ eps + O(eps2). Multiplying the denominator terms out gives

x1 =b1 − x2 U1,2 (1 + ǫ2)

(1 + 2 ǫ5) U1,1, (3.18)

with |ǫ5| ≤ eps + O(eps2).

We have shown that x1 is the exact solution to a problem involving the perturbations (1 + ǫ′1),

(1+ ǫ2) and (1+2 ǫ5) to the entries U11, U12 and U22, respectively. Rewriting (3.17) and (3.18),

this may be stated in the form

(U + ∆U) x = b,

where

∆U =

(2 ǫ5 U1,1 ǫ2 U1,2

0 ǫ′1 U2,2

).

This gives ‖∆U‖ / ‖U‖ = O(eps). Hence, for 2×2 upper triangular matrices back substitution

is backward stable.

The proof for n ≥ 2 is similar. 2

We want to relate the backward stability in Theorem 3.13 to the definition of backward stability

given in the last section. In the terminology of Section 3.4, the problem is to solve the linear

system U x = b, and, due to the floating point arithmetic, we solve instead (U + ∆U) x = b.

Thus, in the language of the previous section, the problem is

f(y) := U−1 y

and we solve instead the problem

f(y) :=(U + ∆U

)−1y.

Thus the first condition in (3.15) reads

x := f(b) =(U + ∆U

)−1b = f(b) = U−1 b ⇒

(U + ∆U

)x = b and U x = b.


Thus b− b = ∆U x and hence, using again x = (U + ∆U)−1 b,

‖b− b‖‖b‖ =

‖∆U x‖‖b‖ =

∥∥∆U(U + ∆U

)−1b∥∥

‖b‖ ≤∥∥∆U

(U + ∆U

)−1∥∥ ‖b‖‖b‖

=∥∥∆U

(U + ∆U

)−1∥∥ ≤ ‖∆U‖‖U‖

(‖U‖

∥∥(U + ∆U)−1∥∥

).

The second term on the right-hand side can be considered as a constant that depends on the

conditioning of U and the conditioning of the perturbed matrix U + ∆U , and the first factor is

the relative error ‖∆U‖/‖U‖ = O(eps).


Chapter 4

Direct Methods for Linear Systems

In this chapter we discuss so-called direct methods for solving linear systems Ax = b. The

most basic method for solving linear systems directly is Gaussian elimination which you will

all have practiced in linear algebra when you solved small linear systems (see Section 4.1). A

formalized way of describing Gaussian elimination via left-multiplication with elementary lower

triangular matrices leads us to the LU factorization of a matrix A into A = L U , where L is

a normalized lower triangular matrix and U an upper triangular matrix (see Section 4.2). In

Section 4.3, we learn that Gaussian elimination and hence the LU factorization can be enhanced

by interchanging rows (or columns) of A to stabilize the process; this is referred to a pivoting.

In Section 4.4, we prove that a Hermitian matrix A ∈ Cn×n (that is, A∗ = A) that is positive

definite has a so-called Cholesky factorization A = L L∗, where L is a lower triangular

matrix. In Section 4.5, we show that every matrix A ∈ Cn×n has a QR factorization, that is,

A = Q R where Q is a unitary matrix and R is an upper triangular matrix.

The advantage of any of these factorizations of A is that once the matrices in the factorization

are known, the linear system Ax = b can be more economically solved: Indeed, any of these

factorizations of A is of the form A = B R, where R is an upper triangular matrix and B is

either a lower triangular matrix or a unitary matrix. Then Ax = b becomes

B Rx = b ⇔ B y = b and R x = y.

Since B is either a lower triangular or a unitary matrix B y = b can be inexpensively solved,

and R x = y can subsequently be solved with back substitution. – This is particularly useful if

we want to solve several systems Ax = bj , j = 1, 2, . . . , m, with the same matrix but different

right-hand sides.

The methods discussed in this chapter are called direct methods because we solve (in some

clever way) the linear system directly. Direct methods are to be seen in contrast to so-called

iterative methods, where we compute a sequence of approximations of the solution. An example

of an iterative method that you will have seen in your second year classes is Banach’s fixed

point iteration. Iterative methods will be discussed in Chapter 5.

71

72 4.1. Standard Gaussian Elimination

4.1 Standard Gaussian Elimination

The aim of Gaussian elimination is to reduce a linear system of equations Ax = b to an

equivalent upper triangular system U x = b′ (with an upper triangular matrix U and a new

right-hand side b′) by applying very simple linear transformations to it. Once we have the

system in the form U x = b′, where U is an upper triangular matrix, we then can use back

substitution (see Section 2.2) to solve U x = b′.

Suppose we have a linear system Ax = b, where A = (ai,j) is in Cn×n or Rn×n, that is,

a1,1 x1 + a1,2 x2 + . . . + a1,n xn = b1

a2,1 x1 + a2,2 x2 + . . . + a2,n xn = b2

......

...

an,1 x1 + an,2 x2 + . . . + an,n xn = bn.

Then, assuming that a1,1 6= 0, we can multiply the first row by a2,1/a1,1 and subtract the

resulting row from the second row, which cancels the factor in front of x1 in that row. Then, we

can multiply the first row by a3,1/a1,1 and subtract it from the third row, which, again, cancels

the factor in front of x1 in that row. Continuing like this all the way down, we end up with an

equivalent linear system of the form

a1,1 x1 + a1,2 x2 + . . . + a1,n xn = b1

a(2)2,2 x2 + . . . + a

(2)2,n xn = b

(2)2

......

...

a(2)n,2 x2 + . . . + a(2)

n,n xn = b(2)n .

Assuming now that a(2)2,2 6= 0, we can repeat the whole process using the second row now to

eliminate x2 from row 3 to n, and so on. Hence, after k − 1 steps the system can be written in

matrix form as

A(k) x =

a1,1 ∗ · · · ∗ . . . ∗0 a

(2)2,2

. . ....

......

. . .. . . ∗ · · · ∗

0 · · · 0 a(k)k,k . . . a

(k)k,n

......

......

0 · · · 0 a(k)n,k . . . a

(k)n,n

x1

x2

...

xk

...

xn

=

b1

b(2)2...

b(k)k...

b(k)n

. (4.1)

If in each step a(k)k,k is different from zero, Gaussian elimination produces after executing all n−1

steps an upper triangular matrix.

4. Direct Methods for Linear Systems 73

Example 4.1 (Gaussian elimination)

Use Gaussian elimination to transform the following linear system into triangular form:

Ax = b, where A =

1 0 2

2 1 3

1 −1 0

, b =

3

6

9

.

Solution: We write the linear system as an augmented matrix, and then we multiply the first

row by 2 and subtract it from the second row. Subsequently we subtract the first row from the

third row.

1 0 2

2 1 3

1 −1 0

∣∣∣∣∣∣

3

6

9

⇔

1 0 2

0 1 −1

1 −1 0

∣∣∣∣∣∣

3

0

9

⇔

1 0 2

0 1 −1

0 −1 −2

∣∣∣∣∣∣

3

0

6

.

In the second step of the Gaussian elimination, we add the new second row to the new third

row, and obtain

⇔

1 0 2

0 1 −1

0 0 −3

∣∣∣∣∣∣

3

0

6

⇔ U x = b′ with U =

1 0 2

0 1 −1

0 0 −3

and b′ =

3

0

6

,

which is an equivalent linear system with an upper triangular matrix. This is the linear system

which has been solved in Example 2.16 with back substitution. 2

For theoretical (not numerical) reasons, it is useful to rewrite this process using matrices.

Definition 4.2 (elementary lower triangular matrix)

For 1 ≤ i ≤ n, let x ∈ Cn such that eTj x = 0 for j ≤ i, that is, x is of the form

x = (0, 0, . . . , 0, xi+1, . . . , xn)T .

The elementary lower triangular matrix, Li(x), is given by

Li(x) := I − x eTi =

1 0 0 · · · · · · · · · · · · · · · 0

0 1 0...

0 0 1. . .

......

. . .. . .

. . ....

... 0 1 0 · · · · · · 0

... 0 −xi+1 1 0 · · · 0

......

... 0 1. . .

......

......

.... . .

. . . 0

0 · · · · · · 0 −xn 0 · · · 0 1

.

Note that in Li(x) the vector −x has been subtracted from the ith column of the identity

matrix.

74 4.1. Standard Gaussian Elimination

Example 4.3 (elementary lower triangular matrix)

For x = (0, 0, 3,−2)T we have eT1 x = eT

2 x = 0, and the lower the lower triangular matrices

L1(x) and L2(x) in R4×4 are given by

L1(x) = I −x eT1 =

1 0 0 0

0 1 0 0

−3 0 1 0

2 0 0 1

, L2(x) = I −x eT

2 =

1 0 0 0

0 1 0 0

0 −3 1 0

0 2 0 1

. 2

You should make sure that the explicit formula for the elementary lower triangular matrix is

clear to you. You should also verify the properties of the matrix Li(x) given in the lemma

below for yourself.

Lemma 4.4 (properties of elementary lower triangular matrices)

An elementary lower triangular n×n matrix Li(x), where x = (0, . . . , 0, xi+1, . . . , xn)T , has

the following properties:

(i) det(Li(x)

)= 1.

(ii)(Li(x)

)−1= Li(−x).

(iii) Multiplying a matrix A = (ai,j) in Cn×n (or in Rn×n) with Li(x) from the left leaves

the first i rows unchanged and, starting from row i + 1, subtracts the row vector

xj (ai,1, ai,2, . . . , ai,n) from row j of A.

We note that statement (ii) in Lemma 4.4 is intuitively clear, from the interpretation given in

(iii) in Lemma 4.4. Indeed, multiplication of A with Li(x) from the left subtracts for j > i the

row vector xj (ai,1, ai,2, . . . , ai,n) from the jth row of A, and multiplication of B with Li(−x)

from the left adds for j > i the row vector xj (bi,1, bi,2, . . . , bi,n) to the jth row of B; thus

Li(−x) Li(x) A gives just the matrix A. (Note that this is not a proper proof, but very helpful

for your understanding!)

Proof of Lemma 4.4. This proof is very straight-forward and is left as an exercise. 2

From statement (iii) in Lemma 4.4 it is clear that we can describe each step in the Gaussian

elimination by multiplication with an elementary lower triangular matrix: Indeed

with suitable lower triangular matrices L1(m1), L2(m2), . . . , Ln−1(mn−1), the result of Gaussian

elimination for the linear system Ax = b can be written as U x = b′, where

U = Ln−1(mn−1) · · · L2(m2) L1(m1) A, b′ = Ln−1(mn−1) · · · L2(m2) L1(m1)b,

and where U is now upper triangular.


Example 4.5 (Gaussian elimination with elementary lower triangular matrices)

Consider the linear system from Example 4.1

Ax = b, where A =

1 0 2

2 1 3

1 −1 0

, b =

3

6

9

.

For executing the steps in the Gaussian elimination in Example 4.1, we have used the elementary

lower triangular matrices

L1(m1) = I − m1 eT1 =

1 0 0

−2 1 0

−1 0 1

and L1(m2) = I − m2 eT

2 =

1 0 0

0 1 0

0 1 1

,

where m1 = (0, 2, 1)T and m2 = (0, 0,−1)T . Thus we have the new system U x = b′ with

U = L2(m2) L1(m1) A =

1 0 2

0 1 −1

0 0 −3

and b′ = L2(m2) L1(m1)b =

3

0

6

. 2

Exercise 54 Solve the following linear system by hand with Gaussian elimination:

Ax = b, A =

1 2 1

−1 1 2

2 2 4

, b =

1

8

4

. (4.2)

Exercise 55 Write down three 4 × 4 elementary lower triangular matrices, and explain for

each matrix what left-multiplication with this matrix does. Verify the properties (i) and (ii)

from Lemma 4.4 explicitly for your three lower triangular matrices.

Exercise 56 Find the elementary lower triangular matrices L1(m1) and L2(m2) which describe

the Gaussian elimination to bring the linear system (4.2) into the form U x = b′ with an upper

triangular matrix U .

Exercise 57 Prove Lemma 4.4

4.2 The LU Factorization

On a computer, Gaussian elimination is realized by programming the corresponding operations

directly. This leads to an O(n3) complexity. In addition, we have an O(n2) complexity for

solving the final linear system by back substitution.

However, there is another way to look at the process. From Lemma 4.4, we know that multipli-

cation with a suitable elementary lower triangular matrix just realizes one step in the Gaussian

76 4.2. The LU Factorization

elimination process. Therefore, we can construct a sequence of n−1 elementary lower triangular

matrices Lj = Lj(mj), j = 1, 2, . . . , n − 1, such that

Ln−1 Ln−2 · · · L2 L1 A = U,

where U is an upper triangular matrix. Since elementary lower triangular matrices are invert-

ible, we have

A = L−11 L−1

2 · · · L−1n−2 L−1

n−1 U.

Using L−1i =

(Li(mi)

)−1= Li(−mi) (see Lemma 4.4 (ii)) and eT

j mk = 0 for j ≤ k, we have

A = L1(−m1) L2(−m2) · · · Ln−2(−mn−2) Ln−1(−mn−1) U

=(I + m1 eT

1

) (I + m2 eT

2

)· · ·(I + mn−2 eT

n−2

)(I + mn−1 eT

n−1

)U

=(I + m1 eT

1 + m2 eT2 + . . . + mn−2 eT

n−2 + mn−1 eTn−1

)U

= L U, (4.3)

where the matrix

L := I + m1 eT1 + m2 eT

2 + . . . + mn−2 eTn−2 + mn−1 eT

n−1 (4.4)

is lower triangular with all diagonal elements one. In the second-last step of (4.3), we have used

(I + m1 eT

1

) (I + m2 eT

2

)= I + m1 eT

1 + m2 eT2 + m1 eT

1 m2 eT2 = I + m1 eT

1 + m2 eT2 ,

since eT1 m2 = 0 by assumption, and analogously for the remaining matrix multiplications.

Inspection of (4.4) shows that L has the following form: if mj = (0, 0, . . . , 0, m(j)j+1, . . . , m

(j)n )T

(note that the first j entries of mj are zero by the definition of the elementary lower triangular

matrix Lj(mj)), then the jth column vector of L is the vector (0, 0, . . . , 0, 1, m(j)j+1, . . . , m

(j)n )T .

Definition 4.6 (LU factorization/decomposition)

Let A ∈ Cn×n (or A ∈ Rn×n). The LU factorization (or LU decomposition) of a

matrix A is the factorization A = L U into the product of a normalized lower triangular

matrix L and an upper triangular matrix U . (A normalized lower triangular matrix is a

lower triangular matrix, where are all diagonal entries are one.)

Theorem 4.7 (LU factorization)

Let A = (ai,j) be an n × n matrix in Cn×n (or Rn×n), and let Ap = (ai,j)1≤i,j≤p be the

pth upper principal submatrix. If det(Ap) 6= 0 for p = 1, 2, . . . , n, then A has an LU

factorization A = L U , with a normalized lower triangular matrix L and an invertible

upper triangular matrix U .

Proof of Theorem 4.7. From the considerations above, we have seen that Gaussian elimina-

tion, if it can be successfully applied, leads to an LU factorization. Thus the proof is given by

induction over the steps in the Gaussian elimination.


Initial step: For k = 1 we have det A1 = det(a1,1) = a1,1 6= 0, hence the first step in the

Gaussian elimination is possible.

Induction step: For k → k + 1 assume our matrix is in the form (4.1). Then, we have with

suitable elementary lower triangular matrices L1, L2, . . . , Lk−1, Lk,

A(k) = Lk Lk−1 · · · L2 L1 A.

From the fact that multiplication with Lj from the left subtracts multiples of the jth row from

other rows with row number > j, and from the properties of the determinant, the following is

easily seen: multiplication of A with lower triangular matrices from the left, does not alter the

value of the determinant of the pth principal submatrix. Hence

0 6= det(Ak) = det((Lk Lk−1 · · · L2 L1 A)k

)= det

((A(k))k

)= a1,1 a

(2)2,2 · · · a

(k)k,k,

where the last step follows from (4.1). So, particularly a(k)k,k 6= 0. Hence, the next step in the

Gaussian elimination is possible as well, and after n − 1 steps we have the LU factorization of

A as derived at the beginning of this section. 2

Theorem 4.8 (LU factorization of invertible matrix is unique)

If an invertible matrix A has an LU factorization then its LU factorization is unique.

Proof of Theorem 4.8. Suppose A = L1 U1 and A = L2 U2 are two LU factorizations of a

non-singular matrix A. As A is non-singular, we have 0 6= det(A) = det(Li) det(Ui) for i = 1, 2.

Thus we know that det(Ui) 6= 0 and det(Li) 6= 0 for i = 1, 2, and hence Li and Ui, i = 1, 2, are

invertible. Therefore,

L1 U1 = L2 U2 = A ⇒ L−12 L1 = U2 U−1

1 .

Furthermore, L−12 L1 is a product of normalized lower triangular matrices. Hence, it is a nor-

malized lower triangular matrix. However, U2 U−11 is upper triangular, since U2 and U−1

1 are

upper triangular. The only possible answer to this is that both L−12 L1 and U2 U−1

1 must be

diagonal matrices and equal to the identity matrix (since L−12 L1 is normalized, that is, has

entries one on the diagonal). Thus

L−12 L1 = U2 U−1

1 = I ⇒ L1 = L2 and U2 = U1,

and we see that the L U factorization is unique. 2

Example 4.9 (LU factorization)

For the matrix A from Examples 4.1 and 4.5,

A =

1 0 2

2 1 3

1 −1 0

,


we found in Example 4.5 that

L2(m2) L1(m1) A = U with U :=

1 0 2

0 1 −1

0 0 −3

, (4.5)

with the lower triangular matrices

L1(m1) =

1 0 0

−2 1 0

−1 0 1

and L1(m2) =

1 0 0

0 1 0

0 1 1

,

where m1 = (0, 2, 1)T and m2 = (0, 0,−1)T . From (4.5),

A =(L1(m1)

)−1 (L2(m2)

)−1U = L U, with L :=

(L1(m1)

)−1 (L2(m2)

)−1.

We compute L as follows, with the help of Lemma 4.4,

L =(L1(m1)

)−1 (L2(m2)

)−1= L1(−m1) L2(−m2)

=

1 0 0

2 1 0

1 0 1

1 0 0

0 1 0

0 −1 1

=

1 0 0

2 1 0

1 −1 1

.

Thus the LU factorization of A reads

1 0 0

2 1 0

1 −1 1

1 0 2

0 1 −1

0 0 −3

=

1 0 2

2 1 3

1 −1 0

. 2

The algorithm for solving Ax = b, where A is a square matrix, using standard Gauss elimina-

tion can be implemented for real matrices with the following MATLAB code:

function [L,U] = LU_factorization_1(A)

%

% algorithm computes LU factorization (without pivoting),

% if it exists for matrix A

% input: A = real n by n matrix, that allows LU factorization without pivoting

% output: L = real n by n normalized lower triangular matrix

% U = real n by n upper triangular matrix

%

n = size(A,1);

U = A;

B = zeros(n,n);

L = eye(n,n);


for i = 1:n-1

L(i+1:n,i) = U(i+1:n,i)/U(i,i);

B = U;

U(i+1:n,i:n) = B(i+1:n,i:n) - (L(i+1:n,i)) * B(i,i:n);

end

Exercise 58 For the matrix A from Exercises 54 and 56,

A =

1 2 1

−1 1 2

2 2 4

,

find the unique LU factorization of A.

Remark 4.10 (computing the determinant with LU factorization)

A numerical reasonable way of calculating the determinant of a matrix A is to first compute

the LU factorization of A and then use the fact that

det(A) = det(L U) = det(L) det(U) = det(U) = u1,1 u2,2 · · · un,n,

where we have used det(L) = 1 since L is a normalized lower triangular matrix.

A possible way of deriving an algorithm for calculating the LU factorization starts with the

component-wise formulation of A = L U , that is,

ai,j =

n∑

k=1

ℓi,k uk,j, 1 ≤ i, j ≤ n.

Since U = (ui,j) is upper triangular and L = (ℓi,j) is lower triangular, the upper limit of the

sum is actually given by min{i, j}. Taking the two possible cases separately gives

ai,j =

j∑

k=1

ℓi,k uk,j, 1 ≤ j ≤ i ≤ n, (4.6)

ai,j =

i∑

k=1

ℓi,k uk,j, 1 ≤ i ≤ j ≤ n. (4.7)

For our convenience it is useful to swap the indices i and j in (4.6), giving

aj,i =

i∑

k=1

ℓj,k uk,i, 1 ≤ i ≤ j ≤ n. (4.8)

Rearranging these equations, and using the fact that ℓi,i = 1 for all 1 ≤ i ≤ n, (4.8) implies

(4.9) below and (4.7) implies (4.10) below.

ℓj,i =1

ui,i

(aj,i −

i−1∑

k=1

ℓj,k uk,i

), 1 ≤ i ≤ n, i ≤ j ≤ n, (4.9)


ui,j = ai,j −i−1∑

k=1

ℓi,k uk,j, 1 ≤ i ≤ n, i ≤ j ≤ n. (4.10)

To see how the algorithm works, and in which order the equations are solved let us work out

all 9 equations for the case of an 3 × 3 matrix. By assumption, we have ℓ1,1 = ℓ2,2 = ℓ3,3 = 1.

From (4.10), we find for i = 1 the first row of U .

u1,1 = a1,1, u1,2 = a1,2, u1,3 = a1,3.

For i = 1, we find from (4.9) the first column of L

ℓ1,1 =a1,1

u1,1

= 1, ℓ2,1 =a2,1

u1,1

, ℓ3,1 =a3,1

u1,1

.

For i = 2, we find from (4.10) the second row of U

u2,2 = a2,2 − ℓ2,1 u1,2, u2,3 = a2,3 − ℓ2,1 u1,3.

For i = 2, we find from (4.9) the second column of L

ℓ2,2 =1

u2,2

(a2,2 − ℓ2,1 u1,2) = 1, ℓ3,2 =1

u2,2

(a3,2 − ℓ3,1 u1,2) .

For i = 3, we find from (4.10) the third row of u

u3,3 = a3,3 − ℓ3,1 u1,3 − ℓ3,2 u2,3.

For i = 3, we find from (4.9) the third column of L

ℓ3,3 =1

u3,3

(a3,3 − ℓ3,1 u1,3 − ℓ3,2 u2,3) = 1.

Note that each step uses only entries of L and U that have been computed in previous steps; this

shows that we have arranged the equations (4.9) and (4.10) in the right order. We have included

the diagonal elements of L to better see how the algorithm works, but in any implementation

we would of course not compute these since we know they have the value one.

We have seen that the algorithm first computes the rows and of U and columns of L in the

following order:

row 1 of U, column 1 of L, row 2 of U, column 2 of L, . . . ,

. . . , row n-1 of U, column n-1 of L, row n of U,

where we need not compute column n of L, since this contains only the diagonal entry ℓn,n = 1.

In pseudo-algorithmic form this algorithm can be formulated as follows for real matrices:


Algorithm 1 LU Factorization

1: input: real n × n matrix A = (ai,j) that has LU factorization (without pivoting)

2: initialize L = (ℓi,j) = I ∈ Rn×n, U = (ui,j) = 0 ∈ Rn×n

3: for i = 1, 2, . . . , n do

4: for j = i, i + 1, . . . , n do

5: ui,j = ai,j −i−1∑

k=1

ℓi,k uk,j

6: end for

7: for j = i + 1, i + 2 . . . , n do

8: ℓj,i =1

ui,i

(aj,i −

i−1∑

k=1

ℓj,k uk,i

)

9: end for

10: end for

The algorithm for computing the LU factorization for real matrices in this way can be imple-

mented with the following MATLAB code:

function [L,U] = LU_factorization_2(A)

%

% algorithm for computing the LU factorization directly from A = L*U

% input: A = real n by n matrix, that has LU factorization without pivoting

% output: L = real n by n normalized lower triangular matrix

% U = real n by n upper triangular matrix

%

n = size(A,1);

L = eye(n,n);

U = zeros(n,n);

for i=1:n

U(i,i:n) = A(i,i:n) - L(i,1:i-1) * U(1:i-1,i:n);

L(i+1:n,i) = ( A(i+1:n,i) - L(i+1:n,1:i-1) * U(1:i-1,i) ) / U(i,i);

end

The complexity of this procedure is O(n3). Indeed consider the LU factorization in its standard

form (not the version of the LU factorization above): In the jth step we need to perform

(n − j)(n − j + 1) elementary operations, where we now for simplicity consider an elementary

operation to be one multiplication/division and one addition/subtraction. Hence we have

n−1∑

j=1

(n − j)(n − j + 1) =

n−1∑

j=1

(n2 + n − (2n + 1) j + j2

)

= n2(n − 1) + n(n − 1) − (2n + 1)n(n − 1)

2+

1

6(n − 1) n (2n − 1)


=1

3n (n2 − 1) =

1

3n3 − 1

3n.

Thus we need to execute O(n3) times a multiplications plus an addition.

Remark 4.11 (how to solve a linear system with the LU factorization)

If A = L U then the linear system Ax = L U x = b can be solved in two steps. First, we

solve Ly = b by forward substitution and then U x = y by back substitution. Both is

possible in O(n2) time, once the LU factorization of A is known.

Exercise 59 A matrix A ∈ Cn×n is called strictly row diagonally dominant if

n∑

k=1, k 6=i

|ai,k| < |ai,i|, for all i = 1, 2, . . . , n.

Show that a strictly row diagonally dominant matrix is invertible and possesses an LU decom-

position.

Exercise 60 Let A ∈ Rn×n be a tridiagonal matrix, that is, a matrix of the form

A =

a1 c1 0

b2 a2 c2

. . .. . .

. . .. . .

. . . cn−1

0 bn an

,

with

|a1| > |c1| > 0,

|ai| ≥ |bi| + |ci|, bi, ci 6= 0, 2 ≤ i ≤ n − 1,

|an| ≥ |bn| > 0.

Show that A is invertible and has an LU decomposition of the form

A =

1 0

ℓ2 1. . .

. . .

0 ℓn 1

r1 c1 0

r2. . .. . . cn−1

0 rn

,

where the vectors ℓ = (ℓ2, ℓ3, . . . , ℓn)T ∈ Rn−1 and r = (r1, r2, . . . , rn)

T ∈ Rn can be computed

via r1 = a1 and ℓi = bi/ri−1 and ri = ai − ℓi ci−1 for 2 ≤ i ≤ n.


Exercise 61 Use Gaussian elimination to find the LU factorization of the following matrix

A =

2 −2 0 0

2 −4 2 0

0 −2 4 −2

0 0 2 −4

.

Why is the LU factorization unique?

4.3 Pivoting

Consider the matrix

A =

(0 1

1 1

).

This matrix is non-singular and is well-conditioned, κ(A) = (1 +√

5)/4, in the 2-norm. But

Gaussian elimination fails at the first step, since a1,1 = 0. Note that, a simple interchanging of

the rows will give an upper triangular matrix. Furthermore, a simple interchanging of columns

leads to a lower triangular matrix. The first corresponds to rearranging the equations in the

linear system, the latter corresponds to rearranging the unknowns.

Furthermore, consider the slightly perturbed matrix

A =

(10−20 1

1 1

). (4.11)

Using Gaussian elimination on this matrix, gives the LU factorization with the matrices

L =

(1 0

1020 1

), U =

(10−20 1

0 1 − 1020

).

However, if we are using floating point arithmetic with eps ≈ 10−16. The number 1 − 1020 will

not be represented exactly, but by its nearest floating point, let us say that this is the number

−1020. Using this number produces the factorization

L =

(1 0

1020 1

), U =

(10−20 1

0 −1020

).

The matrix U is relatively close to the correct U with respect to ‖U‖. But on calculating the

product we obtain

L U =

(10−20 1

1 0

).

We see that this matrix is not close to A, since a2,2 = 1 has been replaced by (L U)2,2 = 0.

Suppose we wish to solve Ax = b with A given by (4.11) and b = (1, 0)T using floating point

84 4.3. Pivoting

arithmetic. Then we would obtain x = (0, 1)T while the true answer is approximately (−1, 1)T .

The explanation for this is that LU factorization is not backward stable.

To avoid such problems we modify the Gaussian elimination such that rows and columns

of the matrix may be interchanged during the elimination process. The exchange of

rows or columns during the Gaussian elimination process is referred to as pivoting. Here is

how it works:

Assume we have performed k − 1 steps and now have a matrix of the form

A(k) =

a1,1 ∗ · · · · · · · · · ∗0 a

(2)2,2

. . ....

.... . .

. . . ∗ · · · ∗... 0 a

(k)k,k . . . a

(k)k,n

......

......

0 · · · 0 a(k)n,k . . . a

(k)n,n

(4.12)

Now, if det A = det A(k) 6= 0, not all the entries in the column vector (a(k)k,k, a

(k)k+1,k, . . . , a

(k)n,k)

T

can be zero. Hence, we can pick one non-zero entry and swap the corresponding row with the

kth row. This is usually referred to as partial (row) pivoting. For numerical reasons it is

reasonable to pick a row ℓ with ∣∣a(k)ℓ,k

∣∣ = maxk≤i≤n

|a(k)i,k |.

In a similar way, partial column pivoting can be defined. Finally, we could use total

pivoting, that is, we could pick indices ℓ, m ∈ {k, k + 1, . . . n} such that

∣∣a(k)ℓ,m

∣∣ = maxk≤i,j≤n

|a(k)i,j |.

Usually, partial row pivoting is implemented.

To understand the process of pivoting in an abstract form, it is useful to introduce permutation

matrices.

Definition 4.12 (elementary permutation matrix)

An n × n elementary permutation matrix is any matrix of the form

Pi,j = I − (ei − ej) (ei − ej)T .

That is, Pi,j is an identity matrix with rows i and j (or equivalently columns i and j)

interchanged. A permutation matrix is the (finite) product of elementary permutations

matrices.


Example 4.13 (elementary permutation matrix)

The 5 × 5 elementary permutation matrix P2,5 is given by

P2,5 =

1 0 0 0 0

0 0 0 0 1

0 0 1 0 0

0 0 0 1 0

0 1 0 0 0

.

The properties of elementary permutation matrices are stated in the next lemma. They also

explain the name elementary permutation matrix.

Lemma 4.14 (properties of elementary permutation matrices)

An elementary permutation matrix has the following properties:

(i) P−1i,j = Pi,j = Pj,i = P T

i,j. In particular, Pi,j is an orthogonal matrix.

(ii) det(Pi,j) = −1, for i 6= j and Pi,i = I.

(iii) Pre-multiplication of a matrix A by Pi,j interchanges rows i and j. Similarly, post-

multiplication interchanges columns i and j.

Proof of Lemma 4.14. The proof is fairly straight-forward and is left as an exercise. 2

Exercise 62 Prove Lemma 4.14.

Now we can describe Gaussian elimination with row pivoting by performing in each step first

a multiplication from the left with an elementary permutation matrix (for the pivoting) and

subsequently a multiplication from the left with an elementary lower triangular matrix (for

performing the step in the Gaussian elimination). This leads to the following theorem.

Theorem 4.15 (Gaussian elimination with row pivoting)

Let A be an n × n matrix. There exist elementary lower triangular matrices L(i) = Li(mi)

and elementary permutation matrices P (i) = Pri,i with ri ≥ i, i = 1, 2, . . . , n − 1, such that

the matrix

U := L(n−1) P (n−1) L(n−2) P (n−2) · · · L(2) P (2) L(1) P (1) A (4.13)

is an upper triangular matrix.

Proof of Theorem 4.15. This follows from the previous considerations for Gaussian elimi-

nation. Note that, if the matrix is singular, we might come into a situation (4.12), where the

column vector (a(k)k,k, . . . , a

(k)n,k)

T is the zero vector. In such a situation, we pick the corresponding

elementary lower triangular matrix and permutation matrix as the identity matrix and proceed

with the next column. 2

86 4.3. Pivoting

The permutation matrices in (4.13) can be ‘moved’ to the right in the following sense.

The elementary lower triangular matrix L(j) is of the form L(j) = I − m(j) eTj , with m(j) =

(0, . . . , 0, mj+1, . . . , mn)T , that is, the first j components of m(j) are zero. If the P is an

elementary permutation matrix, which acts only on rows (and columns) with index > j, then

we have, using P 2 = I since P = P−1 = P T ,

P L(j) P = P(I −m(j) eT

j

)P = I − P m(j) (P ej)

T = I − m(j) eTj = L(j), (4.14)

where m(j) = P m(j) and P ej = ej (since P acts only on columns with index > j). Obviously,

the first j components of m(j) = P m(j) are also zero, since m(j) has this property and since

the permutation matrix P acts only on components with index > j. Thus L(j) := I − m(j) eTj

is a proper lower elementary triangular matrix, and from (4.14),

P L(j) = (P L(j) P ) P−1 = L(j) P−1 = L(j) P, (4.15)

since elementary permutation matrices P satisfy P = P−1. Using (4.15), the equation (4.13)

can be rewritten as

U = L(n−1) P (n−1) L(n−2) P (n−2) L(n−3) · · · L(2) P (2) L(1) P (1) A

= L(n−1) L(n−2) P (n−1) P (n−2) L(n−3) · · · L(2) P (2) L(1) P (1) A.

In the same way, we can move P (n−1)P (n−2) to the right of L(n−3). Continuing with this

procedure leads to

U =(L(n−1) L(n−2) · · · L(1)

) (P (n−1) P (n−2) · · · P (1)

)A, (4.16)

which establishes the following theorem.

Theorem 4.16 (LU factorization with pivoting)

For every n×n matrix A there exists an n×n permutation matrix P such that P A possesses

an LU factorization, that is, there exist a normalized n × n lower triangular matrix L and

an n × n upper triangular matrix U such that P A = L U .

Proof of Theorem 4.16. This follows essentially from the considerations above. From (4.16),

we see that there exists an upper triangular matrix U , a permutation matrix P and a normalized

lower triangular matrix L such that

U = L P A ⇔ L−1 U = P A.

Since the inverse of a normalized lower triangular matrix is again a normalized lower triangular

matrix, we have P A = L U , where U is an upper triangular matrix and L = L−1 is a normalized

lower triangular matrix. 2

The implemented algorithm for computing the LU factorization with row pivoting has the

following MATLAB code:


function [L,U,P] = LU_factorisation_row_piv(A)

%

% algorithm computes LU-factorization with pivoting P*A = L*U

% input: real n by n matrix A

% output: L = n by n real normalized lower triangular matrix

% U = n by n real upper triangular matrix

% P = n by n permutation matrix

%

n = size(A,1);

U = A;

L = eye(n);

P = eye(n);

for i = 1:n-1

u = U(i,:);

l = L(i,1:i-1);

p = P(i,:);

%

[r,q] = find(abs(U(i:n,i)) == max(abs(U(i:n,i))));

r = r+i-1

U(i,:) = U(r,:);

U(r,:) = u;

L(i,1:i-1) = L(r,1:i-1);

L(r,1:i-1) = l;

P(i,:) = P(r,:);

P(r,:) = p;

for j = i+1:n

L(j,i) = U(j,i)/U(i,i);

U(j,i:n) = U(j,i:n) - L(j,i)*U(i,i:n);

end

end

4.4 Cholesky Factorisation

In this section we want to exploit the LU factorization to derive a factorization for Hermitian

matrices A of the form A = L L∗ with a lower triangular matrix L. Such a factorization will

not exist for arbitrary Hermitian matrices A, but it does exist for those Hermitian matrices

that are positive definite.

Let us assume that A has a unique LU factorization A = L U and that A is Hermitian, that

is, A∗ = A. Then we have

L U = A = A∗ = (L U)∗ = U∗ L∗.

88 4.4. Cholesky Factorisation

Since U is an upper triangular matrix U∗ = UT

is a lower triangular matrix. Since L is a

normalized lower triangular matrix, L∗ = LT

is a normalized upper triangular matrix. We

would like to use the uniqueness of the LU factorization to conclude that L = U∗ and U = L∗.

Unfortunately, this is not possible since the uniqueness requires the lower triangular matrix to

be normalized, that is, to have only ones as diagonal entries, and this will in general not be the

case for U∗.

But if A is invertible, the LU factorization is unique and we can write

U = D U

with a diagonal matrix D = diag(d1,1, d2,2, . . . , dn,n) and a normalized upper triangular matrix

U . Then, we can conclude that

A = L D U

and hence

A∗ =(L D U

)∗= U∗ D∗ L∗ = A = L D U = L U, (4.17)

so that we can now apply the uniqueness result to derive

L = U∗ and U = D U = D∗ L∗, (4.18)

which gives, from (4.17) and (4.18),

A = L D∗ L∗. (4.19)

Applying now A∗ = A again in (4.19) and using that L is invertible, we find

A∗ =(L D∗ L∗)∗ = L D L∗ = A = L D∗ L∗ ⇒ L D∗ L∗ = L D L∗ ⇒ D∗ = D,

that is the diagonal matrix has only real entries. Thus (4.19) can be written as

A = L D L∗. (4.20)

where L is a normalized lower triangular matrix and D is a diagonal matrix with real entries.

Let us finally make the assumption that all diagonal entries of D are positive. Then, we can

define the square root of D = diag(d1,1, d2,2, . . . , dn,n) by setting

D1/2 := diag(√

d1,1,√

d2,2, . . . ,√

dn,n

),

which, from (4.20) and (D1/2)∗ = D1/2, leads to the decomposition

A = L D1/2 D1/2 L∗ =(L D1/2

)(L D1/2)∗ = L L∗, with L = L D1/2. (4.21)

Definition 4.17 (Cholesky factorization)

A factorization of an n × n matrix A of the form A = L L∗ with an n × n lower triangular

matrix L is called a Cholesky factorization of A.


It remains to verify for what kind of matrices a Cholesky factorization exists. From (L L∗)∗ =

(L∗)∗ L∗ = L L∗, we see that only Hermitian matrices can have a Cholesky factorization.

Theorem 4.18 (positive definite matrix has Cholesky factorization)

Suppose A ∈ Cn×n is Hermitian (that is, A = A∗) and positive definite. Then, A possesses

a Cholesky factorization.

Proof of Theorem 4.18. If a matrix A is positive definite then the determinant of every upper

principal submatrix Ap is positive (see Subsection 1.2). From Theorem 4.7 and Theorem 4.8

we can therefore conclude that A has a unique LU factorization: A = L U with L a normalized

lower triangular matrix and U an upper triangular matrix. From the argumentation at the

beginning of this section we find therefore from (4.20) that

A = L D L∗ ⇔ L−1 A (L−1)∗ = D,

with a diagonal matrix D = diag(d1,1, d2,2, . . . , dn,n), where di,i ∈ R for all i = 1, 2, . . . , n.

Finally, since A is positive definite and L−1 non-singular,

di,i = e∗i D ei = e∗

i

(L−1 A (L−1)∗

)ei =

((L−1)∗ ei

)∗A((L−1)∗ ei

)> 0,

that is, the diagonal matrix D has positive diagonal entries. As explained above, we can draw

the root D1/2 = diag(√

d1,1,√

d2,2, . . . ,√

dn,n) of D and from (4.21) we obtain the Cholesky

factorization of A. 2

For the numerical algorithm we restrict ourselves to the case of real-valued matrices. Then,

the Cholesky factorization reads A = L LT with a lower triangular matrix L. It is

again useful to write A = L LT component-wise, taking into account that L is lower triangular.

For any i ≥ j we have

ai,j =

j∑

k=1

ℓi,k ℓj k =

j−1∑

k=1

ℓi,k ℓj,k + ℓi,j ℓj,j.

This can first be resolved for i = j:

ℓj,j =

(aj,j −

j−1∑

k=1

ℓ2j,k

)1/2

. (4.22)

With this, we can successively compute the other coefficients. For i > j we have

ℓi,j =1

ℓj,j

(ai,j −

j−1∑

k=1

ℓi,k ℓj,k

). (4.23)

Thus we proceed as follows. We first compute the diagonal entry ℓ1,1 from (4.22) with j = 1

ℓ1,1 = (a1,1)1/2.

90 4.4. Cholesky Factorisation

Then we use (4.23) with j = 1 to compute the rest of the first column of L

ℓ2,1 =1

ℓ1,1a2,1, ℓ3,1 =

1

ℓ1,1a3,1, . . . , ℓn,1 =

1

ℓ1,1an,1, .

Next we use (4.22) with j = 2 to compute

ℓ2,2 =(a2,2 − (ℓ2,1)

2)1/2

,

and then we use (4.23) with j = 2 to compute rest of the second column of L

ℓ3,2 =1

ℓ2,2

(a3,2 − ℓ3,1 ℓ2,1

), ℓ4,2 =

1

ℓ2,2

(a4,2 − ℓ4,1 ℓ2,1

), . . . , ℓn,2 =

1

ℓ2,2

(an,2 − ℓn,1 ℓ2,1

).

We continue this process to compute the third column of L, and so on until the nth column of

L has been computed.

We see that we can describe the algorithm for the Cholesky factorization of a real positive

definite symmetric matrix in the following pseudo-algorithmic form:

Algorithm 2 Cholesky Factorization

1: input: real positive definite symmetric n × n matrix A = (ai,j)

2: initialize L = (ℓi,j) as n × n zero matrix

3: for j = 1, 2, . . . , n do

4: ℓj,j =

(aj,j −

j−1∑

k=1

ℓ2j,k

)1/2

5: for i = j + 1, . . . , n do

6: ℓi,j =1

ℓj,j

(ai,j −

j−1∑

k=1

ℓi,k ℓj,k

)

7: end for

8: end for

The Cholesky factorization can implemented with the following MATLAB code:

function [L] = cholesky(A)

%

% algorithm computes Cholesky factorization A= L L^T, if it exists

% input: A = symmetric real n by n matrix

% output: L = n by n lower triangular matrix

%

n = size(A,1);

L = zeros(n,n);

for j = 1:n

L(j,j) = sqrt( A(j,j) - L(j,1:j-1) * L(j,1:j-1)’ );

L(j+1:n,j) = ( A(j+1:n,j) - L(j+1:n,1:j-1) * L(j,1:j-1)’ ) / L(j,j);

end


Example 4.19 (Cholesky factorization)

Find the Cholesky factorization of the following positive definite matrix

A =

1 −1 2

−1 5 0

2 0 6

.

Solution: We apply the Algorithm 2. Remember that we first compute the first column, then

the second column, and so on. Computing the first column, we find

ℓ1,1 =√

a1,1 =√

1 = 1, ℓ2,1 =1

ℓ1,1

a2,1 =1

1(−1) = −1, ℓ3,1 =

1

ℓ1,1

a3,1 =1

12 = 2.

Now we can compute the second column

ℓ2,2 =(a2,2−ℓ2

2,1

)1/2=(5−(−1)2

)1/2=

√4 = 2, ℓ3,2 =

1

ℓ2,2

(a3,2−ℓ3,1 ℓ2,1

)=

1

2

(0−2 (−1)

)= 1,

and finally the third column

ℓ3,3 =(a3,3 − ℓ2

3,1 − ℓ23,2

)1/2=(6 − 22 − 12

)1/2=

√1 = 1.

Thus the lower triangular matrix L is given by

L =

1 0 0

−1 2 0

2 1 1

and evaluation of L LT shows easily that indeed A = L LT . 2

It can be shown that the computational cost of the Cholesky factorization is approximately

n3/6+O(n2) elementary operations, where we count one multiplication plus one addition as one

elementary operation. This is about half the complexity of the standard Gaussian elimination

process. However, here, we also have to take n roots.

Exercise 63 Compute the Cholesky factorization of the following matrix by hand:

A =

1 2 3

2 5 8

3 8 14

.

(Note that A is positive definite so that we know that a Cholesky factorization exists. You do

not have to verify that A is positive definite.)

Exercise 64 Show that the computational cost of the Cholesky factorization is approximately

n3/6 + O(n2) elementary operations, where we count one multiplication plus one addition as

one elementary operation.

92 4.5. QR Factorization

Exercise 65 Is the Cholesky factorization of a positive definite n × n matrix unique? Give a

proof of your answer.

Exercise 66 Compute the Cholesky factorization of the following matrix by hand:

A =

4 2 6

2 5 5

6 5 14

.

Once you have computed the Cholesky factorization, use it to conclude that the matrix A is

positive definite.

4.5 QR Factorization

A typical step in the Gaussian elimination process multiplies the given matrix A with a lower

triangular matrix L from the left. For the condition number of the resulting matrix, this means

κ(L A) = ‖L A‖ ‖(L A)−1‖ ≤ ‖L‖ ‖A‖ ‖A−1‖ ‖L−1‖ = κ(A) κ(L),

so that we expect to end up with a worse condition number.

From this point of view it would be good to have a process which transforms the matrix A into

an upper triangular matrix without altering the condition number. Such a process is

given by the QR factorization.

Recall, from Section 2.3, that a Householder matrix is a matrix of the form H = H(w) =

I − 2ww∗, where w∗w = 1, and that the vector w ∈ Cn can be chosen such that H x = c e1

for a given vector x ∈ Cn. The coefficient c ∈ C satisfies |c| = ‖x‖2. Also remember that H(w)

is Hermitian and unitary and that that det(H(w)) = −1 for w 6= 0.

Hence, if we let x be the first column of A, we find a first Householder matrix H1 ∈ Cn×n such

that the first column vector of A is mapped by H1 onto α1 e1, that is,

H1 A =

α1 ∗ · · · ∗0... A1

0

,

with |α1| = ‖x‖2 and an (n− 1)× (n− 1) matrix A1. Considering the first column x ∈ Cn−1 of

A1, we can find a second Householder matrix H2 ∈ C(n−1)×(n−1) such that H2 x = α2 e1 where

now e1 ∈ Cn−1 and |α2| = ‖x‖2. The matrix

H2 =

1 0 · · · 0

0... H2

0


is easily seen to be unitary and we get

H2 H1 A =

α1 ∗ ∗ · · · ∗0 α2 ∗ · · · ∗0 0...

... A2

0 0

,

with an (n − 2) × (n − 2) matrix A2. We can proceed in this fashion to derive

Hn−1 Hn−2 · · · H2 H1 A = R, (4.24)

where R is an upper triangular matrix. (Note that if we find in the jth step that the first

column vector of the submatrix Aj−1 is the zero vector, then we choose Hj = I.) Since all

Householder matrices Hj are unitary, so is their product and also their inverses. Therefore,

from (4.24),

A = Q R, with the unitary matrix Q := H−11 H−1

2 · · · H−1n−2 H−1

n−1.

From the construction of the unitary matrices Hj, it is easily seen that Hj = H∗j = H−1

j , and

thus Q can be written as

Q = H1 H2 · · · Hn−2 Hn−1.

Thus we have proven the following result.

Theorem 4.20 (QR factorization)

Every matrix A ∈ Cn×n possesses a QR factorization A = Q R with a unitary matrix Q

and an upper triangular matrix R.

Example 4.21 (QR factorization)

Compute the QR factorization of the following matrix by hand:

A =

2 −3 3

−2 6 6

1 0 3

.

Solution: Since for the first column vector (2,−2, 1)T of A we have ‖(2,−2, 1)T‖2 = 3, we

choose w1 := z/‖z‖2 with

z =

2

−2

1

− 3

1

0

0

=

−1

−2

1

, ‖z‖2 =√

6.

Thus

w1 =1√6

−1

−2

1

,


and the first Householder matrix is given by

H1 := H(w1) = I − 2w1 wT1 = I − 1

3

−1

−2

1

(− 1,−2, 1)T

= I − 1

3

1 2 −1

2 4 −2

−1 −2 1

=

23

− 23

13

− 23

− 13

23

13

23

23

.

We find that

H1 A =

23

− 23

13

− 23

− 13

23

13

23

23

2 −3 3

−2 6 6

1 0 3

=

3 −6 −1

0 0 −2

0 3 7

.

In the next step we consider the submatrix

A1 =

(0 −2

3 7

),

and choose for this submatrix w2 = v/‖v‖2 with

v =

(0

3

)− 3

(1

0

)=

(−3

3

), ‖v‖2 = 3

√2.

Hence

w2 =1√2

(−1

1

),

and the 2 × 2 Householder matrix is given by

H(w2) = I −(

−1

1

)(− 1, 1

)T= I −

(1 −1

−1 1

)=

(0 1

1 0

).

Thus the unitary matrix H2 is given by

H2 =

1 0 0

0 0 1

0 1 0

,

and we have

R := H2 H1 A =

1 0 0

0 0 1

0 1 0

3 −6 −1

0 0 −2

0 3 7

=

3 −6 −1

0 3 7

0 0 −2

.

We have A = Q R with the unitary matrix Q defined by

Q = H−11 H−1

2 = HT1 HT

2 = H1 H2 =

23

13

− 23

− 23

23

− 13

13

23

23

. 2


The QR factorization of a real matrix can be implemented as the following MATLAB code:

function [Q,R] = QR_factorization(A)

%

% algorithm computes the QR factorization of A, that is, A=Q*R

% input: A = real n by n matrix

% output: Q = real orthogonal n by n matrix

% R = real n by n upper triangular matrix

%

n = size(A,1);

Q = eye(n,n);

R = A;

%

for j=1:n-1

if max(abs(R(j:n,j))) == 0

Q = Q;

R = R;

else

v = R(j:n,j) - norm(R(j:n,j)) * [1 ; zeros(n-j,1)];

w = [ zeros(j-1,1) ; v/norm(v)];

R = R - 2* w * w’ * R;

Q = Q - 2* w * w’ * Q;

end

end

Q = Q’;

Exercise 67 Compute the QR factorization of the following matrix by hand:

A =

1 0 3

2 −6 3

−2 3 −3

.

The above mentioned advantage of a stable condition number follows from the following facts,

where we now assume that A is non-singular: From A = Q R we have R = Q−1 A = Q∗ A,

since Q is unitary (that is, Q∗ = Q−1). The induced matrix 2-norm of R = Q∗ A satisfies

‖R‖2 = ‖Q∗ A‖2 = supx∈Cn,‖x‖2=1

‖Q∗ Ax‖2 = supx∈Cn,‖x‖2=1

√(Q∗ Ax)∗(Q∗ Ax)

= supx∈Cn,‖x‖2=1

(x∗ A∗ Q Q∗ Ax

)1/2= sup

x∈Cn,‖x‖2=1

(x∗ A∗ Ax

)1/2


= supx∈Cn,‖x‖2=1

‖Ax‖2 = ‖A‖2, (4.25)

where we have used that Q Q∗ = I since Q is unitary. Likewise, we have A−1 = (Q R)−1 =

R−1 Q−1, and thus R−1 = A−1 Q, and

‖R−1‖2 = ‖A−1 Q‖2 = supx∈Cn,‖x‖2=1

‖A−1 Qx‖2 = supx∈Cn,

‖Qx‖2=1

‖A−1 Qx‖2

= supy∈Cn,‖y‖2=1

‖A−1 y‖2 = ‖A−1‖2, (4.26)

where we have used that ‖Qx‖2 = (x∗ Q∗ Qx)1/2 = (x∗x)1/2 = ‖x‖2 (since Q is unitary) and

later-on substituted y = Qx. From (4.25) and (4.26), we see that, with respect to the induced

matrix 2-norm, the condition number of the upper triangular matrix R is the same as the

condition number of A:

κ2(R) = ‖R‖2 ‖R−1‖2 = ‖A‖2 ‖A−1‖2 = κ2(A).

Since the multiplication of a Householder matrix with a vector can be performed in O(n) time,

each step in deriving the upper triangular matrix costs about O(n2) time so that the total

complexity is again O(n3).

Application 4.22 (least squares solution)

An application of the QR factorisation is the computation of least squares solutions. Assume

that the matrix A is a real m×n matrix with m > n, which means that we have more equations

than unknowns and which makes the problem Ax = b in general unsolvable. A possible remedy

is to look for that vector x, which minimises the error norm ‖Ax− b‖22. It is possible to show

that this vector x has to satisfy the normal equations

AT Ax = AT b,

which are uniquely solvable if A has full rank n. Unfortunately, the normal equations are

usually very ill-conditioned so that solving the normal equations is not really an option.

But we can use the QR factorization to compute the solution of this minimization problem. To

this end, it is important to note that even if A is not a square matrix, we can find a factorization

of the form A = Q R where Q is now an m×m orthogonal matrix and R ∈ Rm×n has the form

R =

∗ · · · ∗. . .

...

∗0

=

(R1

0

)

with an upper triangular matrix R1 ∈ Rn×n. With this and using that QT = Q−1, we can write

‖Ax − b‖22 =

∥∥Q R x − b∥∥2

2=∥∥Q(R x − QT b

)∥∥2

2


=(R x − QT b

)TQT Q

(R x − QT b

)= ‖Rx − QT b‖2

2

=

∥∥∥∥(

R1 x

0

)−(

c

d

)∥∥∥∥2

2

= ‖R1 x − c‖22 + ‖d‖2

2,

where we have split the vector QT b into two components (cT ,dT )T , with c ∈ Rn and d ∈ Rm−n.

From the last equation we see that there is an unavoidable error ‖d‖22. However, we can minimize

the error by choosing x as the solution of R1 x = c. Since R1 is an upper triangular matrix,

this can be done by back substitution. (Note that R1 is invertible, since we have assumed that

A has full rank.) 2

Exercise 68 Compute the QR factorization of the following matrix by hand:

A =

2 5 3

4 4 −3

−4 2 3

.

Exercise 69 Let A ∈ Rm×n, with m > n, and assume that A has full rank. Let b ∈ Rm.

Consider the following linear functional

f(x) = ‖Ax − b‖22, x ∈ R

n.

Show that there exists a unique vector x∗ ∈ Rn that minimizes the functional f and that this

vector x∗ is the unique solution of the so-called normal equations

AT Ax∗ = AT b.

(Hint: Use calculus to investigate the minimum of the functional f . Do not forget to explain

why the solution to the normal equations is unique.)

Exercise 70 Let A ∈ Rm×n with m ≥ n be given. Assume that A = U Σ V T with orthogonal

matrices U ∈ Rm×m, V ∈ R

n×n, and a diagonal matrix Σ ∈ Rm×n that has only non-negative

diagonal entries. (By a diagonal matrix Σ ∈ Rm×n, where m ≥ n, we mean that Σ = (si,j)

and all entries of Σ except s1,1, s2,2, . . . , sn,n are zero.) Use the decomposition A = U Σ V T to

compute the solution to

minx∈Rn

∥∥Ax − b∥∥

2.


Chapter 5

Iterative Methods for Linear Systems

In this chapter, we discuss iterative methods for solving linear systems. In contrast to a direct

method (as discussed in the previous chapter), an iterative method constructs a sequence of

approximate solutions {x(j)} that should ideally converge to the true solution x of Ax = b.

In Section 5.1 we explain the main idea behind iterative methods. In Section 5.2, we discuss one

of the most basic iterative methods that you have already encountered in Applied Mathematics,

namely, Banach’s fixed point iteration. In Section 5.3, we introduce the Jacobi iteration

and the Gauss-Seidel iteration, and in Section 5.4, we learn how these methods may be

improved with relaxation.

5.1 Introduction

The general idea behind iterative methods is the following. Suppose we want to solve a linear

system Ax = b, where A is an invertible square n×n matrix A. If A is very large, say n = 106,

then a direct solver as discussed in the previous chapter is no longer feasible for solving the

linear system Ax = b (since the number of elementary operations is O(n3)). Instead we want

to use a so-called iterative method which constructs a sequence {x(j)} of solutions in Rn

that approximate the true solution x of Ax = b. For j → ∞, we should have that

limj→∞

x(j) = x,

although in reality this will not always be the case, due to rounding errors and stability issues.

How do the ‘approximations’ x(j) look like? The most common form is

x(j+1) = x(j) + B−1(b − Ax(j)

), (5.1)

where B is an invertible matrix such that B−1 is an approximation of the inverse A−1.

Usually, B will be a ‘simplified version’ of A whose inverse can be easily computed.

99

100 5.2. Fixed Point Iterations

To understand (5.1), replace B−1 by A−1; we obtain (multiplying from the left by A in the

second step)

x(j+1) = x(j) + A−1 (b− Ax(j)) ⇔ Ax(j+1) = Ax(j) + (b − Ax(j)) = b,

that is, x(j+1) would then solve Ax = b. In (5.1), we have instead of A−1 the approximation

B−1 of A−1, and we can interpret this as follows: Our previous iteration step yielded the

approximation x(j) of x, and the term

r(j) := b − Ax(j) = A(x − x(j)

)

is the residual of this approximation (which measures the discrepancy, mapped into the range

of A, between x and x(j)). We solve the the equation

Ay = A(x − x(j)

)= b − Ax(j) = r(j)

approximately with the help of the approximation B−1 of A−1, giving

y = x − x(j) ≈ B−1(b− Ax

),

and add this correction to the previous approximation; thus

x(j+1) = x(j) + B−1 (b − Ax(j)),

which is just (5.1).

Sometimes it is not obvious that a complicated iterative method can be interpreted in this

sense, but we will see that all the iterative methods discussed in this chapter follow this general

idea.

5.2 Fixed Point Iterations

One way of considering how to solve a linear system Ax = b, is to write it as a fixed point

equation. We will see a little further below that this finally leads to an iterative algorithm as

discussed in the previous section.

Definition 5.1 (fixed point and fixed point equation)

Consider a function F : Cn → Cn (or F : Rn → Rn). A point x ∈ Cn (or x ∈ Rn) is called

a fixed point of F , if F (x) = x.

The equation F (x) = x is also called a fixed point equation.

Let us write the matrix A as A = A − B + B with an invertible matrix B which can and will

be chosen suitably. Then, the equation Ax = b can be reformulated as

b = Ax = (A − B)x + B x ⇔ B x = b− (A − B)x = (B − A)x + b,

5. Iterative Methods for Linear Systems 101

and hence as the fixed point equation

x = B−1 (B − A)x + B−1 b =: C x + c, (5.2)

where C := B−1 (B − A) and c := B−1 b. Thus the solution x to Ax = b is a fixed point of

the mapping

F (x) := C x + c.

To calculate this fixed point, we can use the following simple iterative process. We first pick a

starting point x(0) and then form

x(j+1) := F(x(j)), j = 1, 2, 3, . . . . (5.3)

From Banach’s fixed point theorem below, we see that under certain assumptions on F the

sequence {x(j)} converges to a fixed point of F .

Definition 5.2 (contraction mapping)

A function F : Cn → Cn (or F : Rn → Rn) is called a contraction with respect to a norm

‖ · ‖ on Cn (or Rn) if there exists a constant 0 < q < 1 such that

∥∥F (x) − F (y)∥∥ ≤ q ‖x − y‖ for all x,y in C

n (or in Rn).

Note that a contraction mapping is Lipschitz-continuous with Lipschitz-constant q < 1.

Theorem 5.3 (Banach’s fixed point theorem)

Let Cn (or Rn) be equipped with the norm ‖·‖. Assume that F : Cn → Cn (or F : Rn → Rn)

is a contraction with respect to the norm ‖ · ‖, that is, ‖F (x − y)‖ ≤ q ‖x − y‖ for all x,y

in Cn (or in Rn), with some constant 0 < q < 1. Then F has exactly one fixed point x. The

sequence {x(j)}, defined recursively by x(j+1) := F (x(j)), converges for every starting point

x(0) ∈ Cn (or x(0) ∈ Rn) to the fixed point x. Furthermore, we have the error estimates

‖x − x(j)‖ ≤ q

1 − q‖x(j) − x(j−1)‖ (a-posteriori estimate),

‖x − x(j)‖ ≤ qj

1 − q‖x(1) − x(0)‖ (a-priori estimate).

If we try to apply this theorem to the function F (x) = C x + c, where C = B−1 (B −A) is the

iteration matrix and c = B−1 b, we see that

∥∥F (x) − F (y)∥∥ =

∥∥C x + c −(C y + c

)∥∥ =∥∥C (x − y)

∥∥ ≤ ‖C‖ ‖x− y‖. (5.4)

(Note that the norms in (5.4) are ‖·‖ on Cn (or R

n) and its corresponding matrix norm ‖·‖, also

denoted by ‖ · ‖.) So from Theorem 5.3, we have convergence of Banach’s fixed point iteration

if ‖C‖ < 1. Unfortunately, this depends on the chosen vector norm and corresponding induced


matrix norm. However, since all norms on Cn are equivalent (see Theorem 2.30), whether

the sequence {x(j)} converges or not does not depend on the choice of the norm. In other

words, having in the induced matrix norm ‖C‖ < 1 is sufficient for convergence but

not necessary. A sufficient and necessary condition for convergence can be stated using the

spectral radius of the iteration matrix.

Theorem 5.4 (convergence of Banach’s fixed point iteration if F (x) = C x + c)

Let F (x) := C x+ c with an n×n matrix C in Cn×n (or in Rn×n) and a vector c in Cn (or

in Rn). Banach’s fixed point iteration {x(j)}, defined by

x(j+1) := F(x(j))

= C x(j) + c,

converges for every starting point x(0) in Cn (or in Rn) to the same vector x if and only if

ρ(C) < 1. If ρ(C) < 1, then the limit of {x(j)} is the unique fixed point x of F .

Proof of Theorem 5.4. Assume first that ρ(C) < 1. Then, we can pick an ǫ > 0 such that

ρ(C) + ǫ < 1. By Theorem 2.43, we can find a norm ‖ · ‖ for Cn (or Rn) such that in the

corresponding induced matrix norm (which we also denote by ‖ · ‖)

‖C‖ ≤ ρ(C) + ǫ < 1.

From (5.4), we then have

∥∥F (x) − F (y)∥∥ ≤ ‖C‖ ‖x− y‖, with ‖C‖ < 1,

and thus F is a contraction. From Banach’s fixed point theorem (see Theorem 5.3 above), the

sequence {x(j)} converges for every starting point x(0) to the unique fixed point x of F .

Assume now that, for every starting point x(0), the iteration {x(j)} converges to the same vector

x. Since F is continuous, we find from

x = limj→∞

x(j+1) = limj→∞

F(x(j))

= F (x),

and we see that the limit x is a fixed point of F . If we pick the starting point x(0) such that

x = x(0) − x is an eigenvector of C with eigenvalue λ (that is, given the eigenvector x, we

choose x(0) = x + x), then

x(j) − x = F(x(j−1)

)− F (x) = C

(x(j−1) − x

)= . . . = Cj

(x(0) − x

)= λj

(x(0) − x

).

Since the expression on the left-hand side tends to the zero vector for j → ∞, so does the

expression on the right-hand side. This, however, is only possible if |λ| < 1. Since λ was

an arbitrary eigenvalue of C, this shows that all eigenvalues of C satisfy |λ| < 1, and hence

ρ(C) < 1. 2


Example 5.5 (Banach’s fixed point iteration)

Consider the affine linear function f : R2 → R

2 defined by

f(x) = C x + c =

(14

1

0 − 12

)x +

(− 7

4

32

).

(a) Does Banach’s fixed point iteration converge for the given function f? Give a proof of your

answer.

(b) If the f has any fixed points, then find these fixed points.

(c) For the starting point x(0) = 0 = (0, 0)T , compute the first four approximations of Banach’s

fixed point iteration by hand.

Solution:

(a) Since the matrix C is an upper triangular matrix, its diagonal entries are its eigenvalues.

Thus we see that the eigenvalues of C are λ1 = 1/4 and λ2 = −1/2. Thus the spectral radius

is ρ(C) = max{1/4, | − 1/2|} = 1/2, and we have verified that Banach’s fixed point iteration

converges for the given function f since ρ(C) = 1/2 < 1 (see Theorem 5.4).

(b) If x is a fixed point of f , then f(x) = C x + c = x or equivalently (I −C) x = c. We solve

the linear system (I − C) x = c,

(34

−1

0 32

∣∣∣∣∣− 7

4

32

)⇔

(34

−1

0 1

∣∣∣∣∣− 7

4

1

)⇔

(34

0

0 1

∣∣∣∣∣− 3

4

1

)⇔

(1 0

0 1

∣∣∣∣∣−1

1

),

and find that the unique fixed point is given by x = (−1, 1)T .

(c) The approximations of Banach’s fixed point iteration are defined recursively by x(j) =

f(x(j−1)). We find

x(1) =

(− 7

4

32

),

x(2) =

(14·(− 7

4

)+ 1 · 3

2− 7

4(− 1

2

)· 3

2+ 3

2

)=

(− 11

16

34

),

x(3) =

(14·(− 11

16

)+ 1 · 3

4− 7

4(− 1

2

)· 3

4+ 3

2

)=

(− 75

64

98

),

x(4) =

(14·(− 75

64

)+ 1 · 9

8− 7

4(− 1

2

)· 9

8+ 3

2

)=

(− 235

256

1516

)≈(

−0.91797

0.9375

).

After four iteration steps we obtain the approximate solution x(4) ≈ (−0.91797, 0.9375)T . 2


For real fixed point equations, Banach’s fixed point iteration can be implemented with the

following MATLAB code:

function [x] = Banach_fixed_point_iteration(C,c,z,J)

%

% computes Banach’s fixed point iteration x^{(J)} for f(x) = C*x + c

% approximations are given by = x^{(j)} = f(x^{(j-1)})

%

% input: C = real n by n matrix with rho(C)<1 (that is, f is a contraction)

% c = real n by 1 vector

% z = real n by 1 vector that is starting point for iteration

% J = number of iteration steps

% output: x = approximation after J steps

%

x = z;

for j=1:J

y = C * x + c;

x = y;

end

Now we come back to the linear system Ax = b which we had rewritten as the fixed point

equation (see (5.2) above)

x = B−1(B − A

)x + B−1 b =: F (x),

with some suitable invertible matrix B. We would like to solve Ax = b by computing Banach’s

fixed point iteration for the function F defined above:

x(j+1) = F(x(j))

= B−1 (B − A)x(j) + B−1 b. (5.5)

Before we explore particular choices of the invertible matrix B, we investigate, why this iteration

is of the form discussed in Section 5.1. We can rewrite (5.5) as

x(j+1) = x(j) − B−1 Ax(j) + B−1 b = x(j) + B−1(b− Ax(j)

), (5.6)

and from (5.6) we see that x(j+1) is indeed of the form (5.1). From the motivation given in

Section 5.1, it is clear that we would like to choose the invertible matrix B such that B−1 is

‘close’ to A−1 (and by implication B is ‘close’ to A). Then the matrix

C = B−1 (B − A) = I − B−1 A

should be ‘close’ to the zero matrix, and we may therefore expect that ρ(C) < 1 for judicious

choices of B.


Exercise 71 Let C ∈ Rn×n and let c ∈ Rn. Show that the function F : Rn → Rn, defined by

F (x) := C x + c, x ∈ Rn,

is continuous on Rn.

Exercise 72 Prove Banach’s fixed point theorem (see Theorem 5.3 above).

Exercise 73 Find the fixed points (if they have any) of each of the following three functions

f : R → R, g : R3 → R3, and h : R → R, defined by

(a) f(x) = x3, (b) g(x) =

3 2 1

2 3 2

1 2 3

x +

−1

−2

−1

, (c) h(x) = exp(x) = ex.

Exercise 74 Consider the affine linear mapping f : R3 → R3, given by

f(x) =

− 12

−1 0

0 12

0

1 1 12

x +

−1

1

2

.


answer.

(b) Find the unique fixed point of f . Show your work.

(c) For the starting point x(0) = 0 = (0, 0, 0)T , compute the first four approximations of Ba-

nach’s fixed point iteration for the given funcion f by hand.

(d) Find a closed (non-recursive) formula for the approximations x(j). Prove your formula.

Exercise 75 Find the fixed points of the following functions:

(a) f(x) = sin(x), (b) g(x) =

(1 1

−1 2

)x +

(0

1

).

Exercise 76 Consider the affine linear function f : R2 → R2 defined by

f(x) = C x + c =

(12

12

12

− 12

)x +

(−1

2

).


answer.

(b) Find the unique fixed point of f . Show your work.

(c) For the starting point x(0) = 0 = (0, 0)T , compute the first five approximations of Banach’s

fixed point iteration by hand.

106 5.3. The Jacobi and Gauss-Seidel Iterations

5.3 The Jacobi and Gauss-Seidel Iterations

After this general discussion, we return to the question how to choose B and hence the iteration

matrix C in (5.2). Our initial approach yields the fixed point equation

x = B−1 (B − A)︸︷︷︸=: C

x + B−1 b︸︷︷︸=: c

with the iteration matrix

C = B−1(B − A) = I − B−1 A, (5.7)

with an invertible matrix B, which should be sufficiently close to A but also easily invertible.

From now on, we will assume that the diagonal elements of A are all nonzero. This can

be achieved by exchanging rows and/or columns as long as the matrix A is non-singular.

Next, we decompose A in its lower-left sub-diagonal part L, its diagonal part D, and its upper-

right super-diagonal part R, that is,

A = L + D + R,

where L = (ℓi,j) with ℓi,j = ai,j for 1 ≤ j < i ≤ n and zero otherwise, D = (di,j) with di,i = ai,i

for 1 ≤ i ≤ n and zero otherwise, and R = (ri,j) with ri,j = ai,j for 1 ≤ i < j ≤ n and zero

otherwise. The simplest possible approximation to A is then given by choosing its diagonal

part D for B so that the iteration matrix CJ := C = (ci,j) becomes

CJ := I − D−1 A = I − D−1 (L + D + R) = −D−1(L + R), (5.8)

with entries

ci,j =

{−ai,j/aii if i 6= j,

0 if i = j.(5.9)

Hence, we obtain the fixed point iteration

x(j+1) = −D−1 (L + R)x(j) + D−1 b

= −D−1[(

L + D + R)− D

]x(j) + D−1 b

= x(j) + D−1(b− Ax(j)

). (5.10)

Rewriting the first line of (5.10) as

x(j+1) = D−1(b− (L + R)x(j)

),

x(j+1) is componentwise given by

x(j+1)i =

1

ai,i

(bi −

n∑

k=1,k 6=i

ai,k x(j)k

), 1 ≤ i ≤ n.


Definition 5.6 (Jacobi method)

Let A = (ai,j) in Cn×n (or in R

n×n) be invertible, and assume that all diagonal elements of

A are non-zero. Let b ∈ Cn (or b ∈ Rn). The iteration {x(j)} defined by

x(j+1)i =

1

ai,i

(bi −

n∑

k=1,k 6=i

ai,k x(j)k

), 1 ≤ i ≤ n, (5.11)

is called the Jacobi method.

Obviously, one can only expect convergence of the Jacobi method if the original matrix A

‘resembles a diagonal matrix’.

Definition 5.7 (strictly row diagonally dominant)

A matrix A in Cn×n (or in Rn×n) is called strictly row diagonally dominant if

n∑

k=1,k 6=i

|ai,k| < |ai,i| for all i = 1, 2, . . . , n. (5.12)

Example 5.8 (strictly row diagonally dominant matrix)

The matrix

A = (ai,j) =

1 12

13

−1 2 12

12

14

56

is strictly row diagonally dominant, since

1

2+

1

3=

5

6< 1 = |a1,1|, |−1|+1

2=

3

2< 2 = |a2,2|,

1

2+

1

4=

3

4<

5

6= |a3,3|. 2

The next theorem gives a sufficient condition for the convergence of the Jacobi method.

Theorem 5.9 (convergence of the Jacobi method)

Let the assumptions be the same as in Definition 5.6. The Jacobi method (5.11) converges

for every starting point x(0) in Cn (or in Rn) if the invertible matrix A in Cn×n (or in Rn×n)

is strictly row diagonally dominant.

Proof of Theorem 5.9. We use the induced matrix ∞-norm to calculate the norm of the

iteration matrix CJ = (ci,j) given by (5.8) and (5.9):

‖CJ‖∞ = max1≤i≤n

(n∑

j=1

|ci,j|)

= max1≤i≤n

(n∑

j=1,j 6=i

| − ai,j ||ai,i|

)= max

1≤i≤n

(1

|ai,i|

n∑

j=1,j 6=i

|ai,j|)

< 1,


where the last estimate follows from the assumption that A is strictly row diagonally dominant

(see (5.12)). Since ρ(CJ) ≤ ‖CJ‖∞ < 1, the convergence of the Jacobi method follows from

Theorem 5.4. 2

Example 5.10 (Jacobi iteration)

Consider the linear system

Ax = b, where A =

2 12

12

1 3 1

2 0 3

, b =

32

−2

2

(a) Find the solution x to Ax = b by hand.

(b) Verify that A is strictly row diagonally dominant.

(c) For the starting point x(0) = 0 = (0, 0, 0)T , compute the first four approximations of the

Jacobi method by hand.

Solution:

(a) We solve the linear system Ax = b. We multiply the first row by 1/2 and subtract it from

the second row. Then we subtract the first row from the third row. Finally, we multiply the

first row and the new third row by 2 and multiply the new second row by 4.

2 12

12

1 3 1

2 0 3

∣∣∣∣∣∣∣∣

32

−2

2

⇔

2 12

12

0 114

34

0 − 12

52

∣∣∣∣∣∣∣∣

32

− 114

12

⇔

4 1 1

0 11 3

0 −1 5

∣∣∣∣∣∣∣∣

3

−11

1

.

Now we multiply the new second row by 1/11 and add it to the new third row, and in the next

step we multiply the new third row by 11/58.

⇔

4 1 1

0 11 3

0 0 5811

∣∣∣∣∣∣∣∣

3

−11

0

⇔

4 1 1

0 11 3

0 0 1

∣∣∣∣∣∣∣∣

3

−11

0

.

In the next step we subtract the new third row from the first row and subtract 3 times the new

third row from the new second row. Subsequently we divide the new second row by 11. After

that we subtract the new second row from the new first row.

⇔

4 1 0

0 11 0

0 0 1

∣∣∣∣∣∣∣∣

3

−11

0

⇔

4 1 0

0 1 0

0 0 1

∣∣∣∣∣∣∣∣

3

−1

0

⇔

4 0 0

0 1 0

0 0 1

∣∣∣∣∣∣∣∣

4

−1

0

.


Finally we divide the new first row by 4.

⇔

1 0 0

0 1 0

0 0 1

∣∣∣∣∣∣∣∣

1

−1

0

⇔ x =

1

−1

0

,

that is, the solution to Ax = b is given by x = (1,−1, 0)T .

(b) Since

1

2+

1

2< 2 = |a1,1|, 1 + 1 < 3 = |a2,2|, and 2 + 0 < 3 = |a3,3|,

the matrix A is clearly strictly row diagonally dominant.

(c) The first approximation is

x(1)1 = 1

2

(32− 1

2· 0 − 1

2· 0)

= 34,

x(1)2 = 1

3(−2 − 1 · 0 − 1 · 0) = − 2

3,

x(1)3 = 1

3(2 − 2 · 0) = 2

3

⇒ x(1) =

34

− 23

23

.

The second approximation is

x(2)1 = 1

2

(32− 1

2·(− 2

3

)− 1

2· 2

3

)= 3

4,

x(2)2 = 1

3

(−2 − 1 · 3

4− 1 · 2

3

)= − 41

36,

x(2)3 = 1

3

(2 − 2 · 3

4

)= 1

6

⇒ x(2) =

34

− 4136

16

.

The third approximation is

x(3)1 = 1

2

(32− 1

2·(− 41

36

)− 1

2· 1

6

)= 143

144,

x(3)2 = 1

3

(−2 − 1 · 3

4− 1 · 1

6

)= − 35

36,

x(3)3 = 1

3

(2 − 2 · 3

4

)= 1

6

⇒ x(3) =

143144

− 3536

16

.

The fourth approximation is

x(4)1 = 1

2

(32− 1

2·(− 35

36

)− 1

2· 1

6

)= 137

144,

x(4)2 = 1

3

(−2 − 1 · 143

144− 1 · 1

6

)= − 455

432,

x(4)3 = 1

3

(2 − 2 · 143

144

)= 1

216

⇒ x(4) =

137144

− 455432

1216

≈

0.95139

−1.0532

0.0046296

.

After four iterations steps, we have already a reasonable approximation of the solution. 2

For a real linear system Ax = b, the Jacobi method can be implemented with the following

MATLAB code:


function [x] = jacobi_1(A,b,z,J)

%

% algorithm computes the Jacobi iteration x^{(J)}

% for approximating the solution of A*x = b

%

% input: A = real invertible n by n matrix with non-zero diagonal entries

% b = real n by 1 vector, right-hand side of linear system

% z = real n by 1 vector, starting point x^{(0)} for Jacobi iteration


% output: x = x^{(J)} = n by 1 vector of the Jth approximation of the Jacobi iteration

%

n = size(A,1);

x = z;

y = zeros(n,1);

for j = 1:J

y = x;

for i = 1:n

x(i) = ( b(i) - A(i,1:i-1) * y(1:i-1) - A(i,i+1:n) * y(i+1:n) ) / A(i,i);

end

end

If we exploit the matrix-vector structures provided by MATLAB, then we get the following

more economical MATLAB code for the Jacobi method:

function [x] = jacobi_2(A,b,z,J)

%

% algorithm computes the Jacobi iteration x^{(J)}

% for approximating the solution of A*x = b

% this version exploits the matrix-vector structures of MATLAB

%

% input: A = real invertible n by n matrix with non-zero diagonal entries


% z = real n by 1 vector, starting point x^{(0)} for Jacobi iteration


% output: x = x^{(J)} = n by 1 vector of Jth approximation of Jacobi iteration

%

n = size(A,1);

x = z;

d = diag(A);

for i=1:n

A(i,i) = 0;


end

y = zeros(n,1);

for j = 1:J

y = x;

x = ( b - A * y ) ./ d;

end

Exercise 77 Which of the following matrices are strictly row diagonally dominant?

A =

3 1 −1

1 2 −1

0 −1 5

, B =

5 2 −2

0 3 −2

−1 1 3

, C =

2 12

−1

−1 52

1

32

−1 12

.

Exercise 78 Consider the linear system

Ax = b, where A =

2 0 −1

0 −2 1

1 1 3

, b =

1

−1

5

.

(a) Compute the solution x of Ax = b by hand.

(b) Verify that the matrix A is strictly row diagonally dominant.

(c) For the starting vector x(0) = 0 = (0, 0, 0)T , compute the first four approximations x(j), for

j = 1, 2, 3, 4, of the Jacobi iteration by hand.


Ax = b, where A =

2 0 0

0 2 −1

0 −1 2

, b =

2

1

1

.

(a) Compute the solution x of Ax = b by hand.

(b) Verify that the matrix A is strictly row diagonally dominant.

(c) For the starting vector x(0) = 0 = (0, 0, 0)T , compute the first four approximations x(j), for

j = 1, 2, 3, 4, of the Jacobi iteration by hand.

(d) Formulate a closed formula for x(j) for general j and prove it.

A closer inspection of the Jacobi method (5.11) shows that the computation of x(j+1)i is inde-

pendent of any other x(j+1)ℓ with ℓ 6= i. This means that, on a parallel or vector computer, all

components of the new approximation x(j+1) can be computed simultaneously.

However, it also gives us the possibility to improve the process. For example, to calculate x(j+1)2

we could already employ the newly computed x(j+1)1 . Then, for computing x

(j+1)3 we could use

x(j+1)1 and x

(j+1)2 and so on.


This leads to the following iteration scheme.

Definition 5.11 (Gauss-Seidel iteration method)

Let A = (ai,j) in Cn×n (or in Rn×n) be invertible, and assume that A has non-zero diagonal

elements. Let b ∈ Cn (or b ∈ Rn). The Gauss-Seidel method is given by the iteration

scheme {x(j)} with

x(j+1)i =

1

ai,i

(bi −

i−1∑

k=1

ai,k x(j+1)k −

n∑

k=i+1

ai,k x(j)k

), 1 ≤ i ≤ n. (5.13)

We note for later use that in matrix notation (5.13) reads

x(j+1) = D−1(b − Lx(j+1) − R x(j)

)⇔ D x(j+1) = b − Lx(j+1) − Rx(j). (5.14)

To analyze the convergence of this scheme, we have to find the iteration matrix C = B−1 (B−A)

with a suitable invertible matrix B. To this end, we rewrite the last equation in (5.14) as

(L + D)x(j+1) = −R x(j) + b ⇔ x(j+1) = −(L + D)−1 Rx(j) + (L + D)−1 b.

Thus, the iteration matrix of the Gauss-Seidel method is given by

CGS := −(L + D)−1 R = (L + D)−1((L + D) − A

), (5.15)

that is, we have chosen B = L + D as our approximation of A.

Later on, we will prove a more general version of the following theorem.

Theorem 5.12 (convergence of Gauss-Seidel method)

Let A in Cn×n or in Rn×n be invertible with non-zero diagonal elements, and let b in Cn or

in Rn, respectively. If A ∈ Cn×n is Hermitian and positive definite, then the Gauss-Seidel

method converges. If A ∈ Rn×n is symmetric and positive definite, then the Gauss-Seidel

method converges.

Example 5.13 (Gauss-Seidel method)

Consider the linear system

Ax = b, where A =

2 1 0

1 2 0

0 0 1

, b =

−1

1

2

(a) Show that the matrix A is positive definite.

(b) Compute the solution x of Ax = b.

(c) For the starting vector x(0) = 0 = (0, 0, 0)T , compute the first three steps of the Gauss-Seidel

iteration by hand.


Solution:

(a) We note that A is symmetric. We evaluate xT Ax.

xT Ax =(x1, x2, x3

)

2 1 0

1 2 0

0 0 1

x1

x2

x3

= 2 x21 + 2 x2

2 + 2 x1 x2 + x23

= x21 + x2

2 + x23 + (x1 + x2)

2 > 0 for all x ∈ R3 \ {0}.

Thus A is positive definite. Alternatively we could have computed the eigenvalues of A which

yields λ1 = 3 and λ2 = λ3 = 1. Since the eigenvalues are all positive, the matrix A is positive

definite.

(b) We solve Ax = b. In the first step we subtract 1/2 times the first row from the second

row. Subsequently, we divide the new second row by 3/2.

2 1 0

1 2 0

0 0 1

∣∣∣∣∣∣

−1

1

2

⇔

2 1 0

0 32

0

0 0 1

∣∣∣∣∣∣∣∣

−1

32

2

⇔

2 1 0

0 1 0

0 0 1

∣∣∣∣∣∣∣∣

−1

1

2

Finally, we subtract the new second row from the first row. After that we divide the new first

row by 2.

⇔

2 0 0

0 1 0

0 0 1

∣∣∣∣∣∣∣∣

−2

1

2

⇔

1 0 0

0 1 0

0 0 1

∣∣∣∣∣∣∣∣

−1

1

2

⇔ x =

−1

1

2

.

The unique solution to Ax = b is x = (−1, 1, 2)T .

(c) We compute the first three approximations of the Gauss-Seidel iteration for the starting

vector x(0) = 0 = (0, 0, 0)T .

x(1)1 = 1

2(−1 − 1 · 0 − 0 · 0) = − 1

2

x(1)2 = 1

2

(1 − 1 ·

(− 1

2

)− 0 · 0

)= 3

4

x(1)3 = 1

1

(2 − 0 ·

(− 1

2

)− 0 · 3

4

)= 2

⇒ x(1) =

− 12

34

2

,

x(2)1 = 1

2

(−1 − 1 · 3

4− 0 · 2

)= − 7

8

x(2)2 = 1

2

(1 − 1 ·

(− 7

8

)− 0 · 2

)= 15

16

x(2)3 = 1

1

(2 − 0 ·

(− 7

8

)− 0 · 15

16

)= 2

⇒ x(2) =

− 78

1516

2

,

x(3)1 = 1

2

(−1 − 1 · 15

16− 0 · 2

)= − 31

32

x(3)2 = 1

2

(1 − 1 ·

(− 31

32

)− 0 · 2

)= 63

64

x(3)3 = 1

1

(2 − 0 ·

(− 31

32

)− 0 · 63

64

)= 2

⇒ x(3) =

− 3132

6364

2

≈

−0.96875

0.98438

2

.


After three iteration steps, we obtain the approximation x(3) ≈ (−0.96875, 0.98438, 2)T . 2

For real linear systems Ax = b, the Gauss-Seidel method can be implemented with the following

MATLAB code:

function [x] = gauss_seidel(A,b,z,J)

%

% algorithm computes Gauss-Seidel iteration for a given linear system A*x = b

% x^{(j)} = jth approximation in the Gauss-Seidel method

%

% input: A = real n by n invertible matrix with non-zero diagonal elements


% z = real n by 1 vector,

% starting point x^{(0)} for Gauss-Seidel iteration


% output: x = x^{(J)} Gauss-Seidel approximation after J iteration steps

%

n = size(A,1);

x = z;

for j=1:J

for i=1:n

x(i) = ( b(i) - A(i,1:i-1) * x(1:i-1) - A(i,i+1:n) * x(i+1:n) ) / A(i,i);

end

end


Ax = b, where A =

2 0 1

0 1 0

1 0 2

, b =

1

1

−1

.

(i) Show that the matrix A is positive definite.

(ii) Compute the solution x of Ax = b by hand.

(iii) For the starting vector x(0) = 0 = (0, 0, 0)T , compute the first three steps of the Gauss-

Seidel iteration by hand.

(iv) Find a closed formula for the jth approximation x(j), j ∈ N. Prove that your formula is

correct.


Ax = b, where A =

2 0 0

0 2 −1

0 −1 2

, b =

2

1

1

.


(i) Show that the matrix A is positive definite.

(ii) Compute the solution x of Ax = b by hand.

(iii) For the starting vector x(0) = 0 = (0, 0, 0)T , compute the first three steps of the Gauss-

Seidel iteration by hand.

(iv) Find a closed formula for the jth approximation x(j), j ∈ N. Prove that your formula is

correct.

5.4 Relaxation

A further improvement of both the Jacobi method and the Gauss-Seidel method can be achieved

by relaxation.

We start by looking at the Jacobi method. In (5.10), we have seen that the approximations

can be written as

x(j+1) = D−1 b− D−1(L + R

)x(j)

= x(j) + D−1 b − D−1(L + R + D

)x(j)

= x(j) + D−1(b− Ax(j)

).

The last representation shows that the new approximation x(j+1) is given by the old approxi-

mation x(j) plus D−1 multiplied with the residual b−Ax. In practice, one often notices that

the correction term D−1 (b−Ax(j)) deviates of the ‘ideal’ correction term by a fixed

factor. Hence, it makes sense to introduce a relaxation parameter ω ∈ R+ and to form the

new approximation as

x(j+1) = x(j) + ω D−1(b − Ax(j)

). (5.16)

(Note that, in principle, ω could have any positive real value.) Formula (5.16) gives the following

componentwise scheme:

Definition 5.14 (Jacobi relaxation)

Let A = (ai,j) in Cn×n (or in Rn×n) be invertible with non-zero diagonal elements, and let

b in Cn (or in R

n). The Jacobi relaxation with relaxation parameter ω > 0 is given

by

x(j+1)i = x

(j)i +

ω

ai,i

(bi −

n∑

k=1

ai,k x(j)k

), 1 ≤ i ≤ n.

Of course, the relaxation parameter ω should be chosen such that the convergence improves

compared to the original Jacobi method. To investigate this, we work out the iteration matrix

of the Jacobi relaxation. By replacing A = L + D + R in (5.16),

x(j+1) = x(j) + ωD−1 b− ω D−1 (L + D + R)x(j)

116 5.4. Relaxation

=[(1 − ω) I − ω D−1 (L + R)

]x(j) + ω D−1 b, (5.17)

and we see that the iteration matrix is now

CJ(ω) := (1 − ω) I − ω D−1 (L + R) = (1 − ω) I + ω CJ , (5.18)

where CJ = −D−1 (L + R) is the matrix of the classical Jacobi iteration (without relaxation),

see (5.8). As expected, for ω = 1 we find CJ(1) = CJ .

By Theorem 5.4 we know that the Jacobi relaxation converges if ρ(CJ(ω)) < 1, and an inspec-

tion of the proof of Theorem 5.4 shows that, the smaller ρ(CJ(ω)) < 1, the faster the Jacobi

relaxation converges. Hence it makes sense to determine ω such that ρ(CJ(ω)) is minimized.

Theorem 5.15 (convergence of Jacobi relaxation)

Let the assumptions be the same as in Definition 5.14. Furthermore, let L be the lower-left

sub-diagonal part of A, D the diagonal part of A, and R the upper-right super-diagonal part

of A, and assume that CJ = −D−1 (L + R) has only real eigenvalues λ1 ≤ λ2 ≤ . . . ≤ λn

in the interval (−1, 1) with corresponding eigenvectors z(1), z(2), . . . , z(n). Then, CJ(ω) =

(1 − ω) I + ω CJ has the same eigenvectors z(1), z(2), . . . , z(n), but with the corresponding

eigenvalues µj = µj(ω) = 1 − ω + ω λj, 1 ≤ j ≤ n. The spectral radius of CJ(ω) is

minimized by choosing ω to be

ω∗ =2

2 − λ1 − λn. (5.19)

If λ1 6= −λn, then the Jacobi relaxation converges faster than the Jacobi method.

1−1−2−3

1

λn

λ1

f1/5

f1fω∗

Figure 5.1: Determination of the relaxation parameter for the Jacobi relaxation.

Proof of Theorem 5.15. First note that the assumption −1 < λ1 ≤ λ2 ≤ . . . ≤ λn < 1

on the eigenvalues of the iteration matrix CJ of the classical Jacobi method guarantees that

ρ(CJ) < 1 and thus the classical Jacobi method converges. Furthermore, since CJ(1) = CJ , we

know that there exist ω > 0 for which ρ(CJ(ω)) < 1.


For every eigenvector z(j) of CJ it follows that

CJ(ω) z(j) =((1 − ω) I + ω CJ

)z(j) = (1 − ω) z(j) + ω λj z(j) =

(1 − ω + ω λj

)z(j),

that is, z(j) is an eigenvector of CJ(ω) with the eigenvalue 1 − ω + ω λj =: µj(ω). Thus, the

spectral radius of CJ(ω) is given by

ρ(CJ(ω)

)= max

1≤j≤n|µj(ω)| = max

1≤j≤n|1 − ω + ω λj |.

The aim is to choose ω > 0 such that ρ(CJ(ω)) is minimized. For a fixed ω let us have a look

at the function

fω(λ) := 1 − ω + ω λ,

which, as a function of λ, is a straight line with fω(1) = 1. For different choices of ω we get

in this way a collection of such lines going all through (1, 1) (see Figure 5.1) and it follows

that the maximum in the definition of ρ(CJ(ω)) can only be attained for the indices j = 1 and

j = n, since these belong to the smallest and largest eigenvalues λ1 and λn of CJ , respectively.

Moreover, it follows from Figure 5.1 that ω is optimally chosen if

fω(λ1) = −fω(λn). (5.20)

Writing (5.20) explicitly and solving for ω yields

1 − ω + ω λ1 = −(1 − ω + ω λn

)⇔ ω =

2

2 − λ1 − λn

.

This gives (5.19). Since the spectral radius ρ(CJ(ω)) is minimized for ω = ω∗, it follows (see

Theorem 5.4 and its proof) that the Jacobi method has the best convergence if ω = ω∗, that is,

if CJ = CJ(ω∗). Since the classical Jacobi method corresponds to ω = 1, the classical Jacobi

method has the best convergence if ω∗ = 1 which is equivalent to λ1 = −λn. 2

An alternative interpretation of the relaxation can be derived from

x(j+1) = x(j) + ω D−1 b − ω D−1 (L + D + R)x(j)

= (1 − ω) x(j) − ω D−1 (L + R)x(j) + ω D−1 b

= (1 − ω)x(j) + ω CJ x(j) + ω D−1b

= (1 − ω)x(j) + ω(CJ x(j) + D−1 b

),

which follows from (5.17) and CJ = −D−1 (L + R). Hence, if we define

z(j+1) = CJ x(j) + D−1b,

which is one step of the classical Jacobi method, the next approximation of the Jacobi relaxation

method is

x(j+1) = (1 − ω)x(j) + ω z(j+1). (5.21)

118 5.4. Relaxation

Formula (5.21) can be interpreted as a linear interpolation between the old approximation

x(j) and the new classical Jacobi approximation z(j+1).

This idea can be used to introduce relaxation for the Gauss-Seidel method as well. From

(5.14) the classical Gauss-Seidel iteration is given by

z(j+1) = D−1(b − L z(j+1) − R z(j)

). (5.22)

In analogy to (5.21), we can now use linear interpolation to form a new approximation by linear

interpolation between z(j+1), given by (5.22), and the previous approximation (where we now

rename the approximations to be x(j) and x(j+1) as usual):

x(j+1) = (1 − ω)x(j) + ω D−1(b − Lx(j+1) − Rx(j)

). (5.23)

Multiplying (5.23) with D yields

D x(j+1) = (1 − ω) D x(j) + ω b − ω Lx(j+1) − ω Rx(j).

Hence (D + ω L

)x(j+1) =

[(1 − ω) D − ω R

]x(j) + ω b,

or equivalently (assuming (D + ω L) is non-singular)

x(j+1) =(D + ω L

)−1 [(1 − ω) D − ω R

]x(j) + ω

(D + ω L

)−1b.

Thus, the iteration matrix of the Gauss-Seidel relaxation is given by

CGS(ω) :=(D + ω L

)−1 [(1 − ω) D − ω R

]. (5.24)

(We see that, for ω = 1, we have CGS(1) = CGS with the iteration matrix CGS = −(D+L)−1 R

of the classical Gauss-Seidel method, see (5.15).) Writing (5.23) equivalently as

x(j+1) = x(j) + ω D−1(b− Lx(j+1) − D x(j) − R x(j)

)

= x(j) + ω D−1(b− Lx(j+1) − (D + R)x(j)

), (5.25)

the second line in (5.25) shows that the Gauss-Seidel relaxation can be written component-wise

as

x(j+1)i = x

(j)i +

ω

ai,i

(bi −

i−1∑

k=1

ai,k x(j+1)k −

n∑

k=i

ai,k x(j)k

), 1 ≤ i ≤ n.

Definition 5.16 (Gauss-Seidel relaxation or SOR method)

Let A = (ai,j) in Cn×n (or in Rn×n) be invertible and have non-zero diagonal entries, and

let b ∈ Cn (or b ∈ R

n). The Gauss-Seidel relaxation or SOR method (or successive

over-relaxation method) is defined by

x(j+1)i = x

(j)i +

ω

ai,i

(bi −

i−1∑

k=1

ai,k x(j+1)k −

n∑

k=i

ai,k x(j)k

), 1 ≤ i ≤ n. (5.26)


Again, we have to deal with the question how to choose the relaxation parameter.

Theorem 5.17 (necessary condition for convergence of SOR method)

Let the assumptions be the same as in Definition 5.16. The spectral radius of the iteration

matrix

CGS(ω) =(D + ω L

)−1[(1 − ω) D − ω R

]

of the Gauss-Seidel relaxation or SOR method satisfies

ρ(CGS(ω)

)≥ |ω − 1|.

Hence, convergence is only possible if ω ∈ (0, 2).

Proof of Theorem 5.17. We rewrite the iteration matrix CGS(ω) as follows

CGS(ω) =(D + ω L

)−1 [(1 − ω) D − ω R

]

=(D + ω L

)−1D D−1

[(1 − ω) D − ω R

]

=(D−1

(D + ω L

))−1[(1 − ω) I − ω D−1 R

]

=(I + ω D−1 L

)−1[(1 − ω) I − ω D−1 R

]. (5.27)

Consider the representation of CGS(ω) in the last line of (5.27). The first matrix in this product

is a normalized lower triangular matrix and the second matrix is an upper triangular matrix

with diagonal entries all equal to 1 − ω. Thus the determinant of CGS(ω) is given by (where

we use det(A B) = det(A) det(B))

det(CG(ω)

)= det

((I + ω D−1 L

)−1)

det((1 − ω) I − ω D−1 R

)= (1 − ω)n.

Since the determinant of a matrix equals the product of its eigenvalues, we have the following:

Denote by λ1, λ2, . . . , λn the n eigenvalues of CGS(ω), then

| det(CGS)| = |λ1 λ1 · · · λn| = |1 − ω|n. (5.28)

It can easily be shown by proof by contradiction that (5.28) implies that there exists at least

one eigenvalue λj of CGS(ω) with

|λj| ≥ |1 − ω|.

Thus, from the definition of the spectral radius,

ρ(CGS(ω)

)≥ |1 − ω|. (5.29)

From Theorem 5.4, we know that the method converges if and only if ρ(CGS(ω)

)< 1. Thus

from (5.29) convergence is only possible if ω ∈ (0, 2). 2

120 5.4. Relaxation

We will now show that for a positive definite matrix, ω ∈ (0, 2) is also sufficient for convergence.

Since ω = 1 gives the classical Gauss-Seidel method, we also cover Theorem 5.12 as a special

case.

Theorem 5.18 (sufficient condition for convergence of SOR method)

Let A ∈ Cn×n be Hermitian and positive definite and b ∈ C

n, or let A ∈ Rn×n be sym-

metric and positive definite and b ∈ Rn. Then the Gauss-Seidel relaxation or SOR method

converges for every relaxation parameter ω ∈ (0, 2) and every starting point x(0) ∈ Cn and

x(0) ∈ Rn, respectively.

Proof of Theorem 5.18. We first observe that the Gauss-Seidel relaxation is well-defined for

a positive definite matrix A = (ai,j), since

aj,j = e∗j A ej > 0 for all j = 1, 2, . . . , n,

that is, A has non-zero diagonal entries.

We have to show that ρ(CGS(ω)

)< 1. To this end, we rewrite the iteration matrix CGS(ω) in

the form (see the first line in (5.27))

CGS(ω) = (D + ω L)−1[(1 − ω) D − ω R

]

= (D + ω L)−1[D + ω L − ω (L + D + R)

]

= I − ω(D + ω L

)−1A

= I −(

1

ωD + L

)−1

A

= I − B−1 A,

with B := ω−1 D + L. Let λ ∈ C be an eigenvalue of CGS(ω) with corresponding eigenvector

x ∈ Cn, which we assume to be normalized, such that ‖x‖2 = 1. Then, we have

CGS(ω)x =(I−B−1 A

)x = λx ⇔ (1−λ)x = B−1 Ax ⇔ Ax = (1−λ) B x.

Since A is positive definite, and hence invertible, we must have λ 6= 1. (Otherwise we would

have Ax = 0 for the non-zero vector x which would be in contradiction to the fact that A is

invertible). Since A is positive definite, we have

0 < x∗ Ax = (1 − λ)x∗ B x,

and since λ 6= 1,1

1 − λ=

x∗ B x

x∗ Ax.

Since A is Hermitian, that is, A∗ = A, we have

L∗ + D∗ + R∗ = L + D + R ⇒ L∗ = R, D = D∗, R∗ = L.


Thus for the matrix B = ω−1 D + L introduced earlier in this proof

B + B∗ =(ω−1 D + L

)+(ω−1 D + L

)∗= ω−1 D + L + ω−1 D∗ + L∗

= ω−1 D + L + ω−1 D + R

= 2 ω−1 D − D +(L + D + R

)

=

(2

ω− 1

)D + A. (5.30)

Since A is positive definite, for x ∈ Cn \ {0}, x∗ Ax is real and positive. The real part of

1/(1 − λ) satisfies

Re

(1

1 − λ

)= Re

(x∗ B x

x∗ Ax

)=

Re(x∗ B x)

x∗ Ax=

1

2

1

x∗ Ax

(x∗ B x + x∗ B x

)

=1

2

1

x∗ Ax(x∗ B x + x∗ B∗ x) =

1

2

1

x∗ Axx∗(B + B∗)x

=1

2

1

x∗ Axx∗[(

2

ω− 1

)D + A

]x

=1

2

1

x∗ Ax

[(2

ω− 1

)x∗ D x + x∗ Ax

]

=1

2

[(2

ω− 1

)x∗ D x

x∗ Ax+ 1

]. (5.31)

In the second line of (5.31) we have used

x∗ B x = xT B x = xT B x =(xT B

Tx)T

= xT BTx = x∗ B∗ x,

and in the third line of (5.31) we have used the representation (5.30) of B + B∗. Because

ω ∈ (0, 2), the expression 2/ω − 1 is positive. Since A is positive definite, we have

0 < e∗j A ej = aj,j, j = 1, 2, . . . , n,

and therefore the diagonal matrix D has only positive entries and hence D is also positive

definite. Therefore (x∗ D x)/(x∗ Ax) is positive for all x ∈ Cn \ {0}, because A and D are

positive definite. Thus

Re

(1

1 − λ

)=

1

2

[(2

ω− 1

)x∗ D x

x∗ Ax︸︷︷︸>0

+1

]>

1

2.

If we write λ = u + i v then we can conclude that

1

2< Re

(1

1 − λ

)= Re

(1

1 − u − i v

)= Re

((1 − u) + i v

(1 − u)2 + v2

)=

1 − u

(1 − u)2 + v2,

122 5.4. Relaxation

and hence

1

2<

1 − u

(1 − u)2 + v2⇔ (1 − u)2 + v2 < 2 (1 − u) ⇔ u2 + v2 < 1,

that is, |λ|2 = u2 +v2 < 1. Since the eigenvalue λ of the iteration matrix CGS(ω) was arbitrary,

we have shown that for ω ∈ (0, 2)

ρ(CGS(ω)

)< 1.

Thus from Theorem 5.4, the Gauss-Seidel relaxation or SOR method converges. 2

For real linear systems Ax = b, the Gauss-Seidel relaxation or SOR method can be imple-

mented with the following MATLAB code:

function [x] = SOR_method(A,b,z,w,J)

%

% algorithm executes the SOR method for solving A*x = b

% with relaxation parameter w;

% x^{(j)} is the approximation after the jth iteration step

%

% input: A = real invertible n by n matrix with non-zero diagonal elements


% z = real n by 1 vector, starting vector x^{(0)} for SOR iteration

% w = relaxation parameter in (0,2)

% J = number of approximations

%

n = size(A,1);

x = z;

for j = 1:J

for i = 1:n

y = x(i)

+ w * ( b(i) - A(i,1:i-1) * x(1:i-1) - A(i,i:n) * x(i:n) ) / A(i,i);

x(i) = y;

end

end

Example 5.19 (Jacobi, Gauss-Seidel, and SOR method)

Suppose we wish to solve the real linear system Ax = b, where

A =

1 0 0.25 0.25

0 1 0 0.25

0.25 0 1 0

0.25 0.25 0 1

, b =

0.25

0.5

0.75

1.0


Note that A is symmetric and positive definite and diagonally dominant. Thus the classical

Jacobi method, the classical Gauss-Seidel method, and the Gauss-Seidel relaxation can be

applied to solve Ax = b. The true solution x rounded to 4 decimal points is given by

x ≈

−0.1962

0.2536

0.7990

0.9856

.

Using the Jacobi method, with starting vector x(0) = 0 = (0, 0, 0, 0)T , we have:

approximation x(0) x(1)J x

(2)J x

(3)J

0 0.25 −0.1875 −0.125

0 0.5 0.25 0.2969

0 0.75 0.6875 0.7969

0 1.0 0.8125 0.9844

∥∥Ax(j)J − b

∥∥2

1.3693 0.5413 0.2182 0.0882

∥∥x(j)J − x

∥∥2

1.3087 0.5122 0.2062 0.0833

Using the Gauss-Seidel method, with starting vector x(0) = 0 = (0, 0, 0, 0)T , we have:

approximation x(0) x(1)GS x

(2)GS x

(3)GS

0 0.25 −0.125 −0.1846

0 0.5 0.2969 0.2607

0 0.6875 0.7812 0.7961

0 0.8125 0.9570 0.9810

∥∥Ax(j)GS − b

∥∥2

1.3693 0.4265 0.0697 0.0114

∥∥x(j)GS − x

∥∥2

1.3087 0.5497 0.0899 0.0147

Using the SOR method with ω = 1.05, with starting vector x(0) = 0 = (0, 0, 0, 0)T , we have:

124 5.4. Relaxation

approximation x(0) x(1)SOR x

(2)SOR x

(3)SOR

0 0.2625 −0.1606 −0.1943

0 0.5250 0.2774 0.2546

0 0.7186 0.7937 0.7988

0 0.8433 0.9772 0.9853

∥∥Ax(j)SOR − b

∥∥2

1.3693 0.4699 0.0394 0.0020

∥∥x(j)SOR − x

∥∥2

1.3087 0.5575 0.0439 0.0021

We see that the SOR method with ω = 1.05 clearly converges fastest in this example. 2

Chapter 6

The Conjugate Gradient Method

In this chapter we discuss the conjugate gradient method (or CG method) which is an

iterative method for solving linear systems Ax = b with a positive definite symmetric matrix

A ∈ Rn×n. We will see that the conjugate gradient method converges in at most n steps to the

exact solution x of Ax = b, and so this iterative method is in some sense also a direct method.

However, in praxis n is usually large and we will not let the conjugate gradient method run

though all steps but let it terminate once a certain accuracy is reached.

6.1 The Generic Minimization Algorithm

We now construct an iterative method for solving linear systems Ax = b with a symmetric

and positive definite real n × n matrix A.

As a reminder, A ∈ Rn×n is positive definite if A is symmetric (that is, AT = A) and if

xT Ax > 0 for all x ∈ Rn \ {0}.

Let A ∈ Rn×n be symmetric and positive definite and b ∈ Rn. We wish to solve the linear

system

Ax = b.

Associated with this system we have a function f : Rn → R, defined by

f(x) :=1

2xT Ax − xT b.

We will also denote f as the conjugate gradient functional. Writing f more explicitly as

f(x) =1

2

n∑

i=1

n∑

j=1

ai,j xi xj −n∑

j=1

xj bj , (6.1)

125

126 6.1. The Generic Minimization Algorithm

we see that f is a polynomial of degree 2 in the entries x1, x2, . . . , xn of the vector x. Thus we

can calculate the first order and second order derivatives of f , and we obtain the gradient and

the Hessian (details left as an exercise)

∇f(x) = Ax − b, (6.2)

Hf(x) = A. (6.3)

Since f is a polynomial of degree 2 in the entries x1, x2, . . . , xn, all higher order derivatives

vanish. Thus we can rewrite f as its Taylor polynomial of degree 2 centred at y (using (6.2)

and (6.3))

f(x) = f(y) + (x − y)T ∇f(y) +1

2(x − y)T Hf(y) (x− y)

= f(y) + (x − y)T(Ay − b

)+

1

2(x − y)T A (x − y). (6.4)

If x minimizes f(x), then we know from calculus that

0 = ∇f(x) = A x − b,

that is, any minimizer is a solution of the linear system Ax = b.

On the other hand, assume that x satisfies A x = b. Then, setting in (6.4) y = x and using

A x = b, yields

f(x) = f(x) + (x − x)T(A x − b

)+

1

2(x − x)T A (x − x)

= f(x) +1

2(x − x)T A (x − x). (6.5)

Because A is positive definite

(x − x)T A (x − x) > 0 for all x 6= x,

and we see from (6.5) that

f(x) > f(x) for all x ∈ Rn \ {x},

and thus x minimizes f .

Since A is positive definite, A is, in particular, non-singular. Therefore Ax = b has a unique so-

lution x and this unique solution is the unique minimizer of the function f . We summarize

this in the theorem below.

Theorem 6.1 (minimizer of CG functional)

Let A ∈ Rn×n be symmetric and positive definite. The uniquely determined solution of the

system Ax = b is the unique minimizer of the functional

f(x) :=1

2xT Ax − xT b.

6. The Conjugate Gradient Method 127

Exercise 82 Let A = (ai,j) in Rn×n be symmetric and positive definite, and let b ∈ Rn. By

writing the functional

f(x) =1

2xT Ax − xT b

explicitly in the form (6.1) and differentiating, show that ∇f(x) = Ax − b. Similarly show

that Hf(x) = A.

We will now try to find the minimum of f by a simple iterative procedure {xj} of the form

xj+1 = xj + αj pj .

Here, xj is our current position. From this position we want to move into the direction of pj.

The step-length of our move is determined by αj . Of course, it will be our goal to select the

direction pj and the step-length αj in such a way that

f(xj+1) ≤ f(xj).

(Note in this section we will use the notation xj rather than x(j), since it is shorter and there

can be no misunderstanding, since we never need the components of xj .)

Given a new direction pj , the best possible step-length in that direction can be determined by

looking at the minimum of f along the line xj + α pj. Hence, if we set

ϕ(α) = f(xj + αpj

)=

1

2

(xj + α pj

)TA(xj + αpj

)−(xj + αpj

)Tb,

the necessary condition for a minimum yields, from the chain rule and (6.2),

0 = ϕ′(α) = pTj ∇f

(xj + αpj

)= pT

j

[A(xj + αpj

)− b

]

= pTj Axj + αpT

j Apj − pTj b = pT

j

(Axj − b

)+ αpT

j Apj. (6.6)

Since A is positive definite, we know that

ϕ′′(α) = pTj Apj > 0

and thus at any α with ϕ′(α) = 0 we have a minimum. Resolving (6.6) for α gives the new

step-length αj = α as

αj =pT

j

(b− Axj

)

pTj Apj

=pT

j rj

pTj Apj

,

where we have defined the residual of the jth step as

rj := b− Axj.

Thus we have the following generic algorithm.

128 6.1. The Generic Minimization Algorithm

Algorithm 3 Generic Minimisation

1: Choose x1 and p1.

2: Let r1 = b− Ax1.

3: for j = 1, 2, . . . do

4: αj =pT

j rj

pTj Apj

5: xj+1 = xj + αj pj

6: rj+1 = b− Axj+1

7: Choose next direction pj+1.

8: end for

Note that we pick αj in such a way that (see the last expression in the first line of (6.6) and

use xj + αj pj = xj+1)

0 = pTj

[A (xj + αj pj)︸︷︷︸

= xj+1

−b]

= pTj

(Axj+1 − b

)= −pT

j rj+1, (6.7)

showing that pj and rj+1 are orthogonal.

Obviously, we still have to determine how to choose the search directions. One possible way is

to pick the direction of steepest descent: It can be shown that this direction is given by

the negative gradient of the target function, that is, by

pj = −∇f(xj) = −(Axj − b

)= b − Axj = rj.

Thus in the jth step, the direction pj is chosen to be the residual rj = b − Axj from the

previous approximation xj . This gives the following algorithm.

Algorithm 4 Steepest Descent

1: Choose x1

2: Set p1 = b− Ax1

3: for j = 1, 2, . . . do

4: αj =pT

j pj

pTj Apj


6: pj+1 = b − Axj+1

7: end for

The relation (6.7) now becomes

pTj pj+1 = 0, (6.8)


which means that two successive search directions are orthogonal.

Unfortunately, the method of steepest descent often converges rather slowly as illustrated in

the example below.

Example 6.2 (method of steepest descent)

Let us choose the following matrix, right-hand side, and initial position:

A =

(1 0

0 9

), b =

(0

0

), x1 =

(9

1

)

Clearly the solution of Ax = b is x = 0 = (0, 0)T . It can be shown that the method of steepest

descent produces the following iteration

xj = (0.8)j−1

(9

(−1)j−1

)(6.9)

which is depicted in Figure 6.1. The details of the computation are left as an exercise.

Clearly (6.9) does converge rather slow and does not terminate after a finite number of steps

with the correct solution. This linear system with a 2 × 2 diagonal matrix could have easily

been solved with any direct method in a much more economical way.

The problematic of this iteration becomes more apparent if we look at the level curves of the

corresponding quadratic function

f(x) =1

2xT Ax − xT b =

1

2

(x2

1 + 9 x22

),

which are oblong ellipses, as illustrated in Figure 6.2. Since the new search direction has to

be orthogonal to the previous one, we are always orthogonal to the level curves, and hence the

new direction does not point towards the centre of the ellipses, which is our solution vector. 2

0 1 2 3 4 5 6 7 80

0.5

b

b

b

b

b

b

b

b

b

b

b

b

b

b

b

bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb

Figure 6.1: Approximations computed by the method of steepest descent in Example 6.2.

130 6.2. Minimization with A-Conjugate Search Directions

xj

xj+1

pj

Figure 6.2: Illustration of Example 6.2.

Exercise 83 The method of steepest descent to compute the minimiser of the functional

f(x) =1

2xT Ax − xT b

is given by Algorithm 4. Consider the system Ax = b with

A =

(1 0

0 9

)and b =

(0

0

).

(i) Starting with x1 = (9, 1)T , compute x2 and x3. State all relevant computations.

(ii) Prove that the jth approximation xj is given by

xj = (0.8)j−1

(9

(−1)j−1

).

Exercise 84 State the definition of an inner product for a real linear space.

6.2 Minimization with A-Conjugate Search Directions

In the previous example we have seen that orthogonal search directions can lead to a rather

slow convergence. Therefore we will now modify the algorithm such that it uses so-called

A-conjugate search directions.

Definition 6.3 (A-conjugate vectors)

Let A ∈ Rn×n be a symmetric and positive definite matrix. The vectors p1,p2, . . . ,pm ∈Rn \ {0} are called A-conjugate if

pTj Apk = 0 for all 1 ≤ j, k ≤ m with j 6= k.


To understand the meaning of A-conjugate vectors better, we discuss some of their properties.

Lemma 6.4 (properties of A-conjugate vectors)

Let A ∈ Rn×n be a symmetric and positive definite matrix, and let the vectors

p1,p2, . . . ,pm ∈ Rn \ {0} be A-conjugate. Then the following holds true:

(i) Since A is positive definite, we have pTj Apj > 0 for j = 1, 2, . . . , m.

(ii) The A-conjugate vectors p1,p2, . . . ,pm are linearly independent.

(iii) There can be at most n A-conjugate vectors, that is, m ≤ n.

(iv) We can introduce an A-inner product via

〈x,y〉A := xT Ay.

With respect to this inner product, the A-conjuate vectors p1,p2, . . . ,pm are orthogonal.

Proof of Lemma 6.4. Proof of (i): The statement (i) is clear from the positive definiteness

of the matrix A.

Proof of (ii): To see that the vectors p1,p2, . . . ,pm are linearly independent, consider

c1 p1 + c2 p2 + . . . + ck pk + . . . + cm pm = 0, (6.10)

and multiply from the left with pTk A. Then

c1 pTk Ap1 + c2 pT

k Ap2 + . . . + ck pTk Apk + . . . + cm pT

k Apm = 0.

Since pTk Apj = 0 for k 6= j, we find, using pT

k Apk > 0,

ck pTk Apk = 0 ⇔ ck = 0.

Since k was arbitrary, we conclude that in (6.10) all coefficients c1, c2, . . . , cm have to be zero.

Thus we have verified that the vectors p1,p2, . . . ,pm are linearly independent. This proves (ii).

Proof of (iii): Since the we know from (ii) that the vectors p1,p2, . . . ,pm ∈ Rn are linearly

independent, we must have m ≤ n.

Proof of (iv): To see that 〈x,y〉A = xT Ay is an inner product, we note that 〈x,x〉A = xT Ax >

0 for all x ∈ Rn \ {0} and 〈0, 0〉A = 0, since A is positive definite. The other properties of an

inner product are easily checked and are left as an exercise. 2

Exercise 85 Let A ∈ Rn×n be symmetric and positive definite. Show that 〈x,y〉A := xT Ay

defines an inner product for Rn. Conclude that ‖x‖A =√

〈x,x〉A =√

xT Ax defines a vector

norm for Rn.


For A-conjugate search directions it turns out that the generic minimization algorithm (see

Algorithm 3) terminates after at most n steps and gives the exact solution. So the iterative

method is in some sense a direct method. However, for large n, say n = 106, one would usually

not execute all n steps, but rather stop the algorithm beforehand, once a sufficient accuracy is

reached.

Theorem 6.5 (generic minimization with A-conjugate search directions)

Let A ∈ Rn×n be symmetric and positive definite, and let b ∈ Rn. Let x1 ∈ Rn be given,

and assume that the search directions p1,p2, . . . ,pn ∈ Rn \ {0} are A-conjugate. Then the

generic minimization Algorithm 3 terminates after at most n steps with the solution x of

Ax = b.

Proof of Theorem 6.5. Since the n search directions are A-conjugate and thus linearly

independent, they form a basis of Rn. Therefore, we can represent the vector x − x1 in this

basis, which means that we can find β1, β2, . . . , βn with

x − x1 =n∑

j=1

βj pj ⇔ x = x1 +n∑

j=1

βj pj . (6.11)

Furthermore, from the generic algorithm, we have xj+1 = xj + αj pj , j = 1, 2, . . . , n. Thus we

can conclude that the approximations xk have the representation

xk = x1 +k−1∑

j=1

αj pj , 1 ≤ k ≤ n + 1. (6.12)

Comparing (6.11) and (6.12) with k = n + 1, we see that it is sufficient to show that αi = βi

for all 1 ≤ i ≤ n. Then we know that xn+1 = x.

From Algorithm 3, we have

αi =pT

i ri

pTi Api

, 1 ≤ i ≤ n. (6.13)

To compute the coefficients βi, we first observe that, using (6.11),

b− Ax1 = A x − Ax1 = A(x − x1

)= A

(n∑

j=1

βj pj

)=

n∑

j=1

βj Apj . (6.14)

Hence, multiplying (6.14) from the left by pTi ,

pTi

(b − Ax1

)= pT

i

(n∑

j=1

βj Apj

)=

n∑

j=1

βj pTi Apj = βi p

Ti Api,

because pTi Apj = 0 for j 6= i (as the directions p1,p2, . . . ,pn are A-conjugate). This gives the

explicit representation

βi =pT

i

(b − Ax1

)

pTi Api

=pT

i r1

pTi Api

, 1 ≤ i ≤ n, (6.15)


which differs, at first sight, from αi in the numerator (see (6.13)). Fortunately, from (6.12) we

can conclude that

ri = b− Axi = b− A

(x1 +

i−1∑

j=1

αj pj

)=(b− Ax1

)−

i−1∑

j=1

αj Apj = r1 −i−1∑

j=1

αj Apj ,

and thus, employing again the fact that pTi Apj = 0 for j 6= i (as the directions p1,p2, . . . ,pn

are A-conjugate),

pTi ri = pT

i

(r1 −

i−1∑

j=1

αj Apj

)= pT

i r1 −i−1∑

j=1

αj pTi Apj = pT

i r1, 1 ≤ i ≤ n. (6.16)

Thus, we find from (6.13), (6.15), and (6.16),

βi =pT

i r1

pTi Api

=pT

i ri

pTi Api

= αi, 1 ≤ i ≤ n.

Hence, from (6.11) and (6.12) with k = n + 1 and αi = βi, 1 ≤ i ≤ n, we have xn+1 = x. Thus

after at most n steps of the algorithm, we arrive at the true solution x of Ax = b. 2

It remains to answer the question how the search directions are actually determined.

Obviously, we do not want to determine the search directions a-priori but would rather like

to determine them during the iteration process, hoping for our iteration to terminate with

significantly less than n steps.

Assume we already have created search directions p1,p2, . . . ,pj that are A-conjugate. In the

case of Axj+1 = b we can terminate the algorithm and hence do not need to compute another

direction. Otherwise, we could try to compute the next direction by setting

pj+1 = b− Axj+1 +

j∑

k=1

βj,k pk = rj+1 +

j∑

k=1

βj,k pk, (6.17)

that is, the new search direction is the residual rj+1 plus a linear combination of the previous

search directions. The coefficients βj,k are determined from the conditions that pj+1 is A-

conjugate to the search directions p1,p2, . . . ,pj . Thus we obtain the conditions

0 = pTj+1 Api = rT

j+1 Api +

j∑

k=1

βj,k pTk Api = rT

j+1 Api + βj,i pTi Api, 1 ≤ i ≤ j.

Solving for βj,i yields

βj,i = −rT

j+1 Api

pTi Api

, 1 ≤ i ≤ j. (6.18)

Surprisingly, βj,1, . . . , βj,j−1 vanish automatically, as we will see later on (see Remark 6.10

on page 143), so that only the coefficient

βj+1 := βj,j = −rT

j+1 Apj

pTj Apj


remains and the new direction is given by

pj+1 = rj+1 + βj+1 pj .

This choice of the search directions in Algorithm 3 gives the first version of the conjugate

gradient (CG) method, see Algorithm 5 below.

We observe that Algorithm 5 is, apart from the choice of the new search direction pj+1, identical

with Algorithm 3; only the steps 2, 7, and 8 in Algorithm 5 differ from Algorithm 3.

Algorithm 5 CG method

1: Choose x1.

2: Let p1 = r1 = b− Ax1 and j = 1.

3: while rj 6= 0 do

4: αj =pT

j rj

pTj Apj


6: rj+1 = b− Axj+1

7: βj+1 = −rT

j+1 Apj

pTj Apj

8: pj+1 = rj+1 + βj+1 pj

9: j = j + 1

10: end while

It is now time to show that the chosen search directions are indeed A-conjugate.

Theorem 6.6 (properties of pj and rj)

Let A ∈ Rn×n be symmetric and positive definite, let b ∈ Rn, and let x1 ∈ Rn be an arbitrary

starting point. The vectors pj and rj introduced in the CG method (see Algorithm 5 above)

satisfy the following equations:

(1) pTi Apj = 0 for 1 ≤ j ≤ i − 1,

(2) rTi rj = 0 for 1 ≤ j ≤ i − 1,

(3) pTj rj = rT

j rj for 1 ≤ j ≤ i.

The CG-method terminates at the latest with j = n + 1, and if it terminates with j steps

then rj+1 = b − Axj+1 = 0 and pj+1 = 0, that is, xj+1 is the solution of Ax = b. Hence,

the CG method needs at most n steps to compute the solution x of Ax = b.


Proof of Theorem 6.6. We will prove (1) to (3) by induction over i.

Initial step: For i = 1 there is nothing to show for (1) and (2); (3) follows because of p1 = r1.

To get an initial step for all three equations we have to consider also i = 2. We find

pT2 Ap1 = (r2 + β2 p1)

T Ap1 = rT2 Ap1 + β2 p1 Ap1 = rT

2 Ap1 −rT2 Ap1

p1 Ap1

p1 Ap1 = 0,

where we have used the definition of β2. This verifies (1) for i = 2. To verify (2), we use p1 = r1

and the definition of α1:

rT2 r1 = (b− Ax2)

T r1 =(b− A (x1 + α1 p1)

)Tr1 = (r1 − α1 Ap1)

T r1

= (r1 − α1 Ap1)T p1 = rT

1 p1 − α1 pT1 Ap1 = rT

1 p1 −pT

1 r1

pT1 Ap1

pT1 Ap1 = 0,

which verifies (2) for i = 2. Finally

pT2 r2 = (r2 + β2 p1)

T r2 = rT2 r2 + β2 pT

1 r2,

and we have proved (3) for i = 2 if we can show that pT1 r2 = 0. To verify this we proceed as

follows

pT1 r2 = pT

1 (b− Ax2) = pT1

(b− A (x1 + α1 p1)

)= pT

1 (r1 − α1 Ap1)

= pT1 r1 − α1 pT

1 Ap1 = pT1 r1 −

pT1 r1

pT1 Ap1

pT1 Ap1 = 0,

where we have used the definition of α1 in the last step. Thus we have also verified (3) for

i = 2, giving a full initial step in which all three statements have been verified for i = 2.

Induction step: Let us now assume everything is satisfied for an arbitrary i with 1 ≤ i ≤ n.

We first show that then (2) follows for i + 1. By definition, we have

ri+1 = b− Axi+1 = b− A(xi + αi pi

)= b− Axi − αi Api = ri − αi Api. (6.19)

Thus, by the induction hypotheses (1) and (2) and rj = pj − βj pj−1, we have for j ≤ i − 1

immediately

rTi+1 rj =

(ri − αi Api

)Trj = rT

i rj − αi pTi A rj = rT

i rj − αi pTi A

(pj − βj pj−1

)

= rTi rj − αi p

Ti Apj − αi βj pT

i Apj−1 = 0.

For j = i we can use (6.19), the definition of αi, the identity

pTi A ri = pT

i A(pi − βi pi−1

)= pT

i Api − βi pTi Api−1 = pT

i Api


(where we have used the induction hypothesis (1) in the last step), and the induction hypothesis

(3) to conclude that

rTi+1 ri = rT

i ri − αi pTi A ri = rT

i ri −pT

i ri

pTi Api

pTi A ri = rT

i ri − pTi ri = 0,

which finishes the induction step for (2).

We proceed to show the induction step for (1). First of all, we have

pTi+1 Apj =

(rT

i+1 + βi+1 pTi

)Apj = rT

i+1 Apj + βi+1 pTi Apj.

In the case of j = i this leads to (using the definition of βi+1)

pTi+1 Api = rT

i+1 Api + βi+1 pTi Api = rT

i+1 Api −rT

i+1 Api

pTi Api

pTi Api = 0.

In the case of j ≤ i− 1 we first observe that αj = 0 particularly means pTj rj = rT

j rj = 0 (from

the definition of αj), which is equivalent to rj = 0. Thus if αj = 0 the iteration would have

stopped before. Hence, we may assume that αj 6= 0 such that we can use (6.19) to gain the

representation

Apj =1

αj

(rj − rj+1

).

Together with induction hypothesis (1) and statement (2) for i + 1 (which we have proved

already) this leads to

pTi+1 Apj =

(ri+1 + βi+1 pi

)TApj = rT

i+1 Apj + βi+1 pi Apj

= rTi+1 Apj =

1

αjrT

i+1

(rj − rj+1

)=

1

αj

(rT

i+1 rj − rTi+1 rj+1

)= 0,

for 1 ≤ j ≤ i − 1. This proves (1) for i + 1.

Finally, for (3) we use the fact that f(x) = 12xT Ax−bT x attains its minimum in the direction

of pi at xi+1. This means that the function ϕ(t) = f(xi+1 + tpi) satisfies ϕ′(0) = 0, which,

using the chain rule, is equivalent to

0 = ϕ′(0) = pTi ∇f(xi+1) = pT

i

(Axi+1 − b

)= −pT

i ri+1.

From this and (2) (which we have proved already), we can conclude that

pTi+1 ri+1 =

(ri+1 + βi+1 pi

)Tri+1 = rT

i+1 ri+1 + βi+1 pTi ri+1 = rT

i+1 ri+1,

which finalizes our induction step by proving (3) for i + 1.

For the computation of the next iteration step, we need that pi+1 6= 0. If this is not the case

the method terminates. Indeed, if pi+1 = 0, then 0 = pTi+1 ri+1 = rT

i+1 ri+1 = ‖ri+1‖22, and

hence 0 = ri+1 = b − Axi+1, that is, we have produced the solution in the ith step. If the


method does not terminate early then, after n steps, we have created n conjugate directions

and Theorem 6.5 shows that xn+1 is the true solution x. 2

On the implementation side, it is possible to improve the first version (Algorithm 5 above) of

the CG method. First of all, we can reduce the number of the matrix vector multiplications as

follows: In the definition of rj+1 we can use (from (6.19))

rj+1 = rj − αj Apj (6.20)

instead of rj+1 = b − Axj, thus eliminating one matrix vector multiplication, since Apj in

(6.19) was already computed previously.

Moreover, from (6.20) it follows that Apj = α−1j (rj − rj+1) such that

βj+1 = −rT

j+1 Apj

pTj Apj

= − 1

αj

rTj+1 rj − rT

j+1 rj+1

pTj Apj

=pT

j Apj

pTj rj

rTj+1 rj+1

pTj Apj

=rT

j+1rj+1

rTj rj

,

where we have used (2) in Theorem 6.6, the definition of αj , and (3) in Theorem 6.6. This leads

to a version of the CG method that contains, in each iteration step, only one matrix-vector

multiplication, namely Apj, and three inner products and three scalar-vector multiplications.

This version of the CG method is stated as Algorithm 6 below.

Algorithm 6 CG method – improved more economical code

1: Choose x1.

2: Set p1 = r1 = b− Ax1 and j = 1.

3: while ‖rj‖2 > ǫ do

4: αj =‖rj‖2

2

pTj Apj

% store Apj


6: rj+1 = rj − αj Apj

7: βj+1 =‖rj+1‖2

2

‖rj‖22

% store ‖rj+1‖22

8: pj+1 = rj+1 + βj+1 pj

9: j = j + 1

10: end while

Example 6.7 (conjugate gradient method)

Consider the linear system Ax = b with

A =

32

0 12

0 3 0

12

0 32.

and b =

1

1

−1


Solve the linear system by performing the CG method by hand with the starting vector x1 =

0 = (0, 0, 0)T .

Solution: We use Algorithm 6.

1st step: We have

x1 =

0

0

0

and p1 = r1 = b − Ax1 = b− 0 =

1

1

−1

.

Since ‖r1‖22 = rT

1 r1 = 3 and

Ap1 =

32

0 12

0 3 0

12

0 32.

1

1

−1

=

1

3

−1

and pT

1 Ap1 =

1

1

−1

T

1

3

−1

= 5,

we find

α1 =‖r1‖2

2

pT1 Ap1

=3

5.

Thus

x2 = x1 + α1 p1 =

0

0

0

+

3

5

1

1

−1

=

3

5

1

1

−1

and

r2 = r1 − α1 Ap1 =

1

1

−1

− 3

5

1

3

−1

=

1

5

2

−4

−2

.

We have ‖r2‖22 = 24/25, and thus

β2 =‖r2‖2

2

‖r1‖22

=24/25

3=

8

25,

the new search direction is given by

p2 = r2 + β2 p1 =1

5

2

−4

−2

+

8

25

1

1

−1

=

1

25

18

−12

−18

=

6

25

3

−2

−3

.

2nd step: Since ‖r2‖2 6= 0, we perform a second step of the CG method.

Ap2 =6

25

32

0 12

0 3 0

12

0 32.

3

−2

−3

=

6

25

3

−6

−3

=

18

25

1

−2

−1

.


Since from the first iteration step ‖r2‖22 = 24/25, and

pT2 Ap2 =

18

25

6

25

3

−2

−3

T

1

−2

−1

=

18 · 6252

10, =216

125

we find

α2 =‖r2‖2

2

pT2 Ap2

=24/25

216/125=

24

25

125

216=

5

9.

Thus the new approximation is given by

x3 = x2 + α2 p2 =3

5

1

1

−1

+

5

9

6

25

3

−2

−3

=

1

15

9

9

−9

+

1

15

6

−4

−6

=

113

−1

.

The new residual is given by

r3 = r2 − α2 Ap2 =1

5

2

−4

−2

− 5

9

18

25

1

−2

−1

=

1

5

2

−4

−2

− 2

5

1

−2

−1

=

0

0

0

.

Thus the CG algorithm terminates after two steps and provides the correct solution x = x3 =

(1, 1/3,−1)T . 2

The CG algorithm, as given in Algorithm 6, can be implemented with the following MATLAB

code:

function [x,J] = cg_method(A,b,z)

%

% executes the conjugate gradient method (CG method) for solving A*x=b

%

% input: A = symmetric and positive definite real n by n matrix

% b = n by 1 vector, right-hand side of the linear system

% z = start n by 1 vector for the CG iterations

%

% output: x = the approximate solution for the CG-method

% J = number of iterations

%

x = z;

r = b - A * x;

rnorm = (norm(r))^2;

p = r;

j=1;


while norm(r) > 10^(-6)

J = j;

x1 = x;

r1 = r;

p1 = p;

rnorm1 = rnorm;

q = A* p1;

alpha = rnorm1 / (p1’ * q);

x = x1 + alpha * p1;

r = r1 - alpha * q;

rnorm = (norm(r))^2;

beta = rnorm / rnorm1;

p = r + beta * p1;

j = J+1;

end

Exercise 86 Consider the following linear system

Ax = b, where A =

2 −1 0

−1 2 −1

0 −1 2

and b =

1

0

1

.

(a) Show that A is positive definite.

(b) Apply two iterations of the conjugate gradient method to the problem Ax = b to obtain x3

starting with x1 = 0. Do all the computations by hand and give all relevant scalars and

vectors. Calculate the residual for x3 and comment on your answer.

Exercise 87 Consider the following linear system

Ax = b, where A =

2 0 1

0 2 0

1 0 2

and b =

1

1

−1

.

(a) Show that A is positive definite.

(b) Apply two iterations of the conjugate gradient method to the problem Ax = b to obtain x3

starting with x1 = 0. Do all the computations by hand and give all relevant scalars and

vectors. Calculate the residual for x3 and comment on your answer.


6.3 Convergence of the Conjugate Gradient Method

In the following theorems we show some of the properties of the CG algorithm. We will

investigate convergence in the A-norm or energy norm, defined by

‖x‖A =√〈x,x〉A =

√xT Ax.

Since 〈x,y〉A = xT Ay is an inner product for Rn (see Lemma 6.4 (iv)), ‖ · ‖A is clearly a norm

for Rn.

Definition 6.8 (ith Krylov space)

Let A ∈ Rn×n and p ∈ Rn be given. For i ∈ N we define the ith Krylov space to A and p

as

Ki(p, A) = span{p, Ap, A2 p, . . . , Ai−1 p

}.

In general, we only have dim(Ki(p, A)) ≤ i, since p could, for example, be from the null space

of A, that is, Ap = 0, and then dim(Ki(p, A)) ≤ 1 for any i ∈ N. Or if p is an eigenvector of

A with eigenvalue λ, then Ak p = λk p for any k ∈ N0 and hence Ki(p, A) = span{p} for all

i ∈ N. Fortunately, in our situation things are much better.

Lemma 6.9 (Krylov spaces in CG method)

Let A ∈ Rn×n be symmetric and positive definite. Let pi and ri be the vectors created during

the CG iterations, and let m ≤ n be the number of steps of the CG method (that is, rm+1 = 0

but rm 6= 0). Then

Ki(p1, A) = span{p1, . . . ,pi} = span{r1, . . . , ri} (6.21)

for 1 ≤ i ≤ m. In particular, we have dim(Ki(p1, A)

)= i for 1 ≤ i ≤ m.

Proof of Lemma 6.9. The proof is given by induction on i.

Initial step: For i = 1 the statements follow because of p1 = r1 6= 0.

Induction step: For general i = j + 1 it follows by employing the relation

pj+1 = rj+1 + βj+1 pj =(rj − αj Apj

)+ βj+1 pj, =

(rj + βj+1 pj

)− αj Apj, (6.22)

which is a consequence of (6.19), and from the induction assumption. We will now show this

in detail:

Assume that the statement holds for all 1 ≤ i ≤ j < m. then we have to show the statement

for i = j + 1. From (6.21) for i = j, the first term in the last representation in (6.22) is in

Kj(p1, A), that is,

rj + βj+1 pj ∈ Kj(p1, A) ⊂ Kj+1(p1, A).

142 6.3. Convergence of the Conjugate Gradient Method

From (6.21) for i = j, the vector pj is in Kj(p1, A) and can be written as

pj =

j−1∑

k=0

γk Ak p1,

with some coefficients γ0, γ1, . . . , γj−1. Thus the vector Apj in the second term in the last

representation in (6.22) can be written as

Apj = A

(j−1∑

k=0

γk Ak p1

)=

j−1∑

k=0

γk Ak+1 p1 =

j∑

ℓ=1

γℓ−1 Aℓ p1,

and thus Apj is in Kj+1(p1, A). Thus we see from (6.22) that pj+1 is in Kj+1(p1, A), and

therefore, using (6.21) for i = j,

span{p1,p2, . . . ,pj+1} ⊂ Kj+1(p1, A). (6.23)

Since by definition of pj+1,

pj+1 = rj+1 + βj+1 pj ⇔ rj+1 = pj+1 − βj+1 pj . (6.24)

we see, from (6.24) and the assumption that (6.21) holds for i = j, that

span{p1,p2, . . . ,pj+1} = span{r1, r2, . . . , rj+1}. (6.25)

Finally, by the assumption that (6.21) holds for i = j, we have that

Aj p1 = A(Aj−1 p1

)= A

(j∑

k=1

µk pk

)= A

(j−1∑

k=1

µk pk

)+ µj Apj , (6.26)

with some coefficients µ1, µ2, . . . , µj. From the assumption that (6.21) holds for i = j, the first

term in the last expression in (6.26) can be written as

A

(j−1∑

k=1

µk pk

)= A

(j−2∑

ℓ=0

δℓ Aℓ p1

)=

j−2∑

ℓ=0

δℓ Aℓ+1 p1 =

j−1∑

ℓ=1

δℓ−1 Aℓ p1,

with some coefficients δ0, δ1, . . . , δj−2. Thus the first term in the last representation in (6.26) is

in Kj(p1, A) = span{r1, r2, . . . , rj}. Since αj 6= 0 for j ≤ m, from (6.19),

Apj =1

αj

(rj − rj+1

), (6.27)

and we see that the second term in the last representation of (6.26) is in span{r1, r2, . . . , rj+1}.Thus Aj p1 is in span{r1, r2, . . . , rj+1}, and from (6.21) for i = j, we conclude

Kj+1(p1, A) ⊂ span{r1, r2, . . . , rj+1}. (6.28)

Combining (6.23), (6.25), and (6.28), yields that (6.21) holds also for i = j + 1.


It remains to show that dim(Ki(p1, A)

)= i. Since the CG method runs through m steps and

i ≤ m, we have ri 6= 0. Thus by the definition of pi, we have pi 6= 0, and hence the search

directions p1,p2, . . . ,pi are non-zero and, from Theorem 6.6, p1,p2, . . . ,pn are A-conjugate.

Thus, from Lemma 6.4, the search directions p1,p2, . . . ,pi are linearly independent, and hence

dim(Ki(p1, A)

)= dim

(span{p1,p2, . . . ,pi}

)= i

which concludes the proof. 2

Now we can show that in formula (6.18) on page 133 βj,i = 0 for 1 ≤ i ≤ j − 1.

Remark 6.10 (proof that βj,i = 0 for 1 ≤ i ≤ j − 1 in formula (6.18))

The definition of the new search direction (6.17) was

pj+1 = rj+1 +

j∑

k=1

βj,k pk,

and we determined, from the demand that the search directions p1,p2, . . . ,pj+1 are A-

conjugate,

βj,i = −rT

j+1 Api

pTi Api

, 1 ≤ i ≤ j.

We claimed that βj,i = 0 for 1 ≤ i ≤ j − 1, and now we are in the position to prove this

claim: From (6.19), using that αi 6= 0 for 1 ≤ i ≤ j − 1 (since the CG method has not yet

terminated),

Api =1

αi

(ri − ri+1

),

and thus for 1 ≤ i ≤ j − 1

rTj+1 Api =

1

αirT

j+1

(ri − ri+1

)=

1

αi

(rT

j+1 ri − rTj+1 ri+1

)= 0

from Theorem 6.6 (2).

Recall that the true solution x of Ax = b and the approximations xi of the CG method can

be written as (see (6.12))

x = x1 +n∑

j=1

αj pj and xi = x1 +i−1∑

j=1

αj pj , 1 ≤ i ≤ n + 1. (6.29)

Moreover, by Lemma 6.9, it is possible to represent an arbitrary element x from the affine linear

space x1 + Ki−1(p1, A) by

x = x1 +i−1∑

j=1

βj pj, (6.30)


and, in particular, xi ∈ x1 + Ki−1(p1, A). Since the directions p1,p2, . . . ,pn are A-conjugate,

we can conclude from (6.29) and (6.30) that

∥∥x − xi

∥∥2

A=

∥∥∥∥∥

n∑

j=i

αj pj

∥∥∥∥∥

2

A

=

(n∑

j=i

αj pj

)T

A

(n∑

k=i

αk pk

)

=n∑

j=i

n∑

k=i

αj αk pTj Apk =

n∑

j=i

|αj|2 pTj Apj

≤n∑

j=i

|αj |2 pTj Apj +

i−1∑

j=1

|αj − βj|2 pTj Apj

=

∥∥∥∥∥

n∑

j=i

αj pj +

i−1∑

j=1

(αj − βj)pj

∥∥∥∥∥

2

A

=

∥∥∥∥∥

n∑

j=1

αj pj −i−1∑

j=1

βj pj

∥∥∥∥∥

2

A

=

∥∥∥∥∥x − x1 −i−1∑

j=1

βj pj

∥∥∥∥∥

2

A

=∥∥x − x

∥∥2

A,

where x is the vector given by (6.30). Since this holds for an arbitrary x ∈ x1 +Ki−1(p1, A) we

have proved that

∥∥x − xi

∥∥2

A≤∥∥x − x‖2

A for all x ∈ x1 + Ki−1(p1, A).

We formulate this as a theorem.

Theorem 6.11 (CG method gives best approximations in affine Krylov spaces)

Let A ∈ Rn×n be symmetric and positive definite and b ∈ R

n. Let the CG method applied

to solving Ax = b stop after m ≤ n steps. The approximation xi, i ∈ {1, 2, . . . , m}, from

the CG method gives the best approximation to the solution x of Ax = b in the affine

Krylov space x1 + Ki−1(p1, A) with respect to the A-norm ‖ · ‖A. That is,

∥∥x − xi

∥∥A≤∥∥x − x

∥∥A

for all x ∈ x1 + Ki−1(p1, A). (6.31)

We note that (6.31) implies that

∥∥x − xi

∥∥A

= minx∈x1+Ki−1(p1,A)

∥∥x − x∥∥

A= inf

x∈x1+Ki−1(p1,A)

∥∥x − x∥∥

A.

The same idea shows that the iteration sequence is ‘monotone’, that is, ‖x − xi‖ is strictly

monotonically decreasing as i increases.


Corollary 6.12 (error of CG iterations is decreasing)

Let A ∈ Rn×n be symmetric and positive definite and b ∈ R

n. Let the CG method for the

solution of Ax = b stop after m ≤ n steps. The sequence {xi} of approximations xi of the

CG method is monotone in the sense that

∥∥x − xi+1

∥∥A

< ‖x − xi‖A for all 1 ≤ i ≤ m. (6.32)

Before we give the formal proof of Corollary 6.12, we observe that (6.32) with ≤ instead of < is

entirely natural from Theorem 6.11: Since xi is the best approximation of x in x1+Ki−1(p1, A),

and since

x1 + Ki−1(p1, A) ⊂ x1 + Ki(p1, A),

the best approximation in the larger affine space x1 +Ki(p1, A) cannot have a larger error than

the best approximation in x1 + Ki−1(p1, A).

Proof of Corollary 6.12. Using the same notation as in the proof of Theorem 6.11, we have

from (6.29)

x − xi =

n∑

j=i

αj pj =

n∑

j=i+1

αj pj + αi pi =(x − xi+1

)+ αi pi.

Thus, using ‖y‖2A = yT Ay and the fact that pi is A-conjugate to x − xi+1 =

∑nj=i+1 αj pj ,

∥∥x − xi

∥∥2

A=

∥∥(x − xi+1

)+ αi pi

∥∥2

A

=[(

x − xi+1

)+ αi pi

]TA[(

x − xi+1

)+ αi pi

]

=(x − xi+1

)TA(x − xi+1

)+ 2 αi

(x − xi+1

)TApi + |αi|2 pT

i Api

=∥∥x − xi+1

∥∥2

A+ |αi|2 pT

i Api

≥∥∥x − xi+1

∥∥2

A, (6.33)

where we have used in the last step that pTi Api ≥ 0 since A is positive definite.

Since the CG method stops after m steps and since i ≤ m, ri 6= 0, and, from Theorem 6.6

(3), pi 6= 0 and αi 6= 0 and thus |αi|2 pTi Api > 0. Thus from (6.33) ‖x−xi‖A > ‖x−xi+1‖A. 2

Next, let us rewrite this approximation problem in the original basis of the Krylov space. In

doing so, we denote the set of all polynomials of degree less than or equal to n by πn.

For a polynomial P (t) =∑i−1

j=0 γj tj in πi−1 and a matrix A ∈ Rn×n we write

P (A) :=i−1∑

j=0

γj Aj and P (A)x =i−1∑

j=0

γj Aj x.

If x is an eigenvector of A with eigenvalue λ, that is, Ax = λx, then, from Aj x = λj x, clearly

P (A)x =i−1∑

j=0

γj λj x =

(i−1∑

j=0

γj λj

)x = P (λ)x. (6.34)


Theorem 6.13 (estimate of best approximation in affine Krylov space)

Let A ∈ Rn×n be symmetric and positive definite, having the eigenvalues λ1 ≥ λ2 ≥ . . . ≥

λn > 0. Let b ∈ Rn. Let x denote the solution of Ax = b, and let xi, i = 1, 2, . . . , m + 1,

denote the approximations from the CG method, where we assume that the CG method stops

after m ≤ n steps. Then

∥∥x − xi

∥∥A≤

inf

P∈πi−1,P (0)=1

[max1≤j≤n

|P (λj)|] ∥∥x − x1

∥∥A.

Proof of Theorem 6.13. Let us express an arbitrary x ∈ x1 + Ki−1(p1, A), where

i ∈ {1, 2, . . . , m + 1}, as

x = x1 +

i−1∑

j=1

γj Aj−1 p1 =: x1 + Q(A)p1, (6.35)

introducing the polynomial

Q(t) =

i−1∑

j=1

γj tj−1

in πi−2, where we set Q = 0 for i = 1. Using p1 = r1 = b − Ax1 = A x − Ax1 = A (x − x1),

yields for x, given by (6.35),

x − x = x −(x1 + Q(A)p1

)=(x − x1) − Q(A) A

(x − x1

)=(I − Q(A) A

) (x − x1

).

Therefore, using (I − Q(A) A)T = I − A Q(A) = I − Q(A) A (since AT = A and since A

commutes with Q(A)),

∥∥x − x∥∥2

A=

∥∥(I − Q(A) A) (

x − x1

)∥∥2

A

=(x − x1

)T (I − Q(A) A

)A(I − Q(A) A

)(x − x1

)

=:(x − x1

)TP (A) A P (A)

(x − x1

), (6.36)

with the polynomial P in πi−1 defined by

P (t) = 1 − t Q(t) = 1 −i−1∑

j=1

γj tj = t0 −i−1∑

j=1

γj tj .

Clearly P (0) = 1, and thus we have shown that for every x ∈ x1 + Ki−1(p1, A), where

i ∈ {1, 2, . . . , m}, that

∥∥x − x∥∥2

A=(x − x1

)TP (A) A P (A)

(x − x1

),

with some polynomial P ∈ πi−1 satisfying P (0) = 1.


On the other hand, if P ∈ πi−1 with P (0) = 1 is given then we can define the polynomial

Q(t) = (1 − P (t))/t in πi−2, which leads to an element x from x1 + Ki−1(p1, A), defined by

x = x1 + Q(A)p1.

For this x the calculation in (6.36) above holds, with

P (t) = 1 − t Q(t) = 1 − t

(1 − P (t)

)

t= P (t).

Thus we see that for every P ∈ πi−1 with P (0) = 1, there exists some x in x1 + Ki−1(p1, A)

such that ∥∥x − x∥∥2

A=(x − x1

)TP (A) A P (A)

(x − x1

).

Thus we have shown that

infx∈x1+Ki−1(p1,A)

∥∥x − x∥∥

A= inf

P∈πi−1,P (0)=1

√(x − x1

)TP (A) A P (A)

(x − x1

). (6.37)

Theorem 6.11 yields therefore, for i = 1, 2, . . . , m + 1,

∥∥x − xi

∥∥A

= minx∈x1+Ki−1(p1,A)

∥∥x − x∥∥

A= inf

x∈x1+Ki−1(p1,A)

∥∥x − x∥∥

A

= infP∈πi−1,P (0)=1

√(x − x1)T P (A) A P (A) (x− x1). (6.38)

Next, let w1,w2, . . . ,wn be an orthonormal basis of Rn consisting of eigenvectors of the positive

definite symmetric matrix A associated to the eigenvalues λ1, λ2, . . . , λn. Then, we can represent

every vector using this basis. In particular, with such a representation

x − x1 =

n∑

j=1

ρj wj,

with some coefficients ρ1, ρ2, . . . , ρn ∈ R. Thus we can conclude that

P (A) A P (A)(x − x1

)=

n∑

j=1

ρj P (A) A P (A)wj =

n∑

j=1

ρj

[P (λj)

]2λj wj,

where we have used (6.34). This leads to (where we use wTj wk = 0 if j 6= k)

(x − x1

)TP (A) A P (A)

(x − x1

)=

n∑

j=1

n∑

k=1

ρk ρj

[P (λj)

]2λj wT

k wj

=n∑

j=1

|ρj |2 |P (λj)|2 λj


≤(

max1≤ℓ≤n

|P (λℓ)|2) n∑

j=1

|ρj |2 λj

=

(max1≤ℓ≤n

|P (λℓ)|2)( n∑

j=1

ρj wj

)T

A

(n∑

k=1

ρk wk

)

=

(max1≤ℓ≤n

|P (λℓ)|2) ∥∥x − x1

∥∥2

A.

Substituting this into (6.38) and using

max1≤ℓ≤n

|P (λℓ)|2 =

(max1≤ℓ≤n

|P (λℓ)|)2

gives the desired inequality. 2

Since clearly

max1≤ℓ≤n

|P (λℓ)| ≤ maxλ∈[λn,λ1]

|P (λ)|

we have, from Theorem 6.13, the following upper bound for the error

∥∥x − xi

∥∥A≤

inf

P∈πi−1,P (0)=1

‖P‖L∞([λn,λ1])

∥∥x − x1

∥∥A, (6.39)

with the supremum norm

‖P‖L∞([a,b]) = supx∈[a,b]

|P (x)|.

Note that the smallest eigenvalue λn and the largest eigenvalue λ1 can be replaced by estimates

λn ≤ λn and λ1 ≥ λ1.

As stated in the theorem below (which we will not prove) the minimum of the term in the

round parentheses in (6.39) can be computed explicitly. Its solution involves the Chebychev

polynomials defined by

Tn(t) := cos(n arccos(t)

), t ∈ [−1, 1].

We observe that clearly

|Tn(t)| ≤ 1 for all t ∈ [−1, 1], (6.40)

since | cos(ϕ)| ≤ 1 for all ϕ ∈ R.


Theorem 6.14 (minimization problem)

Let λ1 > λn > 0. In the problem

inf{‖P‖L∞([λn,λ1]) : P ∈ πi−1 with P (0) = 1

}

the infimum is attained by

P ∗(t) =Ti−1

(2t−λ1−λn

λ1−λn

)

Ti−1

(λ1+λn

λ1−λn

) , t ∈ [λn, λ1].

The Chebychev polynomials satisfy the following inequality

1

2

(1 +

√t

1 −√

t

)n

≤∣∣∣∣Tn

(1 + t

1 − t

)∣∣∣∣ ≤ 1, t ∈ [0, 1]. (6.41)

To apply this estimate, and derive the final estimate on the convergence of the CG method, we

set γ = λn/λ1 ∈ (0, 1) and use

λ1 + λn

λ1 − λn=

1 + λn

λ1

1 − λn

λ1

=1 + γ

1 − γ. (6.42)

From (6.39), Theorem 6.14, and (6.41), we have

∥∥x − xi

∥∥A

≤

inf

P∈πi−1,P (0)=1

‖P‖L∞([λn,λ1])

∥∥x − x1

∥∥A

=

supt∈[λnλ1]

∣∣∣Ti−1

(2t−λ1−λn

λ1−λn

)∣∣∣∣∣∣Ti−1

(λ1+λn

λ1−λn

)∣∣∣

∥∥x − x1

∥∥A

≤ 1∣∣∣Ti−1

(1+γ1−γ

)∣∣∣

∥∥x − x1

∥∥A

≤ 2

(1 −√

γ

1 +√

γ

)i−1 ∥∥x − x1

∥∥A, (6.43)

where we have used (6.40) and (6.42) in the third step and the lower bound from (6.41) in the

fourth step. Finally (6.43) and the fact that (see Remark 3.4)

γ =λn

λ1

⇒ 1

γ= λ1

1

λn

= ‖A‖2 ‖A−1‖2 = κ2(A)

show the following result.


Theorem 6.15 (CG method error estimate)

Let A ∈ Rn×n be symmetric and positive definite, and let λ1 ≥ λ2 ≥ . . . ≥ λn > 0 be the

eigenvalues of A. Let b ∈ Rn, and let x be the solution of Ax = b. The sequence of

iterations {xi} generated by the CG method satisfies the error estimate

∥∥x − xi

∥∥A≤ 2

∥∥x − x1

∥∥A

(1 −√

γ

1 +√

γ

)i−1

= 2∥∥x − x1

∥∥A

(√κ2(A) − 1√κ2(A) + 1

)i−1

,

where γ = 1/κ2(A) = λn/λ1.

The method also converges if the matrix is almost singular (that is, if λd+1 ≈ . . . ≈ λn ≈ 0),

because (√

κ2(a) − 1)/(√

κ2(a) + 1) < 1. If λd+1 ≈ . . . ≈ λn ≈ 0, then κ2(A) is very large,

and one should try to use appropriate matrix manipulations to reduce the condition number

of A, such as a transformation of the smallest eigenvalues to zero and of the bigger eigenvalues

to an interval of the form (λ, λ(1 + ε)) ⊂ (0,∞) with ε as small as possible. In general it does

not matter if, during this process, the rank of the matrix is reduced. Generally, this leads only

to a small additional error but also to a significant increase in the convergence rate of the CG

method.

Exercise 88 Let A ∈ Rn×n be a real positive definite symmetric n×n matrix. Let λmin denote

the smallest eigenvalue of A, and let λmax denote the largest eigenvalue of A. Prove that

λmin = infx∈Rn\{0}

xT Ax

xT xand λmax = sup

x∈Rn\{0}

xT Ax

xT x.

Chapter 7

Calculation of Eigenvalues

In this last chapter of the lecture notes, we are concerned with the numerical computation of

the eigenvalues and eigenvectors of a square matrix. Let A ∈ Cn×n be a square matrix. A

non-zero vector x ∈ Cn is an eigenvector of A and λ ∈ C is its corresponding eigenvalue if

Ax = λx.

The eigenvalues are the roots of the characteristic polynomial


).

However, for larger matrices it is not practicable to actually compute the characteristic polyno-

mial, let alone its roots. Hence, we will look at more feasible methods to numerically determine

eigenvalues and eigenvectors.

7.1 Basic Localisation Techniques

A very rough way of getting an estimate of the location of the eigenvalues is given by the

following theorem.

Theorem 7.1 (Gershgorin disks)

The eigenvalues of a matrix A in Cn×n (or in Rn×n) are contained in the union⋃n

j=1 Kj of

the disks

Kj :=

λ ∈ C : |λ − aj,j| ≤n∑

k=1,k 6=j

|aj,k|

, 1 ≤ j ≤ n.

151

152 7.1. Basic Localisation Techniques

Proof of Theorem 7.1. For an eigenvalue λ ∈ C of A we can choose an eigenvector

x ∈ Cn \ {0} with ‖x‖∞ = max1≤i≤n |xi| = 1. From Ax − λx = 0 we can conclude that

(n∑

k=1

ai,k xk

)− λ xi =

(ai,i − λ

)xi +

n∑

k=1,k 6=i

ai,k xk = 0, 1 ≤ i ≤ n,

and thus(ai,i − λ

)xi = −

n∑

k=1,k 6=i

ai,k xk = 0, 1 ≤ i ≤ n.

If we pick an index i = j with |xj | = ‖x‖∞ = 1 then the statement follows via

|aj,j − λ| =∣∣(aj,j − λ

)xj

∣∣ =

∣∣∣∣∣

n∑

k=1,k 6=j

aj,k xk

∣∣∣∣∣ ≤n∑

k=1,k 6=j

|aj,k| |xk| ≤(

max1≤i≤n

|xi|)

︸︷︷︸= ‖x‖∞ = 1

n∑

k=1,k 6=j

|aj,k| ≤n∑

k=1,k 6=j

|aj,k|,

where we have used ‖x‖∞ = 1 by assumption. Therefore, we know that the eigenvalue λ is

contained in the disk Kj where j is an index for which |xj | = ‖x‖∞ = 1. Since the eigen-

value λ was arbitrary, we know that all eigenvalues are contained in the union of the disks Kj ,

j = 1, 2, . . . , n. 2

Exercise 89 Consider the matrix

A =

32

0 12

0 3 0

12

0 32

.

(a) Compute the exact eigenvalues of A by hand.

(b) Use the Theorem 7.1 on the Gershgorin disks to estimate the location of the eigenvalues.

The Rayleigh quotient introduced below allows us to approximate an eigenvalue, if we are able

to approximate a corresponding eigenvector.

Definition 7.2 (Rayleigh quotient)

The Rayleigh quotient of a vector x ∈ Rn \ {0} with respect to a real matrix A ∈ Rn×n,

is the scalar

R(x) :=xT Ax

xT x.

The Rayleigh quotient of a vector x ∈ Cn\{0} with respect to a complex matrix A ∈ Cn×n,

is the scalar

R(x) :=x∗ Ax

x∗ x.

7. Calculation of Eigenvalues 153

Obviously, if x is an eigenvector then R(x) is the corresponding eigenvalue: Indeed, if Ax = λx,

then

R(x) =xT Ax

xT x=

λxT x

xT x= λ and R(x) =

x∗ Ax

x∗ x=

λx∗ x

x∗ x= λ, (7.1)

respectively.

We will deal now with symmetric matrices A = AT in Rn×n. For such matrices it is known

(see Theorem 2.11) that they have n real eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λn, and that there

are n corresponding orthonormal eigenvectors w1,w2, . . . ,wn ∈ Rn, that is, the eigenvectors

w1,w2, . . . ,wn satisfy

wTj wk = δj,k, j, k = 1, 2 . . . , n.

In particular, w1,w2, . . . ,wn form an orthonormal basis of Rn. Then, every x ∈ Rn can be

represented as

x =n∑

k=1

ck wk, (7.2)

and the coefficients can be determined via

wTj x =

n∑

k=1

ck wTj wk =

n∑

k=1

ck δj,k = cj ⇒ cj = wTj x.

Moreover, the Euclidean norm of x, given by (7.2) can be computed via

‖x‖22 =

(n∑

j=1

cj wj

)T ( n∑

k=1

ck wk

)=

n∑

j=1

n∑

k=1

cj ck wTj wk =

n∑

j=1

n∑

k=1

cj ck δj,k =

n∑

k=1

c2k.

Theorem 7.3 (convergence of the Rayleigh quotient)

Suppose A ∈ Rn×n is symmetric. Assume that we have a sequence {x(j)} of vectors in Rn

which converges to an eigenvector wJ of A with eigenvalue λJ , and assume that {x(j)} is

normalized, that is, ‖x(j)‖2 = 1 for all j. Then we have

limj→∞

R(x(j)) = R(wJ) = λJ , (7.3)

and ∣∣R(x(j)) − R(wJ)∣∣ = O

(‖x(j) −wJ‖2

2

). (7.4)

Proof of Theorem 7.3. Since the Rayleigh quotient R(x) depends continuously on x, the

first equality in (7.3) is clear and the second equality follows from (7.1).

It remains to show (7.4). For this purpose, we use that, since A is symmetric, there are n

eigenvalues λ1 ≥ λ2 ≥ . . . ≥ λn and n corresponding orthonormal eigenvectors w1,w2, . . . ,wn.

With this orthonormal basis of eigenvectors, x(j) has a representation

x(j) =

n∑

k=1

ck wk, (7.5)

154 7.1. Basic Localisation Techniques

where we notation-wise suppress the fact that the coefficients ck = c(j)k depend also on j. Then,

since Awj = λj wj , we find

Ax(j) = A

(n∑

k=1

ck wk

)=

n∑

k=1

ck Awk =n∑

k=1

ck λk wk. (7.6)

Since wTj wk = 0 if j 6= k, and wj wT

j = ‖wj‖22 = 1 for all j = 1, 2, . . . , n, and ‖x(j)‖2

2 =

(x(j))T x(j) = 1 for all j, from (7.6) and (7.5),

R(x(j))

=(x(j))T Ax(j)

(x(j))T x(j)

= (x(j))T Ax(j)

=

(n∑

i=1

ci wi

)T ( n∑

k=1

ck λk wk

)

=n∑

k=1

n∑

i=1

ci ck λk wTi wk︸︷︷︸

= δi,k

=n∑

k=1

λk c2k.

Using R(wJ) = λJ from (7.1), this gives

R(x(j)) − R(wJ) =n∑

k=1

λk c2k − λJ = λJ (c2

J − 1) +n∑

k=1,k 6=J

λk c2k.

Thus

∣∣R(x(j)) − R(wJ)∣∣ ≤ |λJ | |c2

J − 1| +n∑

k=1,k 6=J

|λk| c2k

≤(

max1≤i≤n

|λi|)(

|c2J − 1| +

n∑

k=1,k 6=J

c2k

)

=

(max1≤i≤n

|λi|)(

|(2 cJ − 2) + (c2J − 2 cJ + 1)| +

n∑

k=1,k 6=J

c2k

)

=

(max1≤i≤n

|λi|)(

|2 (cJ − 1) + (cJ − 1)2| +n∑

k=1,k 6=J

c2k

)

≤(

max1≤i≤n

|λi|)(

2 |cJ − 1| + (cJ − 1)2 +

n∑

k=1,k 6=J

c2k

). (7.7)


From ‖x(j)‖2 = ‖wJ‖2 = 1 and (7.5) and wTi wk = 0 if i 6= k,

∥∥x(j) − wJ

∥∥2

2=(x(j) −wJ

)T (x(j) − wJ

)= ‖x(j)‖2

2︸︷︷︸= 1

+ ‖wJ‖22︸︷︷︸

= 1

−2wTJ x(j)

= 2 − 2wTJ

(n∑

k=1

ck wk

)= 2 − 2

n∑

k=1

ck wTJ wk︸︷︷︸

= δk,J

= 2 − 2 cJ = 2 (1 − cJ). (7.8)

On the other hand, from (7.5),

∥∥x(j) −wJ

∥∥2

2=

∥∥∥∥∥

n∑

k=1

ck wk −wJ

∥∥∥∥∥

2

2

=

∥∥∥∥∥(cJ − 1)wJ +

n∑

k=1,k 6=J

ck wk

∥∥∥∥∥

2

2

=

((cJ − 1)wJ +

n∑

i=1,i6=J

ci wi

)T((cJ − 1)wJ +

n∑

k=1,k 6=J

ck wk

)

= (cJ − 1)2 wTJ wJ︸︷︷︸= 1

+2 (cJ − 1)n∑

k=1,k 6=J

ck wTJ wk︸︷︷︸= 0

+n∑

i=1,i6=J

n∑

k=1,k 6=J

ci ck wTi wk︸︷︷︸

= δi,k

= (cJ − 1)2 +

n∑

k=1,k 6=J

c2k. (7.9)

Substituting (7.8) and (7.9) into the last line of (7.7) yields

∣∣R(x(j))−R(wJ )∣∣ ≤

(max1≤i≤n

|λi|)(∥∥x(j) −wJ

∥∥2

2+∥∥x(j) −wJ

∥∥2

2

)= 2

(max1≤i≤n

|λi|)‖x(j)−wJ‖2

2,

which proves (7.4). 2

In analogy to Theorem 7.3, we can prove the following corresponding theorem for Hermitian

matrices in Cn×n.

Theorem 7.4 (convergence of Rayleigh quotient)

Suppose A ∈ Cn×n is Hermitian. Assume that we have a sequence {x(j)} of vectors in Cn

which converges to an eigenvector wJ of A with eigenvalue λJ , and assume that {x(j)} is

normalized, that is, ‖x(j)‖2 = 1 for all j. Then we have

limj→∞

R(x(j)) = R(wJ) = λJ ,

and ∣∣R(x(j)) − R(wJ)∣∣ = O

(‖x(j) −wJ‖2

2

).

Theorems 7.3 and 7.4 tell us that the Rayleigh quotient converges with a quadratic order to λJ

as {x(j)} converges to xJ .

156 7.2. The Power Method

7.2 The Power Method

In this section we no longer assume that A ∈ Rn×n is symmetric. Suppose A has a dominant

eigenvalue λ1, that is, we have

|λ1| > |λ2| ≥ |λ3| ≥ · · · ≥ |λn|. (7.10)

For such matrices, it is now our goal to determine the dominant eigenvalue λ1 and a corre-

sponding eigenvector. Also assume that there exist n real eigenvalues λ1, λ2, . . . , λn ∈ R and

n linearly dependent eigenvectors w1,w2, . . . ,wn ∈ Rn corresponding to the real eigenvalues

λ1, λ2, . . . , λn, and satisfying ‖w1‖2 = ‖w2‖2 = . . . = ‖wn‖2 = 1. (Note that this is not always

the case, since we have neither assumed that A is symmetric nor that the n eigenvalues are dis-

tinct. Since A is not assumed to be symmetric, the eigenvectors w1,w2, . . . ,wn can in general

not be chosen orthogonal to each other.) Then the eigenvectors w1,w2, . . . ,wn form a basis of

Rn, and we can represent every x as

x =

n∑

j=1

cj wj .

with some coefficients c1, c2, . . . , cn ∈ R. Using Awj = λj wj, j = 1, 2, . . . , n, shows that

Am x =

n∑

j=1

cj Am wj =

n∑

j=1

cj λmj wj = λm

1

(c1w1 +

n∑

j=2

cj

(λj

λ1

)m

wj

)=: λm

1

(c1w1 + Rm

),

(7.11)

with the remainder term

Rm :=n∑

j=2

cj

(λj

λ1

)m

wj. (7.12)

The vector sequence {Rm} tends to the zero vector 0 for m → ∞, since the triangle inequality,

‖wj‖2 = 1 for all j = 1, 2 . . . , n, and |λj/λ1| < 1 for j = 2, . . . , n (from (7.10)) imply

‖Rm‖2 =

∥∥∥∥∥

n∑

j=2

cj

(λj

λ1

)m

wj

∥∥∥∥∥2

≤n∑

j=2

|cj|∣∣∣∣λj

λ1

∣∣∣∣m

‖wj‖2︸︷︷︸= 1

=

n∑

j=2

|cj|∣∣∣∣λj

λ1

∣∣∣∣m

→ 0 as m → ∞.

If c1 6= 0, this means that from (7.11)

1

λm1

Am x = c1 w1 + Rm → c1 w1 as m → ∞,

and c1 w1 is an eigenvector of A for the eigenvalue λ1. Of course, this is so far only of limited

value, since we do not know the eigenvalue λ1 and hence cannot form the quotient (Am x)/λm1 .

Another problem comes from the fact that the norm of Amx converges to zero if |λ1| < 1 and

to infinity if |λ1| > 1. Both problems can be resolved if we normalize appropriately.


For example, if we take the Euclidean norm of Am x, then from the last step in (7.11) and

‖wj‖2 = 1 for j = 1, 2, . . . , n,

‖Amx‖22 =

∥∥λm1

(c1 w1 + Rm

)∥∥2

2

= λ2m1

(c1 w1 + Rm

)T (c1 w1 + Rm

),

= λ2m1

(|c1|2 wT

1 w1︸︷︷︸= 1

+2 c1 wT1 Rm + RT

m Rm︸︷︷︸= ‖Rm‖2

2

)

= λ2m1

(|c1|2 + rm

), (7.13)

where rm ∈ R is defined by

rm := 2 c1 wT1 Rm + ‖Rm‖2

2. (7.14)

Clearly limm→∞ Rm = 0 implies limm→∞ rm = 0. Thus, from (7.13), we can write

‖Am x‖2 = |λ1|m√|c1|2 + rm, with lim

m→∞rm = 0. (7.15)

From (7.15), we find

‖Am+1 x‖2

‖Am x‖2= |λ1|

‖Am+1x‖2

|λ1|m+1

|λ1|m‖Am x‖2

= |λ1|√

|c1|2 + rm+1√|c1|2 + rm

→ |λ1| for m → ∞, (7.16)

which gives the real eigenvalue λ1 up to its sign. To determine also its sign and a corresponding

eigenvector, we refine the method as follows.

Definition 7.5 (power method or von Mises iteration)

Let A ∈ Rn×n, and assume that A has n real eigenvalues λ1, λ1, . . . , λn ∈ R (counted with

multiplicity) and a corresponding set of real normalized eigenvectors w1,w2, . . . ,wn ∈ Rn

that form a basis of Rn. (That is, w1,w2, . . . ,wn are linearly independent, and ‖wj‖2 = 1

and Awj = λj wj, j = 1, 2, . . . , n.) Suppose that A has a dominant eigenvalue λ1, that is,

|λ1| > |λj| for j = 2, 3, . . . , n. The power method or von Mises iteration is defined in

the following way: First, we choose a starting vector

x(0) =n∑

j=1

cj wj , with c1 6= 0,

and set y(0) = x(0)/‖x(0)‖2. Then, for m = 1, 2, . . . we compute:

1. x(m) = Ay(m−1),

2. y(m) = σmx(m)

‖x(m)‖2, where the sign σm ∈ {−1, 1} is chosen such that y(m)T y(m−1) ≥ 0.

The choice of the sign σm means that the angle between y(m−1) and y(m) is in [0, π/2], which

means that we are avoiding a ‘jump’ when moving from y(m−1) to y(m).


The condition c1 6= 0 is usually satisfied simply because of numerical rounding errors and is no

real restriction. We note that c1 6= 0 is equivalent to wT1 x(0) 6= 0.

The power method can be implemented with the following MATLAB code:

function [lambda,v] = power_method(A,z,J)

%

% executes the power method or von Mises iteration

%

% input: A = real n by n matrix with real eigenvalues

% and n linearly independent real eigenvectors;

% it is assumed that A has a dominant eigenvalue

% z = n by 1 starting vector for the power method

% J = number of iterations

%

% output: lambda = approximation of dominant eigenvalue

% v = approximation of corresponding normalized eigenvector

%

y = z / norm(z);

for j = 1:J

x = A * y;

sigma = sign(x’ * y);

y = sigma * ( x / norm(x) );

end

lambda = sigma * norm(x);

v = y;

The theorem below gives information about the convergence of the power method.

Theorem 7.6 (convergence of power method)

Let A ∈ Rn×n, and assume that A has n real eigenvalues λ1, λ1, . . . , λn ∈ R (counted with

multiplicity) and a corresponding set of real normalized eigenvectors w1,w2, . . . ,wn ∈ Rn

that form a basis of Rn. (That is, w1,w2, . . . ,wn are linearly independent, and ‖wj‖2 = 1

and Awj = λj wj, j = 1, 2, . . . , n.) Suppose that A has a dominant eigenvalue λ1, that is,

|λ1| > |λj| for j = 2, 3, . . . , n. Then the iterations of the power method satisfy:

(i) ‖x(m)‖2 → |λ1| for m → ∞,

(ii) y(m) converges to a normalized eigenvector of A with the eigenvalue λ1,

(iii) σm → sign(λ1) for m → ∞, that is, σm = sign(λ1) for sufficiently large m.

Before we prove the theorem, we give a numerical example.


Example 7.7 (power method)

Consider the real 3 × 3 matrix

A =

0 −2 2

−2 −3 2

−3 −6 5

.

From Example 2.4 we know that the eigenvalues of this matrix are λ1 = 2, λ2 = 1, and λ3 = −1

and that the eigenvectors are real. A normalized eigenvector to the dominant eigenvalue λ1 = 2

is given by w1 = (√

2)−1 (1, 0, 1)T , and normalized eigenvectors to the eigenvalues λ2 = 1 and

λ3 = −1 are given by w2 = (√

5)−1 (2,−1, 0)T and w3 = (√

2)−1 (0, 1, 1)T .

We compute the first two steps of the power method with starting vector x(0) = (1, 1, 1)T by

hand. It can be easily verified that x(0) = −√

2w1 +√

5w2 + 2√

2w3, that is the condition

c1 = −√

2 6= 0 is satisfied.

We start with

y(0) =x(0)

‖x(0)‖2=

1√3

1

1

1

.

In the first step of the power method we find

x(1) = Ay(0) =1√3

0 −2 2

−2 −3 2

−3 −6 5

1

1

1

=

1√3

0

−3

−4

.

Since

σ1 = sign((y(0))T x(1)

)= sign

1√3

1

1

1

T

1√3

0

−3

−4

= sign

(1

3(−7)

)= −1

and

‖x(1)‖2 =1√3

√(−3)2 + (−4)2 =

5√3,

we find

y(1) = σ1x(1)

‖x(1)‖2

= −√

3

5

1√3

0

−3

−4

=

1

5

0

3

4

.

In the second step of the power method we find

x(2) = Ay(1) =1

5

0 −2 2

−2 −3 2

−3 −6 5

0

3

4

=1

5

2

−1

2

.


Since

σ2 = sign((y(1))T x(2)

)= sign

1

5

0

3

4

T

1

5

2

−1

2

= sign

(5

25

)= 1

and

‖x(2)‖2 =1

5

√22 + (−1)2 + 22 =

√9

5=

3

5,

we find

y(2) = σ2x(2)

‖x(2)‖2=

5

3

1

5

2

−1

2

=

1

3

2

−1

2

.

Using the MATLAB code given above we obtain the following approximations to λ1 = 2 and the

corresponding normalized eigenvector w1 = (√

2)−1 (1, 0, 1)T ≈ (0.707107, 0, 0.707107)T from

the first 7 iterations of the power method:

j 1 2 3 4 5 6

0 0.6667 0.4983 0.7062 0.6602 0.7071

y(j) 0.6000 −0.3333 0.2491 −0.0504 0.0660 −0.0114

0.8000 0.6667 0.8305 0.7062 0.7482 0.7071

σj ‖x(j)‖2 −2.8868 0.6000 4.0139 1.6463 2.2923 1.9296

We see that while the first few iterations of the power method give very poor results, after only

7 iterations we already have a reasonable approximation of the eigenvector and eigenvalue.

After 15 iterations we find

λ1 ≈ σ15 ‖x(15)‖2 = 2.0002 and w1 ≈ y(15) =

0.7071

0.0001

0.7071

,

which gives a very good approximation of the dominant eigenvalue λ1 = 2 and a corresponding

normalized eigenvector. 2

Note that the assumption that ‖wj‖2 = 1, j = 1, 2, . . . , n, is only for our convenience. Once

we have n linearly independent eigenvectors w1,w2, . . . ,wn corresponding to λ1, λ2, . . . , λn we

can always normalize them so that they have length one.

Proof of Theorem 7.6. From the definition of the power method, we have for k ≥ 1

y(k) = σkx(k)

‖x(k)‖2= σk

Ay(k−1)

‖Ay(k−1)‖ . (7.17)


Applying (7.17) repeatedly for k = m, m − 1, . . . , 1 yields

y(m) = σmAy(m−1)

‖Ay(m−1)‖2= σm

A(σm−1

Ay(m−2)

‖Ay(m−2)‖2

)

∥∥∥σm−1A2y(m−2)

‖Ay(m−2)‖2

∥∥∥2

= σm σm−1A2 y(m−2)

‖A2 y(m−2)‖2= σm σm−1 · · · σ1

Am y(0)

‖Am y(0)‖2

= σm σm−1 · · · σ1Am x(0)

‖Am x(0)‖2

, (7.18)

for m = 1, 2, . . . , where we have used y(0) = x(0)/‖x(0)‖2 in the last step. From this, we can

conclude

x(m+1) = Ay(m) = σm σm−1 · · · σ1Am+1 x(0)

‖Am x(0)‖2

,

such that (7.16) with x = x(0) immediately leads to

‖x(m+1)‖2 =‖Am+1x(0)‖2

‖Amx(0)‖2= |λ1|

√|c1|2 + rm+1√|c1|2 + rm

→ |λ1| for m → ∞.

Using in (7.18) the representations (7.11) and (7.15) yields for sufficiently large m

y(m) = σm σm−1 · · · σ1λm

1 (c1 w1 + Rm)

|λ1|m√

|c1|2 + rm

= σm σm−1 · · · σ1

[sign(λ1)

]msign(c1)

|c1|√|c1|2 + rm

w1 + ρm, (7.19)

where

ρm := σm σm−1 · · · σ1

[sign(λ1)

]m Rm√|c1|2 + rm

.

Since c1 6= 0 and Rm → 0 and rm → 0 for m → ∞, clearly ρm → 0 for m → ∞. From (7.19)

and ρm → 0 for m → ∞, we can conclude that y(m) indeed converges to an eigenvector of A

for the eigenvalue λ1, provided that σm = sign(λ1) for all m ≥ m0. The latter follows from

(y(m−1))T y(m) = (y(m))T y(m−1) ≥ 0 and the first line in (7.19), using ‖w1‖22 = 1,

0 ≤ (y(m−1))T y(m) = σm σ2m−1 · · · σ2

1

λ2m−11

(c1 w1 + Rm−1

)T (c1 w1 + Rm

)

|λ1|2m−1√|c1|2 + rm−1

√|c1|2 + rm

= σm sign(λ1)|c1|2 + RT

m−1 Rm + c1 wT1 (Rm−1 + Rm)

|c1|2√(

1 + rm−1

|c1|2)(

1 + rm

|c1|2) . (7.20)

Because Rm → 0 and rm → 0 for m → ∞, the fraction in the second line of (7.20) converges

to one for m → ∞, and hence (7.20) implies for m large enough that

0 ≤ σm sign(λ1) ⇔ σm = sign(λ1),

162 7.3. Inverse Iteration

which completes the proof. 2

The Euclidean norm involved in the normalization process can be replaced by any other norm

without changing the convergence result. Often, the maximum norm is used, since it is cheaper

to compute.

Exercise 90 Consider the matrix

A =

32

0 12

0 3 0

12

0 32

.

In Exercise 89 you have computed the eigenvalues of the matrix A.

(a) Compute the eigenvectors corresponding to the dominant eigenvalue of A by hand.

(b) For the starting vector x(0) = (1, 1, 1)T , compute the first three iterations of the power

method by hand.

(c) Comment on the quality of the results from (b).

7.3 Inverse Iteration

Inverse iteration can be used to improve the approximations of eigenvalues obtained by other

methods. The idea is to start with a good approximation λ of an eigenvalue λj of A with

multiplicity one, and consider the inverse matrix (A − λ I)−1. If λ was a good approximation

of λj , then we expect λj − λ to be small and hence that the eigenvalue λj := (λj − λ)−1 of

(A − λ I)−1 is a dominant eigenvalue of (A − λ I)−1. Inverse iteration consists of using the

power method to approximate this dominant eigenvalue λj = (λj − λ)−1 and then uses the

relation λj = λ−1j + λ to obtain an improved approximation of the eigenvalue λj of A.

General assumption in this section: Throughout this section we assume that the real

matrix A ∈ Rn×n has n real eigenvalues λ1, λ2, . . . , λn ∈ R and corresponding real normalized

eigenvectors w1,w2, . . . ,wn ∈ Rn that form a basis of Rn.

The major downfall of the power method, discussed in the previous section, is the convergence

factor: Let |λ1| > |λ2| ≥ |λ3| ≥ . . . ≥ |λn|; if |λ2| is close to |λ1| the convergence will be very

slow, since |λ1/λ2|m tends only very slowly to zero for m → ∞.

Inverse iteration which is discussed in this section starts with a good approximation λ of one

eigenvalue λj of multiplicity one, and then uses the power method applied to (A − λ I)−1 to

compute an improved approximation of λj. We will now see in detail how this works.


Assume now that A has an eigenvalue λj of multiplicity one. If λ is a good approximation to

λj then we have

|λ − λj| < |λ − λi| for all i = 1, 2, . . . , n with i 6= j. (7.21)

If λ is not an eigenvalue of A, then det(A − λ I) 6= 0 and we can invert A − λ I. Since

λ1, λ2, . . . , λn are the eigenvalues of A, we have

det(µ I − (A − λ I)

)= det

((µ + λ) I − A

)= (µ + λ − λ1) (µ + λ − λ2) · · · (µ + λ − λn).

Hence A−λ I has the eigenvalues λi−λ, i = 1, 2, . . . , n, and thus (A−λ I)−1 has the eigenvalues

(λi − λ)−1 =: λi, i = 1, 2, . . . , n. Rearranging, for i = j

λj =1

λj − λ⇔ λj − λ =

1

λj

⇔ λj = λ +1

λj

,

and we see that a good approximation of λj allows us to improve our approximation of λj.

From (7.21), we have

1

|λ − λi|<

1

|λ − λj|= |λj| for all i = 1, 2, . . . , n with i 6= j,

and we can conclude that λj is the dominant eigenvalue of (A − λ I)−1.

From our assumption that A has n real eigenvalues λ1, λ2, . . . , λn and corresponding normalized

eigenvectors w1,w2, . . . ,wn, that form a basis of Rn, the real eigenvalue λi = (λi − λ)−1 of the

matrix (A − λ I)−1 has the real eigenvector wi. Indeed, for i = 1, 2, . . . , n,

Awi = λi wi ⇔ (A − λ I)wi = (λi − λ)wi ⇔ 1

(λi − λ)(A − λ I)wi = wi.

Thus from left-multiplication with (A − λ I)−1

(A − λ I)−1 1

(λi − λ)(A − λ I)wi = (A − λ I)−1wi ⇔ 1

(λi − λ)wi = (A − λ I)−1wi.

Thus the real eigenvalue λi = (λi−λ)−1 of (A−λ I)−1 has the corresponding normalized eigen-

vector wi ∈ Rn. From the assumptions on A, we see that the real eigenvectors w1,w2, . . . ,wn

of (A − λ I)−1 corresponding to the real eigenvalues λi = (λi − λ)−1, i = 1, 2, . . . , n, form a

basis of Rn

Therefore the assumptions for the power method are satisfied, and the power method can

be applied to the computation of the dominant eigenvalue λj of (A−λ I)−1. However,

for one iteration step we now have to form

1. x(m) = (A − λ I)−1 y(m−1),

164 7.3. Inverse Iteration

2. y(m) = σmx(m)

‖x(m)‖2, where σm ∈ {−1, 1} is chosen such that (y(m))T y(m−1) ≥ 0.

The computation of x(m) thus requires solving the linear system

(A − λ I)x(m) = y(m−1).

To this end, we determine, before the iteration starts, an LU factorization of A − λ I (or a

QR factorization) using partial row pivoting, that is, we compute

P (A − λ I) = L U,

with a permutation matrix P , an upper triangular matrix U and a normalized lower triangular

matrix L. Then, we can use forward and back substitution to solve

L U x(m) = P y(m−1).

More precisely, we let z = U x(m) and solve first L z = P y(m−1) by forward substitution and

then subsequently z = U x(m) by back substitution. This reduces the complexity necessary in

each step to O(n2) elementary operations and requires only once O(n3) elementary operations

to compute the LU factorization at the beginning.

The approximation λ of λj can be given by any other method. From this point of view, inverse

iteration is particularly interesting for improving the results of other methods.

For symmetric matrices, it is possible to improve the convergence dramatically by estimating

λ in each step using the Rayleigh quotient:

1. Choose x(0) ∈ Rn and set y(0) = x(0)/‖x(0)‖2

2. For m = 0, 1, 2, . . . do

µm = y(m)T Ay(m),

x(m+1) = (A − µm I)−1 y(m),

y(m+1) = x(m+1)/‖x(m+1)‖2.

Of course, x(m+1) is only computed if µm is not an eigenvalue of A, otherwise the method ceases.

The same holds if ‖x(m+1)‖2 becomes too large. Although, for m → ∞, the matrices A − µm I

become singular, there are usually no problems with the computation of x(m+1), provided that

one terminates the computation of the LU decomposition (or QR-decomposition) of A − µm I

in time.

It is possible to show that convergence is cubic under certain assumptions. This, however, is

indeed necessary to make the method efficient since in each step an LU decomposition has to

be computed and this additional computational complexity should be compensated by a faster

convergence.


Exercise 91 For a given matrix A ∈ Rn×n and µ ∈ R, we may write the inverse iteration in

the following form:

Given x(0) ∈ Cn for k = 1, 2, . . . do:

(1) solve (A − µ I)w = x(k−1),

(2) normalize x(k) = σk w/‖w‖2, where σk ∈ {−1, 1} is chosen such that (x(k))∗ x(k−1) ≥ 0,

(3) set λ(k) = (x(k))∗ Ax(k).

Assume that A has a basis {z1, z2, . . . , zn} ⊂ Cn of eigenvectors of A, that is, if λ1, λ2, . . . , λn ∈ C

are the eigenvalues of A then there exist n linearly independent eigenvectors z1, z2, . . . , zn ∈ Cn

such that A zj = λj zj for j = 1, 2, . . . , n. Let α1, α2, . . . , αn ∈ C be the coefficients of x(0) with

respect to the basis {z1, z2, . . . , zn}, that is, x(0) =∑n

i=1 αi zi. Under certain conditions, that

are to be stated, do the following:

(a) Describe the eigenvalues and eigenvectors of (A − µ I)−1 in terms of µ and the eigenvalues

and eigenvectors of A.

(b) Show that

x(r) =σ1 σ2 · · · σr−1 σr

K0 K1 K2 · · · Kr−1

n∑

i=1

αi

(1

λi − µ

)r

zi,

where Kj := ‖(A − µ I)−1 x(j)‖2, j ∈ N0.

(c) For any 1 ≤ J ≤ n, show that

∥∥∥∥K0 K1 K2 · · · Kr−1

σ1 σ2 · · · σr−1 σr

(λJ − µ)r x(r) − αJ zJ

∥∥∥∥2

≤∣∣∣∣λJ − µ

λI − µ

∣∣∣∣r n∑

i=1,i6=J

|αi| ‖zi‖2,

where the properties of the index I are to be stated.

7.4 The Jacobi Method

C. G. J. Jacobi stated in 1845/46 a method for dealing with the eigenvalue problem for sym-

metric n × n matrices, which is still feasible if the matrices are not too big. This method is

called the Jacobi method. Since it can easily be run in parallel it can, on specific computers,

even be superior to the QR-method, which will be discussed in Section 7.6. The Jacobi method

computes all eigenvalues and if necessary also all eigenvectors. The Jacobi method is based

upon the following easy fact.

166 7.4. The Jacobi Method

Lemma 7.8 (Frobenius norm is invariant under orthogonal basis transformation)

Let A = (ai,j) in Rn×n be any matrix. For every orthogonal matrix Q ∈ R

n×n both A and

QT A Q have the same Frobenius norm, that is, ‖A‖F = ‖QT A Q‖F . If A is symmetric

(that is, AT = A) and λ1, . . . , λn are the real eigenvalues of A then

n∑

j=1

|λj |2 =n∑

j=1

n∑

k=1

|aj,k|2 = ‖A‖2F . (7.22)

As a preparation for the proof we recall that for any matrix B = (bj,k) in Rn×n

trace(B) :=n∑

j=1

bj,j.

Since the products A B and B A of A = (aj,k) and B = (bj,k) in Rn×n satisfy

(A B)i,j =

n∑

k=1

ai,k bk,j and (B A)i,j =

n∑

k=1

bi,k ak,j,

setting i = j above, we see that

trace(A B) =n∑

j=1

n∑

k=1

aj,k bk,j =n∑

k=1

n∑

j=1

bk,j aj,k = trace(B A). (7.23)

In particular, (7.23) implies for B = AT

trace(A AT ) = trace(AT A) =n∑

j=1

n∑

k=1

|aj,k|2 = ‖A‖2F . (7.24)

Proof of Lemma 7.8. From (7.24), the square of the Frobenius norm of A equals

‖A‖2F =

n∑

j=1

n∑

k=1

|aj,k|2 = trace(A AT ) = trace(AT A).

Since from (7.23), the trace of A B is the same as the trace of B A for two arbitrary n × n

matrices A and B, we can conclude that (using Q QT = I since Q is orthogonal)

‖QT A Q‖2F = trace

((QT A Q) (QT A Q)T

)= trace

((QT A Q) (QT AT Q)

)

= trace(QT A (Q QT ) AT Q

)= trace

((QT A) (AT Q)

)

= trace((AT Q) (QT A)) = trace

(AT (Q QT ) A

)= trace(AT A) = ‖A‖2

F .

For a symmetric matrix A there exists an orthogonal matrix Q, such that QT A Q = D is

a diagonal matrix with the real eigenvalues λ1, λ2, . . . , λn of A as diagonal entries. Thus for

symmetric A

‖A‖2F =

∥∥QT A Q‖2F = ‖D‖2

F = trace(D DT ) =n∑

j=1

|λj |2.


This concludes the proof. 2

Definition 7.9 (outer norm)

The outer norm of a real n × n matrix A = (ai,j) is defined by

N(A) :=

(n∑

j=1

n∑

k=1,k 6=j

|aj,k|2)1/2

.

Note that the outer norm is not really a norm in the strict sense of Definition 2.24.

Exercise 92 Investigate which of the properties of a norm are satisfied by the outer norm and

which properties of norm a violated by the outer norm.

From now on assume for the rest of this section that A is symmetric. With the outer norm,

(7.22) in Lemma 7.8 gives for symmetric A ∈ Rn×n the decomposition

‖A‖2F =

n∑

j=1

|λj|2 =n∑

j=1

n∑

k=1

|aj,k|2 =n∑

j=1

|aj,j|2 +[N(A)

]2. (7.25)

Since the left-hand side is invariant under orthogonal transformations (from Lemma 7.8), it

is now our goal to decrease the value of [N(A)]2 by choosing appropriate orthogonal trans-

formations Q and thus to increase the value of∑n

j=1 |aj,j|2. In the limit case [N(A)]2 → 0,

the transformed matrix QT A Q tends to the diagonal matrix D = diag(λ1, λ2, . . . , λn), where

λ1, λ2, . . . , λn are the eigenvalues of A (without any ordering).

To find suitable orthogonal transformations Q such that [N(QT A Q)]2 < [N(A)]2, we choose

an element ai,j 6= 0 with i 6= j and perform a transformation in the plane spanned by ei and

ej, which maps ai,j to zero.

To construct such a transformation in consider the following problem in R2: for a symmetric

2 × 2 matrix

A =

(ai,i ai,j

ai,j aj,j

),

find a rotation

R =

(cos α − sin α

sin α cos α

)(7.26)

with angle α such that the matrix B := RT AR is a diagonal matrix, that is,

B =

(bi,i bi,j

bi,j bj,j

)=

(cos α sin α

− sin α cos α

)(ai,i ai,j

ai,j aj,j

)(cos α − sin α

sin α cos α

), (7.27)

with bi,j = 0. (Note that B is symmetric since BT = (RT A R)T = RT AT R = RT A R since

A is symmetric.) The condition bi,j = 0 reads more explicitly

0 = bi,j = ai,j

[(cos α)2 − (sin α)2

]+[aj,j − ai,i

]cos α sin α


= ai,j cos(2α) +[aj,j − ai,i

] 1

2sin(2α).

From ai,j 6= 0 we can conclude that the angle α ∈ [0, π/2] is given by

cot(2α) =ai,i − aj,j

2 ai,j. (7.28)

For the practical computation of the entries cos α and sin α in the rotation matrix R, we do

not actually determine the angle α from (7.28), but use

tan(2α) =2 ai,j

ai,i − aj,jif ai,i 6= aj,j, and α =

π

4if ai,i = aj,j, (7.29)

and the following trigonometric identities:

cos(2α) =1√

1 +(tan(2α)

)2 , cos α =

√1

2

(1 + cos(2α)

), sin α =

√1

2

(1 − cos(2α)

).

(7.30)

We note that clearly

RT R =

(cos α sin α

− sin α cos α

)(cos α − sin α

sin α cos α

)= I,

and hence the rotation R is an orthogonal matrix.

Definition 7.10 (elementary Givens rotation)

An n × n elementary Givens rotation is given by the real n × n matrix

Gi,j(α) := I + (cos α − 1)(ej eT

j + ei eTi ) + sin α

(ej eT

i − ei eTj

)

=

1. . .

1

cos α − sin α

1. . .

1

sin α cos α

1. . .

1

. (7.31)

This matrix coincides with the identity matrix I, except for

(Gi,j(α))(i, i) = (Gi,j(α))(j, j) = cos α,

(Gi,j(α))(i, j) = −(Gi,j(α))(j, i) = − sin α,

where in the second line in (7.31) i < j.


Since rotations R in R2, given by (7.26) are orthogonal matrices, it is not surprising to find

that elementary Givens rotations are orthogonal matrices.

Lemma 7.11 (elementary Givens rotations are orthogonal)

An n × n elementary Givens rotation Gi,j(α) is an orthogonal matrix, and we have

(Gi,j(α)

)−1=(Gi,j(α)

)T= Gj,i(α).

Proof of Lemma 7.11. The proof follows relatively straightforward with the help of the ro-

tation matrix (7.26) and is left as an exercise. 2

Exercise 93 Show Lemma 7.11.

Lemma 7.12 (basis transformation with an elementary Givens rotation)

Let A = (ak,ℓ) ∈ Rn×n be symmetric and let Gi,j(α), with α ∈ [0, π/2], be the n×n elementary

Givens rotation. Let B = (bk,ℓ) :=(Gi,j(α)

)TA Gi,j(α). Let R be the 2×2 orthgonal matrix

R =

(cos α − sin α

sin α cos α

)

and let

A =

(ai,i ai,j

ai,j aj,j

)if i < j, and A =

(aj,j aj,i

aj,i ai,i

)if i > j,

where the entries in A are those of A with indices (i, i), (i, j), (j, i), (j, j). Let

B =

(bi,i bi,j

bi,j bj,j

):= RT A R if i < j, and B =

(bj,j bj,i

bj,i bi,i

):= RT A R if i > j.

Then bk,ℓ = bk,ℓ for k, ℓ ∈ {i, j}. In words, the entries with indices (i, i), (i, j), (j, i), (j, j)

of(Gi,j(α)

)TA Gi,j(α) can be described by executing the basis transformation RT A R on the

2 × 2 matrix A that contains the entries of A with indices (i, i), (i, j), (j, i), (j, j).

Exercise 94 Prove Lemma 7.12.

With the help of Lemma 7.12 we can now prove the following central result.


Proposition 7.13 (basis transformation with elementary Givens rotation)

Let A ∈ Rn×n be symmetric and ai,j 6= 0 for a pair (i, j) of indices i 6= j, and let Gi,j(α)

be the n × n elementary Givens rotation with angle α ∈ [0, π/2] defined by cot(2α) =

(ai,i − aj,j)/(2 ai,j). If

B =(Gi,j(α)

)TA Gi,j(α)

then bi,j = bj,i = 0 and [N(B)

]2=[N(A)

]2 − 2 |ai,j|2. (7.32)

We see from (7.32) that through a suitable basis transformation with an elementary Givens

rotation, the outer norm of A can be reduced, as intended.

Proof of Proposition 7.13. In this proof we will use Lemma 7.12 which allows us to switch

between the pair of matrices A and B and the pair of matrices A and B when we are only inter-

ested in the entries with indices (i, i), (i, j), (j, i), (j, j). We are using the invariance property

of the Frobenius norm (see Lemma 7.8) twice: On the one hand, from Lemma 7.8, we have

‖A‖F =∥∥(Gi,j(α)

)TA Gi,j(α)

∥∥F

= ‖B‖F ,

since Gi,j(α) is an orthogonal matrix. On the other hand, since R given by (7.26) is also

orthogonal, we have from Lemma 7.8 the equality of the Frobenius norms for the small matrices

in (7.27), which means

|ai,i|2 + |aj,j|2 + 2 |ai,j|2 = |bi,i|2 + |bj,j|2, (7.33)

since bi,j = 0. From this, we can conclude that

[N(B)

]2=

n∑

j=1

n∑

k=1k 6=j

|bj,k| = ‖B‖2F −

n∑

k=1

|bk,k|2 = ‖A‖2F −

n∑

k=1

|bk,k|2

=[N(A)

]2+

n∑

k=1

(|ak,k|2 − |bk,k|2

)=[N(A)

]2 − 2 |ai,j|2,

because ak,k = bk,k for all k 6= i, j and

|ai,i|2 + |aj,j|2 −(|bi,i|2 + |bj,j|2

)= −2 |ai,j|2

from (7.33). 2

Iteration of this process gives the classical Jacobi method for computing the eigenvalues of a

symmetric matrix.


Definition 7.14 (classical Jacobi method for eigenvalue computation)

Let A ∈ Rn×n be symmetric. The classical Jacobi method (for eigenvalue computa-

tion) defines A(1) = A and then proceeds for m = 1, 2, . . . , as follows:

1. For A(m) = (a(m)ℓ,k ) determine a

(m)i,j with i 6= j such that

|a(m)i,j | = max

1≤ℓ,k≤n;ℓ 6=k

|aℓ,k|

and set G(m) = Gi,j(α),where α ∈ [0, π/2] is such that cot(2α) = (a(m)i,i −a

(m)j,j )/(2 a

(m)i,j ).

2. Set A(m+1) =(G(m)

)TA(m) G(m).

For practical computations we will again avoid matrix multiplication and code the transforma-

tions in step 2 directly. Note that an non-diagonal element, which has been mapped

to zero in an earlier step, might be changed in later steps again.

Theorem 7.15 (linear convergence of classical Jacobi method)

The classical Jacobi method converges linearly in the outer norm. More precisely, if

A in Rn×n is the symmetric original matrix and A(m+1) = (a(m)i,j ) denotes the new n × n

matrix computed in the mth step of the classical Jacobi method, then for any m ∈ N

∣∣∣∣∣ ‖A‖F −n∑

j=1

|a(m+1)j,j |2

∣∣∣∣∣ =[N(A(m+1))

]2 ≤(

1 − 2

n(n − 1)

)m [N(A)

]2. (7.34)

The sequence {A(m)} converges towards a diagonal matrix with the eigenvalues of A as

diagonal elements.

We note that for large n the number

0 < η := 1 − 2

n(n − 1)< 1

in (7.34) is close to one and therefore the convergence is rather slow.

Proof of Theorem 7.15. We consider one step in the classical Jacobi method: A(m+1) =

(a(m+1)ℓ,k ) =

(G(m)

)TA(m) G(m) with the elementary Givens rotation G(m) = Gi,j(α). Since the

Frobenius norm of the symmetric matrix A is invariant under orthogonal basis transformations,

we have from Lemma 7.8 that

‖A(m+1)‖F = ‖A(m)‖F = ‖A‖F

and therefore, from the definition of the outer norm,∣∣∣∣∣ ‖A‖F −

n∑

k=1

|a(m+1)k,k |2

∣∣∣∣∣ =

∣∣∣∣∣ ‖A(m+1)‖F −

n∑

k=1

|a(m+1)k,k |2

∣∣∣∣∣ =[N(A(m+1))

]2. (7.35)


From (7.32), with B = A(m+1) =(G(m)

)TA(m) G(m) and A = A(m)

[N(A(m+1))

]2=[N(A(m))

]2 − 2 |a(m)i,j |2. (7.36)

Since a(m)i,j was chosen such that |a(m)

i,j | ≥ |a(m)ℓ,k | for all ℓ 6= k, we have

[N(A(m))

]2=

n∑

k=1

n∑

ℓ=1,ℓ 6=k

|a(m)ℓ,k |2 ≤ n(n − 1)|a(m)

i,j |2 ⇔ |a(m)i,j |2 ≥ 1

n(n + 1)

[N(A(m))

]2.

(7.37)

Thus, applying (7.37) in (7.36), we can bound the outer norm by

[N(A(m+1))

]2=

[N(A(m))

]2 − 2 |a(m)i,j |2

≤[N(A(m))

]2 − 2

n(n − 1)

[N(A(m))

]2

=

(1 − 2

n(n − 1)

)[N(A(m))

]2. (7.38)

Applying (7.38) successively with m replaced by m−1, m−2, . . . , 1, and combining with (7.35),

we find (using A(1) = A)∣∣∣∣∣ ‖A‖F −

n∑

k=1

|a(m+1)k,k |2

∣∣∣∣∣ =[N(A(m+1))

]2 ≤(

1 − 2

n(n − 1)

)m [N(A)

]2,

which proves (7.34) and means that we have linear convergence.

From (7.34), we see that for m → ∞

[N(A(m))

]2=

m∑

k=1

n∑

ℓ=1,ℓ 6=k

|a(m)ℓ,k |2 → 0 and

m∑

k=1

|a(m)k,k |2 → ‖A‖2

F =n∑

k=1

|λk|2,

where λ1, λ2, . . . , λn denote the eigenvalues of A. (We have used (7.22) in the second limit.) To

see that the sequences of diagonal elements of {a(m)k,k } converge to the eigenvalues of A (although

we cannot predict for a given k to which eigenvalue λℓ the sequence {a(m)k,k } converges), we use

Theorem 7.1: Since

A(m) = (G(m−1))T A(m−1) G(m−1) = . . . =[G(1) G(2) · · · G(m−1)

]TA[G(1) G(2) · · · G(m−1)

],

A(m) is obtained from A with the basis transformation Q−1m−1 A Qm−1 with the orthogonal matrix

Qm−1 := G(1) G(2) · · · G(m−1). Thus A(m) has the same eigenvalues as A. From Theorem 7.1,

we know that each eigenvalue λ of A(m) (and of A) satisfies for at least one k

∣∣λ − a(m)k,k

∣∣ ≤n∑

ℓ=1,ℓ 6=k

|a(m)k,ℓ | ≤

√n(n − 1)

(n∑

ℓ=1,ℓ 6=k

|a(m)k,ℓ |2

)1/2

=√

n(n − 1)N(A(m)), (7.39)


where we have used in the second step the first estimate from Example 2.31. Since N(A(m)) → 0

as m → ∞, it is clear from (7.39) that

limm→∞

a(m)k,k = λ

for some index k. This completes the proof. 2

Example 7.16 (Jacobi method for the computation of eigenvalues)

Consider the symmetric 3 × 3 matrix

A =

32

0 12

0 3 0

12

0 32

which has the eigenvalues λ1 = 3, λ2 = 2, and λ1 = 1 (see homework). We compute the

first step of the Jacobi method for eigenvalue computation by hand. Above the diagonal of

A(1) = A, we have only one non-zero entry a(1)3,1 = 1/2. Thus we find from (7.29) α = π/4 since

a(1)1,1 = a

(1)3,3 = 3/2. Thus our elementary Givens rotation is given by

G(1) = G1,3(π/4) =

cos(π/4) 0 − sin(π/4)

0 1 0

sin(π/4) 0 cos(π/4)

=

1√2

0 − 1√2

0 1 0

1√2

0 1√2

.

Thus the first step of the Jacobi method is given by

A(2) =(G(1)

)TA(1) G(1)

=

1√2

0 1√2

0 1 0

− 1√2

0 1√2

32

0 12

0 3 0

12

0 32

1√2

0 − 1√2

0 1 0

1√2

0 1√2

=

1√2

0 1√2

0 1 0

− 1√2

0 1√2

√2 0 − 1√

2

0 3 0√

2 0 1√2

=

2 0 0

0 3 0

0 0 1

.

In this example we find the the exact eigenvalues after only one iteration step. 2

Exercise 95 Consider the symmetrix matrix

Ax = b, where A =

2 0 0

0 2 −1

0 −1 2

.


(a) Apply one step of the Jacobi method for eigenvalue computation.

(b) Comment on the result.

The Jacobi method for the computation of eigenvalues can be implemented with the MATLAB

code below, where we compute cos α and sin α with the help of (7.29) and the trigonometric

identities (7.30).

function [d,K] = jacobi_eigenvalue(A,J)

%

% Jacobi method for the computation of eigenvalues

%

% input: A = symmetric real n by n matrix

% J = upper bound for number of iterations

%

% output: d = diagonal of symmetric n by n matrix whose diagonal entries

% are approximations of the eigenvalues of A

% K = number of iterations used until required accuracy was reached

%

n = size(A,2);

%

for j = 1:J

C = A;

for i = 1:n;

C(i:n,i) = 0;

end

if max(max(abs(C))) < 10^(-14) break

else

[R,Q] = find(abs(C) == max(max(abs(C))));

r = R(1,1);

q = Q(1,1);

if A(r,r) == A(q,q)

alpha = pi/4;

cosalpha = cos(alpha);

sinalpha = sin(alpha);

else

tan2alpha = 2*A(r,q) / (A(r,r) - A(q,q));

cos2alpha = 1 / sqrt(1 +(tan2alpha)^2);

cosalpha = sqrt((1+cos2alpha)/2);

sinalpha = sqrt((1-cos2alpha)/2);

end

B = A;


B(1:n,r) = A(1:n,r) * cosalpha + A(1:n,q) * sinalpha;

B(1:n,q) = - A(1:n,r) * sinalpha + A(1:n,q) * cosalpha;

A = B;

A(r,1:n) = cosalpha * B(r,1:n) + sinalpha * B(q,1:n);

A(q,1:n) = - sinalpha * B(r,1:n) + cosalpha * B(q,1:n);

end

end

K = j-1;

d = diag(A);

We compute the eigenvalues of the symmetric 4 × 4 matrix from Example 5.19 by executing

the algorithm above in MATLAB.

Example 7.17 (Jacobi method for eigenvalue computation)

Consider the symmetric matrix from Example 5.19, given by

A =

1 0 0.25 0.25

0 1 0 0.25

0.25 0 1 0

0.25 0.25 0 1

.

First we find its eigenvalues directly by determining the roots of the characteristic polynomial,

and then we execute the MATLAB code above to find the eigenvalues numerically.

We compute the characteristic polynomial by expansion with respect to the first row

p(A, λ) = det

λ − 1 0 − 14

− 14

0 λ − 1 0 − 14

− 14

0 λ − 1 0

− 14

− 14

0 λ − 1

= (λ − 1) det

λ − 1 0 − 14

0 λ − 1 0

− 14

0 λ − 1

− 1

4det

0 λ − 1 − 14

− 14

0 0

− 14

− 14

λ − 1

+

1

4det

0 λ − 1 0

− 14

0 λ − 1

− 14

− 14

0

= (λ − 1)

[(λ − 1)3 −

(1

4

)2

(λ − 1)

]− 1

4

[(− 1

4

)3

+1

4(λ − 1)2

]+

1

4

[− 1

4(λ − 1)2

]

= (λ − 1)4 − 3

16(λ − 1)2 +

1

256=

((λ − 1)2 − 3

32

)2

− 9

1024+

1

256

=

((λ − 1)2 − 3

32

)2

− 5

1024=

((λ − 1)2 − 3 +

√5

32

)((λ − 1)2 − 3 −

√5

32

)


=

λ − 1 −

√3 +

√5

32

λ − 1 +

√3 +

√5

32

×

λ − 1 −

√3 −

√5

32

λ − 1 +

√3 −

√5

32

,

where we have used the binomial formulas (a+ b)2 = a2 +2 a b+ b2 and a2− b2 = (a− b) (a+ b).

Thus we find that the eigenvalues are

λ1 = 1 +

√3 +

√5

32≈ 1.4045,

λ2 = 1 +

√3 −

√5

32≈ 1.1545,

λ3 = 1 −

√3 −

√5

32≈ 0.8455,

λ4 = 1 −

√3 +

√5

32≈ 0.5955.

Executing the Jacobi method with the MATLAB code given above with a maximum of J =

10 iterations, we find that the algorithm breaks off after K = 8 iterations, and that the

approximations of the eigenvalues in the iterations m = 1, 2, . . . , 8 are given by the following

diagonal entries of the matrix A(m+1) = (a(m+1)i,k ) listed in the table below:

m 1 2 3 4 5 6 7 8

a(m+1)1,1 1.2500 1.2500 1.2795 1.2795 1.2795 1.2795 1.1545 1.1545

a(m+1)2,2 1.0000 1.2500 1.2500 1.1677 0.9217 0.7205 0.7205 0.5955

a(m+1)3,3 0.7500 0.7500 0.7500 0.8323 1.0783 1.2795 1.4045 1.4045

a(m+1)4,4 1.0000 0.7500 0.7205 0.7205 0.7205 0.7205 0.7205 0.8455

After only 8 iterations the eigenvalues have been pretty well approximated. 2

Since in every step of the classical Jacobi method we have to search n(n − 1)/2 = O(n2)

elements, there are cheaper versions. For example, the cyclic Jacobi method visits all non-

diagonal entries regardless of their sizes in a cyclic way, that is, the index pair (i, j) becomes

(1, 2), (1, 3), . . . (1, n), (2, 3), . . . , (2, n), (3, 4), . . . and then the process starts again. A transfor-

mation takes only place if ai,j 6= 0. The cyclic version is convergent.


For small values of ai,j it is not efficient to perform a transformation. Thus, we can restrict

ourselves to elements ai,j having a square above a certain threshold, for example N(A)/(2n2).

Finally, let us have a look at the eigenvectors of A. From Theorem 7.15, we know that the

sequence of matrices {A(m)} converges towards a diagonal matrix with the eigenvalues of A on

the diagonal. Thus let

D = diag(λ1, λ2, . . . , λn),

where λk is the limit of the diagonal element sequence {a(m)k,k } from the matrices A(m) in the

classical Jacobi method. (Note that here we cannot make any assumption on the ordering of

the eigenvalues.) Since A(m+1) tends to D for m → ∞, and

A(m+1) =(G(m)

)TA(m) G(m) =

(G(m))T · · ·

(G(2)

)T (G(1)

)TA G(1) G(2) · · · G(m) = QT

m A Qm

with the orthogonal matrix

Qm := G(1) · · · G(m).

For sufficiently large m, we have

A(m+1) = QTm A Qm ≈ D ⇒ A Qm ≈ Qm D. (7.40)

If uℓ is the ℓth column vector of Qm, then (7.40) implies that

Auℓ ≈ λℓ uℓ

and we see that the columns of Qm define approximate eigenvectors of A.

7.5 Householder Reduction to Hessenberg Form

In this section we learn how a square matrix A can be brought into so-called upper Hessenberg

form. While this it is not directly useful for determining the eigenvalues of A, we will see that

it serves as a starting point for the QR algorithm discussed in the subsequent section.

Definition 7.18 (upper Hessenberg form)

We say that a matrix A = (ai,j) is an upper Hessenberg matrix (or is upper Hes-

senberg) if all entries below the first lower sub-diagonal are zero, that is, ai,j = 0 for all

i > j + 1.

A =

∗ ∗ · · · ∗ ∗ ∗∗ ∗ · · · ∗ ∗ ∗0 ∗ . . .

......

...0 0

. . . ∗ ∗ ∗......

. . . ∗ ∗ ∗0 0 · · · 0 ∗ ∗

178 7.5. Householder Reduction to Hessenberg Form

From Theorem 2.21 we know that every square matrix A ∈ Cn×n, has a Schur factorisation,

that is, there exists a unitary matrix S (that is, S−1 = S∗) such that

S∗ A S = U ⇔ A = S U S∗,

where U is an upper triangular matrix. Since A and U are related via a basis transformation

or similarity transformation, the eigenvalues of A and the eigenvalues of U are the same. Since

U = (ui,j) is upper triangular, we have that

det(λ I − U

)= (λ − u1,1) (λ − u2,2) . . . (λ − un,n),

and hence the eigenvalues of A and U are just the diagonal entries of U . One way to calculate the

eigenvalues of A is to bring A into upper triangular form with the help of a basis transformation

or similarity transformation. This reduction usually requires two stages.

The first step is to reduce A to upper Hessenberg form. In the second step we apply an iteration

that generates a sequence of upper Hessenberg matrices that converge to a matrix in upper

triangular form. This is schematically illustrated in the diagram below for 5 × 5 matrices

∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗

step 1−−−→

∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗0 ∗ ∗ ∗ ∗0 0 ∗ ∗ ∗0 0 0 ∗ ∗

step 2−−−→

∗ ∗ ∗ ∗ ∗0 ∗ ∗ ∗ ∗0 0 ∗ ∗ ∗0 0 0 ∗ ∗0 0 0 0 ∗

.

In this section we discuss the Householder reduction which performs step 1 above and

reduces the original matrix to a matrix in upper Hessenberg form, with the help of a sequence

of basis transformations (or similarity transformations) with Householder matrices H(w) =

I − 2ww∗, where ‖w‖2 = 1. Remember that Householder matrices are unitary and were

introduced in Section 2.3 in the context of the Schur factorization.

Theorem 7.19 (Householder reduction to bring into upper Hessenberg form)

Let A be an n × n matrix. Then there exist Householder matrices H(w1), H(w2), . . . ,

H(wn−2), such that S∗ A S, with S := H(w1) H(w2) · · · H(wn−2), is in upper Hessenberg

form.

Proof of Theorem 7.19. The proof by is given by induction. Let A(0) = A and A(k) =

H(wk) A(k−1) H(wk), for some k ≥ 1. Suppose

A(k−1) =

A(k−1)k | B(k−1)

− − −C(k−1) | D(k−1)

, (7.41)

where the principal sub-matrix A(k−1)k ∈ Ck×k of order k is in upper Hessenberg form and

C(k−1) = (0, 0, . . . , 0, ck) is (n − k) × k matrix. (The matrices B(k−1) and D(k−1) are an


arbitrary k× (n−k) and an arbitrary (n−k)× (n−k) matrix, respectively. These assumptions

obviously hold for k = 1.

Let Hk = H(uk) be an (n − k) × (n − k) Householder matrix such that Hk ck = ck e1, where

ck = ±‖ck‖2 (e∗1 ck)/|e∗

1 ck| and e1 is the first Euclidean unit vector in Rn−k. Define the vector

wk := (0T ,uTk )T in C

n, where 0 is the unit vector in Ck. Then it is easily seen that

H(wk) = I − 2wk w∗k =

I | 0

− − −0 | Hk

.

Forming the basis transformation with the Householder matrix H(wk) yields

(H(wk)

)∗A(k−1) H(wk) = H(wk) A(k−1) H(wk) =

A(k−1)k | B(k−1) Hk

− − −Hk C(k−1) | Hk D(k−1) Hk

:= A(k).

From the choice of the (n − k) × (n − k) Householder matrix Hk, we have Hk C(k−1) =

(0, 0, . . . , 0, ck e1). Decomposing A(k) in the form (7.41) with k − 1 replaced by k, we find

that A(k)k+1 is an upper Hessenberg matrix and C(k) is an (n − k − 1) × (k + 1) matrix of the

form (0, 0, . . . , 0, ck+1). This completes the proof. 2

We can easily construct the algorithm from the above proof. Consider step k of the Householder

reduction to upper Hessenberg form. From the above proof we require a vector uk ∈ Cn−k such

that H(uk) is an (n − k) × (n − k) Householder matrix with H(uk) ck = ck e1, where ck :=

‖ck‖2 (e∗1 ck)/|e∗

1 ck|. We note that ‖ck‖2 = ‖ck e1‖2 and c∗k (ck e1) = (ck e1)∗ ck holds which

means that the assumptions of Lemma 2.20 are satisfied. Consulting the proof of Lemma 2.20

in Section 2.2, we define uk ∈ Rn−k via

vk := ck − ck e1 and uk :=vk

‖vk‖2

.

This gives uTk uk = 1, and from the proof of Lemma 2.20 with x = ck and y = ck e1 we can

conclude that H(uk) ck = ck e1 and H(uk) e1 = c−1k ck.

Example 7.20 (Householder reduction to upper Hessenberg form)

Apply the Householder reduction to the matrix

A =

1 0 4 0

0 3 3 4

4 3 3 4

0 4 4 −3

to bring it in upper Hessenberg form.

180 7.5. Householder Reduction to Hessenberg Form

Column 1:

c1 =

0

4

0

and c1 = ‖c1‖2 = 4,

v1 = c1 − c1 e1 =

0

4

0

− 4

−1

0

0

=

−4

4

0

and ‖v1‖2 = 4

√2,

u1 =v1

‖v1‖2

=1√2

−1

1

0

.

Thus the 3 × 3 Householder matrix is given by

H1(u1) = I − 2u1 uT1 =

1 0 0

0 1 0

0 0 1

−

−1

1

0

(− 1, 1, 0)

=

0 1 0

1 0 0

0 0 1

,

and we have indeed

H1(u1) c1 =

0 1 0

1 0 0

0 0 1

0

4

0

=

4

0

0

.

The corresponding 4 × 4 Householder matrix is then given by

H(w1) =

1 0 0 0

0 0 1 0

0 1 0 0

0 0 0 1

,

and we find that

H(w1)T A H(w1) =

1 0 0 0

0 0 1 0

0 1 0 0

0 0 0 1

1 0 4 0

0 3 3 4

4 3 3 4

0 4 4 −3

1 0 0 0

0 0 1 0

0 1 0 0

0 0 0 1

=

1 0 4 0

4 3 3 4

0 3 3 4

0 4 4 −3

1 0 0 0

0 0 1 0

0 1 0 0

0 0 0 1

=

1 4 0 0

4 3 3 4

0 3 3 4

0 4 4 −3

.

Column 2: This is left as a exercise. 2


Exercise 96 Perform the second step of the Householder reduction in Example 7.20.

The algorithm for the Householder reduction of a real n × n matrix to upper Hessenberg form

is given by the following MATLAB Code:

function [B] = householder_hessenberg(A)

%

% algorithm executes the Householder reduction of any matrix

% to upper Hessenberg form


%

% output: B = corresponding "reduced" real n by n matrix in

% upper Hessenberg form

%

n = size(A,2)

for k = 1:n-2

% column to be reduced

x = A(k+1:n,k);

% e is unit vector e_1

e = zeros(n-k,1);

e(1) = 1;

v = x - norm(x) * e;

% u_k for Householder matrix H(u_k)

u = v / norm(v);

% matrix multiplication H(u_k)*A

B = A;

A(k+1:n,k:n) = B(k+1:n,k:n) - 2 * u * (u’ * B(k+1:n,k:n));

% matrix multiplication (H(u_k)*A)*H(u_k)

B = A;

A(1:n,k+1:n) = B(1:n,k+1:n) - 2 * (B(1:n,k+1:n) * u) * u’;

end

B = A;

Note that, since we do not require the product of the Householder matrices, we do not explicitly

produce the Householder matrices, thereby reducing storage significantly.

In the kth step of this method the main work lies in calculating the two matrix-vector multi-

plications (see MATLAB code above) and each of these matrix-vector multiplications requires

O(nk) floating point operations. This is done n − 2 times. Therefore, the overall number of

floating point iterations is O(n3).

182 7.6. QR Algorithm

7.6 QR Algorithm

This methods aims at computing all eigenvalues of a given matrix A ∈ Rn×n simultaneously.

It benefits from the Hessenberg form of a matrix.

Definition 7.21 (QR method for the computation of eigenvalues)

Let A ∈ Rn×n and define A0 := A. For m = 0, 1, 2, . . . decompose Am in the form Am =

Qm Rm with an orthogonal matrix Qm and an upper triangular matrix Rm (that is, we use

the QR factorization of Am). Then, form the swapped product

Am+1 := Rm Qm. (7.42)

Since Qm is an orthogonal matrix, from Am = Qm Rm we have QTm Am = Rm. Obviously, from

(7.42)

Am+1 = QTm Am Qm = Q−1

m Am Qm. (7.43)

The representation (7.43) shows that all matrices Am in the QR method have the same eigen-

values as the initial matrix A0 = A.

Theorem 7.22 (properties of the QR transformation)

Let Am be the matrices computed in the QR method. The QR transformation Am 7→ Am+1

respects the upper Hessenberg form of a matrix, that is, if Am is an upper Hessenberg matrix,

then Am+1 is also an upper Hessenberg matrix. In particular, a given symmetric tridiagonal

A matrix remains a symmetric tridiagonal matrix under the QR transformation. If A ∈Rn×n is in upper Hessenberg form then its QR factorization can be computed in O(n2)

operations.

Proof of Theorem 7.22. From (7.43) we can immediately conclude that if Am is symmetric

so is Am+1. Indeed, if Am = ATm, then

ATm+1 =

(QT

m Am Qm

)T= QT

m ATm (QT

m)T = QTm Am Qm = Am+1.

If Am is in upper Hessenberg form then we can compute the QR factorization of Am in n − 1

steps. In each step, a transformation is performed by multiplication from the left with a

Householder matrix, where in the jth step the Householder matrix maps the entry with index

(j + 1, j) to zero. If we denote the Householder matrix in the jth step by Hj+1,j then we have

that

Hn,n−1 · · · H3,2 H2,1 Am = Rm.

with an upper triangular matrix Rm. Since the Householder matrices are orthogonal matrices

Am =(HT

2,1 HT3,2 · · · HT

n,n−1

)Rm = Qm Rm with Qm = HT

2,1 HT3,2 · · · HT

n,n−1.


Thus the matrix Am+1 in the QR method is given by

Am+1 = Rm Qm = Rm

(HT

2,1 H3,2 · · · HTn,n−1

)︸︷︷︸

=Qm

. (7.44)

From the properties of the Householder matrices Hj+1,j right-multiplication with the transposes

of the Householder matrices only modifies entries of Rm with indices (i, j), where i ≤ j + 1.

Thus Am+1 is also a an upper Hessenberg matrix.

If A is a symmetric tridiagonal matrix, then we know from the beginning of this proof that Am

is also symmetric. In particular, a symmetric tridiagonal matrix is in upper Hessenberg form,

and thus Am is in upper Hessenberg form. Since Am is also symmetric, we see that Am is a

symmetric tridiagonal matrix. 2

For the analysis of the convergence of the QR method, we need two auxiliary results. The first

one concerns the uniqueness of the QR factorization of a matrix A.

Lemma 7.23 (uniqueness of QR factorization)

Let A ∈ Rn×n be non-singular. If A = Q R with an orthogonal matrix Q and an upper

triangular matrix R, then Q and R are unique up to the signs of the diagonal entries.

Proof of Lemma 7.23. Assume that we have two QR decompositions A = Q1 R1 and

A = Q2 R2. This gives

Q1 R1 = Q2 R2 ⇔ R1 R−12 = QT

1 Q2 =: S,

which shows that S is an orthogonal matrix and an upper triangular matrix. (Note that the

inverse of an upper triangular matrix is also upper triangular and that the product of two upper

triangular matrices is also an upper triangular matrix.) The inverse S−1 of the upper triangular

matrix S is also upper triangular. Since S−1 = ST as S is an orthogonal matrix, we also have

that ST is an upper triangular matrix. Hence we have that S is also a lower triangular matrix.

Thus S has to be a diagonal matrix. An orthogonal diagonal matrix can only have diagonal

entries ±1. If we fix the signs of the diagonal entries of R1 to be the same as those of R2, it

follows that S = (si,j) can only be the identity (since si,i = (R1 R−12 )i,i = (R1)i,i (R

−12 )i,i because

R1 and R2 are upper triangular and sign((R−1

2 )i,i

)= sign

((R2)i,i

)). Then

R1 R−12 = QT

1 Q2 = I ⇒ R1 = R2 and Q1 = Q2,

and we see that, apart from the signs of the diagonal entries, the matrices Q and R of the QR

factorization are uniquely determined. 2

The second lemma is technical and will be needed in the proof of the convergence of the QR

method.


Lemma 7.24 (technial lemma for QR method)

Let D := diag(d1, d2, . . . , dn) ∈ Rn×n be a diagonal matrix with

|d1| > |d2| > . . . > |dj| > |dj+1| > . . . > |dn| > 0,

and let L = (ℓi,j) ∈ Rn×n be a normalized lower triangular matrix. Let Lm denote the lower

triangular matrix with entries ℓi,j (di/dj)m for i ≥ j. Then, we have

Lm Dm = Dm L ⇔ Lm = Dm L (D−1)m, m ∈ N0. (7.45)

Furthermore, Lm converges linearly to the identity matrix for m → ∞.

Proof of Lemma 7.24. First we observe that multiplication from the left or right of any

matrix B with a diagonal matrix D will leave any zero entries in B invariant. Therefore,

since Dm and (D−1)m are diagonal matrices and L is a normalized lower triangular matrix,

the matrix Dm L (D−1)m is also a lower triangular matrix. Next we compute its entries: Since

Dm = diag(dm1 , dm

2 , . . . , dmn ) and (D−1)m = diag(d−m

1 , d−m2 , . . . , d−m

n ) we obtain from matrix

multiplication

(Dm L (D−1)m

)i,j

= dmi ℓi,j d−m

j = ℓi,j

(di

dj

)m

, 1 ≤ j ≤ i ≤ n,

and we see that indeed Dm L (D−1)m = Lm. We have (Lm)i,i = ℓi,i = 1, since L is a normalized

lower triangular matrix, and, since |di| < |dj| if j < i,

|(Lm)i,j| = |ℓi,j|∣∣∣∣di

dj

∣∣∣∣m

→ 0 for 1 ≤ j < i ≤ n,

which shows that Lm converges linearly to the identity matrix. 2

Theorem 7.25 (convergence of QR method)

Let A ∈ Rn×n be non-singular, and assume that A has n real eigenvalues λ1, λ2, . . . , λn ∈R and n linearly independent corresponding real eigenvectors w1,w2, . . . ,wn ∈ Rn. Let

T ∈ Rn×n be the matrix of corresponding eigenvectors of A, that is, T = (w1,w2, . . . ,wn),

and assume that T−1 possesses an LU factorization without pivoting. Then, the matrices

Am = (a(m)ij ) created by the QR method have the following properties:

(i) The sub-diagonal elements converge to zero, that is, a(m)i,j → 0 for m → ∞ for all i > j.

(ii) The sequences {A2m} and {A2m+1} each converge to an upper triangular matrix.

(iii) The diagonal elements converge to the eigenvalues of A, that is, a(m)i,i → λπ(i) for

m → ∞ for all 1 ≤ i ≤ n, where π : {1, 2, . . . , n} → {1, 2, . . . , n} is some permutation

of the numbers 1, 2, . . . , n.

Furthermore, the sequence of the matrices Qm, created by the QR method, converges to

an orthogonal diagonal matrix, that is, to an diagonal matrix having only 1 or −1 on the

diagonal.


Proof of Theorem 7.25. From (7.43), we already know that all generated matrices Am have

the same eigenvalues as A. Let Qm and Rm be the matrices generated in the mth step of the

QR method (that is, Am = Qm Rm and Am+1 = Rm Qm), and introduce the notation

Rm := Rm Rm−1 · · · R1 R0 and Qm := Q0 Q1 · · · Qm−1 Qm.

From the definition of Qm−1 as a product of unitary matrices we see that Qm−1 is a unitary

matrix, and from the definition of Rm−1 as a product of upper triangular matrices, we see that

the matrix Rm−1 is an upper triangular matrix.

From repeated application of (7.43) we have

Am = QTm−1 Am−1 Qm−1 = QT

m−1 QTm−2 Am−2 Qm−2 Qm−1 = . . . = QT

m−1 A Qm−1,

that is,

Am = QTm−1 A Qm−1. (7.46)

By induction we can also show that the powers of A have satisfy

Am = Qm−1 Rm−1. (7.47)

Indeed, from the QR factorization A = Q0 R0 = Q0 R0, and (7.47) holds true for m = 1.

Assume that (7.47) holds true for m. Then for m + 1

Am+1 = A Am = A Qm−1 Rm−1 = Qm−1 Am Rm−1 = Qm−1 (Qm Rm) Rm−1 = Qm Rm,

where we have used Qm−1 Am = A Qm−1 (from (7.46)) in the third step and the QR factorization

Am = Qm Rm in the fourth step.

Since Qm−1 is unitary and Rm−1 is upper triangular, (7.47) is a QR factorization Am =

Qm−1 Rm−1 of Am. Since A is non-singular, the mth power Am is also non-singular, and

by Lemma 7.23 the QR factorization of Am is unique, if we assume that all upper triangular

matrices Ri have positive diagonal elements. Thus (7.47) is the unique QR factorization of Am,

where the upper triangular matrix Rm−1 has positive diagonal entries.

If we define D = diag(λ1, λ2, . . . , λn), then we have the relation

A T = T D ⇔ A = T D T−1,

which leads to

Am = T Dm T−1.

By assumption, we have an LU factorization of T−1, that is, T−1 = L U , where L is a normalized

lower triangular matrix and U is an upper triangular matrix. This leads to

Am = T Dm L U. (7.48)


By Lemma 7.24 there is a sequence {Lm} of lower triangular matrices, defined by Lm :=

Dm L (D−1)m, which converges to I and which satisfies

Dm L = Lm Dm. (7.49)

Substitution of (7.49) into (7.48) leads to

Am = T Lm Dm U. (7.50)

Using the QR factorization

T = Q R (7.51)

of T with positive diagonal elements in R we derive from substituting (7.51) into (7.50)

Am = Q R Lm Dm U. (7.52)

Since the matrices Lm converge to I as m → ∞, the matrices R Lm have to converge to R as

m → ∞. If we now compute a QR factorization of R Lm as

R Lm = Qm Rm, (7.53)

again with positive diagonal elements in Rm, then we can conclude from the uniqueness of

the QR factorization that Qm has to converge to the identity matrix. Substituting (7.53) into

(7.52), the equation (7.52) can hence be rewritten as

Am = Q Qm Rm Dm U. (7.54)

Next, let us introduce the diagonal and orthogonal matrices

∆m := diag(sm1 , . . . , sm

n ) with smi := sign

(λm

i ui,i

),

where ui,i are the diagonal elements of U . Then, because of ∆2m = I, we can rewrite the

representation of Am in (7.54) as

Am =(Q Qm ∆m

) (∆m Rm Dm U). (7.55)

Since the signs of the diagonal elements of the upper triangular matrix Rm Dm U coincide with

those of Dm U and hence with those of ∆m, we see that (7.55) is a QR factorization of Am with

positive diagonal elements in the upper triangular matrix.

If we compare this QR factorization with the QR factorization (see (7.47))

Am = Qm−1 Rm−1

we note that in both cases the upper diagonal matrix in the QR factorization has positive

diagonal entries. Thus we can thus conclude from the uniqueness of the QR factorization that

Qm−1 = Q Qm ∆m. (7.56)


Since Qm converges to the identity matrix, the matrices Qm converge to Q apart from the signs

of the columns. From (7.46), we have

Am = QTm−1 A Qm−1,

and from this and (7.56) we can derive (using that QTm = Q−1

m , since Qm is orthogonal, and

∆−1m = ∆m)

Am = Q−1m−1 A Qm−1 =

(Q Qm ∆m

)−1A(Q Qm ∆m

)

= ∆m Q−1m Q−1 A Q Qm ∆m = ∆m Q−1

m︸︷︷︸→ I

R T−1 A T︸︷︷︸= D

R−1 Qm︸︷︷︸→ I

∆m,

where we have used in the last step that Q = T R−1 from (7.51). If m tends to infinity then,

because of ∆2m = ∆0 and ∆2m+1 = ∆1, we can conclude the convergence of {A2m} and {A2m+1}to ∆0 R D R−1 ∆0 and ∆1 R D R−1 ∆1, respectively. Both limit matrices are upper triangular

and have the same elements on the diagonal which proves (ii). Thus we have that, if i > j,

a(m)i,j → 0 for m → ∞ which proves statement (i). Moreover, we can conclude that for the

diagonal entries of Am

limm→∞

a(m)i,i =

(∆0 R D R−1 ∆0

)i,i

=(∆1 R D R−1 ∆1

)i,i

, 1 ≤ i ≤ n. (7.57)

Since ∆0 R D R−1 ∆0 =(∆0 R

)D(∆0 R

)−1this matrix has the same eigenvalues as D =

diag(λ1, λ2, . . . , λn). Since ∆0 R D R−1 ∆0 is an upper triangular matrix, we know that its

diagonal entries are the eigenvalues of A, and thus

(∆0 R D R−1 ∆0

)i,i

=(∆0 (Q−1 T ) D (Q−1 T )−1 ∆0

)i,i

=(∆0 QT A Q ∆0

)i,i

, i = 1, 2, . . . , n,

(7.58)

are the eigenvalues of A. We cannot conclude that(∆0 R D R−1 ∆0

)i,i

is the eigenvalue λi, since

the unitary matrix Q in the last representation in (7.58) could provide a basis transformation

QT A Q that yields a permutation of the eigenvalues. Combining (7.58) and (7.57) yields finally

statement (iii).

Finally, from the definition of Qm and from (7.56),

Qm = Q−1m−1 Qm = ∆−1

m Q−1m Q−1 Q Qm+1 ∆m+1 = ∆m Q−1

m Qm+1 ∆m+1.

Using again the convergence of Qm to the identity matrix I, we see that Qm converges to

∆ := ∆0 ∆1 = diag(sign(λ1), sign(λ2), . . . , sign(λn)

). This concludes the proof. 2

The QR method for eigenvalue computation can be implemented with the following MATLAB

code:


function [lambda,K] = QR_method(A,J)

%

% QR method for the computation of the eigenvalues of a

% real n by n matrix with rea;le eigenvlues and and

% corresponding set of n linearly independent eigenvectors:

% starting matrix A_1 = A

% m-th step: compute QR factorization A_m = Q_m * R_m

% form A_{m+1} = R_m * Q_m

%

% input: A = real n by n matrix with n real eigenvalues and

% a corresponding set of n linearly independent

% eigenvectors

% J = maximal number of iterations

%

% output: lambda = n by 1 vector with approximate eigenvalues of A

% K = number of iterations

%

n = size(A,2);

for m = 1:J

C = A;

for i = 1:n

C(i,i:n) = 0;

end

if max(max(abs(C))) < 10^(-4) break

else

[Q,R] = QR_factorization(A);

A = R * Q;

end

end

lambda = diag(A);

K = m;

This code uses the QR factorization implemented with the MATLAB code:

function [Q,R] = QR_factorization(A)

%

% algorithm computes the QR factorization of A, that is, A=Q*R


% output: Q = real orthogonal n by n matrix

% R = real n by n upper triangular matrix

%


n = size(A,1);

Q = eye(n,n);

R = A;

%

for j=1:n-1

if max(abs(R(j:n,j))) == 0

Q = Q;

R = R;

else

v = R(j:n,j) - norm(R(j:n,j)) * [1 ; zeros(n-j,1)];

w = [ zeros(j-1,1) ; v/norm(v)];

R = R - 2* w * w’ * R;

Q = Q - 2* w * w’ * Q;

end

end

Q = Q’;

We give numerical example.

Example 7.26 (eigenvalue computation with QR method)

Consider the symmetric matrix from Example 5.19 and Example 7.17, given by

A =

1 0 0.25 0.25

0 1 0 0.25

0.25 0 1 0

0.25 0.25 0 1

.

In Example 7.17, we computed the eigenvalues by hand and found:

λ1 = 1 +

√3 +

√5

32≈ 1.4045,

λ2 = 1 +

√3 −

√5

32≈ 1.1545,

λ3 = 1 −

√3 −

√5

32≈ 0.8455,

λ4 = 1 −

√3 +

√5

32≈ 0.5955.

We compute the first 8 iterations of the QR method with the MATLAB code listed above


and show the diagonal entries of Am found in each iteration step in the table below.

m 1 2 3 4 5 6 7 8

(Am+1)1,1 1.2222 1.3267 1.3657 1.3820 1.3904 1.3953 1.3983 1.4004

(Am+1)2,2 1.1056 1.1380 1.1474 1.1525 1.1554 1.1566 1.1568 1.1566

(Am+1)3,3 0.9069 0.8837 0.8722 0.8624 0.8555 0.8512 0.8487 0.8472

(Am+1)4,4 0.7652 0.6516 0.6147 0.6030 0.5988 0.5970 0.5962 0.5958

After only 8 iterative steps the eigenvalues are accurately approximated up to the second

number after the decimal point. 2

Documents

NLA Lecture Notes - hesse-kerstin.deTitle NLA_Lecture_Notes.dvi Created Date 1/7/2011 7:46:57 PM