Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
The University of Sussex – Department of Mathematics
G1110 & 852G1 – Numerical Linear Algebra
Lecture Notes – Autumn Term 2010
Kerstin Hesse
H(w) a = −(w∗ a)w + w
a
w
(w∗ a)w
−(w∗ a)ww
Sw
Figure 1: Geometric explanation of the Householder matrix H(w).
Lecture notes and course material by Holger Wendland, David Kay, and others, who taught
the course ‘Numerical Linear Algebra’ at the University of Sussex, served as a starting point
for the current lecture notes. The current lecture notes are about twice as many pages as the
previous version. Apart from corrections and improvements, many new examples and some
linear algebra revision sections have been added compared to the previous lecture notes.
Contents
Introduction and Motivation iii
0.1 Motivation: An Interpolation Problem . . . . . . . . . . . . . . . . . . . . . . . iii
0.2 Motivation: A Boundary Value Problem . . . . . . . . . . . . . . . . . . . . . . v
1 Revision: Some Linear Algebra 1
1.1 Vectors in Rn and Cn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Determinants of Square Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Inverse Matrix of a Square Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Eigenvalues and Eigenvectors of a Square Matrix . . . . . . . . . . . . . . . . . 12
1.6 Other Notation: The Landau Symbol . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Matrix Theory 15
2.1 The Eigensystem of a Square Matrix . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Upper Triangular Matrices and Back Substitution . . . . . . . . . . . . . . . . . 26
2.3 Schur Factorization: A Triangular Canonical Form . . . . . . . . . . . . . . . . . 30
2.4 Vector Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5 Matrix Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.6 Spectral Radius of a Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3 Floating Point Arithmetic and Stability 57
3.1 Condition Numbers of Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2 Floating Point Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.4 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.5 An Example of a Backward Stable Algorithm: Back Substitution . . . . . . . . . 66
i
ii Contents
4 Direct Methods for Linear Systems 71
4.1 Standard Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2 The LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3 Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.4 Cholesky Factorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.5 QR Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5 Iterative Methods for Linear Systems 99
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2 Fixed Point Iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3 The Jacobi and Gauss-Seidel Iterations . . . . . . . . . . . . . . . . . . . . . . . 106
5.4 Relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6 The Conjugate Gradient Method 125
6.1 The Generic Minimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 125
6.2 Minimization with A-Conjugate Search Directions . . . . . . . . . . . . . . . . . 130
6.3 Convergence of the Conjugate Gradient Method . . . . . . . . . . . . . . . . . . 141
7 Calculation of Eigenvalues 151
7.1 Basic Localisation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.2 The Power Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.3 Inverse Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7.4 The Jacobi Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.5 Householder Reduction to Hessenberg Form . . . . . . . . . . . . . . . . . . . . 177
7.6 QR Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
Introduction and Motivation
The topics of this course center around the numerical solution of linear systems and the
computation of eigenvalues.
Let A ∈ Rm×n be an m×n matrix and let b ∈ Rm be a vector. How do we find the (approximate)
solution x ∈ Rn to the linear system
Ax = b
if the matrix A is very large, say, if A is a 106 × 106 matrix? Unlike in linear algebra, where
we have learnt under what assumptions on A and b a (unique) solution exists, here the focus
is on how this system should be solved with the help of a computer. In devising
algorithms for the numerical solution of such linear systems, we will exploit the properties
of the matrix A.
If the matrix A ∈ Rn×n is a square matrix, then we may want to find the eigenvalues λ and
the corresponding eigenvectors x ∈ Rn, that is,
Ax = λx.
While we have learnt in linear algebra results on the existence of the eigenvalues and correspond-
ing eigenvectors, numerical linear algebra is concerned with the numerical computation of
the eigenvalues on a computer for large square matrices A.
The numerical solution of large linear systems and the numerical computation of eigenvalues are
some of the most important topics in numerical analysis. For example, the approximation or
interpolation of measured data, or the discretization of a differential equation lead to a linear
system. The discretization of a differential equation can also lead to the problem of finding the
eigenvalues of a matrix.
We discuss two motivating examples that illustrate how the problem of the (numerical) solution
of a large linear system might arise in applications.
0.1 Motivation: An Interpolation Problem
Suppose we are given N data sites a ≤ x1 < x2 < . . . < xN ≤ b in [a, b] and corresponding
observations f1, f2, . . . , fN ∈ R. Suppose further that the observations follow an unknown
iii
iv Introduction and Motivation
generation process, that is, there is an unknown function f such that f(xi) = fi. (For example,
the data could be temperatures measured at a fixed time at equidistant locations along a thin
metal rod. In this case we would like to use the measured temperature data to derive a function
of the location along the thin metal rod that describes the temperature along the rod at the
fixed time.)
One possibility to reconstruct the unknown function f is to choose a set of N continuous
basis functions ϕ1, . . . , ϕN ∈ C([a, b]) and to approximate f by a function s of the form
s(x) =
N∑
j=1
αj ϕj(x),
where the coefficients are determined by the interpolation conditions
fi = s(xi) =N∑
j=1
αj ϕj(xi), i = 1, 2, . . . , N.
This leads to a linear system, which can be written in matrix form as
ϕ1(x1) ϕ2(x1) . . . ϕN(x1)
ϕ1(x2) ϕ2(x2) . . . ϕN(x2)...
.... . .
...
ϕ1(xN ) ϕ2(xN ) . . . ϕN(xN )
α1
α2
...
αN
=
f1
f2
...
fN
. (0.1)
The matrix in the linear system is an N ×N matrix, and if N is large, say N ≥ 106, then such
a system is not easy to solve.
The focus of this course is on how to solve linear systems, such as, for example, the one in the
interpolation problem (0.1) above. In this course, we will not discuss interpolation problems
themselves, although they are a very interesting area of study and research. Despite that we
make here two comments on the problem above.
Remark 0.1 (comments on the interpolation problem)
(i) One crucial issue is the choice of the basis functions ϕ1, ϕ2, . . . , ϕN . For example,
possible choices could be:
(a) polynomials: ϕj(x) = xj−1, j = 1, 2, . . . , N , or
(b) shifted Gaussians: ϕj(x) = e−(x−xj)2, j = 1, 2, . . . , N , where the xj in the definition of
ϕj is the data site xj .
However, the choice of the basis functions is not arbitrary but will, in applications, be
determined by information about the kind of measured data that is approximated.
(ii) Also, if the data f1, f2, . . . , fN is measured data then it is usually not exact but has
measurement errors (noise), such that fi = f(xi) + ǫi, where ǫi is the measurement
error. In this case interpolation leads usually to a rather poor approximation of the data,
and it would be better to use an approximation scheme that imposes conditions that
demand only s(xi) ≈ fi, i = 1, 2, . . . , N .
Introduction and Motivation v
0.2 Motivation: A Boundary Value Problem
Consider the following one-dimensional boundary value problem: find a function u(x) that is
twice continuously differentiable and satisfies the differential equation
− d2u
dx2(x) + u(x) = f(x) on (0, 1), (0.2)
subject to the boundary conditions
u(0) = 0 and u(1) = 0. (0.3)
The function f in (0.2) is continuous on [0, 1].
One way to solve this boundary value problem numerically is to use finite differences. The
basic idea here is to approximate the derivative by difference quotients:
u′(x) ≈ u(x + h) − u(x)
hor u′(x) ≈ u(x) − u(x − h)
h. (0.4)
The first formula in (0.4) is a forward difference while the second one is a backward dif-
ference. Using first a forward and then a backward rule for the second derivative leads to the
centred second difference
u′′(x) ≈ u′(x + h) − u′(x)
h≈ 1
h
(u(x + h) − u(x)
h− u(x) − u(x − h)
h
)
=u(x + h) − 2 u(x) + u(x − h)
h2. (0.5)
Finite Difference Method:
When finding a numerical approximation using finite differences we divide the interval [0, 1]
into n+1 subintervals of equal length h = 1/(n+1) with endpoints at the equally spaced nodes
xi = i h =i
n + 1, i = 0, 1, . . . , n, n + 1.
Our aim is to construct a vector u = uh = (u0, u1, . . . , un, un+1)T such that uj is an approxi-
mation of u(xj), j = 0, 1, . . . , n, n + 1, where u denotes the (exact) solution to the boundary
value problem (0.2) and (0.3).
Expressing (0.2) and (0.3) on the grid x0, x1, . . . , xn, xn+1 and replacing the derivatives by finite
differences, with the help of (0.5), we obtain
− u(xi+1) − 2 u(xi) + u(xi−1)
h2+ u(xi) ≈ f(xi), i = 1, 2, . . . , n, (0.6)
and
u(x0) = 0 and u(xn+1) = 0. (0.7)
vi Introduction and Motivation
Replacing in (0.6) and (0.7) the values u(xj) by the approximations uj and using the abbrevi-
ation fi := f(xi), we get the equations
− ui+1 − 2 ui + ui−1
h2+ ui = fi, i = 1, 2, . . . , n, (0.8)
and
u0 = 0 and un+1 = 0. (0.9)
These equations (0.8) and (0.9) lead to the following linear system for the computation of the
finite difference approximation u = uh = (u0, u1, . . . , un, un+1)T :
1
h2
1 0 0 0 · · · 0 0
−1 2 + h2 −1 0 · · · 0 0
0 −1 2 + h2 −1. . .
... 0
... 0. . .
. . .. . . 0
...
0...
. . . −1 2 + h2 −1 0
0 0 · · · 0 −1 2 + h2 −1
0 0 0 . . . 0 0 1
u0
u1
u2
...
un−1
un
un+1
=
0
f1
f2
...
fn−1
fn
0
. (0.10)
The involved matrix, which we denote by A, is in R(n+2)×(n+2) and is tridiagonal. With
f := (0, f1, f2, . . . , fn, 0)T , we can write (0.10) as Au = f .
Remark 0.2 (comments on the finite difference approximation)
(i) This system of equations is sparse, that is, the number of non-zero entries is much less
than (n+2)2. This sparsity can be used to produce more efficient methods of storage, only
storing the non-zero entries of the matrix. Also the sparse matrix reduces the number of
required operations when calculating matrix-matrix and matrix-vector multiplications.
(ii) To obtain an accurate approximation to the true solution u we may have to choose h very
small (very fine step size), thereby increasing the size of the linear system.
(iii) For a general boundary value problem in N -dimensions the size of the linear system can
grow rapidly, for example, three dimensional problems grow over 8 times larger with each
uniform refinement of the domain.
In this course we will learn about direct methods (for example, Gaussian elimination) and
iterative methods (that is, the construction of a sequence of improving approximations to
the solution) that are used to numerically solve linear systems of equations (such as the ones
encountered in this section and the previous section).
We will look at how efficient (how much time and memory are required?) and stable (do they
give good approximations and do they converge and under what conditions?) these methods
are.
Chapter 1
Revision: Some Linear Algebra
In this chapter we first introduce common notation and give a brief revision of some definitions
and results from linear algebra that we will frequently use in this course.
In this course capital (upper-case) letters A, B, C, . . . usually denote matrices, and bold-face
lower-case letters a,b,x,y, . . . denote vectors. Functions are denoted by lower-case letters.
1.1 Vectors in Rn and C
n
A vector x in Rn (or C
n) is a column vector
x =
x1
x2
...
xn
, where x1, x2, . . . , xn ∈ R (or x1, x2, . . . , xn ∈ C).
The vector 0 is the zero vector, where all entries are zero.
We denote by ei in Rn (or in Cn) the standard ith basis vector containing a one in the ith
component and zeros elsewhere. For example in R3 and C3, the standard basis vectors are
e1 =
1
0
0
, e2 =
0
1
0
, e3 =
0
0
1
.
The vectors x1,x2, . . . ,xm in Rn (or in Cn) are linearly independent if the following holds
true: Ifm∑
j=1
aj xj = a1 x1 + a2 x2 + . . . + am xm = 0 (1.1)
1
2 1.1. Vectors in Rn and Cn
with the real (complex) numbers a1, a2, . . . , am, then the numbers aj , j = 1, 2, . . . , m, are all
zero.
In other words, x1,x2, . . . ,xm are linearly independent if the only real (complex) numbers
a1, a2, . . . , am for which (1.1) holds are a1 = a2 = . . . = am = 0.
The vectors x1,x2, . . . ,xm in Rn (or in C
n) are linearly dependent, if they are not linearly
independent. This means x1,x2, . . . ,xm in Rn (or in Cn) are linearly dependent, if there exist
real (complex) numbers a1, a2, . . . , am not all zero such that (1.1) holds.
Any m > n vectors in Rn (in Cn) are linearly dependent.
Any set of n linearly independent vectors in Rn (in Cn) is a basis for Rn (for Cn). If
v1,v2, . . . ,vn is a basis for Rn (for Cn), then the following holds: For every vector x in Rn
(in Cn), there exist uniquely determined real (complex) numbers a1, a2, . . . , an such that
x =
n∑
j=1
aj vj = a1 v1 + a2 v2 + . . . + an vn.
For a column vector x in Rn or in C
n we denote by xT the transposed (row) vector, that is,
x =
x1
x2
...
xn
and xT =
(x1, x2, . . . , xn
).
Likewise the transpose of a row vector y is the corresponding column vector yT , that is,
y =(y1, y2, . . . , yn
)and yT =
y1
y2
...
yn
.
For a column vector x ∈ Cn we denote by x∗ := xT the conjugate (row) vector, that is,
x =
x1
x2
...
xn
and x∗ = xT =
(x1, x2, . . . , xn
).
Here, y indicates taking the complex conjugate of y ∈ C, that is, if y = a + i b with a, b ∈ R
and i the imaginary unit, then y = a − i b. Likewise the conjugate of a complex row vector y
is the corresponding conjugate column vector y∗ := yT , that is,
y =(y1, y2, . . . , yn
). and y∗ = yT =
y1
y2
...
yn
.
1. Revision: Some Linear Algebra 3
For complex numbers y = a + i b ∈ C with a, b ∈ R, we have
|y| =√
y y =√
(a − i b)(a + i b) =√
a2 + b2.
The Euclidean inner product of two real-valued vectors x,y ∈ Rn is given by
xT y = xT · y =n∑
j=1
xj yj = x1 y1 + x2 y2 + . . . + xn yn.
We note that the Euclidean inner product for Rn is symmetric, that is, xT y = yT x for any
x,y ∈ Rn.
The Euclidean inner product of two complex vectors x,y ∈ Cn is given by
x∗ y = x∗ · y = xT · y =n∑
j=1
xj yj = x1 y1 + x2 y2 + . . . + xn yn.
The Euclidean inner product for Cn satisfies x∗y = y∗x for any x,y ∈ C
n.
The Euclidean norm of a vector x ∈ Rn (or x ∈ Cn) is defined by
‖x‖2 =√
xTx =
(n∑
j=1
|xj |2)1/2
or ‖x‖2 =
√x∗ x =
(n∑
j=1
|xj |2)1/2
.
The geometric interpretation of the Euclidean norm ‖x‖2 of a vector x in Rn or Cn is that ‖x‖2
measures the length of x.
We say that two vectors x and y in Rn (or in Cn) are orthogonal (to each other) if xT y = 0 (or
if x∗ y = 0, respectively). For vectors in Rn, this means geometrically that the angle between
these two vectors is π/2, that is 90◦. A set of m vectors x1,x2, . . . ,xm in Rn (or in Cn) is called
orthogonal, if they are mutually orthogonal, that is, if xj is orthogonal to xk whenever j 6= k.
It is easily checked that a set of orthogonal vectors is, in particular, also linearly independent.
A basis v1,v2, . . . ,vn of Rn is called an orthonormal basis for Rn if the basis vectors have
all length one and are mutually orthogonal, that is
‖vj‖2 = 1 for j = 1, 2, . . . , n, and vTj vk = 0 for j, k = 1, 2 . . . , n with j 6= k.
Likewise, a basis v1,v2, . . . ,vn of Cn is called an orthonormal basis for Cn if the basis vectors
have all length one and are mutually orthogonal, that is,
‖vj‖2 = 1 for j = 1, 2, . . . , n, and v∗j vk = 0 for j, k = 1, 2 . . . , n with j 6= k.
An orthonormal basis v1,v2, . . . ,vn of Rn has a very useful property: Any vector x ∈ Rn
has the representation
x =n∑
j=1
(vTj x)vj (1.2)
4 1.2. Matrices
as a linear combination of the basis vectors v1,v2, . . . ,vn.
The validity of (1.2) is easily established: Assume that x =∑n
j=1 aj vj, and take the Euclidean
inner product with vk. Because v1,v2, . . . ,vn form an orthonormal basis, vTk vj = 0 if j 6= k
and vTk vj = ‖vk‖2
2 = 1 if j = k. Thus
vTk x = vT
k
(n∑
j=1
aj vj
)=
n∑
j=1
aj vTk vj = ak vT
k vk = ak ‖vk‖22 = ak.
Replacing aj = vTj x in x =
∑nj=1 aj vj now verifies (1.2).
Analogously to (1.2) an orthonormal basis v1,v2, . . . ,vn of Cn has the following property: Any
vector x ∈ Cn has the representation
x =
n∑
j=1
(v∗j x)vj (1.3)
as a linear combination with respect to the orthonormal basis v1,v2, . . . ,vn.
Exercise 1 State the properties of an inner product/scalar product for a complex vector space
V , and verify that the Euclidean inner product for Cn has these properties.
Exercise 2 Show formula (1.3).
1.2 Matrices
The matrix A ∈ Cm×n (or A ∈ R
m×n) is an m × n (m rows and n columns) matrix with
complex-valued (or real-valued) entries:
A := (ai,j) = (ai,j)1≤i≤m; 1≤j≤n :=
a1,1 a1,2 · · · a1,n
a2,1 a2,2 · · · a2,n
......
...
am,1 am,2 · · · am,n
,
where ai,j ∈ C (and ai,j ∈ R, respectively).
Occasionally, we will denote the column vectors of a matrix A = (ai,j) in Cm×n (or in Rm×n)
by aj , j = 1, 2, . . . , n, that is
A := (a1, a2, · · · , an), with aj =
a1,j
a2,j
...
am,j
∈ C
m (or ∈ Rm), j = 1, 2, . . . , n.
1. Revision: Some Linear Algebra 5
To denote the (i, j)th entry ai,j of A = (ai,j) we may occasionally also write Ai,j := ai,j or
A(i, j) := ai,j.
A matrix is called square if it has the same number of rows and columns. Thus square matrices
are matrices in Cn×n or R
n×n. The diagonal of a square matrix A = (ai,j) in Cn×n or in Rn×n
are the entries aj,j, j = 1, 2, . . . , n.
Vectors are special cases of matrices, and a column vector x ∈ Cn (or x ∈ Rn) is just an n × 1
matrix. Likewise a row vector in Cn (or Rn) is just an 1 × n matrix.
Two matrices of special importance are the m×n zero matrix, and, among the square matrices,
the n × n identity matrix: The zero matrix in Cm×n and in Rm×n is the m × n matrix that
has all entries zero. We denote the m × n zero matrix by 0. The identity matrix in Cn×n
and in Rn×n is the n × n matrix in which the entries on the diagonal are all one and all other
entries are zero. We denote the n× n identity matrix by I. For example, in C3×3 and R
3×3 we
have
0 =
0 0 0
0 0 0
0 0 0
and I =
1 0 0
0 1 0
0 0 1
.
The scalar multiplication of a matrix A = (ai,j) in Cm×n (or in Rm×n) with a complex (or
real) number µ is defined componentwise, that is, µ A in Cm×n (or in Rm×n, respectively) is
defined by
(µ A)i,j := µ ai,j, i = 1, 2, . . . , m; j = 1, 2, . . . , n. (1.4)
The addition of two m× n matrices A = (ai,j) and B = (bi,j) in Cm×n (or in Rm×n) is defined
componentwise, that is, A + B in Cm×n (or in Rm×n, respectively) is defined by
(A + B)i,j := ai,j + bi,j, i = 1, 2, . . . , m; j = 1, 2, . . . , n. (1.5)
The set Cm×n (or Rm×n) of complex (or real) m×n matrices with the scalar multiplication (1.4)
and the addition (1.5) forms a complex vector space (or real vector space, respectively).
The matrix multiplication A B of A = (ai,j) ∈ Cm×n and B = (bi,j) ∈ Cn×p (or A = (ai,j) ∈Rm×n and B = (bi,j) ∈ Rn×p) gives the matrix C = (ci,j) ∈ Cm×p (or C = (ci,j) ∈ Rm×p,
respectively), with the entries
ci,j =n∑
k=1
ai,k bk,j = ai,1 b1,j + ai,2 b2,j + . . . + ai,n bn,j, i = 1, 2, . . . , m; j = 1, 2, . . . , p.
In words, ci,j is computed by taking the Euclidean inner product of the ith row vector of A
with the jth column vector of B.
Note that for square matrices A and B in Cn×n (or in Rn×n), both A B and B A are defined,
but in general A B 6= B A, that is, matrix multiplication is not commutative.
6 1.2. Matrices
Thus the Euclidean inner product x∗ y of two vectors x and y in Cn (and the Euclidean
inner product xT y of two vectors x and y in Rn) is just the matrix multiplication of the 1× n
matrix x∗ = xT (and xT , respectively) with the n × 1 matrix y.
The outer product B = (bi,j) ∈ Cn×n of two vectors x,y ∈ Cn is given by
B = xy∗, where bi,j := xi yj, i = 1, 2, . . . n; j = 1, 2, . . . n.
Analogously, the outer product of x and y in Rn is xyT = (xi yj) ∈ R
n×n.
In analogy to the transpose of a vector in Rn and the conjugate of a vector in Cn, we define
the transposed matrix of a matrix A ∈ Rm×n and the Hermitian conjugate matrix of a matrix
A ∈ Cn×m.
The transposed matrix or transpose of a matrix A = (ai,j) in Rm×n (or in Cm×n) is the
matrix AT in Rn×m (or in C
n×m) whose (i, j)th entry is given by
(AT )i,j = aj,i.
A square matrix A = (ai,j) in Rn×n (or in Cn×n) is called symmetric if AT = A, that is,
ai,j = aj,i for all i, j = 1, 2, . . . , n.
The Hermitian conjugate (matrix) (or adjoint (matrix)) of A = (ai,j) ∈ Cm×n is the
matrix A∗ := AT ∈ Cn×m whose (i, j)the entry is
(A∗)i,j = aj,i.
A square matrix A = (ai,j) ∈ Cn×n is called Hermitian (or self-adjoint) if A∗ = A, that is,
aj,i = ai,j for all i, j = 1, 2, . . . , n.
For A ∈ Rm×n and B ∈ Rn×p, we have
(A B)T = BT AT ,
and for A ∈ Cm×n and B ∈ Cn×p, we have
(A B)∗ = B∗ A∗.
Let A = (ai,j) ∈ Cm×n be a complex m × n matrix. The null space or kernel of the matrix
A is defined by
null(A) = ker(A) :={x ∈ C
n : Ax = 0}
(1.6)
The range of the matrix A is defined by
range(A) :={y ∈ C
m : Ax = y for some x ∈ Cn}. (1.7)
The range of A is the linear space spanned by the column vectors of A.
1. Revision: Some Linear Algebra 7
Analogous statements hold for real matrices A ∈ Rm×n with the only difference that Cn and
Cm in (1.6) and (1.7) need to be replaced by R
n and Rm, respectively.
The rank of a matrix A ∈ Cm×n or A ∈ Rm×n is the dimension of the range of A, that is,
rank(A) := dim(range(A)
).
The trace of a square matrix A = (ai,j) ∈ Cn×n or A = (ai,j) ∈ Rn×n, denoted by trace(A), is
the sum of its diagonal entries, that is,
trace(A) =
n∑
j=1
aj,j = a1,1 + a2,2 + . . . + an,n.
Symmetric matrices in Rn×n and Hermitian matrices in Cn×n may have the following useful
properties:
A square matrix A = (ai,j) ∈ Rn×n is called positive definite if A is symmetric (that is, A
satisfies AT = A) and
xT Ax =
n∑
i=1
n∑
j=1
ai,j xi xj > 0 for all x ∈ Rn with x 6= 0.
A square matrix A = (ai,j) ∈ Rn×n is called positive semidefinite if A is symmetric and
xT Ax =
n∑
i=1
n∑
j=1
ai,j xi xj ≥ 0 for all x ∈ Rn.
A square matrix A = (ai,j) ∈ Cn×n is said to be positive definite if A is Hermitian (that is,
A satisfies A∗ = A) and
x∗Ax =n∑
i=1
n∑
j=1
ai,j xi xj > 0 for all x ∈ Cn with x 6= 0.
A square matrix A = (ai,j) ∈ Cn×n is said to be positive semidefinite if A is Hermitian and
x∗Ax =n∑
i=1
n∑
j=1
ai,j xi xj ≥ 0 for all x ∈ Cn.
We have the following useful property of positive definite matrices:
If A = (ai,j) in Rn×n (or in Cn×n) is positive definite, then the upper principal submatrices Ap :=
(ai,j)1≤i,j≤p, p ∈ {1, 2, . . . , n}, are positive definite, and det(Ap) > 0 for all p ∈ {1, 2, . . . , n}.
8 1.2. Matrices
Theorem 1.1 (characterization of positive definite matrices)
(i) A symmetric matrix A ∈ Rn×n (that is, A = AT ) is positive definite if and only if all
its eigenvalues are positive.
(ii) An Hermitian matrix A ∈ Cn×n (that is, A = A∗) is positive definite if and only if all
its eigenvalues are positive.
Exercise 3 Show that Cm×n with the usual matrix addition and scalar multiplication is a com-
plex vector space.
Exercise 4 Find the range and the null space of the following matrix
A :=
1 0 −1 2
0 1 3 1
−1 1 5 0
.
Exercise 5 For a matrix A ∈ Cm×n show that the range of A is the linear space spanned by
the column vectors of A.
Exercise 6 Which, if any, of the following square matrices are symmetric or Hermitian? (Note
here i is always the imaginary unit!)
A :=
−1 2 i
2 i 3
−i 3 1
, B :=
−1 2 −i
2 7 5
i 5 3
, C :=
2 −2 8
−2 7 −1
8 −1 3
.
Exercise 7 Show that (A B)T = BT AT for any A ∈ Rm×n and B ∈ R
n×p. Show that (A B)∗ =
B∗ A∗ for any A ∈ Cm×n and B ∈ Cn×p.
Exercise 8 Compute the trace of the 3 × 3 matrix
A =
32
0 12
0 3 0
12
0 32
, (1.8)
Exercise 9 Show that the symmetric real 3×3 matrix A given by (1.8) in the previous question
is positive definite.
Exercise 10 Let A ∈ Cn×n be a positive definite matrix. If C ∈ Cn×m show that:
(a) C∗ A C is positive semidefinite.
(b) rank(C∗ A C
)= rank(C).
(c) C∗ A C is positive definite if and only if rank(C) = m.
1. Revision: Some Linear Algebra 9
1.3 Determinants of Square Matrices
In this subsection let A be a square matrix with either real or complex entries.
The determinant det(A) of a 2 × 2 matrix
A =
(a b
c d
)
is defined by
det(A) =
∣∣∣∣a b
c d
∣∣∣∣ = a d − b c.
The determinant det(A) of a 3 × 3 matrix
A =
a1,1 a1,2 a1,3
a2,1 a2,2 a2,3
a3,1 a3,2 a3,3
is defined by
det(A) =
∣∣∣∣∣∣
a1,1 a1,2 a1,3
a2,1 a2,2 a2,3
a3,1 a3,2 a3,3
∣∣∣∣∣∣= a1,1 C1,1 − a1,2 C1,2 + a1,3 C1,3,
where C1,1, −C1,2, and C1,3 are the so-called cofactors of a1,1, a1,2, and a1,3, respectively, and
are defined by
C1,1 =
∣∣∣∣a2,2 a2,3
a3,2 a3,3
∣∣∣∣ , C1,2 =
∣∣∣∣a2,1 a2,3
a3,1 a3,3
∣∣∣∣ , and C1,3 =
∣∣∣∣a2,1 a2,2
a3,1 a3,2
∣∣∣∣ .
We observe that C1,j is the determinant of the 2×2 submatrix of A that is obtained by deleting
the 1st row and jth column of A.
The procedure for computing the determinant of a 3 × 3 matrix can be generalized to the
following formula for computing the determinant of an n × n matrix, where n ≥ 2:
The determinant det(A) of the n × n matrix A = (ai,j) is given by
det(A) =n∑
j=1
ai,j (−1)i+j Ci,j,
for any i ∈ {1, 2, . . . , n}, where Ci,j is the determinant of the (n− 1)× (n − 1) submatrix of A
obtained by deleting the ith row and jth column from A.
10 1.4. Inverse Matrix of a Square Matrix
Equivalently, we can also compute the determinant of an n×n matrix A = (ai,j) by expanding
with respect to a column:
det(A) =
n∑
i=1
ai,j (−1)i+j Ci,j,
for any j ∈ {1, 2, . . . , n}, where Ci,j is the determinant of the (n− 1)× (n− 1) submatrix of A
obtained by deleting the ith row and jth column from A.
We note that for all n × n matrices A,
det(A) = det(AT ).
Let A and B be n × n matrices. Then the determinant of a product of two matrices satisfies
det(A B) = det(A) det(B).
Exercise 11 Compute the determinant of the 3 × 3 matrix
A =
32
0 12
0 3 0
12
0 32
.
Exercise 12 Prove the following statement: If A = (ai,j) in Rn×n is positive definite, then the
upper principal submatrices Ap := (ai,j)1≤i,j≤p, p ∈ {1, 2, . . . , n}, are positive definite.
Exercise 13 Prove that for all square n × n matrices A we have det(A) = det(AT ).
1.4 Inverse Matrix of a Square Matrix
A square matrix A in Cn×n (or in Rn×n) is called invertible or non-singular, if there exists
a matrix X in Cn×n (and in R
n×n, respectively) such that
A X = X A = I,
where I is the n× n identity matrix. The matrix X is then called the inverse (matrix) of A,
and we usually denote the inverse matrix of A by A−1.
If a square matrix A in Cn×n or Rn×n is not invertible, then we call A singular.
A fundamental result about inverse matrices is the following: A matrix square matrix A is
invertible (or non-singular) if and only if det(A) 6= 0.
This implies conversely that a square matrix A is singular if and only if det(A) = 0.
1. Revision: Some Linear Algebra 11
The easiest way to compute the inverse of a non-singular matrix A by hand is to write the
augmented matrix (A|I) and then transform this system with elementary row operations such
that we have the identity matrix on the left-hand side; then we obtain (I|A−1) with the inverse
matrix A−1 on the right-hand side.
With the help of the inverse matrix we can solve linear systems as follows. Assume A is a
non-singular square n×n matrix in Cn×n and b is a given vector in Cn. Then the linear system
Ax = b
has the solution x = A−1b, which follows easily from
A−1b = A−1(Ax)
= (A−1A)x = I x = x,
where we have used the fact that matrix multiplication is associative. We will see in this
course that computing the inverse of a large matrix is a rather expensive process, and therefore
computing the inverse and then using x = A−1 b is numerically not a good way to solve large
linear systems.
For two invertible (or non-singular) square n × n matrices A and B we have
(A B)−1 = B−1A−1,
and, for invertible matrices A ∈ Rn×n and B ∈ Cn×n, we have
(AT )−1 = (A−1)T and (B∗)−1 = (B−1)∗.
For a non-singular matrix A in Cn×n (or in Rn×n), we have
det(A−1) =(det(A)
)−1.
A square matrix Q ∈ Rn×n is said to be orthogonal (or an orthogonal matrix) if
QT = Q−1 ⇔(QT Q = I and Q QT = I
). (1.9)
A square matrix Q ∈ Cn×n is said to be unitary (or a unitary matrix) if
Q∗ = Q−1 ⇔(Q∗Q = I and Q Q∗ = I
). (1.10)
The second characterization in (1.9) and (1.10), respectively, tells us that the column vectors
qi, i = 1, 2, . . . n, of Q are an orthonormal basis of Rn and Cn, respectively. That is, (1.9) is
equivalent to
qTi qj = δi,j, i, j = 1, 2, . . . , n,
and (1.10) is equivalent to
q∗i qj = qi
T qj = δi,j, i, j = 1, 2, . . . , n,
respectively, where δi,j is the Kronecker delta, defined to be one if i = j and zero otherwise.
Likewise, the row vectors of Q form an orthonormal basis of Rn and Cn, respectively.
12 1.5. Eigenvalues and Eigenvectors of a Square Matrix
Exercise 14 Show that the 3 × 3 matrix
A =
32
0 12
0 3 0
12
0 32
,
is non-singular. Compute the inverse matrix A−1 of A.
Exercise 15 Let A and B in Cn×n be invertible matrices. Prove the following statements:
(a) (A B)−1 = B−1A−1.
(b) (A−1)∗ = (A∗)−1. Use this result to conclude that (AT )−1 = (A−1)T if A is in Rn×n.
(c) det(A−1) = (det(A))−1.
Exercise 16 Show that a positive definite matrix A ∈ Rn×n is non-singular and that the inverse
matrix A−1 is also positive definite.
Exercise 17 Show that the inverse of a unitary matrix is unitary. Use this the result to show
that the unitary matrices in Cn×n with the matrix multiplication form a (multiplicative) group.
Exercise 18 Consider the matrix A ∈ R3×3 and the vector b ∈ R3 given by
A =
1 −1 0
−1 2 1
0 1 3
and b =
2
−2
4
.
(a) Compute the determinant of A.
(b) Is A invertible? Why?
(c) Compute the inverse matrix A−1 of A
(d) Use the inverse matrix A−1 to solve the linear system Ax = b.
(e) Show that A is positive definite. (Hint: use Theorem 1.1.)
1.5 Eigenvalues and Eigenvectors of a Square Matrix
In this section, we consider Rn×n as a subset of C
n×n, so that all definitions for complex n × n
matrices also apply to real n × n matrices.
Let A be a square matrix in Cn×n. A complex number λ ∈ C is an eigenvalue of A if there
exists a non-zero vector x ∈ Cn \ {0} such that
Ax = λx. (1.11)
1. Revision: Some Linear Algebra 13
The vector x in (1.11) is then called an eigenvector of A with the eigenvalue λ.
By writing (1.11) equivalently as
λx − Ax = 0 ⇔ (λ I − A)x = 0,
we see that a non-zero vector x satisfying (1.11) exists if and only if det(λ I − A) = 0. The
determinant
p(A, λ) := det(λ I − A) (1.12)
is a polynomial in λ of exact degree n, and (1.12) is called the characteristic polynomial
of A. Clearly, the (complex) roots of the characteristic polynomial are the eigenvalues of the
matrix A. By the fundamental theorem of algebra, any (complex) polynomial of exact degree
n has n complex roots, counted with multiplicity. Therefore, any square matrix A ∈ Cn×n
has n complex eigenvalues, counted with multiplicity.
To compute the eigenvalues and the corresponding eigenvectors of a square matrix
A by hand, we proceed as follows: First we compute the characteristic polynomial p(A, λ) =
det(λ I − A) and find its roots. These roots are the eigenvalues of A. For each eigenvalue λ of
A, we solve the linear system (λ I − A)x = 0 to find the eigenvectors x corresponding to λ.
For a real n × n matrix A ∈ Rn×n, the characteristic polynomial p(A, λ) = det(λ I − A) has n
complex roots (counted with multiplicity). In general, some (or even all) of these roots may
be not in R. Thus, for the special case A ∈ Rn×n, we can in general not conclude that A has
n real eigenvalues, counted with multiplicity.
It is clear that for large n the computation of the eigenvalues is a far from trivial problem.
1.6 Other Notation: The Landau Symbol
For two functions f, g : N → R, we will write f = O(g) if there is a constant C > 0 and N ∈ N
such that
|f(n)| ≤ C|g(n)| for all n ≥ N.
The symbol O is called the Landau symbol.
For example, consider the matrix-vector multiplication Ax, where A ∈ Cn×n and x ∈ Cn. Since
the ith component of Ax is given by
(Ax)i =n∑
j=1
ai,j xj , (1.13)
the number of multiplications in (1.13) is n and the number of additions in (1.13) is n − 1, so
that the number of elementary operations to compute (Ax)i is 2n−1, that is, O(n). The total
14 1.6. Other Notation: The Landau Symbol
number of operations to compute a matrix-vector multiplication is therefore n(2n−1) = 2n2−n,
that is, O(n2).
In numerical linear algebra the number of elementary operations needed to execute an algorithm
is of great interest, since it determines the runtime and efficiency of the algorithm. Usually
information about the cost or number of elementary operations is not given as an exact
figure but rather by listing it as O(n), O(n2), O(n3), . . . , as appropriate, in terms of the
dimension n of the problem. In the example above the dimension is the number of components
n of the involved vectors.
Chapter 2
Matrix Theory
In this chapter we learn various basics from matrix theory and encounter the first numerical
algorithm.
In Section 2.1, we revise some facts and results about the eigenvalues and eigenvectors of a
square matrix in Cn×n. These facts are needed throughout the course, and we will come back
to them at various stages later-on. In Section 2.2, we introduce upper triangular matrices
(and lower triangular matrices), and we learn how a linear system with an upper triangular
matrix can be easily solved by back substitution. In Section 2.3, we learn about the Schur
factorization: Given a square matrix A in Cn×n, the Schur factorization guarantees that
there exists a unitary matrix S, such that the matrix
U = S∗ A S = S−1 A S
is an upper triangular matrix. This result will be exploited at various stages later in the course.
In Section 2.4, we revise some facts about vector norms, and in Section 2.5, we introduce a
variety of matrix norms that will be used throughout the course. In Section 2.6, we define the
spectral radius of a square matrix A ∈ Cn×n which is the maximum of the absolute values of the
n complex eigenvalues of A. In formulas, let λ1, λ2, . . . , λn ∈ C be the n complex eigenvalues
(counted with multiplicity) of A ∈ Cn×n; then the spectral radius of A ∈ Cn×n is defined by
ρ(A) := max{|λ1|, |λ2|, . . . , |λn|
}.
We also learn a result about the relation between the spectral radius and matrix norms.
2.1 The Eigensystem of a Square Matrix
The material in this section on the eigensystem, that is, the eigenvalues and eigenvectors, of
a square matrix should be familiar from linear algebra. However, it is strongly recommended
that you carefully go through this section, revise the material, and solve the exercises!
15
16 2.1. The Eigensystem of a Square Matrix
In this section real square matrices A ∈ Rn×n are considered as the special case of matrices in
Cn×n with real entries; so all results for matrices in C
n×n hold also for matrices in Rn×n.
Consider a square matrix A in Cn×n,
A = (aij) =
a1,1 a1,2 · · · a1,n
a2,1 a2,2 · · · a2,n
......
. . ....
an,1 an,2 · · · an,n
.
Definition 2.1 (eigenvalues and eigenvectors)
A complex number λ ∈ C is called an eigenvalue of A = (ai,j) ∈ Cn×n if there exists a
non-zero vector x ∈ Cn such that
Ax = λx ⇔ λx − Ax =(λ I − A
)x = 0, (2.1)
where I is the n × n identity matrix. A vector x satisfying (2.1) is called an eigenvector
corresponding to the eigenvalue λ.
From (2.1) it is clear that, once we know an eigenvalue λ, any corresponding eigenvector x can
be computed by solving the linear system (λ I − A)x = 0. The linear system (λ I − A)x = 0
has non-zero solutions if and only if det(λ I − A) = 0. Thus we have found a criterion for
determining whether a complex number is an eigenvalue: λ is an eigenvalue of A if and only if
det(λ I − A) = 0. From the properties of the determinant it is easily seen that
det(λ I − A) = λn + cn−1 λn−1 + . . . + c1 λ + c0, (2.2)
with suitable complex coefficients c0, c1, . . . , cn−1, that is, det(λ I − A) is a polynomial in λ of
exact degree n. Thus any eigenvalue of A is a root of the polynomial det(λ I − A).
Theorem 2.2 (roots of the characteristic polynomial are eigenvalues)
Let A ∈ Cn×n. A complex number λ ∈ C is an eigenvalue of A if and only if it is a root
of the characteristic polynomial of A, defined by
p(A, λ) := det(λ I − A).
Counted with multiplicity, the characteristic polynomial has exactly n complex roots,
that is, A has exactly n complex eigenvalues, counted with multiplicity.
The last statement in the theorem above follows from the fundamental theorem of algebra.
Definition 2.3 (spectrum of a matrix)
Let A ∈ Cn×n. The set of all eigenvalues of A is called the spectrum of A and is denoted
by
Λ(A) :={λ ∈ C : det(λ I − A) = 0
}.
2. Matrix Theory 17
Once we have found the n complex eigenvalues of A ∈ Cn×n, we can compute the corresponding
eigenvectors x to an eigenvalue λ by solving the linear system (λ I − A)x = 0.
If an eigenvalue occurs with a multiplicity k > 1, there can be as at most k linearly in-
dependent eigenvectors. It is not difficult to show that the set of eigenvectors to a given
eigenvalue λ forms a vector space, called the eigenspace of the eigenvalue λ. Indeed let λ
be an eigenvalue of A and define the eigenspace of λ by
Eλ(A) := {x ∈ Cn : Ax = λx} .
Now consider two elements x and y from Eλ(A). Then the linear combination αx+ β y is also
in Eλ(A), since
A(αx + β y
)= α Ax + β Ay = α λx + β λy = λ
(αx + β y
).
This guarantees closure under vector addition and scalar multiplication, and hence Eλ ⊂ Cn is
a vector space.
Example 2.4 (eigenvalues and eigenvectors)
Compute the eigenvalues and eigenvectors of the matrix
A =
0 −2 2
−2 −3 2
−3 −6 5
.
Solution: We compute the roots of the characteristic polynomial
det(λ I − A) = det
λ 2 −2
2 λ + 3 −2
3 6 λ − 5
= λ (λ + 3) (λ − 5) − 12 − 24 + 6 (λ + 3) + 12 λ − 4 (λ − 5)
= λ3 − 2 λ2 − λ + 2 = (λ − 2) (λ − 1) (λ + 1),
where the roots were determined by guessing that λ = 1 is a root and then using long division
and the binomial formulas. Thus we have found that λ1 = 2, λ2 = 1 and λ3 = −1 are the
eigenvalues of A.
To compute the corresponding eigenvectors xj, we solve for each eigenvalue λj the linear system
(λj I−A)xj = 0, which we write in augmented matrix form as (λj I−A|0) and solve by Gaussian
elimination.
For λ1 = 2, we have
(2 I − A)x2 = 0 ⇔
2 2 −2
2 5 −2
3 6 −3
x1 = 0 ⇔
2 2 −2
2 5 −2
3 6 −3
∣∣∣∣∣∣
0
0
0
.
18 2.1. The Eigensystem of a Square Matrix
We multiply the first row of the augmented matrix by 1/2, and in the next step, we add (−2)
times the new first row to the second row and (−3) times the new first row to the third row.
⇔
1 1 −1
2 5 −2
3 6 −3
∣∣∣∣∣∣
0
0
0
⇔
1 1 −1
0 3 0
0 3 0
∣∣∣∣∣∣
0
0
0
.
Now we subtract the new second row from the new third row, and subsequently we divide the
new second row by 3
⇔
1 1 −1
0 3 0
0 0 0
∣∣∣∣∣∣
0
0
0
⇔
1 1 −1
0 1 0
0 0 0
∣∣∣∣∣∣
0
0
0
.
Finally, we subtract the second row from the first row and obtain
⇔
1 0 −1
0 1 0
0 0 0
∣∣∣∣∣∣
0
0
0
⇔
1 0 −1
0 1 0
0 0 0
x1 = 0 ⇔ x1 = α
1
0
1
,
where α ∈ R. Thus all eigenvectors x1 corresponding to λ1 = 2 are of the form x1 = α (1, 0, 1)T ,
where α ∈ R \ {0}.
For λ2 = 1, we have
(I − A)x2 = 0 ⇔
1 2 −2
2 4 −2
3 6 −4
x2 = 0 ⇔
1 2 −2
2 4 −2
3 6 −4
∣∣∣∣∣∣
0
0
0
.
We multiply the first row by (−2) and add it to the second row, and we multiply the first row
by (−3) and add it to the third row. Subsequently we subtract the new second row from the
new third row. Then we divide the new second row by 2.
⇔
1 2 −2
0 0 2
0 0 2
∣∣∣∣∣∣
0
0
0
⇔
1 2 −2
0 0 2
0 0 0
∣∣∣∣∣∣
0
0
0
⇔
1 2 −2
0 0 1
0 0 0
∣∣∣∣∣∣
0
0
0
.
Finally we add 2 times the second row to the first row. Thus
⇔
1 2 0
0 0 1
0 0 0
∣∣∣∣∣∣
0
0
0
⇔
1 2 0
0 0 1
0 0 0
x2 = 0 ⇔ x2 = α
2
−1
0
,
with α ∈ R. Thus all eigenvectors corresponding to the eigenvalue λ2 = 1 are of the form
x2 = α (2,−1, 0)T with α ∈ R \ {0}.
2. Matrix Theory 19
For λ3 = −1, we have
(−I − A)x2 = 0 ⇔
−1 2 −2
2 2 −2
3 6 −6
x2 = 0 ⇔
−1 2 −2
2 2 −2
3 6 −6
∣∣∣∣∣∣
0
0
0
.
We multiply the first row by 2 and add it to the second row, and we multiply the first row by
3 and add it to the third row. Subsequently we multiply the new second row by (−2) and add
it to the third row. Afterwards we divide the new second row by 6.
⇔
−1 2 −2
0 6 −6
0 12 −12
∣∣∣∣∣∣
0
0
0
⇔
−1 2 −2
0 6 −6
0 0 0
∣∣∣∣∣∣
0
0
0
⇔
−1 2 −2
0 1 −1
0 0 0
∣∣∣∣∣∣
0
0
0
.
Finally we add 2 times the second row to the first row multiplied by (−1).
⇔
1 0 0
0 1 −1
0 0 0
∣∣∣∣∣∣
0
0
0
⇔
1 0 0
0 1 −1
0 0 0
x3 = 0 ⇔ x3 = α
0
1
1
,
where α ∈ R. Thus all eigenvectors corresponding to the eigenvalue λ3 = −1 are of the form
x3 = α (0, 1, 1)T , with α ∈ R \ {0}.
We summarize what we have found so far: the spectrum of A is
Λ(A) = {−1, 1, 2}, (2.3)
and the eigenspaces Eλ(A) of the eigenvalues λ are
E2 ={α (1, 0, 1)T : α ∈ R
},
E1 ={α (2,−1, 0)T : α ∈ R
},
E−1 ={α (0, 1, 1)T : α ∈ R
}. (2.4)
This completes the example. 2
Exercise 19 Compute the eigenvalues and corresponding eigenvectors of the 3 × 3 matrix
A =
32
0 12
0 3 0
12
0 32
.
Exercise 20 Consider the matrix A ∈ R3×3 given by
A =
1 −1 0
−1 2 1
0 1 3
.
20 2.1. The Eigensystem of a Square Matrix
(a) Compute the eigenvalues of A by hand.
(b) Compute all eigenvectors to the eigenvalue that is an integer by hand.
An important property of eigenvalues is that they are invariant under so-called basis transfor-
mations or similarity transformations.
Lemma 2.5 (eigenvalues are invariant under basis transformation)
Let A ∈ Cn×n, and let S ∈ Cn×n with det(S) 6= 0. Then S−1 A S is called a basis trans-
formation or similarity transformation of A, and
det(λ I − S−1 A S) = det(λ I − A),
so that A and S−1AS have the same eigenvalues. We say the eigenvalues of A are invariant
under a basis transformation or a similarity transformation.
Proof of Lemma 2.5. From det(B C) = det(B) det(C) and S−1 I S = S−1 S = I,
det(λ I − S−1 A S) = det(S−1 (λ I − A) S
)= det(S−1) det(λ I − A) det(S),
and, since det(S−1) = (det(S))−1, the result follows. 2
Lemma 2.5 gives us the following idea: If we are only interested in the eigenvalues of a square
matrix A ∈ Cn×n, then it would be useful to find a suitable non-singular matrix S ∈ Cn×n,
such that the eigenvalues of S−1 A S are easier to compute.
In order to execute this idea, we first need to understand the nature of eigenvectors better.
The following lemma is elementary but has far reaching consequences.
Lemma 2.6 (eigenvectors to different eigenvalues are linearly independent)
Let A ∈ Cn×n. Eigenvectors to different eigenvalues of A are linearly independent. More
precisely, let λ1, λ2, . . . , λm be m distinct eigenvalues of A, and let x1,x2, . . . ,xm be cor-
responding eigenvectors, that is, Axj = λj xj for j = 1, 2, . . . , m. Then the eigenvectors
x1,x2, . . . ,xm are linearly independent.
Proof of Lemma 2.6. The proof is given by induction on m ≥ 2.
Initial step m = 2: Consider two different eigenvalues λ1 and λ2 and let x1 and x2 be two
corresponding eigenvectors, that is, Axi = λi xi, i = 1, 2. To show the linear independence of
x1 and x2 consider
α1 x1 + α2 x2 = 0. (2.5)
If (2.5) implies α1 = α2 = 0, then we have shown that x1 and x2 are linearly independent.
Assume therefore that at least one of the coefficients α1 and α2 in (2.5) is non-zero, say α1 6= 0.
2. Matrix Theory 21
Then from (2.5)
x1 = − α2
α1x2. (2.6)
Multiplying from the left with A on both sides of (2.6) yields
Ax1 = − α2
α1
Ax2 ⇒ λ1 x1 = − α2
α1
λ2 x2 = λ2 x1 ⇒ (λ1 − λ2)x1 = 0, (2.7)
where we have used (2.6) in the middle equation. Since λ1 − λ2 6= 0 and x1 6= 0, we have
(λ1 −λ2)x1 6= 0, and the last formula in (2.7) is a contradiction. We see that only α1 = α2 = 0
are possible in (2.5), and hence x1 and x2 are linearly independent.
Induction step m → m + 1: The induction step is left as an exercise. 2
Exercise 21 Show the induction step in the proof of Lemma 2.6.
After these preparations we can present one of the major theorems of linear algebra.
Theorem 2.7 (basis transformation into diagonal form)
Let A ∈ Cn×n, and let λ1, λ2, . . . , λn ∈ C denote its n complex eigenvalues. If there are
n linearly independent corresponding eigenvectors x1,x2, . . . ,xn (that is, Axj = λj xj
for j = 1, 2, . . . , n), then the n eigenvectors x1,x2, . . . ,xn form a basis of Cn. Under this
assumption, let S denote the matrix that contains the eigenvectors x1,x2, . . . ,xn as column
vectors, that is,
S := (x1,x2, . . . ,xn).
Then the basis transformation (or similarity transformation) S−1 A S yields the diagonal
matrix with the eigenvalues λ1, λ2, . . . , λn along the diagonal. In formulas,
S−1 A S =
λ1 0 0 · · · 0
0 λ2 0 · · · 0
0 0 λ3. . .
......
.... . .
. . . 00 0 · · · 0 λn
.
The proof of this theorem is surprisingly simple and very intuitive, and greatly helps in under-
standing Theorem 2.7.
Proof of Theorem 2.7. We first consider A S. Since the jth column vector of S = (si,j) is
the eigenvector xj (that is, xj = (s1,j, s2,j, . . . , sn,j)T ) and Axj = λj xj, we have that
(A S)k,j =
n∑
i=1
ak,i si,j, = λj sk,j, k = 1, 2, . . . , n. (2.8)
22 2.1. The Eigensystem of a Square Matrix
In other words, A S is the matrix whose jth column is given by λj xj . The (i, j)th entry in
S−1 A S = S−1 (A S) is given by
(S−1 A S)i,j =n∑
k=1
(S−1)i,k (A S)k,j =n∑
k=1
(S−1)i,k λj sk,j = λj
n∑
k=1
(S−1)i,k sk,j = λj δi,j,
where we have used in the last step that S−1S = I. Thus S−1 A S is indeed the diagonal matrix
with the eigenvalues λ1, λ2, . . . , λn along the diagonal. 2
Example 2.8 We illustrate Theorem 2.7 for our matrix from Example 2.4
A =
0 −2 2
−2 −3 2
−3 −6 5
.
In Example 2.4, we found that the spectrum of A is (see (2.3))
Λ(A) = {−1, 1, 2},
and that the eigenspaces of the eigenvalues are (see (2.4))
E2 ={α (1, 0, 1)T : α ∈ R
},
E1 ={α (2,−1, 0)T : α ∈ R
},
E−1 ={α (0, 1, 1)T : α ∈ R
}.
Hence we choose the matrix S to be
S =
1 2 0
0 −1 1
1 0 1
,
and we expect that
S−1 A S =
2 0 0
0 1 0
0 0 −1
. (2.9)
To verify (2.9), we compute S−1 and then execute the matrix multiplication S−1 A S. We write
the augmented linear system (S|I), and use elementary row operations to transform it into
(I|S−1), and we find (details left as an exercise)
1 2 0
0 −1 1
1 0 1
∣∣∣∣∣∣
1 0 0
0 1 0
0 0 1
⇔
1 0 0
0 1 0
0 0 1
∣∣∣∣∣∣
−1 −2 2
1 1 −1
1 2 −1
.
Thus the inverse S−1 is given by
S−1 =
−1 −2 2
1 1 −1
1 2 −1
,
2. Matrix Theory 23
and executing the matrix multiplications in (2.9) shows that (2.9) is indeed true.
Note that in the definition of the matrix S the normalization of the basis vectors plays no role.
That is, if we choose instead of S
T =
α1 2α2 0
0 −α2 α3
α1 0 α3
,
with any non-zero numbers α1, α2, α3 ∈ R, the we also have
T−1 A T =
2 0 0
0 1 0
0 0 −1
.
The permutation of the columns in S or T corresponds to a corresponding permutation of the
eigenvalues in the diagonal matrix. 2
Exercise 22 For the matrix A from Example 19,
A =
32
0 12
0 3 0
12
0 32
,
find a matrix S such that
S−1 A S =
λ1 0 0
0 λ2 0
0 0 λ3
,
with λ1 ≥ λ2 ≥ λ3. Compute S−1 and execute the matrix multiplication S−1 A S to verify that
you have chosen S correctly.
Whether a matrix A ∈ Cn×n has n linearly independent eigenvectors is a non-trivial problem.
The following lemma gives a sufficient but not a necessary condition for the existence of n
linearly independent eigenvalues.
Lemma 2.9 (sufficient cond. for the existence of n lin. indep. eigenvectors)
Let A ∈ Cn×n have n distinct complex eigenvalues λ1, λ2, . . . , λn (that is, the eigenvalues
λ1, λ2, . . . , λn are all different). Then A has n linearly independent eigenvectors.
Proof of Lemma 2.9. For each eigenvalue λj choose one eigenvector xj . From Lemma 2.6,
the eigenvectors x1,x2, . . . ,xn are linearly independent, since the eigenvalues λ1, λ2, . . . , λn are
distinct. 2
24 2.1. The Eigensystem of a Square Matrix
From det(S−1 A S) = det(S−1) det(A) det(S) = (det(S))−1 det(A) det(S) = det(A) it is clear
that the determinant is invariant under a basis transformation. Thus if λ1, λ2, . . . , λn are the n
complex eigenvalues of A and if A has n linearly independent eigenvectors, then Theorem 2.7
implies that
det(A) = λ1 λ2 · · · λn. (2.10)
We mention here that the trace of the square matrix (defined as the sum of its diagonal
elements) is also invariant under basis transformations, that is,
trace(S−1 A S) = trace(A).
Hence, we see with the help of Theorem 2.7 that if A ∈ Cn×n has n linearly independent
eigenvectors, then the trace of A is given by
trace(A) = λ1 + λ2 + · · · + λn, (2.11)
where λ1, λn, . . . , λn denote the n complex eigenvalues of A.
In fact, (2.10) and (2.11) even hold without the assumption that A has n linearly independent
eigenvectors, but the proof of this needs more advanced linear algebra.
Later-on we need the following special cases of Theorem 2.7 above, where A ∈ Rn×n is symmetric
or A ∈ Cn×n is Hermitian. For these results, remember the following definitions: A matrix
S ∈ Cn×n is called unitary if S∗ = ST
= S−1, and a matrix S ∈ Rn×n is called orthogonal if
ST = S−1.
Theorem 2.10 (basis transformation into diagonal form for Hermitian matrices)
Let A ∈ Cn×n be Hermitian, that is, A = A∗. Then there exists a unitary matrix
S ∈ Cn×n such that
S∗ A S = S−1 A S =
λ1 0 0 · · · 0
0 λ2 0 · · · 0
0 0 λ3. . .
......
.... . .
. . . 00 0 · · · 0 λn
.
The values λ1, λ2, . . . , λn are real and are the eigenvalues of A, and the columns of the
matrix S are an orthonormal basis of eigenvectors. More precisely, if we denote the
jth column vector of S by xj, j = 1, 2, . . . , n, then x1,x2, . . . ,xn is an orthonormal basis of
Cn and Axj = λj xj, j = 1, 2, . . . , n.
For the special case of real matrices, we have the following result.
2. Matrix Theory 25
Theorem 2.11 (basis transformation into diagonal form for symmetric matrices)
Let A ∈ Rn×n be symmetric, that is, A = AT . Then there exists an orthogonal matrix
S ∈ Rn×n such that
ST A S = S−1 A S =
λ1 0 0 · · · 0
0 λ2 0 · · · 0
0 0 λ3. . .
......
.... . .
. . . 00 0 · · · 0 λn
.
The values λ1, λ2, . . . , λn are real and are the eigenvalues of A, and the columns of the
matrix S are an orthonormal basis of eigenvectors. More precisely, if we denote the
jth column vector of S by xj, j = 1, 2, . . . , n, then x1,x2, . . . ,xn is an orthonormal basis of
Rn and Axj = λj xj, j = 1, 2, . . . , n.
Proof of Theorem 2.10. A proof of Theorem 2.10 will be discussed in Exercise 32 with the
help of the Schur factorization. 2
Exercise 23 For the matrix A from Examples 19 and 22,
A =
32
0 12
0 3 0
12
0 32
,
find an orthogonal matrix S such that
S−1 A S =
λ1 0 0
0 λ2 0
0 0 λ3
, (2.12)
with λ1 ≥ λ2 ≥ λ3. Verify that your matrix S is orthogonal. Verify that (2.12) is true by
executing the matrix multiplications in (2.12).
Exercise 24 Let A ∈ Cn×n, and let S ∈ Cn×n be a non-singular matrix. Show that
trace(S−1 A S) = trace(A).
Exercise 25 Consider the real 2 × 2 matrix
A =
(3 −1
−1 3
).
(a) Calculate the eigenvalues λ1 and λ2 (where λ1 ≥ λ2) of A by hand.
26 2.2. Upper Triangular Matrices and Back Substitution
(b) Calculate the eigenspaces correponding to the eigenvalues from (a) by hand.
(c) Find an orthogonal 2 × 2 matrix S (that is, ST = S−1) such that
ST A S =
(λ1 0
0 λ2
), where λ1 > λ2.
Execute the matrix multiplication ST A S to verify that your choice of S is correct.
2.2 Upper Triangular Matrices and Back Substitution
Two important classes of matrices are upper triangular matrices and lower triangular matrices.
Definition 2.12 (upper and lower triangular matrices)
A square matrix A = (ai,j) in Cn×n or Rn×n is said to be upper triangular (or an upper
triangular matrix) if ai,j = 0 for i > j. Thus an n × n upper triangular matrix is of the
form
A = (ai,j) =
a1,1 a1,2 a1,3 · · · a1,n
0 a2,2 a2,3 · · · a2,n
0 0 a3,3 · · · a3,n...
.... . .
. . ....
0 0 · · · 0 an,n
. (2.13)
Similarly, an n × n matrix A = (ai,j) is said to be lower triangular (or a lower trian-
gular matrix) if ai,j = 0 for i < j.
Example 2.13 (upper and lower triangular matrices)
A =
1 2 3
0 1 2
0 0 1
, B =
−1 0 0
4 −2 0
5 6 −3
.
Then A is a 3 × 3 upper triangular matrix, and B is a 3 × 3 lower triangular matrix. 2
The following lemma establishes some important properties of lower triangular and upper tri-
angular matrices.
2. Matrix Theory 27
Lemma 2.14 (properties of upper/lower triangular matrices)
The set of upper triangular matrices in Cn×n (or R
n×n) with the usual matrix addition and
usual scalar multiplication with complex (or real) numbers forms a complex (or real) vector
space.
Let A, B ∈ Cn×n (or A, B ∈ Rn×n) be upper triangular matrices. Then A B is also an upper
triangular matrix in Cn×n (or Rn×n, respectively). If A is non-singular, then A−1 is also an
upper triangular matrix in Cn×n (or in R
n×n, respectively).
Analogous statements hold for lower triangular matrices.
Let A = (ai,j) in Cn×n or in Rn×n be an upper triangular or lower triangular matrix. Then
the following holds:
(i) det(A) = a1,1 a2,2 · · · an,n.
(ii) A is non-singular/invertible if and only if aj,j 6= 0 for all j = 1, 2, . . . , n.
(iii) The eigenvalues of A are the entries a1,1, a2,2, . . . , an,n on the diagonal of A.
Proof of Lemma 2.14. The set of all n×n matrices in Cn×n (or Rn×n) with matrix addition
and scalar multiplication with complex (or real) numbers forms a complex (or real) vector
space. Therefore, to verify that the upper triangular matrices form a vector space, it is enough
to check the closure under addition and scalar multiplication. This is easily done and is left
as an exercise. The next statements are covered in the exercises, but we prove (i) to (iii) for
upper triangular matrices.
Let A be an n × n upper triangular matrix (2.13). Computing the determinant det(A) by
expansion with respect to the first column yields
det(A) = a1,1 det
a2,2 a2,3 · · · a2,n
0 a3,3 · · · a3,n...
. . .. . .
...0 · · · 0 an,n
.
The resulting submatrix whose determinant needs to be computed is again upper triangular,
and repeating the procedure yields finally
det(A) = a1,1 a2,2 · · · an,n, (2.14)
which proves (i). Since a matrix is invertible if and only if det(A) 6= 0, (2.14) implies immedi-
ately (ii). To see the statement (iii) consider the matrix λ I − A. Since A is upper triangular
and λ I is diagonal, the matrix λ I − A is also upper triangular and its entries on the diagonal
are λ − aj,j, j = 1, 2, . . . , n. Thus form (2.14),
p(A, λ) = det(λ I − A
)= (λ − a11) (λ − a2,2) · · · (λ − an,n),
and we see that the eigenvalues of A are indeed a1,1, a2,2, . . . , an,n. 2
28 2.2. Upper Triangular Matrices and Back Substitution
For an invertible upper triangular matrix A = (ai,j) in Cn×n (or in Rn×n), the linear
system Ax = b,
a1,1 a1,2 a1,3 · · · a1,n
0 a2,2 a2,3 · · · a2,n
0 0 a3,3 · · · a3,n...
.... . .
. . ....
0 0 · · · 0 an,n
x1
x2
x3
...
xn
=
b1
b2
b3
...
bn
,
is easily solved by observing that the last equation gives
an,n xn = bn, ⇒ xn =1
an,nbn. (2.15)
Then, the second last equation reads
an−1,n−1 xn−1 + an−1,n xn = bn−1,
and its can be can be solved for xn−1 via
xn−1 =1
an−1,n−1
(bn−1 − an−1,n xn
),
where we now use that xn was computed via (2.15) in the previous step. Proceeding in this
way, the jth equation reads
aj,j xj + aj,j+1 xj+1 + · · ·+ aj,n xn = aj,j xj +
n∑
k=j+1
aj,k xk = bj ,
and can be solved for xj as follows:
xj =1
aj,j
(bj −
n∑
k=j+1
aj,k xk
),
where we use that xn, xn−1, . . . , xj+1 are known from previous steps. We can continue this
procedure until we have computed all xj , j = n, n − 1, . . . , 2, 1. This procedure is called back
substitution and the following theorem summarizes what we have derived just now.
Theorem 2.15 (back substitution algorithm)
Let A = (ai,j) in Cn×n (or in R
n×n) be an upper triangular matrix that is also invertible.
Then the solution to Ax = b can be computed with O(n2) elementary operations via
xj =1
aj,j
(bj −
n∑
k=j+1
aj,k xk
), j = n, n − 1, . . . , 2, 1. (2.16)
For the definition of the Landau symbol O see Section 1.6.
2. Matrix Theory 29
Proof of Theorem 2.15. Essentially we have derived the theorem above before stating
it; the only part of the statement that needs some consideration are the O(n2) elementary
operations. Let us consider (2.16) for a given fixed j. The computation of xj involves n − j
additions/subtractions and n− j + 1 multiplications/divisions, that is, 2n + 1− 2j elementary
operations. Thus the back substitution algorithm needs in total
n∑
j=1
(2n + 1 − 2j) = (2n + 1)n − 2n∑
j=1
j = (2n + 1)n − (n + 1)n = n2,
that is, O(n2) elementary operations. 2
Example 2.16 (back substitution)
Solve the following linear system with an upper triangular matrix with back substitution:
1 0 2
0 1 −1
0 0 −3
x1
x2
x3
=
3
0
6
.
Solution: Using back substitution, we have
x3 =6
(−3)= −2, x2 =
1
1
(0−(−1) x3
)= x3 = −2, x1 =
1
1
(3−0 x2−2 x3
)= 3−2 (−2) = 7.
Thus the solution is x = (7,−2,−2)T . 2
The MATLAB code for the implementation of the back substitution algorithm is:
function x = back_sub(U,b)
% executes the back substitution algorithm for solving U x = b
% input: U = n by n upper triangular matrix
% b = 1 by n matrix, right-hand side
% output: x = 1 by n vector
n = size(U,1);
x = zeros(1,n);
x(n) = b(n) / U(n,n);
for i = n-1:-1:1
x(i) = (b(i) - U(i,i+1:n) * x(i+1:n)’) / U(i,i);
end
Exercise 26 Solve the following linear system by hand with the back substitution algorithm:
1 1 1
0 2 2
0 0 3
x1
x2
x3
=
−1
3
6
.
30 2.3. Schur Factorization: A Triangular Canonical Form
Exercise 27 Solve the following linear system by hand with the back substitution algorithm:
2 −1 3 1
0 1 2 −1
0 0 −2 1
0 0 0 3
x1
x2
x3
x4
=
12
−3
1
9
.
Exercise 28 Show that the upper triangular matrices with diagonal elements all different from
zero, with the usual matrix multiplication, form a (multiplicative) group.
Exercise 29 Forward substitution: Consider a linear system Ax = b, where A ∈ Rn×n
is a lower triangular matrix, b ∈ Rn the given right-hand side, and x ∈ Rn the unknown
solution. Analogously to the back substitution algorithm we can formulate a forward sub-
stitution algorithm to compute xj, j = 1, 2, . . . , n − 1, n. Derive the forward substitution
algorithm.
2.3 Schur Factorization: A Triangular Canonical Form
In this section we encounter the Schur factorization which guarantees that for any matrix
A ∈ Cn×n, there exists a unitary matrix such that S∗ A S is an upper triangular matrix. Since
the matrix S is unitary, we have S−1 = S∗ and therefore S∗ A S = S−1 A S, that is, we have
a basis transformation (or similarity transformation) with a unitary matrix that transforms A
into an upper triangular matrix. The proof of the Schur factorization is constructive in that
we will explicitly construct the matrix S with the help of so-called Householder matrices or
elementary Hermitian matrices.
Definition 2.17 (Householder matrix or elementary Hermitian matrix)
A Householder matrix or elementary Hermitian matrix is any matrix of the form
H(w) := I − 2ww∗, where w ∈ Cn, with w∗ w = ‖w‖2
2 = 1 or w = 0.
(2.17)
Figure 2.1 illustrates that the Householder H(w) matrix w with w 6= 0 represents a reflection
on the hyperplane
Sw = {z ∈ Cn : w∗ z = 0}
that is orthogonal to w. Indeed consider any vector a ∈ Cn and decompose it into the compo-
nent in the direction of w and the orthogonal part (which lies in Sw):
a = (w∗ a)w +(a − (w∗ a)w
). (2.18)
If we apply H(w) to the vector a, then, from (2.18) and w∗ w = 1,
H(w) a = (I − 2ww∗) a = a− 2ww∗ a
2. Matrix Theory 31
= (w∗ a)w +(a − (w∗ a)w
)− 2ww∗
((w∗ a)w +
(a− (w∗ a)w
))
= (w∗ a)w +(a − (w∗ a)w
)− 2ww∗
((w∗ a)w
)
= (w∗ a)w +(a − (w∗ a)w
)− 2 (w∗ a) (w∗ w)w
= −(w∗ a)w +(a− (w∗ a)w
),
where we have used that a− (w∗ a)w is orthogonal to w. From the last representation we see
that H(w) a is indeed a reflection of a on the hyperplane Sw.
H(w) a = −(w∗ a)w + w
a
w
(w∗ a)w
−(w∗ a)ww
Sw
Figure 2.1: Householder transformation: In the picture w denotes the projection of a onto
the hyperplane Sw. Then a = (w∗ a)w + w, and since w∗ w = 0 and w∗ w = 1, we find
H(w) a = (I − 2ww∗)((w∗ a)w + w) = (w∗ a)w + w − 2 (w∗ a)w = −(w∗ a)w + w.
Example 2.18 (Householder matrix)
Let w = (0,−3/5, 4/5)T . Then ‖w‖2 = 1, and the matrix
H(w) = I − 2
0
− 35
45
(0,− 3
5, 4
5
)= I − 2
0 0 0
0 925
− 1225
0 − 1225
1625
=
1 0 0
0 725
4925
0 4925
− 725
is a 3 × 3 Householder matrix. 2
The next lemma states some properties of Householder matrices.
32 2.3. Schur Factorization: A Triangular Canonical Form
Lemma 2.19 (properties of Householder matrices)
A Householder matrix H(w), given by (2.17), has the following properties:
(i) H(w) is Hermitian, that is,(H(w)
)∗= H(w).
(ii) H(w) is invertible/non-singular.
(iii) det(H(w)
)= −1 for w 6= 0.
(iv) H(w) is unitary, that is,(H(w)
)−1=(H(w)
)∗. Hence, a product of Householder
matrices is unitary.
(v) Storing H(w) only requires the n elements of w.
Proof of Lemma 2.19. From (A B)∗ = B∗ A∗, (A + B)∗ = A∗ + B∗, and (A∗)∗ = A we find
(H(w)
)∗=(I − 2ww∗)∗ = I∗ − 2 (ww∗)∗ = I − 2 (w∗)∗ w∗ = I − 2ww∗ = H(w),
thus proving (i).
Next we work out det(H(w)
). If w = 0 then H(w) = I and hence det
(H(w)
)= 1. If
w 6= 0, then we will show that the eigenvalues of H(w) are 1, with multiplicity n − 1, and −1,
with multiplicity 1. Therefore, we have that det(H(w)
)= (−1) 1n−1 = −1, which proves (iii).
Consider as before the hyperplane Sw = {z ∈ Cn : w∗ z = 0}, which is a (n − 1)-dimensional
subspace of Cn. For any vector a ∈ Sw, we have w∗ a = 0, and hence for a ∈ Sw,
H(w) a =(I − 2ww∗) a = a− 2w (w∗ a) = a,
that is, a is an eigenvector of H(w) corresponding to the eigenvalue λ = 1. Since dim(Sw) =
n − 1, the eigenvalue λ = 1 has at least n − 1 linearly independent eigenvectors and hence it
has at least multiplicity n − 1. Now consider the vector w itself. Then, since w∗ w = 1,
H(w)w =(I − 2ww∗)w = w − 2w (w∗ w) = −w,
that is, w is an eigenvector corresponding to the eigenvalue λ = 1. Combining these results,
we see that the eigenvalue λ = 1 has the multiplicity n − 1 and the eigenvalue λ = −1 has the
multiplicity 1.
From the proof so far it is clear that H(w) is invertible, since we have established that its
determinant is different from zero. Thus (ii) holds true.
To show that H(w) is unitary, we have to show that
(H(w)
)∗H(w) = H(w)
(H(w)
)∗= I.
Since(H(w)
)∗= H(w) from (i), it is enough to show that H(w) H(w) = I. Indeed,
H(w) H(w) =(I − 2ww∗) (I − 2ww∗)
2. Matrix Theory 33
= I − 4ww∗ + 4 (ww∗) (ww∗)
= I − 4ww∗ + 4w (w∗ w)w∗
= I − 4ww∗ + 4ww∗ = I
where we have used the associativity of matrix multiplication and w∗ w = 1.
That the product of Householder matrixes is also unitary follows from the following general
statement: if A and B in Cn×n are unitary, then A B is unitary. Indeed (A B)∗ = B∗ A∗ =
B−1 A−1, and we know that B−1 A−1 is the inverse matrix to A B.
Statement (v) is evident. 2
Lemma 2.20 (construction of Householder matrices)
Let x and y be given vectors in Cn such that y∗ y = x∗ x and y∗ x = x∗ y. Then there exists
a Householder matrix (or an elementary Hermitian matrix) H(w), such that H(w)x = y
and H(w)y = x. If x and y are real then so is w.
Proof of Lemma 2.20. If x = y then w = 0 and H(w) = I. If x 6= y we define
w =x − y√
(x − y)∗ (x − y)=
x − y
‖x − y‖2. (2.19)
Clearly, w∗ w = 1, and from (2.19)
H(w)x =(I − 2ww∗)x = x − (x − y)
2 (x − y)∗ x
(x − y)∗ (x − y), (2.20)
and, using x∗ x = y∗ y = 1 and x∗ y = y∗ x,
2 (x− y)∗ x = 2x∗ x − 2y∗ x =(x∗ x + y∗ y
)−(y∗ x + x∗ y
)= (x − y)∗ (x − y). (2.21)
Substituting (2.21) into (2.20) yields now H(w)x = y. From (H(w))−1 = (H(w))∗ = H(w)
(see (i) and (iv) in Lemma 2.19) and H(w)x = y, we have
H(w)y = H(w)(H(w)x
)=(H(w) H(w)
)x = I x = x.
If the vectors x and y are real, then, from the definition (2.19), clearly the vector w is also
real. 2
Lemma 2.20 is often used to map a given vector x = (x1, x2, . . . , xn)T onto a scalar multiple of
the first unit vector e1 = (1, 0, 0, . . . , 0)T , that is, we want to find a Householder matrix H(w)
such that
H(w)x = c e1
with a suitable complex number c. From the conditions in Lemma 2.20, we find
(c e1)∗ (c e1) = (c c) (e∗
1 e1) = |c|2 = x∗ x = ‖x‖22 ⇒ |c| = ‖x‖2
34 2.3. Schur Factorization: A Triangular Canonical Form
and
(c e1)∗ x = c x1 = x∗ (c e1) = x1 c = c x1 ⇒ c x1 ∈ R,
and thus c is a real multiple of x1. Combining both properties, we see that
c =x1
|x1|‖x‖2,
and we choose the vector w of the Householder matrix from (2.19) as
w =
x − x1
|x1|‖x‖2 e1
√(x − x1
|x1|‖x‖2 e1
)∗(x − x1
|x1|‖x‖2 e1
) =
x − x1
|x1|‖x‖2 e1
∥∥∥∥x − x1
|x1|‖x‖2 e1
∥∥∥∥2
.
We can now use Lemma 2.20 to prove the following Theorem.
Theorem 2.21 (Schur factorization)
Let A be a matrix in Cn×n. Then there exists a unitary matrix S ∈ Cn×n such that S∗ A S
is an upper triangular matrix. This is known as the Schur factorization of A.
Proof of Theorem 2.21. The proof is given by induction over n.
Initial step: Clearly the result holds for n = 1.
Induction step n − 1 → n: Assume now the result holds for all (n − 1)× (n − 1) matrices. We
need to show that the result also holds for all n × n matrices.
Let x be an eigenvector to some eigenvalue λ of A, that is
Ax = λx, where x 6= 0. (2.22)
By Lemma 2.20 and our considerations after this lemma, there exists a Householder matrix
H(w1) such that
H(w1)x = c1 e1 and H(w1) e1 =1
c1
x, (2.23)
with |c1| = ‖x‖2 6= 0. Using(H(w1)
)∗= H(w1) and (2.22) and (2.23) we have
(H(w1)
)∗A H(w1) e1 =
1
c1
H(w1) Ax =λ
c1
H(w1)x =λ
c1
c1 e1 = λ e1.
Since(H(w1)
)∗A H(w1) e1 is the first column of
(H(w1)
)∗A H(w1), we may write the matrix(
H(w1))∗
A H(w1) in the form
(H(w1)
)∗A H(w1) = H(w1) A H(w1) =
λ | a∗
− − −0 | A(1)
, (2.24)
2. Matrix Theory 35
for some vector a ∈ Cn−1 and some (n − 1) × (n − 1) matrix A(1).
By the induction hypothesis there exists an (n − 1) × (n − 1) unitary matrix V , such that
V ∗ A(1) V = T , where T is an upper triangular (n − 1) × (n − 1) matrix. Consider the matrix
S = H(w1)
1 | 0T
− − −0 | V
with S∗ =
1 | 0T
− − −0 | V ∗
(H(w1)
)∗.
From V ∗ V = V V ∗ = I and(H(w1)
)∗H(w1) = H(w1)
(H(w1)
)∗= I, then S S∗ = S∗ S = I,
that is, the matrix S is unitary. We will now show that S∗ A S is an upper triangular matrix.
From (2.24)
S∗ A S =
1 | 0T
− − −0 | V ∗
(H(w1)
)∗A H(w1)
1 | 0T
− − −0 | V
=
1 | 0T
− − −0 | V ∗
λ | a∗
− − −0 | A(1)
1 | 0T
− − −0 | V
=
1 | 0T
− − −0 | V ∗
λ | a∗ V
− − −0 | A(1) V
=
λ | a∗V
− − −0 | V ∗A(1) V
=
λ | a∗V
− − −0 | T
.
Hence, S∗ A S is upper triangular. 2
Example 2.22 (Schur factorization)
Find the Schur factorization of the matrix
A =
3 0 1
0 3 0
1 0 3
.
Solution: To do this we follow the steps of the proof of for Theorem 2.21. First we find the
eigenvalues of A.
p(A, λ) = det(λ I − A
)= det
λ − 3 0 −1
0 λ − 3 0
−1 0 λ − 3
36 2.3. Schur Factorization: A Triangular Canonical Form
= (λ − 3)3 − (λ − 3) = (λ − 3)[(λ − 3)2 − 1
]= (λ − 3) (λ − 2) (λ − 4),
and we see that the eigenvalues are λ1 = 4, λ2 = 3, and λ3 = 2.
We choose λ2 = 3, find a corresponding eigenvector, and determine the Householder matrix
that maps the eigenvector onto c e1. For λ = λ2 = 3, the eigenvector x2 satisfies the linear
system
0 0 −1
0 0 0
−1 0 0
x2 = 0 ⇒ x2 = α
0
1
0
, α ∈ R.
For α = 1, ‖e1‖2 = ‖x2‖ = 1 and e∗1 x2 = x∗
2 e1 = 0. Hence, we choose the vector w1 for the
first Householder matrix as
w1 =x2 − e1
‖x2 − e1‖=
1√2
−1
1
0
.
The corresponding Householder matrix is given by
H(w1) = I−2w1 w∗1 = I−2
1
(√
2)2
−1
1
0
(−1, 1, 0) = I−
1 −1 0
−1 1 0
0 0 0
=
0 1 0
1 0 0
0 0 1
,
and it is easily verified that indeed H(w1)x2 = e1 and H(w1) e1 = x2. Now we execute the
matrix multiplication(H(w1)
)∗A H(w1) = H(w1) A H(w1) and obtain
H(w1) A H(w1) =
0 1 0
1 0 0
0 0 1
3 0 1
0 3 0
1 0 3
0 1 0
1 0 0
0 0 1
=
0 1 0
1 0 0
0 0 1
0 3 1
3 0 0
0 1 3
=
3 0 0
0 3 1
0 1 3
.
Now we consider the 2 × 2 submatrix
A(1) =
(3 1
1 3
),
and determine its eigenvalues:
p(A(1), λ) = det(λ I − A(1)
)= det
(λ − 3 −1
−1 λ − 3
)= (λ − 3)2 − 1 = (λ − 2) (λ − 4).
2. Matrix Theory 37
The eigenvalues of A(1) are λ1 = 4 and λ2 = 2. We choose λ2 = 2 and find a corresponding
eigenvector x(1)2
(−1 −1
−1 −1
)x
(1)2 = 0 ⇒ x
(1)2 = α
(1
−1
), α ∈ R.
For α = 1, ‖x(1)2 ‖2 = ‖
√2 e1‖, where e1 is now the first unit vector in R2, and we have
(x(1)2 )∗ (
√2 e1) = (
√2 e1)
∗ x(1)2 . The vector w
(1)2 of the Householder matrix in R
2 is given by
w(1)2 =
x(1)2 −
√2 e1∥∥x(1)
2 −√
2 e1
∥∥2
=((1 −
√2)2 + 1
)−1/2
(1 −
√2
−1
)=(2 (2 −
√2))−1/2
(1 −
√2
−1
).
Thus the Householder matrix in R2 is given by
H(w(1)2 ) = I − 2w
(1)2
(w
(1)2
)∗= I −
(2 −
√2)−1
(1 −
√2
−1
)(1 −
√2,−1
)
= I +(√
2 (1 −√
2))−1
((1 −
√2)2 −(1 −
√2)
−(1 −√
2) 1
)
= I +(√
2)−1
((1 −
√2) −1
−1 (1 −√
2)−1
)
=
((√
2)−1 −(√
2)−1
−(√
2)−1 −(√
2)−1
)= − 1√
2
(−1 1
1 1
).
The corresponding unitary matrix in R3 is then given by
H2 :=
1 0 0
0 (√
2)−1 −(√
2)−1
0 −(√
2)−1 −(√
2)−1
,
and
H2
(H(w1) A H(w1)
)H2
=
1 0 0
0 (√
2)−1 −(√
2)−1
0 −(√
2)−1 −(√
2)−1
3 0 0
0 3 1
0 1 3
1 0 0
0 (√
2)−1 −(√
2)−1
0 −(√
2)−1 −(√
2)−1
=
1 0 0
0 (√
2)−1 −(√
2)−1
0 −(√
2)−1 −(√
2)−1
3 0 0
0√
2 −2√
2
0 −√
2 −2√
2
=
3 0 0
0 2 0
0 0 4
.
38 2.3. Schur Factorization: A Triangular Canonical Form
Thus we have
S∗ A S =
3 0 0
0 2 0
0 0 4
with the unitary matrix
S := H(w1) H2 =
0 1 0
1 0 0
0 0 1
1 0 0
0 (√
2)−1 −(√
2)−1
0 −(√
2)−1 −(√
2)−1
=
0 (√
2)−1 −(√
2)−1
1 0 0
0 −(√
2)−1 −(√
2)−1
.
In this example the Schur factorization has brought A into diagonal form, but in general this
is not the case, and we only obtain a upper triangular matrix. 2
We note here that the Schur factorization is mainly of ‘theoretical interest’, but is not interesting
for implementation or as an algorithm. For example, you can use the Schur factorization to
prove the statements investigated in the remark and the exercise below.
Remark 2.23 (special case of Theorem 2.21 for A∗ = A)
An important consequence of Theorem 2.21 is that if A is Hermitian, that is, A∗ = A, then the
upper triangular matrix S∗ A S is also Hermitian, and thus S∗ A S = S−1 A S must be a real
diagonal matrix. Furthermore, A is Hermitian if and only if all eigenvalues are real
and there are n orthonormal eigenvectors!
Exercise 30 Construct a Householder matrix that maps the vector (2, 0, 1)T onto the vector
(√
5, 0, 0)T .
Exercise 31 Let A ∈ Cn×n be an Hermitian matrix, that is, A∗ = AT
= A. Without using
any known results, but by just exploiting the definition of an Hermitian matrix, show that A
has only real eigenvalues.
Exercise 32 Let A be a Hermitian matrix.
(a) By using the Schur factorization, show that there exists a unitary matrix S such that
S∗ A S = U , where U is a real diagonal matrix.
(b) Use the Schur factorization to show that Cn has an orthonormal basis of eigenvectors of A.
(c) Show that A is positive definite if and only if all eigenvalues are positive.
(d) Show that if A is positive definite, then det(A) > 0.
(e) Show that A is positive definite if and only if A = Q∗ Q, with some matrix Q satisfying
det(Q) 6= 0.
2. Matrix Theory 39
2.4 Vector Norms
In this section we briefly revise some material on norms that has been covered in ‘Further
Analysis’.
Definition 2.24 (norm, normed linear space, and unit vector)
Let V be a complex (or real) vector space. A norm for V is a function ‖ · ‖ : V → R with
the following properties: For all x,y ∈ V and α ∈ C (or α ∈ R) we have
(i) ‖x‖ ≥ 0; and ‖x‖ = 0 if and only if x = 0,
(ii) ‖αx‖ = |α| ‖x‖, and
(iii) ‖x + y‖ ≤ ‖x‖ + ‖y‖ (triangle inequality).
A vector space V with a norm ‖ · ‖ is called a normed vector space or normed linear
space.
A unit vector with respect to the norm ‖ · ‖ is a vector x ∈ V such that ‖x‖ = 1.
Example 2.25 (norms on Cn and R
n)
Here is a list of the most important vector norms for Cn (or Rn): for x = (x2, x2, . . . , xn)T in
Cn or Rn, define the p-norms by
(a) 1-norm: ‖x‖1 :=
n∑
j=1
|xj | = |x1| + |x2| + . . . + |xn|,
(b) 2-norm or Euclidean norm: ‖x‖2 :=
(n∑
j=1
|xj |2)1/2
,
(c) p-norm: ‖x‖p :=
(n∑
j=1
|xj |p)1/p
=(|x1|p + |x2|p + . . . + |xn|p
)1/p
, for 1 ≤ p < ∞,
(d) ∞-norm: ‖x‖∞ := max1≤j≤n
|xj |.
We will use these p-norms frequently in this course, in particular, for p ∈ {1, 2,∞}. 2
Example 2.26 (unit balls with respect to some norms on Rn)
The unit ball in Rn with respect to the p-norm ‖ · ‖p is defined to be the set
Bp :={x ∈ R
n : ‖x‖p ≤ 1}.
For R2, Figure 2.2 below shows the unit balls B1, B2, and B∞ with respect to the 1-norm,
the Euclidean norm, and the ∞-norm, respectively. Only the unit ball with respect to the
Euclidean norm looks like we imagine a ball. 2
The following theorem generalizes the Cauchy-Schwarz inequality.
40 2.4. Vector Norms
1−1
1
−1
B∞
B2
B1
Figure 2.2: The unit balls Bp in R2 with respect to the p-norms ‖ · ‖1 (blue), ‖ · ‖2 (black), and
‖ · ‖∞ (red), respectively.
Theorem 2.27 (Holder’s inequality)
The p-norms ‖ · ‖p, where 1 ≤ p ≤ ∞ for Cn (and R
n), defined in Example 2.25 above,
satisfy Holder’s inequality:
|x∗ y| ≤ ‖x‖p ‖y‖q , where1
p+
1
q= 1.
The special case p = q = 2 is known as the Cauchy-Schwarz inequality.
The next definition has far reaching consequences.
Definition 2.28 (equivalent norms)
Two norms ‖ · ‖ and ‖| · ‖| for a (real or complex) vector space V are called equivalent if
there are two positive constants c1 and c2 such that
c1‖x‖ ≤ ‖|x‖| ≤ c2‖x‖ for all x ∈ V.
A norm allows us to define the notion of convergence of sequences.
Definition 2.29 (convergence with respect to a norm)
Let V be a (real or complex) vector space, and let ‖·‖ : V → R be a norm for V . A sequence
{xk} ⊂ V converges with respect to ‖ · ‖ to x ∈ V if
limk→∞
‖xk − x‖ = 0.
If two norms are equivalent, then they define the same notion of convergence.
2. Matrix Theory 41
Theorem 2.30 (equivalence of all norms on Cn (or Rn))
On Cn (or R
n) all norms are equivalent.
Proof of Theorem 2.30. It suffices to show that all norms are equivalent to the 1-norm
‖ · ‖1. Let ‖ · ‖ be any norm on Cn. The representation x =
∑nj=1 xj ej of any vector x =
(x1, x2, . . . , xn)T shows that
‖x‖ =
∥∥∥∥∥
n∑
j=1
xj ej
∥∥∥∥∥ ≤n∑
j=1
‖xj ej‖ =
n∑
j=1
|xj| ‖ej‖ ≤(
max1≤j≤n
‖ej‖)‖x‖1 =: M ‖x‖1,
with M := max1≤j≤n ‖ej‖. From this we can conclude that the norm ‖·‖ is Lipschitz continuous
with respect to the 1-norm ‖ · ‖1, that is,
∣∣‖x‖ − ‖y‖∣∣ ≤ ‖x − y‖ ≤ M ‖x − y‖1 for all x,y ∈ C
n.
(See Exercise 35 for the first inequality.) Since the unit sphere S1 = {x ∈ Cn : ‖x‖1 = 1}in Cn is closed and bounded, the unit sphere S1 is compact. Hence the norm ‖ · ‖ attains its
minimum and maximum on S1 (because it is a continuous function), that is, there are positive
constants c1 and c2 such that
c1 ≤ ‖x‖ ≤ c2 for all x ∈ Cn with ‖x‖1 = 1. (2.25)
For general x ∈ Cn \ {0}, we apply (2.25) to the unit vector y = x/‖x‖1 and obtain
c1 ≤1
‖x‖1‖x‖ ≤ c2 ⇒ c1 ‖x‖1 ≤ ‖x‖ ≤ c2 ‖x‖1. (2.26)
The second estimate in (2.26) holds for all x ∈ Cn \ {0} and trivially also for x = 0. Thus we
have verified that ‖ · ‖ and ‖ · ‖1 are equivalent . 2
Example 2.31 (equivalent norms on Rn)
For example, we have for all x ∈ Rn
‖x‖2 ≤ ‖x‖1 ≤√
n ‖x‖2 ,
‖x‖∞ ≤ ‖x‖2 ≤√
n ‖x‖∞ ,
‖x‖∞ ≤ ‖x‖1 ≤ n ‖x‖∞ .
The estimates will be proved in Exercise 36 below. 2
Exercise 33 Show that ‖ · ‖1 and ‖ · ‖∞ are norms for Cn.
Exercise 34 By considering the inequality
0 ≤(α x + β y)T
(αx + β y)
and choosing α, β ∈ R appropriately, for x,y ∈ Rn, prove the Cauchy-Schwarz inequality.
42 2.5. Matrix Norms
Exercise 35 Prove the lower triangle inequality: If ‖ · ‖ is a norm for a vector space V ,
then ∣∣‖x‖ − ‖y‖∣∣ ≤ ‖x − y‖ for all x,y ∈ V.
Hint: Consider writing y = x + (y − x), and use the triangle inequality.
Exercise 36 Show the inequalities in Example 2.31
2.5 Matrix Norms
When analyzing matrix algorithms we will require the use of matrix norms, since they allow
us to estimate whether the matrix is ‘well-conditioned’. If the matrix is not ‘well-conditioned’,
(which means usually if the matrix is close to being singular), then the quality of a numerically
computed solution could be poor.
We recall that the set of all complex (or real) m × n matrices with the usual component-wise
matrix addition and scalar multiplication forms a complex (or real) vector space.
Lemma 2.32 (m × n matrices form a vector space)
The set Cm×n (or Rm×n) of all complex (or real) m × n matrices with the component-wise
addition of A = (ai,j) and B = (bi,j), defined by
(A + B)i,j := ai,j + bi,j , i = 1, 2, . . . , m; j = 1, 2, . . . , n,
and the component-wise scalar multiplication of α ∈ C (or α ∈ R) and A = (ai,j), defined
by
(α A)i,j := α ai,j , i = 1, 2, . . . , m; j = 1, 2, . . . , n,
is a complex (or real) vector space.
Proof of Lemma 2.32. The proof is straight-forward and was covered in Exercise 3. 2
Since Cm×n (and Rm×n) are vector spaces, we can introduce norms on them. For many purposes,
it is convenient to introduce norms, which are ‘induced’ by given vector norms on Cm and Cn
(and Rm and Rn, respectively). We will now explore the concept of an ‘induced matrix norm’.
In this section, we write ‖·‖(m) to indicate that ‖·‖(m) is a norm on an m-dimensional vector
space (usually Cm or Rm).
Let A ∈ Cm×n. Since the unit sphere S = {x ∈ C
n : ‖x‖(n) = 1} in the finite dimensional
space Cn with norm ‖ · ‖(n) is closed and bounded, and hence compact, the real-valued function
‖Ax‖(m) assumes its supremum on the unit sphere S. Thus
supx∈Cn,
‖x‖(n)=1
‖Ax‖(m) = ‖Ax′‖(m) =: C for some x′ ∈ Cn with ‖x′‖(n) = 1, (2.27)
2. Matrix Theory 43
and, in particular, the supremum has a finite value C. We also see from (2.27) that
‖Ax‖(m) ≤ C for all x ∈ Cn with ‖x‖(n) = 1 ⇔ ‖Ay‖(m) ≤ C ‖y‖(n) for all y ∈ C
n,
where the equivalence follows by setting x = y/‖y‖(n) for y 6= 0. This motivates the definition
of the induced matrix norm.
Definition 2.33 (induced matrix norm)
Let ‖·‖(m) and ‖·‖(n) be norms on Cm (or R
m) and Cn (or R
n), respectively, and let A ∈Cm×n (or Rm×n). The induced matrix norm of A is defined by
‖A‖(m,n) := supx∈Cn,x 6=0
‖Ax‖(m)
‖x‖(n)
= supx∈Cn,
‖x‖(n)=1
‖Ax‖(m) .
Obviously, for any x ∈ Cn (or Rn), with x 6= 0,
‖Ax‖(m)
‖x‖(n)
≤ ‖A‖(m,n) ,
and hence,
‖Ax‖(m) ≤ ‖A‖(m,n) ‖x‖(n) for all x ∈ Cn (or x ∈ R
n). (2.28)
Exercise 37 Let ‖ · ‖(m) and ‖ · ‖(n) be norms for Cm and C
n, respectively, and let A ∈ Cm×n.
Show that the induced matrix norm ‖ · ‖(m,n) satisfies the properties of a norm.
Definition 2.34 (compatible matrix norm)
Let ‖·‖(m) and ‖·‖(n) be norms on Cm (or Rm) and Cn (or Rn), respectively, and let ‖ ·‖(m,n)
denote the induced matrix norm on Cm×n (or Rm×n). Let ‖| ·‖| denote another matrix norm
on Cm×n (or Rm×n). We say that a matrix norm ‖| · ‖| is compatible with the induced
matrix norm ‖ · ‖(m,n) if
‖Ax‖(m) ≤ ‖|A‖| ‖x‖(n) for all x ∈ Cn(or x ∈ R
n).
We observe that by the definition of the induced matrix norm ‖ · ‖(m,n), any compatible matrix
norm ‖| · ‖| clearly satisfies
‖A‖(m,n) ≤ ‖|A‖| for all A ∈ Cm×n
(or A ∈ R
m×n), (2.29)
and, in fact, (2.29) characterizes a compatible matrix norm.
The next two definitions introduce important matrix norms.
44 2.5. Matrix Norms
Definition 2.35 (Frobenius norm)
The Frobenius norm for an m × n matrix A = (ai,j) (in Cm×n or R
m×n) is given by
‖A‖F =
(m∑
i=1
n∑
j=1
|ai,j|2)1/2
. (2.30)
Definition 2.36 (induced matrix p-norms)
Let A ∈ Cm×n (or A ∈ Rm×n). For 1 ≤ p ≤ ∞, equip Cm (or Rm) and Cn (or Rn) with the
corresponding p-norm, where
‖x‖p :=
(d∑
j=1
|xj|p)1/p
, x ∈ Cd (or x ∈ R
d), 1 ≤ p < ∞,
and
‖x‖∞ := sup1≤j≤d
|xj |, x ∈ Cd (or x ∈ R
d).
Then the induced matrix p-norms for A ∈ Cm×n (or A ∈ Rm×n) are given by
‖A‖p = supx∈Cn,x 6=0
‖Ax‖p
‖x‖p
= supx∈Cn,‖x‖p=1
‖Ax‖p .
The next Theorem gives more explicit formulas for some induced matrix p-norms.
Theorem 2.37 (important induced matrix p-norms)
Let A = (ai,j) be an m × n matrix in Cm×n (or Rm×n), and let σ1, σ2, . . . , σn be the non-
negative eigenvalues of the Hermitian matrix A∗ A (or of the symmetric matrix AT A,
respectively). Then the induced matrix p-norms, for p ∈ {1, 2,∞}, are given by
‖A‖1 = max1≤j≤n
(m∑
i=1
|ai,j|)
= max1≤j≤n
‖aj‖1 , (2.31)
‖A‖2 =√
max1≤j≤n
|σj| =√
max1≤j≤n
σj, (2.32)
‖A‖∞ = max1≤i≤m
(n∑
j=1
|ai,j|)
, (2.33)
where the vector aj denotes the jth column vector of A.
In Theorem 2.37, note that since A∗A (or AT A) is Hermitian (or symmetric, respectively), due
to Theorem 2.10 (and Theorem 2.11, respectively), the matrix A∗A (and AT A) has only real
2. Matrix Theory 45
eigenvalues. Let σ be an eigenvalue of A∗A (or AT A), and let x 6= 0 be a corresponding
eigenvector. Then for A ∈ Cm×n
A∗ Ax = σ x ⇒ x∗ A∗ Ax = σ x∗ x ⇒ ‖Ax‖22 = σ ‖x‖2
2 ⇒ σ =‖Ax‖2
2
‖x‖22
≥ 0,
and for A ∈ Rm×n
AT Ax = σ x ⇒ xT AT Ax = σ xTx ⇒ ‖Ax‖22 = σ ‖x‖2
2 ⇒ σ =‖Ax‖2
2
‖x‖22
≥ 0.
Thus we see that the eigenvalues of A∗A (and AT A, respectively) are indeed non-negative.
Proof of Theorem 2.37. To derive the induced matrix 1-norm of a matrix A ∈ C(m×n),
consider x ∈ Cn with ‖x‖1 =∑n
j=1 |xj | = 1. Then for such x, with aj = (a1,j, a2,j , . . . , am,j)T ,
‖Ax‖1 =
∥∥∥∥∥
n∑
j=1
xjaj
∥∥∥∥∥1
≤n∑
j=1
‖xj aj‖1 ≤n∑
j=1
|xj| ‖aj‖1
≤(
max1≤j≤n
‖aj‖1
) n∑
k=1
|xk| =
(max1≤j≤n
‖aj‖1
)‖x‖1 = max
1≤j≤n‖aj‖1 , (2.34)
where we have used ‖x‖1 = 1 in the last step. Furthermore, we may choose x = ek, where
j = k maximizes ‖aj‖1, that is, we have ‖ak‖1 = max1≤j≤n ‖aj‖1. Then
‖A ek‖1 = ‖ak‖1 = max1≤j≤n
‖aj‖1 . (2.35)
From (2.35) we see that the upper bound in (2.34) is attained for x = ek and hence from (2.34)
and (2.35)
max1≤j≤n
‖aj‖1 = ‖ak‖1 = ‖A ek‖1 ≤ supx∈Cn,‖x‖1=1
‖Ax‖1 ≤ max1≤j≤n
‖aj‖1 ,
which (using the sandwich theorem) verifies (2.31).
In the case of the induced matrix 2-norm of a matrix A ∈ C(m×n), we have
‖A‖2 = supx∈Cn,‖x‖2=1
‖Ax‖2 = supx∈Cn,‖x‖2=1
√(Ax)∗(Ax) = sup
x∈Cn,‖x‖2=1
√x∗ A∗ Ax. (2.36)
Since, A∗A is Hermitian there are n orthonormal eigenvectors, z1, z2, . . . , zn corresponding to
the real non-negative eigenvalues σ1, σ2, . . . , σn. Let x = α1 z1 + α2 z2 + · · · + αn zn so that
x∗ x = |α1|2 + |α2|2 + · · · + |αn|2. Then
x∗ A∗ Ax =
(n∑
j=1
αj zj
)∗
A∗ A
(n∑
k=1
αk zk
)
=
(n∑
j=1
αj zj
)∗( n∑
k=1
αk A∗ A zk
)
46 2.5. Matrix Norms
=
(n∑
j=1
αj zj
)∗( n∑
k=1
αk σk zk
)
=
n∑
j=1
n∑
k=1
αj αk σk z∗j zk
= σ1 |α1|2 + σ2 |α2|2 + . . . + σn |αn|2
≤(
max1≤j≤n
σj
)(|α1|2 + |α2|2 + . . . + |αn|2
)2
=
(max1≤j≤n
σj
)x∗ x =
(max1≤j≤n
σj
)‖x‖2
2 , (2.37)
where we have used A∗A zk = σk zk, k = 1, 2, . . . , n, and the orthonormality z∗j zk = δj,k of the
eigenvectors z1, z2, . . . , zn. Hence, from (2.36) and (2.37)
‖A‖2 = supx∈Cn,‖x‖2=1
√x∗ A∗ Ax ≤ sup
x∈Cn,‖x‖2=1
√max1≤j≤n
σj ‖x‖2 =√
max1≤j≤n
σj . (2.38)
Finally, let k be such that σk = max1≤j≤n
σj . Then choosing x = zk gives equality, since
‖A zk‖22 = z∗k A∗ A zk = z∗k (σk zk) = σk ‖zk‖2
2 = σk = max1≤j≤n
σj . (2.39)
Thus, from (2.39), (2.38), and ‖zk‖2 = 1,√
max1≤j≤n
σj = ‖A zk‖2 ≤ supx∈Cn,‖x‖2=1
‖Ax‖2 = ‖A‖2 ≤√
max1≤j≤n
σj , (2.40)
and we see from the sandwich theorem that ‖A‖2 =√
max1≤j≤n
σj .
Finally, for the induced matrix ∞-norm, we get, using |xj | ≤ ‖x‖∞ for all j = 1, 2, . . . , n,
‖A‖∞ = supx∈Cn,
‖x‖∞
=1
‖Ax‖∞ = supx∈Cn,
‖x‖∞
=1
max1≤i≤n
∣∣∣∣∣
n∑
j=1
ai,j xj
∣∣∣∣∣ ≤ supx∈Cn,
‖x‖∞
=1
max1≤i≤n
n∑
j=1
|ai,j| |xj|
≤ supx∈Cn,
‖x‖∞
=1
(max1≤i≤n
n∑
j=1
|ai,j|)‖x‖∞ = max
1≤i≤n
n∑
j=1
|ai,j| =
n∑
j=1
|ak,j|, (2.41)
for some k. To show that this upper bound is attained we may choose the vector x =
(x1, x2, . . . , xn)T with components
xj =ak,j
|ak,j|, j = 1, 2, . . . , n,
which satisfies ‖x‖∞ = 1 since |xj | = |ak,j|/|ak,j| = 1 for all j = 1, 2, . . . , n. Then
‖A‖∞ ≥ ‖Ax‖∞ = max1≤i≤n
∣∣∣∣∣
n∑
j=1
ai,jak,j
|ak,j|
∣∣∣∣∣ ≥∣∣∣∣∣
n∑
j=1
ak,jak,j
|ak,j|
∣∣∣∣∣ ≥n∑
j=1
|ak,j|2|ak,j|
=n∑
j=1
|ak,j|. (2.42)
2. Matrix Theory 47
Since the lower bound in (2.42) and the upper bound in (2.41) coincide, we see from the sand-
wich theorem that (2.33) holds true. 2
Example 2.38 (matrix norms)
Consider the real 3 × 3 matrix A, defined by
A =
1 0 −4
2 0 2
0 3 0
.
Then
‖A‖1 = max1≤j≤n
(n∑
i=1
|ai,j|)
= max{3, 3, 6} = 6,
and
‖A‖∞ = max1≤i≤n
(n∑
j=1
|ai,j|)
= max{5, 4, 3} = 5.
We compute AT A and obtain
AT A =
1 2 0
0 0 3
−4 2 0
1 0 −4
2 0 2
0 3 0
=
5 0 0
0 9 0
0 0 20
.
Since the matrix AT A is diagonal, the eigenvalues are the elements on the diagonal. Hence
AT A has the eigenvalues σ1 = 20, σ2 = 9, and σ3 = 5 Hence, we have
‖A‖2 =√
max {20, 9, 5} =√
20. 2
Let ‖·‖(ℓ), ‖·‖(m) and ‖·‖(n) be norms on Cℓ, Cm and Cn, respectively. Let A be an ℓ×m matrix
and B be an m × n matrix. For any x ∈ Cn we have
‖A B x‖(ℓ) ≤ ‖A‖(ℓ,m) ‖B x‖(m) ≤ ‖A‖(ℓ,m) ‖B‖(m,n) ‖x‖(n) .
Hence, the induced norm for matrixes in Cℓ×n satisfies
‖A B‖(ℓ,n) ≤ ‖A‖(ℓ,m) ‖B‖(m,n) . (2.43)
Note: In general (2.43) is not an equality.
Lemma 2.39 (norms of product of matrix with unitary matrix)
For any A ∈ Cm×n and any unitary matrix Q ∈ C
m×m, we have
‖Q A‖2 = ‖A‖2 and ‖Q A‖F = ‖A‖F . (2.44)
48 2.5. Matrix Norms
Proof of Lemma 2.39. Since Q is unitary (that is, Q∗ Q = Q Q∗ = I), for any x ∈ Cn,
‖Q Ax‖2 =((
Q Ax)∗ (
Q Ax))1/2
=(x∗ A∗ Q∗ Q Ax
)1/2
=(x∗ A∗ Ax
)1/2
=((Ax)∗ (Ax)
)1/2
= ‖Ax‖2 . (2.45)
From (2.45), the result ‖Q A‖2 = ‖A‖2 follows directly from the definition of an induced matrix
norm.
The second result follows from the fact that ‖B‖2F = trace(B∗B) for any matrix B ∈ Cm×n
(see Exercise 39 below). Hence, since Q is unitary,
‖Q A‖2F = trace
((Q A)∗ (Q A)
)= trace
(A∗ Q∗ Q A
)= trace(A∗ A) = ‖A‖2
F ,
which proves ‖Q A‖F = ‖A‖F . 2
Exercise 38 For the matrix A from Exercises 19, 22, and 23,
A =
32
0 12
0 3 0
12
0 32
,
compute the induced matrix p-norms for p ∈ {1, 2,∞} and the Frobenius norm.
Exercise 39 Let B = (bi,j) be in Cm×n. Show that ‖B‖F =√
trace(B∗ B). Conclude that for
B in Rm×n we have ‖B‖F =√
trace(BT B).
Exercise 40 Consider the matrix A := uv∗, where u ∈ Cm and v ∈ C
n. Show that the
induced matrix 2-norm satisfies
‖A‖2 = ‖u‖2 ‖v‖2.
Exercise 41 Define the matrix norm ‖ · ‖ : Rn×n → R by
‖A‖ := 7n∑
i=1
n∑
j=1
|ai,j|, A = (ai,j) ∈ Rn×n.
(a) Show that ‖ · ‖ defines a matrix norm for Rn×n by verifying the norm properties.
(b) Show that there exists no vector norm ‖| · ‖| for Rn that induces the matrix norm ‖ · ‖, that
is, show that there exists no vector norm ‖| · ‖| : Rn → R such that
‖A‖ = supx∈Rn,‖|x‖|=1
‖|Ax‖| for all A ∈ Rn×n.
2. Matrix Theory 49
2.6 Spectral Radius of a Matrix
In this section we consider only square matrices in Cn×n. We introduce the spectral radius
which gives a lower bound for any induced and any compatible matrix norm fir Cn×n.
Definition 2.40 (spectral radius)
Let A ∈ Cn×n with eigenvalues λ1, λ2, . . . , λn ∈ C. The spectral radius of A is defined by
ρ(A) := max1≤j≤n
|λj|.
Note that, if λ is an eigenvalue of A with eigenvector x, then λr is an eigenvalue of Ar,
r = 1, 2, 3, . . . , with eigenvector x. (Indeed, Ar x = Ar−1 (Ax) = λ Ar−1 x = . . . = λr x.)
Hence, since g(x) = xr, r ∈ N, is strictly monotonically increasing for x ≥ 0,
ρ(Ar) =[ρ(A)
]r. (2.46)
Example 2.41 (spectral radius)
For the matrix A from Examples 2.4 and 2.8,
A =
0 −2 2
−2 −3 2
−3 −6 5
,
we found the spectrum Λ(A) = {−1, 1, 2}. Thus the spectral radius of this matrix is
ρ(A) = max{| − 1|, |1|, |2|} = max{1, 2} = 2. 2
Exercise 42 For the matrix A from Exercises 19, 22, 23 and 38,
A =
32
0 12
0 3 0
12
0 32
,
determine the spectral radius.
Exercise 43 Calculate by hand the induced matrix p-norms ‖A‖1, ‖A‖2, ‖A‖∞, and the spec-
tral radius for the matrices
B :=
(0 0
1 −2
)and A =
54
− 12√
2− 1
4
− 12√
232
12√
2
− 14
12√
254
.
50 2.6. Spectral Radius of a Matrix
Exercise 44 Calculate by hand the induced matrix p-norms ‖A‖1, ‖A‖2, ‖A‖∞, the Frobenius
norm, and the spectral radius for the matrix
A =
1 −1 0
−1 2 1
0 1 3
.
Exercise 45 For A ∈ Rm×n show that
‖A‖2 ≤√‖A‖∞ ‖A‖1.
The next lemma expresses the induced matrix 2-norm in terms of the spectral radius.
Lemma 2.42 (relation between induced matrix 2-norm and spectral radius)
Let A ∈ Cm×n (or A ∈ Rm×n). Then the induced matrix 2-norm is given by
‖A‖2 =√
ρ(A∗A) (or ‖A‖2 =√
ρ(AT A), respectively).
If A ∈ Cn×n (or A ∈ Rn×n) is Hermitian (or symmetric, respectively), then the induced
matrix 2-norm is given by
‖A‖2 = ρ(A).
Proof of Lemma 2.42. We give the proof only for the case of complex matrices. The first
statement follows directly from Theorem 2.37 and the definition of the spectral radius. Indeed,
from Theorem 2.37, for A ∈ Cm×n
‖A‖2 =√
max {|σ| ∈ R : σ is eigenvalue of A∗A} =√
ρ(A∗A).
If A ∈ Cn×n is Hermitian, then A∗ = A and the eigenvalues of A are real (see Theorem 2.10).
Hence A∗A = A A = A2. Let λ be an eigenvalue of A, and let x be a corresponding eigenvector.
Then
A∗Ax = A2 x = A (λx) = λ2 x,
that is, if λ is an eigenvalue of A, then σ = λ2 = |λ|2 is an eigenvalue of the square matrix
A∗A. Thus from Theorem 2.37
‖A‖2 =√
max {σ ∈ R : σ is eigenvalue of A∗A}=
√max {|λ|2 ∈ R : λ is eigenvalue of A}
= max {|λ| : λ is eigenvalue of A} = ρ(A),
which proves the second statement. 2
The next theorem establishes a connection between any induced (or compatible) matrix norm
and the spectral radius. This theorem will be very useful for establishing various theoretical
results.
2. Matrix Theory 51
Theorem 2.43 (relation between induced norms and the spectral radius)
(i) Let ‖ · ‖(n) be a norm on Cn, and let ‖ · ‖ be any matrix norm compatible with the
induced matrix norm ‖ · ‖(n,n). Then
ρ(A) ≤ ‖A‖ for all A ∈ Cn×n.
(ii) For any ǫ > 0 there exits a norm ‖ · ‖(n) on Cn such that the corresponding induced
matrix norm ‖ · ‖(n,n) satisfies
ρ(A) ≤ ‖A‖(n,n) ≤ ρ(A) + ǫ for all A ∈ Cn×n.
It is clear that an analogous statement holds for norms on Rn and real square matrices in Rn×n.
Proof of Theorem 2.43. Let ‖ · ‖(n) be an arbitrary norm on Cn, and let ‖ · ‖(n,n) denote
the induced matrix norm. Then we know that any compatible matrix norm ‖ · ‖ satisfies
‖A‖(n,n) ≤ ‖A‖ for all A ∈ Cn×n (see formula (2.29)). Hence it is enough to prove
ρ(A) ≤ ‖A‖(n,n) for all A ∈ Cn×n.
Let λ1, λ2, . . . λn ∈ C be the eigenvalues of A, where we may assume that
ρ(A) = max1≤i≤n
|λi| = |λ1|,
but do not assume any ordering among λ2, . . . , λn. Let x be an eigenvector corresponding to
the eigenvalue λ1. Then Ax = λ1 x, and, using (2.28),
|λ1| ‖x‖(n) = ‖λ1 x‖(n) = ‖Ax‖(n) ≤ ‖A‖(n,n) ‖x‖(n) ⇒ |λ1| ≤ ‖A‖(n,n) .
Hence,
ρ(A) = |λ1| ≤ ‖A‖(n,n) ,
which verifies (i).
From the Schur factorization (see Theorem 2.21) there exists a unitary matrix W such that
W ∗ A W is upper triangular, with the eigenvalues λ1, λ2, . . . , λn of A as the diagonal elements.
Here we may again assume that |λ1| = ρ(A). Inspection of the proof of Theorem 2.21 shows
that we can choose W such that λ1 is the (1, 1)th entry of W ∗ A W . Since we do not assume
any ordering among λ2, . . . , λn, we may assume (after renumbering λ2, . . . , λn as required) that
λj is the (j, j)th entry of W ∗ A W . Thus
W ∗ A W =
λ1 r1,2 · · · r1,n
0 λ2 · · · r2,n
.... . .
. . ....
0 · · · 0 λn
. (2.47)
52 2.6. Spectral Radius of a Matrix
Let D be a diagonal matrix with diagonal elements 1, δ, δ2, . . . , δn−1, which we denote by
D = diag(1, δ, δ2, . . . , δn−1), where δ ≤ 2−1 min{1, ǫ/r} and r = max1≤i≤j≤n |ri,j|. Let S = W D,
then from (2.47) and D−1 = diag(1, δ−1, δ−2, . . . , δ−(n−1))
S−1 A S =
λ1 δr1,2 · · · δn−1r1,n
0 λ2 · · · δn−2r2,n
.... . .
. . ....
0 · · · 0 λn
,
that is, (S−1A S)i,i+k = δk ri,i+k for i = 1, 2, . . . , n − 1 and k = 1, 2, . . . , n − i. Furthermore,
using Theorem 2.37 and ρ(A) = |λ1|,
∥∥S−1 A S∥∥∞ = max
1≤i≤n
n∑
j=1
|(S−1 A S)i,j|
= max1≤i≤n
(|λi| + |δ ri,i+1| + |δ2 ri,i+2| + . . . + |δn−i ri,n|
)
= max1≤i≤n
(|λi| + δ
(|ri,i+1| + δ |ri,i+2| + . . . + δn−i−1 |ri,n|
))
≤ max1≤i≤n
(|λi| + δ r
(1 + δ + . . . + δn−i−1
))
= |λ1| + δ r(1 + δ + . . . + δn−1−1
)
= ρ(A) + δ r(1 + δ + · · ·+ δn−2
),
where the second term is a geometric progression. Hence, using δ ≤ 1/2 and δ ≤ ǫ/(2r) by
definition of δ,
∥∥S−1 A S∥∥∞ ≤ ρ(A) + δ r
(1 − δn−1
1 − δ
)≤ ρ(A) + δ r
1
1 − δ≤ ρ(A) +
ǫ
2 rr 2 ≤ ρ(A) + ǫ. (2.48)
Since ‖S−1AS‖∞ is the matrix norm induced by the vector norm ‖x‖S−1,∞ := ‖S−1 x‖∞,
x ∈ Cn, (see Exercise 46 below), from (i) and (2.48),
ρ(A) ≤∥∥S−1 A S
∥∥∞ ≤ ρ(A) + ǫ,
and the proof is complete. 2
Exercise 46 Let S ∈ Cn×n be an invertible matrix. Show that ‖x‖ := ‖S−1x‖∞ defines a norm
for Cn. Show that this norm induces the matrix norm
‖A‖ := ‖S−1 A S‖∞, A ∈ Cn×n.
The next result (see Theorem 2.45 below) considers a particular type of series of matrices, the
so-called Neumann series, and conditions for its convergence. The Neumann series can be seen
2. Matrix Theory 53
as a generalization of the geometric series. Theorem 2.45 below also gives a condition on A
under which the matrix I − A is non-singular (or invertible).
Definition 2.44 (convergence of sequences of matrices w.r.t. matrix norm)
Let {Xr} be a sequence of m×n matrices in Cm×n or Rm×n. We say that the sequence {Xr}converges with respect to a given matrix norm ‖·‖, with limit X ∈ Cm×n or X ∈ Rm×n,
respectively, if for every ǫ > 0 there is an N = N(ǫ) such that
‖X − Xr‖ < ǫ for all r ≥ N.
Equivalently, {Xr} in Cm×n (or in Rm×n) converges with respect to ‖ · ‖ to X ∈ Cm×n (or
X ∈ Rm×n) if
limr→∞
‖Xr − X‖ = 0.
In ‘Further Analysis’, we have learned that convergence can be defined for any normed linear
space; the definition above is just one special case, where the linear space is the set of real or
complex m×n matrices and where the norm is the given matrix norm. From the general notion
of convergence in a normed linear space it is known that the limit of a convergent sequence is
unique. However, we can also easily verify this for the concrete case in Definition 2.44 above.
Assume the sequence {Xr} has two limits X and Y . Then, given ǫ > 0, there exist N = N(ǫ)
and M = M(ǫ) such that
‖X − Xr‖ <ǫ
2for all r ≥ N, and ‖Y − Xr‖ <
ǫ
2for all r ≥ M.
Choose R := max{N, M}; then from the triangle inequality for r ≥ R,
‖X − Y ‖ = ‖(X − Xr) − (Y − Xr)‖ ≤ ‖X − Xr‖ + ‖Y − Xr‖ <ǫ
2+
ǫ
2= ǫ. (2.49)
Since ǫ > 0 was arbitrary, we see from (2.49) that X = Y and the limit is unique.
We have proved in Theorem 2.30 that all norms on Ck are equivalent. Letting k = m × n, we
see that all matrix norms on Cm×n are equivalent. Thus the notion of convergence
and the limit do not depend on the choice of the matrix norm, since equivalent matrix
norms induce the same notation of convergence.
54 2.6. Spectral Radius of a Matrix
Theorem 2.45 (Neumann series)
Let A ∈ Cn×n (or A ∈ R
n×n). Let {Xr} be defined by
Xr := I + A + · · ·+ Ar =
r∑
k=0
Ak, r = 0, 1, 2, . . . .
Then {Xr} converges if and only if ρ(A) < 1. If ρ(A) < 1, then I − A is non-singular and
the limit of {Xr} is
X =∞∑
k=0
Ak = (I − A)−1. (2.50)
The series in (2.50) is called the Neumann series of the matrix A.
Proof of Theorem 2.45. First we show that I − A is non-singular if ρ(A) < 1.
If λ1, λ2, . . . , λn are the eigenvalues of A then 1 − λ1, 1 − λ2, . . . , 1 − λn are the eigenvalues of
I − A. A matrix is non-singular if and only if its determinant is different from zero, and
det(I − A) = (1 − λ1)(1 − λ2) · · · (1 − λn). (2.51)
If ρ(A) = max1≤j≤n
|λj| < 1, then |λj| < 1 for all j = 1, 2, . . . , n, and (2.51) implies det(I −A) 6= 0.
Thus I − A is nonsingular if ρ(A) < 1.
Next we show that ρ(A) < 1 implies the convergence of the sequence {Xr} to (I − A)−1.
Assume ρ(A) < 1, and choose ǫ := (1− ρ(A))/2. From Theorem 2.43, given ǫ > 0, there exists
some norm for Cn such that for the corresponding induced matrix norm ‖ · ‖, we have
‖A‖ ≤ ρ(A) + ǫ = ρ(A) +1 − ρ(A)
2=
1 + ρ(A)
2:= α < 1 (since ρ(A) < 1).
Because ρ(A) < 1, the matrix I − A is invertible, and because ‖A‖ ≤ α < 1,
∥∥(I − A)−1 − Xr
∥∥ =∥∥(I − A)−1
(I − (I − A) Xr
)∥∥
=∥∥(I − A)−1 Ar+1
∥∥
≤∥∥(I − A)−1
∥∥ ‖A‖r+1
≤∥∥(I − A)−1
∥∥ αr+1. (2.52)
In (2.52), we have used in the second step the definition of Xr and
I − (I − A)Xr = I − Xr + A Xr
= I −r∑
k=0
Ak +
r∑
k=0
Ak+1
2. Matrix Theory 55
= I − I −r∑
k=1
Ak +
r∑
ℓ=1
Aℓ + Ar+1
= Ar+1.
Since 0 < α < 1, the left-hand side in (2.52) tends to zero for r → ∞, and so {Xr} converges
to the limit (I − A)−1.
Finally, we prove that if {Xr} converges then ρ(A) < 1, by showing the equivalent negated
statement: If ρ(A) ≥ 1, then {Xr} diverges.
Let ρ(A) ≥ 1, and let ‖ · ‖ be some induced matrix norm on Cn×n. Choose ǫ = 1: Then for
every N ∈ N
‖XN+1 − XN‖ =
∥∥∥∥∥
N+1∑
k=0
Ak −N∑
k=0
Ak
∥∥∥∥∥ =∥∥AN+1
∥∥ ≥ ρ(AN+1) =[ρ(A)
]N+1 ≥ 1 = ǫ, (2.53)
where we have used (2.46). From (2.53), we see that {Xr} is not a Cauchy sequence, and hence
{Xr} diverges. 2
A useful consequence of Theorem 2.45 is the following error estimate for the norm of (I ±A)−1
where ‖A‖ < 1 in some induced (or compatible) matrix norm ‖ · ‖.
Corollary 2.46 (estimate for ‖(I ± A)−1‖ if ‖A‖ < 1)
Let A be a n×n matrix in Cn×n or Rn×n, and assume that in some induced (or compatible)
matrix norm ‖ · ‖ we have ‖A‖ < 1. Then the I ± A is invertible and
‖(I ± A)−1‖ ≤(1 − ‖A‖
)−1. (2.54)
Proof of Corollary 2.46. If for some induced matrix norm ‖A‖ = ‖ ± A‖ < 1, then, from
Theorem 2.43, ρ(±A) ≤ ‖ ± A‖ < 1. Thus I ± A = I − (∓A) is invertible, and (I ± A)−1 is
given by the Neumann series
(I ± A)−1 =(I − (∓A)
)−1=
∞∑
k=0
(∓A)k =
∞∑
k=0
(∓1)k Ak.
Thus from the triangle inequality
∥∥(I ± A)−1∥∥ =
∥∥∥∥∥
∞∑
k=0
(∓A)k
∥∥∥∥∥
≤ 1 + ‖∓A‖ + ‖∓A‖2 + . . .
= 1 + ‖A‖ + ‖A‖2 + . . . =
∞∑
k=0
‖A‖k
56 2.6. Spectral Radius of a Matrix
= (1 − ‖A‖)−1,
where the last step follows from the geometric series, since ‖A‖ < 1. 2
Exercise 47 Show that the following series of matrices converges and compute its limit:
∞∑
k=0
Ak, where A =
12
−2 −1
0 13
0
0 0 − 12
.
Exercise 48 Show that the following series of matrices converges and compute its limit:
∞∑
k=0
Ak, where A =
14
0 0
1 − 12
0
−1 −2 13
.
Exercise 49 Let A ∈ Cn×n, and assume that the n complex eigenvalues λ1, λ2, . . . , λn of A
satisfy |λj| > 1 for all j = 1, 2, . . . , n. Is the matrix I − A invertible? Give a proof of your
answer.
Exercise 50 Let A ∈ Cn×n satisfy ρ(A) < 1, and let S ∈ Cn×n be a unitary matrix (that is,
S∗ = ST
= S−1). Show that I − S∗ A S is invertible and find its inverse in two different ways.
Chapter 3
Floating Point Arithmetic and Stability
In this chapter we learn about the condition numbers of matrices, floating point arithmetic,
and the stability of numerical algorithms.
3.1 Condition Numbers of Matrices
Suppose A is a square matrix, and we want to solve the linear system Ax = b. Unfortunately,
our input data A and b have been perturbed such that we are actually given A + ∆A and
b+∆b, and we do not know A and b. We can try to model the error in the output by assuming
that x + ∆x solves
(A + ∆A) (x + ∆x) = b + ∆b.
Multiplying this out and using Ax = b yields
(A + ∆A) ∆x = ∆b− (∆A)x.
For the time being, let us assume that A + ∆A is invertible, which is true if A is invertible and
∆A is only a small perturbation. Then,
∆x = (A + ∆A)−1(∆b − (∆A)x
),
that is,
‖∆x‖ ≤ ‖(A + ∆A)−1‖(‖∆b‖ + ‖∆A‖ ‖x‖
), (3.1)
which gives a first estimate on the absolute error ‖∆x‖.
However, more relevant is the relative error ‖∆x‖/‖x‖, and, from (3.1), we find
‖∆x‖‖x‖ ≤ ‖(A + ∆A)−1‖ ‖A‖
( ‖∆b‖‖A‖ ‖x‖ +
‖∆A‖‖A‖
)
57
58 3.1. Condition Numbers of Matrices
≤ ‖(A + ∆A)−1‖ ‖A‖(‖∆b‖
‖b‖ +‖∆A‖‖A‖
), (3.2)
where the last line follows from ‖b‖ = ‖Ax‖ ≤ ‖A‖ ‖x‖. The last estimate in (3.2) shows that
the relative input error magnified by a factor
‖(A + ∆A)−1‖ ‖A‖
gives an upper bound for the relative output error. This demonstrates the problem of condi-
tioning. The latter expression can further be modified to incorporate the so-called condition
number of a matrix.
Definition 3.1 (condition number)
The condition number of an invertible square matrix A with respect to an induced or
compatible matrix norm ‖ · ‖ is defined as
κ(A) := ‖A‖ ‖A−1‖.
The next theorem estimates the relative error in terms of the condition number.
Theorem 3.2 (estimate of the relative error)
Let ‖ · ‖ be an induced or compatible matrix norm. Suppose that A is an invertible square
matrix, let x be the solution to Ax = b, and assume that the perturbation matrix ∆A
satisfies ‖A−1‖ ‖∆A‖ < 1. Then, there exists a unique vector ∆x such that
(A + ∆A) (x + ∆x) = b + ∆b,
and for b 6= 0, we have
‖∆x‖‖x‖ ≤ κ(A)
(1 − κ(A)
‖∆A‖‖A‖
)−1(‖∆b‖‖b‖ +
‖∆A‖‖A‖
). (3.3)
Proof of Theorem 3.2. Essentially we need to estimate the factor ‖(A + ∆A)−1‖ ‖A‖ in
(3.2). However, we first need to verify that A + ∆A is non-singular, and hence ∆x exists and
is uniquely determined by (A + ∆A) (x + ∆x) = b + ∆b.
We have
A + ∆A = A(I + A−1 ∆A
), (3.4)
and, by the assumption ‖A−1‖ ‖∆A‖ < 1, and hence
∥∥− A−1 ∆A∥∥ = ‖A−1 ∆A‖ ≤ ‖A−1‖ ‖∆A‖ < 1. (3.5)
Thus, by Theorem 2.45 we know that I + A−1 ∆A = I − (−A−1 ∆A) is invertible. Since
I + A−1 ∆A and A are invertible, we see from (3.4) that A + ∆A is invertible. Furthermore,
3. Floating Point Arithmetic and Stability 59
from (3.5) and (2.54) in Corollary 2.46,
∥∥I + A−1 ∆A∥∥ ≤ 1
1 − ‖A−1 ∆A‖ ≤ 1
1 − ‖A−1‖ ‖∆A‖ . (3.6)
Since, from (3.4)
(A + ∆A)−1 =(I + A−1 ∆A
)−1A−1,
the estimate (3.6) implies
∥∥(A + ∆A)−1∥∥ ≤
∥∥(I + A−1∆A)−1∥∥ ‖A−1‖ ≤ 1
1 − ‖A−1‖ ‖∆A‖ ‖A−1‖. (3.7)
Using (3.7) to estimate ‖(A + ∆A)−1‖ ‖A‖ yields
‖(A + ∆A)−1‖ ‖A‖ ≤ 1
1 − ‖A−1‖ ‖∆A‖ ‖A−1‖ ‖A‖
=
(1 − ‖A−1‖ ‖A‖ ‖∆A‖
‖A‖
)−1
κ(A)
=
(1 − κ(A)
∆A
‖A‖
)−1
κ(A).
Substitution of this estimate into (3.2) yields the desired estimate of the relative error. 2
Example 3.3 (condition numbers)
For the non-singular matrix A, given by
A =
2 0 −1
1 0 2
0 2 0
.
compute the condition numbers with respect to the induced p-norms, for p ∈ {1, 2,∞}.
Solution: First we need to find the inverse matrix A−1. Transforming the augmented system
(A|I) with elementary row operation to the form (I|A−1), we find
A−1 =
25
15
0
0 0 12
− 15
25
0
.
Using Theorem 2.37, we find
‖A‖1 = 3 and ‖A−1‖1 = 3/5 ⇒ κ1(A) = ‖A‖1 ‖A−1‖1 =9
5,
and
‖A‖∞ = 3 and ‖A−1‖∞ = 3/5 ⇒ κ∞(A) = ‖A‖∞ ‖A−1‖∞ =9
5.
60 3.1. Condition Numbers of Matrices
For the condition number with respect to the induced matrix 2-norm we need to find the
eigenvalues of AT A and (A−1)T A−1.
AT A =
2 1 0
0 0 2
−1 2 0
2 0 −1
1 0 2
0 2 0
=
5 0 0
0 4 0
0 0 5
,
and thus AT A has the eigenvalues λ1 = 5 and λ2 = 4. Hence, using Theorem 2.37,
‖A‖2 =√
ρ(AT A) =√
max{4, 5} =√
5.
(A−1)T A−1 =
25
0 − 15
15
0 25
0 12
0
25
15
0
0 0 12
− 15
25
0
=
15
0 0
0 15
0
0 0 14
,
and thus (A−1)T A−1 has the eigenvalues λ1 = 1/4 and λ2 = 1/5. Hence, from Theorem 2.37,
‖A−1‖2 =√
ρ((A−1)T A−1
)=√
max{1/4, 1/5} =√
1/4 = 1/2.
Hence, κ2(A) = ‖A‖2 ‖A−1‖2 =√
5/2. 2
Remark 3.4 (condition number for positive definite symmetric matrix)
Note that if A ∈ Rn×n is symmetric and positive definite all eigenvalues of A are positive
(see Theorem 1.1). If these eigenvalues are denoted by λ1 ≥ λ2 ≥ · · · ≥ λn > 0, then the
eigenvalues of A−1 are given by 1/λj, j = 1, 2, . . . , n. Thus we have in the induced 2-norm
‖A‖2 =√
ρ(AT A) (see Theorem 2.37 and Lemma 2.42) for any symmetric positive definite
matrix A ∈ Rn×n. Thus
‖A‖2 =√
ρ(AT A) = ρ(A) = λ1,
‖A−1‖2 =√
(A−1)T A−1 = ρ(A−1) = λ−1n ,
where we have used the fact that the inverse of a positive definite symmetric matrix is also
a positive definite symmetric matrix. Thus
κ2(A) = ‖A‖2 ‖A−1‖2 =λ1
λn
. (3.8)
We see from (3.8) that if λ1 >> λn, then the condition number κ2(A) is very large. This
means, from (3.3) in Theorem 3.2, that in the upper bound for the relative error the input
error is amplified by a very large factor. The problem to solve Ax = b is poorly conditioned.
Exercise 51 For the matrix A from Examples 19, 22, and 23, and 38,
A =
32
0 12
0 3 0
12
0 32
,
3. Floating Point Arithmetic and Stability 61
compute the condition numbers with respect to the induced p-norms, for p ∈ {1, 2,∞}.
Exercise 52 Compute the condition numbers with respect to the induced p-norms, where
p ∈ {1, 2,∞}, of the matrix
A =
1 −1 0
−1 2 1
0 1 3
.
Exercise 53 Let A = (ai,j) in Rn×n be invertible and satisfy
n∑
j=1
|ai,j| = 1, 1 ≤ i ≤ n.
Let D ∈ Rn×n be an arbitrary diagonal matrix with det(D) 6= 0. Let κ∞(A) denote the condition
number of A with respect to the induced matrix ∞-norm ‖ · ‖∞. Show that
κ∞(D A) ≥ κ∞(A).
3.2 Floating Point Arithmetic
Since computers use a finite number of bits to represent a real number, they can only represent
a finite subset of the real numbers. In general the range of numbers is great enough, but the
major problem is the gaps.
Hence, it is time to discuss the representation of a number in a computer. We are used to
represent a number in digits, i.e.
105.67 = 1000 · 0.10567 = +103(1 · 10−1 + 0 · 10−2 + 5 · 10−3 + 6 · 10−4 + 7 · 10−5
).
However, there is no need to represent numbers with respect to a base 10. A more general
representation is the B-adic representation.
Definition 3.5 (B-adic representation of floating point numbers)
A B-adic, normalized floating point number of precision m is either x = 0 or
x = ±Be
−1∑
k=−m
xk Bk, x1 6= 0, xk ∈ {0, 1, 2, . . . , B − 1}.
Here, e denotes the exponent with emin ≤ e ≤ emax, B ∈ N with B ≥ 2 is the base, and∑−1k=−m xk Bk denotes the mantissa.
The corresponding floating point number system consists of all numbers, which can be
represented in this form.
62 3.2. Floating Point Arithmetic
For a computer the different types of representing a number, like single and double precision,
are defined by the IEEE754 norm. For example, it states that for a double precision number we
have B = 2 and m = 52. Hence, if one bit is used to store the sign and if 11 bits are reserved
for representing the exponent, a double precision number can be stored in 64 bits.
As a consequence, the distribution of floating point numbers is not uniform. Furthermore, every
real number and the result of every operation has to be rounded. There are different rounding
strategies, but the default one is rounding to the nearest machine number.
Definition 3.6 (machine precision)
The machine precision, denoted by eps, is the smallest number, which satisfies
|x − rd(x)| ≤ eps |x| (3.9)
for all x ∈ R within the range of the floating point number system, where rd(x) is rounding
to the nearest machine number.
Note that (3.9) is a relative error since for x ∈ R \ {0} (within the range of the floating point
number system)|x − rd(x)|
|x| ≤ eps.
Theorem 3.7 (precision of floating point numbers)
For a floating point number system with base B and precision m we have
|x − rd(x)| ≤ |x|B1−m (3.10)
for all x within the range of the floating point number system, which means eps ≤ B1−m.
Proof of Theorem 3.7. For every real number x 6= 0 in the range of the floating point system
there exists an integer e between emin and emax such that
x = ±Be−1∑
k=−∞xk Bk = ±Be
−1∑
k=−m
xk Bk +
(±Be
−m−1∑
k=−∞xk Bk
),
where x1 6= 0. Hence, we have
|x − rd(x)| ≤ Be−m−1∑
k=−∞xk Bk ≤ Be
−m−1∑
k=−∞(B − 1) Bk = Be (B − 1)
∞∑
ℓ=m+1
(B−1)ℓ
= Be (B − 1)
(1
1 − B−1− 1 − B−(m+1)
1 − B−1
)
= Be (B − 1)
(B
B − 1− B − B−m
B − 1
)
3. Floating Point Arithmetic and Stability 63
= Be (B − 1)B−m
B − 1= Be−m, (3.11)
where we have used 0 ≤ xk ≤ B − 1 and the formulas for the geometric sum and geometric
series. On the other hand, we have, because x1 6= 0 and thus x1 ≥ 1,
|x| ≥ Be x1 B−1 ≥ Be B−1 = Be−1 (3.12)
such that, from (3.11) and (3.12),
|x − rd(x)||x| ≤ Be−m
Be−1= B1−m,
which gives the desired result (3.10) by multiplying with |x|. 2
For the operations +,−, ·, /, we will denote the corresponding floating point operations
by ⊕,⊖,⊙,⊘.
Theorem 3.8 (accuracy of floating point operations)
Let ∗ be one of the operations +,−, ·, / and let ⊛ be the equivalent floating point operation,
then for all x, y from the floating point system, there exists ǫ with |ǫ| ≤ eps such that
x ⊛ y = (x ∗ y)(1 + ǫ). (3.13)
If x ∗ y 6= 0 in (3.13), then (3.13) implies
|(x ⊛ y) − (x ∗ y)||x ∗ y| = |ǫ| ≤ eps. (3.14)
In words, (3.14) means that every operation in floating point arithmetic is exact up to
a relative error of size ≤ eps.
3.3 Conditioning
Consider a problem f : X → Y from a normed vector space X of data to a normed vector space
Y of solutions. For a particular point x ∈ X we say that the problem f(x) is well-conditioned
if all small perturbations of x lead only to small changes in f(x). A problem is ill-conditioned
if there exists a small perturbation that leads to a large change in f(x).
In general, one distinguishes between the absolute condition number, which is the smallest
number κ(x), which satisfies
‖f(x + ∆x) − f(x)‖ ≤ κ(x) ‖∆x‖ for all ∆x,
64 3.3. Conditioning
and the more useful relative condition number, which is the smallest number κ(x) satisfying
‖f(x + ∆x) − f(x)‖‖f(x)‖ ≤ κ(x)
‖∆x‖‖x‖ for all ∆x.
Note that the absolute condition number κ(x) and the relative condition number κ(x) depend
on x. As we will see in the examples below that the absolute condition number can vary a lot
depending on x.
A small relative condition number indicates a well-conditioned problem, and a large
relative condition number indicates an ill-conditioned problem.
We discuss some examples of well-conditioned and ill-conditioned problems.
Example 3.9 (matrix-vector multiplication)
Let A ∈ Rm×n or A ∈ Cm×n, and let ‖ · ‖(m) and ‖ · ‖(n) be norms on Rm and Rn, or Cm and Cn,
respectively. Let f(x) = Ax be the problem of matrix-vector multiplication. Then the relative
error satisfies∥∥f(x + ∆x) − f(x)
∥∥(m)
‖f(x)‖(m)
=
∥∥A (x + ∆x) − Ax∥∥
(m)
‖Ax‖(m)
=‖A ∆x‖(m)
‖Ax‖(m)
≤ ‖A‖(m,n) ‖∆x‖(n)
‖Ax‖(m)
=‖A‖(m,n) ‖x‖(n)
‖Ax‖(m)
‖∆x‖(n)
‖x‖(n)
= κ(x)‖∆x‖(n)
‖x‖(n)
,
with the relative condition number
κ(x) :=‖A‖(m,n) ‖x‖(n)
‖Ax‖(m)
.
If m = n and if A is invertible, we have ‖x‖(n) = ‖A−1 Ax‖(n) ≤ ‖A−1‖(n,n) ‖Ax‖(n) such that
κ(x) =‖A‖(n,n) ‖x‖(n)
‖Ax‖(n)
≤ ‖A‖(n,n) ‖A−1‖(n,n) = κ(A). 2
In praxis, we often do not apply the definitions of the absolute and relative condition number
exactly, but investigate the conditioning of a problem in a slightly less formalized way, as
illustrated in the following examples.
Example 3.10 (addition of two numbers)
Let f : R2 → R be defined as f(x, y) = x + y. Then,
f(x + ∆x, y + ∆y) − f(x, y)
f(x, y)=
x + ∆x + y + ∆y − (x + y)
x + y
=∆x + ∆y
x + y=
∆x
x + y+
∆y
x + y
=x
x + y
∆x
x+
y
x + y
∆y
y.
3. Floating Point Arithmetic and Stability 65
Hence, the addition of two numbers is well-conditioned if both numbers have the same sign and
are sufficiently far away from zero. However, if both numbers are of the same magnitude with
different sign, that is, x ≈ −y, then adding these numbers is highly ill-conditioned because of
cancelation (that is, x + y ≈ 0). 2
Example 3.11 (multiplying two numbers)
For the multiplication f : R2 → R, f(x, y) = x y, of two real numbers x and y, we have
f(x + ∆x, y + ∆y) − f(x, y)
f(x, y)=
(x + ∆x) (y + ∆y) − (x y)
x y
=x ∆y + y ∆x + ∆x ∆y
x y
=∆y
y+
∆x
x+
∆x
x
∆y
y,
which shows that the multiplication of two numbers is always well-conditioned. 2
Example 3.12 (function evaluation)
Let f(x) = ex2. Since f ′(x) = 2 x ex2
= 2 x f(x), we can use Taylor’s formula to represent
f(x + ∆x) = f(x) + f ′(x)∆x + O((∆x)2
).
Substituting this expression into [f(x + ∆x) − f(x)]/f(x) yields
f(x + ∆x) − f(x)
f(x)=
[ex2
+ 2 x ex2∆x + O
((∆x)2
)]− ex2
ex2
= 2 x ∆x +1
ex2 O((∆x)2
)
= 2 x2 ∆x
x+ O
((∆x)2
).
The problem is well-conditioned for small x and ill-conditioned for large x. 2
3.4 Stability
Consider a problem f : X → Y from a vector space X of data to a vector space Y of solutions. A
numerical algorithm for the computation of f(x) = y is defined to be another map f : X → Y
such that f(x) ‘approximates’ f(x). That is, given data x ∈ X, this data will be rounded
to floating point precision and supplied to the algorithm f . What results is a floating point
number f(x) ∈ Y . The major question is how accurate this output f(x) is, compared to the
true answer f(x). We will look at the relative error of the numerical solution,
errrel :=
∥∥f(x) − f(x)∥∥
‖f(x)‖ .
66 3.5. An Example of a Backward Stable Algorithm: Back Substitution
We would like this error to be small, of order eps.
We say that an algorithm f for a problem f is stable if for each x ∈ X
∥∥f(x) − f(x)∥∥
‖f(x)‖ = O(eps) for some x with‖x − x‖‖x‖ = O(eps).
‘A stable algorithm gives nearly the right answer to nearly the right question.’
A stronger and simpler numerical linear algebra condition is that of backward stability. We
say that an algorithm f is backward stable if for each x ∈ X,
f(x) = f(x) for some x with‖x − x‖‖x‖ = O(eps). (3.15)
‘A backward stable algorithm gives exactly the right answer to nearly the right question.’
3.5 An Example of a Backward Stable Algorithm: Back
Substitution
In this section we will verify that back substitution (see Theorem 2.15) is backward stable.
Recall that an n × n matrix U = (ui,j) is said to be upper triangular if ui,j = 0 for i > j.
For example, for n = 4, an upper triangular matrix is of the form
U =
u1,1 u1,2 u1,3 u1,4
0 u2,2 u2,3 u2,4
0 0 u3,3 u3,4
0 0 0 u4,4
.
Similarly, an n × n matrix L = (ℓi,j) is said to be lower triangular if ℓi,j = 0 for i < j.
Suppose we wish to solve an upper triangular linear system of equations U x = b, where
U ∈ Rn×n is a non-singular upper triangular matrix, and x,b ∈ Rn. Then we use back
substitution, as discussed in Section 2.2. The vector x = (x1, x2, . . . , xn)T is computed via
xj =1
uj,j
(bj −
n∑
k=j+1
uj,k xk
), j = n, n − 1, . . . , 2, 1. (3.16)
From Theorem 2.15, the total number of elementary operations of this algorithm is O(n2).
3. Floating Point Arithmetic and Stability 67
Theorem 3.13 (backward stability of back substitution)
Consider the equation U x = b, where b ∈ Rn and U ∈ R
n×n is a non-singular upper
triangular matrix. The back substitution algorithm (3.16), when applied to a floating point
number system, is backward stable in the sense that the computed solution x ∈ Rn satisfies
(U + ∆U
)x = b
for some upper triangular matrix ∆U ∈ Rn×n with
‖∆U‖‖U‖ = O(eps).
Specifically, for each i, j,|∆Ui,j ||Ui,j|
≤ n · eps + O(eps2).
Proof of Theorem 3.13. We only show this for the cases n = 1, 2.
Case n = 1: For this case back substitution consists of the step
x1 = b1 ⊘ U1,1.
Using the fundamental property of floating point arithmetic (see Theorem 3.8) gives
x1 =b1
U1,1
(1 + ǫ1), |ǫ1| ≤ eps.
This is now used to express the error as if it resulted from a perturbation of U , by writing
x1 =b1
U1,1 (1 + ǫ′1), |ǫ′1| ≤ eps + O(eps2),
where ǫ′1 = −ǫ1/(1 + ǫ1).
From this we have that x1 is the correct solution to the perturbed problem,
(U1,1 + ∆U1,1) x1 = b1, with ∆U1,1 = ǫ′1 U1,1.
Hence,|∆U1,1||U1,1|
≤ eps + O(eps2).
Case n = 2: For the case of a 2 × 2 matrix U , we have the first step as before
x2 = b2 ⊘ U2,2 =b2
U2,2 (1 + ǫ′1), |ǫ′1| ≤ eps + O(eps2). (3.17)
The second step is
x1 =(b1 ⊖ (x2 ⊙ U1,2)
)⊘ U1,1.
68 3.5. An Example of a Backward Stable Algorithm: Back Substitution
Again, using the fundamental property of floating point systems (see Theorem 3.8), we have
x1 =(b1 ⊖ x2 U1,2 (1 + ǫ2)
)⊘ U1,1
=(b1 − x2 U1,2 (1 + ǫ2)
)(1 + ǫ3) ⊘ U1,1
=
(b1 − x2 U1,2 (1 + ǫ2)
)(1 + ǫ3) (1 + ǫ4)
U1,1,
and we have |ǫi| ≤ eps, i = 2, 3, 4.
Using the same method as in the first step we may shift the ǫ terms to obtain
x1 =b1 − x2 U1,2 (1 + ǫ2)
(1 + ǫ′3) (1 + ǫ′4) U1,1,
with |ǫ′3|, |ǫ′4| ≤ eps + O(eps2). Multiplying the denominator terms out gives
x1 =b1 − x2 U1,2 (1 + ǫ2)
(1 + 2 ǫ5) U1,1, (3.18)
with |ǫ5| ≤ eps + O(eps2).
We have shown that x1 is the exact solution to a problem involving the perturbations (1 + ǫ′1),
(1+ ǫ2) and (1+2 ǫ5) to the entries U11, U12 and U22, respectively. Rewriting (3.17) and (3.18),
this may be stated in the form
(U + ∆U) x = b,
where
∆U =
(2 ǫ5 U1,1 ǫ2 U1,2
0 ǫ′1 U2,2
).
This gives ‖∆U‖ / ‖U‖ = O(eps). Hence, for 2×2 upper triangular matrices back substitution
is backward stable.
The proof for n ≥ 2 is similar. 2
We want to relate the backward stability in Theorem 3.13 to the definition of backward stability
given in the last section. In the terminology of Section 3.4, the problem is to solve the linear
system U x = b, and, due to the floating point arithmetic, we solve instead (U + ∆U) x = b.
Thus, in the language of the previous section, the problem is
f(y) := U−1 y
and we solve instead the problem
f(y) :=(U + ∆U
)−1y.
Thus the first condition in (3.15) reads
x := f(b) =(U + ∆U
)−1b = f(b) = U−1 b ⇒
(U + ∆U
)x = b and U x = b.
3. Floating Point Arithmetic and Stability 69
Thus b− b = ∆U x and hence, using again x = (U + ∆U)−1 b,
‖b− b‖‖b‖ =
‖∆U x‖‖b‖ =
∥∥∆U(U + ∆U
)−1b∥∥
‖b‖ ≤∥∥∆U
(U + ∆U
)−1∥∥ ‖b‖‖b‖
=∥∥∆U
(U + ∆U
)−1∥∥ ≤ ‖∆U‖‖U‖
(‖U‖
∥∥(U + ∆U)−1∥∥
).
The second term on the right-hand side can be considered as a constant that depends on the
conditioning of U and the conditioning of the perturbed matrix U + ∆U , and the first factor is
the relative error ‖∆U‖/‖U‖ = O(eps).
70 3.5. An Example of a Backward Stable Algorithm: Back Substitution
Chapter 4
Direct Methods for Linear Systems
In this chapter we discuss so-called direct methods for solving linear systems Ax = b. The
most basic method for solving linear systems directly is Gaussian elimination which you will
all have practiced in linear algebra when you solved small linear systems (see Section 4.1). A
formalized way of describing Gaussian elimination via left-multiplication with elementary lower
triangular matrices leads us to the LU factorization of a matrix A into A = L U , where L is
a normalized lower triangular matrix and U an upper triangular matrix (see Section 4.2). In
Section 4.3, we learn that Gaussian elimination and hence the LU factorization can be enhanced
by interchanging rows (or columns) of A to stabilize the process; this is referred to a pivoting.
In Section 4.4, we prove that a Hermitian matrix A ∈ Cn×n (that is, A∗ = A) that is positive
definite has a so-called Cholesky factorization A = L L∗, where L is a lower triangular
matrix. In Section 4.5, we show that every matrix A ∈ Cn×n has a QR factorization, that is,
A = Q R where Q is a unitary matrix and R is an upper triangular matrix.
The advantage of any of these factorizations of A is that once the matrices in the factorization
are known, the linear system Ax = b can be more economically solved: Indeed, any of these
factorizations of A is of the form A = B R, where R is an upper triangular matrix and B is
either a lower triangular matrix or a unitary matrix. Then Ax = b becomes
B Rx = b ⇔ B y = b and R x = y.
Since B is either a lower triangular or a unitary matrix B y = b can be inexpensively solved,
and R x = y can subsequently be solved with back substitution. – This is particularly useful if
we want to solve several systems Ax = bj , j = 1, 2, . . . , m, with the same matrix but different
right-hand sides.
The methods discussed in this chapter are called direct methods because we solve (in some
clever way) the linear system directly. Direct methods are to be seen in contrast to so-called
iterative methods, where we compute a sequence of approximations of the solution. An example
of an iterative method that you will have seen in your second year classes is Banach’s fixed
point iteration. Iterative methods will be discussed in Chapter 5.
71
72 4.1. Standard Gaussian Elimination
4.1 Standard Gaussian Elimination
The aim of Gaussian elimination is to reduce a linear system of equations Ax = b to an
equivalent upper triangular system U x = b′ (with an upper triangular matrix U and a new
right-hand side b′) by applying very simple linear transformations to it. Once we have the
system in the form U x = b′, where U is an upper triangular matrix, we then can use back
substitution (see Section 2.2) to solve U x = b′.
Suppose we have a linear system Ax = b, where A = (ai,j) is in Cn×n or Rn×n, that is,
a1,1 x1 + a1,2 x2 + . . . + a1,n xn = b1
a2,1 x1 + a2,2 x2 + . . . + a2,n xn = b2
......
...
an,1 x1 + an,2 x2 + . . . + an,n xn = bn.
Then, assuming that a1,1 6= 0, we can multiply the first row by a2,1/a1,1 and subtract the
resulting row from the second row, which cancels the factor in front of x1 in that row. Then, we
can multiply the first row by a3,1/a1,1 and subtract it from the third row, which, again, cancels
the factor in front of x1 in that row. Continuing like this all the way down, we end up with an
equivalent linear system of the form
a1,1 x1 + a1,2 x2 + . . . + a1,n xn = b1
a(2)2,2 x2 + . . . + a
(2)2,n xn = b
(2)2
......
...
a(2)n,2 x2 + . . . + a(2)
n,n xn = b(2)n .
Assuming now that a(2)2,2 6= 0, we can repeat the whole process using the second row now to
eliminate x2 from row 3 to n, and so on. Hence, after k − 1 steps the system can be written in
matrix form as
A(k) x =
a1,1 ∗ · · · ∗ . . . ∗0 a
(2)2,2
. . ....
......
. . .. . . ∗ · · · ∗
0 · · · 0 a(k)k,k . . . a
(k)k,n
......
......
0 · · · 0 a(k)n,k . . . a
(k)n,n
x1
x2
...
xk
...
xn
=
b1
b(2)2...
b(k)k...
b(k)n
. (4.1)
If in each step a(k)k,k is different from zero, Gaussian elimination produces after executing all n−1
steps an upper triangular matrix.
4. Direct Methods for Linear Systems 73
Example 4.1 (Gaussian elimination)
Use Gaussian elimination to transform the following linear system into triangular form:
Ax = b, where A =
1 0 2
2 1 3
1 −1 0
, b =
3
6
9
.
Solution: We write the linear system as an augmented matrix, and then we multiply the first
row by 2 and subtract it from the second row. Subsequently we subtract the first row from the
third row.
1 0 2
2 1 3
1 −1 0
∣∣∣∣∣∣
3
6
9
⇔
1 0 2
0 1 −1
1 −1 0
∣∣∣∣∣∣
3
0
9
⇔
1 0 2
0 1 −1
0 −1 −2
∣∣∣∣∣∣
3
0
6
.
In the second step of the Gaussian elimination, we add the new second row to the new third
row, and obtain
⇔
1 0 2
0 1 −1
0 0 −3
∣∣∣∣∣∣
3
0
6
⇔ U x = b′ with U =
1 0 2
0 1 −1
0 0 −3
and b′ =
3
0
6
,
which is an equivalent linear system with an upper triangular matrix. This is the linear system
which has been solved in Example 2.16 with back substitution. 2
For theoretical (not numerical) reasons, it is useful to rewrite this process using matrices.
Definition 4.2 (elementary lower triangular matrix)
For 1 ≤ i ≤ n, let x ∈ Cn such that eTj x = 0 for j ≤ i, that is, x is of the form
x = (0, 0, . . . , 0, xi+1, . . . , xn)T .
The elementary lower triangular matrix, Li(x), is given by
Li(x) := I − x eTi =
1 0 0 · · · · · · · · · · · · · · · 0
0 1 0...
0 0 1. . .
......
. . .. . .
. . ....
... 0 1 0 · · · · · · 0
... 0 −xi+1 1 0 · · · 0
......
... 0 1. . .
......
......
.... . .
. . . 0
0 · · · · · · 0 −xn 0 · · · 0 1
.
Note that in Li(x) the vector −x has been subtracted from the ith column of the identity
matrix.
74 4.1. Standard Gaussian Elimination
Example 4.3 (elementary lower triangular matrix)
For x = (0, 0, 3,−2)T we have eT1 x = eT
2 x = 0, and the lower the lower triangular matrices
L1(x) and L2(x) in R4×4 are given by
L1(x) = I −x eT1 =
1 0 0 0
0 1 0 0
−3 0 1 0
2 0 0 1
, L2(x) = I −x eT
2 =
1 0 0 0
0 1 0 0
0 −3 1 0
0 2 0 1
. 2
You should make sure that the explicit formula for the elementary lower triangular matrix is
clear to you. You should also verify the properties of the matrix Li(x) given in the lemma
below for yourself.
Lemma 4.4 (properties of elementary lower triangular matrices)
An elementary lower triangular n×n matrix Li(x), where x = (0, . . . , 0, xi+1, . . . , xn)T , has
the following properties:
(i) det(Li(x)
)= 1.
(ii)(Li(x)
)−1= Li(−x).
(iii) Multiplying a matrix A = (ai,j) in Cn×n (or in Rn×n) with Li(x) from the left leaves
the first i rows unchanged and, starting from row i + 1, subtracts the row vector
xj (ai,1, ai,2, . . . , ai,n) from row j of A.
We note that statement (ii) in Lemma 4.4 is intuitively clear, from the interpretation given in
(iii) in Lemma 4.4. Indeed, multiplication of A with Li(x) from the left subtracts for j > i the
row vector xj (ai,1, ai,2, . . . , ai,n) from the jth row of A, and multiplication of B with Li(−x)
from the left adds for j > i the row vector xj (bi,1, bi,2, . . . , bi,n) to the jth row of B; thus
Li(−x) Li(x) A gives just the matrix A. (Note that this is not a proper proof, but very helpful
for your understanding!)
Proof of Lemma 4.4. This proof is very straight-forward and is left as an exercise. 2
From statement (iii) in Lemma 4.4 it is clear that we can describe each step in the Gaussian
elimination by multiplication with an elementary lower triangular matrix: Indeed
with suitable lower triangular matrices L1(m1), L2(m2), . . . , Ln−1(mn−1), the result of Gaussian
elimination for the linear system Ax = b can be written as U x = b′, where
U = Ln−1(mn−1) · · · L2(m2) L1(m1) A, b′ = Ln−1(mn−1) · · · L2(m2) L1(m1)b,
and where U is now upper triangular.
4. Direct Methods for Linear Systems 75
Example 4.5 (Gaussian elimination with elementary lower triangular matrices)
Consider the linear system from Example 4.1
Ax = b, where A =
1 0 2
2 1 3
1 −1 0
, b =
3
6
9
.
For executing the steps in the Gaussian elimination in Example 4.1, we have used the elementary
lower triangular matrices
L1(m1) = I − m1 eT1 =
1 0 0
−2 1 0
−1 0 1
and L1(m2) = I − m2 eT
2 =
1 0 0
0 1 0
0 1 1
,
where m1 = (0, 2, 1)T and m2 = (0, 0,−1)T . Thus we have the new system U x = b′ with
U = L2(m2) L1(m1) A =
1 0 2
0 1 −1
0 0 −3
and b′ = L2(m2) L1(m1)b =
3
0
6
. 2
Exercise 54 Solve the following linear system by hand with Gaussian elimination:
Ax = b, A =
1 2 1
−1 1 2
2 2 4
, b =
1
8
4
. (4.2)
Exercise 55 Write down three 4 × 4 elementary lower triangular matrices, and explain for
each matrix what left-multiplication with this matrix does. Verify the properties (i) and (ii)
from Lemma 4.4 explicitly for your three lower triangular matrices.
Exercise 56 Find the elementary lower triangular matrices L1(m1) and L2(m2) which describe
the Gaussian elimination to bring the linear system (4.2) into the form U x = b′ with an upper
triangular matrix U .
Exercise 57 Prove Lemma 4.4
4.2 The LU Factorization
On a computer, Gaussian elimination is realized by programming the corresponding operations
directly. This leads to an O(n3) complexity. In addition, we have an O(n2) complexity for
solving the final linear system by back substitution.
However, there is another way to look at the process. From Lemma 4.4, we know that multipli-
cation with a suitable elementary lower triangular matrix just realizes one step in the Gaussian
76 4.2. The LU Factorization
elimination process. Therefore, we can construct a sequence of n−1 elementary lower triangular
matrices Lj = Lj(mj), j = 1, 2, . . . , n − 1, such that
Ln−1 Ln−2 · · · L2 L1 A = U,
where U is an upper triangular matrix. Since elementary lower triangular matrices are invert-
ible, we have
A = L−11 L−1
2 · · · L−1n−2 L−1
n−1 U.
Using L−1i =
(Li(mi)
)−1= Li(−mi) (see Lemma 4.4 (ii)) and eT
j mk = 0 for j ≤ k, we have
A = L1(−m1) L2(−m2) · · · Ln−2(−mn−2) Ln−1(−mn−1) U
=(I + m1 eT
1
) (I + m2 eT
2
)· · ·(I + mn−2 eT
n−2
)(I + mn−1 eT
n−1
)U
=(I + m1 eT
1 + m2 eT2 + . . . + mn−2 eT
n−2 + mn−1 eTn−1
)U
= L U, (4.3)
where the matrix
L := I + m1 eT1 + m2 eT
2 + . . . + mn−2 eTn−2 + mn−1 eT
n−1 (4.4)
is lower triangular with all diagonal elements one. In the second-last step of (4.3), we have used
(I + m1 eT
1
) (I + m2 eT
2
)= I + m1 eT
1 + m2 eT2 + m1 eT
1 m2 eT2 = I + m1 eT
1 + m2 eT2 ,
since eT1 m2 = 0 by assumption, and analogously for the remaining matrix multiplications.
Inspection of (4.4) shows that L has the following form: if mj = (0, 0, . . . , 0, m(j)j+1, . . . , m
(j)n )T
(note that the first j entries of mj are zero by the definition of the elementary lower triangular
matrix Lj(mj)), then the jth column vector of L is the vector (0, 0, . . . , 0, 1, m(j)j+1, . . . , m
(j)n )T .
Definition 4.6 (LU factorization/decomposition)
Let A ∈ Cn×n (or A ∈ Rn×n). The LU factorization (or LU decomposition) of a
matrix A is the factorization A = L U into the product of a normalized lower triangular
matrix L and an upper triangular matrix U . (A normalized lower triangular matrix is a
lower triangular matrix, where are all diagonal entries are one.)
Theorem 4.7 (LU factorization)
Let A = (ai,j) be an n × n matrix in Cn×n (or Rn×n), and let Ap = (ai,j)1≤i,j≤p be the
pth upper principal submatrix. If det(Ap) 6= 0 for p = 1, 2, . . . , n, then A has an LU
factorization A = L U , with a normalized lower triangular matrix L and an invertible
upper triangular matrix U .
Proof of Theorem 4.7. From the considerations above, we have seen that Gaussian elimina-
tion, if it can be successfully applied, leads to an LU factorization. Thus the proof is given by
induction over the steps in the Gaussian elimination.
4. Direct Methods for Linear Systems 77
Initial step: For k = 1 we have det A1 = det(a1,1) = a1,1 6= 0, hence the first step in the
Gaussian elimination is possible.
Induction step: For k → k + 1 assume our matrix is in the form (4.1). Then, we have with
suitable elementary lower triangular matrices L1, L2, . . . , Lk−1, Lk,
A(k) = Lk Lk−1 · · · L2 L1 A.
From the fact that multiplication with Lj from the left subtracts multiples of the jth row from
other rows with row number > j, and from the properties of the determinant, the following is
easily seen: multiplication of A with lower triangular matrices from the left, does not alter the
value of the determinant of the pth principal submatrix. Hence
0 6= det(Ak) = det((Lk Lk−1 · · · L2 L1 A)k
)= det
((A(k))k
)= a1,1 a
(2)2,2 · · · a
(k)k,k,
where the last step follows from (4.1). So, particularly a(k)k,k 6= 0. Hence, the next step in the
Gaussian elimination is possible as well, and after n − 1 steps we have the LU factorization of
A as derived at the beginning of this section. 2
Theorem 4.8 (LU factorization of invertible matrix is unique)
If an invertible matrix A has an LU factorization then its LU factorization is unique.
Proof of Theorem 4.8. Suppose A = L1 U1 and A = L2 U2 are two LU factorizations of a
non-singular matrix A. As A is non-singular, we have 0 6= det(A) = det(Li) det(Ui) for i = 1, 2.
Thus we know that det(Ui) 6= 0 and det(Li) 6= 0 for i = 1, 2, and hence Li and Ui, i = 1, 2, are
invertible. Therefore,
L1 U1 = L2 U2 = A ⇒ L−12 L1 = U2 U−1
1 .
Furthermore, L−12 L1 is a product of normalized lower triangular matrices. Hence, it is a nor-
malized lower triangular matrix. However, U2 U−11 is upper triangular, since U2 and U−1
1 are
upper triangular. The only possible answer to this is that both L−12 L1 and U2 U−1
1 must be
diagonal matrices and equal to the identity matrix (since L−12 L1 is normalized, that is, has
entries one on the diagonal). Thus
L−12 L1 = U2 U−1
1 = I ⇒ L1 = L2 and U2 = U1,
and we see that the L U factorization is unique. 2
Example 4.9 (LU factorization)
For the matrix A from Examples 4.1 and 4.5,
A =
1 0 2
2 1 3
1 −1 0
,
78 4.2. The LU Factorization
we found in Example 4.5 that
L2(m2) L1(m1) A = U with U :=
1 0 2
0 1 −1
0 0 −3
, (4.5)
with the lower triangular matrices
L1(m1) =
1 0 0
−2 1 0
−1 0 1
and L1(m2) =
1 0 0
0 1 0
0 1 1
,
where m1 = (0, 2, 1)T and m2 = (0, 0,−1)T . From (4.5),
A =(L1(m1)
)−1 (L2(m2)
)−1U = L U, with L :=
(L1(m1)
)−1 (L2(m2)
)−1.
We compute L as follows, with the help of Lemma 4.4,
L =(L1(m1)
)−1 (L2(m2)
)−1= L1(−m1) L2(−m2)
=
1 0 0
2 1 0
1 0 1
1 0 0
0 1 0
0 −1 1
=
1 0 0
2 1 0
1 −1 1
.
Thus the LU factorization of A reads
1 0 0
2 1 0
1 −1 1
1 0 2
0 1 −1
0 0 −3
=
1 0 2
2 1 3
1 −1 0
. 2
The algorithm for solving Ax = b, where A is a square matrix, using standard Gauss elimina-
tion can be implemented for real matrices with the following MATLAB code:
function [L,U] = LU_factorization_1(A)
%
% algorithm computes LU factorization (without pivoting),
% if it exists for matrix A
% input: A = real n by n matrix, that allows LU factorization without pivoting
% output: L = real n by n normalized lower triangular matrix
% U = real n by n upper triangular matrix
%
n = size(A,1);
U = A;
B = zeros(n,n);
L = eye(n,n);
4. Direct Methods for Linear Systems 79
for i = 1:n-1
L(i+1:n,i) = U(i+1:n,i)/U(i,i);
B = U;
U(i+1:n,i:n) = B(i+1:n,i:n) - (L(i+1:n,i)) * B(i,i:n);
end
Exercise 58 For the matrix A from Exercises 54 and 56,
A =
1 2 1
−1 1 2
2 2 4
,
find the unique LU factorization of A.
Remark 4.10 (computing the determinant with LU factorization)
A numerical reasonable way of calculating the determinant of a matrix A is to first compute
the LU factorization of A and then use the fact that
det(A) = det(L U) = det(L) det(U) = det(U) = u1,1 u2,2 · · · un,n,
where we have used det(L) = 1 since L is a normalized lower triangular matrix.
A possible way of deriving an algorithm for calculating the LU factorization starts with the
component-wise formulation of A = L U , that is,
ai,j =
n∑
k=1
ℓi,k uk,j, 1 ≤ i, j ≤ n.
Since U = (ui,j) is upper triangular and L = (ℓi,j) is lower triangular, the upper limit of the
sum is actually given by min{i, j}. Taking the two possible cases separately gives
ai,j =
j∑
k=1
ℓi,k uk,j, 1 ≤ j ≤ i ≤ n, (4.6)
ai,j =
i∑
k=1
ℓi,k uk,j, 1 ≤ i ≤ j ≤ n. (4.7)
For our convenience it is useful to swap the indices i and j in (4.6), giving
aj,i =
i∑
k=1
ℓj,k uk,i, 1 ≤ i ≤ j ≤ n. (4.8)
Rearranging these equations, and using the fact that ℓi,i = 1 for all 1 ≤ i ≤ n, (4.8) implies
(4.9) below and (4.7) implies (4.10) below.
ℓj,i =1
ui,i
(aj,i −
i−1∑
k=1
ℓj,k uk,i
), 1 ≤ i ≤ n, i ≤ j ≤ n, (4.9)
80 4.2. The LU Factorization
ui,j = ai,j −i−1∑
k=1
ℓi,k uk,j, 1 ≤ i ≤ n, i ≤ j ≤ n. (4.10)
To see how the algorithm works, and in which order the equations are solved let us work out
all 9 equations for the case of an 3 × 3 matrix. By assumption, we have ℓ1,1 = ℓ2,2 = ℓ3,3 = 1.
From (4.10), we find for i = 1 the first row of U .
u1,1 = a1,1, u1,2 = a1,2, u1,3 = a1,3.
For i = 1, we find from (4.9) the first column of L
ℓ1,1 =a1,1
u1,1
= 1, ℓ2,1 =a2,1
u1,1
, ℓ3,1 =a3,1
u1,1
.
For i = 2, we find from (4.10) the second row of U
u2,2 = a2,2 − ℓ2,1 u1,2, u2,3 = a2,3 − ℓ2,1 u1,3.
For i = 2, we find from (4.9) the second column of L
ℓ2,2 =1
u2,2
(a2,2 − ℓ2,1 u1,2) = 1, ℓ3,2 =1
u2,2
(a3,2 − ℓ3,1 u1,2) .
For i = 3, we find from (4.10) the third row of u
u3,3 = a3,3 − ℓ3,1 u1,3 − ℓ3,2 u2,3.
For i = 3, we find from (4.9) the third column of L
ℓ3,3 =1
u3,3
(a3,3 − ℓ3,1 u1,3 − ℓ3,2 u2,3) = 1.
Note that each step uses only entries of L and U that have been computed in previous steps; this
shows that we have arranged the equations (4.9) and (4.10) in the right order. We have included
the diagonal elements of L to better see how the algorithm works, but in any implementation
we would of course not compute these since we know they have the value one.
We have seen that the algorithm first computes the rows and of U and columns of L in the
following order:
row 1 of U, column 1 of L, row 2 of U, column 2 of L, . . . ,
. . . , row n-1 of U, column n-1 of L, row n of U,
where we need not compute column n of L, since this contains only the diagonal entry ℓn,n = 1.
In pseudo-algorithmic form this algorithm can be formulated as follows for real matrices:
4. Direct Methods for Linear Systems 81
Algorithm 1 LU Factorization
1: input: real n × n matrix A = (ai,j) that has LU factorization (without pivoting)
2: initialize L = (ℓi,j) = I ∈ Rn×n, U = (ui,j) = 0 ∈ Rn×n
3: for i = 1, 2, . . . , n do
4: for j = i, i + 1, . . . , n do
5: ui,j = ai,j −i−1∑
k=1
ℓi,k uk,j
6: end for
7: for j = i + 1, i + 2 . . . , n do
8: ℓj,i =1
ui,i
(aj,i −
i−1∑
k=1
ℓj,k uk,i
)
9: end for
10: end for
The algorithm for computing the LU factorization for real matrices in this way can be imple-
mented with the following MATLAB code:
function [L,U] = LU_factorization_2(A)
%
% algorithm for computing the LU factorization directly from A = L*U
% input: A = real n by n matrix, that has LU factorization without pivoting
% output: L = real n by n normalized lower triangular matrix
% U = real n by n upper triangular matrix
%
n = size(A,1);
L = eye(n,n);
U = zeros(n,n);
for i=1:n
U(i,i:n) = A(i,i:n) - L(i,1:i-1) * U(1:i-1,i:n);
L(i+1:n,i) = ( A(i+1:n,i) - L(i+1:n,1:i-1) * U(1:i-1,i) ) / U(i,i);
end
The complexity of this procedure is O(n3). Indeed consider the LU factorization in its standard
form (not the version of the LU factorization above): In the jth step we need to perform
(n − j)(n − j + 1) elementary operations, where we now for simplicity consider an elementary
operation to be one multiplication/division and one addition/subtraction. Hence we have
n−1∑
j=1
(n − j)(n − j + 1) =
n−1∑
j=1
(n2 + n − (2n + 1) j + j2
)
= n2(n − 1) + n(n − 1) − (2n + 1)n(n − 1)
2+
1
6(n − 1) n (2n − 1)
82 4.2. The LU Factorization
=1
3n (n2 − 1) =
1
3n3 − 1
3n.
Thus we need to execute O(n3) times a multiplications plus an addition.
Remark 4.11 (how to solve a linear system with the LU factorization)
If A = L U then the linear system Ax = L U x = b can be solved in two steps. First, we
solve Ly = b by forward substitution and then U x = y by back substitution. Both is
possible in O(n2) time, once the LU factorization of A is known.
Exercise 59 A matrix A ∈ Cn×n is called strictly row diagonally dominant if
n∑
k=1, k 6=i
|ai,k| < |ai,i|, for all i = 1, 2, . . . , n.
Show that a strictly row diagonally dominant matrix is invertible and possesses an LU decom-
position.
Exercise 60 Let A ∈ Rn×n be a tridiagonal matrix, that is, a matrix of the form
A =
a1 c1 0
b2 a2 c2
. . .. . .
. . .. . .
. . . cn−1
0 bn an
,
with
|a1| > |c1| > 0,
|ai| ≥ |bi| + |ci|, bi, ci 6= 0, 2 ≤ i ≤ n − 1,
|an| ≥ |bn| > 0.
Show that A is invertible and has an LU decomposition of the form
A =
1 0
ℓ2 1. . .
. . .
0 ℓn 1
r1 c1 0
r2. . .. . . cn−1
0 rn
,
where the vectors ℓ = (ℓ2, ℓ3, . . . , ℓn)T ∈ Rn−1 and r = (r1, r2, . . . , rn)
T ∈ Rn can be computed
via r1 = a1 and ℓi = bi/ri−1 and ri = ai − ℓi ci−1 for 2 ≤ i ≤ n.
4. Direct Methods for Linear Systems 83
Exercise 61 Use Gaussian elimination to find the LU factorization of the following matrix
A =
2 −2 0 0
2 −4 2 0
0 −2 4 −2
0 0 2 −4
.
Why is the LU factorization unique?
4.3 Pivoting
Consider the matrix
A =
(0 1
1 1
).
This matrix is non-singular and is well-conditioned, κ(A) = (1 +√
5)/4, in the 2-norm. But
Gaussian elimination fails at the first step, since a1,1 = 0. Note that, a simple interchanging of
the rows will give an upper triangular matrix. Furthermore, a simple interchanging of columns
leads to a lower triangular matrix. The first corresponds to rearranging the equations in the
linear system, the latter corresponds to rearranging the unknowns.
Furthermore, consider the slightly perturbed matrix
A =
(10−20 1
1 1
). (4.11)
Using Gaussian elimination on this matrix, gives the LU factorization with the matrices
L =
(1 0
1020 1
), U =
(10−20 1
0 1 − 1020
).
However, if we are using floating point arithmetic with eps ≈ 10−16. The number 1 − 1020 will
not be represented exactly, but by its nearest floating point, let us say that this is the number
−1020. Using this number produces the factorization
L =
(1 0
1020 1
), U =
(10−20 1
0 −1020
).
The matrix U is relatively close to the correct U with respect to ‖U‖. But on calculating the
product we obtain
L U =
(10−20 1
1 0
).
We see that this matrix is not close to A, since a2,2 = 1 has been replaced by (L U)2,2 = 0.
Suppose we wish to solve Ax = b with A given by (4.11) and b = (1, 0)T using floating point
84 4.3. Pivoting
arithmetic. Then we would obtain x = (0, 1)T while the true answer is approximately (−1, 1)T .
The explanation for this is that LU factorization is not backward stable.
To avoid such problems we modify the Gaussian elimination such that rows and columns
of the matrix may be interchanged during the elimination process. The exchange of
rows or columns during the Gaussian elimination process is referred to as pivoting. Here is
how it works:
Assume we have performed k − 1 steps and now have a matrix of the form
A(k) =
a1,1 ∗ · · · · · · · · · ∗0 a
(2)2,2
. . ....
.... . .
. . . ∗ · · · ∗... 0 a
(k)k,k . . . a
(k)k,n
......
......
0 · · · 0 a(k)n,k . . . a
(k)n,n
(4.12)
Now, if det A = det A(k) 6= 0, not all the entries in the column vector (a(k)k,k, a
(k)k+1,k, . . . , a
(k)n,k)
T
can be zero. Hence, we can pick one non-zero entry and swap the corresponding row with the
kth row. This is usually referred to as partial (row) pivoting. For numerical reasons it is
reasonable to pick a row ℓ with ∣∣a(k)ℓ,k
∣∣ = maxk≤i≤n
|a(k)i,k |.
In a similar way, partial column pivoting can be defined. Finally, we could use total
pivoting, that is, we could pick indices ℓ, m ∈ {k, k + 1, . . . n} such that
∣∣a(k)ℓ,m
∣∣ = maxk≤i,j≤n
|a(k)i,j |.
Usually, partial row pivoting is implemented.
To understand the process of pivoting in an abstract form, it is useful to introduce permutation
matrices.
Definition 4.12 (elementary permutation matrix)
An n × n elementary permutation matrix is any matrix of the form
Pi,j = I − (ei − ej) (ei − ej)T .
That is, Pi,j is an identity matrix with rows i and j (or equivalently columns i and j)
interchanged. A permutation matrix is the (finite) product of elementary permutations
matrices.
4. Direct Methods for Linear Systems 85
Example 4.13 (elementary permutation matrix)
The 5 × 5 elementary permutation matrix P2,5 is given by
P2,5 =
1 0 0 0 0
0 0 0 0 1
0 0 1 0 0
0 0 0 1 0
0 1 0 0 0
.
The properties of elementary permutation matrices are stated in the next lemma. They also
explain the name elementary permutation matrix.
Lemma 4.14 (properties of elementary permutation matrices)
An elementary permutation matrix has the following properties:
(i) P−1i,j = Pi,j = Pj,i = P T
i,j. In particular, Pi,j is an orthogonal matrix.
(ii) det(Pi,j) = −1, for i 6= j and Pi,i = I.
(iii) Pre-multiplication of a matrix A by Pi,j interchanges rows i and j. Similarly, post-
multiplication interchanges columns i and j.
Proof of Lemma 4.14. The proof is fairly straight-forward and is left as an exercise. 2
Exercise 62 Prove Lemma 4.14.
Now we can describe Gaussian elimination with row pivoting by performing in each step first
a multiplication from the left with an elementary permutation matrix (for the pivoting) and
subsequently a multiplication from the left with an elementary lower triangular matrix (for
performing the step in the Gaussian elimination). This leads to the following theorem.
Theorem 4.15 (Gaussian elimination with row pivoting)
Let A be an n × n matrix. There exist elementary lower triangular matrices L(i) = Li(mi)
and elementary permutation matrices P (i) = Pri,i with ri ≥ i, i = 1, 2, . . . , n − 1, such that
the matrix
U := L(n−1) P (n−1) L(n−2) P (n−2) · · · L(2) P (2) L(1) P (1) A (4.13)
is an upper triangular matrix.
Proof of Theorem 4.15. This follows from the previous considerations for Gaussian elimi-
nation. Note that, if the matrix is singular, we might come into a situation (4.12), where the
column vector (a(k)k,k, . . . , a
(k)n,k)
T is the zero vector. In such a situation, we pick the corresponding
elementary lower triangular matrix and permutation matrix as the identity matrix and proceed
with the next column. 2
86 4.3. Pivoting
The permutation matrices in (4.13) can be ‘moved’ to the right in the following sense.
The elementary lower triangular matrix L(j) is of the form L(j) = I − m(j) eTj , with m(j) =
(0, . . . , 0, mj+1, . . . , mn)T , that is, the first j components of m(j) are zero. If the P is an
elementary permutation matrix, which acts only on rows (and columns) with index > j, then
we have, using P 2 = I since P = P−1 = P T ,
P L(j) P = P(I −m(j) eT
j
)P = I − P m(j) (P ej)
T = I − m(j) eTj = L(j), (4.14)
where m(j) = P m(j) and P ej = ej (since P acts only on columns with index > j). Obviously,
the first j components of m(j) = P m(j) are also zero, since m(j) has this property and since
the permutation matrix P acts only on components with index > j. Thus L(j) := I − m(j) eTj
is a proper lower elementary triangular matrix, and from (4.14),
P L(j) = (P L(j) P ) P−1 = L(j) P−1 = L(j) P, (4.15)
since elementary permutation matrices P satisfy P = P−1. Using (4.15), the equation (4.13)
can be rewritten as
U = L(n−1) P (n−1) L(n−2) P (n−2) L(n−3) · · · L(2) P (2) L(1) P (1) A
= L(n−1) L(n−2) P (n−1) P (n−2) L(n−3) · · · L(2) P (2) L(1) P (1) A.
In the same way, we can move P (n−1)P (n−2) to the right of L(n−3). Continuing with this
procedure leads to
U =(L(n−1) L(n−2) · · · L(1)
) (P (n−1) P (n−2) · · · P (1)
)A, (4.16)
which establishes the following theorem.
Theorem 4.16 (LU factorization with pivoting)
For every n×n matrix A there exists an n×n permutation matrix P such that P A possesses
an LU factorization, that is, there exist a normalized n × n lower triangular matrix L and
an n × n upper triangular matrix U such that P A = L U .
Proof of Theorem 4.16. This follows essentially from the considerations above. From (4.16),
we see that there exists an upper triangular matrix U , a permutation matrix P and a normalized
lower triangular matrix L such that
U = L P A ⇔ L−1 U = P A.
Since the inverse of a normalized lower triangular matrix is again a normalized lower triangular
matrix, we have P A = L U , where U is an upper triangular matrix and L = L−1 is a normalized
lower triangular matrix. 2
The implemented algorithm for computing the LU factorization with row pivoting has the
following MATLAB code:
4. Direct Methods for Linear Systems 87
function [L,U,P] = LU_factorisation_row_piv(A)
%
% algorithm computes LU-factorization with pivoting P*A = L*U
% input: real n by n matrix A
% output: L = n by n real normalized lower triangular matrix
% U = n by n real upper triangular matrix
% P = n by n permutation matrix
%
n = size(A,1);
U = A;
L = eye(n);
P = eye(n);
for i = 1:n-1
u = U(i,:);
l = L(i,1:i-1);
p = P(i,:);
%
[r,q] = find(abs(U(i:n,i)) == max(abs(U(i:n,i))));
r = r+i-1
U(i,:) = U(r,:);
U(r,:) = u;
L(i,1:i-1) = L(r,1:i-1);
L(r,1:i-1) = l;
P(i,:) = P(r,:);
P(r,:) = p;
for j = i+1:n
L(j,i) = U(j,i)/U(i,i);
U(j,i:n) = U(j,i:n) - L(j,i)*U(i,i:n);
end
end
4.4 Cholesky Factorisation
In this section we want to exploit the LU factorization to derive a factorization for Hermitian
matrices A of the form A = L L∗ with a lower triangular matrix L. Such a factorization will
not exist for arbitrary Hermitian matrices A, but it does exist for those Hermitian matrices
that are positive definite.
Let us assume that A has a unique LU factorization A = L U and that A is Hermitian, that
is, A∗ = A. Then we have
L U = A = A∗ = (L U)∗ = U∗ L∗.
88 4.4. Cholesky Factorisation
Since U is an upper triangular matrix U∗ = UT
is a lower triangular matrix. Since L is a
normalized lower triangular matrix, L∗ = LT
is a normalized upper triangular matrix. We
would like to use the uniqueness of the LU factorization to conclude that L = U∗ and U = L∗.
Unfortunately, this is not possible since the uniqueness requires the lower triangular matrix to
be normalized, that is, to have only ones as diagonal entries, and this will in general not be the
case for U∗.
But if A is invertible, the LU factorization is unique and we can write
U = D U
with a diagonal matrix D = diag(d1,1, d2,2, . . . , dn,n) and a normalized upper triangular matrix
U . Then, we can conclude that
A = L D U
and hence
A∗ =(L D U
)∗= U∗ D∗ L∗ = A = L D U = L U, (4.17)
so that we can now apply the uniqueness result to derive
L = U∗ and U = D U = D∗ L∗, (4.18)
which gives, from (4.17) and (4.18),
A = L D∗ L∗. (4.19)
Applying now A∗ = A again in (4.19) and using that L is invertible, we find
A∗ =(L D∗ L∗)∗ = L D L∗ = A = L D∗ L∗ ⇒ L D∗ L∗ = L D L∗ ⇒ D∗ = D,
that is the diagonal matrix has only real entries. Thus (4.19) can be written as
A = L D L∗. (4.20)
where L is a normalized lower triangular matrix and D is a diagonal matrix with real entries.
Let us finally make the assumption that all diagonal entries of D are positive. Then, we can
define the square root of D = diag(d1,1, d2,2, . . . , dn,n) by setting
D1/2 := diag(√
d1,1,√
d2,2, . . . ,√
dn,n
),
which, from (4.20) and (D1/2)∗ = D1/2, leads to the decomposition
A = L D1/2 D1/2 L∗ =(L D1/2
)(L D1/2)∗ = L L∗, with L = L D1/2. (4.21)
Definition 4.17 (Cholesky factorization)
A factorization of an n × n matrix A of the form A = L L∗ with an n × n lower triangular
matrix L is called a Cholesky factorization of A.
4. Direct Methods for Linear Systems 89
It remains to verify for what kind of matrices a Cholesky factorization exists. From (L L∗)∗ =
(L∗)∗ L∗ = L L∗, we see that only Hermitian matrices can have a Cholesky factorization.
Theorem 4.18 (positive definite matrix has Cholesky factorization)
Suppose A ∈ Cn×n is Hermitian (that is, A = A∗) and positive definite. Then, A possesses
a Cholesky factorization.
Proof of Theorem 4.18. If a matrix A is positive definite then the determinant of every upper
principal submatrix Ap is positive (see Subsection 1.2). From Theorem 4.7 and Theorem 4.8
we can therefore conclude that A has a unique LU factorization: A = L U with L a normalized
lower triangular matrix and U an upper triangular matrix. From the argumentation at the
beginning of this section we find therefore from (4.20) that
A = L D L∗ ⇔ L−1 A (L−1)∗ = D,
with a diagonal matrix D = diag(d1,1, d2,2, . . . , dn,n), where di,i ∈ R for all i = 1, 2, . . . , n.
Finally, since A is positive definite and L−1 non-singular,
di,i = e∗i D ei = e∗
i
(L−1 A (L−1)∗
)ei =
((L−1)∗ ei
)∗A((L−1)∗ ei
)> 0,
that is, the diagonal matrix D has positive diagonal entries. As explained above, we can draw
the root D1/2 = diag(√
d1,1,√
d2,2, . . . ,√
dn,n) of D and from (4.21) we obtain the Cholesky
factorization of A. 2
For the numerical algorithm we restrict ourselves to the case of real-valued matrices. Then,
the Cholesky factorization reads A = L LT with a lower triangular matrix L. It is
again useful to write A = L LT component-wise, taking into account that L is lower triangular.
For any i ≥ j we have
ai,j =
j∑
k=1
ℓi,k ℓj k =
j−1∑
k=1
ℓi,k ℓj,k + ℓi,j ℓj,j.
This can first be resolved for i = j:
ℓj,j =
(aj,j −
j−1∑
k=1
ℓ2j,k
)1/2
. (4.22)
With this, we can successively compute the other coefficients. For i > j we have
ℓi,j =1
ℓj,j
(ai,j −
j−1∑
k=1
ℓi,k ℓj,k
). (4.23)
Thus we proceed as follows. We first compute the diagonal entry ℓ1,1 from (4.22) with j = 1
ℓ1,1 = (a1,1)1/2.
90 4.4. Cholesky Factorisation
Then we use (4.23) with j = 1 to compute the rest of the first column of L
ℓ2,1 =1
ℓ1,1a2,1, ℓ3,1 =
1
ℓ1,1a3,1, . . . , ℓn,1 =
1
ℓ1,1an,1, .
Next we use (4.22) with j = 2 to compute
ℓ2,2 =(a2,2 − (ℓ2,1)
2)1/2
,
and then we use (4.23) with j = 2 to compute rest of the second column of L
ℓ3,2 =1
ℓ2,2
(a3,2 − ℓ3,1 ℓ2,1
), ℓ4,2 =
1
ℓ2,2
(a4,2 − ℓ4,1 ℓ2,1
), . . . , ℓn,2 =
1
ℓ2,2
(an,2 − ℓn,1 ℓ2,1
).
We continue this process to compute the third column of L, and so on until the nth column of
L has been computed.
We see that we can describe the algorithm for the Cholesky factorization of a real positive
definite symmetric matrix in the following pseudo-algorithmic form:
Algorithm 2 Cholesky Factorization
1: input: real positive definite symmetric n × n matrix A = (ai,j)
2: initialize L = (ℓi,j) as n × n zero matrix
3: for j = 1, 2, . . . , n do
4: ℓj,j =
(aj,j −
j−1∑
k=1
ℓ2j,k
)1/2
5: for i = j + 1, . . . , n do
6: ℓi,j =1
ℓj,j
(ai,j −
j−1∑
k=1
ℓi,k ℓj,k
)
7: end for
8: end for
The Cholesky factorization can implemented with the following MATLAB code:
function [L] = cholesky(A)
%
% algorithm computes Cholesky factorization A= L L^T, if it exists
% input: A = symmetric real n by n matrix
% output: L = n by n lower triangular matrix
%
n = size(A,1);
L = zeros(n,n);
for j = 1:n
L(j,j) = sqrt( A(j,j) - L(j,1:j-1) * L(j,1:j-1)’ );
L(j+1:n,j) = ( A(j+1:n,j) - L(j+1:n,1:j-1) * L(j,1:j-1)’ ) / L(j,j);
end
4. Direct Methods for Linear Systems 91
Example 4.19 (Cholesky factorization)
Find the Cholesky factorization of the following positive definite matrix
A =
1 −1 2
−1 5 0
2 0 6
.
Solution: We apply the Algorithm 2. Remember that we first compute the first column, then
the second column, and so on. Computing the first column, we find
ℓ1,1 =√
a1,1 =√
1 = 1, ℓ2,1 =1
ℓ1,1
a2,1 =1
1(−1) = −1, ℓ3,1 =
1
ℓ1,1
a3,1 =1
12 = 2.
Now we can compute the second column
ℓ2,2 =(a2,2−ℓ2
2,1
)1/2=(5−(−1)2
)1/2=
√4 = 2, ℓ3,2 =
1
ℓ2,2
(a3,2−ℓ3,1 ℓ2,1
)=
1
2
(0−2 (−1)
)= 1,
and finally the third column
ℓ3,3 =(a3,3 − ℓ2
3,1 − ℓ23,2
)1/2=(6 − 22 − 12
)1/2=
√1 = 1.
Thus the lower triangular matrix L is given by
L =
1 0 0
−1 2 0
2 1 1
and evaluation of L LT shows easily that indeed A = L LT . 2
It can be shown that the computational cost of the Cholesky factorization is approximately
n3/6+O(n2) elementary operations, where we count one multiplication plus one addition as one
elementary operation. This is about half the complexity of the standard Gaussian elimination
process. However, here, we also have to take n roots.
Exercise 63 Compute the Cholesky factorization of the following matrix by hand:
A =
1 2 3
2 5 8
3 8 14
.
(Note that A is positive definite so that we know that a Cholesky factorization exists. You do
not have to verify that A is positive definite.)
Exercise 64 Show that the computational cost of the Cholesky factorization is approximately
n3/6 + O(n2) elementary operations, where we count one multiplication plus one addition as
one elementary operation.
92 4.5. QR Factorization
Exercise 65 Is the Cholesky factorization of a positive definite n × n matrix unique? Give a
proof of your answer.
Exercise 66 Compute the Cholesky factorization of the following matrix by hand:
A =
4 2 6
2 5 5
6 5 14
.
Once you have computed the Cholesky factorization, use it to conclude that the matrix A is
positive definite.
4.5 QR Factorization
A typical step in the Gaussian elimination process multiplies the given matrix A with a lower
triangular matrix L from the left. For the condition number of the resulting matrix, this means
κ(L A) = ‖L A‖ ‖(L A)−1‖ ≤ ‖L‖ ‖A‖ ‖A−1‖ ‖L−1‖ = κ(A) κ(L),
so that we expect to end up with a worse condition number.
From this point of view it would be good to have a process which transforms the matrix A into
an upper triangular matrix without altering the condition number. Such a process is
given by the QR factorization.
Recall, from Section 2.3, that a Householder matrix is a matrix of the form H = H(w) =
I − 2ww∗, where w∗w = 1, and that the vector w ∈ Cn can be chosen such that H x = c e1
for a given vector x ∈ Cn. The coefficient c ∈ C satisfies |c| = ‖x‖2. Also remember that H(w)
is Hermitian and unitary and that that det(H(w)) = −1 for w 6= 0.
Hence, if we let x be the first column of A, we find a first Householder matrix H1 ∈ Cn×n such
that the first column vector of A is mapped by H1 onto α1 e1, that is,
H1 A =
α1 ∗ · · · ∗0... A1
0
,
with |α1| = ‖x‖2 and an (n− 1)× (n− 1) matrix A1. Considering the first column x ∈ Cn−1 of
A1, we can find a second Householder matrix H2 ∈ C(n−1)×(n−1) such that H2 x = α2 e1 where
now e1 ∈ Cn−1 and |α2| = ‖x‖2. The matrix
H2 =
1 0 · · · 0
0... H2
0
4. Direct Methods for Linear Systems 93
is easily seen to be unitary and we get
H2 H1 A =
α1 ∗ ∗ · · · ∗0 α2 ∗ · · · ∗0 0...
... A2
0 0
,
with an (n − 2) × (n − 2) matrix A2. We can proceed in this fashion to derive
Hn−1 Hn−2 · · · H2 H1 A = R, (4.24)
where R is an upper triangular matrix. (Note that if we find in the jth step that the first
column vector of the submatrix Aj−1 is the zero vector, then we choose Hj = I.) Since all
Householder matrices Hj are unitary, so is their product and also their inverses. Therefore,
from (4.24),
A = Q R, with the unitary matrix Q := H−11 H−1
2 · · · H−1n−2 H−1
n−1.
From the construction of the unitary matrices Hj, it is easily seen that Hj = H∗j = H−1
j , and
thus Q can be written as
Q = H1 H2 · · · Hn−2 Hn−1.
Thus we have proven the following result.
Theorem 4.20 (QR factorization)
Every matrix A ∈ Cn×n possesses a QR factorization A = Q R with a unitary matrix Q
and an upper triangular matrix R.
Example 4.21 (QR factorization)
Compute the QR factorization of the following matrix by hand:
A =
2 −3 3
−2 6 6
1 0 3
.
Solution: Since for the first column vector (2,−2, 1)T of A we have ‖(2,−2, 1)T‖2 = 3, we
choose w1 := z/‖z‖2 with
z =
2
−2
1
− 3
1
0
0
=
−1
−2
1
, ‖z‖2 =√
6.
Thus
w1 =1√6
−1
−2
1
,
94 4.5. QR Factorization
and the first Householder matrix is given by
H1 := H(w1) = I − 2w1 wT1 = I − 1
3
−1
−2
1
(− 1,−2, 1)T
= I − 1
3
1 2 −1
2 4 −2
−1 −2 1
=
23
− 23
13
− 23
− 13
23
13
23
23
.
We find that
H1 A =
23
− 23
13
− 23
− 13
23
13
23
23
2 −3 3
−2 6 6
1 0 3
=
3 −6 −1
0 0 −2
0 3 7
.
In the next step we consider the submatrix
A1 =
(0 −2
3 7
),
and choose for this submatrix w2 = v/‖v‖2 with
v =
(0
3
)− 3
(1
0
)=
(−3
3
), ‖v‖2 = 3
√2.
Hence
w2 =1√2
(−1
1
),
and the 2 × 2 Householder matrix is given by
H(w2) = I −(
−1
1
)(− 1, 1
)T= I −
(1 −1
−1 1
)=
(0 1
1 0
).
Thus the unitary matrix H2 is given by
H2 =
1 0 0
0 0 1
0 1 0
,
and we have
R := H2 H1 A =
1 0 0
0 0 1
0 1 0
3 −6 −1
0 0 −2
0 3 7
=
3 −6 −1
0 3 7
0 0 −2
.
We have A = Q R with the unitary matrix Q defined by
Q = H−11 H−1
2 = HT1 HT
2 = H1 H2 =
23
13
− 23
− 23
23
− 13
13
23
23
. 2
4. Direct Methods for Linear Systems 95
The QR factorization of a real matrix can be implemented as the following MATLAB code:
function [Q,R] = QR_factorization(A)
%
% algorithm computes the QR factorization of A, that is, A=Q*R
% input: A = real n by n matrix
% output: Q = real orthogonal n by n matrix
% R = real n by n upper triangular matrix
%
n = size(A,1);
Q = eye(n,n);
R = A;
%
for j=1:n-1
if max(abs(R(j:n,j))) == 0
Q = Q;
R = R;
else
v = R(j:n,j) - norm(R(j:n,j)) * [1 ; zeros(n-j,1)];
w = [ zeros(j-1,1) ; v/norm(v)];
R = R - 2* w * w’ * R;
Q = Q - 2* w * w’ * Q;
end
end
Q = Q’;
Exercise 67 Compute the QR factorization of the following matrix by hand:
A =
1 0 3
2 −6 3
−2 3 −3
.
The above mentioned advantage of a stable condition number follows from the following facts,
where we now assume that A is non-singular: From A = Q R we have R = Q−1 A = Q∗ A,
since Q is unitary (that is, Q∗ = Q−1). The induced matrix 2-norm of R = Q∗ A satisfies
‖R‖2 = ‖Q∗ A‖2 = supx∈Cn,‖x‖2=1
‖Q∗ Ax‖2 = supx∈Cn,‖x‖2=1
√(Q∗ Ax)∗(Q∗ Ax)
= supx∈Cn,‖x‖2=1
(x∗ A∗ Q Q∗ Ax
)1/2= sup
x∈Cn,‖x‖2=1
(x∗ A∗ Ax
)1/2
96 4.5. QR Factorization
= supx∈Cn,‖x‖2=1
‖Ax‖2 = ‖A‖2, (4.25)
where we have used that Q Q∗ = I since Q is unitary. Likewise, we have A−1 = (Q R)−1 =
R−1 Q−1, and thus R−1 = A−1 Q, and
‖R−1‖2 = ‖A−1 Q‖2 = supx∈Cn,‖x‖2=1
‖A−1 Qx‖2 = supx∈Cn,
‖Qx‖2=1
‖A−1 Qx‖2
= supy∈Cn,‖y‖2=1
‖A−1 y‖2 = ‖A−1‖2, (4.26)
where we have used that ‖Qx‖2 = (x∗ Q∗ Qx)1/2 = (x∗x)1/2 = ‖x‖2 (since Q is unitary) and
later-on substituted y = Qx. From (4.25) and (4.26), we see that, with respect to the induced
matrix 2-norm, the condition number of the upper triangular matrix R is the same as the
condition number of A:
κ2(R) = ‖R‖2 ‖R−1‖2 = ‖A‖2 ‖A−1‖2 = κ2(A).
Since the multiplication of a Householder matrix with a vector can be performed in O(n) time,
each step in deriving the upper triangular matrix costs about O(n2) time so that the total
complexity is again O(n3).
Application 4.22 (least squares solution)
An application of the QR factorisation is the computation of least squares solutions. Assume
that the matrix A is a real m×n matrix with m > n, which means that we have more equations
than unknowns and which makes the problem Ax = b in general unsolvable. A possible remedy
is to look for that vector x, which minimises the error norm ‖Ax− b‖22. It is possible to show
that this vector x has to satisfy the normal equations
AT Ax = AT b,
which are uniquely solvable if A has full rank n. Unfortunately, the normal equations are
usually very ill-conditioned so that solving the normal equations is not really an option.
But we can use the QR factorization to compute the solution of this minimization problem. To
this end, it is important to note that even if A is not a square matrix, we can find a factorization
of the form A = Q R where Q is now an m×m orthogonal matrix and R ∈ Rm×n has the form
R =
∗ · · · ∗. . .
...
∗0
=
(R1
0
)
with an upper triangular matrix R1 ∈ Rn×n. With this and using that QT = Q−1, we can write
‖Ax − b‖22 =
∥∥Q R x − b∥∥2
2=∥∥Q(R x − QT b
)∥∥2
2
4. Direct Methods for Linear Systems 97
=(R x − QT b
)TQT Q
(R x − QT b
)= ‖Rx − QT b‖2
2
=
∥∥∥∥(
R1 x
0
)−(
c
d
)∥∥∥∥2
2
= ‖R1 x − c‖22 + ‖d‖2
2,
where we have split the vector QT b into two components (cT ,dT )T , with c ∈ Rn and d ∈ Rm−n.
From the last equation we see that there is an unavoidable error ‖d‖22. However, we can minimize
the error by choosing x as the solution of R1 x = c. Since R1 is an upper triangular matrix,
this can be done by back substitution. (Note that R1 is invertible, since we have assumed that
A has full rank.) 2
Exercise 68 Compute the QR factorization of the following matrix by hand:
A =
2 5 3
4 4 −3
−4 2 3
.
Exercise 69 Let A ∈ Rm×n, with m > n, and assume that A has full rank. Let b ∈ Rm.
Consider the following linear functional
f(x) = ‖Ax − b‖22, x ∈ R
n.
Show that there exists a unique vector x∗ ∈ Rn that minimizes the functional f and that this
vector x∗ is the unique solution of the so-called normal equations
AT Ax∗ = AT b.
(Hint: Use calculus to investigate the minimum of the functional f . Do not forget to explain
why the solution to the normal equations is unique.)
Exercise 70 Let A ∈ Rm×n with m ≥ n be given. Assume that A = U Σ V T with orthogonal
matrices U ∈ Rm×m, V ∈ R
n×n, and a diagonal matrix Σ ∈ Rm×n that has only non-negative
diagonal entries. (By a diagonal matrix Σ ∈ Rm×n, where m ≥ n, we mean that Σ = (si,j)
and all entries of Σ except s1,1, s2,2, . . . , sn,n are zero.) Use the decomposition A = U Σ V T to
compute the solution to
minx∈Rn
∥∥Ax − b∥∥
2.
98 4.5. QR Factorization
Chapter 5
Iterative Methods for Linear Systems
In this chapter, we discuss iterative methods for solving linear systems. In contrast to a direct
method (as discussed in the previous chapter), an iterative method constructs a sequence of
approximate solutions {x(j)} that should ideally converge to the true solution x of Ax = b.
In Section 5.1 we explain the main idea behind iterative methods. In Section 5.2, we discuss one
of the most basic iterative methods that you have already encountered in Applied Mathematics,
namely, Banach’s fixed point iteration. In Section 5.3, we introduce the Jacobi iteration
and the Gauss-Seidel iteration, and in Section 5.4, we learn how these methods may be
improved with relaxation.
5.1 Introduction
The general idea behind iterative methods is the following. Suppose we want to solve a linear
system Ax = b, where A is an invertible square n×n matrix A. If A is very large, say n = 106,
then a direct solver as discussed in the previous chapter is no longer feasible for solving the
linear system Ax = b (since the number of elementary operations is O(n3)). Instead we want
to use a so-called iterative method which constructs a sequence {x(j)} of solutions in Rn
that approximate the true solution x of Ax = b. For j → ∞, we should have that
limj→∞
x(j) = x,
although in reality this will not always be the case, due to rounding errors and stability issues.
How do the ‘approximations’ x(j) look like? The most common form is
x(j+1) = x(j) + B−1(b − Ax(j)
), (5.1)
where B is an invertible matrix such that B−1 is an approximation of the inverse A−1.
Usually, B will be a ‘simplified version’ of A whose inverse can be easily computed.
99
100 5.2. Fixed Point Iterations
To understand (5.1), replace B−1 by A−1; we obtain (multiplying from the left by A in the
second step)
x(j+1) = x(j) + A−1 (b− Ax(j)) ⇔ Ax(j+1) = Ax(j) + (b − Ax(j)) = b,
that is, x(j+1) would then solve Ax = b. In (5.1), we have instead of A−1 the approximation
B−1 of A−1, and we can interpret this as follows: Our previous iteration step yielded the
approximation x(j) of x, and the term
r(j) := b − Ax(j) = A(x − x(j)
)
is the residual of this approximation (which measures the discrepancy, mapped into the range
of A, between x and x(j)). We solve the the equation
Ay = A(x − x(j)
)= b − Ax(j) = r(j)
approximately with the help of the approximation B−1 of A−1, giving
y = x − x(j) ≈ B−1(b− Ax
),
and add this correction to the previous approximation; thus
x(j+1) = x(j) + B−1 (b − Ax(j)),
which is just (5.1).
Sometimes it is not obvious that a complicated iterative method can be interpreted in this
sense, but we will see that all the iterative methods discussed in this chapter follow this general
idea.
5.2 Fixed Point Iterations
One way of considering how to solve a linear system Ax = b, is to write it as a fixed point
equation. We will see a little further below that this finally leads to an iterative algorithm as
discussed in the previous section.
Definition 5.1 (fixed point and fixed point equation)
Consider a function F : Cn → Cn (or F : Rn → Rn). A point x ∈ Cn (or x ∈ Rn) is called
a fixed point of F , if F (x) = x.
The equation F (x) = x is also called a fixed point equation.
Let us write the matrix A as A = A − B + B with an invertible matrix B which can and will
be chosen suitably. Then, the equation Ax = b can be reformulated as
b = Ax = (A − B)x + B x ⇔ B x = b− (A − B)x = (B − A)x + b,
5. Iterative Methods for Linear Systems 101
and hence as the fixed point equation
x = B−1 (B − A)x + B−1 b =: C x + c, (5.2)
where C := B−1 (B − A) and c := B−1 b. Thus the solution x to Ax = b is a fixed point of
the mapping
F (x) := C x + c.
To calculate this fixed point, we can use the following simple iterative process. We first pick a
starting point x(0) and then form
x(j+1) := F(x(j)), j = 1, 2, 3, . . . . (5.3)
From Banach’s fixed point theorem below, we see that under certain assumptions on F the
sequence {x(j)} converges to a fixed point of F .
Definition 5.2 (contraction mapping)
A function F : Cn → Cn (or F : Rn → Rn) is called a contraction with respect to a norm
‖ · ‖ on Cn (or Rn) if there exists a constant 0 < q < 1 such that
∥∥F (x) − F (y)∥∥ ≤ q ‖x − y‖ for all x,y in C
n (or in Rn).
Note that a contraction mapping is Lipschitz-continuous with Lipschitz-constant q < 1.
Theorem 5.3 (Banach’s fixed point theorem)
Let Cn (or Rn) be equipped with the norm ‖·‖. Assume that F : Cn → Cn (or F : Rn → Rn)
is a contraction with respect to the norm ‖ · ‖, that is, ‖F (x − y)‖ ≤ q ‖x − y‖ for all x,y
in Cn (or in Rn), with some constant 0 < q < 1. Then F has exactly one fixed point x. The
sequence {x(j)}, defined recursively by x(j+1) := F (x(j)), converges for every starting point
x(0) ∈ Cn (or x(0) ∈ Rn) to the fixed point x. Furthermore, we have the error estimates
‖x − x(j)‖ ≤ q
1 − q‖x(j) − x(j−1)‖ (a-posteriori estimate),
‖x − x(j)‖ ≤ qj
1 − q‖x(1) − x(0)‖ (a-priori estimate).
If we try to apply this theorem to the function F (x) = C x + c, where C = B−1 (B −A) is the
iteration matrix and c = B−1 b, we see that
∥∥F (x) − F (y)∥∥ =
∥∥C x + c −(C y + c
)∥∥ =∥∥C (x − y)
∥∥ ≤ ‖C‖ ‖x− y‖. (5.4)
(Note that the norms in (5.4) are ‖·‖ on Cn (or R
n) and its corresponding matrix norm ‖·‖, also
denoted by ‖ · ‖.) So from Theorem 5.3, we have convergence of Banach’s fixed point iteration
if ‖C‖ < 1. Unfortunately, this depends on the chosen vector norm and corresponding induced
102 5.2. Fixed Point Iterations
matrix norm. However, since all norms on Cn are equivalent (see Theorem 2.30), whether
the sequence {x(j)} converges or not does not depend on the choice of the norm. In other
words, having in the induced matrix norm ‖C‖ < 1 is sufficient for convergence but
not necessary. A sufficient and necessary condition for convergence can be stated using the
spectral radius of the iteration matrix.
Theorem 5.4 (convergence of Banach’s fixed point iteration if F (x) = C x + c)
Let F (x) := C x+ c with an n×n matrix C in Cn×n (or in Rn×n) and a vector c in Cn (or
in Rn). Banach’s fixed point iteration {x(j)}, defined by
x(j+1) := F(x(j))
= C x(j) + c,
converges for every starting point x(0) in Cn (or in Rn) to the same vector x if and only if
ρ(C) < 1. If ρ(C) < 1, then the limit of {x(j)} is the unique fixed point x of F .
Proof of Theorem 5.4. Assume first that ρ(C) < 1. Then, we can pick an ǫ > 0 such that
ρ(C) + ǫ < 1. By Theorem 2.43, we can find a norm ‖ · ‖ for Cn (or Rn) such that in the
corresponding induced matrix norm (which we also denote by ‖ · ‖)
‖C‖ ≤ ρ(C) + ǫ < 1.
From (5.4), we then have
∥∥F (x) − F (y)∥∥ ≤ ‖C‖ ‖x− y‖, with ‖C‖ < 1,
and thus F is a contraction. From Banach’s fixed point theorem (see Theorem 5.3 above), the
sequence {x(j)} converges for every starting point x(0) to the unique fixed point x of F .
Assume now that, for every starting point x(0), the iteration {x(j)} converges to the same vector
x. Since F is continuous, we find from
x = limj→∞
x(j+1) = limj→∞
F(x(j))
= F (x),
and we see that the limit x is a fixed point of F . If we pick the starting point x(0) such that
x = x(0) − x is an eigenvector of C with eigenvalue λ (that is, given the eigenvector x, we
choose x(0) = x + x), then
x(j) − x = F(x(j−1)
)− F (x) = C
(x(j−1) − x
)= . . . = Cj
(x(0) − x
)= λj
(x(0) − x
).
Since the expression on the left-hand side tends to the zero vector for j → ∞, so does the
expression on the right-hand side. This, however, is only possible if |λ| < 1. Since λ was
an arbitrary eigenvalue of C, this shows that all eigenvalues of C satisfy |λ| < 1, and hence
ρ(C) < 1. 2
5. Iterative Methods for Linear Systems 103
Example 5.5 (Banach’s fixed point iteration)
Consider the affine linear function f : R2 → R
2 defined by
f(x) = C x + c =
(14
1
0 − 12
)x +
(− 7
4
32
).
(a) Does Banach’s fixed point iteration converge for the given function f? Give a proof of your
answer.
(b) If the f has any fixed points, then find these fixed points.
(c) For the starting point x(0) = 0 = (0, 0)T , compute the first four approximations of Banach’s
fixed point iteration by hand.
Solution:
(a) Since the matrix C is an upper triangular matrix, its diagonal entries are its eigenvalues.
Thus we see that the eigenvalues of C are λ1 = 1/4 and λ2 = −1/2. Thus the spectral radius
is ρ(C) = max{1/4, | − 1/2|} = 1/2, and we have verified that Banach’s fixed point iteration
converges for the given function f since ρ(C) = 1/2 < 1 (see Theorem 5.4).
(b) If x is a fixed point of f , then f(x) = C x + c = x or equivalently (I −C) x = c. We solve
the linear system (I − C) x = c,
(34
−1
0 32
∣∣∣∣∣− 7
4
32
)⇔
(34
−1
0 1
∣∣∣∣∣− 7
4
1
)⇔
(34
0
0 1
∣∣∣∣∣− 3
4
1
)⇔
(1 0
0 1
∣∣∣∣∣−1
1
),
and find that the unique fixed point is given by x = (−1, 1)T .
(c) The approximations of Banach’s fixed point iteration are defined recursively by x(j) =
f(x(j−1)). We find
x(1) =
(− 7
4
32
),
x(2) =
(14·(− 7
4
)+ 1 · 3
2− 7
4(− 1
2
)· 3
2+ 3
2
)=
(− 11
16
34
),
x(3) =
(14·(− 11
16
)+ 1 · 3
4− 7
4(− 1
2
)· 3
4+ 3
2
)=
(− 75
64
98
),
x(4) =
(14·(− 75
64
)+ 1 · 9
8− 7
4(− 1
2
)· 9
8+ 3
2
)=
(− 235
256
1516
)≈(
−0.91797
0.9375
).
After four iteration steps we obtain the approximate solution x(4) ≈ (−0.91797, 0.9375)T . 2
104 5.2. Fixed Point Iterations
For real fixed point equations, Banach’s fixed point iteration can be implemented with the
following MATLAB code:
function [x] = Banach_fixed_point_iteration(C,c,z,J)
%
% computes Banach’s fixed point iteration x^{(J)} for f(x) = C*x + c
% approximations are given by = x^{(j)} = f(x^{(j-1)})
%
% input: C = real n by n matrix with rho(C)<1 (that is, f is a contraction)
% c = real n by 1 vector
% z = real n by 1 vector that is starting point for iteration
% J = number of iteration steps
% output: x = approximation after J steps
%
x = z;
for j=1:J
y = C * x + c;
x = y;
end
Now we come back to the linear system Ax = b which we had rewritten as the fixed point
equation (see (5.2) above)
x = B−1(B − A
)x + B−1 b =: F (x),
with some suitable invertible matrix B. We would like to solve Ax = b by computing Banach’s
fixed point iteration for the function F defined above:
x(j+1) = F(x(j))
= B−1 (B − A)x(j) + B−1 b. (5.5)
Before we explore particular choices of the invertible matrix B, we investigate, why this iteration
is of the form discussed in Section 5.1. We can rewrite (5.5) as
x(j+1) = x(j) − B−1 Ax(j) + B−1 b = x(j) + B−1(b− Ax(j)
), (5.6)
and from (5.6) we see that x(j+1) is indeed of the form (5.1). From the motivation given in
Section 5.1, it is clear that we would like to choose the invertible matrix B such that B−1 is
‘close’ to A−1 (and by implication B is ‘close’ to A). Then the matrix
C = B−1 (B − A) = I − B−1 A
should be ‘close’ to the zero matrix, and we may therefore expect that ρ(C) < 1 for judicious
choices of B.
5. Iterative Methods for Linear Systems 105
Exercise 71 Let C ∈ Rn×n and let c ∈ Rn. Show that the function F : Rn → Rn, defined by
F (x) := C x + c, x ∈ Rn,
is continuous on Rn.
Exercise 72 Prove Banach’s fixed point theorem (see Theorem 5.3 above).
Exercise 73 Find the fixed points (if they have any) of each of the following three functions
f : R → R, g : R3 → R3, and h : R → R, defined by
(a) f(x) = x3, (b) g(x) =
3 2 1
2 3 2
1 2 3
x +
−1
−2
−1
, (c) h(x) = exp(x) = ex.
Exercise 74 Consider the affine linear mapping f : R3 → R3, given by
f(x) =
− 12
−1 0
0 12
0
1 1 12
x +
−1
1
2
.
(a) Does Banach’s fixed point iteration converge for the given function f? Give a proof of your
answer.
(b) Find the unique fixed point of f . Show your work.
(c) For the starting point x(0) = 0 = (0, 0, 0)T , compute the first four approximations of Ba-
nach’s fixed point iteration for the given funcion f by hand.
(d) Find a closed (non-recursive) formula for the approximations x(j). Prove your formula.
Exercise 75 Find the fixed points of the following functions:
(a) f(x) = sin(x), (b) g(x) =
(1 1
−1 2
)x +
(0
1
).
Exercise 76 Consider the affine linear function f : R2 → R2 defined by
f(x) = C x + c =
(12
12
12
− 12
)x +
(−1
2
).
(a) Does Banach’s fixed point iteration converge for the given function f? Give a proof of your
answer.
(b) Find the unique fixed point of f . Show your work.
(c) For the starting point x(0) = 0 = (0, 0)T , compute the first five approximations of Banach’s
fixed point iteration by hand.
106 5.3. The Jacobi and Gauss-Seidel Iterations
5.3 The Jacobi and Gauss-Seidel Iterations
After this general discussion, we return to the question how to choose B and hence the iteration
matrix C in (5.2). Our initial approach yields the fixed point equation
x = B−1 (B − A)︸ ︷︷ ︸=: C
x + B−1 b︸ ︷︷ ︸=: c
with the iteration matrix
C = B−1(B − A) = I − B−1 A, (5.7)
with an invertible matrix B, which should be sufficiently close to A but also easily invertible.
From now on, we will assume that the diagonal elements of A are all nonzero. This can
be achieved by exchanging rows and/or columns as long as the matrix A is non-singular.
Next, we decompose A in its lower-left sub-diagonal part L, its diagonal part D, and its upper-
right super-diagonal part R, that is,
A = L + D + R,
where L = (ℓi,j) with ℓi,j = ai,j for 1 ≤ j < i ≤ n and zero otherwise, D = (di,j) with di,i = ai,i
for 1 ≤ i ≤ n and zero otherwise, and R = (ri,j) with ri,j = ai,j for 1 ≤ i < j ≤ n and zero
otherwise. The simplest possible approximation to A is then given by choosing its diagonal
part D for B so that the iteration matrix CJ := C = (ci,j) becomes
CJ := I − D−1 A = I − D−1 (L + D + R) = −D−1(L + R), (5.8)
with entries
ci,j =
{−ai,j/aii if i 6= j,
0 if i = j.(5.9)
Hence, we obtain the fixed point iteration
x(j+1) = −D−1 (L + R)x(j) + D−1 b
= −D−1[(
L + D + R)− D
]x(j) + D−1 b
= x(j) + D−1(b− Ax(j)
). (5.10)
Rewriting the first line of (5.10) as
x(j+1) = D−1(b− (L + R)x(j)
),
x(j+1) is componentwise given by
x(j+1)i =
1
ai,i
(bi −
n∑
k=1,k 6=i
ai,k x(j)k
), 1 ≤ i ≤ n.
5. Iterative Methods for Linear Systems 107
Definition 5.6 (Jacobi method)
Let A = (ai,j) in Cn×n (or in R
n×n) be invertible, and assume that all diagonal elements of
A are non-zero. Let b ∈ Cn (or b ∈ Rn). The iteration {x(j)} defined by
x(j+1)i =
1
ai,i
(bi −
n∑
k=1,k 6=i
ai,k x(j)k
), 1 ≤ i ≤ n, (5.11)
is called the Jacobi method.
Obviously, one can only expect convergence of the Jacobi method if the original matrix A
‘resembles a diagonal matrix’.
Definition 5.7 (strictly row diagonally dominant)
A matrix A in Cn×n (or in Rn×n) is called strictly row diagonally dominant if
n∑
k=1,k 6=i
|ai,k| < |ai,i| for all i = 1, 2, . . . , n. (5.12)
Example 5.8 (strictly row diagonally dominant matrix)
The matrix
A = (ai,j) =
1 12
13
−1 2 12
12
14
56
is strictly row diagonally dominant, since
1
2+
1
3=
5
6< 1 = |a1,1|, |−1|+1
2=
3
2< 2 = |a2,2|,
1
2+
1
4=
3
4<
5
6= |a3,3|. 2
The next theorem gives a sufficient condition for the convergence of the Jacobi method.
Theorem 5.9 (convergence of the Jacobi method)
Let the assumptions be the same as in Definition 5.6. The Jacobi method (5.11) converges
for every starting point x(0) in Cn (or in Rn) if the invertible matrix A in Cn×n (or in Rn×n)
is strictly row diagonally dominant.
Proof of Theorem 5.9. We use the induced matrix ∞-norm to calculate the norm of the
iteration matrix CJ = (ci,j) given by (5.8) and (5.9):
‖CJ‖∞ = max1≤i≤n
(n∑
j=1
|ci,j|)
= max1≤i≤n
(n∑
j=1,j 6=i
| − ai,j ||ai,i|
)= max
1≤i≤n
(1
|ai,i|
n∑
j=1,j 6=i
|ai,j|)
< 1,
108 5.3. The Jacobi and Gauss-Seidel Iterations
where the last estimate follows from the assumption that A is strictly row diagonally dominant
(see (5.12)). Since ρ(CJ) ≤ ‖CJ‖∞ < 1, the convergence of the Jacobi method follows from
Theorem 5.4. 2
Example 5.10 (Jacobi iteration)
Consider the linear system
Ax = b, where A =
2 12
12
1 3 1
2 0 3
, b =
32
−2
2
(a) Find the solution x to Ax = b by hand.
(b) Verify that A is strictly row diagonally dominant.
(c) For the starting point x(0) = 0 = (0, 0, 0)T , compute the first four approximations of the
Jacobi method by hand.
Solution:
(a) We solve the linear system Ax = b. We multiply the first row by 1/2 and subtract it from
the second row. Then we subtract the first row from the third row. Finally, we multiply the
first row and the new third row by 2 and multiply the new second row by 4.
2 12
12
1 3 1
2 0 3
∣∣∣∣∣∣∣∣
32
−2
2
⇔
2 12
12
0 114
34
0 − 12
52
∣∣∣∣∣∣∣∣
32
− 114
12
⇔
4 1 1
0 11 3
0 −1 5
∣∣∣∣∣∣∣∣
3
−11
1
.
Now we multiply the new second row by 1/11 and add it to the new third row, and in the next
step we multiply the new third row by 11/58.
⇔
4 1 1
0 11 3
0 0 5811
∣∣∣∣∣∣∣∣
3
−11
0
⇔
4 1 1
0 11 3
0 0 1
∣∣∣∣∣∣∣∣
3
−11
0
.
In the next step we subtract the new third row from the first row and subtract 3 times the new
third row from the new second row. Subsequently we divide the new second row by 11. After
that we subtract the new second row from the new first row.
⇔
4 1 0
0 11 0
0 0 1
∣∣∣∣∣∣∣∣
3
−11
0
⇔
4 1 0
0 1 0
0 0 1
∣∣∣∣∣∣∣∣
3
−1
0
⇔
4 0 0
0 1 0
0 0 1
∣∣∣∣∣∣∣∣
4
−1
0
.
5. Iterative Methods for Linear Systems 109
Finally we divide the new first row by 4.
⇔
1 0 0
0 1 0
0 0 1
∣∣∣∣∣∣∣∣
1
−1
0
⇔ x =
1
−1
0
,
that is, the solution to Ax = b is given by x = (1,−1, 0)T .
(b) Since
1
2+
1
2< 2 = |a1,1|, 1 + 1 < 3 = |a2,2|, and 2 + 0 < 3 = |a3,3|,
the matrix A is clearly strictly row diagonally dominant.
(c) The first approximation is
x(1)1 = 1
2
(32− 1
2· 0 − 1
2· 0)
= 34,
x(1)2 = 1
3(−2 − 1 · 0 − 1 · 0) = − 2
3,
x(1)3 = 1
3(2 − 2 · 0) = 2
3
⇒ x(1) =
34
− 23
23
.
The second approximation is
x(2)1 = 1
2
(32− 1
2·(− 2
3
)− 1
2· 2
3
)= 3
4,
x(2)2 = 1
3
(−2 − 1 · 3
4− 1 · 2
3
)= − 41
36,
x(2)3 = 1
3
(2 − 2 · 3
4
)= 1
6
⇒ x(2) =
34
− 4136
16
.
The third approximation is
x(3)1 = 1
2
(32− 1
2·(− 41
36
)− 1
2· 1
6
)= 143
144,
x(3)2 = 1
3
(−2 − 1 · 3
4− 1 · 1
6
)= − 35
36,
x(3)3 = 1
3
(2 − 2 · 3
4
)= 1
6
⇒ x(3) =
143144
− 3536
16
.
The fourth approximation is
x(4)1 = 1
2
(32− 1
2·(− 35
36
)− 1
2· 1
6
)= 137
144,
x(4)2 = 1
3
(−2 − 1 · 143
144− 1 · 1
6
)= − 455
432,
x(4)3 = 1
3
(2 − 2 · 143
144
)= 1
216
⇒ x(4) =
137144
− 455432
1216
≈
0.95139
−1.0532
0.0046296
.
After four iterations steps, we have already a reasonable approximation of the solution. 2
For a real linear system Ax = b, the Jacobi method can be implemented with the following
MATLAB code:
110 5.3. The Jacobi and Gauss-Seidel Iterations
function [x] = jacobi_1(A,b,z,J)
%
% algorithm computes the Jacobi iteration x^{(J)}
% for approximating the solution of A*x = b
%
% input: A = real invertible n by n matrix with non-zero diagonal entries
% b = real n by 1 vector, right-hand side of linear system
% z = real n by 1 vector, starting point x^{(0)} for Jacobi iteration
% J = number of iteration steps
% output: x = x^{(J)} = n by 1 vector of the Jth approximation of the Jacobi iteration
%
n = size(A,1);
x = z;
y = zeros(n,1);
for j = 1:J
y = x;
for i = 1:n
x(i) = ( b(i) - A(i,1:i-1) * y(1:i-1) - A(i,i+1:n) * y(i+1:n) ) / A(i,i);
end
end
If we exploit the matrix-vector structures provided by MATLAB, then we get the following
more economical MATLAB code for the Jacobi method:
function [x] = jacobi_2(A,b,z,J)
%
% algorithm computes the Jacobi iteration x^{(J)}
% for approximating the solution of A*x = b
% this version exploits the matrix-vector structures of MATLAB
%
% input: A = real invertible n by n matrix with non-zero diagonal entries
% b = real n by 1 vector, right-hand side of linear system
% z = real n by 1 vector, starting point x^{(0)} for Jacobi iteration
% J = number of iteration steps
% output: x = x^{(J)} = n by 1 vector of Jth approximation of Jacobi iteration
%
n = size(A,1);
x = z;
d = diag(A);
for i=1:n
A(i,i) = 0;
5. Iterative Methods for Linear Systems 111
end
y = zeros(n,1);
for j = 1:J
y = x;
x = ( b - A * y ) ./ d;
end
Exercise 77 Which of the following matrices are strictly row diagonally dominant?
A =
3 1 −1
1 2 −1
0 −1 5
, B =
5 2 −2
0 3 −2
−1 1 3
, C =
2 12
−1
−1 52
1
32
−1 12
.
Exercise 78 Consider the linear system
Ax = b, where A =
2 0 −1
0 −2 1
1 1 3
, b =
1
−1
5
.
(a) Compute the solution x of Ax = b by hand.
(b) Verify that the matrix A is strictly row diagonally dominant.
(c) For the starting vector x(0) = 0 = (0, 0, 0)T , compute the first four approximations x(j), for
j = 1, 2, 3, 4, of the Jacobi iteration by hand.
Exercise 79 Consider the linear system
Ax = b, where A =
2 0 0
0 2 −1
0 −1 2
, b =
2
1
1
.
(a) Compute the solution x of Ax = b by hand.
(b) Verify that the matrix A is strictly row diagonally dominant.
(c) For the starting vector x(0) = 0 = (0, 0, 0)T , compute the first four approximations x(j), for
j = 1, 2, 3, 4, of the Jacobi iteration by hand.
(d) Formulate a closed formula for x(j) for general j and prove it.
A closer inspection of the Jacobi method (5.11) shows that the computation of x(j+1)i is inde-
pendent of any other x(j+1)ℓ with ℓ 6= i. This means that, on a parallel or vector computer, all
components of the new approximation x(j+1) can be computed simultaneously.
However, it also gives us the possibility to improve the process. For example, to calculate x(j+1)2
we could already employ the newly computed x(j+1)1 . Then, for computing x
(j+1)3 we could use
x(j+1)1 and x
(j+1)2 and so on.
112 5.3. The Jacobi and Gauss-Seidel Iterations
This leads to the following iteration scheme.
Definition 5.11 (Gauss-Seidel iteration method)
Let A = (ai,j) in Cn×n (or in Rn×n) be invertible, and assume that A has non-zero diagonal
elements. Let b ∈ Cn (or b ∈ Rn). The Gauss-Seidel method is given by the iteration
scheme {x(j)} with
x(j+1)i =
1
ai,i
(bi −
i−1∑
k=1
ai,k x(j+1)k −
n∑
k=i+1
ai,k x(j)k
), 1 ≤ i ≤ n. (5.13)
We note for later use that in matrix notation (5.13) reads
x(j+1) = D−1(b − Lx(j+1) − R x(j)
)⇔ D x(j+1) = b − Lx(j+1) − Rx(j). (5.14)
To analyze the convergence of this scheme, we have to find the iteration matrix C = B−1 (B−A)
with a suitable invertible matrix B. To this end, we rewrite the last equation in (5.14) as
(L + D)x(j+1) = −R x(j) + b ⇔ x(j+1) = −(L + D)−1 Rx(j) + (L + D)−1 b.
Thus, the iteration matrix of the Gauss-Seidel method is given by
CGS := −(L + D)−1 R = (L + D)−1((L + D) − A
), (5.15)
that is, we have chosen B = L + D as our approximation of A.
Later on, we will prove a more general version of the following theorem.
Theorem 5.12 (convergence of Gauss-Seidel method)
Let A in Cn×n or in Rn×n be invertible with non-zero diagonal elements, and let b in Cn or
in Rn, respectively. If A ∈ Cn×n is Hermitian and positive definite, then the Gauss-Seidel
method converges. If A ∈ Rn×n is symmetric and positive definite, then the Gauss-Seidel
method converges.
Example 5.13 (Gauss-Seidel method)
Consider the linear system
Ax = b, where A =
2 1 0
1 2 0
0 0 1
, b =
−1
1
2
(a) Show that the matrix A is positive definite.
(b) Compute the solution x of Ax = b.
(c) For the starting vector x(0) = 0 = (0, 0, 0)T , compute the first three steps of the Gauss-Seidel
iteration by hand.
5. Iterative Methods for Linear Systems 113
Solution:
(a) We note that A is symmetric. We evaluate xT Ax.
xT Ax =(x1, x2, x3
)
2 1 0
1 2 0
0 0 1
x1
x2
x3
= 2 x21 + 2 x2
2 + 2 x1 x2 + x23
= x21 + x2
2 + x23 + (x1 + x2)
2 > 0 for all x ∈ R3 \ {0}.
Thus A is positive definite. Alternatively we could have computed the eigenvalues of A which
yields λ1 = 3 and λ2 = λ3 = 1. Since the eigenvalues are all positive, the matrix A is positive
definite.
(b) We solve Ax = b. In the first step we subtract 1/2 times the first row from the second
row. Subsequently, we divide the new second row by 3/2.
2 1 0
1 2 0
0 0 1
∣∣∣∣∣∣
−1
1
2
⇔
2 1 0
0 32
0
0 0 1
∣∣∣∣∣∣∣∣
−1
32
2
⇔
2 1 0
0 1 0
0 0 1
∣∣∣∣∣∣∣∣
−1
1
2
Finally, we subtract the new second row from the first row. After that we divide the new first
row by 2.
⇔
2 0 0
0 1 0
0 0 1
∣∣∣∣∣∣∣∣
−2
1
2
⇔
1 0 0
0 1 0
0 0 1
∣∣∣∣∣∣∣∣
−1
1
2
⇔ x =
−1
1
2
.
The unique solution to Ax = b is x = (−1, 1, 2)T .
(c) We compute the first three approximations of the Gauss-Seidel iteration for the starting
vector x(0) = 0 = (0, 0, 0)T .
x(1)1 = 1
2(−1 − 1 · 0 − 0 · 0) = − 1
2
x(1)2 = 1
2
(1 − 1 ·
(− 1
2
)− 0 · 0
)= 3
4
x(1)3 = 1
1
(2 − 0 ·
(− 1
2
)− 0 · 3
4
)= 2
⇒ x(1) =
− 12
34
2
,
x(2)1 = 1
2
(−1 − 1 · 3
4− 0 · 2
)= − 7
8
x(2)2 = 1
2
(1 − 1 ·
(− 7
8
)− 0 · 2
)= 15
16
x(2)3 = 1
1
(2 − 0 ·
(− 7
8
)− 0 · 15
16
)= 2
⇒ x(2) =
− 78
1516
2
,
x(3)1 = 1
2
(−1 − 1 · 15
16− 0 · 2
)= − 31
32
x(3)2 = 1
2
(1 − 1 ·
(− 31
32
)− 0 · 2
)= 63
64
x(3)3 = 1
1
(2 − 0 ·
(− 31
32
)− 0 · 63
64
)= 2
⇒ x(3) =
− 3132
6364
2
≈
−0.96875
0.98438
2
.
114 5.3. The Jacobi and Gauss-Seidel Iterations
After three iteration steps, we obtain the approximation x(3) ≈ (−0.96875, 0.98438, 2)T . 2
For real linear systems Ax = b, the Gauss-Seidel method can be implemented with the following
MATLAB code:
function [x] = gauss_seidel(A,b,z,J)
%
% algorithm computes Gauss-Seidel iteration for a given linear system A*x = b
% x^{(j)} = jth approximation in the Gauss-Seidel method
%
% input: A = real n by n invertible matrix with non-zero diagonal elements
% b = real n by 1 vector, right-hand side of linear system
% z = real n by 1 vector,
% starting point x^{(0)} for Gauss-Seidel iteration
% J = number of iteration steps
% output: x = x^{(J)} Gauss-Seidel approximation after J iteration steps
%
n = size(A,1);
x = z;
for j=1:J
for i=1:n
x(i) = ( b(i) - A(i,1:i-1) * x(1:i-1) - A(i,i+1:n) * x(i+1:n) ) / A(i,i);
end
end
Exercise 80 Consider the linear system
Ax = b, where A =
2 0 1
0 1 0
1 0 2
, b =
1
1
−1
.
(i) Show that the matrix A is positive definite.
(ii) Compute the solution x of Ax = b by hand.
(iii) For the starting vector x(0) = 0 = (0, 0, 0)T , compute the first three steps of the Gauss-
Seidel iteration by hand.
(iv) Find a closed formula for the jth approximation x(j), j ∈ N. Prove that your formula is
correct.
Exercise 81 Consider the linear system
Ax = b, where A =
2 0 0
0 2 −1
0 −1 2
, b =
2
1
1
.
5. Iterative Methods for Linear Systems 115
(i) Show that the matrix A is positive definite.
(ii) Compute the solution x of Ax = b by hand.
(iii) For the starting vector x(0) = 0 = (0, 0, 0)T , compute the first three steps of the Gauss-
Seidel iteration by hand.
(iv) Find a closed formula for the jth approximation x(j), j ∈ N. Prove that your formula is
correct.
5.4 Relaxation
A further improvement of both the Jacobi method and the Gauss-Seidel method can be achieved
by relaxation.
We start by looking at the Jacobi method. In (5.10), we have seen that the approximations
can be written as
x(j+1) = D−1 b− D−1(L + R
)x(j)
= x(j) + D−1 b − D−1(L + R + D
)x(j)
= x(j) + D−1(b− Ax(j)
).
The last representation shows that the new approximation x(j+1) is given by the old approxi-
mation x(j) plus D−1 multiplied with the residual b−Ax. In practice, one often notices that
the correction term D−1 (b−Ax(j)) deviates of the ‘ideal’ correction term by a fixed
factor. Hence, it makes sense to introduce a relaxation parameter ω ∈ R+ and to form the
new approximation as
x(j+1) = x(j) + ω D−1(b − Ax(j)
). (5.16)
(Note that, in principle, ω could have any positive real value.) Formula (5.16) gives the following
componentwise scheme:
Definition 5.14 (Jacobi relaxation)
Let A = (ai,j) in Cn×n (or in Rn×n) be invertible with non-zero diagonal elements, and let
b in Cn (or in R
n). The Jacobi relaxation with relaxation parameter ω > 0 is given
by
x(j+1)i = x
(j)i +
ω
ai,i
(bi −
n∑
k=1
ai,k x(j)k
), 1 ≤ i ≤ n.
Of course, the relaxation parameter ω should be chosen such that the convergence improves
compared to the original Jacobi method. To investigate this, we work out the iteration matrix
of the Jacobi relaxation. By replacing A = L + D + R in (5.16),
x(j+1) = x(j) + ωD−1 b− ω D−1 (L + D + R)x(j)
116 5.4. Relaxation
=[(1 − ω) I − ω D−1 (L + R)
]x(j) + ω D−1 b, (5.17)
and we see that the iteration matrix is now
CJ(ω) := (1 − ω) I − ω D−1 (L + R) = (1 − ω) I + ω CJ , (5.18)
where CJ = −D−1 (L + R) is the matrix of the classical Jacobi iteration (without relaxation),
see (5.8). As expected, for ω = 1 we find CJ(1) = CJ .
By Theorem 5.4 we know that the Jacobi relaxation converges if ρ(CJ(ω)) < 1, and an inspec-
tion of the proof of Theorem 5.4 shows that, the smaller ρ(CJ(ω)) < 1, the faster the Jacobi
relaxation converges. Hence it makes sense to determine ω such that ρ(CJ(ω)) is minimized.
Theorem 5.15 (convergence of Jacobi relaxation)
Let the assumptions be the same as in Definition 5.14. Furthermore, let L be the lower-left
sub-diagonal part of A, D the diagonal part of A, and R the upper-right super-diagonal part
of A, and assume that CJ = −D−1 (L + R) has only real eigenvalues λ1 ≤ λ2 ≤ . . . ≤ λn
in the interval (−1, 1) with corresponding eigenvectors z(1), z(2), . . . , z(n). Then, CJ(ω) =
(1 − ω) I + ω CJ has the same eigenvectors z(1), z(2), . . . , z(n), but with the corresponding
eigenvalues µj = µj(ω) = 1 − ω + ω λj, 1 ≤ j ≤ n. The spectral radius of CJ(ω) is
minimized by choosing ω to be
ω∗ =2
2 − λ1 − λn. (5.19)
If λ1 6= −λn, then the Jacobi relaxation converges faster than the Jacobi method.
1−1−2−3
1
λn
λ1
f1/5
f1fω∗
Figure 5.1: Determination of the relaxation parameter for the Jacobi relaxation.
Proof of Theorem 5.15. First note that the assumption −1 < λ1 ≤ λ2 ≤ . . . ≤ λn < 1
on the eigenvalues of the iteration matrix CJ of the classical Jacobi method guarantees that
ρ(CJ) < 1 and thus the classical Jacobi method converges. Furthermore, since CJ(1) = CJ , we
know that there exist ω > 0 for which ρ(CJ(ω)) < 1.
5. Iterative Methods for Linear Systems 117
For every eigenvector z(j) of CJ it follows that
CJ(ω) z(j) =((1 − ω) I + ω CJ
)z(j) = (1 − ω) z(j) + ω λj z(j) =
(1 − ω + ω λj
)z(j),
that is, z(j) is an eigenvector of CJ(ω) with the eigenvalue 1 − ω + ω λj =: µj(ω). Thus, the
spectral radius of CJ(ω) is given by
ρ(CJ(ω)
)= max
1≤j≤n|µj(ω)| = max
1≤j≤n|1 − ω + ω λj |.
The aim is to choose ω > 0 such that ρ(CJ(ω)) is minimized. For a fixed ω let us have a look
at the function
fω(λ) := 1 − ω + ω λ,
which, as a function of λ, is a straight line with fω(1) = 1. For different choices of ω we get
in this way a collection of such lines going all through (1, 1) (see Figure 5.1) and it follows
that the maximum in the definition of ρ(CJ(ω)) can only be attained for the indices j = 1 and
j = n, since these belong to the smallest and largest eigenvalues λ1 and λn of CJ , respectively.
Moreover, it follows from Figure 5.1 that ω is optimally chosen if
fω(λ1) = −fω(λn). (5.20)
Writing (5.20) explicitly and solving for ω yields
1 − ω + ω λ1 = −(1 − ω + ω λn
)⇔ ω =
2
2 − λ1 − λn
.
This gives (5.19). Since the spectral radius ρ(CJ(ω)) is minimized for ω = ω∗, it follows (see
Theorem 5.4 and its proof) that the Jacobi method has the best convergence if ω = ω∗, that is,
if CJ = CJ(ω∗). Since the classical Jacobi method corresponds to ω = 1, the classical Jacobi
method has the best convergence if ω∗ = 1 which is equivalent to λ1 = −λn. 2
An alternative interpretation of the relaxation can be derived from
x(j+1) = x(j) + ω D−1 b − ω D−1 (L + D + R)x(j)
= (1 − ω) x(j) − ω D−1 (L + R)x(j) + ω D−1 b
= (1 − ω)x(j) + ω CJ x(j) + ω D−1b
= (1 − ω)x(j) + ω(CJ x(j) + D−1 b
),
which follows from (5.17) and CJ = −D−1 (L + R). Hence, if we define
z(j+1) = CJ x(j) + D−1b,
which is one step of the classical Jacobi method, the next approximation of the Jacobi relaxation
method is
x(j+1) = (1 − ω)x(j) + ω z(j+1). (5.21)
118 5.4. Relaxation
Formula (5.21) can be interpreted as a linear interpolation between the old approximation
x(j) and the new classical Jacobi approximation z(j+1).
This idea can be used to introduce relaxation for the Gauss-Seidel method as well. From
(5.14) the classical Gauss-Seidel iteration is given by
z(j+1) = D−1(b − L z(j+1) − R z(j)
). (5.22)
In analogy to (5.21), we can now use linear interpolation to form a new approximation by linear
interpolation between z(j+1), given by (5.22), and the previous approximation (where we now
rename the approximations to be x(j) and x(j+1) as usual):
x(j+1) = (1 − ω)x(j) + ω D−1(b − Lx(j+1) − Rx(j)
). (5.23)
Multiplying (5.23) with D yields
D x(j+1) = (1 − ω) D x(j) + ω b − ω Lx(j+1) − ω Rx(j).
Hence (D + ω L
)x(j+1) =
[(1 − ω) D − ω R
]x(j) + ω b,
or equivalently (assuming (D + ω L) is non-singular)
x(j+1) =(D + ω L
)−1 [(1 − ω) D − ω R
]x(j) + ω
(D + ω L
)−1b.
Thus, the iteration matrix of the Gauss-Seidel relaxation is given by
CGS(ω) :=(D + ω L
)−1 [(1 − ω) D − ω R
]. (5.24)
(We see that, for ω = 1, we have CGS(1) = CGS with the iteration matrix CGS = −(D+L)−1 R
of the classical Gauss-Seidel method, see (5.15).) Writing (5.23) equivalently as
x(j+1) = x(j) + ω D−1(b− Lx(j+1) − D x(j) − R x(j)
)
= x(j) + ω D−1(b− Lx(j+1) − (D + R)x(j)
), (5.25)
the second line in (5.25) shows that the Gauss-Seidel relaxation can be written component-wise
as
x(j+1)i = x
(j)i +
ω
ai,i
(bi −
i−1∑
k=1
ai,k x(j+1)k −
n∑
k=i
ai,k x(j)k
), 1 ≤ i ≤ n.
Definition 5.16 (Gauss-Seidel relaxation or SOR method)
Let A = (ai,j) in Cn×n (or in Rn×n) be invertible and have non-zero diagonal entries, and
let b ∈ Cn (or b ∈ R
n). The Gauss-Seidel relaxation or SOR method (or successive
over-relaxation method) is defined by
x(j+1)i = x
(j)i +
ω
ai,i
(bi −
i−1∑
k=1
ai,k x(j+1)k −
n∑
k=i
ai,k x(j)k
), 1 ≤ i ≤ n. (5.26)
5. Iterative Methods for Linear Systems 119
Again, we have to deal with the question how to choose the relaxation parameter.
Theorem 5.17 (necessary condition for convergence of SOR method)
Let the assumptions be the same as in Definition 5.16. The spectral radius of the iteration
matrix
CGS(ω) =(D + ω L
)−1[(1 − ω) D − ω R
]
of the Gauss-Seidel relaxation or SOR method satisfies
ρ(CGS(ω)
)≥ |ω − 1|.
Hence, convergence is only possible if ω ∈ (0, 2).
Proof of Theorem 5.17. We rewrite the iteration matrix CGS(ω) as follows
CGS(ω) =(D + ω L
)−1 [(1 − ω) D − ω R
]
=(D + ω L
)−1D D−1
[(1 − ω) D − ω R
]
=(D−1
(D + ω L
))−1[(1 − ω) I − ω D−1 R
]
=(I + ω D−1 L
)−1[(1 − ω) I − ω D−1 R
]. (5.27)
Consider the representation of CGS(ω) in the last line of (5.27). The first matrix in this product
is a normalized lower triangular matrix and the second matrix is an upper triangular matrix
with diagonal entries all equal to 1 − ω. Thus the determinant of CGS(ω) is given by (where
we use det(A B) = det(A) det(B))
det(CG(ω)
)= det
((I + ω D−1 L
)−1)
det((1 − ω) I − ω D−1 R
)= (1 − ω)n.
Since the determinant of a matrix equals the product of its eigenvalues, we have the following:
Denote by λ1, λ2, . . . , λn the n eigenvalues of CGS(ω), then
| det(CGS)| = |λ1 λ1 · · · λn| = |1 − ω|n. (5.28)
It can easily be shown by proof by contradiction that (5.28) implies that there exists at least
one eigenvalue λj of CGS(ω) with
|λj| ≥ |1 − ω|.
Thus, from the definition of the spectral radius,
ρ(CGS(ω)
)≥ |1 − ω|. (5.29)
From Theorem 5.4, we know that the method converges if and only if ρ(CGS(ω)
)< 1. Thus
from (5.29) convergence is only possible if ω ∈ (0, 2). 2
120 5.4. Relaxation
We will now show that for a positive definite matrix, ω ∈ (0, 2) is also sufficient for convergence.
Since ω = 1 gives the classical Gauss-Seidel method, we also cover Theorem 5.12 as a special
case.
Theorem 5.18 (sufficient condition for convergence of SOR method)
Let A ∈ Cn×n be Hermitian and positive definite and b ∈ C
n, or let A ∈ Rn×n be sym-
metric and positive definite and b ∈ Rn. Then the Gauss-Seidel relaxation or SOR method
converges for every relaxation parameter ω ∈ (0, 2) and every starting point x(0) ∈ Cn and
x(0) ∈ Rn, respectively.
Proof of Theorem 5.18. We first observe that the Gauss-Seidel relaxation is well-defined for
a positive definite matrix A = (ai,j), since
aj,j = e∗j A ej > 0 for all j = 1, 2, . . . , n,
that is, A has non-zero diagonal entries.
We have to show that ρ(CGS(ω)
)< 1. To this end, we rewrite the iteration matrix CGS(ω) in
the form (see the first line in (5.27))
CGS(ω) = (D + ω L)−1[(1 − ω) D − ω R
]
= (D + ω L)−1[D + ω L − ω (L + D + R)
]
= I − ω(D + ω L
)−1A
= I −(
1
ωD + L
)−1
A
= I − B−1 A,
with B := ω−1 D + L. Let λ ∈ C be an eigenvalue of CGS(ω) with corresponding eigenvector
x ∈ Cn, which we assume to be normalized, such that ‖x‖2 = 1. Then, we have
CGS(ω)x =(I−B−1 A
)x = λx ⇔ (1−λ)x = B−1 Ax ⇔ Ax = (1−λ) B x.
Since A is positive definite, and hence invertible, we must have λ 6= 1. (Otherwise we would
have Ax = 0 for the non-zero vector x which would be in contradiction to the fact that A is
invertible). Since A is positive definite, we have
0 < x∗ Ax = (1 − λ)x∗ B x,
and since λ 6= 1,1
1 − λ=
x∗ B x
x∗ Ax.
Since A is Hermitian, that is, A∗ = A, we have
L∗ + D∗ + R∗ = L + D + R ⇒ L∗ = R, D = D∗, R∗ = L.
5. Iterative Methods for Linear Systems 121
Thus for the matrix B = ω−1 D + L introduced earlier in this proof
B + B∗ =(ω−1 D + L
)+(ω−1 D + L
)∗= ω−1 D + L + ω−1 D∗ + L∗
= ω−1 D + L + ω−1 D + R
= 2 ω−1 D − D +(L + D + R
)
=
(2
ω− 1
)D + A. (5.30)
Since A is positive definite, for x ∈ Cn \ {0}, x∗ Ax is real and positive. The real part of
1/(1 − λ) satisfies
Re
(1
1 − λ
)= Re
(x∗ B x
x∗ Ax
)=
Re(x∗ B x)
x∗ Ax=
1
2
1
x∗ Ax
(x∗ B x + x∗ B x
)
=1
2
1
x∗ Ax(x∗ B x + x∗ B∗ x) =
1
2
1
x∗ Axx∗(B + B∗)x
=1
2
1
x∗ Axx∗[(
2
ω− 1
)D + A
]x
=1
2
1
x∗ Ax
[(2
ω− 1
)x∗ D x + x∗ Ax
]
=1
2
[(2
ω− 1
)x∗ D x
x∗ Ax+ 1
]. (5.31)
In the second line of (5.31) we have used
x∗ B x = xT B x = xT B x =(xT B
Tx)T
= xT BTx = x∗ B∗ x,
and in the third line of (5.31) we have used the representation (5.30) of B + B∗. Because
ω ∈ (0, 2), the expression 2/ω − 1 is positive. Since A is positive definite, we have
0 < e∗j A ej = aj,j, j = 1, 2, . . . , n,
and therefore the diagonal matrix D has only positive entries and hence D is also positive
definite. Therefore (x∗ D x)/(x∗ Ax) is positive for all x ∈ Cn \ {0}, because A and D are
positive definite. Thus
Re
(1
1 − λ
)=
1
2
[(2
ω− 1
)x∗ D x
x∗ Ax︸ ︷︷ ︸>0
+1
]>
1
2.
If we write λ = u + i v then we can conclude that
1
2< Re
(1
1 − λ
)= Re
(1
1 − u − i v
)= Re
((1 − u) + i v
(1 − u)2 + v2
)=
1 − u
(1 − u)2 + v2,
122 5.4. Relaxation
and hence
1
2<
1 − u
(1 − u)2 + v2⇔ (1 − u)2 + v2 < 2 (1 − u) ⇔ u2 + v2 < 1,
that is, |λ|2 = u2 +v2 < 1. Since the eigenvalue λ of the iteration matrix CGS(ω) was arbitrary,
we have shown that for ω ∈ (0, 2)
ρ(CGS(ω)
)< 1.
Thus from Theorem 5.4, the Gauss-Seidel relaxation or SOR method converges. 2
For real linear systems Ax = b, the Gauss-Seidel relaxation or SOR method can be imple-
mented with the following MATLAB code:
function [x] = SOR_method(A,b,z,w,J)
%
% algorithm executes the SOR method for solving A*x = b
% with relaxation parameter w;
% x^{(j)} is the approximation after the jth iteration step
%
% input: A = real invertible n by n matrix with non-zero diagonal elements
% b = real n by 1 vector, right-hand side of linear system
% z = real n by 1 vector, starting vector x^{(0)} for SOR iteration
% w = relaxation parameter in (0,2)
% J = number of approximations
%
n = size(A,1);
x = z;
for j = 1:J
for i = 1:n
y = x(i)
+ w * ( b(i) - A(i,1:i-1) * x(1:i-1) - A(i,i:n) * x(i:n) ) / A(i,i);
x(i) = y;
end
end
Example 5.19 (Jacobi, Gauss-Seidel, and SOR method)
Suppose we wish to solve the real linear system Ax = b, where
A =
1 0 0.25 0.25
0 1 0 0.25
0.25 0 1 0
0.25 0.25 0 1
, b =
0.25
0.5
0.75
1.0
5. Iterative Methods for Linear Systems 123
Note that A is symmetric and positive definite and diagonally dominant. Thus the classical
Jacobi method, the classical Gauss-Seidel method, and the Gauss-Seidel relaxation can be
applied to solve Ax = b. The true solution x rounded to 4 decimal points is given by
x ≈
−0.1962
0.2536
0.7990
0.9856
.
Using the Jacobi method, with starting vector x(0) = 0 = (0, 0, 0, 0)T , we have:
approximation x(0) x(1)J x
(2)J x
(3)J
0 0.25 −0.1875 −0.125
0 0.5 0.25 0.2969
0 0.75 0.6875 0.7969
0 1.0 0.8125 0.9844
∥∥Ax(j)J − b
∥∥2
1.3693 0.5413 0.2182 0.0882
∥∥x(j)J − x
∥∥2
1.3087 0.5122 0.2062 0.0833
Using the Gauss-Seidel method, with starting vector x(0) = 0 = (0, 0, 0, 0)T , we have:
approximation x(0) x(1)GS x
(2)GS x
(3)GS
0 0.25 −0.125 −0.1846
0 0.5 0.2969 0.2607
0 0.6875 0.7812 0.7961
0 0.8125 0.9570 0.9810
∥∥Ax(j)GS − b
∥∥2
1.3693 0.4265 0.0697 0.0114
∥∥x(j)GS − x
∥∥2
1.3087 0.5497 0.0899 0.0147
Using the SOR method with ω = 1.05, with starting vector x(0) = 0 = (0, 0, 0, 0)T , we have:
124 5.4. Relaxation
approximation x(0) x(1)SOR x
(2)SOR x
(3)SOR
0 0.2625 −0.1606 −0.1943
0 0.5250 0.2774 0.2546
0 0.7186 0.7937 0.7988
0 0.8433 0.9772 0.9853
∥∥Ax(j)SOR − b
∥∥2
1.3693 0.4699 0.0394 0.0020
∥∥x(j)SOR − x
∥∥2
1.3087 0.5575 0.0439 0.0021
We see that the SOR method with ω = 1.05 clearly converges fastest in this example. 2
Chapter 6
The Conjugate Gradient Method
In this chapter we discuss the conjugate gradient method (or CG method) which is an
iterative method for solving linear systems Ax = b with a positive definite symmetric matrix
A ∈ Rn×n. We will see that the conjugate gradient method converges in at most n steps to the
exact solution x of Ax = b, and so this iterative method is in some sense also a direct method.
However, in praxis n is usually large and we will not let the conjugate gradient method run
though all steps but let it terminate once a certain accuracy is reached.
6.1 The Generic Minimization Algorithm
We now construct an iterative method for solving linear systems Ax = b with a symmetric
and positive definite real n × n matrix A.
As a reminder, A ∈ Rn×n is positive definite if A is symmetric (that is, AT = A) and if
xT Ax > 0 for all x ∈ Rn \ {0}.
Let A ∈ Rn×n be symmetric and positive definite and b ∈ Rn. We wish to solve the linear
system
Ax = b.
Associated with this system we have a function f : Rn → R, defined by
f(x) :=1
2xT Ax − xT b.
We will also denote f as the conjugate gradient functional. Writing f more explicitly as
f(x) =1
2
n∑
i=1
n∑
j=1
ai,j xi xj −n∑
j=1
xj bj , (6.1)
125
126 6.1. The Generic Minimization Algorithm
we see that f is a polynomial of degree 2 in the entries x1, x2, . . . , xn of the vector x. Thus we
can calculate the first order and second order derivatives of f , and we obtain the gradient and
the Hessian (details left as an exercise)
∇f(x) = Ax − b, (6.2)
Hf(x) = A. (6.3)
Since f is a polynomial of degree 2 in the entries x1, x2, . . . , xn, all higher order derivatives
vanish. Thus we can rewrite f as its Taylor polynomial of degree 2 centred at y (using (6.2)
and (6.3))
f(x) = f(y) + (x − y)T ∇f(y) +1
2(x − y)T Hf(y) (x− y)
= f(y) + (x − y)T(Ay − b
)+
1
2(x − y)T A (x − y). (6.4)
If x minimizes f(x), then we know from calculus that
0 = ∇f(x) = A x − b,
that is, any minimizer is a solution of the linear system Ax = b.
On the other hand, assume that x satisfies A x = b. Then, setting in (6.4) y = x and using
A x = b, yields
f(x) = f(x) + (x − x)T(A x − b
)+
1
2(x − x)T A (x − x)
= f(x) +1
2(x − x)T A (x − x). (6.5)
Because A is positive definite
(x − x)T A (x − x) > 0 for all x 6= x,
and we see from (6.5) that
f(x) > f(x) for all x ∈ Rn \ {x},
and thus x minimizes f .
Since A is positive definite, A is, in particular, non-singular. Therefore Ax = b has a unique so-
lution x and this unique solution is the unique minimizer of the function f . We summarize
this in the theorem below.
Theorem 6.1 (minimizer of CG functional)
Let A ∈ Rn×n be symmetric and positive definite. The uniquely determined solution of the
system Ax = b is the unique minimizer of the functional
f(x) :=1
2xT Ax − xT b.
6. The Conjugate Gradient Method 127
Exercise 82 Let A = (ai,j) in Rn×n be symmetric and positive definite, and let b ∈ Rn. By
writing the functional
f(x) =1
2xT Ax − xT b
explicitly in the form (6.1) and differentiating, show that ∇f(x) = Ax − b. Similarly show
that Hf(x) = A.
We will now try to find the minimum of f by a simple iterative procedure {xj} of the form
xj+1 = xj + αj pj .
Here, xj is our current position. From this position we want to move into the direction of pj.
The step-length of our move is determined by αj . Of course, it will be our goal to select the
direction pj and the step-length αj in such a way that
f(xj+1) ≤ f(xj).
(Note in this section we will use the notation xj rather than x(j), since it is shorter and there
can be no misunderstanding, since we never need the components of xj .)
Given a new direction pj , the best possible step-length in that direction can be determined by
looking at the minimum of f along the line xj + α pj. Hence, if we set
ϕ(α) = f(xj + αpj
)=
1
2
(xj + α pj
)TA(xj + αpj
)−(xj + αpj
)Tb,
the necessary condition for a minimum yields, from the chain rule and (6.2),
0 = ϕ′(α) = pTj ∇f
(xj + αpj
)= pT
j
[A(xj + αpj
)− b
]
= pTj Axj + αpT
j Apj − pTj b = pT
j
(Axj − b
)+ αpT
j Apj. (6.6)
Since A is positive definite, we know that
ϕ′′(α) = pTj Apj > 0
and thus at any α with ϕ′(α) = 0 we have a minimum. Resolving (6.6) for α gives the new
step-length αj = α as
αj =pT
j
(b− Axj
)
pTj Apj
=pT
j rj
pTj Apj
,
where we have defined the residual of the jth step as
rj := b− Axj.
Thus we have the following generic algorithm.
128 6.1. The Generic Minimization Algorithm
Algorithm 3 Generic Minimisation
1: Choose x1 and p1.
2: Let r1 = b− Ax1.
3: for j = 1, 2, . . . do
4: αj =pT
j rj
pTj Apj
5: xj+1 = xj + αj pj
6: rj+1 = b− Axj+1
7: Choose next direction pj+1.
8: end for
Note that we pick αj in such a way that (see the last expression in the first line of (6.6) and
use xj + αj pj = xj+1)
0 = pTj
[A (xj + αj pj)︸ ︷︷ ︸
= xj+1
−b]
= pTj
(Axj+1 − b
)= −pT
j rj+1, (6.7)
showing that pj and rj+1 are orthogonal.
Obviously, we still have to determine how to choose the search directions. One possible way is
to pick the direction of steepest descent: It can be shown that this direction is given by
the negative gradient of the target function, that is, by
pj = −∇f(xj) = −(Axj − b
)= b − Axj = rj.
Thus in the jth step, the direction pj is chosen to be the residual rj = b − Axj from the
previous approximation xj . This gives the following algorithm.
Algorithm 4 Steepest Descent
1: Choose x1
2: Set p1 = b− Ax1
3: for j = 1, 2, . . . do
4: αj =pT
j pj
pTj Apj
5: xj+1 = xj + αj pj
6: pj+1 = b − Axj+1
7: end for
The relation (6.7) now becomes
pTj pj+1 = 0, (6.8)
6. The Conjugate Gradient Method 129
which means that two successive search directions are orthogonal.
Unfortunately, the method of steepest descent often converges rather slowly as illustrated in
the example below.
Example 6.2 (method of steepest descent)
Let us choose the following matrix, right-hand side, and initial position:
A =
(1 0
0 9
), b =
(0
0
), x1 =
(9
1
)
Clearly the solution of Ax = b is x = 0 = (0, 0)T . It can be shown that the method of steepest
descent produces the following iteration
xj = (0.8)j−1
(9
(−1)j−1
)(6.9)
which is depicted in Figure 6.1. The details of the computation are left as an exercise.
Clearly (6.9) does converge rather slow and does not terminate after a finite number of steps
with the correct solution. This linear system with a 2 × 2 diagonal matrix could have easily
been solved with any direct method in a much more economical way.
The problematic of this iteration becomes more apparent if we look at the level curves of the
corresponding quadratic function
f(x) =1
2xT Ax − xT b =
1
2
(x2
1 + 9 x22
),
which are oblong ellipses, as illustrated in Figure 6.2. Since the new search direction has to
be orthogonal to the previous one, we are always orthogonal to the level curves, and hence the
new direction does not point towards the centre of the ellipses, which is our solution vector. 2
0 1 2 3 4 5 6 7 80
0.5
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
Figure 6.1: Approximations computed by the method of steepest descent in Example 6.2.
130 6.2. Minimization with A-Conjugate Search Directions
xj
xj+1
pj
Figure 6.2: Illustration of Example 6.2.
Exercise 83 The method of steepest descent to compute the minimiser of the functional
f(x) =1
2xT Ax − xT b
is given by Algorithm 4. Consider the system Ax = b with
A =
(1 0
0 9
)and b =
(0
0
).
(i) Starting with x1 = (9, 1)T , compute x2 and x3. State all relevant computations.
(ii) Prove that the jth approximation xj is given by
xj = (0.8)j−1
(9
(−1)j−1
).
Exercise 84 State the definition of an inner product for a real linear space.
6.2 Minimization with A-Conjugate Search Directions
In the previous example we have seen that orthogonal search directions can lead to a rather
slow convergence. Therefore we will now modify the algorithm such that it uses so-called
A-conjugate search directions.
Definition 6.3 (A-conjugate vectors)
Let A ∈ Rn×n be a symmetric and positive definite matrix. The vectors p1,p2, . . . ,pm ∈Rn \ {0} are called A-conjugate if
pTj Apk = 0 for all 1 ≤ j, k ≤ m with j 6= k.
6. The Conjugate Gradient Method 131
To understand the meaning of A-conjugate vectors better, we discuss some of their properties.
Lemma 6.4 (properties of A-conjugate vectors)
Let A ∈ Rn×n be a symmetric and positive definite matrix, and let the vectors
p1,p2, . . . ,pm ∈ Rn \ {0} be A-conjugate. Then the following holds true:
(i) Since A is positive definite, we have pTj Apj > 0 for j = 1, 2, . . . , m.
(ii) The A-conjugate vectors p1,p2, . . . ,pm are linearly independent.
(iii) There can be at most n A-conjugate vectors, that is, m ≤ n.
(iv) We can introduce an A-inner product via
〈x,y〉A := xT Ay.
With respect to this inner product, the A-conjuate vectors p1,p2, . . . ,pm are orthogonal.
Proof of Lemma 6.4. Proof of (i): The statement (i) is clear from the positive definiteness
of the matrix A.
Proof of (ii): To see that the vectors p1,p2, . . . ,pm are linearly independent, consider
c1 p1 + c2 p2 + . . . + ck pk + . . . + cm pm = 0, (6.10)
and multiply from the left with pTk A. Then
c1 pTk Ap1 + c2 pT
k Ap2 + . . . + ck pTk Apk + . . . + cm pT
k Apm = 0.
Since pTk Apj = 0 for k 6= j, we find, using pT
k Apk > 0,
ck pTk Apk = 0 ⇔ ck = 0.
Since k was arbitrary, we conclude that in (6.10) all coefficients c1, c2, . . . , cm have to be zero.
Thus we have verified that the vectors p1,p2, . . . ,pm are linearly independent. This proves (ii).
Proof of (iii): Since the we know from (ii) that the vectors p1,p2, . . . ,pm ∈ Rn are linearly
independent, we must have m ≤ n.
Proof of (iv): To see that 〈x,y〉A = xT Ay is an inner product, we note that 〈x,x〉A = xT Ax >
0 for all x ∈ Rn \ {0} and 〈0, 0〉A = 0, since A is positive definite. The other properties of an
inner product are easily checked and are left as an exercise. 2
Exercise 85 Let A ∈ Rn×n be symmetric and positive definite. Show that 〈x,y〉A := xT Ay
defines an inner product for Rn. Conclude that ‖x‖A =√
〈x,x〉A =√
xT Ax defines a vector
norm for Rn.
132 6.2. Minimization with A-Conjugate Search Directions
For A-conjugate search directions it turns out that the generic minimization algorithm (see
Algorithm 3) terminates after at most n steps and gives the exact solution. So the iterative
method is in some sense a direct method. However, for large n, say n = 106, one would usually
not execute all n steps, but rather stop the algorithm beforehand, once a sufficient accuracy is
reached.
Theorem 6.5 (generic minimization with A-conjugate search directions)
Let A ∈ Rn×n be symmetric and positive definite, and let b ∈ Rn. Let x1 ∈ Rn be given,
and assume that the search directions p1,p2, . . . ,pn ∈ Rn \ {0} are A-conjugate. Then the
generic minimization Algorithm 3 terminates after at most n steps with the solution x of
Ax = b.
Proof of Theorem 6.5. Since the n search directions are A-conjugate and thus linearly
independent, they form a basis of Rn. Therefore, we can represent the vector x − x1 in this
basis, which means that we can find β1, β2, . . . , βn with
x − x1 =n∑
j=1
βj pj ⇔ x = x1 +n∑
j=1
βj pj . (6.11)
Furthermore, from the generic algorithm, we have xj+1 = xj + αj pj , j = 1, 2, . . . , n. Thus we
can conclude that the approximations xk have the representation
xk = x1 +k−1∑
j=1
αj pj , 1 ≤ k ≤ n + 1. (6.12)
Comparing (6.11) and (6.12) with k = n + 1, we see that it is sufficient to show that αi = βi
for all 1 ≤ i ≤ n. Then we know that xn+1 = x.
From Algorithm 3, we have
αi =pT
i ri
pTi Api
, 1 ≤ i ≤ n. (6.13)
To compute the coefficients βi, we first observe that, using (6.11),
b− Ax1 = A x − Ax1 = A(x − x1
)= A
(n∑
j=1
βj pj
)=
n∑
j=1
βj Apj . (6.14)
Hence, multiplying (6.14) from the left by pTi ,
pTi
(b − Ax1
)= pT
i
(n∑
j=1
βj Apj
)=
n∑
j=1
βj pTi Apj = βi p
Ti Api,
because pTi Apj = 0 for j 6= i (as the directions p1,p2, . . . ,pn are A-conjugate). This gives the
explicit representation
βi =pT
i
(b − Ax1
)
pTi Api
=pT
i r1
pTi Api
, 1 ≤ i ≤ n, (6.15)
6. The Conjugate Gradient Method 133
which differs, at first sight, from αi in the numerator (see (6.13)). Fortunately, from (6.12) we
can conclude that
ri = b− Axi = b− A
(x1 +
i−1∑
j=1
αj pj
)=(b− Ax1
)−
i−1∑
j=1
αj Apj = r1 −i−1∑
j=1
αj Apj ,
and thus, employing again the fact that pTi Apj = 0 for j 6= i (as the directions p1,p2, . . . ,pn
are A-conjugate),
pTi ri = pT
i
(r1 −
i−1∑
j=1
αj Apj
)= pT
i r1 −i−1∑
j=1
αj pTi Apj = pT
i r1, 1 ≤ i ≤ n. (6.16)
Thus, we find from (6.13), (6.15), and (6.16),
βi =pT
i r1
pTi Api
=pT
i ri
pTi Api
= αi, 1 ≤ i ≤ n.
Hence, from (6.11) and (6.12) with k = n + 1 and αi = βi, 1 ≤ i ≤ n, we have xn+1 = x. Thus
after at most n steps of the algorithm, we arrive at the true solution x of Ax = b. 2
It remains to answer the question how the search directions are actually determined.
Obviously, we do not want to determine the search directions a-priori but would rather like
to determine them during the iteration process, hoping for our iteration to terminate with
significantly less than n steps.
Assume we already have created search directions p1,p2, . . . ,pj that are A-conjugate. In the
case of Axj+1 = b we can terminate the algorithm and hence do not need to compute another
direction. Otherwise, we could try to compute the next direction by setting
pj+1 = b− Axj+1 +
j∑
k=1
βj,k pk = rj+1 +
j∑
k=1
βj,k pk, (6.17)
that is, the new search direction is the residual rj+1 plus a linear combination of the previous
search directions. The coefficients βj,k are determined from the conditions that pj+1 is A-
conjugate to the search directions p1,p2, . . . ,pj . Thus we obtain the conditions
0 = pTj+1 Api = rT
j+1 Api +
j∑
k=1
βj,k pTk Api = rT
j+1 Api + βj,i pTi Api, 1 ≤ i ≤ j.
Solving for βj,i yields
βj,i = −rT
j+1 Api
pTi Api
, 1 ≤ i ≤ j. (6.18)
Surprisingly, βj,1, . . . , βj,j−1 vanish automatically, as we will see later on (see Remark 6.10
on page 143), so that only the coefficient
βj+1 := βj,j = −rT
j+1 Apj
pTj Apj
134 6.2. Minimization with A-Conjugate Search Directions
remains and the new direction is given by
pj+1 = rj+1 + βj+1 pj .
This choice of the search directions in Algorithm 3 gives the first version of the conjugate
gradient (CG) method, see Algorithm 5 below.
We observe that Algorithm 5 is, apart from the choice of the new search direction pj+1, identical
with Algorithm 3; only the steps 2, 7, and 8 in Algorithm 5 differ from Algorithm 3.
Algorithm 5 CG method
1: Choose x1.
2: Let p1 = r1 = b− Ax1 and j = 1.
3: while rj 6= 0 do
4: αj =pT
j rj
pTj Apj
5: xj+1 = xj + αj pj
6: rj+1 = b− Axj+1
7: βj+1 = −rT
j+1 Apj
pTj Apj
8: pj+1 = rj+1 + βj+1 pj
9: j = j + 1
10: end while
It is now time to show that the chosen search directions are indeed A-conjugate.
Theorem 6.6 (properties of pj and rj)
Let A ∈ Rn×n be symmetric and positive definite, let b ∈ Rn, and let x1 ∈ Rn be an arbitrary
starting point. The vectors pj and rj introduced in the CG method (see Algorithm 5 above)
satisfy the following equations:
(1) pTi Apj = 0 for 1 ≤ j ≤ i − 1,
(2) rTi rj = 0 for 1 ≤ j ≤ i − 1,
(3) pTj rj = rT
j rj for 1 ≤ j ≤ i.
The CG-method terminates at the latest with j = n + 1, and if it terminates with j steps
then rj+1 = b − Axj+1 = 0 and pj+1 = 0, that is, xj+1 is the solution of Ax = b. Hence,
the CG method needs at most n steps to compute the solution x of Ax = b.
6. The Conjugate Gradient Method 135
Proof of Theorem 6.6. We will prove (1) to (3) by induction over i.
Initial step: For i = 1 there is nothing to show for (1) and (2); (3) follows because of p1 = r1.
To get an initial step for all three equations we have to consider also i = 2. We find
pT2 Ap1 = (r2 + β2 p1)
T Ap1 = rT2 Ap1 + β2 p1 Ap1 = rT
2 Ap1 −rT2 Ap1
p1 Ap1
p1 Ap1 = 0,
where we have used the definition of β2. This verifies (1) for i = 2. To verify (2), we use p1 = r1
and the definition of α1:
rT2 r1 = (b− Ax2)
T r1 =(b− A (x1 + α1 p1)
)Tr1 = (r1 − α1 Ap1)
T r1
= (r1 − α1 Ap1)T p1 = rT
1 p1 − α1 pT1 Ap1 = rT
1 p1 −pT
1 r1
pT1 Ap1
pT1 Ap1 = 0,
which verifies (2) for i = 2. Finally
pT2 r2 = (r2 + β2 p1)
T r2 = rT2 r2 + β2 pT
1 r2,
and we have proved (3) for i = 2 if we can show that pT1 r2 = 0. To verify this we proceed as
follows
pT1 r2 = pT
1 (b− Ax2) = pT1
(b− A (x1 + α1 p1)
)= pT
1 (r1 − α1 Ap1)
= pT1 r1 − α1 pT
1 Ap1 = pT1 r1 −
pT1 r1
pT1 Ap1
pT1 Ap1 = 0,
where we have used the definition of α1 in the last step. Thus we have also verified (3) for
i = 2, giving a full initial step in which all three statements have been verified for i = 2.
Induction step: Let us now assume everything is satisfied for an arbitrary i with 1 ≤ i ≤ n.
We first show that then (2) follows for i + 1. By definition, we have
ri+1 = b− Axi+1 = b− A(xi + αi pi
)= b− Axi − αi Api = ri − αi Api. (6.19)
Thus, by the induction hypotheses (1) and (2) and rj = pj − βj pj−1, we have for j ≤ i − 1
immediately
rTi+1 rj =
(ri − αi Api
)Trj = rT
i rj − αi pTi A rj = rT
i rj − αi pTi A
(pj − βj pj−1
)
= rTi rj − αi p
Ti Apj − αi βj pT
i Apj−1 = 0.
For j = i we can use (6.19), the definition of αi, the identity
pTi A ri = pT
i A(pi − βi pi−1
)= pT
i Api − βi pTi Api−1 = pT
i Api
136 6.2. Minimization with A-Conjugate Search Directions
(where we have used the induction hypothesis (1) in the last step), and the induction hypothesis
(3) to conclude that
rTi+1 ri = rT
i ri − αi pTi A ri = rT
i ri −pT
i ri
pTi Api
pTi A ri = rT
i ri − pTi ri = 0,
which finishes the induction step for (2).
We proceed to show the induction step for (1). First of all, we have
pTi+1 Apj =
(rT
i+1 + βi+1 pTi
)Apj = rT
i+1 Apj + βi+1 pTi Apj.
In the case of j = i this leads to (using the definition of βi+1)
pTi+1 Api = rT
i+1 Api + βi+1 pTi Api = rT
i+1 Api −rT
i+1 Api
pTi Api
pTi Api = 0.
In the case of j ≤ i− 1 we first observe that αj = 0 particularly means pTj rj = rT
j rj = 0 (from
the definition of αj), which is equivalent to rj = 0. Thus if αj = 0 the iteration would have
stopped before. Hence, we may assume that αj 6= 0 such that we can use (6.19) to gain the
representation
Apj =1
αj
(rj − rj+1
).
Together with induction hypothesis (1) and statement (2) for i + 1 (which we have proved
already) this leads to
pTi+1 Apj =
(ri+1 + βi+1 pi
)TApj = rT
i+1 Apj + βi+1 pi Apj
= rTi+1 Apj =
1
αjrT
i+1
(rj − rj+1
)=
1
αj
(rT
i+1 rj − rTi+1 rj+1
)= 0,
for 1 ≤ j ≤ i − 1. This proves (1) for i + 1.
Finally, for (3) we use the fact that f(x) = 12xT Ax−bT x attains its minimum in the direction
of pi at xi+1. This means that the function ϕ(t) = f(xi+1 + tpi) satisfies ϕ′(0) = 0, which,
using the chain rule, is equivalent to
0 = ϕ′(0) = pTi ∇f(xi+1) = pT
i
(Axi+1 − b
)= −pT
i ri+1.
From this and (2) (which we have proved already), we can conclude that
pTi+1 ri+1 =
(ri+1 + βi+1 pi
)Tri+1 = rT
i+1 ri+1 + βi+1 pTi ri+1 = rT
i+1 ri+1,
which finalizes our induction step by proving (3) for i + 1.
For the computation of the next iteration step, we need that pi+1 6= 0. If this is not the case
the method terminates. Indeed, if pi+1 = 0, then 0 = pTi+1 ri+1 = rT
i+1 ri+1 = ‖ri+1‖22, and
hence 0 = ri+1 = b − Axi+1, that is, we have produced the solution in the ith step. If the
6. The Conjugate Gradient Method 137
method does not terminate early then, after n steps, we have created n conjugate directions
and Theorem 6.5 shows that xn+1 is the true solution x. 2
On the implementation side, it is possible to improve the first version (Algorithm 5 above) of
the CG method. First of all, we can reduce the number of the matrix vector multiplications as
follows: In the definition of rj+1 we can use (from (6.19))
rj+1 = rj − αj Apj (6.20)
instead of rj+1 = b − Axj, thus eliminating one matrix vector multiplication, since Apj in
(6.19) was already computed previously.
Moreover, from (6.20) it follows that Apj = α−1j (rj − rj+1) such that
βj+1 = −rT
j+1 Apj
pTj Apj
= − 1
αj
rTj+1 rj − rT
j+1 rj+1
pTj Apj
=pT
j Apj
pTj rj
rTj+1 rj+1
pTj Apj
=rT
j+1rj+1
rTj rj
,
where we have used (2) in Theorem 6.6, the definition of αj , and (3) in Theorem 6.6. This leads
to a version of the CG method that contains, in each iteration step, only one matrix-vector
multiplication, namely Apj, and three inner products and three scalar-vector multiplications.
This version of the CG method is stated as Algorithm 6 below.
Algorithm 6 CG method – improved more economical code
1: Choose x1.
2: Set p1 = r1 = b− Ax1 and j = 1.
3: while ‖rj‖2 > ǫ do
4: αj =‖rj‖2
2
pTj Apj
% store Apj
5: xj+1 = xj + αj pj
6: rj+1 = rj − αj Apj
7: βj+1 =‖rj+1‖2
2
‖rj‖22
% store ‖rj+1‖22
8: pj+1 = rj+1 + βj+1 pj
9: j = j + 1
10: end while
Example 6.7 (conjugate gradient method)
Consider the linear system Ax = b with
A =
32
0 12
0 3 0
12
0 32.
and b =
1
1
−1
138 6.2. Minimization with A-Conjugate Search Directions
Solve the linear system by performing the CG method by hand with the starting vector x1 =
0 = (0, 0, 0)T .
Solution: We use Algorithm 6.
1st step: We have
x1 =
0
0
0
and p1 = r1 = b − Ax1 = b− 0 =
1
1
−1
.
Since ‖r1‖22 = rT
1 r1 = 3 and
Ap1 =
32
0 12
0 3 0
12
0 32.
1
1
−1
=
1
3
−1
and pT
1 Ap1 =
1
1
−1
T
1
3
−1
= 5,
we find
α1 =‖r1‖2
2
pT1 Ap1
=3
5.
Thus
x2 = x1 + α1 p1 =
0
0
0
+
3
5
1
1
−1
=
3
5
1
1
−1
and
r2 = r1 − α1 Ap1 =
1
1
−1
− 3
5
1
3
−1
=
1
5
2
−4
−2
.
We have ‖r2‖22 = 24/25, and thus
β2 =‖r2‖2
2
‖r1‖22
=24/25
3=
8
25,
the new search direction is given by
p2 = r2 + β2 p1 =1
5
2
−4
−2
+
8
25
1
1
−1
=
1
25
18
−12
−18
=
6
25
3
−2
−3
.
2nd step: Since ‖r2‖2 6= 0, we perform a second step of the CG method.
Ap2 =6
25
32
0 12
0 3 0
12
0 32.
3
−2
−3
=
6
25
3
−6
−3
=
18
25
1
−2
−1
.
6. The Conjugate Gradient Method 139
Since from the first iteration step ‖r2‖22 = 24/25, and
pT2 Ap2 =
18
25
6
25
3
−2
−3
T
1
−2
−1
=
18 · 6252
10, =216
125
we find
α2 =‖r2‖2
2
pT2 Ap2
=24/25
216/125=
24
25
125
216=
5
9.
Thus the new approximation is given by
x3 = x2 + α2 p2 =3
5
1
1
−1
+
5
9
6
25
3
−2
−3
=
1
15
9
9
−9
+
1
15
6
−4
−6
=
113
−1
.
The new residual is given by
r3 = r2 − α2 Ap2 =1
5
2
−4
−2
− 5
9
18
25
1
−2
−1
=
1
5
2
−4
−2
− 2
5
1
−2
−1
=
0
0
0
.
Thus the CG algorithm terminates after two steps and provides the correct solution x = x3 =
(1, 1/3,−1)T . 2
The CG algorithm, as given in Algorithm 6, can be implemented with the following MATLAB
code:
function [x,J] = cg_method(A,b,z)
%
% executes the conjugate gradient method (CG method) for solving A*x=b
%
% input: A = symmetric and positive definite real n by n matrix
% b = n by 1 vector, right-hand side of the linear system
% z = start n by 1 vector for the CG iterations
%
% output: x = the approximate solution for the CG-method
% J = number of iterations
%
x = z;
r = b - A * x;
rnorm = (norm(r))^2;
p = r;
j=1;
140 6.2. Minimization with A-Conjugate Search Directions
while norm(r) > 10^(-6)
J = j;
x1 = x;
r1 = r;
p1 = p;
rnorm1 = rnorm;
q = A* p1;
alpha = rnorm1 / (p1’ * q);
x = x1 + alpha * p1;
r = r1 - alpha * q;
rnorm = (norm(r))^2;
beta = rnorm / rnorm1;
p = r + beta * p1;
j = J+1;
end
Exercise 86 Consider the following linear system
Ax = b, where A =
2 −1 0
−1 2 −1
0 −1 2
and b =
1
0
1
.
(a) Show that A is positive definite.
(b) Apply two iterations of the conjugate gradient method to the problem Ax = b to obtain x3
starting with x1 = 0. Do all the computations by hand and give all relevant scalars and
vectors. Calculate the residual for x3 and comment on your answer.
Exercise 87 Consider the following linear system
Ax = b, where A =
2 0 1
0 2 0
1 0 2
and b =
1
1
−1
.
(a) Show that A is positive definite.
(b) Apply two iterations of the conjugate gradient method to the problem Ax = b to obtain x3
starting with x1 = 0. Do all the computations by hand and give all relevant scalars and
vectors. Calculate the residual for x3 and comment on your answer.
6. The Conjugate Gradient Method 141
6.3 Convergence of the Conjugate Gradient Method
In the following theorems we show some of the properties of the CG algorithm. We will
investigate convergence in the A-norm or energy norm, defined by
‖x‖A =√〈x,x〉A =
√xT Ax.
Since 〈x,y〉A = xT Ay is an inner product for Rn (see Lemma 6.4 (iv)), ‖ · ‖A is clearly a norm
for Rn.
Definition 6.8 (ith Krylov space)
Let A ∈ Rn×n and p ∈ Rn be given. For i ∈ N we define the ith Krylov space to A and p
as
Ki(p, A) = span{p, Ap, A2 p, . . . , Ai−1 p
}.
In general, we only have dim(Ki(p, A)) ≤ i, since p could, for example, be from the null space
of A, that is, Ap = 0, and then dim(Ki(p, A)) ≤ 1 for any i ∈ N. Or if p is an eigenvector of
A with eigenvalue λ, then Ak p = λk p for any k ∈ N0 and hence Ki(p, A) = span{p} for all
i ∈ N. Fortunately, in our situation things are much better.
Lemma 6.9 (Krylov spaces in CG method)
Let A ∈ Rn×n be symmetric and positive definite. Let pi and ri be the vectors created during
the CG iterations, and let m ≤ n be the number of steps of the CG method (that is, rm+1 = 0
but rm 6= 0). Then
Ki(p1, A) = span{p1, . . . ,pi} = span{r1, . . . , ri} (6.21)
for 1 ≤ i ≤ m. In particular, we have dim(Ki(p1, A)
)= i for 1 ≤ i ≤ m.
Proof of Lemma 6.9. The proof is given by induction on i.
Initial step: For i = 1 the statements follow because of p1 = r1 6= 0.
Induction step: For general i = j + 1 it follows by employing the relation
pj+1 = rj+1 + βj+1 pj =(rj − αj Apj
)+ βj+1 pj, =
(rj + βj+1 pj
)− αj Apj, (6.22)
which is a consequence of (6.19), and from the induction assumption. We will now show this
in detail:
Assume that the statement holds for all 1 ≤ i ≤ j < m. then we have to show the statement
for i = j + 1. From (6.21) for i = j, the first term in the last representation in (6.22) is in
Kj(p1, A), that is,
rj + βj+1 pj ∈ Kj(p1, A) ⊂ Kj+1(p1, A).
142 6.3. Convergence of the Conjugate Gradient Method
From (6.21) for i = j, the vector pj is in Kj(p1, A) and can be written as
pj =
j−1∑
k=0
γk Ak p1,
with some coefficients γ0, γ1, . . . , γj−1. Thus the vector Apj in the second term in the last
representation in (6.22) can be written as
Apj = A
(j−1∑
k=0
γk Ak p1
)=
j−1∑
k=0
γk Ak+1 p1 =
j∑
ℓ=1
γℓ−1 Aℓ p1,
and thus Apj is in Kj+1(p1, A). Thus we see from (6.22) that pj+1 is in Kj+1(p1, A), and
therefore, using (6.21) for i = j,
span{p1,p2, . . . ,pj+1} ⊂ Kj+1(p1, A). (6.23)
Since by definition of pj+1,
pj+1 = rj+1 + βj+1 pj ⇔ rj+1 = pj+1 − βj+1 pj . (6.24)
we see, from (6.24) and the assumption that (6.21) holds for i = j, that
span{p1,p2, . . . ,pj+1} = span{r1, r2, . . . , rj+1}. (6.25)
Finally, by the assumption that (6.21) holds for i = j, we have that
Aj p1 = A(Aj−1 p1
)= A
(j∑
k=1
µk pk
)= A
(j−1∑
k=1
µk pk
)+ µj Apj , (6.26)
with some coefficients µ1, µ2, . . . , µj. From the assumption that (6.21) holds for i = j, the first
term in the last expression in (6.26) can be written as
A
(j−1∑
k=1
µk pk
)= A
(j−2∑
ℓ=0
δℓ Aℓ p1
)=
j−2∑
ℓ=0
δℓ Aℓ+1 p1 =
j−1∑
ℓ=1
δℓ−1 Aℓ p1,
with some coefficients δ0, δ1, . . . , δj−2. Thus the first term in the last representation in (6.26) is
in Kj(p1, A) = span{r1, r2, . . . , rj}. Since αj 6= 0 for j ≤ m, from (6.19),
Apj =1
αj
(rj − rj+1
), (6.27)
and we see that the second term in the last representation of (6.26) is in span{r1, r2, . . . , rj+1}.Thus Aj p1 is in span{r1, r2, . . . , rj+1}, and from (6.21) for i = j, we conclude
Kj+1(p1, A) ⊂ span{r1, r2, . . . , rj+1}. (6.28)
Combining (6.23), (6.25), and (6.28), yields that (6.21) holds also for i = j + 1.
6. The Conjugate Gradient Method 143
It remains to show that dim(Ki(p1, A)
)= i. Since the CG method runs through m steps and
i ≤ m, we have ri 6= 0. Thus by the definition of pi, we have pi 6= 0, and hence the search
directions p1,p2, . . . ,pi are non-zero and, from Theorem 6.6, p1,p2, . . . ,pn are A-conjugate.
Thus, from Lemma 6.4, the search directions p1,p2, . . . ,pi are linearly independent, and hence
dim(Ki(p1, A)
)= dim
(span{p1,p2, . . . ,pi}
)= i
which concludes the proof. 2
Now we can show that in formula (6.18) on page 133 βj,i = 0 for 1 ≤ i ≤ j − 1.
Remark 6.10 (proof that βj,i = 0 for 1 ≤ i ≤ j − 1 in formula (6.18))
The definition of the new search direction (6.17) was
pj+1 = rj+1 +
j∑
k=1
βj,k pk,
and we determined, from the demand that the search directions p1,p2, . . . ,pj+1 are A-
conjugate,
βj,i = −rT
j+1 Api
pTi Api
, 1 ≤ i ≤ j.
We claimed that βj,i = 0 for 1 ≤ i ≤ j − 1, and now we are in the position to prove this
claim: From (6.19), using that αi 6= 0 for 1 ≤ i ≤ j − 1 (since the CG method has not yet
terminated),
Api =1
αi
(ri − ri+1
),
and thus for 1 ≤ i ≤ j − 1
rTj+1 Api =
1
αirT
j+1
(ri − ri+1
)=
1
αi
(rT
j+1 ri − rTj+1 ri+1
)= 0
from Theorem 6.6 (2).
Recall that the true solution x of Ax = b and the approximations xi of the CG method can
be written as (see (6.12))
x = x1 +n∑
j=1
αj pj and xi = x1 +i−1∑
j=1
αj pj , 1 ≤ i ≤ n + 1. (6.29)
Moreover, by Lemma 6.9, it is possible to represent an arbitrary element x from the affine linear
space x1 + Ki−1(p1, A) by
x = x1 +i−1∑
j=1
βj pj, (6.30)
144 6.3. Convergence of the Conjugate Gradient Method
and, in particular, xi ∈ x1 + Ki−1(p1, A). Since the directions p1,p2, . . . ,pn are A-conjugate,
we can conclude from (6.29) and (6.30) that
∥∥x − xi
∥∥2
A=
∥∥∥∥∥
n∑
j=i
αj pj
∥∥∥∥∥
2
A
=
(n∑
j=i
αj pj
)T
A
(n∑
k=i
αk pk
)
=n∑
j=i
n∑
k=i
αj αk pTj Apk =
n∑
j=i
|αj|2 pTj Apj
≤n∑
j=i
|αj |2 pTj Apj +
i−1∑
j=1
|αj − βj|2 pTj Apj
=
∥∥∥∥∥
n∑
j=i
αj pj +
i−1∑
j=1
(αj − βj)pj
∥∥∥∥∥
2
A
=
∥∥∥∥∥
n∑
j=1
αj pj −i−1∑
j=1
βj pj
∥∥∥∥∥
2
A
=
∥∥∥∥∥x − x1 −i−1∑
j=1
βj pj
∥∥∥∥∥
2
A
=∥∥x − x
∥∥2
A,
where x is the vector given by (6.30). Since this holds for an arbitrary x ∈ x1 +Ki−1(p1, A) we
have proved that
∥∥x − xi
∥∥2
A≤∥∥x − x‖2
A for all x ∈ x1 + Ki−1(p1, A).
We formulate this as a theorem.
Theorem 6.11 (CG method gives best approximations in affine Krylov spaces)
Let A ∈ Rn×n be symmetric and positive definite and b ∈ R
n. Let the CG method applied
to solving Ax = b stop after m ≤ n steps. The approximation xi, i ∈ {1, 2, . . . , m}, from
the CG method gives the best approximation to the solution x of Ax = b in the affine
Krylov space x1 + Ki−1(p1, A) with respect to the A-norm ‖ · ‖A. That is,
∥∥x − xi
∥∥A≤∥∥x − x
∥∥A
for all x ∈ x1 + Ki−1(p1, A). (6.31)
We note that (6.31) implies that
∥∥x − xi
∥∥A
= minx∈x1+Ki−1(p1,A)
∥∥x − x∥∥
A= inf
x∈x1+Ki−1(p1,A)
∥∥x − x∥∥
A.
The same idea shows that the iteration sequence is ‘monotone’, that is, ‖x − xi‖ is strictly
monotonically decreasing as i increases.
6. The Conjugate Gradient Method 145
Corollary 6.12 (error of CG iterations is decreasing)
Let A ∈ Rn×n be symmetric and positive definite and b ∈ R
n. Let the CG method for the
solution of Ax = b stop after m ≤ n steps. The sequence {xi} of approximations xi of the
CG method is monotone in the sense that
∥∥x − xi+1
∥∥A
< ‖x − xi‖A for all 1 ≤ i ≤ m. (6.32)
Before we give the formal proof of Corollary 6.12, we observe that (6.32) with ≤ instead of < is
entirely natural from Theorem 6.11: Since xi is the best approximation of x in x1+Ki−1(p1, A),
and since
x1 + Ki−1(p1, A) ⊂ x1 + Ki(p1, A),
the best approximation in the larger affine space x1 +Ki(p1, A) cannot have a larger error than
the best approximation in x1 + Ki−1(p1, A).
Proof of Corollary 6.12. Using the same notation as in the proof of Theorem 6.11, we have
from (6.29)
x − xi =
n∑
j=i
αj pj =
n∑
j=i+1
αj pj + αi pi =(x − xi+1
)+ αi pi.
Thus, using ‖y‖2A = yT Ay and the fact that pi is A-conjugate to x − xi+1 =
∑nj=i+1 αj pj ,
∥∥x − xi
∥∥2
A=
∥∥(x − xi+1
)+ αi pi
∥∥2
A
=[(
x − xi+1
)+ αi pi
]TA[(
x − xi+1
)+ αi pi
]
=(x − xi+1
)TA(x − xi+1
)+ 2 αi
(x − xi+1
)TApi + |αi|2 pT
i Api
=∥∥x − xi+1
∥∥2
A+ |αi|2 pT
i Api
≥∥∥x − xi+1
∥∥2
A, (6.33)
where we have used in the last step that pTi Api ≥ 0 since A is positive definite.
Since the CG method stops after m steps and since i ≤ m, ri 6= 0, and, from Theorem 6.6
(3), pi 6= 0 and αi 6= 0 and thus |αi|2 pTi Api > 0. Thus from (6.33) ‖x−xi‖A > ‖x−xi+1‖A. 2
Next, let us rewrite this approximation problem in the original basis of the Krylov space. In
doing so, we denote the set of all polynomials of degree less than or equal to n by πn.
For a polynomial P (t) =∑i−1
j=0 γj tj in πi−1 and a matrix A ∈ Rn×n we write
P (A) :=i−1∑
j=0
γj Aj and P (A)x =i−1∑
j=0
γj Aj x.
If x is an eigenvector of A with eigenvalue λ, that is, Ax = λx, then, from Aj x = λj x, clearly
P (A)x =i−1∑
j=0
γj λj x =
(i−1∑
j=0
γj λj
)x = P (λ)x. (6.34)
146 6.3. Convergence of the Conjugate Gradient Method
Theorem 6.13 (estimate of best approximation in affine Krylov space)
Let A ∈ Rn×n be symmetric and positive definite, having the eigenvalues λ1 ≥ λ2 ≥ . . . ≥
λn > 0. Let b ∈ Rn. Let x denote the solution of Ax = b, and let xi, i = 1, 2, . . . , m + 1,
denote the approximations from the CG method, where we assume that the CG method stops
after m ≤ n steps. Then
∥∥x − xi
∥∥A≤
inf
P∈πi−1,P (0)=1
[max1≤j≤n
|P (λj)|] ∥∥x − x1
∥∥A.
Proof of Theorem 6.13. Let us express an arbitrary x ∈ x1 + Ki−1(p1, A), where
i ∈ {1, 2, . . . , m + 1}, as
x = x1 +
i−1∑
j=1
γj Aj−1 p1 =: x1 + Q(A)p1, (6.35)
introducing the polynomial
Q(t) =
i−1∑
j=1
γj tj−1
in πi−2, where we set Q = 0 for i = 1. Using p1 = r1 = b − Ax1 = A x − Ax1 = A (x − x1),
yields for x, given by (6.35),
x − x = x −(x1 + Q(A)p1
)=(x − x1) − Q(A) A
(x − x1
)=(I − Q(A) A
) (x − x1
).
Therefore, using (I − Q(A) A)T = I − A Q(A) = I − Q(A) A (since AT = A and since A
commutes with Q(A)),
∥∥x − x∥∥2
A=
∥∥(I − Q(A) A) (
x − x1
)∥∥2
A
=(x − x1
)T (I − Q(A) A
)A(I − Q(A) A
)(x − x1
)
=:(x − x1
)TP (A) A P (A)
(x − x1
), (6.36)
with the polynomial P in πi−1 defined by
P (t) = 1 − t Q(t) = 1 −i−1∑
j=1
γj tj = t0 −i−1∑
j=1
γj tj .
Clearly P (0) = 1, and thus we have shown that for every x ∈ x1 + Ki−1(p1, A), where
i ∈ {1, 2, . . . , m}, that
∥∥x − x∥∥2
A=(x − x1
)TP (A) A P (A)
(x − x1
),
with some polynomial P ∈ πi−1 satisfying P (0) = 1.
6. The Conjugate Gradient Method 147
On the other hand, if P ∈ πi−1 with P (0) = 1 is given then we can define the polynomial
Q(t) = (1 − P (t))/t in πi−2, which leads to an element x from x1 + Ki−1(p1, A), defined by
x = x1 + Q(A)p1.
For this x the calculation in (6.36) above holds, with
P (t) = 1 − t Q(t) = 1 − t
(1 − P (t)
)
t= P (t).
Thus we see that for every P ∈ πi−1 with P (0) = 1, there exists some x in x1 + Ki−1(p1, A)
such that ∥∥x − x∥∥2
A=(x − x1
)TP (A) A P (A)
(x − x1
).
Thus we have shown that
infx∈x1+Ki−1(p1,A)
∥∥x − x∥∥
A= inf
P∈πi−1,P (0)=1
√(x − x1
)TP (A) A P (A)
(x − x1
). (6.37)
Theorem 6.11 yields therefore, for i = 1, 2, . . . , m + 1,
∥∥x − xi
∥∥A
= minx∈x1+Ki−1(p1,A)
∥∥x − x∥∥
A= inf
x∈x1+Ki−1(p1,A)
∥∥x − x∥∥
A
= infP∈πi−1,P (0)=1
√(x − x1)T P (A) A P (A) (x− x1). (6.38)
Next, let w1,w2, . . . ,wn be an orthonormal basis of Rn consisting of eigenvectors of the positive
definite symmetric matrix A associated to the eigenvalues λ1, λ2, . . . , λn. Then, we can represent
every vector using this basis. In particular, with such a representation
x − x1 =
n∑
j=1
ρj wj,
with some coefficients ρ1, ρ2, . . . , ρn ∈ R. Thus we can conclude that
P (A) A P (A)(x − x1
)=
n∑
j=1
ρj P (A) A P (A)wj =
n∑
j=1
ρj
[P (λj)
]2λj wj,
where we have used (6.34). This leads to (where we use wTj wk = 0 if j 6= k)
(x − x1
)TP (A) A P (A)
(x − x1
)=
n∑
j=1
n∑
k=1
ρk ρj
[P (λj)
]2λj wT
k wj
=n∑
j=1
|ρj |2 |P (λj)|2 λj
148 6.3. Convergence of the Conjugate Gradient Method
≤(
max1≤ℓ≤n
|P (λℓ)|2) n∑
j=1
|ρj |2 λj
=
(max1≤ℓ≤n
|P (λℓ)|2)( n∑
j=1
ρj wj
)T
A
(n∑
k=1
ρk wk
)
=
(max1≤ℓ≤n
|P (λℓ)|2) ∥∥x − x1
∥∥2
A.
Substituting this into (6.38) and using
max1≤ℓ≤n
|P (λℓ)|2 =
(max1≤ℓ≤n
|P (λℓ)|)2
gives the desired inequality. 2
Since clearly
max1≤ℓ≤n
|P (λℓ)| ≤ maxλ∈[λn,λ1]
|P (λ)|
we have, from Theorem 6.13, the following upper bound for the error
∥∥x − xi
∥∥A≤
inf
P∈πi−1,P (0)=1
‖P‖L∞([λn,λ1])
∥∥x − x1
∥∥A, (6.39)
with the supremum norm
‖P‖L∞([a,b]) = supx∈[a,b]
|P (x)|.
Note that the smallest eigenvalue λn and the largest eigenvalue λ1 can be replaced by estimates
λn ≤ λn and λ1 ≥ λ1.
As stated in the theorem below (which we will not prove) the minimum of the term in the
round parentheses in (6.39) can be computed explicitly. Its solution involves the Chebychev
polynomials defined by
Tn(t) := cos(n arccos(t)
), t ∈ [−1, 1].
We observe that clearly
|Tn(t)| ≤ 1 for all t ∈ [−1, 1], (6.40)
since | cos(ϕ)| ≤ 1 for all ϕ ∈ R.
6. The Conjugate Gradient Method 149
Theorem 6.14 (minimization problem)
Let λ1 > λn > 0. In the problem
inf{‖P‖L∞([λn,λ1]) : P ∈ πi−1 with P (0) = 1
}
the infimum is attained by
P ∗(t) =Ti−1
(2t−λ1−λn
λ1−λn
)
Ti−1
(λ1+λn
λ1−λn
) , t ∈ [λn, λ1].
The Chebychev polynomials satisfy the following inequality
1
2
(1 +
√t
1 −√
t
)n
≤∣∣∣∣Tn
(1 + t
1 − t
)∣∣∣∣ ≤ 1, t ∈ [0, 1]. (6.41)
To apply this estimate, and derive the final estimate on the convergence of the CG method, we
set γ = λn/λ1 ∈ (0, 1) and use
λ1 + λn
λ1 − λn=
1 + λn
λ1
1 − λn
λ1
=1 + γ
1 − γ. (6.42)
From (6.39), Theorem 6.14, and (6.41), we have
∥∥x − xi
∥∥A
≤
inf
P∈πi−1,P (0)=1
‖P‖L∞([λn,λ1])
∥∥x − x1
∥∥A
=
supt∈[λnλ1]
∣∣∣Ti−1
(2t−λ1−λn
λ1−λn
)∣∣∣∣∣∣Ti−1
(λ1+λn
λ1−λn
)∣∣∣
∥∥x − x1
∥∥A
≤ 1∣∣∣Ti−1
(1+γ1−γ
)∣∣∣
∥∥x − x1
∥∥A
≤ 2
(1 −√
γ
1 +√
γ
)i−1 ∥∥x − x1
∥∥A, (6.43)
where we have used (6.40) and (6.42) in the third step and the lower bound from (6.41) in the
fourth step. Finally (6.43) and the fact that (see Remark 3.4)
γ =λn
λ1
⇒ 1
γ= λ1
1
λn
= ‖A‖2 ‖A−1‖2 = κ2(A)
show the following result.
150 6.3. Convergence of the Conjugate Gradient Method
Theorem 6.15 (CG method error estimate)
Let A ∈ Rn×n be symmetric and positive definite, and let λ1 ≥ λ2 ≥ . . . ≥ λn > 0 be the
eigenvalues of A. Let b ∈ Rn, and let x be the solution of Ax = b. The sequence of
iterations {xi} generated by the CG method satisfies the error estimate
∥∥x − xi
∥∥A≤ 2
∥∥x − x1
∥∥A
(1 −√
γ
1 +√
γ
)i−1
= 2∥∥x − x1
∥∥A
(√κ2(A) − 1√κ2(A) + 1
)i−1
,
where γ = 1/κ2(A) = λn/λ1.
The method also converges if the matrix is almost singular (that is, if λd+1 ≈ . . . ≈ λn ≈ 0),
because (√
κ2(a) − 1)/(√
κ2(a) + 1) < 1. If λd+1 ≈ . . . ≈ λn ≈ 0, then κ2(A) is very large,
and one should try to use appropriate matrix manipulations to reduce the condition number
of A, such as a transformation of the smallest eigenvalues to zero and of the bigger eigenvalues
to an interval of the form (λ, λ(1 + ε)) ⊂ (0,∞) with ε as small as possible. In general it does
not matter if, during this process, the rank of the matrix is reduced. Generally, this leads only
to a small additional error but also to a significant increase in the convergence rate of the CG
method.
Exercise 88 Let A ∈ Rn×n be a real positive definite symmetric n×n matrix. Let λmin denote
the smallest eigenvalue of A, and let λmax denote the largest eigenvalue of A. Prove that
λmin = infx∈Rn\{0}
xT Ax
xT xand λmax = sup
x∈Rn\{0}
xT Ax
xT x.
Chapter 7
Calculation of Eigenvalues
In this last chapter of the lecture notes, we are concerned with the numerical computation of
the eigenvalues and eigenvectors of a square matrix. Let A ∈ Cn×n be a square matrix. A
non-zero vector x ∈ Cn is an eigenvector of A and λ ∈ C is its corresponding eigenvalue if
Ax = λx.
The eigenvalues are the roots of the characteristic polynomial
p(A, λ) = det(λ I − A
).
However, for larger matrices it is not practicable to actually compute the characteristic polyno-
mial, let alone its roots. Hence, we will look at more feasible methods to numerically determine
eigenvalues and eigenvectors.
7.1 Basic Localisation Techniques
A very rough way of getting an estimate of the location of the eigenvalues is given by the
following theorem.
Theorem 7.1 (Gershgorin disks)
The eigenvalues of a matrix A in Cn×n (or in Rn×n) are contained in the union⋃n
j=1 Kj of
the disks
Kj :=
λ ∈ C : |λ − aj,j| ≤n∑
k=1,k 6=j
|aj,k|
, 1 ≤ j ≤ n.
151
152 7.1. Basic Localisation Techniques
Proof of Theorem 7.1. For an eigenvalue λ ∈ C of A we can choose an eigenvector
x ∈ Cn \ {0} with ‖x‖∞ = max1≤i≤n |xi| = 1. From Ax − λx = 0 we can conclude that
(n∑
k=1
ai,k xk
)− λ xi =
(ai,i − λ
)xi +
n∑
k=1,k 6=i
ai,k xk = 0, 1 ≤ i ≤ n,
and thus(ai,i − λ
)xi = −
n∑
k=1,k 6=i
ai,k xk = 0, 1 ≤ i ≤ n.
If we pick an index i = j with |xj | = ‖x‖∞ = 1 then the statement follows via
|aj,j − λ| =∣∣(aj,j − λ
)xj
∣∣ =
∣∣∣∣∣
n∑
k=1,k 6=j
aj,k xk
∣∣∣∣∣ ≤n∑
k=1,k 6=j
|aj,k| |xk| ≤(
max1≤i≤n
|xi|)
︸ ︷︷ ︸= ‖x‖∞ = 1
n∑
k=1,k 6=j
|aj,k| ≤n∑
k=1,k 6=j
|aj,k|,
where we have used ‖x‖∞ = 1 by assumption. Therefore, we know that the eigenvalue λ is
contained in the disk Kj where j is an index for which |xj | = ‖x‖∞ = 1. Since the eigen-
value λ was arbitrary, we know that all eigenvalues are contained in the union of the disks Kj ,
j = 1, 2, . . . , n. 2
Exercise 89 Consider the matrix
A =
32
0 12
0 3 0
12
0 32
.
(a) Compute the exact eigenvalues of A by hand.
(b) Use the Theorem 7.1 on the Gershgorin disks to estimate the location of the eigenvalues.
The Rayleigh quotient introduced below allows us to approximate an eigenvalue, if we are able
to approximate a corresponding eigenvector.
Definition 7.2 (Rayleigh quotient)
The Rayleigh quotient of a vector x ∈ Rn \ {0} with respect to a real matrix A ∈ Rn×n,
is the scalar
R(x) :=xT Ax
xT x.
The Rayleigh quotient of a vector x ∈ Cn\{0} with respect to a complex matrix A ∈ Cn×n,
is the scalar
R(x) :=x∗ Ax
x∗ x.
7. Calculation of Eigenvalues 153
Obviously, if x is an eigenvector then R(x) is the corresponding eigenvalue: Indeed, if Ax = λx,
then
R(x) =xT Ax
xT x=
λxT x
xT x= λ and R(x) =
x∗ Ax
x∗ x=
λx∗ x
x∗ x= λ, (7.1)
respectively.
We will deal now with symmetric matrices A = AT in Rn×n. For such matrices it is known
(see Theorem 2.11) that they have n real eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λn, and that there
are n corresponding orthonormal eigenvectors w1,w2, . . . ,wn ∈ Rn, that is, the eigenvectors
w1,w2, . . . ,wn satisfy
wTj wk = δj,k, j, k = 1, 2 . . . , n.
In particular, w1,w2, . . . ,wn form an orthonormal basis of Rn. Then, every x ∈ Rn can be
represented as
x =n∑
k=1
ck wk, (7.2)
and the coefficients can be determined via
wTj x =
n∑
k=1
ck wTj wk =
n∑
k=1
ck δj,k = cj ⇒ cj = wTj x.
Moreover, the Euclidean norm of x, given by (7.2) can be computed via
‖x‖22 =
(n∑
j=1
cj wj
)T ( n∑
k=1
ck wk
)=
n∑
j=1
n∑
k=1
cj ck wTj wk =
n∑
j=1
n∑
k=1
cj ck δj,k =
n∑
k=1
c2k.
Theorem 7.3 (convergence of the Rayleigh quotient)
Suppose A ∈ Rn×n is symmetric. Assume that we have a sequence {x(j)} of vectors in Rn
which converges to an eigenvector wJ of A with eigenvalue λJ , and assume that {x(j)} is
normalized, that is, ‖x(j)‖2 = 1 for all j. Then we have
limj→∞
R(x(j)) = R(wJ) = λJ , (7.3)
and ∣∣R(x(j)) − R(wJ)∣∣ = O
(‖x(j) −wJ‖2
2
). (7.4)
Proof of Theorem 7.3. Since the Rayleigh quotient R(x) depends continuously on x, the
first equality in (7.3) is clear and the second equality follows from (7.1).
It remains to show (7.4). For this purpose, we use that, since A is symmetric, there are n
eigenvalues λ1 ≥ λ2 ≥ . . . ≥ λn and n corresponding orthonormal eigenvectors w1,w2, . . . ,wn.
With this orthonormal basis of eigenvectors, x(j) has a representation
x(j) =
n∑
k=1
ck wk, (7.5)
154 7.1. Basic Localisation Techniques
where we notation-wise suppress the fact that the coefficients ck = c(j)k depend also on j. Then,
since Awj = λj wj , we find
Ax(j) = A
(n∑
k=1
ck wk
)=
n∑
k=1
ck Awk =n∑
k=1
ck λk wk. (7.6)
Since wTj wk = 0 if j 6= k, and wj wT
j = ‖wj‖22 = 1 for all j = 1, 2, . . . , n, and ‖x(j)‖2
2 =
(x(j))T x(j) = 1 for all j, from (7.6) and (7.5),
R(x(j))
=(x(j))T Ax(j)
(x(j))T x(j)
= (x(j))T Ax(j)
=
(n∑
i=1
ci wi
)T ( n∑
k=1
ck λk wk
)
=n∑
k=1
n∑
i=1
ci ck λk wTi wk︸ ︷︷ ︸
= δi,k
=n∑
k=1
λk c2k.
Using R(wJ) = λJ from (7.1), this gives
R(x(j)) − R(wJ) =n∑
k=1
λk c2k − λJ = λJ (c2
J − 1) +n∑
k=1,k 6=J
λk c2k.
Thus
∣∣R(x(j)) − R(wJ)∣∣ ≤ |λJ | |c2
J − 1| +n∑
k=1,k 6=J
|λk| c2k
≤(
max1≤i≤n
|λi|)(
|c2J − 1| +
n∑
k=1,k 6=J
c2k
)
=
(max1≤i≤n
|λi|)(
|(2 cJ − 2) + (c2J − 2 cJ + 1)| +
n∑
k=1,k 6=J
c2k
)
=
(max1≤i≤n
|λi|)(
|2 (cJ − 1) + (cJ − 1)2| +n∑
k=1,k 6=J
c2k
)
≤(
max1≤i≤n
|λi|)(
2 |cJ − 1| + (cJ − 1)2 +
n∑
k=1,k 6=J
c2k
). (7.7)
7. Calculation of Eigenvalues 155
From ‖x(j)‖2 = ‖wJ‖2 = 1 and (7.5) and wTi wk = 0 if i 6= k,
∥∥x(j) − wJ
∥∥2
2=(x(j) −wJ
)T (x(j) − wJ
)= ‖x(j)‖2
2︸ ︷︷ ︸= 1
+ ‖wJ‖22︸ ︷︷ ︸
= 1
−2wTJ x(j)
= 2 − 2wTJ
(n∑
k=1
ck wk
)= 2 − 2
n∑
k=1
ck wTJ wk︸ ︷︷ ︸
= δk,J
= 2 − 2 cJ = 2 (1 − cJ). (7.8)
On the other hand, from (7.5),
∥∥x(j) −wJ
∥∥2
2=
∥∥∥∥∥
n∑
k=1
ck wk −wJ
∥∥∥∥∥
2
2
=
∥∥∥∥∥(cJ − 1)wJ +
n∑
k=1,k 6=J
ck wk
∥∥∥∥∥
2
2
=
((cJ − 1)wJ +
n∑
i=1,i6=J
ci wi
)T((cJ − 1)wJ +
n∑
k=1,k 6=J
ck wk
)
= (cJ − 1)2 wTJ wJ︸ ︷︷ ︸= 1
+2 (cJ − 1)n∑
k=1,k 6=J
ck wTJ wk︸ ︷︷ ︸= 0
+n∑
i=1,i6=J
n∑
k=1,k 6=J
ci ck wTi wk︸ ︷︷ ︸
= δi,k
= (cJ − 1)2 +
n∑
k=1,k 6=J
c2k. (7.9)
Substituting (7.8) and (7.9) into the last line of (7.7) yields
∣∣R(x(j))−R(wJ )∣∣ ≤
(max1≤i≤n
|λi|)(∥∥x(j) −wJ
∥∥2
2+∥∥x(j) −wJ
∥∥2
2
)= 2
(max1≤i≤n
|λi|)‖x(j)−wJ‖2
2,
which proves (7.4). 2
In analogy to Theorem 7.3, we can prove the following corresponding theorem for Hermitian
matrices in Cn×n.
Theorem 7.4 (convergence of Rayleigh quotient)
Suppose A ∈ Cn×n is Hermitian. Assume that we have a sequence {x(j)} of vectors in Cn
which converges to an eigenvector wJ of A with eigenvalue λJ , and assume that {x(j)} is
normalized, that is, ‖x(j)‖2 = 1 for all j. Then we have
limj→∞
R(x(j)) = R(wJ) = λJ ,
and ∣∣R(x(j)) − R(wJ)∣∣ = O
(‖x(j) −wJ‖2
2
).
Theorems 7.3 and 7.4 tell us that the Rayleigh quotient converges with a quadratic order to λJ
as {x(j)} converges to xJ .
156 7.2. The Power Method
7.2 The Power Method
In this section we no longer assume that A ∈ Rn×n is symmetric. Suppose A has a dominant
eigenvalue λ1, that is, we have
|λ1| > |λ2| ≥ |λ3| ≥ · · · ≥ |λn|. (7.10)
For such matrices, it is now our goal to determine the dominant eigenvalue λ1 and a corre-
sponding eigenvector. Also assume that there exist n real eigenvalues λ1, λ2, . . . , λn ∈ R and
n linearly dependent eigenvectors w1,w2, . . . ,wn ∈ Rn corresponding to the real eigenvalues
λ1, λ2, . . . , λn, and satisfying ‖w1‖2 = ‖w2‖2 = . . . = ‖wn‖2 = 1. (Note that this is not always
the case, since we have neither assumed that A is symmetric nor that the n eigenvalues are dis-
tinct. Since A is not assumed to be symmetric, the eigenvectors w1,w2, . . . ,wn can in general
not be chosen orthogonal to each other.) Then the eigenvectors w1,w2, . . . ,wn form a basis of
Rn, and we can represent every x as
x =
n∑
j=1
cj wj .
with some coefficients c1, c2, . . . , cn ∈ R. Using Awj = λj wj, j = 1, 2, . . . , n, shows that
Am x =
n∑
j=1
cj Am wj =
n∑
j=1
cj λmj wj = λm
1
(c1w1 +
n∑
j=2
cj
(λj
λ1
)m
wj
)=: λm
1
(c1w1 + Rm
),
(7.11)
with the remainder term
Rm :=n∑
j=2
cj
(λj
λ1
)m
wj. (7.12)
The vector sequence {Rm} tends to the zero vector 0 for m → ∞, since the triangle inequality,
‖wj‖2 = 1 for all j = 1, 2 . . . , n, and |λj/λ1| < 1 for j = 2, . . . , n (from (7.10)) imply
‖Rm‖2 =
∥∥∥∥∥
n∑
j=2
cj
(λj
λ1
)m
wj
∥∥∥∥∥2
≤n∑
j=2
|cj|∣∣∣∣λj
λ1
∣∣∣∣m
‖wj‖2︸ ︷︷ ︸= 1
=
n∑
j=2
|cj|∣∣∣∣λj
λ1
∣∣∣∣m
→ 0 as m → ∞.
If c1 6= 0, this means that from (7.11)
1
λm1
Am x = c1 w1 + Rm → c1 w1 as m → ∞,
and c1 w1 is an eigenvector of A for the eigenvalue λ1. Of course, this is so far only of limited
value, since we do not know the eigenvalue λ1 and hence cannot form the quotient (Am x)/λm1 .
Another problem comes from the fact that the norm of Amx converges to zero if |λ1| < 1 and
to infinity if |λ1| > 1. Both problems can be resolved if we normalize appropriately.
7. Calculation of Eigenvalues 157
For example, if we take the Euclidean norm of Am x, then from the last step in (7.11) and
‖wj‖2 = 1 for j = 1, 2, . . . , n,
‖Amx‖22 =
∥∥λm1
(c1 w1 + Rm
)∥∥2
2
= λ2m1
(c1 w1 + Rm
)T (c1 w1 + Rm
),
= λ2m1
(|c1|2 wT
1 w1︸ ︷︷ ︸= 1
+2 c1 wT1 Rm + RT
m Rm︸ ︷︷ ︸= ‖Rm‖2
2
)
= λ2m1
(|c1|2 + rm
), (7.13)
where rm ∈ R is defined by
rm := 2 c1 wT1 Rm + ‖Rm‖2
2. (7.14)
Clearly limm→∞ Rm = 0 implies limm→∞ rm = 0. Thus, from (7.13), we can write
‖Am x‖2 = |λ1|m√|c1|2 + rm, with lim
m→∞rm = 0. (7.15)
From (7.15), we find
‖Am+1 x‖2
‖Am x‖2= |λ1|
‖Am+1x‖2
|λ1|m+1
|λ1|m‖Am x‖2
= |λ1|√
|c1|2 + rm+1√|c1|2 + rm
→ |λ1| for m → ∞, (7.16)
which gives the real eigenvalue λ1 up to its sign. To determine also its sign and a corresponding
eigenvector, we refine the method as follows.
Definition 7.5 (power method or von Mises iteration)
Let A ∈ Rn×n, and assume that A has n real eigenvalues λ1, λ1, . . . , λn ∈ R (counted with
multiplicity) and a corresponding set of real normalized eigenvectors w1,w2, . . . ,wn ∈ Rn
that form a basis of Rn. (That is, w1,w2, . . . ,wn are linearly independent, and ‖wj‖2 = 1
and Awj = λj wj, j = 1, 2, . . . , n.) Suppose that A has a dominant eigenvalue λ1, that is,
|λ1| > |λj| for j = 2, 3, . . . , n. The power method or von Mises iteration is defined in
the following way: First, we choose a starting vector
x(0) =n∑
j=1
cj wj , with c1 6= 0,
and set y(0) = x(0)/‖x(0)‖2. Then, for m = 1, 2, . . . we compute:
1. x(m) = Ay(m−1),
2. y(m) = σmx(m)
‖x(m)‖2, where the sign σm ∈ {−1, 1} is chosen such that y(m)T y(m−1) ≥ 0.
The choice of the sign σm means that the angle between y(m−1) and y(m) is in [0, π/2], which
means that we are avoiding a ‘jump’ when moving from y(m−1) to y(m).
158 7.2. The Power Method
The condition c1 6= 0 is usually satisfied simply because of numerical rounding errors and is no
real restriction. We note that c1 6= 0 is equivalent to wT1 x(0) 6= 0.
The power method can be implemented with the following MATLAB code:
function [lambda,v] = power_method(A,z,J)
%
% executes the power method or von Mises iteration
%
% input: A = real n by n matrix with real eigenvalues
% and n linearly independent real eigenvectors;
% it is assumed that A has a dominant eigenvalue
% z = n by 1 starting vector for the power method
% J = number of iterations
%
% output: lambda = approximation of dominant eigenvalue
% v = approximation of corresponding normalized eigenvector
%
y = z / norm(z);
for j = 1:J
x = A * y;
sigma = sign(x’ * y);
y = sigma * ( x / norm(x) );
end
lambda = sigma * norm(x);
v = y;
The theorem below gives information about the convergence of the power method.
Theorem 7.6 (convergence of power method)
Let A ∈ Rn×n, and assume that A has n real eigenvalues λ1, λ1, . . . , λn ∈ R (counted with
multiplicity) and a corresponding set of real normalized eigenvectors w1,w2, . . . ,wn ∈ Rn
that form a basis of Rn. (That is, w1,w2, . . . ,wn are linearly independent, and ‖wj‖2 = 1
and Awj = λj wj, j = 1, 2, . . . , n.) Suppose that A has a dominant eigenvalue λ1, that is,
|λ1| > |λj| for j = 2, 3, . . . , n. Then the iterations of the power method satisfy:
(i) ‖x(m)‖2 → |λ1| for m → ∞,
(ii) y(m) converges to a normalized eigenvector of A with the eigenvalue λ1,
(iii) σm → sign(λ1) for m → ∞, that is, σm = sign(λ1) for sufficiently large m.
Before we prove the theorem, we give a numerical example.
7. Calculation of Eigenvalues 159
Example 7.7 (power method)
Consider the real 3 × 3 matrix
A =
0 −2 2
−2 −3 2
−3 −6 5
.
From Example 2.4 we know that the eigenvalues of this matrix are λ1 = 2, λ2 = 1, and λ3 = −1
and that the eigenvectors are real. A normalized eigenvector to the dominant eigenvalue λ1 = 2
is given by w1 = (√
2)−1 (1, 0, 1)T , and normalized eigenvectors to the eigenvalues λ2 = 1 and
λ3 = −1 are given by w2 = (√
5)−1 (2,−1, 0)T and w3 = (√
2)−1 (0, 1, 1)T .
We compute the first two steps of the power method with starting vector x(0) = (1, 1, 1)T by
hand. It can be easily verified that x(0) = −√
2w1 +√
5w2 + 2√
2w3, that is the condition
c1 = −√
2 6= 0 is satisfied.
We start with
y(0) =x(0)
‖x(0)‖2=
1√3
1
1
1
.
In the first step of the power method we find
x(1) = Ay(0) =1√3
0 −2 2
−2 −3 2
−3 −6 5
1
1
1
=
1√3
0
−3
−4
.
Since
σ1 = sign((y(0))T x(1)
)= sign
1√3
1
1
1
T
1√3
0
−3
−4
= sign
(1
3(−7)
)= −1
and
‖x(1)‖2 =1√3
√(−3)2 + (−4)2 =
5√3,
we find
y(1) = σ1x(1)
‖x(1)‖2
= −√
3
5
1√3
0
−3
−4
=
1
5
0
3
4
.
In the second step of the power method we find
x(2) = Ay(1) =1
5
0 −2 2
−2 −3 2
−3 −6 5
0
3
4
=1
5
2
−1
2
.
160 7.2. The Power Method
Since
σ2 = sign((y(1))T x(2)
)= sign
1
5
0
3
4
T
1
5
2
−1
2
= sign
(5
25
)= 1
and
‖x(2)‖2 =1
5
√22 + (−1)2 + 22 =
√9
5=
3
5,
we find
y(2) = σ2x(2)
‖x(2)‖2=
5
3
1
5
2
−1
2
=
1
3
2
−1
2
.
Using the MATLAB code given above we obtain the following approximations to λ1 = 2 and the
corresponding normalized eigenvector w1 = (√
2)−1 (1, 0, 1)T ≈ (0.707107, 0, 0.707107)T from
the first 7 iterations of the power method:
j 1 2 3 4 5 6
0 0.6667 0.4983 0.7062 0.6602 0.7071
y(j) 0.6000 −0.3333 0.2491 −0.0504 0.0660 −0.0114
0.8000 0.6667 0.8305 0.7062 0.7482 0.7071
σj ‖x(j)‖2 −2.8868 0.6000 4.0139 1.6463 2.2923 1.9296
We see that while the first few iterations of the power method give very poor results, after only
7 iterations we already have a reasonable approximation of the eigenvector and eigenvalue.
After 15 iterations we find
λ1 ≈ σ15 ‖x(15)‖2 = 2.0002 and w1 ≈ y(15) =
0.7071
0.0001
0.7071
,
which gives a very good approximation of the dominant eigenvalue λ1 = 2 and a corresponding
normalized eigenvector. 2
Note that the assumption that ‖wj‖2 = 1, j = 1, 2, . . . , n, is only for our convenience. Once
we have n linearly independent eigenvectors w1,w2, . . . ,wn corresponding to λ1, λ2, . . . , λn we
can always normalize them so that they have length one.
Proof of Theorem 7.6. From the definition of the power method, we have for k ≥ 1
y(k) = σkx(k)
‖x(k)‖2= σk
Ay(k−1)
‖Ay(k−1)‖ . (7.17)
7. Calculation of Eigenvalues 161
Applying (7.17) repeatedly for k = m, m − 1, . . . , 1 yields
y(m) = σmAy(m−1)
‖Ay(m−1)‖2= σm
A(σm−1
Ay(m−2)
‖Ay(m−2)‖2
)
∥∥∥σm−1A2y(m−2)
‖Ay(m−2)‖2
∥∥∥2
= σm σm−1A2 y(m−2)
‖A2 y(m−2)‖2= σm σm−1 · · · σ1
Am y(0)
‖Am y(0)‖2
= σm σm−1 · · · σ1Am x(0)
‖Am x(0)‖2
, (7.18)
for m = 1, 2, . . . , where we have used y(0) = x(0)/‖x(0)‖2 in the last step. From this, we can
conclude
x(m+1) = Ay(m) = σm σm−1 · · · σ1Am+1 x(0)
‖Am x(0)‖2
,
such that (7.16) with x = x(0) immediately leads to
‖x(m+1)‖2 =‖Am+1x(0)‖2
‖Amx(0)‖2= |λ1|
√|c1|2 + rm+1√|c1|2 + rm
→ |λ1| for m → ∞.
Using in (7.18) the representations (7.11) and (7.15) yields for sufficiently large m
y(m) = σm σm−1 · · · σ1λm
1 (c1 w1 + Rm)
|λ1|m√
|c1|2 + rm
= σm σm−1 · · · σ1
[sign(λ1)
]msign(c1)
|c1|√|c1|2 + rm
w1 + ρm, (7.19)
where
ρm := σm σm−1 · · · σ1
[sign(λ1)
]m Rm√|c1|2 + rm
.
Since c1 6= 0 and Rm → 0 and rm → 0 for m → ∞, clearly ρm → 0 for m → ∞. From (7.19)
and ρm → 0 for m → ∞, we can conclude that y(m) indeed converges to an eigenvector of A
for the eigenvalue λ1, provided that σm = sign(λ1) for all m ≥ m0. The latter follows from
(y(m−1))T y(m) = (y(m))T y(m−1) ≥ 0 and the first line in (7.19), using ‖w1‖22 = 1,
0 ≤ (y(m−1))T y(m) = σm σ2m−1 · · · σ2
1
λ2m−11
(c1 w1 + Rm−1
)T (c1 w1 + Rm
)
|λ1|2m−1√|c1|2 + rm−1
√|c1|2 + rm
= σm sign(λ1)|c1|2 + RT
m−1 Rm + c1 wT1 (Rm−1 + Rm)
|c1|2√(
1 + rm−1
|c1|2)(
1 + rm
|c1|2) . (7.20)
Because Rm → 0 and rm → 0 for m → ∞, the fraction in the second line of (7.20) converges
to one for m → ∞, and hence (7.20) implies for m large enough that
0 ≤ σm sign(λ1) ⇔ σm = sign(λ1),
162 7.3. Inverse Iteration
which completes the proof. 2
The Euclidean norm involved in the normalization process can be replaced by any other norm
without changing the convergence result. Often, the maximum norm is used, since it is cheaper
to compute.
Exercise 90 Consider the matrix
A =
32
0 12
0 3 0
12
0 32
.
In Exercise 89 you have computed the eigenvalues of the matrix A.
(a) Compute the eigenvectors corresponding to the dominant eigenvalue of A by hand.
(b) For the starting vector x(0) = (1, 1, 1)T , compute the first three iterations of the power
method by hand.
(c) Comment on the quality of the results from (b).
7.3 Inverse Iteration
Inverse iteration can be used to improve the approximations of eigenvalues obtained by other
methods. The idea is to start with a good approximation λ of an eigenvalue λj of A with
multiplicity one, and consider the inverse matrix (A − λ I)−1. If λ was a good approximation
of λj , then we expect λj − λ to be small and hence that the eigenvalue λj := (λj − λ)−1 of
(A − λ I)−1 is a dominant eigenvalue of (A − λ I)−1. Inverse iteration consists of using the
power method to approximate this dominant eigenvalue λj = (λj − λ)−1 and then uses the
relation λj = λ−1j + λ to obtain an improved approximation of the eigenvalue λj of A.
General assumption in this section: Throughout this section we assume that the real
matrix A ∈ Rn×n has n real eigenvalues λ1, λ2, . . . , λn ∈ R and corresponding real normalized
eigenvectors w1,w2, . . . ,wn ∈ Rn that form a basis of Rn.
The major downfall of the power method, discussed in the previous section, is the convergence
factor: Let |λ1| > |λ2| ≥ |λ3| ≥ . . . ≥ |λn|; if |λ2| is close to |λ1| the convergence will be very
slow, since |λ1/λ2|m tends only very slowly to zero for m → ∞.
Inverse iteration which is discussed in this section starts with a good approximation λ of one
eigenvalue λj of multiplicity one, and then uses the power method applied to (A − λ I)−1 to
compute an improved approximation of λj. We will now see in detail how this works.
7. Calculation of Eigenvalues 163
Assume now that A has an eigenvalue λj of multiplicity one. If λ is a good approximation to
λj then we have
|λ − λj| < |λ − λi| for all i = 1, 2, . . . , n with i 6= j. (7.21)
If λ is not an eigenvalue of A, then det(A − λ I) 6= 0 and we can invert A − λ I. Since
λ1, λ2, . . . , λn are the eigenvalues of A, we have
det(µ I − (A − λ I)
)= det
((µ + λ) I − A
)= (µ + λ − λ1) (µ + λ − λ2) · · · (µ + λ − λn).
Hence A−λ I has the eigenvalues λi−λ, i = 1, 2, . . . , n, and thus (A−λ I)−1 has the eigenvalues
(λi − λ)−1 =: λi, i = 1, 2, . . . , n. Rearranging, for i = j
λj =1
λj − λ⇔ λj − λ =
1
λj
⇔ λj = λ +1
λj
,
and we see that a good approximation of λj allows us to improve our approximation of λj.
From (7.21), we have
1
|λ − λi|<
1
|λ − λj|= |λj| for all i = 1, 2, . . . , n with i 6= j,
and we can conclude that λj is the dominant eigenvalue of (A − λ I)−1.
From our assumption that A has n real eigenvalues λ1, λ2, . . . , λn and corresponding normalized
eigenvectors w1,w2, . . . ,wn, that form a basis of Rn, the real eigenvalue λi = (λi − λ)−1 of the
matrix (A − λ I)−1 has the real eigenvector wi. Indeed, for i = 1, 2, . . . , n,
Awi = λi wi ⇔ (A − λ I)wi = (λi − λ)wi ⇔ 1
(λi − λ)(A − λ I)wi = wi.
Thus from left-multiplication with (A − λ I)−1
(A − λ I)−1 1
(λi − λ)(A − λ I)wi = (A − λ I)−1wi ⇔ 1
(λi − λ)wi = (A − λ I)−1wi.
Thus the real eigenvalue λi = (λi−λ)−1 of (A−λ I)−1 has the corresponding normalized eigen-
vector wi ∈ Rn. From the assumptions on A, we see that the real eigenvectors w1,w2, . . . ,wn
of (A − λ I)−1 corresponding to the real eigenvalues λi = (λi − λ)−1, i = 1, 2, . . . , n, form a
basis of Rn
Therefore the assumptions for the power method are satisfied, and the power method can
be applied to the computation of the dominant eigenvalue λj of (A−λ I)−1. However,
for one iteration step we now have to form
1. x(m) = (A − λ I)−1 y(m−1),
164 7.3. Inverse Iteration
2. y(m) = σmx(m)
‖x(m)‖2, where σm ∈ {−1, 1} is chosen such that (y(m))T y(m−1) ≥ 0.
The computation of x(m) thus requires solving the linear system
(A − λ I)x(m) = y(m−1).
To this end, we determine, before the iteration starts, an LU factorization of A − λ I (or a
QR factorization) using partial row pivoting, that is, we compute
P (A − λ I) = L U,
with a permutation matrix P , an upper triangular matrix U and a normalized lower triangular
matrix L. Then, we can use forward and back substitution to solve
L U x(m) = P y(m−1).
More precisely, we let z = U x(m) and solve first L z = P y(m−1) by forward substitution and
then subsequently z = U x(m) by back substitution. This reduces the complexity necessary in
each step to O(n2) elementary operations and requires only once O(n3) elementary operations
to compute the LU factorization at the beginning.
The approximation λ of λj can be given by any other method. From this point of view, inverse
iteration is particularly interesting for improving the results of other methods.
For symmetric matrices, it is possible to improve the convergence dramatically by estimating
λ in each step using the Rayleigh quotient:
1. Choose x(0) ∈ Rn and set y(0) = x(0)/‖x(0)‖2
2. For m = 0, 1, 2, . . . do
µm = y(m)T Ay(m),
x(m+1) = (A − µm I)−1 y(m),
y(m+1) = x(m+1)/‖x(m+1)‖2.
Of course, x(m+1) is only computed if µm is not an eigenvalue of A, otherwise the method ceases.
The same holds if ‖x(m+1)‖2 becomes too large. Although, for m → ∞, the matrices A − µm I
become singular, there are usually no problems with the computation of x(m+1), provided that
one terminates the computation of the LU decomposition (or QR-decomposition) of A − µm I
in time.
It is possible to show that convergence is cubic under certain assumptions. This, however, is
indeed necessary to make the method efficient since in each step an LU decomposition has to
be computed and this additional computational complexity should be compensated by a faster
convergence.
7. Calculation of Eigenvalues 165
Exercise 91 For a given matrix A ∈ Rn×n and µ ∈ R, we may write the inverse iteration in
the following form:
Given x(0) ∈ Cn for k = 1, 2, . . . do:
(1) solve (A − µ I)w = x(k−1),
(2) normalize x(k) = σk w/‖w‖2, where σk ∈ {−1, 1} is chosen such that (x(k))∗ x(k−1) ≥ 0,
(3) set λ(k) = (x(k))∗ Ax(k).
Assume that A has a basis {z1, z2, . . . , zn} ⊂ Cn of eigenvectors of A, that is, if λ1, λ2, . . . , λn ∈ C
are the eigenvalues of A then there exist n linearly independent eigenvectors z1, z2, . . . , zn ∈ Cn
such that A zj = λj zj for j = 1, 2, . . . , n. Let α1, α2, . . . , αn ∈ C be the coefficients of x(0) with
respect to the basis {z1, z2, . . . , zn}, that is, x(0) =∑n
i=1 αi zi. Under certain conditions, that
are to be stated, do the following:
(a) Describe the eigenvalues and eigenvectors of (A − µ I)−1 in terms of µ and the eigenvalues
and eigenvectors of A.
(b) Show that
x(r) =σ1 σ2 · · · σr−1 σr
K0 K1 K2 · · · Kr−1
n∑
i=1
αi
(1
λi − µ
)r
zi,
where Kj := ‖(A − µ I)−1 x(j)‖2, j ∈ N0.
(c) For any 1 ≤ J ≤ n, show that
∥∥∥∥K0 K1 K2 · · · Kr−1
σ1 σ2 · · · σr−1 σr
(λJ − µ)r x(r) − αJ zJ
∥∥∥∥2
≤∣∣∣∣λJ − µ
λI − µ
∣∣∣∣r n∑
i=1,i6=J
|αi| ‖zi‖2,
where the properties of the index I are to be stated.
7.4 The Jacobi Method
C. G. J. Jacobi stated in 1845/46 a method for dealing with the eigenvalue problem for sym-
metric n × n matrices, which is still feasible if the matrices are not too big. This method is
called the Jacobi method. Since it can easily be run in parallel it can, on specific computers,
even be superior to the QR-method, which will be discussed in Section 7.6. The Jacobi method
computes all eigenvalues and if necessary also all eigenvectors. The Jacobi method is based
upon the following easy fact.
166 7.4. The Jacobi Method
Lemma 7.8 (Frobenius norm is invariant under orthogonal basis transformation)
Let A = (ai,j) in Rn×n be any matrix. For every orthogonal matrix Q ∈ R
n×n both A and
QT A Q have the same Frobenius norm, that is, ‖A‖F = ‖QT A Q‖F . If A is symmetric
(that is, AT = A) and λ1, . . . , λn are the real eigenvalues of A then
n∑
j=1
|λj |2 =n∑
j=1
n∑
k=1
|aj,k|2 = ‖A‖2F . (7.22)
As a preparation for the proof we recall that for any matrix B = (bj,k) in Rn×n
trace(B) :=n∑
j=1
bj,j.
Since the products A B and B A of A = (aj,k) and B = (bj,k) in Rn×n satisfy
(A B)i,j =
n∑
k=1
ai,k bk,j and (B A)i,j =
n∑
k=1
bi,k ak,j,
setting i = j above, we see that
trace(A B) =n∑
j=1
n∑
k=1
aj,k bk,j =n∑
k=1
n∑
j=1
bk,j aj,k = trace(B A). (7.23)
In particular, (7.23) implies for B = AT
trace(A AT ) = trace(AT A) =n∑
j=1
n∑
k=1
|aj,k|2 = ‖A‖2F . (7.24)
Proof of Lemma 7.8. From (7.24), the square of the Frobenius norm of A equals
‖A‖2F =
n∑
j=1
n∑
k=1
|aj,k|2 = trace(A AT ) = trace(AT A).
Since from (7.23), the trace of A B is the same as the trace of B A for two arbitrary n × n
matrices A and B, we can conclude that (using Q QT = I since Q is orthogonal)
‖QT A Q‖2F = trace
((QT A Q) (QT A Q)T
)= trace
((QT A Q) (QT AT Q)
)
= trace(QT A (Q QT ) AT Q
)= trace
((QT A) (AT Q)
)
= trace((AT Q) (QT A)) = trace
(AT (Q QT ) A
)= trace(AT A) = ‖A‖2
F .
For a symmetric matrix A there exists an orthogonal matrix Q, such that QT A Q = D is
a diagonal matrix with the real eigenvalues λ1, λ2, . . . , λn of A as diagonal entries. Thus for
symmetric A
‖A‖2F =
∥∥QT A Q‖2F = ‖D‖2
F = trace(D DT ) =n∑
j=1
|λj |2.
7. Calculation of Eigenvalues 167
This concludes the proof. 2
Definition 7.9 (outer norm)
The outer norm of a real n × n matrix A = (ai,j) is defined by
N(A) :=
(n∑
j=1
n∑
k=1,k 6=j
|aj,k|2)1/2
.
Note that the outer norm is not really a norm in the strict sense of Definition 2.24.
Exercise 92 Investigate which of the properties of a norm are satisfied by the outer norm and
which properties of norm a violated by the outer norm.
From now on assume for the rest of this section that A is symmetric. With the outer norm,
(7.22) in Lemma 7.8 gives for symmetric A ∈ Rn×n the decomposition
‖A‖2F =
n∑
j=1
|λj|2 =n∑
j=1
n∑
k=1
|aj,k|2 =n∑
j=1
|aj,j|2 +[N(A)
]2. (7.25)
Since the left-hand side is invariant under orthogonal transformations (from Lemma 7.8), it
is now our goal to decrease the value of [N(A)]2 by choosing appropriate orthogonal trans-
formations Q and thus to increase the value of∑n
j=1 |aj,j|2. In the limit case [N(A)]2 → 0,
the transformed matrix QT A Q tends to the diagonal matrix D = diag(λ1, λ2, . . . , λn), where
λ1, λ2, . . . , λn are the eigenvalues of A (without any ordering).
To find suitable orthogonal transformations Q such that [N(QT A Q)]2 < [N(A)]2, we choose
an element ai,j 6= 0 with i 6= j and perform a transformation in the plane spanned by ei and
ej, which maps ai,j to zero.
To construct such a transformation in consider the following problem in R2: for a symmetric
2 × 2 matrix
A =
(ai,i ai,j
ai,j aj,j
),
find a rotation
R =
(cos α − sin α
sin α cos α
)(7.26)
with angle α such that the matrix B := RT AR is a diagonal matrix, that is,
B =
(bi,i bi,j
bi,j bj,j
)=
(cos α sin α
− sin α cos α
)(ai,i ai,j
ai,j aj,j
)(cos α − sin α
sin α cos α
), (7.27)
with bi,j = 0. (Note that B is symmetric since BT = (RT A R)T = RT AT R = RT A R since
A is symmetric.) The condition bi,j = 0 reads more explicitly
0 = bi,j = ai,j
[(cos α)2 − (sin α)2
]+[aj,j − ai,i
]cos α sin α
168 7.4. The Jacobi Method
= ai,j cos(2α) +[aj,j − ai,i
] 1
2sin(2α).
From ai,j 6= 0 we can conclude that the angle α ∈ [0, π/2] is given by
cot(2α) =ai,i − aj,j
2 ai,j. (7.28)
For the practical computation of the entries cos α and sin α in the rotation matrix R, we do
not actually determine the angle α from (7.28), but use
tan(2α) =2 ai,j
ai,i − aj,jif ai,i 6= aj,j, and α =
π
4if ai,i = aj,j, (7.29)
and the following trigonometric identities:
cos(2α) =1√
1 +(tan(2α)
)2 , cos α =
√1
2
(1 + cos(2α)
), sin α =
√1
2
(1 − cos(2α)
).
(7.30)
We note that clearly
RT R =
(cos α sin α
− sin α cos α
)(cos α − sin α
sin α cos α
)= I,
and hence the rotation R is an orthogonal matrix.
Definition 7.10 (elementary Givens rotation)
An n × n elementary Givens rotation is given by the real n × n matrix
Gi,j(α) := I + (cos α − 1)(ej eT
j + ei eTi ) + sin α
(ej eT
i − ei eTj
)
=
1. . .
1
cos α − sin α
1. . .
1
sin α cos α
1. . .
1
. (7.31)
This matrix coincides with the identity matrix I, except for
(Gi,j(α))(i, i) = (Gi,j(α))(j, j) = cos α,
(Gi,j(α))(i, j) = −(Gi,j(α))(j, i) = − sin α,
where in the second line in (7.31) i < j.
7. Calculation of Eigenvalues 169
Since rotations R in R2, given by (7.26) are orthogonal matrices, it is not surprising to find
that elementary Givens rotations are orthogonal matrices.
Lemma 7.11 (elementary Givens rotations are orthogonal)
An n × n elementary Givens rotation Gi,j(α) is an orthogonal matrix, and we have
(Gi,j(α)
)−1=(Gi,j(α)
)T= Gj,i(α).
Proof of Lemma 7.11. The proof follows relatively straightforward with the help of the ro-
tation matrix (7.26) and is left as an exercise. 2
Exercise 93 Show Lemma 7.11.
Lemma 7.12 (basis transformation with an elementary Givens rotation)
Let A = (ak,ℓ) ∈ Rn×n be symmetric and let Gi,j(α), with α ∈ [0, π/2], be the n×n elementary
Givens rotation. Let B = (bk,ℓ) :=(Gi,j(α)
)TA Gi,j(α). Let R be the 2×2 orthgonal matrix
R =
(cos α − sin α
sin α cos α
)
and let
A =
(ai,i ai,j
ai,j aj,j
)if i < j, and A =
(aj,j aj,i
aj,i ai,i
)if i > j,
where the entries in A are those of A with indices (i, i), (i, j), (j, i), (j, j). Let
B =
(bi,i bi,j
bi,j bj,j
):= RT A R if i < j, and B =
(bj,j bj,i
bj,i bi,i
):= RT A R if i > j.
Then bk,ℓ = bk,ℓ for k, ℓ ∈ {i, j}. In words, the entries with indices (i, i), (i, j), (j, i), (j, j)
of(Gi,j(α)
)TA Gi,j(α) can be described by executing the basis transformation RT A R on the
2 × 2 matrix A that contains the entries of A with indices (i, i), (i, j), (j, i), (j, j).
Exercise 94 Prove Lemma 7.12.
With the help of Lemma 7.12 we can now prove the following central result.
170 7.4. The Jacobi Method
Proposition 7.13 (basis transformation with elementary Givens rotation)
Let A ∈ Rn×n be symmetric and ai,j 6= 0 for a pair (i, j) of indices i 6= j, and let Gi,j(α)
be the n × n elementary Givens rotation with angle α ∈ [0, π/2] defined by cot(2α) =
(ai,i − aj,j)/(2 ai,j). If
B =(Gi,j(α)
)TA Gi,j(α)
then bi,j = bj,i = 0 and [N(B)
]2=[N(A)
]2 − 2 |ai,j|2. (7.32)
We see from (7.32) that through a suitable basis transformation with an elementary Givens
rotation, the outer norm of A can be reduced, as intended.
Proof of Proposition 7.13. In this proof we will use Lemma 7.12 which allows us to switch
between the pair of matrices A and B and the pair of matrices A and B when we are only inter-
ested in the entries with indices (i, i), (i, j), (j, i), (j, j). We are using the invariance property
of the Frobenius norm (see Lemma 7.8) twice: On the one hand, from Lemma 7.8, we have
‖A‖F =∥∥(Gi,j(α)
)TA Gi,j(α)
∥∥F
= ‖B‖F ,
since Gi,j(α) is an orthogonal matrix. On the other hand, since R given by (7.26) is also
orthogonal, we have from Lemma 7.8 the equality of the Frobenius norms for the small matrices
in (7.27), which means
|ai,i|2 + |aj,j|2 + 2 |ai,j|2 = |bi,i|2 + |bj,j|2, (7.33)
since bi,j = 0. From this, we can conclude that
[N(B)
]2=
n∑
j=1
n∑
k=1k 6=j
|bj,k| = ‖B‖2F −
n∑
k=1
|bk,k|2 = ‖A‖2F −
n∑
k=1
|bk,k|2
=[N(A)
]2+
n∑
k=1
(|ak,k|2 − |bk,k|2
)=[N(A)
]2 − 2 |ai,j|2,
because ak,k = bk,k for all k 6= i, j and
|ai,i|2 + |aj,j|2 −(|bi,i|2 + |bj,j|2
)= −2 |ai,j|2
from (7.33). 2
Iteration of this process gives the classical Jacobi method for computing the eigenvalues of a
symmetric matrix.
7. Calculation of Eigenvalues 171
Definition 7.14 (classical Jacobi method for eigenvalue computation)
Let A ∈ Rn×n be symmetric. The classical Jacobi method (for eigenvalue computa-
tion) defines A(1) = A and then proceeds for m = 1, 2, . . . , as follows:
1. For A(m) = (a(m)ℓ,k ) determine a
(m)i,j with i 6= j such that
|a(m)i,j | = max
1≤ℓ,k≤n;ℓ 6=k
|aℓ,k|
and set G(m) = Gi,j(α),where α ∈ [0, π/2] is such that cot(2α) = (a(m)i,i −a
(m)j,j )/(2 a
(m)i,j ).
2. Set A(m+1) =(G(m)
)TA(m) G(m).
For practical computations we will again avoid matrix multiplication and code the transforma-
tions in step 2 directly. Note that an non-diagonal element, which has been mapped
to zero in an earlier step, might be changed in later steps again.
Theorem 7.15 (linear convergence of classical Jacobi method)
The classical Jacobi method converges linearly in the outer norm. More precisely, if
A in Rn×n is the symmetric original matrix and A(m+1) = (a(m)i,j ) denotes the new n × n
matrix computed in the mth step of the classical Jacobi method, then for any m ∈ N
∣∣∣∣∣ ‖A‖F −n∑
j=1
|a(m+1)j,j |2
∣∣∣∣∣ =[N(A(m+1))
]2 ≤(
1 − 2
n(n − 1)
)m [N(A)
]2. (7.34)
The sequence {A(m)} converges towards a diagonal matrix with the eigenvalues of A as
diagonal elements.
We note that for large n the number
0 < η := 1 − 2
n(n − 1)< 1
in (7.34) is close to one and therefore the convergence is rather slow.
Proof of Theorem 7.15. We consider one step in the classical Jacobi method: A(m+1) =
(a(m+1)ℓ,k ) =
(G(m)
)TA(m) G(m) with the elementary Givens rotation G(m) = Gi,j(α). Since the
Frobenius norm of the symmetric matrix A is invariant under orthogonal basis transformations,
we have from Lemma 7.8 that
‖A(m+1)‖F = ‖A(m)‖F = ‖A‖F
and therefore, from the definition of the outer norm,∣∣∣∣∣ ‖A‖F −
n∑
k=1
|a(m+1)k,k |2
∣∣∣∣∣ =
∣∣∣∣∣ ‖A(m+1)‖F −
n∑
k=1
|a(m+1)k,k |2
∣∣∣∣∣ =[N(A(m+1))
]2. (7.35)
172 7.4. The Jacobi Method
From (7.32), with B = A(m+1) =(G(m)
)TA(m) G(m) and A = A(m)
[N(A(m+1))
]2=[N(A(m))
]2 − 2 |a(m)i,j |2. (7.36)
Since a(m)i,j was chosen such that |a(m)
i,j | ≥ |a(m)ℓ,k | for all ℓ 6= k, we have
[N(A(m))
]2=
n∑
k=1
n∑
ℓ=1,ℓ 6=k
|a(m)ℓ,k |2 ≤ n(n − 1)|a(m)
i,j |2 ⇔ |a(m)i,j |2 ≥ 1
n(n + 1)
[N(A(m))
]2.
(7.37)
Thus, applying (7.37) in (7.36), we can bound the outer norm by
[N(A(m+1))
]2=
[N(A(m))
]2 − 2 |a(m)i,j |2
≤[N(A(m))
]2 − 2
n(n − 1)
[N(A(m))
]2
=
(1 − 2
n(n − 1)
)[N(A(m))
]2. (7.38)
Applying (7.38) successively with m replaced by m−1, m−2, . . . , 1, and combining with (7.35),
we find (using A(1) = A)∣∣∣∣∣ ‖A‖F −
n∑
k=1
|a(m+1)k,k |2
∣∣∣∣∣ =[N(A(m+1))
]2 ≤(
1 − 2
n(n − 1)
)m [N(A)
]2,
which proves (7.34) and means that we have linear convergence.
From (7.34), we see that for m → ∞
[N(A(m))
]2=
m∑
k=1
n∑
ℓ=1,ℓ 6=k
|a(m)ℓ,k |2 → 0 and
m∑
k=1
|a(m)k,k |2 → ‖A‖2
F =n∑
k=1
|λk|2,
where λ1, λ2, . . . , λn denote the eigenvalues of A. (We have used (7.22) in the second limit.) To
see that the sequences of diagonal elements of {a(m)k,k } converge to the eigenvalues of A (although
we cannot predict for a given k to which eigenvalue λℓ the sequence {a(m)k,k } converges), we use
Theorem 7.1: Since
A(m) = (G(m−1))T A(m−1) G(m−1) = . . . =[G(1) G(2) · · · G(m−1)
]TA[G(1) G(2) · · · G(m−1)
],
A(m) is obtained from A with the basis transformation Q−1m−1 A Qm−1 with the orthogonal matrix
Qm−1 := G(1) G(2) · · · G(m−1). Thus A(m) has the same eigenvalues as A. From Theorem 7.1,
we know that each eigenvalue λ of A(m) (and of A) satisfies for at least one k
∣∣λ − a(m)k,k
∣∣ ≤n∑
ℓ=1,ℓ 6=k
|a(m)k,ℓ | ≤
√n(n − 1)
(n∑
ℓ=1,ℓ 6=k
|a(m)k,ℓ |2
)1/2
=√
n(n − 1)N(A(m)), (7.39)
7. Calculation of Eigenvalues 173
where we have used in the second step the first estimate from Example 2.31. Since N(A(m)) → 0
as m → ∞, it is clear from (7.39) that
limm→∞
a(m)k,k = λ
for some index k. This completes the proof. 2
Example 7.16 (Jacobi method for the computation of eigenvalues)
Consider the symmetric 3 × 3 matrix
A =
32
0 12
0 3 0
12
0 32
which has the eigenvalues λ1 = 3, λ2 = 2, and λ1 = 1 (see homework). We compute the
first step of the Jacobi method for eigenvalue computation by hand. Above the diagonal of
A(1) = A, we have only one non-zero entry a(1)3,1 = 1/2. Thus we find from (7.29) α = π/4 since
a(1)1,1 = a
(1)3,3 = 3/2. Thus our elementary Givens rotation is given by
G(1) = G1,3(π/4) =
cos(π/4) 0 − sin(π/4)
0 1 0
sin(π/4) 0 cos(π/4)
=
1√2
0 − 1√2
0 1 0
1√2
0 1√2
.
Thus the first step of the Jacobi method is given by
A(2) =(G(1)
)TA(1) G(1)
=
1√2
0 1√2
0 1 0
− 1√2
0 1√2
32
0 12
0 3 0
12
0 32
1√2
0 − 1√2
0 1 0
1√2
0 1√2
=
1√2
0 1√2
0 1 0
− 1√2
0 1√2
√2 0 − 1√
2
0 3 0√
2 0 1√2
=
2 0 0
0 3 0
0 0 1
.
In this example we find the the exact eigenvalues after only one iteration step. 2
Exercise 95 Consider the symmetrix matrix
Ax = b, where A =
2 0 0
0 2 −1
0 −1 2
.
174 7.4. The Jacobi Method
(a) Apply one step of the Jacobi method for eigenvalue computation.
(b) Comment on the result.
The Jacobi method for the computation of eigenvalues can be implemented with the MATLAB
code below, where we compute cos α and sin α with the help of (7.29) and the trigonometric
identities (7.30).
function [d,K] = jacobi_eigenvalue(A,J)
%
% Jacobi method for the computation of eigenvalues
%
% input: A = symmetric real n by n matrix
% J = upper bound for number of iterations
%
% output: d = diagonal of symmetric n by n matrix whose diagonal entries
% are approximations of the eigenvalues of A
% K = number of iterations used until required accuracy was reached
%
n = size(A,2);
%
for j = 1:J
C = A;
for i = 1:n;
C(i:n,i) = 0;
end
if max(max(abs(C))) < 10^(-14) break
else
[R,Q] = find(abs(C) == max(max(abs(C))));
r = R(1,1);
q = Q(1,1);
if A(r,r) == A(q,q)
alpha = pi/4;
cosalpha = cos(alpha);
sinalpha = sin(alpha);
else
tan2alpha = 2*A(r,q) / (A(r,r) - A(q,q));
cos2alpha = 1 / sqrt(1 +(tan2alpha)^2);
cosalpha = sqrt((1+cos2alpha)/2);
sinalpha = sqrt((1-cos2alpha)/2);
end
B = A;
7. Calculation of Eigenvalues 175
B(1:n,r) = A(1:n,r) * cosalpha + A(1:n,q) * sinalpha;
B(1:n,q) = - A(1:n,r) * sinalpha + A(1:n,q) * cosalpha;
A = B;
A(r,1:n) = cosalpha * B(r,1:n) + sinalpha * B(q,1:n);
A(q,1:n) = - sinalpha * B(r,1:n) + cosalpha * B(q,1:n);
end
end
K = j-1;
d = diag(A);
We compute the eigenvalues of the symmetric 4 × 4 matrix from Example 5.19 by executing
the algorithm above in MATLAB.
Example 7.17 (Jacobi method for eigenvalue computation)
Consider the symmetric matrix from Example 5.19, given by
A =
1 0 0.25 0.25
0 1 0 0.25
0.25 0 1 0
0.25 0.25 0 1
.
First we find its eigenvalues directly by determining the roots of the characteristic polynomial,
and then we execute the MATLAB code above to find the eigenvalues numerically.
We compute the characteristic polynomial by expansion with respect to the first row
p(A, λ) = det
λ − 1 0 − 14
− 14
0 λ − 1 0 − 14
− 14
0 λ − 1 0
− 14
− 14
0 λ − 1
= (λ − 1) det
λ − 1 0 − 14
0 λ − 1 0
− 14
0 λ − 1
− 1
4det
0 λ − 1 − 14
− 14
0 0
− 14
− 14
λ − 1
+
1
4det
0 λ − 1 0
− 14
0 λ − 1
− 14
− 14
0
= (λ − 1)
[(λ − 1)3 −
(1
4
)2
(λ − 1)
]− 1
4
[(− 1
4
)3
+1
4(λ − 1)2
]+
1
4
[− 1
4(λ − 1)2
]
= (λ − 1)4 − 3
16(λ − 1)2 +
1
256=
((λ − 1)2 − 3
32
)2
− 9
1024+
1
256
=
((λ − 1)2 − 3
32
)2
− 5
1024=
((λ − 1)2 − 3 +
√5
32
)((λ − 1)2 − 3 −
√5
32
)
176 7.4. The Jacobi Method
=
λ − 1 −
√3 +
√5
32
λ − 1 +
√3 +
√5
32
×
λ − 1 −
√3 −
√5
32
λ − 1 +
√3 −
√5
32
,
where we have used the binomial formulas (a+ b)2 = a2 +2 a b+ b2 and a2− b2 = (a− b) (a+ b).
Thus we find that the eigenvalues are
λ1 = 1 +
√3 +
√5
32≈ 1.4045,
λ2 = 1 +
√3 −
√5
32≈ 1.1545,
λ3 = 1 −
√3 −
√5
32≈ 0.8455,
λ4 = 1 −
√3 +
√5
32≈ 0.5955.
Executing the Jacobi method with the MATLAB code given above with a maximum of J =
10 iterations, we find that the algorithm breaks off after K = 8 iterations, and that the
approximations of the eigenvalues in the iterations m = 1, 2, . . . , 8 are given by the following
diagonal entries of the matrix A(m+1) = (a(m+1)i,k ) listed in the table below:
m 1 2 3 4 5 6 7 8
a(m+1)1,1 1.2500 1.2500 1.2795 1.2795 1.2795 1.2795 1.1545 1.1545
a(m+1)2,2 1.0000 1.2500 1.2500 1.1677 0.9217 0.7205 0.7205 0.5955
a(m+1)3,3 0.7500 0.7500 0.7500 0.8323 1.0783 1.2795 1.4045 1.4045
a(m+1)4,4 1.0000 0.7500 0.7205 0.7205 0.7205 0.7205 0.7205 0.8455
After only 8 iterations the eigenvalues have been pretty well approximated. 2
Since in every step of the classical Jacobi method we have to search n(n − 1)/2 = O(n2)
elements, there are cheaper versions. For example, the cyclic Jacobi method visits all non-
diagonal entries regardless of their sizes in a cyclic way, that is, the index pair (i, j) becomes
(1, 2), (1, 3), . . . (1, n), (2, 3), . . . , (2, n), (3, 4), . . . and then the process starts again. A transfor-
mation takes only place if ai,j 6= 0. The cyclic version is convergent.
7. Calculation of Eigenvalues 177
For small values of ai,j it is not efficient to perform a transformation. Thus, we can restrict
ourselves to elements ai,j having a square above a certain threshold, for example N(A)/(2n2).
Finally, let us have a look at the eigenvectors of A. From Theorem 7.15, we know that the
sequence of matrices {A(m)} converges towards a diagonal matrix with the eigenvalues of A on
the diagonal. Thus let
D = diag(λ1, λ2, . . . , λn),
where λk is the limit of the diagonal element sequence {a(m)k,k } from the matrices A(m) in the
classical Jacobi method. (Note that here we cannot make any assumption on the ordering of
the eigenvalues.) Since A(m+1) tends to D for m → ∞, and
A(m+1) =(G(m)
)TA(m) G(m) =
(G(m))T · · ·
(G(2)
)T (G(1)
)TA G(1) G(2) · · · G(m) = QT
m A Qm
with the orthogonal matrix
Qm := G(1) · · · G(m).
For sufficiently large m, we have
A(m+1) = QTm A Qm ≈ D ⇒ A Qm ≈ Qm D. (7.40)
If uℓ is the ℓth column vector of Qm, then (7.40) implies that
Auℓ ≈ λℓ uℓ
and we see that the columns of Qm define approximate eigenvectors of A.
7.5 Householder Reduction to Hessenberg Form
In this section we learn how a square matrix A can be brought into so-called upper Hessenberg
form. While this it is not directly useful for determining the eigenvalues of A, we will see that
it serves as a starting point for the QR algorithm discussed in the subsequent section.
Definition 7.18 (upper Hessenberg form)
We say that a matrix A = (ai,j) is an upper Hessenberg matrix (or is upper Hes-
senberg) if all entries below the first lower sub-diagonal are zero, that is, ai,j = 0 for all
i > j + 1.
A =
∗ ∗ · · · ∗ ∗ ∗∗ ∗ · · · ∗ ∗ ∗0 ∗ . . .
......
...0 0
. . . ∗ ∗ ∗......
. . . ∗ ∗ ∗0 0 · · · 0 ∗ ∗
178 7.5. Householder Reduction to Hessenberg Form
From Theorem 2.21 we know that every square matrix A ∈ Cn×n, has a Schur factorisation,
that is, there exists a unitary matrix S (that is, S−1 = S∗) such that
S∗ A S = U ⇔ A = S U S∗,
where U is an upper triangular matrix. Since A and U are related via a basis transformation
or similarity transformation, the eigenvalues of A and the eigenvalues of U are the same. Since
U = (ui,j) is upper triangular, we have that
det(λ I − U
)= (λ − u1,1) (λ − u2,2) . . . (λ − un,n),
and hence the eigenvalues of A and U are just the diagonal entries of U . One way to calculate the
eigenvalues of A is to bring A into upper triangular form with the help of a basis transformation
or similarity transformation. This reduction usually requires two stages.
The first step is to reduce A to upper Hessenberg form. In the second step we apply an iteration
that generates a sequence of upper Hessenberg matrices that converge to a matrix in upper
triangular form. This is schematically illustrated in the diagram below for 5 × 5 matrices
∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗
step 1−−−→
∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗0 ∗ ∗ ∗ ∗0 0 ∗ ∗ ∗0 0 0 ∗ ∗
step 2−−−→
∗ ∗ ∗ ∗ ∗0 ∗ ∗ ∗ ∗0 0 ∗ ∗ ∗0 0 0 ∗ ∗0 0 0 0 ∗
.
In this section we discuss the Householder reduction which performs step 1 above and
reduces the original matrix to a matrix in upper Hessenberg form, with the help of a sequence
of basis transformations (or similarity transformations) with Householder matrices H(w) =
I − 2ww∗, where ‖w‖2 = 1. Remember that Householder matrices are unitary and were
introduced in Section 2.3 in the context of the Schur factorization.
Theorem 7.19 (Householder reduction to bring into upper Hessenberg form)
Let A be an n × n matrix. Then there exist Householder matrices H(w1), H(w2), . . . ,
H(wn−2), such that S∗ A S, with S := H(w1) H(w2) · · · H(wn−2), is in upper Hessenberg
form.
Proof of Theorem 7.19. The proof by is given by induction. Let A(0) = A and A(k) =
H(wk) A(k−1) H(wk), for some k ≥ 1. Suppose
A(k−1) =
A(k−1)k | B(k−1)
− − −C(k−1) | D(k−1)
, (7.41)
where the principal sub-matrix A(k−1)k ∈ Ck×k of order k is in upper Hessenberg form and
C(k−1) = (0, 0, . . . , 0, ck) is (n − k) × k matrix. (The matrices B(k−1) and D(k−1) are an
7. Calculation of Eigenvalues 179
arbitrary k× (n−k) and an arbitrary (n−k)× (n−k) matrix, respectively. These assumptions
obviously hold for k = 1.
Let Hk = H(uk) be an (n − k) × (n − k) Householder matrix such that Hk ck = ck e1, where
ck = ±‖ck‖2 (e∗1 ck)/|e∗
1 ck| and e1 is the first Euclidean unit vector in Rn−k. Define the vector
wk := (0T ,uTk )T in C
n, where 0 is the unit vector in Ck. Then it is easily seen that
H(wk) = I − 2wk w∗k =
I | 0
− − −0 | Hk
.
Forming the basis transformation with the Householder matrix H(wk) yields
(H(wk)
)∗A(k−1) H(wk) = H(wk) A(k−1) H(wk) =
A(k−1)k | B(k−1) Hk
− − −Hk C(k−1) | Hk D(k−1) Hk
:= A(k).
From the choice of the (n − k) × (n − k) Householder matrix Hk, we have Hk C(k−1) =
(0, 0, . . . , 0, ck e1). Decomposing A(k) in the form (7.41) with k − 1 replaced by k, we find
that A(k)k+1 is an upper Hessenberg matrix and C(k) is an (n − k − 1) × (k + 1) matrix of the
form (0, 0, . . . , 0, ck+1). This completes the proof. 2
We can easily construct the algorithm from the above proof. Consider step k of the Householder
reduction to upper Hessenberg form. From the above proof we require a vector uk ∈ Cn−k such
that H(uk) is an (n − k) × (n − k) Householder matrix with H(uk) ck = ck e1, where ck :=
‖ck‖2 (e∗1 ck)/|e∗
1 ck|. We note that ‖ck‖2 = ‖ck e1‖2 and c∗k (ck e1) = (ck e1)∗ ck holds which
means that the assumptions of Lemma 2.20 are satisfied. Consulting the proof of Lemma 2.20
in Section 2.2, we define uk ∈ Rn−k via
vk := ck − ck e1 and uk :=vk
‖vk‖2
.
This gives uTk uk = 1, and from the proof of Lemma 2.20 with x = ck and y = ck e1 we can
conclude that H(uk) ck = ck e1 and H(uk) e1 = c−1k ck.
Example 7.20 (Householder reduction to upper Hessenberg form)
Apply the Householder reduction to the matrix
A =
1 0 4 0
0 3 3 4
4 3 3 4
0 4 4 −3
to bring it in upper Hessenberg form.
180 7.5. Householder Reduction to Hessenberg Form
Column 1:
c1 =
0
4
0
and c1 = ‖c1‖2 = 4,
v1 = c1 − c1 e1 =
0
4
0
− 4
−1
0
0
=
−4
4
0
and ‖v1‖2 = 4
√2,
u1 =v1
‖v1‖2
=1√2
−1
1
0
.
Thus the 3 × 3 Householder matrix is given by
H1(u1) = I − 2u1 uT1 =
1 0 0
0 1 0
0 0 1
−
−1
1
0
(− 1, 1, 0)
=
0 1 0
1 0 0
0 0 1
,
and we have indeed
H1(u1) c1 =
0 1 0
1 0 0
0 0 1
0
4
0
=
4
0
0
.
The corresponding 4 × 4 Householder matrix is then given by
H(w1) =
1 0 0 0
0 0 1 0
0 1 0 0
0 0 0 1
,
and we find that
H(w1)T A H(w1) =
1 0 0 0
0 0 1 0
0 1 0 0
0 0 0 1
1 0 4 0
0 3 3 4
4 3 3 4
0 4 4 −3
1 0 0 0
0 0 1 0
0 1 0 0
0 0 0 1
=
1 0 4 0
4 3 3 4
0 3 3 4
0 4 4 −3
1 0 0 0
0 0 1 0
0 1 0 0
0 0 0 1
=
1 4 0 0
4 3 3 4
0 3 3 4
0 4 4 −3
.
Column 2: This is left as a exercise. 2
7. Calculation of Eigenvalues 181
Exercise 96 Perform the second step of the Householder reduction in Example 7.20.
The algorithm for the Householder reduction of a real n × n matrix to upper Hessenberg form
is given by the following MATLAB Code:
function [B] = householder_hessenberg(A)
%
% algorithm executes the Householder reduction of any matrix
% to upper Hessenberg form
% input: A = real n by n matrix
%
% output: B = corresponding "reduced" real n by n matrix in
% upper Hessenberg form
%
n = size(A,2)
for k = 1:n-2
% column to be reduced
x = A(k+1:n,k);
% e is unit vector e_1
e = zeros(n-k,1);
e(1) = 1;
v = x - norm(x) * e;
% u_k for Householder matrix H(u_k)
u = v / norm(v);
% matrix multiplication H(u_k)*A
B = A;
A(k+1:n,k:n) = B(k+1:n,k:n) - 2 * u * (u’ * B(k+1:n,k:n));
% matrix multiplication (H(u_k)*A)*H(u_k)
B = A;
A(1:n,k+1:n) = B(1:n,k+1:n) - 2 * (B(1:n,k+1:n) * u) * u’;
end
B = A;
Note that, since we do not require the product of the Householder matrices, we do not explicitly
produce the Householder matrices, thereby reducing storage significantly.
In the kth step of this method the main work lies in calculating the two matrix-vector multi-
plications (see MATLAB code above) and each of these matrix-vector multiplications requires
O(nk) floating point operations. This is done n − 2 times. Therefore, the overall number of
floating point iterations is O(n3).
182 7.6. QR Algorithm
7.6 QR Algorithm
This methods aims at computing all eigenvalues of a given matrix A ∈ Rn×n simultaneously.
It benefits from the Hessenberg form of a matrix.
Definition 7.21 (QR method for the computation of eigenvalues)
Let A ∈ Rn×n and define A0 := A. For m = 0, 1, 2, . . . decompose Am in the form Am =
Qm Rm with an orthogonal matrix Qm and an upper triangular matrix Rm (that is, we use
the QR factorization of Am). Then, form the swapped product
Am+1 := Rm Qm. (7.42)
Since Qm is an orthogonal matrix, from Am = Qm Rm we have QTm Am = Rm. Obviously, from
(7.42)
Am+1 = QTm Am Qm = Q−1
m Am Qm. (7.43)
The representation (7.43) shows that all matrices Am in the QR method have the same eigen-
values as the initial matrix A0 = A.
Theorem 7.22 (properties of the QR transformation)
Let Am be the matrices computed in the QR method. The QR transformation Am 7→ Am+1
respects the upper Hessenberg form of a matrix, that is, if Am is an upper Hessenberg matrix,
then Am+1 is also an upper Hessenberg matrix. In particular, a given symmetric tridiagonal
A matrix remains a symmetric tridiagonal matrix under the QR transformation. If A ∈Rn×n is in upper Hessenberg form then its QR factorization can be computed in O(n2)
operations.
Proof of Theorem 7.22. From (7.43) we can immediately conclude that if Am is symmetric
so is Am+1. Indeed, if Am = ATm, then
ATm+1 =
(QT
m Am Qm
)T= QT
m ATm (QT
m)T = QTm Am Qm = Am+1.
If Am is in upper Hessenberg form then we can compute the QR factorization of Am in n − 1
steps. In each step, a transformation is performed by multiplication from the left with a
Householder matrix, where in the jth step the Householder matrix maps the entry with index
(j + 1, j) to zero. If we denote the Householder matrix in the jth step by Hj+1,j then we have
that
Hn,n−1 · · · H3,2 H2,1 Am = Rm.
with an upper triangular matrix Rm. Since the Householder matrices are orthogonal matrices
Am =(HT
2,1 HT3,2 · · · HT
n,n−1
)Rm = Qm Rm with Qm = HT
2,1 HT3,2 · · · HT
n,n−1.
7. Calculation of Eigenvalues 183
Thus the matrix Am+1 in the QR method is given by
Am+1 = Rm Qm = Rm
(HT
2,1 H3,2 · · · HTn,n−1
)︸ ︷︷ ︸
=Qm
. (7.44)
From the properties of the Householder matrices Hj+1,j right-multiplication with the transposes
of the Householder matrices only modifies entries of Rm with indices (i, j), where i ≤ j + 1.
Thus Am+1 is also a an upper Hessenberg matrix.
If A is a symmetric tridiagonal matrix, then we know from the beginning of this proof that Am
is also symmetric. In particular, a symmetric tridiagonal matrix is in upper Hessenberg form,
and thus Am is in upper Hessenberg form. Since Am is also symmetric, we see that Am is a
symmetric tridiagonal matrix. 2
For the analysis of the convergence of the QR method, we need two auxiliary results. The first
one concerns the uniqueness of the QR factorization of a matrix A.
Lemma 7.23 (uniqueness of QR factorization)
Let A ∈ Rn×n be non-singular. If A = Q R with an orthogonal matrix Q and an upper
triangular matrix R, then Q and R are unique up to the signs of the diagonal entries.
Proof of Lemma 7.23. Assume that we have two QR decompositions A = Q1 R1 and
A = Q2 R2. This gives
Q1 R1 = Q2 R2 ⇔ R1 R−12 = QT
1 Q2 =: S,
which shows that S is an orthogonal matrix and an upper triangular matrix. (Note that the
inverse of an upper triangular matrix is also upper triangular and that the product of two upper
triangular matrices is also an upper triangular matrix.) The inverse S−1 of the upper triangular
matrix S is also upper triangular. Since S−1 = ST as S is an orthogonal matrix, we also have
that ST is an upper triangular matrix. Hence we have that S is also a lower triangular matrix.
Thus S has to be a diagonal matrix. An orthogonal diagonal matrix can only have diagonal
entries ±1. If we fix the signs of the diagonal entries of R1 to be the same as those of R2, it
follows that S = (si,j) can only be the identity (since si,i = (R1 R−12 )i,i = (R1)i,i (R
−12 )i,i because
R1 and R2 are upper triangular and sign((R−1
2 )i,i
)= sign
((R2)i,i
)). Then
R1 R−12 = QT
1 Q2 = I ⇒ R1 = R2 and Q1 = Q2,
and we see that, apart from the signs of the diagonal entries, the matrices Q and R of the QR
factorization are uniquely determined. 2
The second lemma is technical and will be needed in the proof of the convergence of the QR
method.
184 7.6. QR Algorithm
Lemma 7.24 (technial lemma for QR method)
Let D := diag(d1, d2, . . . , dn) ∈ Rn×n be a diagonal matrix with
|d1| > |d2| > . . . > |dj| > |dj+1| > . . . > |dn| > 0,
and let L = (ℓi,j) ∈ Rn×n be a normalized lower triangular matrix. Let Lm denote the lower
triangular matrix with entries ℓi,j (di/dj)m for i ≥ j. Then, we have
Lm Dm = Dm L ⇔ Lm = Dm L (D−1)m, m ∈ N0. (7.45)
Furthermore, Lm converges linearly to the identity matrix for m → ∞.
Proof of Lemma 7.24. First we observe that multiplication from the left or right of any
matrix B with a diagonal matrix D will leave any zero entries in B invariant. Therefore,
since Dm and (D−1)m are diagonal matrices and L is a normalized lower triangular matrix,
the matrix Dm L (D−1)m is also a lower triangular matrix. Next we compute its entries: Since
Dm = diag(dm1 , dm
2 , . . . , dmn ) and (D−1)m = diag(d−m
1 , d−m2 , . . . , d−m
n ) we obtain from matrix
multiplication
(Dm L (D−1)m
)i,j
= dmi ℓi,j d−m
j = ℓi,j
(di
dj
)m
, 1 ≤ j ≤ i ≤ n,
and we see that indeed Dm L (D−1)m = Lm. We have (Lm)i,i = ℓi,i = 1, since L is a normalized
lower triangular matrix, and, since |di| < |dj| if j < i,
|(Lm)i,j| = |ℓi,j|∣∣∣∣di
dj
∣∣∣∣m
→ 0 for 1 ≤ j < i ≤ n,
which shows that Lm converges linearly to the identity matrix. 2
Theorem 7.25 (convergence of QR method)
Let A ∈ Rn×n be non-singular, and assume that A has n real eigenvalues λ1, λ2, . . . , λn ∈R and n linearly independent corresponding real eigenvectors w1,w2, . . . ,wn ∈ Rn. Let
T ∈ Rn×n be the matrix of corresponding eigenvectors of A, that is, T = (w1,w2, . . . ,wn),
and assume that T−1 possesses an LU factorization without pivoting. Then, the matrices
Am = (a(m)ij ) created by the QR method have the following properties:
(i) The sub-diagonal elements converge to zero, that is, a(m)i,j → 0 for m → ∞ for all i > j.
(ii) The sequences {A2m} and {A2m+1} each converge to an upper triangular matrix.
(iii) The diagonal elements converge to the eigenvalues of A, that is, a(m)i,i → λπ(i) for
m → ∞ for all 1 ≤ i ≤ n, where π : {1, 2, . . . , n} → {1, 2, . . . , n} is some permutation
of the numbers 1, 2, . . . , n.
Furthermore, the sequence of the matrices Qm, created by the QR method, converges to
an orthogonal diagonal matrix, that is, to an diagonal matrix having only 1 or −1 on the
diagonal.
7. Calculation of Eigenvalues 185
Proof of Theorem 7.25. From (7.43), we already know that all generated matrices Am have
the same eigenvalues as A. Let Qm and Rm be the matrices generated in the mth step of the
QR method (that is, Am = Qm Rm and Am+1 = Rm Qm), and introduce the notation
Rm := Rm Rm−1 · · · R1 R0 and Qm := Q0 Q1 · · · Qm−1 Qm.
From the definition of Qm−1 as a product of unitary matrices we see that Qm−1 is a unitary
matrix, and from the definition of Rm−1 as a product of upper triangular matrices, we see that
the matrix Rm−1 is an upper triangular matrix.
From repeated application of (7.43) we have
Am = QTm−1 Am−1 Qm−1 = QT
m−1 QTm−2 Am−2 Qm−2 Qm−1 = . . . = QT
m−1 A Qm−1,
that is,
Am = QTm−1 A Qm−1. (7.46)
By induction we can also show that the powers of A have satisfy
Am = Qm−1 Rm−1. (7.47)
Indeed, from the QR factorization A = Q0 R0 = Q0 R0, and (7.47) holds true for m = 1.
Assume that (7.47) holds true for m. Then for m + 1
Am+1 = A Am = A Qm−1 Rm−1 = Qm−1 Am Rm−1 = Qm−1 (Qm Rm) Rm−1 = Qm Rm,
where we have used Qm−1 Am = A Qm−1 (from (7.46)) in the third step and the QR factorization
Am = Qm Rm in the fourth step.
Since Qm−1 is unitary and Rm−1 is upper triangular, (7.47) is a QR factorization Am =
Qm−1 Rm−1 of Am. Since A is non-singular, the mth power Am is also non-singular, and
by Lemma 7.23 the QR factorization of Am is unique, if we assume that all upper triangular
matrices Ri have positive diagonal elements. Thus (7.47) is the unique QR factorization of Am,
where the upper triangular matrix Rm−1 has positive diagonal entries.
If we define D = diag(λ1, λ2, . . . , λn), then we have the relation
A T = T D ⇔ A = T D T−1,
which leads to
Am = T Dm T−1.
By assumption, we have an LU factorization of T−1, that is, T−1 = L U , where L is a normalized
lower triangular matrix and U is an upper triangular matrix. This leads to
Am = T Dm L U. (7.48)
186 7.6. QR Algorithm
By Lemma 7.24 there is a sequence {Lm} of lower triangular matrices, defined by Lm :=
Dm L (D−1)m, which converges to I and which satisfies
Dm L = Lm Dm. (7.49)
Substitution of (7.49) into (7.48) leads to
Am = T Lm Dm U. (7.50)
Using the QR factorization
T = Q R (7.51)
of T with positive diagonal elements in R we derive from substituting (7.51) into (7.50)
Am = Q R Lm Dm U. (7.52)
Since the matrices Lm converge to I as m → ∞, the matrices R Lm have to converge to R as
m → ∞. If we now compute a QR factorization of R Lm as
R Lm = Qm Rm, (7.53)
again with positive diagonal elements in Rm, then we can conclude from the uniqueness of
the QR factorization that Qm has to converge to the identity matrix. Substituting (7.53) into
(7.52), the equation (7.52) can hence be rewritten as
Am = Q Qm Rm Dm U. (7.54)
Next, let us introduce the diagonal and orthogonal matrices
∆m := diag(sm1 , . . . , sm
n ) with smi := sign
(λm
i ui,i
),
where ui,i are the diagonal elements of U . Then, because of ∆2m = I, we can rewrite the
representation of Am in (7.54) as
Am =(Q Qm ∆m
) (∆m Rm Dm U). (7.55)
Since the signs of the diagonal elements of the upper triangular matrix Rm Dm U coincide with
those of Dm U and hence with those of ∆m, we see that (7.55) is a QR factorization of Am with
positive diagonal elements in the upper triangular matrix.
If we compare this QR factorization with the QR factorization (see (7.47))
Am = Qm−1 Rm−1
we note that in both cases the upper diagonal matrix in the QR factorization has positive
diagonal entries. Thus we can thus conclude from the uniqueness of the QR factorization that
Qm−1 = Q Qm ∆m. (7.56)
7. Calculation of Eigenvalues 187
Since Qm converges to the identity matrix, the matrices Qm converge to Q apart from the signs
of the columns. From (7.46), we have
Am = QTm−1 A Qm−1,
and from this and (7.56) we can derive (using that QTm = Q−1
m , since Qm is orthogonal, and
∆−1m = ∆m)
Am = Q−1m−1 A Qm−1 =
(Q Qm ∆m
)−1A(Q Qm ∆m
)
= ∆m Q−1m Q−1 A Q Qm ∆m = ∆m Q−1
m︸︷︷︸→ I
R T−1 A T︸ ︷︷ ︸= D
R−1 Qm︸︷︷︸→ I
∆m,
where we have used in the last step that Q = T R−1 from (7.51). If m tends to infinity then,
because of ∆2m = ∆0 and ∆2m+1 = ∆1, we can conclude the convergence of {A2m} and {A2m+1}to ∆0 R D R−1 ∆0 and ∆1 R D R−1 ∆1, respectively. Both limit matrices are upper triangular
and have the same elements on the diagonal which proves (ii). Thus we have that, if i > j,
a(m)i,j → 0 for m → ∞ which proves statement (i). Moreover, we can conclude that for the
diagonal entries of Am
limm→∞
a(m)i,i =
(∆0 R D R−1 ∆0
)i,i
=(∆1 R D R−1 ∆1
)i,i
, 1 ≤ i ≤ n. (7.57)
Since ∆0 R D R−1 ∆0 =(∆0 R
)D(∆0 R
)−1this matrix has the same eigenvalues as D =
diag(λ1, λ2, . . . , λn). Since ∆0 R D R−1 ∆0 is an upper triangular matrix, we know that its
diagonal entries are the eigenvalues of A, and thus
(∆0 R D R−1 ∆0
)i,i
=(∆0 (Q−1 T ) D (Q−1 T )−1 ∆0
)i,i
=(∆0 QT A Q ∆0
)i,i
, i = 1, 2, . . . , n,
(7.58)
are the eigenvalues of A. We cannot conclude that(∆0 R D R−1 ∆0
)i,i
is the eigenvalue λi, since
the unitary matrix Q in the last representation in (7.58) could provide a basis transformation
QT A Q that yields a permutation of the eigenvalues. Combining (7.58) and (7.57) yields finally
statement (iii).
Finally, from the definition of Qm and from (7.56),
Qm = Q−1m−1 Qm = ∆−1
m Q−1m Q−1 Q Qm+1 ∆m+1 = ∆m Q−1
m Qm+1 ∆m+1.
Using again the convergence of Qm to the identity matrix I, we see that Qm converges to
∆ := ∆0 ∆1 = diag(sign(λ1), sign(λ2), . . . , sign(λn)
). This concludes the proof. 2
The QR method for eigenvalue computation can be implemented with the following MATLAB
code:
188 7.6. QR Algorithm
function [lambda,K] = QR_method(A,J)
%
% QR method for the computation of the eigenvalues of a
% real n by n matrix with rea;le eigenvlues and and
% corresponding set of n linearly independent eigenvectors:
% starting matrix A_1 = A
% m-th step: compute QR factorization A_m = Q_m * R_m
% form A_{m+1} = R_m * Q_m
%
% input: A = real n by n matrix with n real eigenvalues and
% a corresponding set of n linearly independent
% eigenvectors
% J = maximal number of iterations
%
% output: lambda = n by 1 vector with approximate eigenvalues of A
% K = number of iterations
%
n = size(A,2);
for m = 1:J
C = A;
for i = 1:n
C(i,i:n) = 0;
end
if max(max(abs(C))) < 10^(-4) break
else
[Q,R] = QR_factorization(A);
A = R * Q;
end
end
lambda = diag(A);
K = m;
This code uses the QR factorization implemented with the MATLAB code:
function [Q,R] = QR_factorization(A)
%
% algorithm computes the QR factorization of A, that is, A=Q*R
% input: A = real n by n matrix
% output: Q = real orthogonal n by n matrix
% R = real n by n upper triangular matrix
%
7. Calculation of Eigenvalues 189
n = size(A,1);
Q = eye(n,n);
R = A;
%
for j=1:n-1
if max(abs(R(j:n,j))) == 0
Q = Q;
R = R;
else
v = R(j:n,j) - norm(R(j:n,j)) * [1 ; zeros(n-j,1)];
w = [ zeros(j-1,1) ; v/norm(v)];
R = R - 2* w * w’ * R;
Q = Q - 2* w * w’ * Q;
end
end
Q = Q’;
We give numerical example.
Example 7.26 (eigenvalue computation with QR method)
Consider the symmetric matrix from Example 5.19 and Example 7.17, given by
A =
1 0 0.25 0.25
0 1 0 0.25
0.25 0 1 0
0.25 0.25 0 1
.
In Example 7.17, we computed the eigenvalues by hand and found:
λ1 = 1 +
√3 +
√5
32≈ 1.4045,
λ2 = 1 +
√3 −
√5
32≈ 1.1545,
λ3 = 1 −
√3 −
√5
32≈ 0.8455,
λ4 = 1 −
√3 +
√5
32≈ 0.5955.
We compute the first 8 iterations of the QR method with the MATLAB code listed above
190 7.6. QR Algorithm
and show the diagonal entries of Am found in each iteration step in the table below.
m 1 2 3 4 5 6 7 8
(Am+1)1,1 1.2222 1.3267 1.3657 1.3820 1.3904 1.3953 1.3983 1.4004
(Am+1)2,2 1.1056 1.1380 1.1474 1.1525 1.1554 1.1566 1.1568 1.1566
(Am+1)3,3 0.9069 0.8837 0.8722 0.8624 0.8555 0.8512 0.8487 0.8472
(Am+1)4,4 0.7652 0.6516 0.6147 0.6030 0.5988 0.5970 0.5962 0.5958
After only 8 iterative steps the eigenvalues are accurately approximated up to the second
number after the decimal point. 2