Numerical Linear Algebra And Its Applications - …umir.umac.mo/jspui/bitstream/123456789/14052/1/3027_0_textbook.pdf · Numerical Linear Algebra ... Introduction Numerical linear

Numerical Linear AlgebraAnd Its Applications

Xiao-Qing JIN 1 Yi-Min WEI 2

August 29, 2008

1Department of Mathematics, University of Macau, Macau, P. R. China.2Department of Mathematics, Fudan University, Shanghai, P.R. China

2

i

To Our Families

ii

CONTENTS

page

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Basic symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Basic problems in NLA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Why shall we study numerical methods? . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Matrix factorizations (decompositions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 Perturbation and error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.6 Operation cost and convergence rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Chapter 2 Direct Methods for Linear Systems . . . . . . . . . . . . . . . . . 9

2.1 Triangular linear systems and LU factorization . . . . . . . . . . . . . . . . . . . . . 9

2.2 LU factorization with pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 Cholesky factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Chapter 3 Perturbation and Error Analysis . . . . . . . . . . . . . . . . . . . 25

3.1 Vector and matrix norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Perturbation analysis for linear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3 Error analysis on floating point arithmetic . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4 Error analysis on partial pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Chapter 4 Least Squares Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.1 Least squares problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2 Orthogonal transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

iii

4.3 QR decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Chapter 5 Classical Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.1 Jacobi and Gauss-Seidel method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.2 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.3 Convergence rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.4 SOR method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

Chapter 6 Krylov Subspace Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.1 Steepest descent method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.2 Conjugate gradient method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.3 Practical CG method and convergence analysis . . . . . . . . . . . . . . . . . . . . 92

6.4 Preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.5 GMRES method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

Chapter 7 Nonsymmetric Eigenvalue Problems . . . . . . . . . . . . . . . 111

7.1 Basic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7.2 Power method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7.3 Inverse power method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7.4 QR method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.5 Real version of QR algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

Chapter 8 Symmetric Eigenvalue Problems . . . . . . . . . . . . . . . . . . . 131

8.1 Basic spectral properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

8.2 Symmetric QR method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

8.3 Jacobi method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

iv

8.4 Bisection method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

8.5 Divide-and-conquer method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

Chapter 9 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

9.2 Background of BVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

9.3 Strang-type preconditioner for ODEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

9.4 Strang-type preconditioner for DDEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

9.5 Strang-type preconditioner for NDDEs . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

9.6 Strang-type preconditioner for SPDDEs . . . . . . . . . . . . . . . . . . . . . . . . . . 177

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

v

Preface

Numerical linear algebra, also called matrix computation, has been a center of sci-entific and engineering computing since 1946, the first modern computer was born.Most of problems in science and engineering finally become problems in matrix compu-tation. Therefore, it is important for us to study numerical linear algebra. This bookgives an elementary introduction to matrix computation and it also includes some newresults obtained in recent years. In the beginning of this book, we first give an outlineof numerical linear algebra in Chapter 1.

In Chapter 2, we introduce Gaussian elimination, a basic direct method, for solvinggeneral linear systems. Usually, Gaussian elimination is used for solving a dense linearsystem with median size and no special structure. The operation cost of Gaussianelimination is O(n3) where n is the size of the system. The pivoting technique is alsostudied.

In Chapter 3, in order to discuss effects of perturbation and error on numericalsolutions, we introduce vector and matrix norms and study their properties. The erroranalysis on floating point operations and on partial pivoting technique is also given.

In Chapter 4, linear least squares problems are studied. We will concentrate onthe problem of finding the least squares solution of an overdetermined linear systemAx = b where A has more rows than columns. Some orthogonal transformations andthe QR decomposition are used to design efficient algorithms for solving least squaresproblems.

We study classical iterative methods for the solution of Ax = b in Chapter 5.Iterative methods are quite different from direct methods such as Gaussian elimination.Direct methods based on an LU factorization of the matrix A are prohibitive in termsof computing time and computer storage if A is quite large. Usually, in most largeproblems, the matrices are sparse. The sparsity may be lost during the LU factorizationprocedure and then at the end of LU factorization, the storage becomes a crucial issue.For such kind of problem, we can use a class of methods called iterative methods. Weonly consider some classical iterative methods in this chapter.

In Chapter 6, we introduce another class of iterative methods called Krylov sub-space methods proposed recently. We will only study two versions among those Krylovsubspace methods: the conjugate gradient (CG) method and the generalized mini-mum residual (GMRES) method. The CG method proposed in 1952 is one of the bestknown iterative method for solving symmetric positive definite linear systems. TheGMRES method was proposed in 1986 for solving nonsymmetric linear systems. Thepreconditioning technique is also studied.

Eigenvalue problems are particularly interesting in scientific computing. In Chapter

vi

7, nonsymmetric eigenvalue problems are studied. We introduce some well-knownmethods such as the power method, the inverse power method and the QR method.

The symmetric eigenvalue problem with its nice properties and rich mathematicaltheory is one of the most interesting topics in numerical linear algebra. In Chapter 8,we will study this topic. The symmetric QR iteration method, the Jacobi method, thebisection method and a divide-and-conquer technique will be discussed in this chapter.

In Chapter 9, we will briefly survey some of the latest developments in using bound-ary value methods for solving systems of ordinary differential equations with initialvalues. These methods require the solutions of one or more nonsymmetric, large andsparse linear systems. Therefore, we will use the GMRES method in Chapter 6 withsome preconditioners for solving these linear systems. One of the main results is thatif an Aν1,ν2-stable boundary value method is used for an m-by-m system of ODEs,then the preconditioned matrix can be decomposed as I + L where I is the identitymatrix and the rank of L is at most 2m(ν1 + ν2). It follows that when the GMRESmethod is applied to the preconditioned system, the method will converge in at most2m(ν1 +ν2)+1 iterations. Applications to different delay differential equations are alsogiven.

“ If any other mathematical topic is as fundamental to the mathematicalsciences as calculus and differential equations, it is numerical linear algebra.” — L. Trefethen and D. Bau III

Acknowledgments: We would like to thank Professor Raymond H. F. Chan ofthe Department of Mathematics, Chinese University of Hong Kong, for his constantencouragement, long-standing friendship, financial support; Professor Z. H. Cao of theDepartment of Mathematics, Fudan University, for his many helpful discussions anduseful suggestions. We also would like to thank our friend Professor Z. C. Shi forhis encouraging support and valuable comments. Of course, special appreciation goesto two important institutions in the authors’ life: University of Macau and FudanUniversity for providing a wonderful intellectual atmosphere for writing this book.Most of the writing was done during evenings, weekends and holidays. Finally, thanksare also due to our families for their endless love, understanding, encouragement andsupport essential to the completion of this book. The most heartfelt thanks to all ofthem!

The publication of the book is supported in part by the research grants No.RG024/01-02S/JXQ/FST, No.RG031/02-03S/JXQ/FST and No.RG064/03-04S/JXQ/FST fromUniversity of Macau; the research grant No.10471027 from the National Natural ScienceFoundation of China and some financial support from Shanghai Education Committeeand Fudan University.

vii

Authors’ words on the corrected and revised second printing: In its secondprinting, we corrected some minor mathematical and typographical mistakes in thefirst printing of the book. We would like to thank all those people who pointed theseout to us. Additional comments and some revision have been made in Chapter 7.The references have been updated. More exercises are also to be found in the book.The second printing of the book is supported by the research grant No.RG081/04-05S/JXQ/FST.

viii

Chapter 1

Introduction

Numerical linear algebra (NLA) is also called matrix computation. It has been a centerof scientific and engineering computing since the first modern computer came to thisworld around 1946. Most of problems in science and engineering are finally transferredinto problems in NLA. Thus, it is very important for us to study NLA. This book givesan elementary introduction to NLA and it also includes some new results obtained inrecent years.

1.1 Basic symbols

We will use the following symbols throughout this book.

• Let R denote the set of real numbers, C denote the set of complex numbers andi ≡ √−1.

• Let Rn denote the set of real n-vectors and Cn denote the set of complex n-vectors.Vectors will almost always be column vectors.

• Let Rm×n denote the linear vector space of m-by-n real matrices and Cm×n denotethe linear vector space of m-by-n complex matrices.

• We will use the upper case letters such as A, B, C, ∆ and Λ, etc, to denotematrices and use the lower case letters such as x, y, z, etc, to denote vectors.

• The symbol aij will denote the ij-th entry in a matrix A.

• The symbol AT will denote the transpose of the matrix A and A∗ will denote theconjugate transpose of the matrix A.

• Let a1, · · · , am ∈ Rn (or Cn). We will use span{a1, · · · , am} to denote the linearvector space of all the linear combinations of a1, · · · , am.

1

2 CHAPTER 1. INTRODUCTION

• Let rank(A) denote the rank of the matrix A.

• Let dim(S) denote the dimension of the vector space S.

• We will use det(A) to denote the determinant of the matrix A and use diag(a11, · · · , ann)to denote the n-by-n diagonal matrix:

diag(a11, · · · , ann) =

a11 0 · · · 0

0 a22. . .

......

. . . . . . 00 · · · 0 ann

.

• For matrix A = [aij ], the symbol |A| will denote the matrix with entries (|A|)ij =|aij |.

• The symbol I will denote the identity matrix, i.e.,

I =

1 0 · · · 0

0 1. . .

......

. . . . . . 00 · · · 0 1

,

and ei will denote the i-th unit vector, i.e., the i-th column vector of I.

• We will use ‖ · ‖ to denote a norm of matrix or vector. The symbols ‖ · ‖1, ‖ · ‖2

and ‖ · ‖∞ will denote the p-norm with p = 1, 2,∞, respectively.

• As in MATLAB, in algorithms, A(i, j) will denote the (i, j)-th entry of matrix A;A(i, :) and A(:, j) will denote the i-th row and the j-th column of A, respectively;A(i1 : i2, k) will express the column vector constructed by using entries from thei1-th entry to the i2-th entry in the k-th column of A; A(k, j1 : j2) will express therow vector constructed by using entries from the j1-th entry to the j2-th entryin the k-th row of A; A(k : l, p : q) will denote the (l − k + 1)-by-(q − p + 1)submatrix constructed by using the rows from the k-th row to the l-th row andthe columns from the p-th column to the q-th column.

1.2 Basic problems in NLA

NLA includes the following three main important problems which will be studied inthis book:

1.3. WHY SHALL WE STUDY NUMERICAL METHODS? 3

(1) Find the solution of linear systems

Ax = b

where A is an n-by-n nonsingular matrix and b is an n-vector.

(2) Linear least squares problems: For any m-by-n matrix A and an m-vector b, findan n-vector x such that

‖Ax− b‖2 = miny∈Rn

‖Ay − b‖2.

(3) Eigenvalues problems: For any n-by-n matrix A, find a part (or all) of its eigen-values and corresponding eigenvectors. We remark here that a complex numberλ is called an eigenvalue of A if there exists a nonzero vector x ∈ Cn such that

Ax = λx,

where x is called the eigenvector of A associated with λ.

Besides these main problems, there are many other fundamental problems in NLA,for instance, total least squares problems, matrix equations, generalized inverses, in-verse problems of eigenvalues, and singular value problems, etc.

1.3 Why shall we study numerical methods?

To answer this question, let us consider the following linear system,

Ax = b

where A is an n-by-n nonsingular matrix and x = (x1, x2, · · · , xn)T . If we use thewell-known Cramer rule, then we have the following solution:

x1 =det(A1)det(A)

, x2 =det(A2)det(A)

, · · · , xn =det(An)det(A)

,

where Ai, for i = 1, 2, · · · , n, are matrices with the i-th column replaced by the vectorb. Then we should compute n + 1 determinants det(Ai), i = 1, 2, · · · , n, and det(A).There are

[n!(n− 1)](n + 1) = (n− 1)(n + 1)!

multiplications. When n = 25, by using a computer with 10 billion operations/sec., weneed

24× 26!1010 × 3600× 24× 365

≈ 30.6 billion years.


If one uses Gaussian elimination, it requires

n∑

i=1

(i− 1)(i + 1) =n∑

i=1

i2 − n =16n(n + 1)(2n + 1)− n = O(n3)

multiplications. Then less than 1 second, we could solve 25-by-25 linear systems byusing the same computer. From above discussions, we note that for solving the sameproblem by using different numerical methods, the results are much different. There-fore, it is essential for us to study the properties of numerical methods.

1.4 Matrix factorizations (decompositions)

For any linear system Ax = b, if we can factorize (decompose) A as A = LU where Lis a lower triangular matrix and U is an upper triangular matrix, then we have

Ly = b

Ux = y.(1.1)

By substituting, we can easily solve (1.1) and then Ax = b. Therefore, matrix factor-izations (decompositions) are very important tools in NLA. The following theorem isbasic and useful in linear algebra, see [17].

Theorem 1.1 (Jordan Decomposition Theorem) If A ∈ Cn×n, then there existsa nonsingular matrix X ∈ Cn×n such that

X−1AX = J ≡ diag(J1, J2, · · · , Jp),

or A = XJX−1, where J is called the Jordan canonical form of A and

Ji =

λi 1 0 · · · 0

0 λi 1. . .

...... 0

. . . . . . 0...

. . . . . . 10 · · · · · · 0 λi

∈ Cni×ni ,

for i = 1, 2, · · · , p, are called Jordan blocks with n1+· · ·+np = n. The Jordan canonicalform of A is unique up to the permutation of diagonal Jordan blocks. If A ∈ Rn×n withonly real eigenvalues, then the matrix X can be taken to be real.

1.5. PERTURBATION AND ERROR ANALYSIS 5

1.5 Perturbation and error analysis

The solutions provided by numerical algorithms are seldom absolutely correct. Usu-ally, there are two kinds of errors. First, errors appear in input data caused by priorcomputations or measurements. Second, there may be errors caused by algorithmsthemselves because of approximations made within algorithms. Thus, we need to carryout a perturbation and error analysis.

(1) Perturbation.

For a given x, we want to compute the value of function f(x). Suppose thereis a perturbation δx of x and |δx|/|x| is very small. We want to find a positivenumber c(x) as small as possible such that

|f(x + δx)− f(x)||f(x)| ≤ c(x)

|δx||x| .

Then c(x) is called the condition number of f(x) at x. If c(x) is large, we say thatthe function f is ill-conditioned at x; if c(x) is small, we say that the function fis well-conditioned at x.

Remark: A computational problem being ill-conditioned or not has no relationwith numerical methods that we used.

(2) Error.

By using some numerical methods, we calculate the value of a function f at apoint x and we obtain y. Because of the rounding error (or chopping error),usually

y 6= f(x).

If there exists δx such that

y = f(x + δx), |δx| ≤ ε|x|,where ε is a positive constant having a closed relation with numerical methodsand computers used, then we say that the method is stable if ε is small; themethod is unstable if ε is large.

Remark: A numerical method being stable or not has no relation with computa-tional problems that we faced.

With the perturbation and error analysis, we obtain

|y − f(x)||f(x)| =

|f(x + δx)− f(x)||f(x)| ≤ c(x)

|δx||x| ≤ εc(x).

Therefore, whether a numerical result is accurate depends on both the stability of thenumerical method and the condition number of the computational problem.


1.6 Operation cost and convergence rate

Usually, numerical algorithms are divided into two classes:

(i) direct methods;

(ii) iterative methods.

By using direct methods, one can obtain an accurate solution of computational prob-lems within finite steps in exact arithmetic. By using iterative methods, one can onlyobtain an approximate solution of computational problems within finite steps.

The operation cost is an important measurement of algorithms. The operationcost of an algorithm is the total operations of “+, −, ×, ÷” used in the algorithm.We remark that the speed of algorithms is only partially depending on the operationcost. In modern computers, the speed of operations is much faster than that of datatransfer. Therefore, sometimes, the speed of an algorithm is mainly depending on thetotal amount of data transfers.

For direct methods, usually, we use the operation cost as a main measurement ofthe speed of algorithms. For iterative methods, we need to consider

(i) operation cost in each iteration;

(ii) convergence rate of the method.

For a sequence {xk} provided by an iterative algorithm, if {xk} −→ x, the exactsolution, and if {xk} satisfies

‖xk − x‖ ≤ c‖xk−1 − x‖, k = 1, 2, · · · ,

where 0 < c < 1 and ‖ · ‖ is any vector norm (see Chapter 3 for a detail), then we saythat the convergence rate is linear. If it satisfies

‖xk − x‖ ≤ c‖xk−1 − x‖p, k = 1, 2, · · · ,

where 0 < c < 1 and p > 1, then we say that the convergence rate is superlinear.

Exercises:

1. A matrix is strictly upper triangular if it is upper triangular with zero diagonal elements.Show that if A is a strictly upper triangular matrix of order n, then An = 0.

2. Let A ∈ Cn×m and B ∈ Cm×l. Prove that

rank(AB) ≥ rank(A) + rank(B)−m.

1.6. OPERATION COST AND CONVERGENCE RATE 7

3. Let

A =[

A11 A12

A21 A22

],

where Aij , for i, j = 1, 2, are square matrices with det(A11) 6= 0, and satisfy

A11A21 = A21A11.

Thendet(A) = det(A11A22 −A21A12).

4. Show that det(I − uv∗) = 1− v∗u where u, v ∈ Cm are column vectors.

5. Prove Hadamard’s inequality for A ∈ Cn×n:

|det(A)| ≤n∏

j=1

‖aj‖2,

where aj = A(:, j) and ‖aj‖2 =(

n∑i=1

|A(i, j)|2) 1

2

. When does the equality hold?

6. Let B be nilpotent, i.e., there exists an integer k > 0 such that Bk = 0. Show that ifAB = BA, then

det(A + B) = det(A).

7. Let A be an m-by-n matrix and B be an n-by-m matrix. Show that the matrices[

AB 0B 0

]and

[0 0B BA

]

are similar. Conclude that the nonzero eigenvalues of AB are the same as those of BA,and

det(Im + AB) = det(In + BA).

8. A matrix M ∈ Cn×n is Hermitian positive definite if it satisfies

M = M∗, x∗Mx > 0,

for all x 6= 0 ∈ Cn. Let A and B be Hermitian positive definite matrices.

(1) Show that the matrix product AB has positive eigenvalues.

(2) Show that AB is Hermitian if and only if A and B commute.

9. Show that any matrix A ∈ Cn×n can be written uniquely in the form

A = B + iC,

where B and C are Hermitian.

10. Show that if A is skew-Hermitian, i.e., A∗ = −A, then all its eigenvalues lie on theimaginary axis.


11. Let

A =[

A11 A12

A21 A22

].

Assume that A11, A22 are square, and A11, A22 −A21A−111 A12 are nonsingular. Let

B =[

B11 B12

B21 B22

]

be the inverse of A. Show that

B22 = (A22 −A21A−111 A12)−1, B12 = −A−1

11 A12B22,

B21 = −B22A21A−111 , B11 = A−1

11 −B12A21A−111 .

12. Suppose that A and B are Hermitian with A being positive definite. Show that A + Bis positive definite if and only if all the eigenvalues of A−1B are greater than −1.

13. Let A be idempotent, i.e., A2 = A. Show that each eigenvalue of A is either 0 or 1.

14. Let A be a matrix with all entries equal to one. Show that A can be written as A =eeT , where eT = (1, 1, · · · , 1), and A is positive semi-definite. Find the eigenvalues andeigenvectors of A.

15. Prove that any matrix A ∈ Cn×n has a polar decomposition A = HQ, where H isHermitian positive semi-definite and Q is unitary. We recall that M ∈ Cn×n is a unitarymatrix if M−1 = M∗. Moreover, if A is nonsingular, then H is Hermitian positive definiteand the polar decomposition of A is unique.

Chapter 2

Direct Methods for LinearSystems

The problem of solving linear systems is central in NLA. For solving linear systems, ingeneral, we have two classes of methods. One is called the direct method and the otheris called the iterative method. By using direct methods, within finite steps, one canobtain an accurate solution of computational problems in exact arithmetic. By usingiterative methods, within finite steps, one can only obtain an approximate solution ofcomputational problems.

In this chapter, we will introduce a basic direct method called Gaussian eliminationfor solving general linear systems. Usually, Gaussian elimination is used for solving adense linear system with median size and no special structure.

2.1 Triangular linear systems and LU factorization

We first study triangular linear systems.

2.1.1 Triangular linear systems

We consider the following nonsingular lower triangular linear system

Ly = b (2.1)

9

10 CHAPTER 2. DIRECT METHODS FOR LINEAR SYSTEMS

where b = (b1, b2, · · · , bn)T ∈ Rn is a known vector, y = (y1, y2, · · · , yn)T is an unknownvector, and L = [lij ] ∈ Rn×n is given by

L =

l11 0 · · · · · · 0

l21 l22 0 · · · ...

l31 l32 l33. . .

......

......

. . . 0ln1 ln2 ln3 · · · lnn

with lii 6= 0, i = 1, 2, · · · , n. By the first equation in (2.1), we have

l11y1 = b1,

and theny1 =

b1

l11.

Similarly, by the second equation in (2.1), we have

y2 =1l22

(b2 − l21y1).

In general, if we have already obtained y1, y2, · · · , yi−1, then by using the i-th equationin (2.1), we have

yi =1lii

(bi −

i−1∑

j=1

lijyj

).

This algorithm is called the forward substitution method which needs O(n2) operations.Now, we consider the following nonsingular upper triangular linear system

Ux = y (2.2)

where x = (x1, x2, · · · , xn)T is an unknown vector, and U ∈ Rn×n is given by

U =

u11 u12 u13 · · · u1n

0 u22 u23 · · · ...

0 0 u33. . .

......

.... . . . . .

...0 · · · · · · 0 unn

with uii 6= 0, i = 1, 2, · · · , n. Beginning from the last equation of (2.2), we can obtainxn, xn−1, · · · , x1 step by step. The xn = yn/unn and xi is given by

xi =1uii

(yi −

n∑

j=i+1

uijxj

)

2.1. TRIANGULAR LINEAR SYSTEMS AND LU FACTORIZATION 11

for i = n− 1, · · · , 1. This algorithm is called the backward substitution method whichalso needs O(n2) operations.

For general linear systemsAx = b (2.3)

where A ∈ Rn×n and b ∈ Rn are known. If we can factorize the matrix A into

A = LU

where L is a lower triangular matrix and U is an upper triangular matrix, then wecould find the solution of (2.3) by the following two steps:

(1) By using the forward substitution method to find solution y of Ly = b.

(2) By using the backward substitution method to find solution x of Ux = y.

Now the problem that we are facing is how to factorize the matrix A into A = LU . Wetherefore introduce Gaussian transform matrices.

2.1.2 Gaussian transform matrix

LetLk = I − lke

Tk

where I ∈ Rn×n is the identity matrix, lk = (0, · · · , 0, lk+1,k, · · · , lnk)T ∈ Rn and ek ∈ Rn

is the k-th unit vector. Then for any k,

Lk =

1 0 · · · · · · · · · 0

0. . .

...... 1

...... −lk+1,k 1

......

.... . . 0

0 · · · −lnk · · · 0 1

is called the Gaussian transform matrix. Such a matrix is a unit lower triangularmatrix. We remark that a unit triangular matrix is a triangular matrix with ones onits diagonal. For any given vector

x = (x1, x2, · · · , xn)T ∈ Rn,

we haveLkx = (x1, · · · , xk, xk+1 − xklk+1,k, · · · , xn − xklnk)T

= (x1, · · · , xk, 0, · · · , 0)T


if we takelik =

xi

xk, i = k + 1, · · · , n

with xk 6= 0. It is easy to check that

L−1k = I + lke

Tk

by noting that eTk lk = 0.

For a given matrix A ∈ Rn×n, we have

LkA = (I − lkeTk )A = A− lk(eT

k A)

andrank(lk(eT

k A)) = 1.

Therefore, LkA is a rank-one modification of the matrix A.

2.1.3 Computation of LU factorization

We consider the following simple example. Let

A =

1 5 92 4 73 3 10

.

By using the Gaussian transform matrix

L1 =

1 0 0−2 1 0−3 0 1

,

we have

L1A =

1 5 90 −6 −110 −12 −17

.

Followed by using the Gaussian transform matrix

L2 =

1 0 00 1 00 −2 1

,

we have

L2(L1A) ≡ U =

1 5 90 −6 −110 0 5

.

2.1. TRIANGULAR LINEAR SYSTEMS AND LU FACTORIZATION 13

Therefore, we finally haveA = LU

where

L ≡ (L2L1)−1 = L−11 L−1

2 =

1 0 02 1 03 2 1

.

For general n-by-n matrix A, we can use n − 1 Gaussian transform matrices L1,L2, · · · , Ln−1 such that Ln−1 · · ·L1A is an upper triangular matrix. In fact, let A(0) ≡ Aand assume that we have already found k−1 Gaussian transform matrices L1, · · · , Lk−1 ∈Rn×n such that

A(k−1) = Lk−1 · · ·L1A =

[A

(k−1)11 A

(k−1)12

0 A(k−1)22

]

where A(k−1)11 is a (k − 1)-by-(k − 1) upper triangular matrix and

A(k−1)22 =

a(k−1)kk · · · a

(k−1)kn

.... . .

...a

(k−1)nk · · · a

(k−1)nn

.

If a(k−1)kk 6= 0, then we can use the Gaussian transform matrix

Lk = I − lkeTk

wherelk = (0, · · · , 0, lk+1,k, · · · , lnk)T

with

lik =a

(k−1)ik

a(k−1)kk

, i = k + 1, · · · , n,

such that the last n − k entries in the k-th column of LkA(k−1) become zeros. We

therefore have

A(k) ≡ LkA(k−1) =

[A

(k)11 A

(k)12

0 A(k)22

]

where A(k)11 is a k-by-k upper triangular matrix. After n − 1 steps, we obtain A(n−1)

which is an upper triangular matrix that we need. Let

L = (Ln−1 · · ·L1)−1, U = A(n−1),


then A = LU . Now we want to show that L is a unit lower triangular matrix. Bynoting that eT

j li = 0 for j < i, we have

L = L−11 · · ·L−1

n−1

= (I + l1eT1 )(I + l2e

T2 ) · · · (I + ln−1e

Tn−1)

= I + l1eT1 + · · ·+ ln−1e

Tn−1

= I + [l1, l2, · · · , ln−1,0]

=

1 0 · · · 0 0l21 1 0 · · · 0

l31 l32 1. . .

......

......

. . . 0ln1 ln2 ln3 · · · 1

.

This computational process of the LU factorization is called Gaussian elimination.Thus, we have the following algorithm.

Algorithm 2.1 (Gaussian elimination)

for k = 1 : n− 1A(k + 1 : n, k) = A(k + 1 : n, k)/A(k, k)A(k + 1 : n, k + 1 : n) = A(k + 1 : n, k + 1 : n)

−A(k + 1 : n, k)A(k, k + 1 : n)end

The operation cost of Gaussian elimination is

n−1∑

k=1

((n− k) + 2(n− k)2

)=

n(n− 1)2

+n(n− 1)(2n− 1)

3

=23n3 + O(n2) = O(n3).

We remark that in Gaussian elimination, a(k−1)kk , k = 1, · · · , n − 1, are required to

be nonzero. We have the following theorem.

Theorem 2.1 The entries a(i−1)ii 6= 0, i = 1, · · · , k, if and only if all the leading

principal submatrices Ai of A, i = 1, · · · , k, are nonsingular.

2.2. LU FACTORIZATION WITH PIVOTING 15

Proof: By induction, for k = 1, it is obviously true. Assume that the statement istrue until k − 1. We want to show that if A1, · · · , Ak−1 are nonsingular, then

“ Ak is nonsingular ⇐⇒ a(k−1)kk 6= 0 ”.

By assumption, we know that

a(i−1)ii 6= 0, i = 1, · · · , k − 1.

By using k − 1 Gaussian transform matrices L1, · · · , Lk−1, we obtain

A(k−1) = Lk−1 · · ·L1A =

[A

(k−1)11 A

(k−1)12

0 A(k−1)22

](2.4)

where A(k−1)11 is an upper triangular matrix with nonzero diagonal entries a

(i−1)ii , i =

1, · · · , k−1. Therefore, the k-th leading principal submatrix of A(k−1) has the followingform [

A(k−1)11 ∗0 a

(k−1)kk

].

Let (L1)k, · · · , (Lk−1)k denote the k-th leading principal submatrices of L1, · · · , Lk−1,respectively. By using (2.4), we obtain

(Lk−1)k · · · (L1)kAk =

[A

(k−1)11 ∗0 a

(k−1)kk

].

By noting that Li, i = 1, · · · , k− 1, are unit lower triangular matrices, we immediatelyknow that

det(Ak) = a(k−1)kk det(A(k−1)

11 ) 6= 0

if and only if a(k−1)kk 6= 0.

Thus, we have

Theorem 2.2 If all the leading principal submatrices Ai of a matrix A ∈ Rn×n arenonsingular for i = 1, · · · , n− 1, then there exists a unique LU factorization of A.

2.2 LU factorization with pivoting

Before we study pivoting techniques, we first consider the following simple example:[

0.3× 10−11 11 1

] [x1

x2

]=

[0.70.9

].


If we using Gaussian elimination with the 10-decimal-digit floating point arithmetic,we have

L =[

1 00.3333333333× 1012 1

]

and

U =[

0.3× 10−11 10 −0.3333333333× 1012

].

Then the computational solution is

x = (0.0000000000, 0.7000000000)T

which is not good comparing with the accurate solution

x = (0.2000000000006 · · · , 0.6999999999994 · · ·)T .

If we just interchange the first equation and the second equation, we have[

1 10.3× 10−11 1

] [x1

x2

]=

[0.90.7

].

By using Gaussian elimination with the 10-decimal-digit floating point arithmetic again,we have

L =[

1 00.3× 10−11 1

], U =

[1 10 1

].

Then the computational solution is

x = (0.2000000000, 0.7000000000)T

which is very good. So we need to introduce permutations into Gaussian elimination.We first define a permutation matrix.

Definition 2.1 A permutation matrix P is an identity matrix with permuted rows.

The important properties of the permutation matrix are included in the followinglemma. Its proof is straightforward.

Lemma 2.1 Let P , P1, P2 ∈ Rn×n be permutation matrices and X ∈ Rn×n. Then

(i) PX is the same as X with its rows permuted. XP is the same as X with itscolumns permuted.

(ii) P−1 = P T .


(iii) det(P ) = ±1.

(iv) P1 · P2 is also a permutation matrix.

Now we introduce the main theorem of this section.

Theorem 2.3 If A is nonsingular, then there exist permutation matrices P1 and P2,a unit lower triangular matrix L, and a nonsingular upper triangular matrix U suchthat

P1AP2 = LU.

Only one of P1 and P2 is necessary.

Proof: We use induction on the dimension n. For n = 1, it is obviously true. Assumethat the statement is true for n − 1. If A is nonsingular, then it has a nonzero entry.Choose permutation matrices P

′1 and P

′2 such that the (1, 1)-th position of P

′1AP

′2 is

nonzero. Now we write a desired factorization and solve for unknown components:

P′1AP

′2 =

[a11 A12

A21 A22

]=

[1 0

L21 I

]·[

u11 U12

0 A22

]

=[

u11 U12

L21u11 L21U12 + A22

],

(2.5)

where A22, A22 are (n− 1)-by-(n− 1) matrices, and L21, UT12 are (n− 1)-by-1 matrices.

Solving for the components of this 2-by-2 block factorization, we get

u11 = a11 6= 0, U12 = A12,

andL21u11 = A21, A22 = L21U12 + A22.

Therefore, we obtain

L21 =A21

a11, A22 = A22 − L21U12.

We want to apply induction to A22, but to do so we need to check that

det(A22) 6= 0.

Sincedet(P

′1AP

′2) = ±det(A) 6= 0


and also

det(P′1AP

′2) = det

[1 0

L21 I

]· det

[u11 U12

0 A22

]

= u11 · det(A22),

we know thatdet(A22) 6= 0.

Therefore, by the assumption of induction, there exist permutation matrices P1 andP2 such that

P1A22P2 = LU , (2.6)

where L is a unit lower triangular matrix and U is a nonsingular upper triangularmatrix. Substituting (2.6) into (2.5) yields

P′1AP

′2 =

[1 0

L21 I

] [u11 U12

0 P T1 LU P T

2

]

=[

1 0L21 I

] [1 00 P T

1 L

] [u11 U12

0 U P T2

]

=[

1 0L21 P T

1 L

][u11 U12P2

0 U

] [1 00 P T

2

]

=[

1 00 P T

1

] [1 0

P1L21 L

][u11 U12P2

0 U

] [1 00 P T

2

],

so we get a desired factorization of A:

P1AP2 =([

1 00 P1

]P′1

)A

(P′2

[1 00 P2

])

=[

1 0P1L21 L

][u11 U12P2

0 U

].

This row-column interchange strategy is called complete pivoting. We thereforehave the following algorithm.


Algorithm 2.2 (Gaussian elimination with complete pivoting)

for k = 1 : n− 1choose p, q, (k ≤ p, q ≤ n) such that

|A(p, q)| = max {|A(i, j)| : i = k : n, j = k : n}A(k, 1 : n) ←→ A(p, 1 : n)A(1 : n, k) ←→ A(1 : n, q)if A(k, k) 6= 0

A(k + 1 : n, k) = A(k + 1 : n, k)/A(k, k)A(k + 1 : n, k + 1 : n) = A(k + 1 : n, k + 1 : n)

−A(k + 1 : n, k)A(k, k + 1 : n)else

stopend

end

We remark that although the LU factorization with complete pivoting can overcomesome shortcomings of the LU factorization without pivoting, the cost of completepivoting is very high. Usually, it requires O(n3) operations in comparison with entriesof the matrix for pivoting.

In order to reduce the operation cost of pivoting, the LU factorization with partialpivoting is proposed. In partial pivoting, at the k-th step, we choose a

(k−1)pk from the

submatrix A(k−1)22 which satisfies

|a(k−1)pk | = max

{|a(k−1)

ik | : k ≤ i ≤ n}

.

When A is nonsingular, the LU factorization with partial pivoting can be carried outuntil we finally obtain

PA = LU.

In this algorithm, the operation cost in comparison with entries of the matrix forpivoting is O(n2). We have


Algorithm 2.3 (Gaussian elimination with partial pivoting)

for k = 1 : n− 1choose p, (k ≤ p ≤ n) such that

|A(p, k)| = max {|A(i, k)| : i = k : n}A(k, 1 : n) ←→ A(p, 1 : n)if A(k, k) 6= 0

A(k + 1 : n, k) = A(k + 1 : n, k)/A(k, k)A(k + 1 : n, k + 1 : n) = A(k + 1 : n, k + 1 : n)

−A(k + 1 : n, k)A(k, k + 1 : n)else

stopend

end

2.3 Cholesky factorization

Let A ∈ Rn×n be symmetric positive definite, i.e., it satisfies

A = AT , xT Ax > 0,

for all x 6= 0 ∈ Rn. We have

Theorem 2.4 Let A ∈ Rn×n be symmetric positive definite. Then there exists a lowertriangular matrix L ∈ Rn×n with positive diagonal entries such that

A = LLT .

This factorization is called the Cholesky factorization.

Proof: Since A is positive definite, all the principal submatrices of A should be positivedefinite. By Theorem 2.2, there exist a unit lower triangular matrix L and an uppertriangular matrix U such that

A = LU.

LetD = diag(u11, · · · , unn), U = D−1U,

where uii > 0, for i = 1, · · · , n. Then we have

UT DLT = AT = A = LDU.

Therefore,LT U−1 = D−1U−T LD.

2.3. CHOLESKY FACTORIZATION 21

We note that LT U−1 is a unit upper triangular matrix and D−1U−T LD is a lowertriangular matrix. Hence

LT U−1 = I = D−1U−T LD

which implies U = LT . ThusA = LDLT .

LetL = Ldiag(

√u11, · · · ,√unn).

We finally haveA = LLT .

Thus, when a matrix A is symmetric positive definite, we could find the solution ofthe system Ax = b by the following three steps:

(1) Find the Cholesky factorization of A: A = LLT .

(2) Find solution y of Ly = b.

(3) Find solution x of LT x = y.

From Theorem 2.4, we know that we do not need a pivoting in Cholesky factor-ization. Also we could calculate L directly through a comparison in the correspondingentries between two sides of A = LLT . We have the following algorithm.

Algorithm 2.4 (Cholesky factorization)

for k = 1 : n

A(k, k) =√

A(k, k)A(k + 1 : n, k) = A(k + 1 : n, k)/A(k, k)for j = k + 1 : n

A(j : n, j) = A(j : n, j)−A(j : n, k)A(j, k)end

end

The operation cost of Cholesky factorization is n3/3.


Exercises:

1. Let S, T ∈ Rn×n be upper triangular matrices such that

(ST − λI)x = b

is a nonsingular system. Find an algorithm of O(n2) operations for computing x.

2. Show that the LDLT factorization of a symmetric positive definite matrix A is unique.

3. Let A ∈ Rn×n be symmetric positive definite. Find an algorithm for computing an uppertriangular matrix U ∈ Rn×n such that A = UUT .

4. Let A = [aij ] ∈ Rn×n be strictly diagonally dominant matrix, i.e.,

|akk| >n∑

j=1j 6=k

|akj |, k = 1, 2, · · · , n.

Prove that a strictly diagonally dominant matrix is nonsingular, and a strictly diagonallydominant symmetric matrix with positive diagonal entries is positive definite.

5. Let

A =[

A11 A12

A21 A22

]

with A11 being a k-by-k nonsingular matrix. Then

S = A22 −A21A−111 A12

is called the Schur complement of A11 in A. Show that after k steps of Gaussian elimi-nation without pivoting, A

(k−1)22 = S.

6. Let A be a symmetric positive definite matrix. At the end of the first step of Gaussianelimination, we have [

a11 aT1

0 A22

].

Prove that A22 is also symmetric positive definite.

7. Let A = [aij ] ∈ Rn×n be a strictly diagonally dominant matrix. After one step ofGaussian elimination, we have [

a11 aT1

0 A22

].

Show that A22 is also strictly diagonally dominant.

8. Show that if PAQ = LU is obtained via Gaussian elimination with pivoting, then |uii| ≥|uij |, for j = i + 1, · · · , n.

9. Let H = A + iB be a Hermitian positive definite matrix, where A,B ∈ Rn×n.

2.3. CHOLESKY FACTORIZATION 23

(1) Prove the matrix

C =[

A −BB A

]

is symmetric positive definite.

(2) How to solve(A + iB)(x + iy) = b + ic, x, y, b, c ∈ Rn

by real number computation only?

10. Develop an algorithm to solve a tridiagonal system by using Gaussian elimination withpartial pivoting.

11. Show that if a singular matrix A ∈ Rn×n has a unique LU factorization, then Ak isnonsingular for k = 1, 2, · · · , n− 1.


Chapter 3

Perturbation and Error Analysis

In this chapter, we will discuss effects of perturbation and error on numerical solutions.The error analysis on floating point operations and on partial pivoting technique is alsogiven. It is well-known that the essential notions of distance and size in linear vectorspaces are captured by norms. We therefore need to introduce vector and matrix normsand study their properties before we develop our perturbation and error analysis.

3.1 Vector and matrix norms

We first introduce vector norms.

3.1.1 Vector norms

Letx = (x1, x2, · · · , xn)T ∈ Rn.

Definition 3.1 A vector norm on Rn is a function that assigns to each x ∈ Rn a realnumber ‖x‖, called the norm of x, such that the following three properties are satisfiedfor all x, y ∈ Rn and all α ∈ R:

(i) ‖x‖ > 0 if x 6= 0, and ‖x‖ = 0 if and only if x = 0;

(ii) ‖αx‖ = |α| · ‖x‖;(iii) ‖x + y‖ ≤ ‖x‖+ ‖y‖.

A useful class of vector norms is the p-norm defined by

‖x‖p ≡(

n∑

i=1

|xi|p) 1

p

25

26 CHAPTER 3. PERTURBATION AND ERROR ANALYSIS

where 1 ≤ p. The following p-norms are the most commonly used norms in practice:

‖x‖1 =n∑

i=1

|xi|, ‖x‖2 =

(n∑

i=1

|xi|2) 1

2

, ‖x‖∞ = max1≤i≤n

|xi|.

The Cauchy-Schwarz inequality concerning ‖ · ‖2 is given as follows,

|xT y| ≤ ‖x‖2‖y‖2

for x, y ∈ Rn, which is a special case of the Holder inequality given as follows,

|xT y| ≤ ‖x‖p‖y‖q

where 1/p + 1/q = 1.A very important property of vector norms on Rn is that all the vector norms on

Rn are equivalent as the following theorem said, see [35].

Theorem 3.1 If ‖ · ‖α and ‖ · ‖β are two norms on Rn, then there exist two positiveconstants c1 and c2 such that

c1‖x‖α ≤ ‖x‖β ≤ c2‖x‖α

for all x ∈ Rn.

For example, if x ∈ Rn, then we have

‖x‖2 ≤ ‖x‖1 ≤√

n‖x‖2,

‖x‖∞ ≤ ‖x‖2 ≤√

n‖x‖∞and

‖x‖∞ ≤ ‖x‖1 ≤ n‖x‖∞.

We remark that for any sequence of vectors {xk} where xk = (x(k)1 , · · · , x(k)

n )T ∈ Rn,and x = (x1, · · · , xn)T ∈ Rn, by Theorem 3.1, one can prove that

limk→∞

‖xk − x‖ = 0 ⇐⇒ limk→∞

|x(k)i − xi| = 0,

for i = 1, · · · , n.

3.1. VECTOR AND MATRIX NORMS 27

3.1.2 Matrix norms

LetA = [aij ]ni,j=1 ∈ Rn×n.

We now turn our attention to matrix norms.

Definition 3.2 A matrix norm is a function that assigns to each A ∈ Rn×n a realnumber ‖A‖, called the norm of A, such that the following four properties are satisfiedfor all A,B ∈ Rn×n and all α ∈ R:

(i) ‖A‖ > 0 if A 6= 0, and ‖A‖ = 0 if and only if A = 0;

(ii) ‖αA‖ = |α| · ‖A‖;(iii) ‖A + B‖ ≤ ‖A‖+ ‖B‖;(iv) ‖AB‖ ≤ ‖A‖ · ‖B‖.

An important property of matrix norms on Rn×n is that all the matrix norms onRn×n are equivalent. For the relation between a vector norm and a matrix norm, wehave

Definition 3.3 If a matrix norm ‖ · ‖M and a vector norm ‖ · ‖v satisfy

‖Ax‖v ≤ ‖A‖M‖x‖v,

for A ∈ Rn×n and x ∈ Rn, then these norms are called mutually consistent.

For any vector norm ‖ · ‖v, we can define a matrix norm in the following naturalway:

‖A‖M ≡ maxx 6=0

‖Ax‖v

‖x‖v= max‖x‖v=1

‖Ax‖v.

The most important matrix norms are the matrix p-norms induced by the vector p-norms for p = 1, 2,∞. We have the following theorem.

Theorem 3.2 LetA = [aij ]ni,j=1 ∈ Rn×n.

Then we have

(i) ‖A‖1 = max1≤j≤n

n∑i=1

|aij |.


(ii) ‖A‖∞ = max1≤i≤n

n∑j=1

|aij |.

(iii) ‖A‖2 =√

λmax(AT A), where λmax(AT A) is the largest eigenvalue of AT A.

Proof: We only give the proof of (i) and (iii). In the following, we always assume thatA 6= 0.

For (i), we partition the matrix A by columns:

A = [a1, · · · , an].

Letδ = ‖aj0‖1 = max

1≤j≤n‖aj‖1.

Then for any vector x ∈ Rn which satisfies ‖x‖1 =n∑

i=1|xi| = 1, we have

‖Ax‖1 =∥∥∥

n∑j=1

xjaj

∥∥∥1≤

n∑j=1

|xj | · ‖aj‖1

≤ (n∑

j=1|xj |) · max

1≤j≤n‖aj‖1

= ‖aj0‖1 = δ.

Let ej0 denote the j0-th unit vector and then

‖Aej0‖1 = ‖aj0‖1 = δ.

Therefore

‖A‖1 = max‖x‖1=1

‖Ax‖1 = δ = max1≤j≤n

‖aj‖1 = max1≤j≤n

n∑

i=1

|aij |.

For (iii), we have

‖A‖2 = max‖x‖2=1

‖Ax‖2 = max‖x‖2=1

[(Ax)T (Ax)]1/2

= max‖x‖2=1

[xT (AT A)x]1/2.

Since AT A is positive semi-definite, its eigenvalues can be assumed to be in the followingorder:

λ1 ≥ λ2 ≥ · · · ≥ λn ≥ 0.


Letv1, v2, · · · , vn ∈ Rn

denote the orthonormal eigenvectors corresponding to λ1, λ2, · · · , λn, respectively. Thenfor any vector x ∈ Rn with ‖x‖2 = 1, we have

x =n∑

i=1

αivi,

n∑

i=1

α2i = 1.

Therefore,

xT AT Ax =n∑

i=1

λiα2i ≤ λ1.

On the other hand, let x = v1, we have

xT AT Ax = vT1 AT Av1 = vT

1 λ1v1 = λ1.

Thus‖A‖2 = max

‖x‖2=1‖Ax‖2 =

√λ1 =

√λmax(AT A).

We have the following theorem for the norm ‖ · ‖2.

Theorem 3.3 Let A ∈ Rn×n. Then we have

(i) ‖A‖2 = max‖x‖2=1

max‖y‖2=1

|y∗Ax|, where x, y ∈ Cn.

(ii) ‖AT ‖2 = ‖A‖2 =√‖AT A‖2.

(iii) ‖A‖2 = ‖QAZ‖2, for any orthogonal matrices Q and Z. We recall that a matrixM ∈ Rn×n is called orthogonal if M−1 = MT .

Proof: We only prove (i). We first introduce the dual norm ‖ · ‖D of a vector norm‖ · ‖ defined as follows,

‖y‖D = max‖x‖=1

|y∗x|.

For ‖ · ‖2, we have by the Cauchy-Schwarz inequality,

|y∗x| ≤ ‖y‖2‖x‖2

with equality when x = 1‖y‖2 y. Therefore, the dual norm of ‖ · ‖2 is given by

‖y‖D2 = max

‖x‖2=1|y∗x| = max

‖x‖2=1‖y‖2‖x‖2 = ‖y‖2.


So, ‖ · ‖2 is its own dual. Now, we consider

‖A‖2 = max‖x‖2=1

‖Ax‖2 = max‖x‖2=1

‖Ax‖D2

= max‖x‖2=1

max‖y‖2=1

|(Ax)∗y|

= max‖x‖2=1

max‖y‖2=1

|y∗Ax|.

Another useful norm is the Frobenius norm which is defined by

‖A‖F ≡

n∑

j=1

n∑

i=1

|aij |2

12

.

One of the most important properties of ‖ · ‖F is that for any orthogonal matrices Qand Z,

‖A‖F = ‖QAZ‖F .

In the following, we will extend our discussion on norms to the field of C. We remarkthat from the viewpoint of norms, there is no essential difference between matrices orvectors in the field of R and matrices or vectors in the field of C.

Definition 3.4 Let A ∈ Cn×n. Then the set of all the eigenvalues of A is called thespectrum of A and

ρ(A) = max{|λ| : λ belongs to the spectrum of A}

is called the spectral radius of A.

For the relation between the spectral radius and matrix norms, we have

Theorem 3.4 Let A ∈ Cn×n. Then

(i) For any matrix norm, we have

ρ(A) ≤ ‖A‖.

(ii) For any ε > 0, there exists a norm defined on Cn×n such that

‖A‖ ≤ ρ(A) + ε.


Proof: For (i), let x ∈ Cn satisfy

x 6= 0, Ax = λx, |λ| = ρ(A).

Then we haveρ(A)‖xeT

1 ‖ = ‖λxeT1 ‖ = ‖AxeT

1 ‖ ≤ ‖A‖ · ‖xeT1 ‖.

Henceρ(A) ≤ ‖A‖.

For (ii), by using Theorem 1.1 (Jordan Decomposition Theorem), we know thatthere is a nonsingular matrix X ∈ Cn×n such that

X−1AX =

λ1 δ1

λ2 δ2

. . . . . .λn−1 δn−1

λn

where δi = 1 or 0. For any given ε > 0, let

Dε = diag(1, ε, ε2, · · · , εn−1),

then

D−1ε X−1AXDε =

λ1 εδ1

λ2 εδ2

. . . . . .λn−1 εδn−1

λn

.

Now, define‖G‖ε = ‖D−1

ε X−1GXDε‖∞, G ∈ Cn×n.

It is easy to see this matrix norm ‖ · ‖ε actually is induced by the vector norm definedas follows:

‖x‖XDε = ‖(XDε)−1x‖∞, x ∈ Cn.

Therefore,

‖A‖ε = ‖D−1ε X−1AXDε‖∞ = max

1≤i≤n(|λi|+ |εδi|) ≤ ρ(A) + ε,

where δn = 0.

We remark that for any sequence of matrices {A(k)} where A(k) = [a(k)ij ] ∈ Rn×n,

and A = [aij ] ∈ Rn×n,

limk→∞

‖A(k) −A‖ = 0 ⇐⇒ limk→∞

a(k)ij = aij ,


for i, j = 1, · · · , n.


limk→∞

Ak = 0 ⇐⇒ ρ(A) < 1.

Proof: We first assume thatlim

k→∞Ak = 0.

Let λ be an eigenvalue of A such that ρ(A) = |λ|. Then λk is an eigenvalue of Ak forany k. By Theorem 3.4 (i), we know that for any k,

ρ(A)k = |λ|k = |λk| ≤ ρ(Ak) ≤ ‖Ak‖.Therefore,

limk→∞

ρ(A)k = 0

which implies ρ(A) < 1.Conversely, assume that ρ(A) < 1. By Theorem 3.4 (ii), there exists a matrix norm

‖ · ‖ such that ‖A‖ < 1. Therefore, we have

0 ≤ ‖Ak‖ ≤ ‖A‖k −→ 0, k −→∞,

i.e.,lim

k→∞Ak = 0.

By using Theorem 3.5, one can easily prove that following important theorem.


(i)∞∑

k=0

Ak is convergent if and only if ρ(A) < 1.

(ii) When∞∑

k=0

Ak converges, we have

∞∑

k=0

Ak = (I −A)−1.

Moreover, there exists a norm defined on Cn×n such that for any m,

∥∥∥(I −A)−1 −m∑

k=0

Ak∥∥∥ ≤ ‖A‖m+1

1− ‖A‖ .

3.2. PERTURBATION ANALYSIS FOR LINEAR SYSTEMS 33

Corollary 3.1 Let ‖ ·‖ be a norm defined on Cn×n with ‖I‖ = 1 and A ∈ Cn×n satisfy‖A‖ < 1. Then I −A is nonsingular and satisfies

‖(I −A)−1‖ ≤ 11− ‖A‖ .

Proof: Just note that

‖(I −A)−1‖ =∥∥∥∞∑

k=0

Ak∥∥∥ ≤ 1 +

∞∑

k=1

‖A‖k =1

1− ‖A‖ .

3.2 Perturbation analysis for linear systems

We first consider the following simple example Ax = b given by:[

2.0001 1.99991.9999 2.0001

] [x1

x2

]=

[44

].

The solution of this linear system is x = (1, 1)T . If there is a small perturbation on b,say,

β = (1× 10−4,−1× 10−4)T ,

the system becomes[

2.0001 1.99991.9999 2.0001

] [x1

x2

]=

[4.00013.9999

].

The solution of this perturbed system is x = (1.5, 0.5)T . Therefore, we have

‖x− x‖∞‖x‖∞ =

12,

‖β‖∞‖b‖∞ =

140000

,

i.e., the relative error of the solution is 20000 times of that of the perturbation on b.Thus, when we solve a linear system Ax = b, a good measurement, which can

tell us how sensitive the computed solution is to input small perturbations, is needed.The condition number of matrices is then defined. It relates perturbations of x toperturbations of A and b.

Definition 3.5 Let ‖ · ‖ be any norm of matrix and A be a nonsingular matrix. Thecondition number of A is defined as follows,

κ(A) ≡ ‖A‖ · ‖A−1‖. (3.1)


Obviously, the condition number depends on the matrix norm used. When κ(A) issmall, then A is said to be well-conditioned, whereas if κ(A) is large, then A is said tobe ill-conditioned. Note that for any p-norm, we have

1 = ‖I‖ = ‖A ·A−1‖ ≤ ‖A‖ · ‖A−1‖ = κ(A).

Let x be an approximation of the exact solution x of Ax = b. The error vector isdefined as follows,

e = x− x,

i.e.,x = x + e. (3.2)

The absolute error is given by‖e‖ = ‖x− x‖

for any vector norm. If x 6= 0, then the relative error is defined by

‖e‖‖x‖ =

‖x− x‖‖x‖ .

We have by substituting (3.2) into Ax = b,

A(x + e) = Ax + Ae = b.

Therefore,Ax = b−Ae = b.

The x is the exact solution of Ax = b where b is a perturbed vector of b. Since x = A−1band x = A−1b, we have

‖x− x‖ = ‖A−1(b− b)‖ ≤ ‖A−1‖ · ‖b− b‖. (3.3)

Similarly,‖b‖ = ‖Ax‖ ≤ ‖A‖ · ‖x‖,

i.e.,1‖x‖ ≤

‖A‖‖b‖ . (3.4)

Combining (3.3), (3.4) and (3.1), we obtain the following theorem which gives the effectof perturbations of the vector b on the solution of Ax = b in terms of the conditionnumber.

3.2. PERTURBATION ANALYSIS FOR LINEAR SYSTEMS 35

Theorem 3.7 Let x be an approximate solution of the exact solution x of Ax = b.Then

‖x− x‖‖x‖ ≤ κ(A)

‖b− b‖‖b‖ .

The next theorem includes the effect of perturbations of the coefficient matrix Aon the solution of Ax = b in terms of the condition number.

Theorem 3.8 Let A be a nonsingular matrix and A be a perturbed matrix of A suchthat

‖A− A‖ · ‖A−1‖ < 1.

If Ax = b and Ax = b where b is a perturbed vector of b, then

‖x− x‖‖x‖ ≤ κ(A)

1− κ(A)‖A−A‖‖A‖

(‖A− A‖‖A‖ +

‖b− b‖‖b‖

).

Proof: LetE = A− A and β = b− b.

By subtracting Ax = b from Ax = b, we have

A(x− x) = −Ex + β.

Furthermore, we get

‖x− x‖‖x‖ ≤ ‖A−1E‖‖x‖‖x‖ + ‖A−1‖‖Ax‖

‖x‖‖β‖‖b‖ .

By using‖x‖ ≤ ‖x− x‖+ ‖x‖ and ‖Ax‖ ≤ ‖A‖ · ‖x‖,

we then have

‖x− x‖‖x‖ ≤ ‖A−1E‖‖x− x‖

‖x‖ + ‖A−1E‖+ ‖A−1‖‖A‖‖β‖‖b‖ ,

i.e.,

(1− ‖A−1E‖)‖x− x‖‖x‖ ≤ ‖A−1E‖+ κ(A)

‖β‖‖b‖ .

Since‖A−1E‖ ≤ ‖A−1‖ · ‖E‖ = ‖A−1‖ · ‖A− A‖ < 1,


we get‖x− x‖‖x‖ ≤ (1− ‖A−1E‖)−1

(‖A−1E‖+ κ(A)

‖β‖‖b‖

).

By using

‖A−1E‖ ≤ ‖A−1‖ · ‖E‖ = κ(A)‖E‖‖A‖ ,

we finally have‖x− x‖‖x‖ ≤ κ(A)

1− κ(A)‖E‖‖A‖

(‖E‖‖A‖ +

‖β‖‖b‖

).

Theorems 3.7 and 3.8 give upper bounds for the relative error of x in terms ofthe condition number of A. From Theorems 3.7 and 3.8, we know that if A is well-conditioned, i.e., κ(A) is small, the relative error in x will be small if the relative errorsin both A and b are small.

Corollary 3.2 Let ‖ · ‖ be any matrix norm with ‖I‖ = 1 and A be a nonsingularmatrix with A + A being a perturbed matrix of A such that

‖A−1A‖ < 1.

Then A + A is nonsingular and

‖(A + A)−1 −A−1‖‖A−1‖ ≤ κ(A)

1− κ(A)‖A‖‖A‖

‖A‖‖A‖ .

Proof: We first prove that

‖(A + A)−1 −A−1‖ ≤ ‖A‖ · ‖A−1‖2

1− ‖A−1A‖.

Note thatA + A = A(I + A−1A) = A[I − (−A−1A)].

Let F = −A−1A and r = ‖A−1A‖. Now,

(A + A)−1 = (I − F )−1A−1.

Therefore by noting that ‖F‖ = r < 1 and Corollary 3.1, we have

‖(A + A)−1‖ = ‖(I − F )−1A−1‖ ≤ ‖(I − F )−1‖ · ‖A−1‖ <‖A−1‖

1− ‖F‖ =‖A−1‖1− r

.

3.3. ERROR ANALYSIS ON FLOATING POINT ARITHMETIC 37

By using identityB−1 = A−1 −B−1(B −A)A−1,

we have,(A + A)−1 −A−1 = −(A + A)−1AA−1.

Then

‖(A + A)−1 −A−1‖ ≤ ‖A−1‖ · ‖A‖ · ‖(A + A)−1‖ ≤ ‖A−1‖2‖A‖1− r

.

Finally, we obtain

‖(A + A)−1 −A−1‖‖A−1‖ ≤ ‖A−1‖ · ‖A‖

1− r≤ ‖A−1‖ · ‖A‖

1− ‖A−1‖ · ‖A‖≤ κ(A)

1− κ(A)‖A‖‖A‖

‖A‖‖A‖ .

3.3 Error analysis on floating point arithmetic

In computers, the floating point numbers f are expressed as

f = ±ω × βJ , L ≤ J ≤ U,

where β is the base, J is the order, and ω is the fraction. Usually, ω has the followingform:

ω = 0.d1d2 · · · dt

where t is the length (precision) of ω, d1 6= 0, and 0 ≤ di < β, for i = 2, · · · , t.Let

F = {0} ∪ {f : f = ±ω × βJ , 0 ≤ di < β, d1 6= 0, L ≤ J ≤ U}.Then F contains

2(β − 1)βt−1(U − L + 1) + 1

floating point numbers. These numbers are symmetrically distributed in the intervals[m,M ] and [−M,−m], where

m = βL−1, M = βU (1− β−t). (3.5)

We remark that F is only a finite set which cannot contain all the real numbers inthese two intervals.

Let fl(x) denote the floating point number of any real number x. Then

fl(x) = 0, for x = 0.


If m ≤ |x| ≤ M , by rounding, fl(x) is the minimum of

|fl(x)− x| = minf∈F

|f − x|.

By chopping, fl(x) is the minimum of

|fl(x)− x| = min|f |≤|x|

|f − x|.

For example, let β = 10, t = 3, L = 0 and U = 2. We consider the floating pointexpression of x = 5.45627. By rounding, we have fl(x) = 0.546× 10. By chopping, wehave fl(x) = 0.545× 10. The following theorem gives an estimate of the relative errorof floating point expressions.

Theorem 3.9 Let m ≤ |x| ≤ M , where m and M are defined by (3.5). Then

fl(x) = x(1 + δ), |δ| ≤ u,

where u is the machine precision, i.e.,

u =

12β1−t, by rounding,

β1−t, by chopping.

Proof: In the following, we assume that x 6= 0 and x > 0. Let α be an integer andsatisfy

βα−1 ≤ x < βα. (3.6)

Since the order of floating point numbers in [βα−1, βα) is α, all the numbers

0.d1d2 · · · dt × βα

are distributed in the interval with distance βα−t. For the rounding error, by (3.6), wehave

|fl(x)− x| ≤ 12βα−t =

12βα−1β1−t ≤ 1

2xβ1−t,

i.e.,|fl(x)− x|

x≤ 1

2β1−t.

For the chopping error, we have

|fl(x)− x| ≤ βα−t = βα−1β1−t ≤ xβ1−t,

i.e.,|fl(x)− x|

x≤ β1−t.

3.3. ERROR ANALYSIS ON FLOATING POINT ARITHMETIC 39

The proof is complete.

Let us now consider the rounding error of elementary operations. Let a, b ∈ F and“◦” represent any elementary operations: “+,−,×,÷”. By Theorem 3.9, we immedi-ately have

Theorem 3.10 We have

fl(a ◦ b) = (a ◦ b)(1 + δ), |δ| ≤ u.

Theorem 3.11 If |δi| ≤ u and nu ≤ 0.01, then

1− 1.01nu ≤n∏

i=1

(1 + δi) ≤ 1 + 1.01nu.

Proof: Since |δi| ≤ u, we have

(1− u)n ≤n∏

i=1

(1 + δi) ≤ (1 + u)n. (3.7)

For the lower bound of (1− u)n, by using the Taylor expansion of (1− x)n, i.e.,

(1− x)n = 1− nx +n(n− 1)

2(1− θx)n−2x2,

we have1− nx ≤ (1− x)n.

Therefore,1− 1.01nu ≤ 1− nu ≤ (1− u)n. (3.8)

Now, we estimate the upper bound of (1 + u)n. By using the Taylor expansion of ex,we have

ex = 1 + x +x2

2!+

x3

3!+ · · ·

= 1 + x +x

2· x ·

(1 +

x

3+ · · ·

).

Therefore, when 0 ≤ x ≤ 0.01, we know that by using e0.01 < 2,

1 + x ≤ ex ≤ 1 + x +0.012

xex ≤ 1 + 1.01x. (3.9)


Let x = u. By the left inequality of (3.9), we have

(1 + u)n ≤ enu. (3.10)

Let x = nu. By the right inequality of (3.9), we have

enu ≤ 1 + 1.01nu. (3.11)

Combining (3.10) and (3.11), we have

(1 + u)n ≤ 1 + 1.01nu. (3.12)

By (3.7), (3.8) and (3.12), the proof is complete.

We consider the following example.

Example 3.1. For given x, y ∈ Rn, estimate the upper bound of

|fl(xT y)− xT y|.Let

Sk = fl( k∑

i=1

xiyi

).

By Theorem 3.10, we have

S1 = x1y1(1 + γ1), |γ1| ≤ u,

andSk = fl(Sk−1 + fl(xkyk))

= [Sk−1 + xkyk(1 + γk)](1 + δk), |δk|, |γk| ≤ u.

Therefore,

fl(xT y) = Sn =n∑

i=1xiyi(1 + γi)

n∏j=i

(1 + δj)

=n∑

i=1(1 + εi)xiyi,

where

1 + εi = (1 + γi)n∏

j=i

(1 + δj)

with δ1 = 0. Thus, if nu ≤ 0.01, we then have by Theorem 3.11,

|fl(xT y)− xT y| ≤n∑

i=1

|εi| · |xiyi| ≤ 1.01nun∑

i=1

|xiyi|.

3.4. ERROR ANALYSIS ON PARTIAL PIVOTING 41

Before we finish this section, let us briefly discuss the floating point analysis onelementary matrix operations. We first introduce the following notations:

|E| = [|eij |],

where E = [eij ] ∈ Rn×n and

|E| ≤ |F | ⇐⇒ |eij | ≤ |fij |

for i, j = 1, 2, · · · , n. Let A, B ∈ Rn×n be matrices with entries in F , and α ∈ F . ByTheorem 3.10, we have

fl(αA) = αA + E, |E| ≤ u|αA|,

andfl(A + B) = (A + B) + E, |E| ≤ u|A + B|.

From Example 3.1, we also have

fl(AB) = AB + E, |E| ≤ 1.01nu|A| · |B|.

Note that |A| · |B| maybe is much larger than |AB|. Therefore the relative error of ABmay not be small.

3.4 Error analysis on partial pivoting

We will show that if Gaussian elimination with partial pivoting is used to solve Ax = b,then the computational solution x satisfies

(A + E)x = b,

where E is an error matrix. An upper bound of E is also given. We first study therounding error of the LU factorization of A.

Lemma 3.1 Let A ∈ Rn×n with floating point entries. Assume that A has an LUfactorization and 6nu ≤ 1 where u is the machine precision. Then by using Gaussianelimination, we have

LU = A + E

where|E| ≤ 3nu(|A|+ |L| · |U |).


Proof: We use induction on n. Obviously, Lemma 3.1 is true for n = 1. Assume thatthe lemma holds for n− 1. Now, we consider a matrix A ∈ Rn×n:

A =[

α wT

v A1

],

where A1 ∈ R(n−1)×(n−1). At the first step of Gaussian elimination, we should computethe vector l1 = fl(v/α) and modify the matrix A1 as

A1 = fl(A1 − fl(l1wT )).

By Theorem 3.10, we have

l1 = v/α + f, |f | ≤ u|α| |v| (3.13)

and

A1 = A1 − l1wT + F, |F | ≤ (2 + u)u(|A1|+ |l1| · |w|T ). (3.14)

For A1, by using the assumption, we obtain an LU factorization with a unit lowertriangular matrix L1 and an upper triangular matrix U1 such that

L1U1 = A1 + E1

where

|E1| ≤ 3(n− 1)u(|A1|+ |L1| · |U1|).

Thus, we have

LU =[

1 0l1 L1

] [α wT

0 U1

]= A + E,

where

E =[

0 0αf E1 + F

].

By using (3.14), we obtain

|A1| ≤ (1 + 2u + u2)(|A1|+ |l1| · |w|T ).


Therefore, by using the condition 6nu ≤ 1, we have

|E1 + F | ≤ |E1|+ |F |

≤ 3(n− 1)u(|A1|+ |L1| · |U1|) + (2 + u)u(|A1|+ |l1| · |w|T )

≤ 3(n− 1)u((1 + 2u + u2)(|A1|+ |l1| · |w|T ) + |L1| · |U1|

)

+(2 + u)u(|A1|+ |l1| · |w|T )

≤ u(3n− 1 + [6n + 3(n− 1)u− 5]u

)(|A1|+ |l1| · |w|T )

+3(n− 1)u(|L1| · |U1|)

≤ 3nu(|A1|+ |l1| · |w|T + |L1| · |U1|).

Combining with (3.13), we obtain

|E| =[

0 0|α||f | |E1 + F |

]

≤ 3nu[

0 0|v| |A1|+ |l1| · |w|T + |L1| · |U1|

]

≤ 3nu([ |α| |w|T

|v| |A1|]

+[

1 0|l1| |L1|

] [ |α| |w|T0 |U1|

])

= 3nu(|A|+ |L| · |U |).


Corollary 3.3 Let A ∈ Rn×n be nonsingular with floating point entries and 6nu ≤ 1.Assume that by using Gaussian elimination with partial pivoting, we obtain

LU = PA + E

where L = [lij ] is a unit lower triangular matrix with |lij | ≤ 1, U is an upper triangularmatrix and P is a permutation matrix. Then E satisfies the following inequality:

|E| ≤ 3nu(|PA|+ |L| · |U |).


After we obtain the LU factorization of A, the problem of solving Ax = b becomesthe problem of solving the following two triangular systems:

Ly = Pb, Ux = y.

Therefore, we need to estimate the rounding error of solving triangular systems.

Lemma 3.2 Let S ∈ Rn×n be a nonsingular triangular matrix with floating pointentries and 1.01nu ≤ 0.01. By using the method proposed in Section 2.1.1 to solveSx = b, we then obtain a computational solution x which satisfies

(S + H)x = b,

where|H| ≤ 1.01nu|S|.

Proof: We use induction on n. Without loss of generality, let S = L be a lowertriangular matrix. Obviously, Lemma 3.2 is true for n = 1. Assume that the lemmais true for n− 1. Now, we consider a lower triangular matrix L ∈ Rn×n. Let x be thecomputational solution of Lx = b and we partition L, b and x as follows:

L =[

l11 0l1 L1

], b =

[b1

c

], x =

[x1

y

],

where c, y ∈ Rn−1 and L1 ∈ R(n−1)×(n−1). By Theorem 3.10, we have

x1 = fl(b1/l11) =b1

l11(1 + δ1), |δ1| ≤ u. (3.15)

Note that y is the computational solution of the (n− 1)-by-(n− 1) system

L1y = fl(c− x1l1).

By assumption, we have(L1 + H1)y = fl(c− x1l1)

where|H1| ≤ 1.01(n− 1)u|L1|. (3.16)

By Theorem 3.10 again, we obtain

fl(c− x1l1) = fl(c− fl(x1l1)) = (I + Dγ)−1(c− x1l1 − x1Dδl1),

whereDγ = diag(γ2, · · · , γn), Dδ = diag(δ2, · · · , δn)


with|γi|, |δi| ≤ u, i = 2, · · · , n.

Therefore,x1l1 + x1Dδl1 + (I + Dγ)(L1 + H1)y = c,

and then(L + H)x = b,

where

H =[

δ1l11 0Dδl1 H1 + Dγ(L1 + H1)

].

By using (3.15), (3.16) and the condition 1.01nu ≤ 0.01, we have

|H| ≤[ |δ1| · |l11| 0|Dδ| · |l1| |H1|+ |Dγ |(|L1|+ |H1|)

]

≤[

u|l11| 0u|l1| |H1|+ u(|L1|+ |H1|)

]

≤ u[ |l11| 0|l1| [1.01(n− 1) + 1 + 1.01(n− 1)u]|L1|

]

≤ 1.01nu|L|.

We then have the main theorem of this section.

Theorem 3.12 Let A ∈ Rn×n be a nonsingular matrix with floating point entries and1.01nu ≤ 0.01. If Gaussian elimination with partial pivoting is used to solve Ax = b,we then obtain a computational solution x which satisfies

(A + δA)x = b,

where‖δA‖∞ ≤ u(3n + 5.04n3ρ)‖A‖∞ (3.17)

with the growth factor

ρ ≡ 1‖A‖∞ max

i,j,k|a(k)

ij |.

Proof: By using Gaussian elimination with partial pivoting, we have the following twotriangular systems:

Ly = Pb, Ux = y.


By using Lemma 3.2, the computational solution x should satisfy

(L + F )(U + G)x = Pb,

i.e.,(LU + FU + LG + FG)x = Pb, (3.18)

where|F | ≤ 1.01nu|L|, |G| ≤ 1.01nu|U |. (3.19)

Substituting LU = PA + E into (3.18), we have

(A + δA)x = b,

whereδA = P T (E + FU + LG + FG).

By using (3.19), Corollary 3.3 and the condition 1.01nu ≤ 0.01, we have

|δA| ≤ P T (3nu|PA|+ (3n + 2.04n)u|L| · |U |)

= nuP T (3|PA|+ 5.04|L| · |U |).(3.20)

By Corollary 3.3 again, the absolute values of entries in L are less than or equal to 1.Therefore, we have

‖L‖∞ ≤ n. (3.21)

We now defineρ ≡ 1

‖A‖∞ maxi,j,k

|a(k)ij |

and then we have‖U‖∞ ≤ nρ‖A‖∞. (3.22)

Substituting (3.21) and (3.22) into (3.20), we have (3.17). The proof is complete.

We remark that ‖δA‖∞ usually is very small comparing with the initial error fromgiven data. Thus, Gaussian elimination with partial pivoting is numerically stable.

Exercises:

1. Let

A =[

1 0.9999990.999999 1

].

Compute A−1, det(A) and the condition number of A.


2. Prove that ‖AB‖F ≤ ‖A‖2‖B‖F and ‖AB‖F ≤ ‖A‖F ‖B‖2.3. Prove that ‖A‖22 ≤ ‖A‖1‖A‖∞ for any square matrix A.

4. Show that ∥∥∥∥[

A11 00 A22

]∥∥∥∥2

≤∥∥∥∥[

A11 A12

A21 A22

]∥∥∥∥2

.

5. Let A be nonsingular. Show that

‖A−1‖−12 = min

‖x‖2=1‖Ax‖2.

6. Show that if S is real and S = −ST , then I − S is nonsingular and the matrix

(I − S)−1(I + S)

is orthogonal. This is known as the Cayley transform of S.

7. Prove that if both A and A + E are nonsingular, then

‖(A + E)−1 −A−1‖‖A−1‖ ≤ ‖(A + E)−1‖ · ‖E‖.

8. Let A ∈ Rn×n be nonsingular and let x, y, z ∈ Rn such that Ax = b and Ay = b + z.Show that ‖z‖2

‖A‖2 ≤ ‖x− y‖2 ≤ ‖A−1‖2‖z‖2.

9. Let A = [aij ] be an m-by-n matrix. Define

|||A|||l = maxi,j

|aij |.

Is ||| · |||l a matrix norm? Give a reason for your answer.

10. Show that if X ∈ Cn×n is nonsingular, then ‖A‖X = ‖X−1AX‖2 defines a matrix norm.

11. Let A = LDLT ∈ Rn×n be a symmetric positive definite matrix and

D = diag(d11, · · · , dnn).

Show that

κ2(A) ≥max

i{dii}

mini{dii} .

12. Verify that‖xy∗‖F = ‖xy∗‖2 = ‖x‖2‖y‖2,

for any x, y ∈ Cn.

13. Show that if 0 6= v ∈ Rn and E ∈ Rn×n, then∥∥∥E

(I − vvT

vT v

)∥∥∥2

F= ‖E‖2F −

‖Ev‖22vT v

.


14. Let A ∈ Rm×n and x ∈ Rn. Show that

|A| · |x| ≤ τ‖A‖∞|x|

where τ = maxi|xi|/ min

i|xi|.

15. Prove the Sherman-Morrison-Woodbury formula. Let U , V be n-by-k rectangular ma-trices with k ≤ n and A be an n-by-n matrix. Then

T = I + V T A−1U

is nonsingular if and only if A + UV T is nonsingular. In this case, we have

(A + UV T )−1 = A−1 −A−1UT−1V T A−1.

Chapter 4

Least Squares Problems

In this chapter, we study linear least squares problems:

miny∈Rn

‖Ay − b‖2

where the data matrix A ∈ Rm×n with m ≥ n and the observation vector b ∈ Rm

are given. We introduce some well-known orthogonal transformations and the QRdecomposition for constructing efficient algorithms for these problems. For a literatureon least squares problems, we refer to [15, 21, 42, 44, 45, 48].

4.1 Least squares problems

In practice, if we are given m points t1, t2, · · · , tm with data on these points y1, y2, · · · , ym,and functions φ1(t), φ2(t), · · · , φn(t) defined on these points, we then try to find f(x, t)defined by

f(x, t) ≡n∑

j=1

xjφj(t)

such that residuals defined by

ri(x) ≡ yi − f(x, ti) = yi −n∑

j=1

xjφj(ti), i = 1, 2, · · · ,m,

can be as small as possible. In matrix form, we have

r(x) = b−Ax

where

A =

φ1(t1) · · · φn(t1)...

...φ1(tm) · · · φn(tm)

,

49

50 CHAPTER 4. LEAST SQUARES PROBLEMS

andb = (y1, · · · , ym)T , x = (x1, · · · , xn)T , r(x) = (r1(x), · · · , rm(x))T .

When m = n, we can require that r(x) = 0 and x can be found by solving the systemAx = b. When m > n, we require that r(x) can reach its minimum under the norm‖ · ‖2. We therefore introduce the following definition of the least squares problem.

Definition 4.1 Let A ∈ Rm×n and b ∈ Rm. Find x ∈ Rn such that

‖b−Ax‖2 = ‖r(x)‖2 = miny∈Rn

‖r(y)‖2 = miny∈Rn

‖b−Ay‖2. (4.1)

It is called the least squares (LS) problem and r(x) is called the residual.

In the following, we only consider the case of

rank(A) = n < m.

We first study the solution x of the following equation

Ax = b, A ∈ Rm×n. (4.2)

The range of matrix A is defined by

R(A) ≡ {y ∈ Rm : y = Ax, x ∈ Rn}.

It is easy to see thatR(A) = span{a1, · · · , an}

where ai, i = 1, · · · , n, are column vectors of A. The nullspace of A is defined by

N (A) ≡ {x ∈ Rn : Ax = 0}.

The dimension of N (A) is denoted by null(A). The orthogonal complement of a sub-space S ⊂ Rn is defined by

S⊥ ≡ {y ∈ Rn : yT x = 0, for all x ∈ S}.

We have the following theorems for (4.2).

Theorem 4.1 The equation (4.2) has solutions ⇐⇒ rank(A) = rank([A, b]).

Theorem 4.2 Let x be a special solution of (4.2). Then the solution set of (4.2) isgiven by x +N (A).

4.1. LEAST SQUARES PROBLEMS 51

Corollary 4.1 Assume that the equation (4.2) has some solution. The solution isunique ⇐⇒ null(A) = 0.

We have the following essential theorem for the solution of (4.1).

Theorem 4.3 The LS problem (4.1) always has solutions. The solution is unique ifand only if null(A) = 0.

Proof: SinceRm = R(A)⊕R(A)⊥,

the vector b can be expressed uniquely by

b = b1 + b2

where b1 ∈ R(A) and b2 ∈ R(A)⊥. For any x ∈ Rn, since b1 − Ax ∈ R(A) and isorthogonal to b2, we therefore have

‖r(x)‖22 = ‖b−Ax‖2

2 = ‖(b1 −Ax) + b2‖22

= ‖b1 −Ax‖22 + ‖b2‖2

2.

Note that ‖r(x)‖22 reaches the minimum if and only if ‖b1−Ax‖2

2 reaches the minimum.Since b1 ∈ R(A), ‖r(x)‖2

2 reaches its minimum if and only if

Ax = b1,

i.e.,‖b1 −Ax‖2

2 = 0.

Thus, by Corollary 4.1, we know that the solution of Ax = b1 is unique, i.e., the solutionof (4.1) is unique, if and only if

null(A) = 0.

LetX = {x ∈ Rn : x is a solution of (4.1)}.

We have

Theorem 4.4 A vector x ∈ X if and only if

AT Ax = AT b. (4.3)


Proof: Let x ∈ X . By Theorem 4.3, we know that Ax = b1 where b1 ∈ R(A) and

r(x) = b−Ax = b− b1 = b2 ∈ R(A)⊥.

ThereforeAT r(x) = AT b2 = 0.

Substituting r(x) = b−Ax into AT r(x) = 0, we obtain (4.3).Conversely, let x ∈ Rn satisfy

AT Ax = AT b,

then for any y ∈ Rn, we have

‖b−A(x + y)‖22 = ‖b−Ax‖2

2 − 2yT AT (b−Ax) + ‖Ay‖22

= ‖b−Ax‖22 + ‖Ay‖2

2

≥ ‖b−Ax‖22.

Thus, x ∈ X .

We therefore have the following algorithm for LS problems:

(1) Compute C = AT A and d = AT b.

(2) Find the Cholesky factorization of C = LLT .

(3) Solve triangular linear systems: Ly = d and LT x = y.

We remark that in computation of AT A, usually, the operation cost is O(n2m), andsome information of matrix A could be lost. For example, we consider

A =

1 1 1ε 0 00 ε 00 0 ε

.

We have

AT A =

1 + ε2 1 11 1 + ε2 11 1 1 + ε2

.

Assume that ε = 10−3 and a 6-digital decimal floating system is used. Then 1 + ε2 =1 + 10−6 is rounded off to be 1, which means that AT A is singular!

4.1. LEAST SQUARES PROBLEMS 53

We note that the solution x of (4.3) can be expressed as

x = (AT A)−1AT b.

If we letA† = (AT A)−1AT ,

then the LS solution x could be written as

x = A†b.

Actually, the n-by-m matrix A† is the Moore-Penrose generalized inverse of A, whichis unique, see [14, 17, 42]. In general, we have

Definition 4.2 Let X ∈ Rn×m. If it satisfies the following conditions:

AXA = A, XAX = X, (AX)T = AX, (XA)T = XA,

then X is called the Moore-Penrose generalized inverse of A and denoted by A†.

Now we develop the perturbation analysis of LS problems. Assume that there is aperturbation δb on b and let x, x+δx denote the solutions of the following LS problems,respectively,

minx‖b−Ax‖2, min

x‖(b + δb)−Ax‖2.

Thenx = A†b,

andx + δx = A†(b + δb) = A†b

where b = b + δb. We have

Theorem 4.5 Let b1 and b1 denote orthogonal projections of b and b on R(A), respec-tively. If b1 6= 0, then

‖δx‖2

‖x‖2≤ κ2(A)

‖δb1‖2

‖b1‖2

where κ2(A) = ‖A‖2‖A†‖2 and b1 = b1 + δb1.

Proof: Let b2 denote the orthogonal projection of b on R(A)⊥. Then b = b1 + b2 andAT b2 = 0. Note that

A†b = A†b1 + A†b2 = A†b1 + (AT A)−1AT b2 = A†b1.


Similarly, A†b = A†b1. Therefore,

‖δx‖2 = ‖A†b−A†b‖2 = ‖A†(b1 − b1)‖2

≤ ‖A†‖2‖b1 − b1‖2 = ‖A†‖2‖δb1‖2.

(4.4)

Since Ax = b1, we have‖b1‖2 ≤ ‖A‖2‖x‖2. (4.5)

By combining (4.4) and (4.5), the proof is complete.

We remark that the condition number κ2(A) is important for LS problems. Whenκ2(A) is large, we say that the LS problem is ill-conditioned. When κ2(A) is small, wesay that the LS problem is well-conditioned.

Theorem 4.6 Suppose that column vectors of A are linearly independent. Then

κ2(A)2 = κ2(AT A).

Proof: By Theorem 3.3 (ii) and the given condition, we have

‖A‖22 = ‖AT A‖2, ‖A†‖2

2 = ‖A†(A†)T ‖2 = ‖(AT A)−1‖2.

Therefore,

κ2(A)2 = ‖A‖22‖A†‖2

2 = ‖AT A‖2‖(AT A)−1‖2 = κ2(AT A).

4.2 Orthogonal transformations

In order to construct efficient algorithms for solving LS problems, we introduce somewell-known orthogonal transformations.

4.2.1 Householder transformation

We first introduce the following definition of Householder transformation.

Definition 4.3 Let ω ∈ Rn with ‖ω‖2 = 1. Define H ∈ Rn×n as follows:

H = I − 2ωωT . (4.6)

The matrix H is called the Householder transformation.

4.2. ORTHOGONAL TRANSFORMATIONS 55

Theorem 4.7 Let H be defined as in (4.6). Then H has the following properties:

(i) H is a symmetric orthogonal matrix.

(ii) H2 = I.

(iii) H is called a reflection because Hx is the reflection of x ∈ Rn in the plane through0 perpendicular to ω.

Proof: We only prove (iii). Note that for any vector x ∈ Rn, it can be expressed as:

x = u + αω

where u ∈ span{ω}⊥ and α ∈ R. By using uT ω = 0 and ωT ω = 1, we have

Hx = (I − 2ωωT )(u + αω) = u + αω − 2ωωT u− 2αωωT ω = u− αω.

Theorem 4.8 For any 0 6= x ∈ Rn, we can construct a unit vector ω such that theHouseholder transformation defined as in (4.6) satisfies

Hx = αe1

where α = ±‖x‖2.

Proof: Note that Hx = (I − 2ωωT )x = x− 2(ωT x)ω. Let

ω =x− αe1

‖x− αe1‖2.

We then have

Hx = x− 2(ωT x)ω

= x− 2‖x− αe1‖2

2

[(xT − αeT

1 )x](x− αe1)

= x− 2(‖x‖22 − αx1)

‖x− αe1‖22

x +2(‖x‖2

2 − αx1)α‖x− αe1‖2

2

e1

=[1− 2(‖x‖2

2 − αx1)‖x− αe1‖2

2

]x +

2(‖x‖22 − αx1)α

‖x− αe1‖22

e1,

(4.7)


where x1 is the first component of the vector x. Let the coefficient of x be zero andthen we have the following equation:

1− 2(‖x‖22 − αx1)

‖x− αe1‖22

= 0.

Solving this equation for α, we have α = ±‖x‖2. Substituting it into (4.7), we thereforehave

Hx = ∓‖x‖2e1.

We remark that for any vector 0 6= x ∈ Rn, by Theorem 4.8, one can construct aHouseholder matrix H such that the last n − 1 components of Hx are zeros. We canuse the following two steps to construct the unit vector ω of H:

(1) compute v = x± ‖x‖2e1;

(2) compute ω = v/‖v‖2.

Now a natural question is: how to choose the sign in front of ‖x‖2? Usually, wechoose

v = x + sign(x1)‖x‖2e1,

where x1 6= 0 is the first component of the vector x, see [38]. Since

H = I − 2ωωT = I − 2vT v

vvT = I − βvvT

where β = 2/vT v, we only need to compute β and v instead of forming ω. Thus, wehave the following algorithm.

Algorithm 4.1 (Householder transformation)

function: [v, β] = house(x)n = length(x)σ = x(2 : n)T x(2 : n)v(1) = x(1) + sign(x(1))

√x(1)2 + σ

v(2 : n) = x(2 : n)if σ = 0

β = 0else

β = 2/(v(1)2 + σ)end

4.3. QR DECOMPOSITION 57

4.2.2 Givens rotation

A Givens rotation is defined as follows:

G(i, k, θ) = I + s(eieTk − eke

Ti ) + (c− 1)(eie

Ti + eke

Tk )

=

1...

.... . .

......

· · · · · · c · · · s · · · · · ·...

...· · · · · · −s · · · c · · · · · ·

......

. . ....

... 1

,

where c = cos θ and s = sin θ. It is easy to prove that G(i, k, θ) is an orthogonal matrix.Let x ∈ Rn and y = G(i, k, θ)x. We then have

yi = cxi + sxk, yk = −sxi + cxk, yj = xj , j 6= i, k.

If we want to make yk = 0, then we only need to take

c =xi√

x2i + x2

k

, s =xk√

x2i + x2

k

.

Therefore,

yi =√

x2i + x2

k, yk = 0.

We remark that for any vector 0 6= x ∈ Rn, one can construct a Givens rotationG(i, k, θ) acting on x to make a nonzero component of x be zero.

4.3 QR decomposition

Let A ∈ Rm×n and b ∈ Rm. By Theorem 3.3 (iii), for any orthogonal matrix Q, wehave

‖Ax− b‖2 = ‖QT (Ax− b)‖2.

Therefore, the LS problemmin

x‖QT Ax−QT b‖2

is equivalent to (4.1). We wish that we could find a suitable orthogonal matrix Q suchthat the original LS problem becomes an easier solvable LS problem. We have


Theorem 4.9 (QR decomposition) Let A ∈ Rm×n (m ≥ n). Then A has a QRdecomposition:

A = Q

[R0

], (4.8)

where Q ∈ Rm×m is an orthogonal matrix and R ∈ Rn×n is an upper triangular matrixwith nonnegative diagonal entries. The decomposition is unique when m = n and A isnonsingular.

Proof: We use induction. When n = 1, we note that it is true by using Theorem4.8. Now, we assume that the theorem is true for all the matrices in Rp×(n−1) withp ≥ n−1. Let the first column of A ∈ Rm×n be a1. By Theorem 4.8 again, there existsan orthogonal matrix Q1 ∈ Rm×m such that

QT1 a1 = ‖a1‖2e1.

Therefore, we have

QT1 A =

[ ‖a1‖2 vT

0 A1

].

For the matrix A1 ∈ R(m−1)×(n−1), we obtain by assumption,

A1 = Q2

[R2

0

],

where Q2 ∈ R(m−1)×(m−1) is an orthogonal matrix and R2 is an upper triangular matrixwith nonnegative diagonal entries. Thus, let

Q = Q1

[1 00 Q2

], R =

‖a1‖2 vT

0 R2

0 0

.

Then Q and R are the matrices satisfying the conditions of the theorem.When A ∈ Rm×m is nonsingular, we want to show that the QR decomposition is

unique. LetA = QR = QR

where Q, Q ∈ Rm×m are orthogonal matrices, and R, R ∈ Rm×m are upper triangularmatrices with nonnegative diagonal entries. Since A is nonsingular, we know that thediagonal entries of R and R are positive. Therefore, the matrices

QT Q = RR−1

are both orthogonal and upper triangular matrices with positive diagonal entries. Thus

QT Q = RR−1 = I,


i.e.,Q = Q, R = R.

A complex version of the QR decomposition is needed later on.

Corollary 4.2 Let A ∈ Cm×n (m ≥ n). Then A has a QR decomposition:

A = Q

[R0

],

where Q ∈ Cm×m is a unitary matrix and R ∈ Cn×n is an upper triangular matrixwith nonnegative diagonal entries. The decomposition is unique when m = n and A isnonsingular.

Now we use the QR decomposition to solve the LS problem (4.1). Suppose thatA ∈ Rm×n (m ≥ n) has linearly independent columns, b ∈ Rm, and A has a QRdecomposition (4.8). Let Q be partitioned as

Q = [Q1 Q2] ,

and

QT b =[

QT1

QT2

]b =

[c1

c2

].

Then‖Ax− b‖2

2 = ‖QT Ax−QT b‖22 = ‖Rx− c1‖2

2 + ‖c2‖22.

The x is the solution of the LS problem (4.1) if and only if it is the solution of Rx = c1.Note that it is much easier to get the solution of (4.1) by solving Rx = c1 since R isan upper triangular matrix. We have the following algorithm for LS problems:

(1) Compute a QR decomposition of A.

(2) Compute c1 = QT1 b.

(3) Solve the upper triangular system Rx = c1.

Finally, we discuss how to use Householder transformations to compute the QRdecomposition of A. Let m = 7 and n = 5. Assume that we have already foundHouseholder transformations H1 and H2 such that

H2H1A =

× × × × ×0 × × × ×0 0 + × ×0 0 + × ×0 0 + × ×0 0 + × ×0 0 + × ×

.


Now we construct a Householder transformation H3 ∈ R5×5 such that

H3

+++++

=

×0000

.

Let H3 = diag(I2, H3). We obtain

H3H2H1A =

× × × × ×0 × × × ×0 0 × × ×0 0 0 × ×0 0 0 × ×0 0 0 × ×0 0 0 × ×

.

In general, after n such steps, we can reduce the matrix A into the following form,

HnHn−1 · · ·H1A =[

R0

],

where R is an upper triangular matrix with nonnegative diagonal entries. By settingQ = H1 · · ·Hn, we obtain

A = Q

[R0

].

Thus, we have the following algorithm.

Algorithm 4.2 ( QR decomposition: Householder transformation)

for j = 1 : n[v, β] = house(A(j : m, j))A(j : m, j : n) = (Im−j+1 − βvvT )A(j : m, j : n)if j < m

A(j + 1 : m, j) = v(2 : m− j + 1)end

end

We remark that the QR decomposition is not only a basic tool for solving LSproblems but also an important tool for solving some other fundamental problems inNLA.

Exercises:


1. Let A ∈ Rm×n have full column rank. Prove that A + E also has full column rank if Esatisfies ‖E‖2 ≤ 1

‖A†‖2 , where A† = (AT A)−1AT .

2. Let U = [uij ] be a nonsingular upper triangular matrix. Show that

κ∞(U) ≥max

i|uii|

mini|uii| ,

where κ∞(U) = ‖U‖∞‖U−1‖∞.

3. Let A ∈ Rm×n with m ≥ n and have full column rank. Show that[

I AAT 0

] [rx

]=

[b0

]

has a solution where x minimizes ‖Ax− b‖2.4. Let x ∈ Rn and P be a Householder transformation such that

Px = ±‖x‖2e1.

Let G12, G23, · · · , Gn−1,n be Givens rotations, and let

Q = G12G23 · · ·Gn−1,n.

Suppose Qx = ±‖x‖2e1. Is P equal to Q? Give a proof or a counterexample.

5. Let A ∈ Rm×n. Show that X = A† minimizes ‖AX − I‖F over all X ∈ Rn×m. What isthe minimum?

6. Let x =[

x1

x2

]∈ C2. Find an algorithm to compute the following unitary matrix

Q =[

c s−s c

], c ∈ R, c2 + |s|2 = 1

such that Qx =[ ∗

0

].

7. Suppose an m-by-n matrix A has the form

A =[

A1

A2

],

where A1 is an n-by-n nonsingular matrix and A2 is an (m − n)-by-n arbitrary matrix.Prove that ‖A†‖2 ≤ ‖A−1

1 ‖2.8. Consider the following well-known ill-conditioned matrix

A =

1 1 1ε 0 00 ε 00 0 ε

, |ε| ¿ 1.


(a) Choose a small ε such that rank(A) = 3. Then compute κ2(A) to show that A isill-conditioned.

(b) Find the LS solution with A given as above and b = (3, ε, ε, ε)T by using

(i) the normalized equation method;(ii) the QR method.

9. Let A = BC where B ∈ Cm×r and C ∈ Cr×n with

r = rank(A) = rank(B) = rank(C).

Show thatA† = C∗(CC∗)−1(B∗B)−1B∗.

10. Let A = UΣV ∗ ∈ Cm×n, where U ∈ Cm×n satisfies U∗U = I, V ∈ Cn×n satisfiesV ∗V = I and Σ is an n-by-n diagonal matrix. Show that

A† = V Σ†U∗.

11. Prove thatA† = lim

ε→0(A∗A + εI)−1A∗ = lim

ε→0A∗(AA∗ + εI)−1.

12. Show thatR(A) ∩N (A∗) = {0}.

13. Let A = [aij ] ∈ Cn×n be idempotent. Then

R(A)⊕N (A) = Cn, rank(A) =n∑

i=1

aii.

14. Let A ∈ Cm×n. Prove that

R(AA†) = R(AA∗) = R(A),

R(A†A) = R(A∗A) = R(A†) = R(A∗),

N (AA†) = N (AA∗) = N (A†) = N (A∗),

N (A†A) = N (A∗A) = N (A).

Therefore AA† and A†A are orthogonal projectors.

15. Prove Corollary 4.2.

Chapter 5

Classical Iterative Methods

We study classical iterative methods for the solution of Ax = b. Iterative methods,originally proposed by Gauss in 1823, Liouville in 1837, and Jacobi in 1845, are quitedifferent from direct methods such as Gaussian elimination, see [2].

Direct methods based on an LU factorization of A become prohibitive in terms ofcomputing time and computer storage if the matrix A is quite large. In some practicalsituation such as the discretization of partial differential equations, the matrix size canbe as large as several hundreds of thousands. For such problems, direct methods becomeimpractical. Furthermore, most large problems are sparse, and usually the sparsity islost during LU factorizations. Therefore, we have to face a very large matrix withmany nonzero entries at the end of LU factorizations, and then the storage becomes acrucial issue. For such kind of problems, we can use a class of methods called iterativemethods. In this chapter, we only consider some classical iterative methods.

We remark that the disadvantage with classical iterative methods is that the conver-gence rate maybe is slow or they may even diverge, and a stopping criterion is neededto be found.

5.1 Jacobi and Gauss-Seidel method

5.1.1 Jacobi method

Consider the following linear system

Ax = b

where A = [aij ] ∈ Rn×n. We can write the matrix A in the following form

A = D − L− U,

whereD = diag(a11, a22, · · · , ann),

63

64 CHAPTER 5. CLASSICAL ITERATIVE METHODS

L =

0−a21 0−a31 −a32 0

......

. . . . . .−an1 −an2 · · · −an,n−1 0

,

and

U =

0 −a12 −a13 · · · −a1n

0 −a23 · · · −a2n

. . . . . ....

0 −an−1,n

0

.

Then it is easy to see thatx = BJx + g,

whereBJ = D−1(L + U), g = D−1b.

The matrix BJ is called the Jacobi iteration matrix. The corresponding iteration

xk = BJxk−1 + g, k = 1, 2, · · · , (5.1)

is known as the Jacobi method if an initial vector x0 =(x

(0)1 , x

(0)2 , · · · , x(0)

n

)Tis given.

5.1.2 Gauss-Seidel method

In the Jacobi method, to compute the components of the vector

xk+1 =(x

(k+1)1 , x

(k+1)2 , · · · , x(k+1)

n

)T,

only the components of the vector xk are used. However, note that to compute x(k+1)i ,

we could use x(k+1)1 , x

(k+1)2 , · · · , x(k+1)

i−1 , which were already available for us. Thus anatural modification of the Jacobi method is to rewrite the Jacobi iteration (5.1) in thefollowing form

xk = (D − L)−1Uxk−1 + (D − L)−1b, k = 1, 2, · · · . (5.2)

The idea is to use each new component as soon as it is available in the computation ofthe next component. The iteration (5.2) is known as the Gauss-Seidel method.

Note that the matrix D − L is a lower triangular matrix with a11, · · · , ann on thediagonal. Because these entries are assumed to be nonzero, the matrix D − L is non-singular. The matrix

BGS = (D − L)−1U

is called the Gauss-Seidel iteration matrix.

5.2. CONVERGENCE ANALYSIS 65

5.2 Convergence analysis

5.2.1 Convergence theorems

It is often hard to make a good initial approximation x0. Thus, it will be nice to haveconditions that will guarantee the convergence of Jacobi, Gauss-Seidel methods for anyarbitrary choice of the initial approximation.

Both of the Jacobi iteration and the Gauss-Seidel iteration can be expressed by

xk+1 = Bxk + g, k = 0, 1, · · · . (5.3)

For the Jacobi iteration, we have

BJ = D−1(L + U), g = D−1b;

and for the Gauss-Seidel iteration, we have

BGS = (D − L)−1U, g = (D − L)−1b.

The iteration (5.3) is called linear stationary iteration, where B ∈ Rn×n is called theiteration matrix, g ∈ Rn the constant term, and x0 ∈ Rn the initial vector. In thefollowing, we give a convergence theorem.

Theorem 5.1 The iteration (5.3) converges with an arbitrary initial guess x0 if andonly if the matrix Bk → 0 as k →∞.

Proof: From x = Bx + g and xk+1 = Bxk + g, we have

x− xk+1 = B(x− xk). (5.4)

Because it is true for any value of k, we can write

x− xk = B(x− xk−1). (5.5)

Substituting (5.5) into (5.4), we have

x− xk+1 = B2(x− xk−1).

Continuing this process k times, we can write

x− xk+1 = Bk+1(x− x0).

This shows that {xk} converges to the solution x for any choice x0 if and only if Bk → 0as k →∞.


Recall that Bk → 0 as k → ∞ if and only if the spectral radius ρ(B) < 1. Since|λi| ≤ ‖B‖, a good way to see whether ρ(B) < 1 is to see whether ‖B‖ < 1 bycomputing ‖B‖ with a row-sum or column-sum norm. Note that the converse is nottrue. Combining the result of Theorem 5.1 with the above observation, we have thefollowing theorem.

Theorem 5.2 The iteration (5.3) converges for any choice of x0 if and only if ρ(B) <1. Moreover, if ‖B‖ < 1 for some matrix norm, then the iteration (5.3) converges.

Let us consider the examples:

A1 =

1 2 −21 1 12 2 1

, A2 =

2 −1 11 1 11 1 −2

.

It is easy to verify that for A1, the Jacobi method converges even if the Gauss-Seidelmethod does not. For A2, the Jacobi method diverges while the Gauss-Seidel methodconverges.

5.2.2 Sufficient conditions for convergence

We now apply Theorem 5.2 to give a sequence of criteria that guarantee the convergenceof the Jacobi and (or) Gauss-Seidel methods with any choice of initial approximationx0.

Definition 5.1 Let A = [aij ] ∈ Rn×n. If

|aii| >n∑

j=1j 6=i

|aij |, i = 1, 2, · · · , n,

then the matrix A is called strictly diagonally dominant. The matrix A is called weaklydiagonally dominant if for i = 1, 2, · · · , n,

|aii| ≥n∑

j=1j 6=i

|aij |

with at least one strict inequality.

Theorem 5.3 If A is strictly diagonally dominant, then the Jacobi method convergesfor any initial approximation x0.


Proof: Because A = [aij ] is strictly diagonally dominant, we have

|aii| >n∑

j=1j 6=i

|aij |, i = 1, 2, · · · , n.

Recall that the Jacobi iteration matrix

BJ = D−1(L + U)

is given by

BJ =

0 −a12a11

· · · · · · −a1na11

−a21a22

0 −a23a22

· · · −a2na22

.... . . . . . . . .

......

. . . . . . − an−1,n

an−1,n−1

− an1ann

· · · · · · an,n−1

ann0

.

We know that the absolute row sum of each row is less than 1, which means

‖BJ‖∞ < 1.

Thus by Theorem 5.2, the Jacobi method converges.

Theorem 5.4 If A is strictly diagonally dominant, then the Gauss-Seidel method con-verges for any initial approximation x0.

Proof: The Gauss-Seidel iteration matrix is given by

BGS = (D − L)−1U.

Let λ be an eigenvalue of this matrix and x = (x1, x2, · · · , xn)T be the correspondingeigenvector with the largest component having the magnitude 1. Then from

BGSx = λx,

we haveUx = λ(D − L)x,

i.e.,

−n∑

j=i+1

aijxj = λi∑

j=1

aijxj , 1 ≤ i ≤ n,


which can be rewritten as

λaiixi = −λi−1∑

j=1

aijxj −n∑

j=i+1

aijxj , 1 ≤ i ≤ n. (5.6)

Let xk be the largest component having the magnitude 1 of the vector x. Then by(5.6), we have

|λ| · |akk| ≤ |λ|k−1∑

j=1

|akj |+n∑

j=k+1

|akj |

i.e.,

|λ|(|akk| −k−1∑

j=1

|akj |) ≤n∑

j=k+1

|akj |

or

|λ| ≤

n∑j=k+1

|akj |

|akk| −k−1∑j=1

|akj |(5.7)

Since A is strictly diagonally dominant, we have

|akk| −k−1∑

j=1

|akj | >n∑

j=k+1

|akj |.

Thus from (5.7), we conclude that |λ| < 1, i.e., ρ(BGS) < 1. By Theorem 5.2, theGauss-Seidel method converges.

We now discuss the convergence of the Jacobi, Gauss-Seidel methods for the sym-metric positive definite matrices.

Theorem 5.5 Let A be symmetric with diagonal entries aii > 0, i = 1, 2, · · · , n. Thenthe Jacobi method converges if and only if both A and 2D −A are positive definite.

Proof: SinceBJ = D−1(L + U) = D−1(D −A) = I −D−1A,

andD = diag(a11, a22, · · · , ann)

with aii > 0, i = 1, 2, · · · , n, then

BJ = I −D−1A = D−1/2(I −D−1/2AD−1/2)D1/2.


It is easy to see that I −D−1/2AD−1/2 symmetric and similar to BJ . Then the eigen-values of BJ are real.

Now, we first suppose that the Jacobi method converges, then ρ(BJ) < 1 by Theo-rem 5.2. The absolute value of the eigenvalues of

I −D−1/2AD−1/2

is less than 1, i.e., the eigenvalues of D−1/2AD−1/2 lies on (0, 2). Thus A is positivedefinite. On the other hand, the eigenvalues of 2I −D−1/2AD−1/2 are positive, so thematrix

2I −D−1/2AD−1/2

is positive definite. Since

D−1/2(2D −A)D−1/2 = 2I −D−1/2AD−1/2,

we know that 2D −A is positive definite too.Conversely, since

D1/2(I −BJ)D−1/2 = D−1/2AD−1/2

and A is positive definite, it follows that the eigenvalues of I−BJ are positive, i.e., theeigenvalues of BJ are less than 1. Because 2D −A is positive definite and

D−1/2(2D −A)D−1/2 = D1/2(I + BJ)D−1/2,

we can deduce that the eigenvalues of I + BJ are positive, i.e., the eigenvalues of BJ

are greater than −1. Thus ρ(BJ) < 1. By Theorem 5.2 again, the Jacobi methodconverges.

Theorem 5.6 Let A be a symmetric positive definite matrix. Then the Gauss-Seidelmethod converges for any initial approximation x0.

Proof: Let λ be an eigenvalue of the iteration matrix BGS and u the correspondingeigenvector. Then

(D − L)−1Uu = λu.

Since A is symmetric positive definite, we have U = LT and

λ(D − L)u = LT u.

Therefore,λu∗(D − L)u = u∗LT u.


Letu∗Du = δ, u∗Lu = α + iβ.

We then haveu∗LT u = (Lu)∗u = u∗Lu = α− iβ.

Thusλ[δ − (α + iβ)] = α− iβ.

Taking the modulus of both sides, we have

|λ|2 =α2 + β2

(δ − α)2 + β2.

On the other hand,

0 < u∗Au = u∗(D − L− LT )u = δ − 2α.

Hence,

(δ − α)2 + β2 = δ2 + α2 + β2 − 2δα

= δ(δ − 2α) + α2 + β2

> α2 + β2.

So we get |λ| < 1. By Theorem 5.2, the Gauss-Seidel method converges.

Definition 5.2 A matrix A is called irreducible if there is no permutation matrix Psuch that

PAP T =[

A11 A12

0 A22

],

where A11 and A22 are square matrices.

We have the following lemma, see [41].

Lemma 5.1 If a matrix A is strictly diagonally dominant; or irreducible and weaklydiagonally dominant, then A is nonsingular.

Furthermore,

Theorem 5.7 We have

5.3. CONVERGENCE RATE 71

(i) If A is strictly diagonally dominant, then both the Jacobi and the Gauss-Seidelmethods converge. In fact,

‖BGS‖∞ ≤ ‖BJ‖∞ < 1.

(ii) If A is irreducible and weakly diagonally dominant, then both the Jacobi and theGauss-Seidel methods converge. Moreover,

ρ(BGS) < ρ(BJ) < 1.

For the proof of this theorem, we refer to [14, 41].

5.3 Convergence rate

We consider a stationary linear iteration method,

xk+1 = Mxk + g, k = 0, 1, · · · ,

where M is an n-by-n matrix. If I − M is nonsingular, then there exists a uniquesolution of

(I −M)x∗ = g.

The error vectors yk are defined as yk = xk − x∗, for k = 0, 1, · · ·, and then

yk = Myk−1 = · · · = Mky0.

Using matrix and vector norms, we have

‖yk‖ ≤ ‖Mk‖ ‖y0‖

with the equality possible for each k for some vector y0. Thus, if y0 is not the zerovector, then ‖Mk‖ gives us a sharp upper-bound estimate for the ratio ‖yk‖/‖y0‖. Sincethe initial vector y0 is unknown in practical problems, ‖Mk‖ serves as a measurementof comparison of different iterative methods.

Definition 5.3 Let M be an n-by-n iteration matrix. If ‖Mk‖ < 1 for some positiveinteger k, then

Rk(M) ≡ − ln[(‖Mk‖)1/k] = − ln ‖Mk‖k

is called the average rate of convergence for k iterations of M .


In terms of actual computations, the significance of the average rate of convergenceRk(M) is given as follows. Clearly, the quantity

σ =(‖yk‖‖y0‖

)1/k

is the average reduction factor per iteration for the norm of error. If ‖Mk‖ < 1, thenby Definition 5.3, we have

σ ≤ (‖Mk‖)1/k = e−Rk(M)

where e is the base of the natural logarithm.If M is symmetric (or Hermitian, or normal, i.e., MM∗ = M∗M), by using the

spectral radius of the iteration matrix M , we then have

‖Mk‖2 = [ρ(M)]k,

and thus,Rk(M) = − ln ρ(M),

which is independent of k.Next we will consider the asymptotic convergence rate

R∞(M) ≡ limk→∞

Rk(M).

Theorem 5.8 We haveR∞(M) = − ln ρ(M).

Proof: We only need to prove that

limk→∞

‖Mk‖1/k = ρ(M).

Since[ρ(M)]k = ρ(Mk) ≤ ‖Mk‖,

we haveρ(M) ≤ ‖Mk‖1/k.

On the other hand, for any ε > 0, consider the matrix

Bε =1

ρ(M) + εM.

It is obvious that ρ(Bε) < 1 and then limk→∞

Bkε = 0. Hence, there exists a natural

number K, for k ≥ K, we have‖Bk

ε ‖ ≤ 1,

5.4. SOR METHOD 73

i.e.,‖Mk‖ ≤ [ρ(M) + ε]k.

Thus,ρ(M) ≤ ‖Mk‖1/k ≤ ρ(M) + ε,

which meanslim

k→∞‖Mk‖1/k = ρ(M).

5.4 SOR method

The Gauss-Seidel method is very slow when ρ(BGS) is close to unity. However, theconvergence rate of the Gauss-Seidel iteration, in certain cases, can be improved byintroducing a parameter ω, known as the relaxation parameter. This method is calledthe successive overrelaxation (SOR) method.

5.4.1 Iterative form

Let A = [aij ] ∈ Rn×n and A = D − L − U defined as in Section 5.1.1. Consider thesolution of Ax = b again. The motivation of the SOR method is to improve the Gauss-Seidel iteration by taking an appropriately weighted average of the x

(k)i and x

(k+1)i

yielding the following algorithm

x(k+1)i = (1− ω)x(k)

i + ω

i−1∑

j=1

cijx(k+1)j +

n∑

j=i+1

cijx(k)j + gi

. (5.8)

Here D−1(L+U) = [cij ], xk =(x

(k)1 , x

(k)2 , · · · , x(k)

n

)T, and g = D−1b = (g1, g2, · · · , gn)T .

In matrix form, we have

xk+1 = Lωxk + ω(D − ωL)−1b

whereLω = (D − ωL)−1[(1− ω)D + ωU ]

is called the iteration matrix of the SOR method and ω is the relaxation parameter.We have three cases depending on the values of ω:

(1) if ω = 1, (5.8) is equivalent to the Gauss-Seidel method;

(2) if ω < 1, (5.8) is called underrelaxation;

(3) if ω > 1, (5.8) is called overrelaxation.


5.4.2 Convergence criteria

It is natural to ask for which range of ω the SOR iteration converges. To this end, wefirst prove the following important result, see [14, 41].

Theorem 5.9 The SOR iteration cannot converge for any initial approximation if ωlies outside the interval (0,2).

Proof: Recall that the SOR iteration matrix Lω is given by

Lω = (D − ωL)−1[(1− ω)D + ωU ],

where A = [aij ] = D − L− U .The matrix (D − ωL)−1 is a lower triangular matrix with 1/aii, i = 1, 2, · · · , n, as

diagonal entries, and the matrix

(1− ω)D + ωU

is an upper triangular matrix with (1 − ω)aii, i = 1, 2, · · · , n, as diagonal entries.Therefore,

det(Lω) = (1− ω)n.

Since the determinant of a matrix is equal to the product of its eigenvalues, we concludethat

ρ(Lω) ≥ |1− ω|,where ρ(Lω) is the spectral radius of Lω. By Theorem 5.2, the spectral radius ofthe iteration matrix should be less than 1 for convergence. We then conclude that0 < ω < 2 is required for the convergence of the SOR method.

Theorem 5.10 If A is symmetric positive definite, then

ρ(Lω) < 1

for all 0 < ω < 2.

Proof: Let λ be any eigenvalue of the SOR iteration matrix

Lω = (D − ωL)−1[(1− ω)D + ωLT ]

and x be the corresponding eigenvector. We have

[(1− ω)D + ωLT ]x = λ(D − ωL)x

5.4. SOR METHOD 75

orx∗[(1− ω)D + ωLT ]x = λx∗(D − ωL)x.

Letx∗Dx = δ, x∗Lx = α + iβ.

Therefore, x∗LT x = α− iβ and then

(1− ω)δ + ω(α− iβ) = λ[δ − ω(α + iβ)].

Taking modulus of both sides, we obtain

|λ|2 =[(1− ω)δ + ωα]2 + ω2β2

(δ − ωα)2 + ω2β2. (5.9)

Note that[(1− ω)δ + ωα]2 + ω2β2 − (δ − ωα)2 − ω2β2

= [δ − ω(δ − α)]2 − (δ − ωα)2

= ωδ(δ − 2α)(ω − 2).

Since A is symmetric positive definite, we have

δ > 0, δ − 2α > 0.

Therefore, if 0 < ω < 2, we have

[(1− ω)δ + ωα]2 + ω2β2 < (δ − ωα)2 + ω2β2. (5.10)

Thus, for 0 < ω < 2, we obtain by (5.9) and (5.10),

|λ|2 < 1,

i.e., the SOR method converges.

5.4.3 Optimal ω in SOR iteration

For further comparison of the Jacobi, Gauss-Seidel and SOR methods, we imposeanother condition on matrices. This condition allows us to compute ρ(BGS) and ρ(Lω)explicitly in terms of ρ(BJ).

Definition 5.4 A matrix M has property A if there exists a permutation P such that

PMP T =[

D11 M12

M21 D22

]

where D11 and D22 are diagonal matrices.


If M has property A, then we can write

PMP T = D − L− U

where

D =[

D11 00 D22

], L =

[0 0

−M21 0

], U =

[0 −M12

0 0

].

LetBJ(α) ≡ αD−1L +

1α

D−1U .

We have

Theorem 5.11 The eigenvalues of BJ(α) are independent of α.

Proof: Just note that the matrix

BJ(α) = −[

0 1αD−1

11 M12

αD−122 M21 0

]

is similar to the matrix[

I 00 αI

]−1

BJ(α)[

I 00 αI

]= −

[0 D−1

11 M12

D−122 M21 0

]= BJ(1).

Definition 5.5 Let M = D − L− U and

BJ(α) = αD−1L +1α

D−1U.

If the eigenvalues of BJ(α) are independent of α, then M is called consistent ordering(or said to be consistently ordered).

Note that BJ(1) = BJ is the Jacobi iteration matrix. From Theorem 5.11, we haveknown that if M has property A, then PMP T is consistently ordered where P is apermutation matrix such that

PMP T =[

D11 M12

M21 D22

]

with D11 and D22 being diagonal. It is not true that consistent ordering implies prop-erty A.

5.4. SOR METHOD 77

Example 5.1. Any block tridiagonal matrix

D1 A1

B1. . . . . .. . . . . . An−1

Bn−1 Dn

is consistently ordered when Di are diagonal.The following theorem gives a relation between the eigenvalues of BJ and the eigen-

values of Lω, see [14, 41].

Theorem 5.12 If A is consistently ordered and ω 6= 0, then

(i) The eigenvalues of BJ appear in ± pairs.

(ii) If µ is an eigenvalue of BJ and λ satisfies

(λ + ω − 1)2 = λω2µ2, (5.11)

then λ is an eigenvalue of Lω.

(iii) If λ 6= 0 is an eigenvalue of Lω and µ satisfies (5.11), then µ is an eigenvalue ofBJ .

Corollary 5.1 If A is consistently ordered, then

ρ(BGS) = [ρ(BJ)]2.

This means that the convergence rate of the Gauss-Seidel method is twice as fast asthat of the Jacobi method.

Proof: The choice of ω = 1 is equivalent to the Gauss-Seidel method. Therefore, by(5.11), we have λ2 = λµ2 and then

λ = µ2.

To get the most benefit from overrelaxation, we would like to find an optimal ω,denoted by ωopt, minimizing ρ(Lω). We have the following theorem, see [14, 41].


Theorem 5.13 Suppose that A is consistently ordered and BJ has real eigenvalueswith µ = ρ(BJ) < 1. Then

ωopt =2

1 +√

1− µ2,

ρ(Lωopt) =µ2

(1 +√

1− µ2)2,

and

ρ(Lω) =

ω − 1, ωopt ≤ ω < 2,

1− ω + 12ω2µ2 + ωµ

√1− ω + 1

4ω2µ2, 0 < ω ≤ ωopt.

Exercises:

1. Judge the convergence of the Jacobi method and the Gauss-Seidel method for the fol-lowing examples:

A1 =

1 2 −21 1 12 2 1

, A2 =

2 −1 11 1 11 1 −2

.

2. Show that the Jacobi method converges for 2-by-2 symmetric positive definite systems.

3. Show that if A = M −N is singular, then we can never have ρ(M−1N) < 1 even if M isnonsingular.

4. Consider Ax = b where

A =

1 0 α0 1 0α 0 1

.

(1) For which α, is A positive definite?

(2) For which α, does the Jacobi method converge?

(3) For which α, does the Gauss-Seidel method converge?

5. Let A ∈ Rn×n be nonsingular. Show that there exists a permutation matrix P such thatthe diagonal entries of PA are nonzero.

6. Prove that

A =

4 −1 −1 0−1 4 0 −1−1 0 4 −10 −1 −1 4

is consistently ordered.

5.4. SOR METHOD 79

7. Let A ∈ Rn×n. Then ρ(A) < 1 if and only if I − A nonsingular and each eigenvalue of(I −A)−1(I + A) has a positive real part.

8. Let B ∈ Rn×n satisfy ρ(B) = 0. Show that for any g, x0 ∈ Rn, the iterative formula

xk+1 = Bxk + g, k = 0, 1, · · ·

converges to the exact solution of x = Bx + g for at most n iterations.

9. If A = [aij ] ∈ Rn×n is irreducible with aij ≥ 0 and B = [bij ] ∈ Rn×n with bij ≥ 0. Showthat A + B is irreducible.

10. Let

A =

0 a 00 0 bc 0 0

.

Is Ak irreducible (k = 1, 2, 3)?

11. Let

A =[

2 −11 0

].

Show that

(1)

Ak = k

[1 + 1/k −1

1 1/k − 1

], k = 1, 2, · · · .

(2)

limk→∞

‖Ak‖p

k= 2, p = 1, 2,∞.

(3) ρ(Ak) = 1.

12. LetBk = Bk−1 + Bk−1(I −ABk−1), k = 1, 2, · · · .

Show that if ‖I −AB0‖ = c < 1, then

limk→∞

Bk = A−1

and

‖A−1 −Bk‖ ≤ c2k

1− c‖B0‖.

13. Prove Theorem 5.12.

14. Prove Theorem 5.13.


Chapter 6

Krylov Subspace Methods

In this chapter, we will introduce a class of iterative methods called Krylov subspacemethods. Among Krylov subspace methods developed for large sparse problems, we willmainly study two methods: the conjugate gradient (CG) method and the generalizedminimum residual (GMRES) method. The CG method proposed by Hestenes andStiefel in 1952 is one of the best known iterative methods for solving symmetric positivedefinite linear systems, see [16]. The GMRES method was proposed by Saad andSchultz in 1986 for solving nonsymmetric linear systems, see [34]. As usual, let usbegin our discussion from the steepest descent method.

6.1 Steepest descent method

We consider the linear systemAx = b

where A ∈ Rn×n is a symmetric positive definite matrix and b ∈ Rn is a known vector.We define the following quadratic function

φ(x) ≡ xT Ax− 2bT x. (6.1)

Theorem 6.1 Let A ∈ Rn×n be a symmetric positive definite matrix. Then findingthe solution of Ax = b is equivalent to finding the minimum of function (6.1).

Proof: Note that

∂φ

∂xi= 2(ai1x1 + · · ·+ ainxn)− 2bi, i = 1, 2, · · · , n.

We therefore havegradφ(x) = 2(Ax− b) = −2r, (6.2)

81

82 CHAPTER 6. KRYLOV SUBSPACE METHODS

where gradφ(x) denotes the gradient of φ(x) and r = b − Ax. If φ(x) reaches itsminimum at a point x∗, then

gradφ(x∗) = 0,

i.e., Ax∗ = b which means that x∗ is the solution of the system.Conversely, if x∗ is the solution of the system, then for any vector y, we have

φ(x∗ + y) = (x∗ + y)T A(x∗ + y)− 2bT (x∗ + y)

= xT∗Ax∗ − 2bT x∗ + yT Ay = φ(x∗) + yT Ay.

Since A is symmetric positive definite, we have yT Ay ≥ 0. Hence,

φ(x∗ + y) ≥ φ(x∗),

i.e., φ(x) reaches its minimum at the point x∗.

How to find the minimum of (6.1)? Usually, for any given initial vector x0, wechoose a direction p0, and then we try to find a point

x1 = x0 + α0p0

on the line x = x0 + αp0 such that

φ(x1) = φ(x0 + α0p0) ≤ φ(x0 + αp0).

It means that along this line, φ(x) reaches its minimum at point x1. Afterwards,starting from x1, we choose another direction p1, and then we try to find a point

x2 = x1 + α1p1

on the line x = x1 + αp1 such that

φ(x2) = φ(x1 + α1p1) ≤ φ(x1 + αp1),

i.e., along this line, φ(x) reaches its minimum at point x2. Step by step, we have

p0, p1, · · · , and α0, α1, · · · ,

where {pk} are line search directions and {αk} are step sizes. In general, at a point xk,we choose a direction pk and then determine a step size αk along the line x = xk +αpk

such thatφ(xk + αkpk) ≤ φ(xk + αpk).

We then obtain xk+1 = xk + αkpk. We remark that different ways for choosing linesearch directions and step sizes give different algorithms for solving (6.1).

6.1. STEEPEST DESCENT METHOD 83

In the following, we first consider how to determine a step size αk. Starting from apoint xk along a direction pk, we want to find a step size αk on the line x = xk + αpk

such thatφ(xk + αkpk) ≤ φ(xk + αpk).

Letf(α) = φ(xk + αpk)

= (xk + αpk)T A(xk + αpk)− 2bT (xk + αpk)

= α2pTk Apk − 2αrT

k pk + φ(xk)

where rk = b−Axk. We have

df

dα= 2αpT

k Apk − 2rTk pk = 0.

Therefore,

αk =rTk pk

pTk Apk

. (6.3)

Once we get αk, we can compute

xk+1 = xk + αkpk.

Is φ(xk+1) ≤ φ(xk)? We consider

φ(xk+1)− φ(xk) = φ(xk + αkpk)− φ(xk)

= α2kp

Tk Apk − 2αkr

Tk pk

= −(rTk pk)2

pTk Apk

≤ 0.

If rTk pk 6= 0, then φ(xk+1) < φ(xk).Now we consider how to choose a direction pk. It is well-known that the steepest

descent direction of φ(x) is the negative direction of the gradient, i.e., pk = rk by (6.2).Thus we have the steepest descent method. In order to discuss the convergence rate ofthe method, we introduce the following lemma first.

Lemma 6.1 Let 0 < λ1 ≤ · · · ≤ λn be the eigenvalues of a symmetric positive definitematrix A and P (t) be a real polynomial of t. Then

‖P (A)x‖A ≤ max1≤j≤n

|P (λj)| · ‖x‖A, x ∈ Rn,

where ‖x‖A ≡√

xT Ax.


Proof: Let y1, y2, · · · , yn be the eigenvectors of A corresponding to the eigenvaluesλ1, λ2, · · · , λn, respectively. Suppose that y1, y2, · · · , yn also form an orthonormal basis

of Rn. Therefore, for any x ∈ Rn, we have x =n∑

i=1βiyi and furthermore,

xT P (A)AP (A)x =(

n∑i=1

βiP (λi)yi

)T

A

(n∑

i=1βiP (λi)yi

)

=n∑

i=1λiβ

2i P 2(λi) ≤ max

1≤j≤nP 2(λj)

n∑i=1

λiβ2i

= max1≤j≤n

P 2(λj)xT Ax.

Then‖P (A)x‖A ≤ max

1≤j≤n|P (λj)| · ‖x‖A.

For the steepest descent method, we have the following convergence theorem.

Theorem 6.2 Let 0 < λ1 ≤ · · · ≤ λn be the eigenvalues of a symmetric positivedefinite matrix A. Then the sequence {xk} produced by the steepest descent methodsatisfies

‖xk − x∗‖A ≤(

λn − λ1

λn + λ1

)k

‖x0 − x∗‖A,

where x∗ is the exact solution of Ax = b.

Proof: The xk produced by the steepest descent method satisfies

φ(xk) ≤ φ(xk−1 + αrk−1), α ∈ R.

By noting thatφ(x) + xT

∗Ax∗ = (x− x∗)T A(x− x∗),

we have

(xk − x∗)T A(xk − x∗) ≤ (xk−1 + αrk−1 − x∗)T A(xk−1 + αrk−1 − x∗)

= [(I − αA)(xk−1 − x∗)]T A[(I − αA)(xk−1 − x∗)],(6.4)

for any α ∈ R. Let Pα(t) = 1− αt. By using Lemma 6.1, we have from (6.4),

‖xk − x∗‖A ≤ ‖Pα(A)(xk−1 − x∗)‖A

≤ max1≤j≤n

|Pα(λj)| · ‖xk−1 − x∗‖A,(6.5)

6.2. CONJUGATE GRADIENT METHOD 85

for any α ∈ R. By using properties of the Chebyshev approximation, see [33], we have

minα

maxλ1≤t≤λn

|1− αt| = λn − λ1

λn + λ1. (6.6)

Substituting (6.6) into (6.5), we obtain

‖xk − x∗‖A ≤ λn − λ1

λn + λ1‖xk−1 − x∗‖A.

Thus,

‖xk − x∗‖A ≤(

λn − λ1

λn + λ1

)k

‖x0 − x∗‖A.

6.2 Conjugate gradient method

In this section, we introduce the conjugate gradient (CG) method which is one of themost important Krylov subspace methods.

6.2.1 Conjugate gradient method

The basic idea of the CG method is given as follows. For a given initial vector x0, inthe first step, we still choose the direction of negative gradient, i.e., p0 = r0. Then wehave

α0 =rT0 p0

pT0 Ap0

, x1 = x0 + α0p0, r1 = b−Ax1.

Afterwards, at the (k + 1)-th step (k ≥ 1), we want to choose a direction pk on theplane

π2 = {x = xk + ξrk + ηpk−1 : ξ, η ∈ R}such that φ decreases most rapidly. Consider φ on π2:

ψ(ξ, η) = φ(xk + ξrk + ηpk−1)

= (xk + ξrk + ηpk−1)T A(xk + ξrk + ηpk−1)

−2bT (xk + ξrk + ηpk−1).

By directly computing, we have

∂ψ

∂ξ= 2(ξrT

k Ark + ηrTk Apk−1 − rT

k rk),

∂ψ

∂η= 2(ξrT

k Apk−1 + ηpTk−1Apk−1),


where we use rTk pk−1 = 0 (see Theorem 6.3 later). Let

∂ψ

∂ξ=

∂ψ

∂η= 0,

then we find a unique minimum point

x = xk + ξ0rk + η0pk−1

of φ on the plane π2, where ξ0 and η0 satisfy:

ξ0rTk Ark + η0r

Tk Apk−1 = rT

k rk

ξ0rTk Apk−1 + η0p

Tk−1Apk−1 = 0.

(6.7)

Note that from (6.7), if rk 6= 0 then ξ0 6= 0. We therefore can choose

pk =1ξ0

(x− xk) = rk +η0

ξ0pk−1

as a new direction which is the optimal direction for minimizing φ on the plane π2. Letβk−1 = η0

ξ0. Then by using the second equation in (6.7), we have

βk−1 = − rTk Apk−1

pTk−1Apk−1

.

Note that pk satisfies pTk Apk−1 = 0 (see Theorem 6.3 later), i.e., pk and pk−1 are

mutually A-conjugate.Once we get pk, we can determine αk by using (6.3) and then compute

xk+1 = xk + αkpk.

In conclusion, we have the following formulas:

αk =rTk pk

pTk Apk

,

xk+1 = xk + αkpk,

rk+1 = b−Axk+1,

βk = −rTk+1Apk

pTk Apk

,

pk+1 = rk+1 + βkpk.


After a few elementary computations, we obtain

αk =rTk rk

pTk Apk

, βk =rTk+1rk+1

rTk rk

.

Thus, the scheme of the CG method, one of the most popular and successful iterativemethods for solving symmetric positive definite systems Ax = b, is given as follows. Atthe initialization step, for k = 0, we choose x0 and then calculate

r0 = b−Ax0.

While rk 6= 0, in iteration steps, we have

k = k + 1if k = 1

p0 = r0

elseβk−2 = rT

k−1rk−1/rTk−2rk−2

pk−1 = rk−1 + βk−2pk−2

endαk−1 = rT

k−1rk−1/pTk−1Apk−1

xk = xk−1 + αk−1pk−1

rk = rk−1 − αk−1Apk−1

where rk, pk are vectors and αk, βk are scalars, k = 0, 1, · · ·. The xk is the approxima-tion to the exact solution after the k-th iteration. When rk = 0, then the solution isx = xk.

6.2.2 Basic properties

Theorem 6.3 The vectors {ri} and {pi} satisfy the following properties:

(1) pTi rj = 0, 0 ≤ i < j ≤ k;

(2) rTi rj = 0, i 6= j, 0 ≤ i, j ≤ k;

(3) pTi Apj = 0, i 6= j, 0 ≤ i, j ≤ k;

(4) span{r0, · · · , rk} = span{p0, · · · , pk} = K(A, r0, k + 1), where

K(A, r0, k + 1) ≡ span{r0, Ar0, · · · , Akr0}

is called the Krylov subspace.


Proof: By using induction, for k = 1, we have

p0 = r0, r1 = r0 − α0Ap0, p1 = r1 + β0p0.

ThenpT0 r1 = rT

0 r1 = rT0 (r0 − α0Ap0) = rT

0 r0 − α0pT0 Ap0 = 0

provided α0 = rT0 r0/pT

0 Ap0, and

pT1 Ap0 = (r1 + β0r0)T Ar0 = rT

1 Ar0 − rT1 Ar0

rT0 Ar0

rT0 Ar0 = 0.

Now, we assume that the theorem is true for k and we try to prove that it also holdsfor k + 1.

For (1), by using assumption and rk+1 = rk − αkApk, we have

pTi rk+1 = pT

i rk − αkpTi Apk = 0, 0 ≤ i ≤ k − 1,

and also

pTk rk+1 = pT

k rk −pT

k rk

pTk Apk

pTk Apk = 0.

Thus, (1) is true for k + 1.For (2), we have by assumption,

span{r0, · · · , rk} = span{p0, · · · , pk}.

By (1), we know that rk+1 is orthogonal to this subspace. Therefore, (2) is true fork + 1.

For (3), by using assumption, (2) and

pk+1 = rk+1 + βkpk, ri+1 = ri − αiApi,

we have

pTk+1Api =

1αi

rTk+1(ri − ri+1) + βkp

Tk Api = 0, i = 0, 1, · · · , k − 1.

By the definition of βk, we have

pTk+1Apk = (rk+1 + βkpk)T Apk = rT

k+1Apk −rTk+1Apk

pTk Apk

pTk Apk = 0.

Then, (3) holds for k + 1.For (4), we know by using assumption that

rk, pk ∈ K(A, r0, k + 1) = span{r0, Ar0, · · · , Akr0}.


Therefore

rk+1 = rk − αkApk ∈ K(A, r0, k + 2) = span{r0, Ar0, · · · , Ak+1r0},

andpk+1 = rk+1 + βkpk ∈ K(A, r0, k + 2) = span{r0, Ar0, · · · , Ak+1r0}.

By (2) and (3), we note that the vectors r0, · · · , rk+1 and p0, · · · , pk+1 are linearlyindependent. Thus, (4) is true for k + 1.

We remark that by Theorem 6.3, at most n steps, we can obtain the exact solutionof an n-by-n system by using the CG method. Therefore, from a theoretical viewpoint,the CG method is a direct method.

Theorem 6.4 The xk obtained from the CG method satisfies

φ(xk) = min{φ(x) : x ∈ x0 +K(A, r0, k)} (6.8)

or‖xk − x∗‖A = min{‖x− x∗‖A : x ∈ x0 +K(A, r0, k)}, (6.9)

where ‖x‖A =√

xT Ax and x∗ is the exact solution of Ax = b.

Proof: Since (6.8) and (6.9) are equivalent, we only need to prove (6.9). Suppose thatrl = 0 at the l-th step of the CG method, then we have

x∗ = xl = xl−1 + αl−1pl−1

= xl−2 + αl−2pl−2 + αl−1pl−1

· · · · · · · · ·

= x0 + α0p0 + · · ·+ αl−1pl−1.

For k < l, we have

xk = x0 + α0p0 + · · ·+ αk−1pk−1 ∈ x0 +K(A, r0, k).

Let x be any vector in x0 +K(A, r0, k). Then by Theorem 6.3 (4), we have

x = x0 + γ0p0 + · · ·+ γk−1pk−1.

Moreover,

x∗ − x = (α0 − γ0)p0 + · · ·+ (αk−1 − γk−1)pk−1 + αkpk + · · ·+ αl−1pl−1.


Since

x∗ − xk = αkpk + · · ·+ αl−1pl−1,

we have by using Theorem 6.3 (3),

‖x∗ − x‖2A = ‖(α0 − γ0)p0 + · · ·+ (αk−1 − γk−1)pk−1‖2

A

+‖αkpk + · · ·+ αl−1pl−1‖2A

≥ ‖αkpk + · · ·+ αl−1pl−1‖2A

= ‖x∗ − xk‖2A.

6.3 Practical CG method and convergence analysis

In this section, we give a practical algorithm of the CG method and analyze the con-vergence rate of the CG method.

6.3.1 Practical CG method

From Theorem 6.3, we know that the CG method would obtain an accurate solutionafter n steps in exact arithmetic, where n is the size of the system. In other words,the CG method is thought as a direct method rather than an iterative method. Whenn is very large, in practice, we use the CG method as an iterative method and stopiterations when

(i) ‖rk‖ is less than ε, where rk = b−Axk and ε is a given error bound; or

(ii) the number of iterations reaches kmax, the largest number of iterations providedby us, where kmax ¿ n.

We then have the following practical algorithm for solving symmetric positive definitesystems Ax = b. At the initialization step k = 0, we choose a initial vector x andcalculate

r = b−Ax, ρ = rT r.

6.3. PRACTICAL CG METHOD AND CONVERGENCE ANALYSIS 91

While√

ρ > ε‖b‖2 and k < kmax, in iteration steps, we have

k = k + 1if k = 1

p = relse

β = ρ/ρ; p = r + βpendw = Ap; α = ρ/pT w; x = x + αpr = r − αw; ρ = ρ; ρ = rT r

where r, p, w are vectors and ρ, α, β are scalars.We remark that:

(1) We only have the matrix-vector multiplications in the algorithm. If the matrixis sparse or it has a special structure, then these multiplications can be doneefficiently by using some sparse solvers or fast solvers.

(2) We do not need to estimate any parameter in the algorithm unlike the SORmethod.

(3) For each iteration, we could use the parallel algorithms for the vector operations.

Now, we briefly discuss how to use the CG method to solve general linear systemsAx = b. Since we cannot use the CG method directly to the system, instead of solvingAx = b, we can use the CG method to solve the normalized system,

AT Ax = AT b.

When the system is well-conditioned, then the normalized CG method is suitable. Butif the system is ill-conditioned, then the condition number of the normalized systemcould become very large because of κ2(AT A) = (κ2(A))2. Hence the normalized CGmethod is not suitable for ill-conditioned systems.

6.3.2 Convergence analysis

We have the following theorem for the convergence rate of the CG method.

Theorem 6.5 If A = I + B where I is the identity matrix and rank(B) = p, then byusing at most p + 1 iterations, the CG method can obtain the exact solution of Ax = b.

Proof: Since A = I + B, it is easy to show that

span{r0, Ar0, · · · , Akr0} = span{r0, Br0, · · · , Bkr0}.


Note that rank(B) = p and then

dim(span{r0, Br0, · · · , Bkr0}

)≤ p + 1.

By Theorem 6.3, we know that

dim (span{r0, · · · , rp, rp+1}) ≤ p + 1

and alsorTi rp+1 = 0, i = 0, 1, · · · , p.

Therefore rp+1 = 0, i.e., Axp+1 = b.

Theorem 6.6 Let A ∈ Rn×n be symmetric positive definite and x∗ be the exact solutionof Ax = b. We then have

‖x∗ − xk‖A ≤ 2(√

κ2 − 1√κ2 + 1

)k

‖x∗ − x0‖A

where xk is produced by the CG method and

κ2 = κ2(A) = ‖A‖2‖A−1‖2.

Proof: By Theorem 6.3, we know that for any x ∈ x0 +K(A, r0, k),

x∗ − x = x∗ − x0 + ak1r0 + ak2Ar0 + · · ·+ akkAk−1r0

= A−1(r0 + ak1Ar0 + ak2A2r0 + · · ·+ akkA

kr0)

= A−1Pk(A)r0,

where Pk(λ) =k∑

j=0akjλ

j with Pk(0) = 1. Let Pk be the set of all the polynomials Pk

with order less than or equal to k and Pk(0) = 1. By Theorem 6.4 and Lemma 6.1, wehave

‖xk − x∗‖A = min{‖x− x∗‖A : x ∈ x0 +K(A, r0, k)}

= minPk∈Pk

‖A−1Pk(A)r0‖A = minPk∈Pk

‖Pk(A)A−1r0‖A

≤ minPk∈Pk

max1≤i≤n

|Pk(λi)| · ‖A−1r0‖A

≤ minPk∈Pk

maxa1≤λ≤a2

|Pk(λ)| · ‖x∗ − x0‖A,

6.3. PRACTICAL CG METHOD AND CONVERGENCE ANALYSIS 93

where0 < a1 = λ1 ≤ · · · ≤ λn = a2

are the eigenvalues of A. By the well-known Approximation Theorem of Chebyshevpolynomials, see [33], we know that there exists a unique solution of the optimal prob-lem

minPk∈Pk

maxa1≤λ≤a2

|Pk(λ)|

given by

Pk(λ) =Tk(a2+a1−2λ

a2−a1)

Tk(a2+a1a2−a1

).

Here Tk(z) is the k-th Chebyshev polynomial defined recursively by

Tk(z) = 2zTk−1(z)− Tk−2(z)

with T0(z) = 1 and T1(z) = z. By the properties of Chebyshev polynomials, see [33]again, we know that

maxa1≤λ≤a2

|Pk(λ)| = 1Tk(a2+a1

a2−a1)≤ 2

(√κ2 − 1√κ2 + 1

)k

,

where κ2 = κ2(A) = ‖A‖2‖A−1‖2. Therefore,

‖x∗ − xk‖A ≤ 2(√

κ2 − 1√κ2 + 1

)k

‖x∗ − x0‖A.

Furthermore, we have the following theorem, see [2].

Theorem 6.7 If the eigenvalues λj of a symmetric positive definite matrix A are or-dered such that

0 < λ1 ≤ ... ≤ λp ≤ b1 ≤ λp+1 ≤ ... ≤ λn−q ≤ b2 ≤ λn−q+1 ≤ ... ≤ λn

where b1 and b2 are two constants, then

‖x∗ − xk‖A

‖x∗ − x0‖A≤ 2

(α− 1α + 1

)k−p−q

· maxλ∈[b1,b2]

p∏

j=1

(λ− λj

λj

),

where α ≡ (b2/b1)1/2 ≥ 1.


From Theorem 6.7, we note that when n is increased, if p, q are constants that donot depend on n and λ1 is uniformly bounded from zero, then the convergence rate islinear, i.e., the number of iterations is independent of n. We also notice that the moreclustered the eigenvalues are, the faster the convergence rate will be.

Corollary 6.1 If the eigenvalues λj of a symmetric positive definite matrix A areordered such that

0 < δ < λ1 ≤ ... ≤ λp ≤ 1− ε ≤ λp+1 ≤ ... ≤ λn−q ≤ 1 + ε ≤ λn−q+1 ≤ ... ≤ λn

where 0 < ε < 1, then‖x∗ − xk‖A

‖x∗ − x0‖A≤ 2

(1 + ε

δ

)p

εk−p−q

where k ≥ p + q.

Proof: For α given in Theorem 6.7, we have

α ≡(

b2

b1

) 12

=(

1 + ε

1− ε

) 12

.

Therefore,

α− 1α + 1

=1−√1− ε2

ε< ε.

For 1 ≤ j ≤ p and λ ∈ [1− ε, 1 + ε], we have

0 ≤ λ− λj

λj≤ 1 + ε

δ.

Thus, by using Theorem 6.7, we obtain

‖x∗ − xk‖A

‖x∗ − x0‖A≤ 2

(α− 1α + 1

)k−p−q

· maxλ∈[b1,b2]

p∏

j=1

(λ− λj

λj

)

≤ 2(

1 + ε

δ

)p

εk−p−q.

6.4. PRECONDITIONING 95

6.4 Preconditioning

From Section 6.3, we know that if the matrix A of the system

Ax = b

is well-conditioned or its spectrum is clustered, then the convergence rate of the CGmethod will be very quick. Therefore, in order to speed up the convergence rate, weusually precondition the system, i.e., instead of solving the original system, we solvethe following preconditioned system

Ax = b, (6.10)

whereA = C−1AC−1, x = Cx, b = C−1b,

and C is symmetric positive definite. We wish that the preconditioned matrix A couldhave better spectral properties than those of A.

By using the CG method on (6.10), we have

αk =rTk rk

pTk Apk

,

xk+1 = xk + αkpk,

rk+1 = rk − αkApk,

βk =rTk+1rk+1

rTk rk

,

pk+1 = rk+1 + βkpk,

(6.11)

where x0 is any given initial vector, r0 = b− Ax0 and p0 = r0. Let

xk = Cxk, rk = C−1rk, pk = Cpk, M = C2.

Substituting them into (6.11), we actually have

wk = Apk, αk = ρk/pTk wk,

xk+1 = xk + αkpk, rk+1 = rk − αkwk,

zk+1 = M−1rk+1, ρk+1 = rTk+1zk+1,

βk = ρk+1/ρk, pk+1 = zk+1 + βkpk,


where x0 is any given initial vector, r0 = b−Ax0, z0 = M−1r0, ρ0 = rT0 z0 and p0 = z0.

We then have the following preconditioned algorithm. At the initialization stepk = 0, we choose a initial vector x and calculate r = b−Ax. While

√rT r > ε‖b‖2 and

k < kmax, in iteration steps, we have

Solve Mz = r for zk = k + 1if k = 1

p = zelse

β = ρ/ρ; p = z + βpendw = Ap; α = ρ/pT w; x = x + αpr = r − αw; ρ = ρ; ρ = rT z

where z, r, p, w are vectors and ρ, α, β are scalars. This algorithm is called thepreconditioned conjugate gradient (PCG) method. Note that the PCG method has thefollowing properties:

(i) rTi M−1rj = 0, for i 6= j.

(ii) pTi Apj = 0, for i 6= j.

(iii) The approximation xk satisfies:

‖x∗ − xk‖A≤ 2

(√κ− 1√κ + 1

)k

‖x∗ − x0‖A

where κ = λn/λ1 with λn being the largest eigenvalue of M−1A and λ1 being thesmallest eigenvalue of M−1A.

A good preconditioner M = C2 is chosen with two criteria in mind, see [2, 19, 22]:

(1) Mz = d is easy to solve.

(2) The spectrum of M−1A is clustered and (or) M−1A is well-conditioned comparedto A.

Usually, it is not easy to choose a preconditioner which satisfies all these two criteria.Now we briefly discuss the following three classes of preconditioners.

1. Diagonal preconditioners. If diagonal entries of the coefficient matrix A aremuch different, one could use the matrix

M = diag(a11, · · · , ann)

6.4. PRECONDITIONING 97

as a preconditioner to speed up the convergence rate of the CG method hopefully.For block matrix

A =

A11 · · · A1k...

. . ....

Ak1 · · · Akk

,

if A−1ii are easily to be obtained, then one could use the block diagonal matrix

M = diag(A11, · · · , Akk)

as a preconditioner.

2. Preconditioners based on incomplete Cholesky factorization. If one com-putes the incomplete Cholesky factorization first,

A = LLT + R,

then one can use the matrix M = LLT as a preconditioner. We could requirethat the matrix L has the same sparse structure as the matrix A and also thematrix LLT ≈ A.

3. Optimal (circulant) preconditioners. This class of preconditioners is pro-posed very recently, see [9, 22, 37]. The circulant matrix is defined as follows:

Cn =

c0 c−1 · · · c2−n c1−n

c1 c0 c−1 · · · c2−n... c1 c0

. . ....

cn−2 · · · . . . . . . c−1

cn−1 cn−2 · · · c1 c0

where c−k = cn−k for 1 ≤ k ≤ n− 1. It is well-known that circulant matrices canbe diagonalized by the Fourier matrix Fn, see [13], i.e.,

Cn = F ∗nΛnFn, (6.12)

where the entries of Fn are given by

(Fn)j,k =1√n

e2πi(j−1)(k−1)/n, i ≡ √−1,

for 1 ≤ j, k ≤ n, and Λn is a diagonal matrix holding the eigenvalues of Cn. Wenote that Λn can be obtained in O(n log n) operations by taking the fast Fouriertransform (FFT) of the first column of Cn. Once Λn is obtained, the products Cny


and C−1n y for any vector y can be computed by FFTs in O(n log n) operations.

For the Fourier matrix Fn, when there is no ambiguity, we shall denote F .

Now, we study a kind of preconditioner called the optimal preconditioner, see[9, 11, 22]. Given any unitary matrix U ∈ Cn×n, let MU be the set of allmatrices simultaneously diagonalized by U , i.e.,

MU = {U∗ΛU | Λ is an n-by-n diagonal matrix}. (6.13)

We note that in (6.13), when U = F , the Fourier matrix, MF is the set of all thecirculant matrices. Let δ(A) denote the diagonal matrix whose diagonal is equalto the diagonal of the matrix A. We have the following lemma, see [22, 27, 39].

Lemma 6.2 For any arbitrary A = [apq] ∈ Cn×n, let cU (A) be the minimizer of‖W −A‖F over all W ∈MU . Then

(i) cU (A) is uniquely determined by A and is given by

cU (A) = U∗δ(UAU∗)U.

(ii) If A is Hermitian, then cU (A) is also Hermitian. Furthermore, if λmin(·)and λmax(·) denote the smallest and largest eigenvalues respectively, then wehave

λmin(A) ≤ λmin(cU (A)) ≤ λmax(cU (A)) ≤ λmax(A).

In particular, if A is positive definite, then so is cU (A).(iii) If A is normal and stable, i.e., A∗A = AA∗ and the real parts of all the

eigenvalues of A are negative, then cU (A) is also stable.(iv) When U is the Fourier matrix F , we then have

cF (A) =n−1∑

j=0

1

n

∑

p−q≡j( mod n)

apq

Qj ,

where Q is an n-by-n circulant matrix given by

Q ≡

0 11 0

0 1. . .

.... . . . . . . . .

0 · · · 0 1 0

.

The matrix cU (A) is called the optimal preconditioner of A and the matrix cF (A)is called the optimal circulant preconditioner of A. We remark that cF (A) is agood preconditioner for solving a large class of structured linear systems Ax = b,for instance, Toeplitz systems, Hankel systems, etc, see [9, 11, 12, 22].

6.5. GMRES METHOD 99

6.5 GMRES method

In this section, we introduce the generalized minimum residual (GMRES) method tosolve general systems

Ax = b

where A ∈ Rn×n is nonsingular. The GMRES method was proposed by Saad andSchultz in 1986, which is one of the most important Krylov subspace methods fornonsymmetric systems, see [33, 34].

6.5.1 Basic properties of GMRES method

For the GMRES method, in the k-th iteration, we are going to find a solution xk ofthe LS problem

minx∈x0+K(A,r0,k)

‖b−Ax‖2

whereK(A, r0, k) ≡ span{r0, Ar0, · · · , Ak−1r0}

with r0 = b−Ax0. Let x ∈ x0 +K(A, r0, k). We have

x = x0 +k−1∑

j=0

γjAjr0

and then

r = b−Ax = b−Ax0 −k−1∑

j=0

γjAj+1r0 = r0 −

k∑

j=1

γj−1Ajr0.

Hencer = Pk(A)r0

where Pk ∈ Pk with Pk being the set of all the polynomials Pk with order less than orequal to k and Pk(0) = 1. We therefore have the following theorem.

Theorem 6.8 Let xk be the solution after the k-th GMRES iteration. Then we have

‖rk‖2 = minP∈Pk

‖P (A)r0‖2 ≤ ‖Pk(A)r0‖2.

Furthermore,‖rk‖2

‖r0‖2≤ ‖Pk(A)‖2.

Moreover, we have


Theorem 6.9 The GMRES method will obtain the exact solution of Ax = b within niterations, where A ∈ Rn×n.

Proof: The characteristic polynomial of A is given by

pA(z) = det(zI −A)

where the order of pA(z) is n and pA(0) = (−1)ndet(A) 6= 0. Then

Pn(z) =pA(z)pA(0)

∈ Pn.

By the Hamilton-Cayley Theorem, see [17], we know that

Pn(A) = pA(A) = 0.

By Theorem 6.8, we havern = b−Axn = 0.

Thus xn is the exact solution of Ax = b.

If A is diagonalizable, i.e., A = V ΛV −1 where Λ is a diagonal matrix, then we have

P (A) = V P (Λ)V −1.

If V is orthogonal, then A is a normal matrix.

Theorem 6.10 Let A = V ΛV −1. We have

‖rk‖2

‖r0‖2≤ κ2(V ) min

P∈Pk

maxλi

|P (λi)|

where λi are the eigenvalues of the matrix A.

Proof: By Theorem 6.8, we have

‖rk‖2 = minP∈Pk

‖P (A)r0‖2 = minP∈Pk

‖V P (Λ)V −1r0‖2

≤ minP∈Pk

‖V ‖2‖V −1‖2‖P (Λ)‖2‖r0‖2

= κ2(V ) minP∈Pk

‖P (Λ)‖2‖r0‖2

= κ2(V ) minP∈Pk

maxλi

|P (λi)| · ‖r0‖2.


We remark that when A is normal, then κ2(V ) = 1. We therefore have

‖rk‖2

‖r0‖2≤ min

P∈Pk

maxλi

|P (λi)|.

Theorem 6.11 If A is diagonalizable and has exactly k distinct eigenvalues, then theGMRES method will terminate in at most k iterations.

Proof: We construct a polynomial as follows,

P (z) =k∏

i=1

λi − z

λi∈ Pk,

where λi are the eigenvalues of A. By Theorem 6.10, we know that rk = 0, i.e., Axk = b.

We should emphasize that in general, the behavior of the GMRES method cannotbe determined by eigenvalues alone. In fact, it is shown in [20] that any nonincreasingconvergence rate is possible for the GMRES method applied to some problem withnonnormal matrix. Moreover, that problem can have any desired distribution of eigen-values. Thus, for instance, eigenvalues tightly clustered around 1 are not necessarilygood for nonnormal matrices, as they are for normal ones. However, we have thefollowing two theorems.

Theorem 6.12 If b is a linear combination of k eigenvectors of A, say

b =k∑

l=1

γluil ,

then the GMRES method will terminate in at most k iterations.

Proof: We first extend the set of eigenvectors {uil} to be a basis of Rn, i.e., the vectors

ui1 , ui2 , · · · , uik , v1, · · · , vn−k,

form a basis of Rn. Let x∗ be the exact solution of Ax = b. Then

x∗ =k∑

l=1

αluil +n−k∑

j=1

βjvj .


Moreover, we have

Ax∗ =k∑

l=1

αlAuil +n−k∑j=1

βjAvj

=k∑

l=1

αlλiluil +n−k∑j=1

βjAvj

= b =k∑

l=1

γluil .

Hence,

βj = 0, j = 1, 2, · · · , n− k;

αl = γl/λil , l = 1, 2, · · · , k.

We therefore have

x∗ =k∑

l=1

(γl/λil)uil .

Let

Pk(z) =k∏

l=1

λil − z

λil

∈ Pk.

Note that Pk(λil) = 0, for 1 ≤ l ≤ k, and

Pk(A)x∗ =k∑

l=1

Pk(λil)(γl/λil)uil = 0.

We thus have‖rk‖2 ≤ ‖Pk(A)r0‖2 = ‖Pk(A)b‖2

= ‖Pk(A)Ax∗‖2 = ‖APk(A)x∗‖2 = 0,

where we choose the initial vector x0 = 0.

Theorem 6.13 When the GMRES method is applied for solving a linear system Ax =b where A = I + L, the method will converge in at most rank(L) + 1 iterations.

Proof: We first recall that the minimal polynomial of r0 with respect to A is thenonzero monic polynomial p of the lowest degree such that p(A)r0 = 0, see [33]. ByTheorem 6.8, the GMRES method must converge within ν iterations, where ν is the


degree of the minimal polynomials of the residual r0 with respect to A = I + L. Let µbe the degree of the minimal polynomials of r0 with respect to L. Since

k∑

i=0

αi(I + L)ir0 = 0

impliesk∑

i=0

βiLir0 = 0

for some constants βi and vice versa, we have ν = µ. Moreover, from the definition ofµ, the set

{r0, Lr0, · · · , Lµ−1r0}is linearly independent. Let B be the column vector space of L. Then, the dimensionof B is equal to the rank of L. Since Lir0 ∈ B for i ≥ 1, we have

{Lr0, · · · , Lµ−1r0} ⊆ B.

Thus, µ − 1 ≤ rank(L), i.e., µ ≤ rank(L) + 1. Hence, the GMRES method convergeswithin

ν = µ ≤ rank(L) + 1

iterations.

6.5.2 Implementation of GMRES method

Recall that the LS problem from the k-th iteration of the GMRES method is

minx∈x0+K(A,r0,k)

‖b−Ax‖2.

Suppose that we have a matrix

Vk = [vk1 , vk

2 , · · · , vkk ]

whose columns form an orthonormal basis of K(A, r0, k). Then for any z ∈ K(A, r0, k),it can be written as

z =k∑

l=1

ulvkl = Vku,

where u = (u1, u2, · · · , uk)T ∈ Rk. Thus, once we find Vk, we can convert the originalLS problem in the Krylov subspace into a LS problem in Rk as follows. Let xk be thesolution after the k-th iteration. We then have

xk = x0 + Vkyk


where the vector yk minimizes

miny∈Rk

‖b−A(x0 + Vky)‖2 = miny∈Rk

‖r0 −AVky‖2.

This is a standard linear LS problem that can be solved by a QR decomposition.One could use the Gram-Schmidt orthogonalization to find an orthonormal basis of

K(A, r0, k). The algorithm is given as follows:

(1) Define r0 = b−Ax0 and v1 =r0

‖r0‖2.

(2) Compute

vi+1 =

Avi −i∑

j=1((Avi)T vj)vj

∥∥∥Avi −i∑

j=1((Avi)T vj)vj

∥∥∥2

for i = 1, 2, · · · , k − 1.

This algorithm produces the columns of the matrix Vk which are also an orthonormalbasis for K(A, r0, k). We note that a breakdown happens when a division by zero occurs.We have the following theorem for the breakdown happening.

Theorem 6.14 Let A be nonsingular, the vectors vj be generated by the above algo-rithm, and i be the smallest integer for which

Avi −i∑

j=1

((Avi)T vj)vj = 0. (6.14)

Then x = A−1b ∈ x0 +K(A, r0, i).

Proof: Since by (6.14),

Avi =i∑

j=1

((Avi)T vj)vj ∈ K(A, r0, i),

we know thatAK(A, r0, i) ⊂ K(A, r0, i).

Note that the columns of Vi = [v1, v2, · · · , vi] form an orthonormal basis for K(A, r0, i),i.e.,

K(A, r0, i) = span{v1, v2, · · · , vi},


and thenAVi = ViH (6.15)

where H ∈ Ri×i is nonsingular since A is nonsingular. There exists a vector y ∈ Ri

such that xi − x0 = Viy because xi − x0 ∈ K(A, r0, i). We therefore have

‖ri‖2 = ‖b−Axi‖2 = ‖r0 −A(xi − x0)‖2 = ‖r0 −AViy‖2. (6.16)

Let β = ‖r0‖2 and e1 = (1, 0, · · · , 0)T ∈ Ri. Then r0 = βVie1. Since Vi is a matrix withorthonormal columns, we have by (6.15) and (6.16),

‖ri‖2 = ‖Vi(βe1 −Hy)‖2 = ‖βe1 −Hy‖2.

Setting y = βH−1e1 and hence ri = 0, i.e.,

xi = A−1b ∈ x0 +K(A, r0, i).

If the Gram-Schmidt process does not breakdown, we can use it to carry out theGMRES method in the following efficient way. Let hij = (Avj)T vi. By the Gram-Schmidt algorithm, we have a (k + 1)-by-k matrix Hk which is upper Hessenberg, i.e.,its entries hij satisfy hij = 0 if i > j + 1. This process produces a sequence of matrices{Vk} with orthonormal columns such that

AVk = Vk+1Hk.

Therefore, we have

rk = b−Axk = r0 −A(xk − x0) = βVk+1e1 −AVkyk = Vk+1(βe1 −Hkyk),

where yk is the solution ofminy∈Rk

‖βe1 −Hky‖2.

Hencexk = x0 + Vkyk.

Let the GMRES iterations be ended when one finds a vector x such that for a given ε,

‖r‖2 = ‖b−Ax‖2 ≤ ε‖b‖2.

We then have the following GMRES algorithm for solving

Ax = b,


where A ∈ Rn×n and b ∈ Rn are known, see [33]. At the initialization step, let

r0 = b−Ax0, β = ‖r0‖2, v1 = r0/β.

In the iteration steps, we have

for j = 1 : mwj = Avj

for i = 1 : jhij = wT

j vi

wj = wj − hijvi

endhj+1,j = ‖wj‖2; if hj+1,j = 0, set m = j and go to (∗)vj+1 = wj/hj+1,j

end(∗) compute ym the minimizer of ‖βe1 − Hmy‖2

xm = x0 + Vmym

where Hm = (hij)1≤i≤m+1,1≤j≤m. The matrix Vm = [v1, v2, · · · , vm] ∈ Rn×m withm ≤ n in the algorithm is a matrix with orthonormal columns.

Exercises:

1. Suppose that xk is generated by the steepest descent method. Prove that

φ(xk) ≤(

1− 1κ2(A)

)φ(xk−1),

where κ2(A) = ‖A‖2‖A−1‖2.2. Let A be a symmetric positive definite matrix. Define ‖x‖A ≡

√xT Ax. Show that it is

a vector norm.

3. Let A ∈ Rn×n be symmetric positive definite, and p1, p2, · · · , pk ∈ Rn be mutually A-conjugate, i.e., pT

i Apj = 0, i 6= j. Prove that {p1, p2, · · · , pk} is linearly independent.

4. Let xk be produced by the CG method. Show that

‖xk − x∗‖2 ≤ 2√

κ2

(√κ2 − 1√κ2 + 1

)k

‖x0 − x∗‖2,

where κ2 = κ2(A) = ‖A‖2‖A−1‖2.5. Suppose zk and rk are generated by the PCG method. Show that if rk 6= 0, then zT

k rk > 0.

6. Find an efficient algorithm for solving AT Ax = AT b by the CG method.


7. Show that if A ∈ Rn×n is symmetric positive definite and has exactly k distinct eigen-values, then the CG method will terminate in at most k iterations.

8. Let the initial vector x0 = 0. When the GMRES method is used to solve the linearsystem Ax = b where

A =

0 11 0

1 01 0

1 0

and b = (1, 0, 0, 0, 0)T , what is the convergence rate?

9. Let

A =[

I Y0 I

].

When the GMRES method is used to solve Ax = b, what is the maximum number ofiterations required to converge?

10. Prove that the LS problem in the GMRES method has full column rank.

11. Show that cU (A) = U∗δ(UAU∗)U .


Chapter 7

Nonsymmetric EigenvalueProblems

Eigenvalue problems are particularly interesting in NLA. In this chapter, we studynonsymmetric eigenvalue problems. Some well-known methods, such as the powermethod, the inverse power method and the QR method, are discussed.

7.1 Basic properties

Let A ∈ Cn×n. We recall that a complex number λ is called an eigenvalue of A if thereexists a nonzero vector x ∈ Cn such that

Ax = λx.

Here x is called the eigenvector of A associated with λ. It is well-known that λ is aneigenvalue of A if and only if

det(λI −A) = 0.

LetpA(λ) = det(λI −A)

be the characteristic polynomial of A. By the Fundamental Theorem of Algebra, weknow that pA(λ) has n roots in C, i.e., A has n eigenvalues.

Now suppose that pA(λ) has the following factorization:

pA(λ) = (λ− λ1)n1(λ− λ2)n2 · · · (λ− λp)np ,

where n1+n2+· · ·+np = n, λi 6= λj for i 6= j. The ni is called the algebraic multiplicityof λi and the number

mi = n− rank(λiI −A)

109

110 CHAPTER 7. NONSYMMETRIC EIGENVALUE PROBLEMS

is called the geometric multiplicity of λi. Actually, mi is the dimension of the eigenspaceof λi. We remark that the eigenspace of λi is the solution space of (λiI − A)x = 0.It is easy to see that mi ≤ ni for i = 1, 2, · · · , p. If ni = 1, then λi is called a simpleeigenvalue. If mi < ni for some eigenvalue λi, then A is called defective. If the geometricmultiplicity is equal to the algebraic multiplicity for each eigenvalue of A, then A issaid to be nondefective. Note that A is diagonalizable if and only if A is nondefective.

Let A, B ∈ Cn×n. We recall that A and B are called similar matrices if there is anonsingular matrix X ∈ Cn×n such that

B = XAX−1.

The transformationA → B = XAX−1

is called a similarity transformation by the similarity matrix X. The similar matriceshave the same eigenvalues. If x is an eigenvector of A, then y = Xx is an eigenvector ofB. By the Jordan Decomposition Theorem (Theorem 1.1), we know that any n-by-nmatrix is similar to its Jordan canonical form. If the similarity matrix is required to bea unitary matrix, we then have the following perhaps the most fundamentally usefultheorem in NLA, see [17].

Theorem 7.1 (Schur Decomposition Theorem) Let A ∈ Cn×n with the eigenval-ues λ1, · · · , λn in any prescribed order. Then there exists a unitary matrix U ∈ Cn×n

such thatU∗AU = T = [tij ]

where T is an upper triangular matrix with diagonal entries tii = λi, i = 1, · · · , n.Furthermore, if A ∈ Rn×n and if all the eigenvalues of A are real, then U may bechosen to be real and orthogonal.

Proof: Let x1 ∈ Cn be a normalized eigenvector of A associated with the eigenvalueλ1. The vector x1 can be extended to a basis of Cn:

x1, y2, · · · , yn.

By applying the Gram-Schmidt orthonormalization procedure to this basis, we thereforehave an orthonormal basis of Cn:

x1, z2, · · · , zn.

LetU1 = [x1, z2, · · · , zn]

7.1. BASIC PROPERTIES 111

be a unitary matrix. By a simple calculation, we obtain

U∗1 AU1 =

[λ1 ∗0 A1

].

The matrix A1 ∈ C(n−1)×(n−1) has the eigenvalues λ2, · · · , λn. Let x2 ∈ Cn−1 be anormalized eigenvector of A1 associated with λ2, and do it all over again. Determine aunitary matrix V2 ∈ C(n−1)×(n−1) such that

V ∗2 A1V2 =

[λ2 ∗0 A2

].

Let

U2 =[

1 00 V2

].

Then the matrices U2 and U1U2 are unitary, and

U∗2 U∗

1 AU1U2 =

λ1 ∗λ2

0 A2

.

Continuing this process, we can produce unitary matrices U1, U2, · · · , Un−1 such thatthe matrix

U = U1U2 · · ·Un−1

is unitary and U∗AU yields the desired form.

Theorem 7.2 (Real Schur Decomposition Theorem [17]) Let A ∈ Rn×n. Thenthere exists an orthogonal matrix Q ∈ Rn×n such that

QT AQ =

R11 R12 · · · R1m

R22 · · · R2m

. . ....

0 Rmm

,

where Rii is either a real number or a 2-by-2 matrix having a pair of complex conjugateeigenvalues.

In general, one cannot hope to reduce a real matrix to a strictly upper triangularform by using an orthogonal similarity transformation because the diagonal entrieswould then be eigenvalues, which could not be real.


7.2 Power method

The power method is an iterative algorithm for computing the eigenvalue with thelargest absolute value and its corresponding eigenvector. Now, we introduce the basicidea of this method. For simplicity, we first suppose that the matrix A ∈ Cn×n isdiagonalizable, i.e., A has the following Jordan decomposition:

A = XΛX−1

where Λ = diag(λ1, · · · , λn) with the order

|λ1| > |λ2| ≥ · · · ≥ |λn|,and X = [x1, · · · , xn] ∈ Cn×n. Let u0 ∈ Cn be any vector. Since the column vectorsx1, · · · , xn form a basis of Cn, we have

u0 =n∑

j=1

αjxj

where αj ∈ C. We therefore have,

Aku0 =n∑

j=1αjA

kxj =n∑

j=1αjλ

kj xj

= λk1

(α1x1 +

n∑j=2

αj

(λj

λ1

)kxj

),

and then

limk→∞

Aku0

λk1

= α1x1.

When α1 6= 0 and k is sufficiently large, we know that the vector

uk =Aku0

λk1

(7.1)

is a good approximate eigenvector of A.In practice, we cannot use (7.1) directly to compute an approximate eigenvector

since we do not know the eigenvalue λ1 in advance and the operation cost of Ak is verylarge when k is large. We therefore propose the following iterative algorithm:

yk = Auk−1,

µk = ζ(k)j ,

uk = yk/µk,

(7.2)

7.2. POWER METHOD 113

where u0 ∈ Cn is any given initial vector with ‖u0‖∞ = 1 usually, and ζ(k)j is the

largest absolute value of components of yk. This iterative algorithm is called the powermethod. We have the following theorem for the convergence of the power method.

Theorem 7.3 Let A ∈ Cn×n with p distinct eigenvalues which satisfy

|λ1| > |λ2| ≥ · · · ≥ |λp|.

The geometric multiplicity of λ1 is equal to its algebraic multiplicity. If the projectionof the initial vector u0 on the eigenspace of λ1 is nonzero, then {uk} produced by(7.2) converges to an eigenvector x1 associated with λ1. Also {µk} produced by (7.2)converges to λ1.

Proof: We note that A has the Jordan decomposition as follows,

A = Xdiag(J1, · · · , Jp)X−1, (7.3)

where X ∈ Cn×n, Ji ∈ Cni×ni is the Jordan block associated with λi (i = 1, · · · , p), and

n1 + n2 + · · ·+ np = n.

Since the geometric multiplicity of λ1 is the same as its algebraic multiplicity, we have

J1 = λ1In1

where In1 ∈ Rn1×n1 is the identity matrix. Let y = X−1u0 and then decompose y andX as follows:

y = (yT1 , yT

2 , · · · , yTp )T , X = [X1, X2, · · · , Xp]

where yi ∈ Cni and Xi ∈ Cn×ni , for i = 1, · · · , p. By using (7.3), we have

Aku0 = Xdiag(Jk1 , · · · , Jk

p )X−1u0

= X1Jk1 y1 + X2J

k2 y2 + · · ·+ XpJ

kp yp

= λk1X1y1 + X2J

k2 y2 + · · ·+ XpJ

kp yp

= λk1

(X1y1 + X2(λ−1

1 J2)ky2 + · · ·+ Xp(λ−11 Jp)kyp

).

Note that the spectral radius of λ−11 Ji satisfies

ρ(λ−11 Ji) = |λi|/|λ1| < 1,


for i = 2, 3, · · · , p. Therefore,

limk→∞

1λk

1

Aku0 = X1y1. (7.4)

Since the projection of the initial vector u0 on the eigenspace of λ1 is nonzero, we haveX1y1 6= 0. Let

x1 = ζ−1X1y1

where ζ is the largest absolute value of components of X1y1. Obviously, x1 is aneigenvector of A associated with λ1. Let ζk be the largest absolute value of componentsof Aku0. Then the largest absolute value of components of λ−k

1 Aku0 is ζkλ−k1 . By (7.2),

we have

uk =Auk−1

µk=

Aku0

µkµk−1 · · ·µ1=

Aku0

ζk=

Aku0/λk1

ζk/λk1

.

By using (7.4), we know that {uk} is convergent and

limk→∞

uk =lim

k→∞(Aku0/λk

1)

limk→∞

(ζk/λk1)

=X1y1

ζ= x1.

By using Auk−1 = µkuk and the fact that {uk} converges to an eigenvector associatedwith λ1 having the largest absolute value of components equaling to 1, we immediatelyknow that {µk} converges to λ1.

We remark that from the proof of Theorem 7.3, the convergence rate of the powermethod is determined by the value of |λ2|/|λ1|. Under the conditions of the theorem,we know that

|λ2||λ1| < 1.

The smaller the |λ2|/|λ1| is, the faster the convergence rate will be. When |λ2|/|λ1|is closed to 1, then the convergence rate will be very slow. In order to speed up theconvergence of the power method, we could use the method on A−µI where µ is calleda shift. The µ could be chosen such that the distance between the eigenvalue withthe largest absolute value and the other eigenvalues becomes larger. Therefore, theconvergence rate of the power method can be increased.

7.3 Inverse power method

If one uses the power method on A−1 to obtain an eigenvalue of A with the smallestabsolute value and its corresponding eigenvector, then this algorithm is called the

7.3. INVERSE POWER METHOD 115

inverse power method. Its basic iterative scheme is given as follows:

Ayk = zk−1,

µk = ζ(k)j ,

zk = yk/µk,

where z0 ∈ Cn is any given initial vector and ζ(k)j is the largest absolute value of

components of yk. From Theorem 7.3, we know that if the eigenvalues of A satisfy

|λn| < |λn−1| ≤ · · · ≤ |λ1|,

then {zk} converges to an eigenvector associated with λn and {µk} converges to λ−1n .

The convergence rate of the inverse power method is determined by |λn|/|λn−1|.In practice, usually, the inverse power method is used on the matrix

A− µI

to compute an approximate eigenvector when an approximate eigenvalue µ of a distincteigenvalue λi of A is obtained in advance. Therefore, we have the following inversepower method with a shift µ:

(A− µI)vk = zk−1,

zk = vk/‖vk‖2.(7.5)

From (7.5), we know that in each iteration of the inverse power method, one needs tosolve a linear system. Hence, its operation cost is much larger than that of the powermethod. However, one could use the LU factorization with partial pivoting in advanceand then in each iteration later on, one only needs to solve two triangular systems.

Suppose that the eigenvalues of the A− µI are ordered as follows:

0 < |λ1 − µ| < |λ2 − µ| ≤ |λ3 − µ| ≤ · · · ≤ |λn − µ|.

From Theorem 7.3 again, we know that the sequence {zk} produced by (7.5) convergesto an eigenvector associated with λ1. The convergence rate is determined by the valueof |λ1 − µ|/|λ2 − µ|. The more closer of µ to λ1, the faster the convergence rate willbe. But if µ is closed to an eigenvalue of A, then A−µI is closed to a singular matrix.Therefore, one needs to solve an ill-conditioned linear system in each iteration of theinverse power method. However, from practical computations, the illness of systemshas no effect on the convergence rate of the method. Usually, only one iteration couldproduce a good approximate eigenvector of A if µ is closed to an eigenvalue of A.


7.4 QR method

In this section, we introduce the well-known QR method which is one of main importantdevelopments in matrix computations. For any given A0 = A ∈ Cn×n, the basiciterative scheme of the QR algorithm is given as follows:

Am−1 = QmRm,

Am = RmQm,(7.6)

for m = 1, 2, · · ·, where Qm is a unitary matrix and Rm is an upper triangular matrix.For simplicity in later analysis, we require that diagonal entries of Rm are nonnegative.By (7.6), one can easily obtain

Am = Q∗mAm−1Qm, (7.7)

i.e., each matrix in the sequence {Am} is similar to the matrix A. By using (7.7) againand again, we have

Am = Q∗mAQm, (7.8)

where Qm = Q1Q2 · · ·Qm. Substituting Am = Qm+1Rm+1 into (7.8), we obtain

QmQm+1Rm+1 = AQm.

Therefore,QmQm+1Rm+1Rm · · ·R1 = AQmRm · · ·R1,

i.e.,Qm+1Rm+1 = AQmRm,

where Rk = RkRk−1 · · ·R1, for k = m,m + 1. Moreover, we have

Am = QmRm. (7.9)

Theorem 7.4 Suppose that the eigenvalues of A satisfy:

|λ1| > |λ2| > · · · > |λn| > 0.

Let Y be a matrix and yTi denote the i-th row of Y satisfying

yTi A = λiy

Ti .

If Y has an LU factorization, then the entries under the diagonal of the matrix

Am = [α(m)ij ]

produced by (7.6) tend to zero as m −→∞. At the same time,

α(m)ii −→ λi,

for i = 1, 2, · · · , n.

7.4. QR METHOD 117

Proof: LetX = Y −1, Λ = diag(λ1, · · · , λn).

Then A = XΛY . By assumption, Y has an LU factorization

Y = LU

where L is a unit lower triangular matrix and U is an upper triangular matrix. Hence

Am = XΛmY = XΛmLU = X(ΛmLΛ−m)ΛmU

= X(I + Em)ΛmU,(7.10)

where I + Em = ΛmLΛ−m. Since L is a unit lower triangular matrix and |λi| < |λj |for i > j, we have

limm→∞Em = 0. (7.11)

Let X = QR where Q is a unitary matrix and R is an upper triangular matrix. SinceX is nonsingular, we can require that diagonal entries of R are positive. SubstitutingX = QR into (7.10), we obtain

Am = QR(I + Em)ΛmU = Q(I + REmR−1)RΛmU. (7.12)

When m is sufficiently large, I + REmR−1 is nonsingular and has the following QRdecomposition,

I + REmR−1 = QmRm, (7.13)

where diagonal entries of Rm are positive. By using (7.11) and (7.13), it is easy toshow that

limm→∞ Qm = lim

m→∞ Rm = I. (7.14)

Substituting (7.13) into (7.12), we have

Am = (QQm)(RmRΛmU),

which is a QR decomposition of Am. In order to guarantee all the diagonal entries ofthe upper triangular matrix to be positive, we define

D1 = diag(

λ1

|λ1| , · · · ,λn

|λn|)

,

and

D2 = diag(

u11

|u11| , · · · ,unn

|unn|)

,


where uii is the i-th diagonal entry of U . Hence, we have

Am = (QQmDm1 D2)(D−1

2 D−m1 RmRΛmU).

Comparing with (7.9) and noting that the QR decomposition is unique, we obtain

Qm = QQmDm1 D2, Rm = D−1

2 D−m1 RmRΛmU.

Substituting them into (7.8), we have

Am = D∗2(D

∗1)

mQ∗mQ∗AQQmDm

1 D2.

Note thatA = XΛY = XΛX−1 = QRΛR−1Q∗.

We finally haveAm = D∗

2(D∗1)

mQ∗mRΛR−1QmDm

1 D2.

When m −→ ∞, we know that by (7.14) the entries under the diagonal of the matrixAm produced by (7.6) tend to zero. At the same time,

α(m)ii −→ λi,

for i = 1, 2, · · · , n.

From Theorem 7.4, we know that the sequence {Am} produced by (7.6) converges tothe Schur decomposition of A.

7.5 Real version of QR algorithm

If A ∈ Rn×n, we wish that we could find a fast and effective QR algorithm with realnumber operations only. We concentrate on developing the real analog of (7.6) asfollows: let A1 = A and then construct an iterative algorithm:

Am = QmRm,

Am+1 = RmQm,(7.15)

for m = 1, 2, · · ·, where Qm ∈ Rn×n is an orthogonal matrix and Rm ∈ Rn×n is an uppertriangular matrix. A difficulty associated with (7.15) is that Am can never convergeto a strictly upper triangular form in the case that A has complex eigenvalues. ByTheorem 7.2, we can expect that (7.15) converges to the real Schur decomposition ofA.

We note that in practice that (7.15) is not a good iterative method because thefollowing two reasons:

7.5. REAL VERSION OF QR ALGORITHM 119

(i) the operation cost in each iteration is too large;

(ii) the convergence rate is too slow.

We therefore need to reduce the operation cost in each iteration and increase theconvergence rate of the method. Thus, the upper Hessenberg reduction and the shiftstrategy are introduced.

7.5.1 Upper Hessenberg reduction

We construct an orthogonal matrix Q0 such that

QT0 AQ0 = H (7.16)

has some special structure (Hessenberg) with many zero entries. Afterwards, we canapply the QR algorithm (7.15) to the matrix H in (7.16). Then the operation cost periteration can be dramatically reduced.

For A = [αij ] ∈ Rn×n, at the first step, we can choose a Householder transformationH1 such that the first column of H1A has many zero entries (at most n − 1 zeroentries). In order to keep a similarity transformation, we need to add one more columntransformation:

H1AH1.

Hence H1 could have the following form

H1 =[

1 00 H1

](7.17)

to keep the zero entries in the first column unchanged. By using H1 defined as in (7.17),we have

H1AH1 =

[α11 aT

2 H1

H1a1 H1A22H1

](7.18)

where aT1 = (α21, α31, · · ·αn1), aT

2 = (α12, α13, · · ·α1n), and A22 is the (n−1)-by-(n−1)principal submatrix in the lower right corner of A. We know by (7.18) that the bestchoice of the Householder transformation H1 should be

H1a1 = pe1 (7.19)

where p ∈ R and e1 ∈ Rn−1 is the first unit vector. Therefore, the first column of thematrix in (7.18) has n− 2 zeros by using H1 defined by (7.17) and (7.19).

Afterwards, for A22 = H1A22H1, we could find a Householder transformation

H2 =[

1 00 H2

]


such that(H2A22H2)e1 = (∗, ∗, 0, · · · , 0)T .

Let

H2 =[

1 00 H2

].

We therefore have

H2H1AH1H2 =

h11 h12 · · · ∗h21 h22

......

0 h32... ∗

0 0... ∗

......

......

0 0 · · · ∗

.

After n− 2 steps, we have found n− 2 Householder transformations H1, H2, · · · ,Hn−2,such that

Hn−2 · · ·H2H1AH1H2 · · ·Hn−2 = H

where

H =

h11 h12 h13 · · · h1,n−1 h1n

h21 h22 h23 · · · h2,n−1 h2n

0 h32 h33 · · · h3,n−1 h3n... 0 h43

......

......

. . ....

...0 0 0 · · · hn,n−1 hnn

with hij = 0 for i > j + 1, and is called the upper Hessenberg matrix. Let

Q0 = H1H2 · · ·Hn−2

and therefore,QT

0 AQ0 = H,

which is called the upper Hessenberg decomposition of A. We have the followingalgorithm by using Householder transformations.

Algorithm 7.1 (Upper Hessenberg decomposition)

for k = 1 : n− 2[v, β] = house(A(k + 1 : n, k))A(k + 1 : n, k : n) = (I − βvvT )A(k + 1 : n, k : n)A(1 : n, k + 1 : n) = A(1 : n, k + 1 : n)(I − βvvT )

end


The operation cost of the Hessenberg decomposition by Householder transforma-tions is O(n3). Of course, the Hessenberg decomposition can also be obtained by usingGivens rotations but the operation cost will be double of that by using Householdertransformations. Although the Hessenberg decomposition of a matrix is not unique,we have the following theorem.

Theorem 7.5 Suppose that A ∈ Rn×n has the following two upper Hessenberg decom-positions:

UT AU = H, V T AV = G, (7.20)

whereU = [u1, u2, · · · , un], V = [v1, v2, · · · , vn]

are n-by-n orthogonal matrices, and H = [hij ], G = [gij ] are upper Hessenberg matrices.If u1 = v1 and all the entries hi+1,i are not zero, then there exists a diagonal matrix Dwith diagonal entries being either 1 or −1 such that

U = V D, H = DGD.

Proof: Assume that we have proved for some m, 1 ≤ m < n, that

uj = εjvj , j = 1, 2, · · · ,m, (7.21)

where ε1 = 1, εj = 1 or −1. Now, we want to show that there exists εm+1 = 1 or −1such that

um+1 = εm+1vm+1.

From (7.20), we haveAU = UH, AV = V G.

Comparing with the m-th column of above matrix equalities respectively, we obtain

Aum = h1mu1 + · · ·+ hmmum + hm+1,mum+1, (7.22)

andAvm = g1mv1 + · · ·+ gmmvm + gm+1,mvm+1. (7.23)

Multiplying (7.22) and (7.23) by uTi and vT

i , respectively, we have

him = uTi Aum, gim = vT

i Avm, i = 1, 2, · · · ,m.

Therefore by (7.21),him = εiεmgim, i = 1, 2, · · · ,m. (7.24)


Substituting (7.24) into (7.22), and using (7.21) and (7.23), we obtain

hm+1,mum+1 = εm(Avm − ε21g1mv1 − · · · − ε2mgmmvm)

= εm(Avm − g1mv1 − · · · − gmmvm)

= εmgm+1,mvm+1.

(7.25)

Hence,|hm+1,m| = |gm+1,m|.

Since hm+1,m 6= 0, from (7.25), we know that

um+1 = εm+1vm+1,

where εm+1 = 1 or −1. By induction, the proof is complete.

We note that for an upper Hessenberg matrix H = [hij ], if hi+1,i 6= 0, i =1, 2, · · · , n−1, then it is irreducible. Theorem 7.5 said that if QT AQ = H is irreducibleupper Hessenberg where Q is an orthogonal matrix, then Q and H are determinedcompletely by the first column of Q (up to ±1).

Now, suppose that H ∈ Rn×n is an upper Hessenberg matrix, we want to apply a QRiteration to H. Firstly, we need to construct a QR decomposition for H. Since H hasa special structure, we can use n−1 Givens rotations to obtain the QR decomposition.For simplicity, in the following, we consider the case of n = 5. Suppose that we havealready found two Givens rotations P12 and P23 such that

P23P12H =

× × × × ×0 × × × ×0 0 h33 × ×0 0 h43 × ×0 0 0 × ×

.

Then we construct a Givens rotation P34 = G(3, 4, θ3) such that the θ3 satisfies[

cos θ3 sin θ3

− sin θ3 cos θ3

] [h33

h43

]=

[ ×0

].

Hence,

P34P23P12H =

× × × × ×0 × × × ×0 0 × × ×0 0 0 × ×0 0 0 × ×

.


Therefore, it is easy to see that for n-by-n upper Hessenberg matrix H, we can constructn− 1 Givens rotations P12, P23, · · · , Pn−1,n such that

Pn−1,nPn−2,n−1 · · ·P12H = R

is an upper triangular matrix. Let

Q = (Pn−1,nPn−2,n−1 · · ·P12)T .

Then we have H = QR. In order to complete a QR iteration, we have to compute

H = RQ = RP T12P

T23 · · ·P T

n−1,n.

Note that RP T12 is different from R only in the first two columns. Since R is an upper

triangular matrix, RP T12 should have the following form (for n = 5),

RP T12 =

× × × × ×× × × × ×0 0 × × ×0 0 0 × ×0 0 0 0 ×

.

Similarly, RP T12P

T23 is different from RP T

12 only in the second column and the thirdcolumn. Hence,

RP T12P

T23 =

× × × × ×× × × × ×0 × × × ×0 0 0 × ×0 0 0 0 ×

.

Continuously, we finally obtain the matrix H which is also an upper Hessenberg matrix.It is easy to know that the operation cost of a QR iteration for an upper Hessenbergmatrix is O(n2). Note that the operation cost of a QR iteration for a general matrixis O(n3).

7.5.2 QR iteration with single shift

Now, we only need to discuss the Hessenberg form. From Theorem 7.4, we knowthat the convergence rate of the basic QR algorithm is linear and is depending on thedistance between eigenvalues. In order to speed up the convergence rate, we introducethe shift strategy. The QR iteration with a single shift is given as follows:

Hm − µmI = QmRm,

Hm+1 = RmQm + µmI,(7.26)


for m = 1, 2, · · ·, where H1 = H ∈ Rn×n is a given upper Hessenberg matrix satisfyingthe conditions of Theorem 7.4 and µm ∈ R. We consider how to choose a shift if allthe eigenvalues of H are assumed to be real. Since Hm is upper Hessenberg, there areonly two nonzero entries h

(m)n,n−1 and h

(m)nn in the last row (for n = 5):

Hm =

× × × × ×× × × × ×0 × × × ×0 0 × × ×0 0 0 h

(m)n,n−1 h

(m)nn

.

When the QR algorithm converges, h(m)n,n−1 will be very small and h

(m)nn will approach to

an eigenvalue of H. Therefore, we can choose a shift µm = h(m)nn . In fact, we can prove

that if h(m)n,n−1 = ε is very small, then after one iteration of the QR algorithm, we have

h(m+1)n,n−1 = O(ε2). (7.27)

From the discussion above, we know that there are n− 1 steps to reduce Hm − h(m)nn I

to be an upper triangular matrix. Assume that the first n− 2 steps are completed:

H =

× × × × ×0 × × × ×0 0 × × ×0 0 0 α β0 0 0 ε 0

.

Actually, we only need to study the 2-by-2 submatrix

Hm2 =[

α βε 0

]

in the lower right corner of the matrix H. In the (n− 1)-th step of reduction, we wantto determine c = cos θ and s = sin θ such that

[c s−s c

] [αε

]=

[σ0

].

It is easy to see that

c = α/σ, s = ε/σ, σ =√

α2 + ε2.

Thus, after a few simple computations on the 2-by-2 submatrix in the lower right cornerof the matrix Hm+1 = RmQm + h

(m)nn I, we have

h(m+1)n,n−1 = −s2β = −βε2/σ2 = O(ε2),


i.e., (7.27) holds. Through a shift, the convergence rate of the QR iteration is expectedto be quadratic. If H has complex eigenvalues, then a double shift strategy can be usedto speed up the convergence rate of the QR iteration.

7.5.3 QR iteration with double shift

Note that difficulties with (7.26) can be expected if the submatrix

G =

[h

(m)pp h

(m)pn

h(m)np h

(m)nn

], p = n− 1

in the lower right corner of Hm has a pair of complex conjugate eigenvalues µ1 and µ2.We cannot expect that h

(m)nn tends to an eigenvalue of A. A way around this difficulty

is to perform the following QR algorithm with double shifts:

H − µ1I = Q1R1,

H1 = R1Q1 + µ1I,

H1 − µ2I = Q2R2,

H2 = R2Q2 + µ2I,

(7.28)

where H = Hm. LetM ≡ (H − µ1I)(H − µ2I). (7.29)

By a few simple computations, we have

M = QR, (7.30)

andH2 = QT HQ, (7.31)

whereQ = Q1Q2, R = R2R1.

By (7.29), we obtainM = H2 − sH + tI,

wheres = µ1 + µ2 = h(m)

pp + h(m)nn ∈ R,

andt = µ1µ2 = det(G) ∈ R.


Hence M is also real. If µ1, µ2 are not eigenvalues of H and diagonal entries of R1,R2 in each iteration are chosen to be positive, then by using (7.30), Q is also real.By (7.31), it then follows that H2 is real. Therefore, under an assumption of withoutrounding error, by using (7.28), H2 is still a real upper Hessenberg matrix. In practice,because of rounding error, usually, H2 is not real. In order to keep the reality of H2,by using (7.30) and (7.31), we propose the following process to compute H2:

(1) Compute M = H2 − sH + tI.

(2) Compute the QR decomposition of M : M = QR.

(3) Compute H2 = QT HQ.

Note that the operation cost of forming the matrix M in (1) needs O(n3). Weremark that actually, we do not need to form matrix M explicitly, see [14]. For apractical implementation of the QR algorithm with the shift strategy, we refer to [14,19, 47].

Exercises:

1. Show that if T ∈ Cn×n is upper triangular and normal, then T is diagonal.

2. Let A, B ∈ Cn×n. Prove that the spectrum of AB is equal to the spectrum of BA.

3. Let A ∈ Cn×n, x ∈ Cn and X = [x,Ax, · · · , An−1x]. Show that if X is nonsingular, thenX−1AX is an upper Hessenberg matrix.

4. Suppose that A ∈ Cn×n has distinct eigenvalues. Show that if Q∗AQ = T is the Schurdecomposition and AB = BA, then Q∗BQ is upper triangular.

5. Suppose that A ∈ Rn×n and z ∈ Rn. Find a detailed algorithm for computing anorthogonal matrix Q such that QT AQ is upper Hessenberg and QT z is a multiple of e1

where e1 is the first unit vector.

6. Suppose that W , Y ∈ Rn×n and define matrices C, B by

C = W + iY, B =[

W −YY W

].

Show that if λ ∈ R is an eigenvalue of C, then λ is also an eigenvalue of B. What is therelation between two corresponding eigenvectors?

7. Suppose that

A =[

w xy z

]∈ R2×2

has eigenvalues λ ± iµ, where µ 6= 0. Find an algorithm that determines c = cos θ ands = sin θ stably such that

[c s−s c

]T [w xy z

] [c s−s c

]=

[λ βα λ

],


where αβ = −µ2.

8. Find a 2-by-2 diagonal matrix D that minimizes ‖D−1AD‖F where

A =[

w xy z

].

9. Let H = H1 be a given matrix. We generate matrices Hk via

Hk − µkI = QkRk, Hk+1 = RkQk + µkI.

Show that(Q1 · · ·Qj)(Rj · · ·R1) = (H − µ1I) · · · (H − µjI).

10. Show that if

Y =[

I Z0 I

],

thenκ2(Y ) ≡ ‖Y ‖2‖Y −1‖2 =

12

(2 + σ2 +

√4σ2 + σ4

),

where σ = ‖Z‖2.11. Let A be a matrix with real diagonal entries. Show that

|Im(λ)| ≤ maxi

∑

j 6=i

|aij |,

where λ is any eigenvalue of A and Im(·) denotes the imaginary part of a complex number.

12. Show that if12(A + AT )

is positive definite, then Re(λ) > 0, where λ is any eigenvalue of A and Re(·) denotes thereal part of a complex number.

13. Let B be a matrix with ‖B‖2 < 1. Show that I − B is nonsingular and the eigenvaluesof I + 2(B − I)−1 have negative real parts.


Chapter 8

Symmetric Eigenvalue Problems

The symmetric eigenvalue problem with its nice properties and rich mathematical the-ory is one of the most pleasing topics in NLA. In this chapter, we will study symmetriceigenvalue problems. We begin by introducing some basic spectral properties of sym-metric matrices. Then the symmetric QR method, the Jacobi method and the bisectionmethod are discussed. Finally, a divide-and-conquer algorithm is described.

8.1 Basic spectral properties

We first introduce some basic properties of eigenvalues and eigenvectors of any sym-metric matrix A ∈ Rn×n. It is well known that the eigenvalues of any symmetric matrixA are real and there is an orthonormal basis of Rn formed by the eigenvectors of A.

Theorem 8.1 (Spectral Decomposition Theorem [17]) If A ∈ Rn×n is symmet-ric, then there exists an orthogonal matrix Q ∈ Rn×n such that

QT AQ = diag(λ1, · · · , λn).

The eigenvalues of a symmetric matrix have minimax properties based on the valuescalled the Rayleigh quotient:

xT Ax

xT x.

We have the following theorem and its proof can be found in [19].

Theorem 8.2 (Courant-Fischer Minimax Theorem) Let A ∈ Rn×n be symmetricand its eigenvalues be ordered as

λ1 ≥ · · · ≥ λn.

129

130 CHAPTER 8. SYMMETRIC EIGENVALUE PROBLEMS

Then

λi = maxdim(S)=i

min0 6=u∈S

uT Au

uT u= min

dim(S)=n−i+1max

0 6=u∈S

uT Au

uT u,

where S is any subspace of Rn.

The next theorem [46] shows the sensitivity of eigenvalues of symmetric matrices.

Theorem 8.3 (Weyl, Wielandt-Hoffman Theorem) If A, B ∈ Rn×n are sym-metric matrices and the eigenvalues of A, B are ordered respectively as follows,

λ1(A) ≥ · · · ≥ λn(A), µ1(B) ≥ · · · ≥ µn(B),

then|λi(A)− µi(B)| ≤ ‖A−B‖2, i = 1, 2, · · · , n,

andn∑

i=1

(λi(A)− µi(B))2 ≤ ‖A−B‖2F .

Theorem 8.3 tells us that the eigenvalues of any symmetric matrix are well conditioned,i.e., small perturbations on the entries of A cause only small changes in the eigenvaluesof A.

Theorem 8.4 (Cauchy Interlace Theorem [17, 46]) If A ∈ Rn×n is symmetricand Ar denotes the r-by-r leading principal submatrix of A, then

λr+1(Ar+1) ≤ λr(Ar) ≤ λr(Ar+1) ≤ · · · ≤ λ2(Ar+1) ≤ λ1(Ar) ≤ λ1(Ar+1),

for r = 1, 2, · · · , n− 1.

As for the sensitivity of eigenvectors, we have the following theorem, see [46].

Theorem 8.5 Suppose A, A + E ∈ Rn×n are symmetric matrices and

Q = [q1, Q2] ∈ Rn×n

is an orthogonal matrix where q1 is a unit eigenvector of A. Partition the matricesQT AQ and QT EQ as follows:

QT AQ =[

λ 00 D22

], QT EQ =

[ε eT

e E22

],

8.1. BASIC SPECTRAL PROPERTIES 131

where D22, E22 ∈ R(n−1)×(n−1). If

d = minµ|λ− µ| > 0, ‖E‖2 ≤ d/4,

where µ is any eigenvalue of D22, then there exists a unit eigenvector q1 such that

sin θ =√

1− |qT1 q1|2 ≤ 4

d‖e‖2 ≤ 4

d‖E‖2,

where θ = arccos |qT1 q1|.

It seems that θ could be a good measurement between q1 and q1. We can see that thesensitivity to a perturbation of a single eigenvector depends on the separation of itscorresponding eigenvalue from the rest of eigenvalues.

The eigenvalues of any symmetric matrix are closely related with the singular valuesof the matrix. The singular value decomposition [19] is essential in NLA.

Theorem 8.6 (Singular Value Decomposition Theorem) Let A ∈ Rm×n withrank(A) = r. Then there exist orthogonal matrices

U = [u1, · · · , um] ∈ Rm×m, V = [v1, · · · , vn] ∈ Rn×n

such that

UT AV =[

Σr 00 0

],

whereΣr = diag(σ1, · · · , σr)

with σ1 ≥ σ2 ≥ · · · ≥ σr > 0.

The σi are called the singular values of A. The vectors ui and vi are called the i-thleft singular vector and the i-th right singular vector respectively. The next corollaryis the Weyl, Wielandt-Hoffman Theorem for the singular values.

Corollary 8.1 Let A, B ∈ Rn×n and their singular values be ordered respectively asfollows:

σ1(A) ≥ · · · ≥ σn(A), τ1(B) ≥ · · · ≥ τn(B),

then|σi(A)− τi(B)| ≤ ‖A−B‖2, i = 1, 2, · · · , n,

andn∑

i=1

(σi(A)− τi(B))2 ≤ ‖A−B‖2F .

The corollary shows that the singular values of any real matrix are also well conditioned,i.e., small perturbations on the entries of A cause only small changes in the singularvalues of A.


8.2 Symmetric QR method

The symmetric QR method is a QR iteration for solving symmetric eigenvalue problem.It applies the QR algorithm to any symmetric matrix A by using its symmetry. Inorder to construct an efficient QR method, we first reduce the symmetric matrix A toa tridiagonal matrix T and then apply the QR iteration to the matrix T .

8.2.1 Tridiagonal QR iteration

Let A be symmetric and suppose that A can be decomposed as

QT AQ = T,

where Q is an orthogonal matrix and T is an upper Hessenberg matrix. Then, T shouldbe symmetric tridiagonal. In this case we only need to handle with an eigenvalueproblem of a symmetric tridiagonal matrix T .

Partition A as follows,

A =[

α1 vT0

v0 A0

]

where A0 ∈ R(n−1)×(n−1). By using Householder transformations, we can reduce thematrix A into a symmetric tridiagonal matrix as follows: at the k-th step,

(1) Compute a Householder transformation Hk ∈ R(n−k)×(n−k) such that

Hkvk−1 = βke1, βk ∈ R.

(2) Compute [αk+1 vT

k

vk Ak

]= HkAk−1Hk,

where Ak ∈ R(n−k−1)×(n−k−1).

If we use αk, βk and Hk generated by the reduction above to define

T =

α1 β1 0

β1 α2. . .

. . . . . . βn−1

0 βn−1 αn

,

Hk = diag(Ik, Hk) ∈ Rn×n, Q = H1H2 · · ·Hn−2,

where [αn−1 βn−1

βn−1 αn

]= Hn−2An−3Hn−2,

8.2. SYMMETRIC QR METHOD 133

then we haveQT AQ = T.

From the reduction above, it is easy to see that the main operation cost of the k-thstep is to compute HkAk−1Hk. Let

Hk = I − βvvT , v ∈ Rn−k.

By using the symmetry of Ak−1, we then have

HkAk−1Hk = Ak−1 − vwT − wvT ,

wherew = u− 1

2β(vT u)v, u = βAk−1v.

Since only the upper triangular portion of this matrix needs to be computed, we seethat the transition from Ak−1 to Ak can be computed in 4(n−k)2 operations only. Givena symmetric matrix A ∈ Rn×n, the following algorithm overwrites A with T = QT AQ,where T is a tridiagonal matrix and Q is a product of Householder transformations.

Algorithm 8.1 (Householder tridiagonalization)

for k = 1 : n− 2[v, β] = house(A(k + 1 : n, k))u = βA(k + 1 : n, k + 1 : n)vw = u− (βuT v/2)vA(k + 1, k) = ‖A(k + 1 : n, k)‖2

A(k, k + 1) = A(k + 1, k)A(k + 1 : n, k + 1 : n) = A(k + 1 : n, k + 1 : n)− vwT − wvT

end

This algorithm requires 4n3/3 operations. If Q is explicitly required, then it can beformed with additional 4n3/3 operations.

After a symmetric matrix A has been reduced to a symmetric tridiagonal matrixT , our aim turns to choose a suitable shift for the QR iteration. Consider the followingQR iteration with a single shift µk:

Tk − µkI = QkRk,

Tk+1 = RkQk + µkI,(8.1)

for k = 1, 2, · · ·, where T1 = T is a symmetric tridiagonal matrix and so is each matrixTk in (8.1). Just as the QR algorithm for nonsymmetric case, Tk is assumed to be


irreducible, i.e., sub-diagonal entries are nonzero. Let us discuss how to choose theshift µk. We can take µk = Tk(n, n), the (n, n)-th entry at each iteration as a shift.However, a better way is to select

µk = αn + δ − sign(δ)√

δ2 + β2n−1,

where δ = (αn−1 − αn)/2. This is the well-known Wilkinson shift, see [46]. Note thatµk is just the eigenvalue of the matrix

Tk(n− 1 : n, n− 1 : n) =[

αn−1 βn−1

βn−1 αn

]

which is closer to αn.

8.2.2 Implicit symmetric QR iteration

For the explicit QR algorithm (8.1), it is possible to execute the transition from

T − µI = QR

toT = RQ + µI

without explicitly forming the matrix T − µI. The essence of (8.1) is to transformT to T by orthogonal similarity transformations. It follows from Theorem 7.5 that Tcan be determined completely by the first column of Q. From the process of the QRdecomposition of T − µI by Givens rotations, we know that

Qe1 = G1e1,

where G1 = G(1, 2, θ1) is a rotation which makes the second entry in the first columnof T − µI to be zero. The θ1 can be computed from

[cos θ1 sin θ1

− sin θ1 cos θ1

] [α1 − µ

β1

]=

[ ∗0

].

LetB = G1TGT

1 .

Then B has the following form (n = 4),

B =

∗ ∗ + 0∗ ∗ ∗ 0+ ∗ ∗ ∗0 0 ∗ ∗

.

8.2. SYMMETRIC QR METHOD 135

Let Gi = G(i, i+1, θi), i = 2, 3. Then Gi of this form can be used to chase the unwantednonzero entry “+” out of the matrix B as follows:

BG2−→

∗ ∗ 0 0∗ ∗ ∗ +0 ∗ ∗ ∗0 + ∗ ∗

G3−→

∗ ∗ 0 0∗ ∗ ∗ 00 ∗ ∗ ∗0 0 ∗ ∗

.

In general, if Z = G1G2 · · ·Gn−1, then

Ze1 = G1e1 = Qe1

and ZTZT is tridiagonal. Thus, from Theorem 7.5, the tridiagonal matrix ZTZT

produced by this zero-chasing technique is essentially the same as the tridiagonal matrixT obtained by the explicit method (8.1). Overall, we obtain

Algorithm 8.2 (Implicit symmetric QR iteration with Wilkinson shift)

d = (T (n− 1, n− 1)− T (n, n))/2µ = T (n, n)− T (n, n− 1)2/(d + sign(d)

√d2 + T (n, n− 1)2)

x = T (1, 1)− µ; z = T (2, 1)for k = 1 : n− 1

[c, s] = givens(x, z)T = GkTGT

k , where Gk = G(k, k + 1, θk)if k < n− 1

x = T (k + 1, k); z = T (k + 2, k)end

end

This algorithm requires about 30n operations and n square roots. Of course, the tridi-agonal matrix T would be stored in a pair of n-vectors in any practical implementation.

8.2.3 Implicit symmetric QR algorithm

Algorithm 8.2 is a base of the symmetric QR algorithm – the standard means for com-puting the spectral decomposition of any symmetric matrix. By applying Algorithm8.2, we develop the following algorithm.

Algorithm 8.3 (Implicit symmetric QR algorithm)

(1) Input A (real symmetric matrix).


(2) Tridiagonalization: Compute the tridiagonalization of A by Algorithm 8.1,

T = UT0 AU0.

Set Q = U0.

(3) Criterion for convergence:

(i) For i = 1, · · · , n− 1, let ti+1,i and ti,i+1 be zero if

|ti+1,i| = |ti,i+1| ≤ (|ti,i|+ |ti+1,i+1|)u,

where u is the machine precision.

(ii) Find the largest integer m ≥ 0 and the smallest integer l ≥ 0 such that

T =

T11 0 00 T22 00 0 T33

,

where T11 ∈ Rl×l, T22 ∈ R(n−l−m)×(n−l−m) is an irreducible tridiagonal ma-trix and T33 ∈ Rm×m is a diagonal matrix.

(iii) If m = n, then output; otherwise

(4) QR iteration: Apply Algorithm 8.2 to T22:

T22 = GT22GT , G = G1G2 · · ·Gn−l−m−1.

(5) Set Q = Qdiag(Il, G, Im), then go to (3).

If we only need to compute the eigenvalues, this algorithm requires about 4n3/3operations. If we need both the eigenvalues and eigenvectors, it requires about 9n3

operations. It can be shown [46] that the computed eigenvalues λi, i = 1, 2, · · · , n,obtained by Algorithm 8.3, satisfy

QT (A + E)Q = diag(λ1, · · · , λn),

where Q ∈ Rn×n is orthogonal and ‖E‖2 ≈ ‖A‖2u where u is the machine precision.Using Theorem 8.3, we have

|λi − λi| ≈ ‖A‖2u, i = 1, 2, · · · , n,

where {λi} are the eigenvalues of A. The absolute error in each λi is small and therelative error is less than the machine precision u. If Q = [q1, · · · , qn] is the matrixof computed orthonormal eigenvectors, then the accuracy of each qi depends on theseparation of λi from the rest of eigenvalues.

8.3. JACOBI METHOD 137

8.3 Jacobi method

Jacobi method is one of the earliest methods for computing the symmetric eigenvalueproblem. It was developed by Jacobi in 1846, see [19]. It is well-known that a realsymmetric matrix can be reduced to a diagonal matrix by orthogonal similarity trans-formations. Jacobi method exploits the symmetry of the matrix and chooses suitablerotations to reduce a symmetric matrix to a diagonal form. Jacobi method is usuallymuch slower than the symmetric QR algorithm. However, Jacobi method remains to beinteresting because the method is capable of programming and inherently parallelismrecognized in recent years.

8.3.1 Basic idea

Let A = [aij ] ∈ Rn×n be a symmetric matrix and

off(A) ≡(‖A‖2

F −n∑

i=1

a2ii

)1/2

=

n∑

i=1

n∑j=1

j 6=i

a2ij

1/2

.

The idea of Jacobi method is to systematically reduce the off(A) to be zero. The basictools for doing this are called Jacobi rotations defined as follows,

J(p, q, θ) = I + sin θ(epeTq − eqe

Tp ) + (cos θ − 1)(epe

Tp + eqe

Tq ),

where p < q and ek denotes the k-th unit vector. Note that Jacobi rotations are nodifferent from Givens rotations, see Section 4.2.2. We change the name in this sectionto honour the inventor. The basic step in a Jacobi procedure involves:

(1) Choose p and q for a rotation with 1 ≤ p < q ≤ n.

(2) Compute a rotation angle θ such that

[bpp bpq

bqp bqq

]=

[c s−s c

]T [app apq

aqp aqq

] [c s−s c

](8.2)

is diagonal, i.e., bpq = bqp = 0, where c = cos θ and s = sin θ.

(3) Overwrite A with B = [bij ] = JT AJ , where J = J(p, q, θ).


Note that the matrix B agrees with the matrix A except the p-th row (column) andthe q-th row (column). The relations are:

bip = bpi = caip − saiq, i 6= p, q

biq = bqi = saip + caiq, i 6= p, q

bpp = c2app − 2scapq + s2aqq,

bqq = s2app + 2scapq + c2aqq,

bpq = bqp = (c2 − s2)apq + sc(app − aqq).

Let us first consider actual computations of s = sin θ and c = cos θ such thatbpq = bqp = 0 in (8.2). Actually it is equivalent to

apq(c2 − s2) + (app − aqq)cs = 0. (8.3)

If apq = 0, then we just set c = 1 and s = 0. Otherwise define

τ =aqq − app

2apq, t = tan θ =

s

c,

and conclude from (8.3) that t solves the quadratic equation

t2 + 2τt− 1 = 0.

Then,t = −τ ±

√1 + τ2.

We select t to be the smaller of the two roots which ensures that |θ| ≤ π/4 and has theeffect of minimizing of ‖B −A‖2

F because

‖B −A‖2F = 4(1− c)

n∑i=1

i6=p,q

(a2ip + a2

iq) +2a2

pq

c2.

After t is determined, we can obtain c and s from the formulas

c =1√

1 + t2, s = tc.

We summarize the computation of Jacobi rotation J(p, q, θ) as follows. Given asymmetric matrix A ∈ Rn×n and indices p, q with 1 ≤ p < q ≤ n, the followingalgorithm computes a cosine-sine pair such that bpq = bqp = 0 where bjk is the (j, k)-thentry of the matrix B = J(p, q, θ)T AJ(p, q, θ).


Algorithm 8.4

function : [c, s] = sym(A, p, q)if A(p, q) 6= 0

τ = (A(q, q)−A(p, p))/(2A(p, q))if τ ≥ 0

t = 1/(τ +√

1 + τ2)else

t = −1/(−τ +√

1 + τ2)endc = 1/

√1 + t2

s = tcelse

c = 1s = 0

end

Once J(p, q, θ) is determined, then the updated

A ← J(p, q, θ)T AJ(p, q, θ)

can be computed in 6n operations.How can we choose the integers p and q? Since the Frobenius norm is preserved by

the orthogonal transformation, we have ‖B‖F = ‖A‖F . Note that

a2pp + a2

qq + 2a2pq = b2

pp + b2qq + 2b2

pq = b2pp + b2

qq,

and then

off(B)2 = ‖B‖2F −

n∑i=1

b2ii

= ‖A‖2F −

n∑i=1

a2ii + (a2

pp + a2qq − b2

pp − b2qq)

= off(A)2 − 2a2pq.

(8.4)

Our goal is to minimize off(B), the best choice of p, q should be

|apq| = max1≤i<j≤n

|aij |.

This is the basic idea of the classical Jacobi method.


8.3.2 Classical Jacobi method

Given a symmetric matrix A ∈ Rn×n and a tolerance ε > 0, the following algorithmoverwrites A with UT AU , where U is orthogonal and off(UT AU) ≤ ε‖A‖F .

Algorithm 8.5 (Classical Jacobi method)

U = In; eps = ε‖A‖F

while off(A) > epschoose p, q such that |apq| = max

i6=j|aij |

[c, s] = sym(A, p, q)A = J(p, q, θ)T AJ(p, q, θ)U = UJ(p, q, θ)

end

Since |apq| is the largest off-diagonal entry, we have

off(A)2 ≤ N(a2pq + a2

qp),

where N = n(n− 1)/2. It follows from (8.4) that

off(B)2 ≤(

1− 1N

)off(A)2.

If Ak denotes the matrix A after k-th Jacobi iteration and A0 = A, we have by induc-tion,

off(Ak)2 ≤(

1− 1N

)k

off(A0)2.

This implies that the classical Jacobi method converges linearly. However, actually,the asymptotic convergence rate of the Jacobi method is quadratic. We can prove thatfor k large enough, there is a constant c such that

off(Ak+N ) ≤ c · off(Ak)2,

see [19] and references therein. Therefore, the off-diagonal “norm” will approach tozero at a quadratic rate after a sufficient number of iterations.

Another advantage of the Jacobi method is easy to compute the eigenvectors. Ifthe iteration stops after the k-th rotation, we then have

Ak = JTk JT

k−1 · · ·JT1 AJ1J2 · · ·Jk.

DenoteQk = J1J2 · · ·Jk.


ThusAQk = QkAk.

Since off-diagonal entries of Ak are tiny and then diagonal entries of Ak are goodapproximations to the eigenvalues of A, the identity above shows that the columns of Qk

are good approximations to the eigenvectors of A and all the approximate eigenvectorsare orthonormal. We can obtain Qk, the approximate eigenvectors, during Jacobiiterative process.

8.3.3 Parallel Jacobi method

Jacobi method to the symmetric eigenvalue problem is inherently parallelism. To illus-trate this, let A ∈ R8×8 be symmetric. If one has a parallel computer with 4 processors,then one can group the 28 subproblems into 7 groups of rotations as follows:

group (1) : (1, 2), (3, 4), (5, 6), (7, 8);group (2) : (1, 3), (2, 4), (5, 7), (6, 8);group (3) : (1, 4), (2, 3), (5, 8), (6, 7);group (4) : (1, 5), (2, 6), (3, 7), (4, 8);group (5) : (1, 6), (2, 5), (3, 8), (4, 7);group (6) : (1, 7), (2, 8), (3, 5), (4, 6);group (7) : (1, 8), (2, 7), (3, 6), (4, 5).

Note that all 4 rotations within each group are nonconflicting. For instance, thesubproblems J(2i − 1, 2i, θi), i = 1, 2, 3, 4, in the first group can be carried out inparallel. When we compute J(1, 2, θ1)T AJ(1, 2, θ1), it has no effect on the rotations(3, 4), (5, 6) and (7, 8). Then the computation of

A = AJ(1, 2, θ1), A = AJ(3, 4, θ2),

A = AJ(5, 6, θ3), A = AJ(7, 8, θ4),

can be executed in parallel by 4 processors. Similarly, the computation of

A = J(1, 2, θ1)T A, A = J(3, 4, θ2)T A,

A = J(5, 6, θ3)T A, A = J(7, 8, θ4)T A,

can also be carried out in parallel by 4 processors. For the example above, it only needs1/4 computing time of a computer with a single processor. A parallel Jacobi algorithmcan be found in [19].


8.4 Bisection method

In this section we present the bisection method for symmetric tridiagonal eigenvalueproblems. Combining the bisection method with the tridiagonal skill, we can obtaina numerical method for a specified eigenvalue and its corresponding eigenvector of asymmetric matrix. For a given tridiagonal matrix

T =

a1 b1 0

b1 a2. . .

. . . . . . bn−1

0 bn−1 an

∈ Rn×n, (8.5)

we consider the computation of eigenvalues of T . Without loss of generality, we assumethat bi 6= 0, i = 1, 2, · · · , n − 1, i.e., T is an irreducible symmetric tridiagonal matrix.Otherwise, T can be divided into several smaller irreducible symmetric tridiagonalmatrices.

Let pi(λ) be the characteristic polynomial of the i-by-i leading principal submatrixTi of T , i = 1, 2, · · · , n. Then these polynomials satisfy a three-term recurrence:

p0(λ) ≡ 1, p1(λ) = a1 − λ,

pi(λ) = (ai − λ)pi−1(λ)− b2i−1pi−2(λ), i = 2, 3, · · · , n.

(8.6)

Since T is symmetric, the roots of polynomial pi(λ) (i = 1, 2 · · · , n) are real. Thefollowing interlacing property is very important.

Theorem 8.7 (Sturm Sequence Property) Let the symmetric tridiagonal matrixT in (8.5) be irreducible. Then the eigenvalues of Ti−1 strictly separate the eigenvaluesof Ti:

λi(Ti) < λi−1(Ti−1) < λi−1(Ti) < · · · < λ2(Ti) < λ1(Ti−1) < λ1(Ti).

Moreover, if sn(λ) denotes the number of sign changes in the sequence

{p0(λ), p1(λ), · · · , pn(λ)}then sn(λ) is equal to the number of eigenvalues of T that are less than λ, where pi(λ)are defined by (8.6). If pi(λ) = 0, then pi−1(λ)pi+1(λ) < 0.

Proof: It follows from Theorem 8.4 that the eigenvalues of Ti−1 weakly separatethose of Ti. Next we will show that the separation must be strict. Assume thatpi(µ) = pi−1(µ) = 0 for some i and µ. Since T is irreducible and, we note that by (8.6),

p0(µ) = p1(µ) = · · · = pi(µ) = 0,

8.4. BISECTION METHOD 143

which is a contradiction with p0(λ) ≡ 1. Thus we have a strict separation. The assertionabout sn(λ) is developed in [46].

The bisection method for computing a specified eigenvalue of T can be stated asfollows:

Algorithm 8.6 (Bisection algorithm) Let λ1 < λ2 < · · · < λn be the eigenvaluesof T , i.e., the roots of pn(λ), and ε be a tolerance. Suppose that the desired eigenvalueis λm for a given m ≤ n. Then

(1) Find an interval [l0, u0] including λm. Since |λi| ≤ ρ(T ) ≤ ‖T‖∞, we can takel0 = −‖T‖∞ and u0 = ‖T‖∞.

(2) Compute r1 =l0 + u0

2and sn(r1).

(3) If sn(r1) ≥ m, then λm ∈ [l0, r1], set l1 = l0 and u1 = r1; otherwise λm ∈ [r1, u0],set l1 = r1 and u1 = u0.

(4) If |l1 − u1| < ε, take r2 =l1 + u1

2as an approximate value of λm. Otherwise go

to (2).

From the algorithm above, we can see that the main operation cost is to computesn(µ). However in practice, sn(µ) cannot be obtained through computing the value ofpi(µ) because it is difficult to evaluate polynomials of high order. In order to avoidsuch a problem, we define

qi(λ) =pi(λ)

pi−1(λ), i = 1, 2, · · · , n.

From (8.6), we have

q1(λ) = p1(λ) = a1 − λ, qi(λ) = ai − λ− b2i−1

qi−1(λ), i = 2, 3, · · · , n.

It is easy to check that sn(µ) is exactly the number of negative values in the sequenceof q1(µ), · · · , qn(µ). The following is a practical algorithm for computing sn(µ).


Algorithm 8.7 (Compute sign changes)

x = [a1, a2, · · · , an]y = [0, b1, · · · , bn−1]s = 0; q = x(1)− µfor k = 1 : n

if q < 0s = s + 1

endif k < n

if q = 0q = |y(k + 1)|u

endq = x(k + 1)− µ− y(k + 1)2/q

endend

where u is the machine precision.

When qi(µ) = 0, we treat qi as a positive number. Therefore, we can use a smallpositive number |bi|u instead of qi(µ) in Algorithm 8.7. If we store b2

i in advance, thenAlgorithm 8.7 needs 3n operations. If an eigenvalue is computed by using the bisectionmethod m times on average, then the operation cost is 3nm. Thus, it is efficient tocompute the eigenvalues of any symmetric tridiagonal matrix by the bisection method.On the other hand, the rounding error analysis shows that the bisection method isnumerically stable, see [46].

8.5 Divide-and-conquer method

A divide-and-conquer method is a numerical method developed by Dongarra andSorensen in 1987 for computing all the eigenvalues and eigenvectors of symmetric tridi-agonal matrices, see [19]. The basic idea is to tear the original symmetric tridiagonalmatrix into 2k symmetric tridiagonal matrices with smaller sizes and then computethe spectral decomposition of each smaller symmetric tridiagonal matrix. Once we ob-tained these smaller spectral decompositions, we then combine them together to form aspectral decomposition of the original matrix. Thus, this method is suitable for parallelcomputing.

8.5. DIVIDE-AND-CONQUER METHOD 145

8.5.1 Tearing

Let T ∈ Rn×n be given as follows,

T =

a1 b1 0

b1 a2. . .

. . . . . . bn−1

0 bn−1 an

.

Without loss of generality, assume n = 2m. Let

v = (0, · · · , 0︸︷︷︸m−1

, 1, θ, 0, · · · , 0︸︷︷︸m−1

)T ∈ Rn.

Consider the matrixT = T − ρvvT

where θ, ρ ∈ R are needed to be determined. It is easy to see that T is identical to Texcept its 4 middle entries:

[am − ρ bm − ρθbm − ρθ am+1 − ρθ2

].

If we set ρθ = bm, then

T =[

T1 00 T2

]+ ρvvT ,

where

T1 =

a1 b1 0b1 a2 b2

. . . . . . . . .bm−2 am−1 bm−1

0 bm−1 am

and

T2 =

am+1 bm+1 0bm+1 am+2 bm+2

. . . . . . . . .bn−2 an−1 bn−1

0 bn−1 an

witham = am − ρ, am+1 = am+1 − ρθ2.

Therefore, T is divided into a sum of a partitioned matrix and a rank-one matrix. Ifwe divide T1 and T2 repeatedly, then finally we can divide T into 2k blocks.


8.5.2 Combining

Once we obtained the spectral decompositions of T1 and T2:

QT1 T1Q1 = D1, QT

2 T2Q2 = D2,

where Q1, Q2 ∈ Rm×m are orthogonal matrices, and D1, D2 ∈ Rm×m are diagonalmatrices, our aim now is to compute the spectral decomposition of T from that of T1

and T2, i.e., to find an orthogonal matrix V such that

V T TV = diag(λ1, · · · , λn).

Let

U =[

Q1 00 Q2

],

then

UT TU =[

Q1 00 Q2

]T ([T1 00 T2

]+ ρvvT

)[Q1 00 Q2

]

= D + ρzzT ,

whereD = diag(D1, D2), z = UT v.

Now the problem of finding the spectral decomposition of T is reduced to the problemof computing the spectral decomposition of D+ρzzT . We will consider how to computethe spectral decomposition of D + ρzzT quickly and stably.

Lemma 8.1 Let D = diag(d1, · · · , dn) ∈ Rn×n with d1 > d2 > · · · > dn. Assume that0 6= ρ ∈ R and z = (z1, z2, · · · , zn)T ∈ Rn with zi 6= 0 for all i. Let u ∈ Rn and λ ∈ Rsatisfy

(D + ρzzT )u = λu, u 6= 0.

Then zT u 6= 0 and D − λI is nonsingular.

Proof: If zT u = 0, then Du = λu with u 6= 0, i.e., λ is the eigenvalue of D and u isthe eigenvector associated with λ. Since D is a diagonal matrix with distinct entries,there must exist some i such that di = λ and u = αei with α 6= 0, where ei is the i-thunit vector. Thus

0 = zT u = αzT ei = αzi,

which implies zi = 0, a contradiction. Therefore, zT u 6= 0.


On the other hand, if D−λI is singular, then there exists some i such that eTi (D−

λI) = 0, and then0 = eT

i (D − λI)u = −ρzT ueTi z.

Since ρzT u 6= 0, we have eTi z = zi = 0, a contradiction. Thus, D − λI is nonsingular.

Theorem 8.8 Let D = diag(d1, · · · , dn) ∈ Rn×n with d1 > d2 > · · · > dn. Assumethat 0 6= ρ ∈ R and z = (z1, z2, · · · , zn)T ∈ Rn with zi 6= 0 for all i. Suppose that thespectral decomposition of D + ρzzT is

V T (D + ρzzT )V = diag(λ1, · · · , λn),

where V = [v1, · · · , vn] is an orthogonal matrix and λ1 ≥ · · · ≥ λn. Then

(i) λ1, · · · , λn are n roots of the function

f(λ) = 1 + ρzT (D − λI)−1z.

(ii) If ρ > 0, thenλ1 > d1 > λ2 > · · · > λn > dn;

if ρ < 0, thend1 > λ1 > d2 > · · · > dn > λn.

(iii) There exists a constant αi 6= 0 such that

vi = αi(D − λiI)−1z, i = 1, 2, · · · , n.

Proof: From the assumption, we have

(D + ρzzT )vi = λivi, ‖vi‖2 = 1.

It follows from Lemma 8.1 that D − λiI is nonsingular. Thus,

vi = −ρzT vi(D − λiI)−1z, i = 1, 2, · · · , n, (8.7)

thereby establishing (iii). Note that D + ρzzT has distinct eigenvalues. Otherwise, ifλi = λj , then vi and vj are linearly dependent, which contradicts with the orthogonalitybetween vi and vj .

By multiplying zT to the both sides of (8.7) and noting that zT vi 6= 0, we have

1 = −ρzT (D − λiI)−1z,


i.e.,f(λi) = 0, i = 1, 2, · · · , n.

Thus, λi, i = 1, 2, · · · , n, are the roots of f(λ). Next we prove that f(λ) has exactly nzeros. Note that

f(λ) = 1 + ρ

(z21

d1 − λ+ · · ·+ z2

n

dn − λ

),

and moreover,

f ′(λ) = ρ

(z21

(d1 − λ)2+ · · ·+ z2

n

(dn − λ)2

).

Thus, f(λ) is strictly monotone between the poles di and di+1. If ρ > 0, f(λ) is strictlyincreasing; if ρ < 0, f(λ) is strictly decreasing. Therefore, it is easy to see that f(λ)has exactly n roots, one of each in the intervals

(dn, dn−1), · · · , (d2, d1), (d1,∞),

if ρ > 0; and one of each in the intervals

(−∞, dn), (dn, dn−1), · · · , (d2, d1),

if ρ < 0. Thus, (i) and (ii) are established.

By Theorem 8.8, we can compute the spectral decomposition of D+ρzzT efficientlyin the following two steps:

(1) Find roots λ1, λ2, · · · , λn of f(λ). There is a unique root of f(λ) in each of theintervals (di+1, di) and f(λ) is strictly monotone in the interval. Thus, this stepcan be implemented quickly and stably by using a Newton-like method [14].

(2) Compute

vi =(D − λiI)−1z

‖(D − λiI)−1z‖2, i = 1, 2, · · · , n.

The spectral decomposition of a general D + ρzzT can be turned into the case asin Theorem 8.8. To this end, we can prove the following theorem constructively.

Theorem 8.9 Let D = diag(d1, · · · , dn) ∈ Rn×n and z ∈ Rn. Then there exists anorthogonal matrix V and a permutation π of {1, 2, · · · , n} such that

(i) V T z = (η1, · · · , ηr, 0, · · · , 0)T where ηi 6= 0, for i = 1, 2, · · · , r.

(ii) V T DV = diag(dπ(1), · · · , dπ(n)) where dπ(1) > dπ(2) > · · · > dπ(r).


Proof: Suppose that two indices i < j satisfy di = dj . Then we can set a rotationPij = G(i, j, θ) such that the j-th component of Pijz is zero. It is easy to show thatP T

ij DPij = D. After several steps, we can find an orthogonal matrix V1 which is aproduct of some rotations such that

V T1 DV1 = D, V T

1 z = (ξ1, · · · , ξn)T

with the property that if ξiξj 6= 0 (i 6= j), then di 6= dj .If ξi = 0, ξj 6= 0 for i < j, then we can find a permutation matrix to interchange the

columns i and j. In such a way, we can find a permutation matrix P1 which permutesall the nonzero ξj to the front, i.e.,

P T1 V T

1 z = (ξπ1(1), · · · , ξπ1(n))T

withξπ1(i) 6= 0, i = 1, 2, · · · , r,

andξπ1(i) = 0, i = r + 1, · · · , n.

Here π1 is a permutation of {1, 2, · · · , n}. It follows from the construction of P1 that

P T1 V T

1 DV1P1 = P T1 DP1 = diag(dπ1(1), · · · , dπ1(n)),

where the first r diagonal entries dπ1(1), · · · , dπ1(r) are distinct.Finally, we can find another permutation matrix P2 of order r such that

P T2 diag(dπ1(1), · · · , dπ1(r))P2 = diag(µ1, · · · , µr)

where µ1 > µ2 > · · · > µr. Let

V = V1P1diag(P2, In−r)

and π be a permutation of {1, 2, · · · , n} determined by P1 and P2. Then,

V T z = (ξπ(1), · · · , ξπ(r), 0, · · · , 0)T = (η1, · · · , ηr, 0, · · · , 0)T ,

where ηi 6= 0 for i = 1, 2, · · · , r, and

V T DV = diag(dπ(1), · · · , dπ(n))

withdπ(1) > dπ(2) > · · · > dπ(r).



For any D = diag(d1, · · · , dn) ∈ Rn×n and z ∈ Rn, by Theorem 8.9, we can constructan orthogonal matrix V such that

V T (D + ρzzT )V =[

D1 + ρωωT 00 D2

],

whereD1 = diag(dπ(1), · · · , dπ(r)) ∈ Rr×r, dπ(1) > · · · > dπ(r);

D2 = diag(dπ(r+1), · · · , dπ(n)) ∈ R(n−r)×(n−r);

andω = (η1, · · · , ηr)T , ηi 6= 0, i = 1, 2, · · · , r.

Then we only need to compute the spectral decomposition of D1 + ρωωT instead ofD + ρzzT .

Finally, we briefly introduce the parallel computation of the divide-and-conquermethod. For simplicity, we use a parallel computer with 4 processors to compute thespectral decomposition of a 4N -by-4N symmetric tridiagonal matrix T . It can besummarized by the following 4 steps:

(1) Tear

T =[

T1 00 T2

]+ ρvvT , T1 ∈ R2N×2N , v ∈ R4N ;

and

Ti =[

Ti1 00 Ti2

]+ ρiωiω

Ti ,

where Tij ∈ RN×N and ωi ∈ R2N , for i = 1, 2.

(2) Compute the spectral decompositions of T11, T12, T21 and T22 by 4 processors inparallel.

(3) Combine the spectral decompositions of T11, T12 to form a spectral decompositionof T1, and combine the spectral decompositions of T21, T22 to form a spectraldecomposition of T2. These can be implemented by 4 processors at the sametime.

(4) Combine the spectral decompositions of T1 and T2 to from a spectral decompo-sition of T . This can be derived by 4 processors in parallel.

From discussions above, we know that the divide-and-conquer method can be usedfor computing all the eigenvalues and eigenvectors of any large symmetric tridiagonalmatrix in parallel.


Exercises:

1. Compute the Schur decomposition of

A =[

1 22 3

].

2. Show that if X ∈ Rn×r with r ≤ n, and ‖XT X − I‖2 = τ < 1, then

σmin(X) ≥ 1− τ,

where σmin denotes the smallest singular value.

3. Show that A = B + iC is Hermitian if and only if

M =[

B −CC B

]

is symmetric. Relate the eigenvalues and eigenvectors of A to those of M .

4. Relate the singular values and the singular vectors of A = B + iC to those of[

B −CC B

],

where B, C ∈ Rm×n.

5. Use the singular value decomposition to show that if A ∈ Rm×n with m ≥ n, then thereexist a matrix Q ∈ Rm×n with QT Q = I and a positive semi-definite matrix P ∈ Rn×n

such that A = QP .

6. Let

A =[

I BB∗ I

]

with ‖B‖2 < 1. Show that

‖A‖2‖A−1‖2 =1 + ‖B‖21− ‖B‖2 .

7. Let

A =

a1 b1 0c1 a2 b2

. . . . . . . . .. . . . . . bn−1

0 cn−1 an

,

where bici > 0. Then there exists a diagonal D such that D−1AD is a symmetrictridiagonal matrix.


8. Let

T =

2 −1−1 2 −1

−1 2 −1−1 2

.

(1) Is T positive definite?

(2) How many eigenvalues of T lie in the interval [0, 2]?

9. Let A, E ∈ Rn×n be two symmetric matrices. Show that if A is positive definite and‖A−1‖2‖E‖2 < 1, then A + E is also positive definite.

10. Let A ∈ Rm×n with m ≥ n, and assume that the singular values of A are ordered as

σ1 ≥ σ2 ≥ · · · ≥ σn.

Show that

σi = maxdim(S)=i

min0 6=u∈S

‖Au‖2‖u‖2 = min

dim(S)=n−i+1max

0 6=u∈S

‖Au‖2‖u‖2 ,

where S is any subspace of Rn.

11. Let A = [aij ] ∈ Rn×n be symmetric and satisfy

(1) aii > 0, i = 1, 2, · · · , n,

(2) aij ≤ 0, i 6= j,

(3)n∑

i=1

ai1 > 0,

(4)n∑

i=1

aij = 0, j = 2, 3, · · · , n.

Prove that the eigenvalues of A are nonnegative.

12. Let A be symmetric and have bandwidth p. Show that if we perform the shifted QRiteration A− µI = QR and A = RQ + µI, then A has bandwidth p.

Chapter 9

Applications

In this chapter, we will briefly survey some of the latest developments in using bound-ary value methods (BVMs) for solving initial value problems of systems of ordinarydifferential equations (ODEs). These methods require the solution of one or morenonsymmetric, large and sparse linear systems. Therefore, we will use the GMRESmethod studied in Chapter 6 with some preconditioners for solving these linear sys-tems. One of the main results is that if an Aν1,ν2-stable BVM is used for an n-by-nsystem of ODEs, then the preconditioned matrix can be decomposed as I + L whereI is the identity matrix and the rank of L is at most 2n(ν1 + ν2). When the GMRESmethod is applied to the preconditioned systems, the method will converge in at most2n(ν1 +ν2)+1 iterations. Applications to different kinds of delay differential equations(DDEs) are also given. For a literature on BVMs for ODEs and DDEs, we refer to[3, 4, 5, 8, 10, 23, 24, 25, 26, 29, 30].

9.1 Introduction

Let us begin with the initial value problem:

y′(t) = Jny(t) + g(t), t ∈ (t0, T ],

y(t0) = z,(9.1)

where y(t), g(t) : R→ Rn, z ∈ Rn, and Jn ∈ Rn×n. The initial value methods (IVMs),such as the Runge-Kutta methods, are well-known methods for solving (9.1), see [40].Recently, another class of methods called the boundary value methods (BVMs) hasbeen proposed in [5]. Using BVMs to discretize (9.1), we obtain a linear system

Mu = b.

The advantage of using BVMs is that the methods are more stable and the resultinglinear system Mu = b is hence more well-conditioned. However, this system is in

153

154 CHAPTER 9. APPLICATIONS

general large and sparse (with band-structure), and solving it is a major problem inthe application of BVMs. The GMRES method studied in Chapter 6 will be used forsolving Mu = b. In order to speed up the convergence of the GMRES iterations, apreconditioner S called the Strang-type block-circulant preconditioner [10] is used toprecondition the discrete system. The advantage of the Strang-type preconditioner isthat if an Aν1,ν2-stable BVM is used for solving (9.1), then S is invertible and thepreconditioned matrix can be decomposed as

S−1M = I + L,

where the rank of L is at most 2n(ν1 + ν2) which is independent of the integration stepsize. It follows that the GMRES method applied to the preconditioned system willconverge in at most 2n(ν1 + ν2) + 1 iterations in exact arithmetic.

The outline of this chapter is as follows. In Section 9.2, we will give some backgroundknowledge about the linear multistep formulas (LMFs) and BVMs. Then, we willinvestigate the properties of the Strang-type block-circulant preconditioner for ODEsin Section 9.3. The convergence and cost analysis of the method will also be givenwith a numerical example. Finally, we discuss the applications of the Strang-typepreconditioner with BVMs for solving different kinds of delay differential equations(DDEs) in Sections 9.4–9.6.

9.2 Background of BVMs

We give some background knowledge on LMFs and BVMs in this section.

9.2.1 Linear multistep formulas

Consider an initial value problem

y′ = f(t, y), t ∈ (t0, T ],

y(t0) = y0,

where y(t) : R→ R and f(t, y) : R2 → R. The µ-step linear multistep formula (LMF)over a uniform mesh with step size h is defined as follows:

µ∑

j=0

αjym+j = h

µ∑

j=0

βjfm+j , m = 0, 1, · · · , (9.2)

where ym is the discrete approximation to y(tm) and fm denotes f(tm, ym).To get the solution of (9.2), we need µ initial conditions

y0, y1, · · · , yµ−1.

9.2. BACKGROUND OF BVMS 155

Since only y0 is provided from the original problem, we have to find additional condi-tions for the remaining values

y1, y2, · · · , yµ−1.

The equation (9.2) with µ − 1 additional conditions is called initial value methods(IVMs). An IVM is called implicit if βµ 6= 0 and explicit if βµ = 0. If an IVM isapplied to an initial value problem on the interval [t0, tN+µ−1], we have the followingdiscrete problem

ANy = hBN f +

µ−1∑i=0

(αiyi − hβifi)

...α0yµ−1 − hβ0fµ−1

0...0

, (9.3)

wherey = (yµ, yµ+1, · · · , yN+µ−1)T , f = (fµ, fµ+1, · · · , fN+µ−1)T ,

AN =

αµ...

. . .

α0. . .

. . . . . .α0 · · · αµ

, BN =

βµ...

. . .

β0. . .

. . . . . .β0 · · · βµ

.

Note that the matrices AN , BN ∈ RN×N are lower triangular band Toeplitz matriceswith lower bandwidth µ. We recall that a matrix is said to be Toeplitz if its entriesare constant along its diagonals. Moreover, the linear system (9.3) can be solvedeasily by forward recursion. A classical example of IVM is the second order backwarddifferentiation formula (BDF),

3ym+2 − 4ym+1 + ym = 2hfm+1,

which is a two-step method with α0 = 1, α1 = −4, α2 = 3 and β1 = 2.Instead of using an IVM with µ initial conditions for solving (9.1), we can also use

the so-called boundary value methods (BVMs). Given ν1, ν2 ≥ 0 such that ν1 +ν2 = µ,then the corresponding BVM requires ν1 initial additional conditions

y0, y1, · · · , yν1−1,

and ν2 final additional conditions

yN , yN+1, · · · , yN+ν2−1,


which are called (ν1, ν2)-boundary conditions. Note that the class of BVMs containsthe class of IVMs (i.e., ν1 = µ, ν2 = 0).

The discrete problem generated by a µ-step BVM with (ν1, ν2)-boundary conditionscan be written in the following matrix form

Ay = hBf +

ν1−1∑i=0

(αiyi − hβifi)

...α0yν1−1 − hβ0fν1−1

0...0

αµyN − hβµfN...

ν2∑i=1

(αν1+iyN−1+i − hβν1+ifN−1+i)

,

wherey = (yν1 , yν1+1, · · · , yN−1)T , f = (fν1 , fν1+1, · · · , fN−1)T ,

A and B ∈ R(N−ν1)×(N−ν1) are defined as follows,

A =

αν1 · · · αµ...

. . . . . . . . .

α0. . . . . . . . . αµ

. . . . . ....

α0 · · · αν1

, B =

βν1 · · · βµ...

. . . . . . . . .

β0. . . . . . . . . βµ

. . . . . ....

β0 · · · βν1

. (9.4)

Note that the coefficient matrices are band Toeplitz with lower bandwidth ν1 andupper bandwidth ν2. An example of BVMs is the third order generalized backwarddifferentiation formula (GBDF),

2ym+1 + 3ym − 6ym−1 + ym−2 = 6hfm,

which is a three-step method with (2, 1)-boundary conditions where

α0 = 1, α1 = −6, α2 = 3, α3 = 2, β2 = 6.

Although IVMs are more efficient than BVMs (which cannot be solved by forwardrecursion), the advantage in using BVMs over IVMs comes from their stability prop-erties. For example, the usual BDF are not A-stable for µ > 2 but the GBDF areAν1,ν2-stable for any µ ≥ 1, see for instance [1] and [5, p. 79 and Figures 5.1–5.3].

9.2. BACKGROUND OF BVMS 157

9.2.2 Block-BVMs and their matrix forms

Let µ = ν1 + ν2. By using the µ-step block-BVM based on LMF over a uniform meshh = (T − t0)/s for solving (9.1), we have:

ν2∑

i=−ν1

αi+ν1ym+i = h

ν2∑

i=−ν1

βi+ν1fm+i, m = ν1, . . . , s− ν2. (9.5)

Here, ym is the discrete approximation to y(tm),

fm = Jnym + gm, gm = g(tm).

Also, (9.5) requires ν1 initial conditions and ν2 final conditions which are provided bythe following µ− 1 additional equations:

µ∑

i=0

α(j)i yi = h

µ∑

i=0

β(j)i fi, j = 1, . . . , ν1 − 1, (9.6)

andµ∑

i=0

α(j)µ−iys−i = h

µ∑

i=0

β(j)µ−ifs−i, j = s− ν2 + 1, . . . , s. (9.7)

The coefficients {α(j)}, {β(j)} in (9.6) and (9.7) should be chosen such that thetruncation errors for these initial and final conditions are of the same order as that in(9.5). By combining (9.5), (9.6), (9.7) and the initial condition y(t0) = y0 = z, thediscrete system of (9.1) is given by the following block form

My ≡ (A⊗ In − hB ⊗ Jn)y = e1 ⊗ z + h(B ⊗ In)g. (9.8)

Here

e1 = (1, 0, · · · , 0)T ∈ Rs+1, y = (yT0 , · · · ,yT

s )T ∈ R(s+1)n,

g = (gT0 , · · · ,gT

s )T ∈ R(s+1)n,


A, B ∈ R(s+1)×(s+1) given by:

A =

1 · · · 0α

(1)0 · · · α

(1)µ

......

... 0α

(ν1−1)0 · · · α

(ν1−1)µ

α0 · · · αµ

α0 · · · αµ

. . . . . . . . .. . . . . . . . .

α0 · · · αµ

0 α(s−ν2+1)0 · · · α

(s−ν2+1)µ

......

α(s)0 · · · α

(s)µ

,

B =

0 · · · 0β

(1)0 · · · β

(1)µ

......

...β

(ν1−1)0 · · · β

(ν1−1)µ 0

β0 · · · βµ

β0 · · · βµ

. . . . . . . . .. . . . . . . . .

β0 · · · βµ

0 β(s−ν2+1)0 · · · β

(s−ν2+1)µ

......

β(s)0 · · · β

(s)µ

,

and “ ⊗ ” is the tensor product.We recall that the tensor product of A = (aij) ∈ Rm×n and B ∈ Rp×q is defined as

follows:

A⊗B ≡

a11B a12B · · · a1nBa21B a22B · · · a2nB

......

...am1B am2B · · · amnB

which is an mp-by-nq matrix. The basic properties of the tensor product can be foundin [19, 22].

9.3. STRANG-TYPE PRECONDITIONER FOR ODES 159

We remark that usually the linear system (9.8) is large and sparse (with band-structure), and solving it is a major problem in the application of the BVMs. Wewill use the GMRES method in Chapter 6 for solving (9.8). In order to speed up theconvergence rate of the GMRES iterations, we will use a preconidtioner S called theStrang-type block-circulant preconditioner.

9.3 Strang-type preconditioner for ODEs

Now, we construct the Strang-type block-circulant preconditioner S for solving (9.8).We will show that the main advantages of the Strang-type preconditioner are:

(1) S is invertible if an Aν1,ν2-stable BVM is used.

(2) The spectrum of the preconditioned system is clustered.

(3) The operation cost for each iteration of the preconditioned GMRES method issmaller than that of direct solvers.

9.3.1 Construction of preconditioner

We first recall the definition of Strang’s circulant preconditioner for Toeplitz matrices.Given any Toeplitz matrix

Tl = [ti−j ]li,j=1 = [tq],

Strang’s preconditioner s(Tl) is a circulant matrix with diagonals given by

[s(Tl)]q =

tq, 0 ≤ q ≤ bl/2c,

tq−l, bl/2c < q < l,

[s(Tl)]l+q, 0 < −q < l,

see [9, 22, 37].Neglecting the perturbations of A and B, we propose the following preconditioner

S ∈ R(s+1)n×(s+1)n called the Strang-type block circulant preconditioner for M givenin (9.8):

S = s(A)⊗ In − hs(B)⊗ Jn, (9.9)


where

s(A) =

αν1 · · · αµ α0 · · · αν1−1...

. . . . . . . . ....

α0. . . . . . α0

. . . . . . . . . 0. . . . . . . . .

0. . . . . . . . .

αµ. . . . . . αµ

.... . . . . . . . .

...αν1+1 · · · αµ α0 · · · αν1

and s(B) is defined similarly by using {βi}µi=0 instead of {αi}µ

i=0 in s(A). The {αi}µi=0

and {βi}µi=0 here are the coefficients given in (9.5). We remark that actually s(A),

s(B) are just Strang’s circulant preconditioners for Toeplitz matrices A, B respectively,where A, B are given by (9.4).

We will show that the preconditioner S is invertible provided that the given BVMis Aν1,ν2-stable and the eigenvalues of Jn are in

C− ≡ {q ∈ C : Re(q) < 0}

where Re(·) denotes the real part of a complex number. The stability of a BVM isclosely related to two characteristic polynomials of degree µ = ν1 + ν2, defined asfollows:

ρ(z) ≡ν2∑

j=−ν1

αj+ν1zj+ν1 and σ(z) ≡

ν2∑

j=−ν1

βj+ν1zj+ν1 . (9.10)

The Aν1,ν2-stability polynomial is defined by

π(z, q) ≡ ρ(z)− qσ(z) (9.11)

where z, q ∈ C.

Consider now the equation π(z, q) = 0. It defines a mapping between the complexz-plane and the complex q-plane. For every z ∈ C which is a root of π(z, q), (9.11)provides

q = q(z) =ρ(z)σ(z)

.

Let

Γ ≡{

q ∈ C : q =ρ(eiθ)σ(eiθ)

, 0 ≤ θ < 2π

}. (9.12)


The Γ is the set corresponding to the roots on the unit circumference and is called theboundary locus. We have the following definition and lemma, see [5].

Definition 9.1 Consider a BVM with an Aν1,ν2-stability polynomial π(z, q) defined by(9.10). The region

Dν1,ν2 = {q ∈ C : π(z, q) has ν1 zeros inside |z| = 1 and ν2 zeros outside |z| = 1}is called the region of Aν1,ν2-stability of the given BVM. Moreover, the BVM is said tobe Aν1,ν2-stable if

C− ⊆ Dν1,ν2 .

Lemma 9.1 If a BVM is Aν1,ν2-stable and Γ is defined by (9.12), then Re(q) ≥ 0 forall q ∈ Γ.

Now, we want to show that the preconditioner S is invertible under the stabilitycondition.

Theorem 9.1 If the BVM for (9.1) is Aν1,ν2-stable and hλk(Jn) ∈ Dν1,ν2 where λk(Jn),k = 1, · · · , n, are the eigenvalues of Jn, then the preconditioner S defined by (9.9) isinvertible.

Proof: Since s(A) and s(B) are circulant matrices, their eigenvalues are given by

gA(z) ≡ αµzν2 + . . . + αν1 + αν1−11z

+ . . . + α01

zν1=

ρ(z)zν1

andgB(z) ≡ βµzν2 + . . . + βν1 + βν1−1

1z

+ . . . + β01

zν1=

σ(z)zν1

,

evaluated at ωj = e2πijs+1 where i ≡ √−1, for j = 0, · · · , s, see [9, 13]. The eigenvalues

λjk(S) of S are therefore given by

λjk(S) = gA(ωj)− hλk(Jn)gB(ωj), j = 0, · · · , s, k = 1, · · · , n.

Since the BVM is Aν1,ν2-stable, the µ-degree polynomial

π[z, hλk(Jn)] = ρ(z)− hλk(Jn)σ(z)

has no roots on the unit circle |z| = 1 if hλk(Jn) ∈ Dν1,ν2 . Thus for all k = 1, · · · , n,and any arbitrary |z| = 1, we have

gA(z)− hλk(Jn)gB(z) =1

zν1π[z, hλk(Jn)] 6= 0.


It follows thatλjk(S) 6= 0, j = 0, · · · , s, k = 1, · · · , n.

Thus S is invertible.

In particular, we have

Corollary 9.1 If the BVM is Aν1,ν2-stable and λk(Jn) ∈ C−, then the preconditionerS is invertible.

9.3.2 Convergence rate and operation cost

We have the following theorem for the convergence rate.

Theorem 9.2 We haveS−1M = In(s+1) + L

whererank(L) ≤ 2nµ.

Proof: Let E = M − S. We have by (9.8) and (9.9),

E =(A− s(A)

)⊗ In − h

(B − s(B)

)⊗ Jn = LA ⊗ In − hLB ⊗ Jn.

It is easy to check that LA and LB are (s + 1)-by-(s + 1) matrices with nonzero entriesonly in the following four corners: a ν1-by-(µ + 1) block in the upper left; a ν1-by-ν1

block in the upper right; a ν2-by-(µ + 1) block in the lower right; and a ν2-by-ν2 blockin the lower left. By noting that µ = ν1 + ν2, we then have

rank(LA) ≤ µ, rank(LB) ≤ µ.

Therefore,rank(LA ⊗ In) = rank(LA) · n ≤ nµ

andrank(LB ⊗ Jn) = rank(LB) · n ≤ nµ.

Thus,S−1M = In(s+1) + S−1E = In(s+1) + L,

where the rank of L is at most 2nµ.

Therefore, when the GMRES method is applied to

S−1My = S−1b,


by Theorems 6.13 and 9.2, we know that the method will converge in at most 2nµ + 1iterations in exact arithmetic.

Regarding the cost per iteration, the main work in each iteration for the GMRESmethod is the matrix-vector multiplication

S−1Mz = (s(A)⊗ In − hs(B)⊗ Jn)−1 (A⊗ In − hB ⊗ Jn)z,

see Section 6.5. Since A, B are band matrices and Jn is assumed to be sparse, thematrix-vector multiplication

Mz = (A⊗ In − hB ⊗ Jn)z

can be done very fast.Now we compute S−1(Mz). Note that any circulant matrix can be diagonalized by

the Fourier matrix F , see Section 6.4 and [13]. Since s(A) and s(B) are circulant, wehave the following decompositions by (6.12),

s(A) = FΛAF ∗, s(B) = FΛBF ∗

where ΛA, ΛB are diagonal matrices containing the eigenvalues of s(A), s(B) respec-tively. It follows that

S−1(Mz) = (F ∗ ⊗ In)(ΛA ⊗ In − hΛB ⊗ Jn)−1(F ⊗ In)(Mz).

This product can be obtained by using FFTs and solving s + 1 linear systems of ordern. Since Jn is sparse, the matrix

ΛA ⊗ In − hΛB ⊗ Jn

will also be sparse. Thus S−1(Mz) can be obtained by solving s + 1 sparse linearsystems of order n. It follows that the total number of operations per iteration is

γ1n(s + 1) log(s + 1) + γ2(s + 1)nq,

where q is the number of nonzeros of Jn, and γ1 and γ2 are some positive constants.For comparing the computational cost of the method with direct solvers for the linearsystem (9.8), we refer to [10].

9.3.3 Numerical result

Now we give an example to illustrate the efficiency of the preconditioner S by solvinga test problem given in [4]. The experiments were performed in MATLAB. We usedthe MATLAB-provided M-file “gmres” to solve the preconditioned systems. We should


emphasize that in all of our tests in this chapter, the zero vector is the initial guessand the stopping criterion is

‖rq‖2

‖r0‖2< 10−6,

where rq is the residual after q iterations. The BVM we used is the third order general-ized Adam’s method (GAM). Its formula and the initial and final additional conditionscan be found in [5].

Example 9.1. Heat equation:

∂u

∂t=

∂2u

∂x2,

u(0, t) =∂u

∂x(π, t) = 0, t ∈ [0, 2π],

u(x, 0) = x, x ∈ [0, π].

We discretize the partial differential operator ∂2/∂x2 with central differences and stepsize equals to π/(n + 1). The system of ODEs obtained is:

y′(t) = Tny(t), t ∈ [0, 2π]

y(0) = (x1, x2, · · · , xn)T ,

where Tn is a scaled discrete Laplacian matrix

Tn =(n + 1)2

π2

−2 1

1. . . . . .. . . . . . . . .

1 −2 11 −1

.

Table 9.1 lists the number of iterations required for convergence of the GMRESmethod for different n and s. In the table, I means no preconditioner is used and Sdenotes the Strang-type block-circulant preconditioner defined by (9.9). We see thatthe number of iterations required for convergence, when S is used, is much less thanthat when no preconditioner is used. The numbers under the column S stay almost aconstant for increasing s and n.

9.4 Strang-type preconditioner for DDEs

Now, we study delay differential equations (DDEs).

9.4. STRANG-TYPE PRECONDITIONER FOR DDES 165

Table 9.1: Number of iterations for convergence.

n s I S

24 6 19 412 70 424 152 448 227 396 314 3

n s I S

48 6 47 412 167 424 359 448 >400 396 >400 3

9.4.1 Differential equations with multi-delays

Consider the solution of differential equations with multi-delays:

y′(t) = Jny(t) + D(1)n y(t− τ1) + · · ·+ D

(s)n y(t− τs) + f(t), t ≥ t0,

y(t) = φ(t), t ≤ t0,

(9.13)

where y(t), f(t), φ(t) : R→ Rn; Jn, D(1)n , · · · , D(s)

n ∈ Rn×n; and τ1, · · · , τs > 0 are somerational numbers.

In order to find a reasonable numerical solution, we require that the solution of(9.13) is asymptotically stable. We have the following lemma, see [31, 43].

Lemma 9.2 For any s ≥ 1, if

η(Jn) ≡ 12λmax(Jn + JT

n ) < 0

and

η(Jn) +s∑

j=1

‖D(j)n ‖2 < 0, (9.14)

then the solution of (9.13) is asymptotically stable.

In the following, for simplicity, we only consider the case of s = 2 in (9.13). Thegeneralization to any arbitrary s is straightforward. Let

h = τ1/m1 = τ2/m2

be the step size where m1 and m2 are positive integers with m2 > m1 (τ2 > τ1). For(9.13), by using a BVM with (ν1, ν2)-boundary conditions over a uniform mesh

tj = t0 + jh, j = 0, · · · , r1,


on the interval [t0, t0 + r1h], we have

µ∑

i=0

αiyp+i−ν1 = h

µ∑

i=0

βi(Jnyp+i−ν1 + D(1)n yp+i−ν1−m1 + D(2)

n yp+i−ν1−m2 + fp+i−ν1),

(9.15)for p = ν1, · · · , r1 − 1, where µ = ν1 + ν2. By providing the values

y−m2 , · · · ,y−m1 , · · · ,y0, y1, · · · ,yν1−1, yr1 , · · · ,yr1+ν2−1, (9.16)

(9.15) can be written in a matrix form as

Ry = b

where

R ≡ A⊗ In − hB ⊗ Jn − hC(1) ⊗D(1)n − hC(2) ⊗D(2)

n , (9.17)

y = (yTν1

,yTν1+1, · · · ,yT

r1−1)T ∈ Rn(r1−ν1),

b ∈ Rn(r1−ν1) depends on f , the boundary values and the coefficients of the method.The matrices A, B ∈ R(r1−ν1)×(r1−ν1) are defined as in (9.4) and C(1), C(2) ∈ R(r1−ν1)×(r1−ν1)

are defined as follows:

C(1) =

0

βµ. . .

.... . . . . .

β0 · · · βµ. . .

. . . . . . . . .β0 · · · βµ 0

, C(2) =

0

βµ. . .

.... . . . . .

β0 · · · βµ. . .

. . . . . . . . .β0 · · · βµ 0

.

We remark that the first column of C(1) is given by

(0, · · · , 0︸︷︷︸m1−ν2

, βµ, · · · , β0, 0, · · · , 0︸︷︷︸r1−m1−2ν1−1

)T

and the first column of C(2) is given by

(0, · · · , 0︸︷︷︸m2−ν2

, βµ, · · · , β0, 0, · · · , 0︸︷︷︸r1−m2−2ν1−1

)T .



The Strang-type block-circulant preconditioner for (9.17) is defined as follows:

S ≡ s(A)⊗ In − hs(B)⊗ Jn − hs(C(1))⊗D(1)n − hs(C(2))⊗D(2)

n (9.18)

where s(E) is Strang’s circulant preconditioner of Toeplitz matrix E, for E = A, B,C(1), C(2) respectively.

Now we discuss the invertibility of the Strang-type preconditioner S defined by(9.18). Since any circulant matrix can be diagonalized by the Fourier matrix F , wehave by (6.12),

s(E) = F ∗ΛEF,

where ΛE is the diagonal matrix holding the eigenvalues of s(E), for E = A, B, C(1),C(2) respectively. Therefore, we obtain

S = (F ∗ ⊗ In)(ΛA ⊗ In − hΛB ⊗ Jn − hΛC(1) ⊗D(1)n − hΛC(2) ⊗D(2)

n )(F ⊗ In).

Note that the j-th block of

ΛA ⊗ In − hΛB ⊗ Jn − hΛC(1) ⊗D(1)n − hΛC(2) ⊗D(2)

n

is given by

Sj = [ΛA]jjIn − h[ΛB]jjJn − h[ΛC(1) ]jjD(1)n − h[ΛC(2) ]jjD(2)

n ,

for j = 1, 2, · · · , r1 − ν1. Let

wj = e2πij

r1−ν1 .

We have[ΛA]jj = ρ(wj)/wν1

j , [ΛB]jj = σ(wj)/wν1j ,

[ΛC(1) ]jj = βµw−m1+ν2j + · · ·+ β0w

−m1−ν1j = σ(wj)/wm1+ν1

j ,

and[ΛC(2) ]jj = βµw−m2+ν2

j + · · ·+ β0w−m2−ν1j = σ(wj)/wm2+ν1

j ,

where ρ(z) and σ(z) are defined as in (9.10). Therefore,

Sj =1

wm2+ν1j

[wm2

j

(ρ(wj)In − hσ(wj)Jn − hw−m1

j σ(wj)D(1)n

)− hσ(wj)D(2)

n

].

We therefore only need to prove that Sj , j = 1, 2, · · · , r1 − ν1, are invertible. We havethe following theorem.


Theorem 9.3 If the BVM with (ν1, ν2)-boundary conditions is Aν1,ν2-stable and (9.14)holds, then for any arbitrary θ ∈ R, the matrix

eim2θ(ρ(eiθ)In − hσ(eiθ)Jn − he−im1θσ(eiθ)D(1)

n

)− hσ(eiθ)D(2)

n

is invertible. It follows that the Strang-type preconditioner S defined by (9.18) is alsoinvertible.

Proof: Suppose that there exist x ∈ Cn with ‖x‖2 = 1 and θ ∈ R such that[eim2θ

(ρ(eiθ)In − hσ(eiθ)Jn − he−im1θσ(eiθ)D(1)

n

)− hσ(eiθ)D(2)

n

]x = 0.

Then

x∗[ρ(eiθ)In − hσ(eiθ)Jn − he−im1θσ(eiθ)D(1)

n − he−im2θσ(eiθ)D(2)n

]x = 0,

i.e.,

ρ(eiθ)− hσ(eiθ)x∗Jnx− he−im1θσ(eiθ)x∗D(1)n x− he−im2θσ(eiθ)x∗D(2)

n x = 0.

We therefore have

ρ(eiθ)− (hx∗Jnx + he−im1θx∗D(1)n x + he−im2θx∗D(2)

n x)σ(eiθ)

= π(eiθ, h(x∗Jnx + e−im1θx∗D(1)

n x + e−im2θx∗D(2)n x)

)= 0

where π(z, q) is given by (9.11). Thus,

h(x∗Jnx + e−im1θx∗D(1)n x + e−im2θx∗D(2)

n x) ∈ Γ,

where Γ is the boundary locus defined by (9.12). Since the BVM is Aν1,ν2-stable, fromLemma 9.1, we know that

Re(x∗Jnx + e−im1θx∗D(1)n x + e−im2θx∗D(2)

n x) ≥ 0.

By Cauchy-Schwarz inequality, we have

Re(e−im1θx∗D(1)n x) ≤ |x∗D(1)

n x| ≤ ‖x‖2‖D(1)n x‖2 ≤ ‖D(1)

n ‖2‖x‖2 = ‖D(1)n ‖2,

and similarly, Re(e−im1θx∗D(2)n x) ≤ ‖D(2)

n ‖2. Note that

η(Jn) = max‖x‖2=1

Re(x∗Jnx) ≥ Re(x∗Jnx).


Thus we have

η(Jn) + ‖D(1)n ‖2 + ‖D(2)

n ‖2 ≥ 0,

which is a contradiction to (9.14). Therefore, the matrix

eim2θ(ρ(eiθ)In − hσ(eiθ)Jn − he−im1θσ(eiθ)D(1)

n

)− hσ(eiθ)D(2)

n

is invertible and it follows that the Strang-type preconditioner S is also invertible.

9.4.3 Convergence rate

Now, we discuss the convergence rate of the preconditioned GMRES method with theStrang-type block-circulant preconditioner. We have the following result for the spectraof preconditioned matrices, see [23].

Theorem 9.4 Let R be given by (9.17) and S be given by (9.18). Then we have

S−1R = In(r1−ν1) + L

where

rank(L) ≤ (2µ + m1 + m2 + 2ν1 + 2)n.

By Theorems 6.13 and 9.4, when the GMRES method is applied to

S−1Ry = S−1b,

the method will converge in at most (2µ + m1 + m2 + 2ν1 + 2)n + 1 iterations in exactarithmetic.

We know from Theorem 9.4 that if the step size h = τ1/m1 = τ2/m2 is fixed,the number of iterations for convergence of the GMRES method, when applied to thepreconditioned system

S−1Ry = S−1b,

will be independent of r1 and therefore is independent of the length of the intervalthat we considered. We should emphasize that the numerical example in Section 9.4.4shows a much faster convergence rate than that predicted by the estimate provided byTheorem 9.4. For the operation cost of our algorithm, we refer to [23, 30].



We illustrate the efficiency of our preconditioner by solving the following example. TheBVM we used is the third order GBDF for t ∈ [0, 4].

Example 9.2. Consider

y′(t) = Jny(t) + D(1)n y(t− 0.5) + D

(2)n y(t− 1), t ≥ 0,

y(t) = (sin t, 1, · · · , 1)T , t ≤ 0,

where

Jn =

−10 2

2. . . . . .

1. . . . . . . . .. . . . . . . . . 2

1 2 −10

, D(1)n =

1n

2 −1

−1. . . . . .. . . . . . −1

−1 2

,

and

D(2)n =

1n

2 1

1. . . . . .. . . . . . 1

1 2

.

In practice, we do not have the boundary values

y1, · · · ,yν1−1, yr1 , · · · ,yr1+ν2−1,

provided by (9.16). Instead of giving the above values, as in Section 9.2, ν1 − 1 initialadditional equations and ν2 final additional equations are given. We remark that afterintroducing the additional equations, the matrices A,B, C(1) and C(2) in (9.17) areToeplitz matrices with small rank perturbations. Neglecting the small rank perturba-tions, we can also construct the Strang-type preconditioner (9.18).

Table 9.2 shows the number of iterations required for convergence of the GMRESmethod with different combinations of matrix size n and step size h. In the table, Imeans no preconditioner is used and S denotes the Strang-type block-circulant pre-conditioner defined by (9.18). We see that the numbers of iterations required forconvergence increase slowly for increasing n and decreasing h under the column S.

9.5 Strang-type preconditioner for NDDEs

In this section, we study neutral delay differential equations (NDDEs).

9.5. STRANG-TYPE PRECONDITIONER FOR NDDES 171


n h I S

24 1/10 52 91/20 97 111/40 185 151/80 367 19

n h I S

48 1/10 53 121/20 98 141/40 189 141/80 378 17

9.5.1 Neutral delay differential equations

We consider the solution of NDDE:

y′(t) = Lny′(t− τ) + Mny(t) + Nny(t− τ), t ≥ t0,

y(t) = φ(t), t ≤ t0,(9.19)

where y(t), φ(t) : R→ Rn; Ln, Mn, Nn ∈ Rn×n, and τ > 0 is a constant.As in Section 9.4.1, we want to find an asymptotically stable solution for (9.19).

We have the following lemma, see [18, 28].

Lemma 9.3 Let Ln, Mn and Nn be any matrices with ‖Ln‖2 < 1. Then the solution of(9.19) is asymptotically stable if Re(λi) < 0 where λi, i = 1, · · · , n, are the eigenvaluesof matrix

(In − ηLn)−1(Mn + ηNn)

with |η| ≤ 1.

Let h = τ/k1 be the step size where k1 is a positive integer. For (9.19), by using aBVM with (ν1, ν2)-boundary conditions over a uniform mesh

tj = t0 + jh, j = 0, · · · , r2,

on the interval [t0, t0 + r2h], we have

µ∑

i=0

αiyp+i−ν1 =µ∑

i=0

αiLnyp+i−ν1−k1 + h

µ∑

i=0

βi(Mnyp+i−ν1 + Nnyp+i−ν1−k1), (9.20)

for p = ν1, · · · , r2 − 1, where µ = ν1 + ν2. By providing the values

y−k1 , · · · ,y0, y1, · · · ,yν1−1, yr2 , · · · ,yr2+ν2−1, (9.21)



Hy = b

whereH ≡ A⊗ In −A(1) ⊗ Ln − hB ⊗Mn − hB(1) ⊗Nn, (9.22)

y = (yTν1

,yTν1+1, · · · ,yT

r2−1)T ∈ Rn(r2−ν1),

b ∈ Rn(r2−ν1) depends on the boundary values and the coefficients of the method. In(9.22), the matrices A, B ∈ R(r2−ν1)×(r2−ν1) are defined as in (9.4), and A(1), B(1) ∈R(r2−ν1)×(r2−ν1) are given as follows:

A(1) =

0

αµ. . .

.... . . . . .

α0 · · · αµ. . .

. . . . . . . . .α0 · · · αµ 0

, B(1) =

0

βµ. . .

.... . . . . .

β0 · · · βµ. . .

. . . . . . . . .β0 · · · βµ 0

,

see [3]. We remark that the first column of A(1) is given by:

(0, · · · , 0︸︷︷︸k1−ν2

, αµ, · · · , α0, 0, · · · , 0︸︷︷︸r2−k1−2ν1−1

)T

and the first column of B(1) is given by

(0, · · · , 0︸︷︷︸k1−ν2

, βµ, · · · , β0, 0, · · · , 0︸︷︷︸r2−k1−2ν1−1

)T .


The Strang-type block-circulant preconditioner for (9.22) is defined as follows:

S ≡ s(A)⊗ In − s(A(1))⊗ Ln − hs(B)⊗Mn − hs(B(1))⊗Nn (9.23)

where s(E) is Strang’s circulant preconditioner of Toeplitz matrix E, for E = A, B,A(1), B(1) respectively. By using (6.12) again, we have

S = (F ∗ ⊗ In) (ΛA ⊗ In − ΛA(1) ⊗ Ln − hΛB ⊗Mn − hΛB(1) ⊗Nn) (F ⊗ In),

where ΛE is the diagonal matrix holding the eigenvalues of s(E), for E = A, B, A(1),B(1) respectively.


Now we discuss the invertibility of the Strang-type preconditioner S. Let

wj = e2πij

r2−ν1 .

We have[ΛA]jj = ρ(wj)/wν1

j , [ΛB]jj = σ(wj)/wν1j ,

[ΛA(1) ]jj = αµw−k1+ν2j + · · ·+ α0w

−k1−ν1j = ρ(wj)/wk1+ν1

j ,

and[ΛB(1) ]jj = βµw−k1+ν2

j + · · ·+ β0w−k1−ν1j = σ(wj)/wk1+ν1

j ,

where ρ(z) and σ(z) are defined as in (9.10). Thus the j-th block of

ΛA ⊗ In − ΛA(1) ⊗ Ln − hΛB ⊗Mn − hΛB(1) ⊗Nn

in S is given by

Sj = [ΛA]jjIn − [ΛA(1) ]jjLn − h[ΛB]jjMn − h[ΛB(1) ]jjNn

=1

wk1+ν1j

[wk1

j (ρ(wj)In − hσ(wj)Mn)− ρ(wj)Ln − hσ(wj)Nn

],

for j = 1, 2, · · · , r2 − ν1. In order to prove that S is invertible, we only need to showthat Sj , j = 1, 2, · · · , r2 − ν1, are invertible. Let

∆ ≡ eik1θ(ρ(eiθ)In − hσ(eiθ)Mn

)− ρ(eiθ)Ln − hσ(eiθ)Nn

= eik1θ(In − e−ik1θLn)D

whereD ≡ ρ(eiθ)In − h(In − e−ik1θLn)−1(Mn + e−ik1θNn)σ(eiθ). (9.24)

Hence, we are required to show that ∆ is invertible for any θ ∈ R in order to provethat Sj is invertible. Assume that ‖Ln‖2 < 1, we have In− e−ik1θLn is nonsingular forany θ ∈ R. Therefore, we only need to show D is invertible for any θ ∈ R. We havethe following theorem, see [3]

Theorem 9.5 If the BVM with (ν1, ν2)-boundary conditions is Aν1,ν2-stable and Re(λi) <0 where λi, i = 1, · · · , n, are the eigenvalues of matrix

(In − ηLn)−1(Mn + ηNn)

with |η| ≤ 1, then for any θ ∈ R, the matrix D defined by (9.24) is invertible. It followsthat the Strang-type preconditioner S defined as in (9.23) is also invertible.


Proof: LetU ≡ (In − e−imθLn)−1(Mn + e−imθNn).

Then D can be written asD = ρ(z)In − hUσ(z).

Note that the eigenvalues of D are given by

λi(D) = ρ(z)− hλi(U)σ(z), i = 1, · · · , n,

where λi(U), i = 1, · · · , n, denote the eigenvalues of U . Since we know that

Re[λi(U)] < 0, i = 1, · · · , n,

it follows that hλi(U) ∈ C−. Note that the BVM is Aν1,ν2-stable and then we have

hλi(U) ∈ C− ⊆ Dν1,ν2 .

Therefore, the Aν1,ν2-stability polynomial defined by (9.11)

π[z, hλi(U)] ≡ ρ(z)− hλi(U)σ(z)

has no roots on the unit circle |z| = 1. Thus, for any |z| = 1, we have

λi(D) = ρ(z)− hλi(U)σ(z) = π[z, hλi(U)] 6= 0, i = 1, · · · , n.

It follows that D is invertible. Therefore, the Strang-type preconditioner S defined asin (9.23) is also invertible.

9.5.3 Convergence rate

We have the following result for the spectra of preconditioned matrices, see [3].

Theorem 9.6 Let H be given by (9.22) and S be given by (9.23). Then we have

S−1H = In(r2−ν1) + L

whererank(L) ≤ 2(µ + k1 + ν1 + 1)n.

By Theorems 6.13 and 9.6 , when the GMRES method is applied to

S−1Hy = S−1b,

the method will converge in at most 2(µ+k1+ν1+1)n+1 iterations in exact arithmetic.We observe from Theorem 9.6 that if the step size h = τ/k1 is fixed, the number of

iterations for convergence of the GMRES method, when applied for solving

S−1Hy = S−1b,

is independent of r2, i.e., the length of the interval that we considered.



We illustrate the efficiency of our preconditioner by solving the following example.


y′(t) = Lny′(t− 1) + Mny(t) + Nny(t− 1), t ≥ 0,

y(t) = (1, 1, · · · , 1)T , t ≤ 0,

where

Ln =1n

2 1

1. . . . . .. . . . . . 1

1 2

, Mn =

−8 2 1

2. . . . . . . . .

1. . . . . . . . . 1. . . . . . . . . 2

1 2 −8

,

and

Nn =1n

2 −1

−1. . . . . .. . . . . . −1

−1 2

.

Example 9.3 is solved by using the fifth order GAM for t ∈ [0, 4]. In practice, wedo not have the boundary values

y1, · · · ,yν1−1, yr2 , · · · ,yr2+ν2−1,

provided by (9.21). Again as in Section 9.2, instead of giving the above values, ν1 − 1initial additional equations and ν2 final additional equations are given. After introduc-ing the additional equations, the matrices A,A(1), B and B(2) in (9.22) are Toeplitzmatrices with small rank perturbations. We can also construct the Strang-type pre-conditioner (9.23) by neglecting the small rank perturbations.

Table 9.3 lists the number of iterations required for convergence of the GMRESmethod for different n and k1. In the table, I means no preconditioner is used and Sdenotes the Strang-type block-circulant preconditioner defined by (9.23). We see thatthe number of iterations required for convergence, when S is used, is much less thanthat when no preconditioner is used. We should emphasize that our numerical exampleshows a much faster convergence rate than that predicted by the estimate provided byTheorem 9.6.


Table 9.3: Number of iterations for convergence (‘∗’ means out of memory).

n k1 I S

24 10 43 720 83 740 161 780 * 7

n k1 I S

48 10 44 620 83 640 163 680 * 6

9.6 Strang-type preconditioner for SPDDEs

In this section, we study the solution of singular perturbation delay differential equa-tions (SPDDEs).

9.6.1 Singular perturbation delay differential equations

We consider the solution of SPDDE:

x′(t) = V (1)x(t) + V (2)x(t− τ) + C(1)y(t) + C(2)y(t− τ), t ≥ t0,

εy′(t) = F (1)x(t) + F (2)x(t− τ) + G(1)y(t) + G(2)y(t− τ), t ≥ t0,

x(t) = φ(t), t ≤ t0,

y(t) = ψ(t), t ≤ t0,

(9.25)

wherex(t), φ(t) : R→ Rm; y(t), ψ(t) : R→ Rn;

V (1), V (2) ∈ Rm×m; C(1), C(2) ∈ Rm×n;

F (1), F (2) ∈ Rn×m; G(1), G(2) ∈ Rn×n;

and τ > 0, 0 < ε ¿ 1 are constants. We can rewrite SPDDE (9.25) as the followinginitial value problem:

z′(t) = Pz(t) + Qz(t− τ), t ≥ t0,

z(t) =(

φ(t)ψ(t)

), t ≤ t0,

(9.26)

9.6. STRANG-TYPE PRECONDITIONER FOR SPDDES 177

where P and Q ∈ R(m+n)×(m+n) are defined as follows,

P =[

V (1) C(1)

ε−1F (1) ε−1G(1)

], Q =

[V (2) C(2)

ε−1F (2) ε−1G(2)

].

see [24]. Leth = τ/k2

be the step size where k2 is a positive integer. For (9.26), by using a BVM with(ν1, ν2)-boundary conditions over a uniform mesh

tj = t0 + jh, j = 0, · · · , v,

on the interval [t0, t0 + vh], we have

µ∑

i=0

αizp+i−ν1 = h

µ∑

i=0

βi(Pzp+i−ν1 + Qzp+i−ν1−k2), (9.27)

for p = ν1, · · · , v − 1, where µ = ν1 + ν2. By providing the values

z−k2 , · · · , z0, z1, · · · , zν1−1, zv, · · · , zv+ν2−1, (9.28)


Kv = b, (9.29)

whereK ≡ A⊗ Im+n − hB ⊗ P − hU ⊗Q. (9.30)

The vector v in (9.29) is defined by

v = (zTν1

, zTν1+1, · · · , zT

v−1)T ∈ R(m+n)(v−ν1).

The right-hand side b ∈ R(m+n)(v−ν1) of (9.29) depends on the boundary values andthe coefficients of the method. The matrices A, B ∈ R(v−ν1)×(v−ν1) in (9.30) are definedas in (9.4) and U ∈ R(v−ν1)×(v−ν1) in (9.30) is defined as the matrix C(1) in (9.17).


The Strang-type block-circulant preconditioner can be constructed for solving (9.29):

S ≡ s(A)⊗ Im+n − hs(B)⊗ P − hs(U)⊗Q, (9.31)

where s(E) is Strang’s circulant preconditioner of matrix E, for E = A, B, U respec-tively. We have the following theorem for the invertibility of our preconditioner S andfor the convergence rate of our method, see [24].


Theorem 9.7 If the BVM with (ν1, ν2)-boundary conditions is Aν1,ν2-stable,

η(P ) ≡ 12λmax(P + P T ) < 0, η(P ) + ‖Q‖2 < 0,

then the Strang-type preconditioner S defined by (9.31) is invertible. Moreover, whenthe GMRES method is applied to

S−1Kv = S−1b,

the method will converge in at most O(m + n) iterations in exact arithmetic.


We test the following SPDDE.


x′(t) = Ax(t) + Bx(t− 1) + Cy(t) + Dy(t− 1), t ≥ 0,

εy′(t) = Ex(t) + Fx(t− 1) + Gy(t) + Hy(t− 1), t ≥ 0, ε = 0.001,

x(t) = (1, 1, . . . , 1)T , t ≤ 0,

y(t) = (2, 2, . . . , 2)T , t ≤ 0,

where

A =

−10 2 1

2. . . . . . . . .

1. . . . . . . . . 1. . . . . . . . . 2

1 2 −10

m×m

, B =

−2 1

1. . . . . .. . . . . . 1

1 −2

m×m

,

C =

1. . .

1

m×n

, D = 3C, E =

2 −1

1. . . . . .. . . . . . −1

1 2 −1

n×m

,

F =

1 −1. . . . . .

. . . −11 −1

n×m

, G =

−2

1. . .. . . . . .

1 −2

n×n

,

9.6. STRANG-TYPE PRECONDITIONER FOR SPDDES 179

and

H =

−5 −1

2. . . . . .. . . . . . −1

2 −5

n×n

.

Example 9.4 is solved by using the third order GAM for t ∈ [0, 4]. Table 9.4 liststhe number of iterations required for convergence of the GMRES method for differentm, n and k2. In the table, S denotes the Strang-type block-circulant preconditionerdefined by (9.31).


k2 m n I S

24 8 2 55 2916 4 83 5732 8 109 90

k2 m n I S

48 8 2 100 3416 4 131 6332 8 177 90


Bibliography

[1] P. Amodio, F. Mazzia and D. Trigiante, Stability of Some Boundary Value Methodsfor the Solution of Initial Value Problems, BIT, vol. 33 (1993), pp. 434–451.

[2] O. Axelsson, Iterative Solution Methods, Cambridge University Press, Cambridge,1996.

[3] Z. Bai, X. Jin and L. Song, Strang-type Preconditioners for Solving Linear Systemsfrom Neutral Delay Differential Equations, Calcolo, vol. 40 (2003), pp. 21–31.

[4] D. Bertaccini, A Circulant Preconditioner for the Systems of LMF-Based ODECodes, SIAM J. Sci. Comput., vol. 22 (2000), pp.767–786.

[5] L. Brugnano and D. Trigiante, Solving Differential Problems by Multistep Initialand Boundary Value Methods, Gordon and Berach Science Publishers, Amsterdam,1998.

[6] Z. Cao, Numerical Linear Algebra (in Chinese), Fudan University Press, Shanghai,1996.

[7] R. Chan and X. Jin, A Family of Block Preconditioners for Block Systems, SIAMJ. Sci. Statist. Comput., vol. 13 (1992), pp. 1218–1235.

[8] R. Chan, X. Jin and Y. Tam, Strang-type Preconditioners for Solving System ofODEs by Boundary Value Methods, Electron. J. Math. Phys. Sci., vol. 1 (2002),pp. 14–46.

[9] R. Chan and M. Ng, Conjugate Gradient Methods for Toeplitz Systems, SIAMReview, vol. 38 (1996), pp. 427–482.

[10] R. Chan, M. Ng and X. Jin, Strang-type Preconditioners for Systems of LMF-BasedODE Codes, IMA J. Numer. Anal., vol. 21 (2001), pp. 451–462.

[11] T. Chan, An Optimal Circulant Preconditioner for Toeplitz Systems, SIAM J. Sci.Statist. Comput., vol. 9 (1988), pp. 766–771.

181

182 BIBLIOGRAPHY

[12] W. Ching, Iterative Methods for Queuing and Manufacturing Systems, Springer-Verlag, London, 2001.

[13] P. Davis, Circulant Matrices, 2nd edition, AMS Chelsea Publishing, Rhode Island,1994.

[14] J. Demmel, Applied Numerical Linear Algebra, SIAM Press, Philadelphia, 1997.

[15] H. Diao, Y. Wei and S. Qiao, Displacement Rank of the Drazin Inverse, J. Comput.Appl. Math., vol. 167 (2004), pp. 147–161.

[16] M. Hestenes and E. Stiefel, Methods of Conjugate Gradients for Solving LinearSystems, J. Res. Nat. Bur. Stand., vol. 49 (1952), pp. 409–436.

[17] R. Horn and C. Johnson, Matrix Analysis, Cambridge University Press, Cam-bridge, 1985.

[18] G. Hu and T. Mitsui, Stability Analysis of Numerical Methods for Systems ofNeutral Delay-Differential Equations, BIT, vol. 35 (1995), pp. 504–515.

[19] G. Golub and C. Van Loan, Matrix Computations, 3rd edition, Johns HopkinsUniversity Press, Baltimore, 1996.

[20] A. Greenbaum, Iterative Methods for Solving Linear Systems, SIAM Press,Philadephia, 1997.

[21] M. Gulliksson, X. Jin and Y. Wei, Perturbation Bounds for Constrained andWeighted Least Squares Problems, Linear Algebra Appl., vol. 349 (2002), pp. 221–232.

[22] X. Jin, Developments and Applications of Block Toeplitz Iterative Solvers, KluwerAcademic Publishers, Dordrecht; and Science Press, Beijing, 2002.

[23] X. Jin, S. Lei and Y. Wei, Circulant Preconditioners for Solving Differential Equa-tions with Multi-Delays, Comput. Math. Appl., vol. 47 (2004), pp. 1429–1436.

[24] X. Jin, S. Lei and Y. Wei, Circulant Preconditioners for Solving Singular Pertur-bation Delay Differential Equations. Numer. Linear Algebra Appl., vol. 12 (2005),pp. 327–336.

[25] X. Jin, V. Sin and L. Song, Circulant Preconditioned WR-BVM Methods for ODEsystems, J. Comput. Appl. Math., vol. 162 (2004), pp. 201–211.

[26] X. Jin, V. Sin and L. Song, Preconditioned WR-LMF-Based Method for ODEsystems, J. Comput. Appl. Math., vol. 162 (2004), pp. 431–444.

BIBLIOGRAPHY 183

[27] X. Jin, Y. Wei and W. Xu, A Stability Property of T. Chan’s Preconditioner, SIAMJ. Matrix Anal. Appl., vol. 25 (2003), pp. 627–629.

[28] J. Kuang, J. Xiang and H. Tian The Asymptotic Stability of One Parameter Meth-ods for Neutral Differential Equations, BIT, vol. 34 (1994), pp. 400–408.

[29] S. Lei and X. Jin, BCCB Preconditioners for Systems of BVM-Based NumericalIntegrators, Numer. Linear Algebra Appl., vol. 11 (2004), pp. 25–40.

[30] F. Lin, X. Jin and S. Lei, Strang-type Preconditioners for Solving Linear Systemsform Delay Differential Equations, BIT, vol. 43 (2003), pp. 136–149.

[31] T. Mori, N. Fukuma and M. Kuwahara, Simple Stability Criteria for Single andComposite Linear Systems with Time Delays, Int. J. Control, vol. 34 (1981), pp.1175–1184.

[32] T. Mori, E. Noldus, and M. Kuwahara, A Way to Stabilize Linear Systems withDelayed State, Automatica, vol. 19 (1983), pp. 571–573.

[33] Y. Saad, Iterative Methods for Sparse Linear Systems, PWS Publishing Company,Boston, 1996.

[34] Y. Saad and M. Schultz, GMRES: A Generalized Minimal Residual Algorithm forSolving Nonsymmetric Linear Systems, SIAM J. Sci. Stat. Comput., vol. 7 (1986),pp. 856–869.

[35] G. Stewart and J. Sun, Matrix Perturbation Theory, Academic Press, San Diego,1990.

[36] J. Stoer and R. Bulirsch, Introduction to Numerical Analysis, Springer-Verlag,New York, 1992.

[37] G. Strang, A Proposal for Toeplitz Matrix Calculations, Stud. Appl. Math., vol.74 (1986), pp. 171–176.

[38] L. Trefethen and D. Bau III, Numerical Linear Algebra, SIAM Press, Philadelphia,1997.

[39] E. Trytyshnikov, Optimal and Super-Optimal Circulant Preconditioners, SIAM J.Matrix Anal. Appl., vol. 13 (1992), pp. 459–473.

[40] S. Vandewalle and R. Piessens, On Dynamic Iteration Methods for Solving Time-Periodic Differential Equations, SIAM J. Num. Anal., vol. 30 (1993), pp. 286–303.

[41] R. Varga, Matrix Iterative Analysis, 2nd edition, Springer-Verlag, Berlin, 2000.

184 BIBLIOGRAPHY

[42] G. Wang, Y. Wei and S. Qiao, Generalized Inverses: Theory and Computations,Science Press, Beijing, 2004.

[43] S. Wang, Further Results on Stability of X(t) = AX(t) + BX(t− τ), Syst. Cont.Letters, vol. 19 (1992), pp. 165–168.

[44] Y. Wei, J. Cai and M. Ng, Computing Moore-Penrose Inverses of Toeplitz Matricesby Newton’s Iteration, Math. Comput. Modelling, vol. 40 (2004), pp. 181–191.

[45] Y. Wei and N. Zhang, Condition Number Related with Generalized Inverse A(2)T,S

and Constrained Linear Systems, J. Comput. Appl. Math., vol. 157 (2003), pp.57–72.

[46] J. Wilkinson, The Algebraic Eigenvalue Problem, Clarendon Press, Oxford, 1965.

[47] S. Xu, L. Gao and P. Zhang, Numerical Linear Algebra (in Chinese), Peking Uni-versity Press, Beijing, 2000.

[48] N. Zhang and Y. Wei, Solving EP Singular Linear Systems, Int. J. ComputerMathematics, vol. 81 (2004), pp. 1395–1405.

Index

backward substitution, 11bisection method, 144BVM, 155

Cauchy Interlace Theorem, 132Cauchy-Schwartz inequality, 29, 170characteristic polynomial, 111Cholesky factorization, 20, 52, 99chopping error, 5, 38circulant matrix, 99, 160classical Jacobi method, 142condition number, 5, 34conjugate gradient method, 87convergence rate, 73, 175Courant-Fischer Minimax Theorem, 131

DDE, 155defective, 112determinant, 2diagonally dominant, 22, 69diagonal matrix, 2divide-and-conquer method, 146double shift, 127

eigenvalue, 3, 28, 76, 114eigenvector, 3, 29, 76, 114error, 5, 25

factorization, 4, 20fast Fourier transform (FFT), 99, 165final additional conditions, 157forward substitution, 10Fourier matrix, 99, 164

GAM, 165

Gauss-Seidel, 66Gauss elimination, 4, 14, 41GBDF, 158Givens rotation, 57GMRES method, 83, 155Gram-Schmidt, 106growth factor, 45

Hermitian positive definite matrix, 7Hessenberg decomposition, 122Hessenberg matrix, 122Householder transformation, 57

idempotent, 8ill-conditioned, 5implicit symmetric QR, 136initial additional conditions, 157inverse power method, 116invertibility, 168IVM, 155

Jacobi method, 65, 139Jordan Decomposition Theorem, 4, 31

Krylov subspace method, 83

LDLT , 22least squares problem, 49LMF, 156LS, 50LU factorization, 9, 41, 65, 119

matrix norm, 25MATLAB, 2, 165Moore-Penrose inverse, 53

185

186 INDEX

nilpotent, 7NLA, 1, 9, 111, 131norm,

Frobenius norm ‖ · ‖F , 30‖ · ‖1, 2, 27‖ · ‖2, 2, 28‖ · ‖∞, 2, 27

nonsingular, 9, 43, 112normal matrix, 102nullspace, 50

ODE, 155orthogonal matrix, 29, 54, 120

parallel computing, 146PCG method, 98permutation matrix, 16, 43perturbation, 5pivoting, 15, 41power method, 114preconditioner,

block-circulant preconditioner, 156optimal preconditioner, 99Strang’s circulant preconditioner, 161

QR algorithm, 120

range, 50rank-one matrix, 147rank-one modification, 12rounding error, 5, 38

Schur decomposition, 112shift, 125similarity transformation, 112single shift, 125singular value, 3, 133singular value decomposition, 133singular vector, 133Spectral Decomposition Theorem, 131stability, 162Sturm Sequence Property, 144

successive overrelaxation, 75superlinear, 6symmetric positive definite matrix, 22

tensor product, 158Toeplitz matrix, 157triangular system, 9tridiagonal matrix, 134tridiagonalization, 135

unitary matrix, 8

vector norm, 25

well-conditioned, 5Weyl, Wielandt-Hoffman theorem, 132Wilkinson shift, 137

Documents

Numerical Linear Algebra And Its Applications - …umir.umac.mo/jspui/bitstream/123456789/14052/1/3027_0_textbook.pdf · Numerical Linear Algebra ... Introduction Numerical linear