Scienti c Computing for Computed Tomographypcha/HDtomo/SC/SCforCTbook.pdf · Scienti c Computing for Computed Tomography K. Joost Batenburg, Per Christian Hansen, Jakob Sauer J˝rgensen,

Scientific Computing for Computed Tomography

K. Joost Batenburg, Per Christian Hansen, Jakob Sauer Jørgensen,William R. B. Lionheart

December 29, 2018

2

Contents

Preface iii

List of Symbols v

1 Forward Modeling: The Radon Transform 11.1 X-Ray CT: Imaging From Projections . . . . . . . . . . . . . . . . . . . 1

1.1.1 Transmission and Absorption . . . . . . . . . . . . . . . . . . . 31.1.2 Data Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 The Radon Transform for Parallel-Beam Geometry . . . . . . . . . . . 51.3 The Parameterized Form of the Radon Transform . . . . . . . . . . . . 9

2 Filtered Back Projection 112.1 The Back Projection and Fourier Transforms . . . . . . . . . . . . . . . 112.2 The Fourier Slice Theorem: The Key to Reconstruction . . . . . . . . . 132.3 The Filtered Back Projection Method . . . . . . . . . . . . . . . . . . . 16

2.3.1 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.3 Reconstruction for Fan-Beam and Cone-Beam . . . . . . . . . . 21

2.4 Corrections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Limited-Data Artifacts 25

4 Singular Values and Functions 274.1 Signal Restoration and Deconvolution . . . . . . . . . . . . . . . . . . . 284.2 Singular Values and Functions . . . . . . . . . . . . . . . . . . . . . . . 314.3 Let’s Reconstruct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5 Discretization 395.1 Image Coordinate Systems . . . . . . . . . . . . . . . . . . . . . . . . . 405.2 Pixels and Discrete Images . . . . . . . . . . . . . . . . . . . . . . . . . 415.3 Projection Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.3.1 The Line Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.3.2 The Strip Model . . . . . . . . . . . . . . . . . . . . . . . . . . 445.3.3 The Interpolation Model (a.k.a. the Joseph Model) . . . . . . . 46

5.4 The System Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.4.1 Projection Geometries in 2D . . . . . . . . . . . . . . . . . . . . 49

i

ii CONTENTS

5.4.2 From Geometries to Matrices . . . . . . . . . . . . . . . . . . . 495.4.3 Image Discretization Issues . . . . . . . . . . . . . . . . . . . . . 50

5.5 Back Projection Computations . . . . . . . . . . . . . . . . . . . . . . . 535.6 The Rows and Columns of the System Matrix . . . . . . . . . . . . . . 535.7 The System Matrix: Storage and Computation . . . . . . . . . . . . . . 54

5.7.1 Memory Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.7.2 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6 SVD Analysis of Tomography Problems 576.1 The Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . 576.2 Ill-Conditioned Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 606.3 Spectral Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616.4 SVD Analysis of Discretized CT Problems . . . . . . . . . . . . . . . . 636.5 Additional Topics: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

7 AIR Methods 657.1 A Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 657.2 Linear Systems of Equation . . . . . . . . . . . . . . . . . . . . . . . . 67

7.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677.2.2 Rank, Consistency, and Null Space . . . . . . . . . . . . . . . . 68

7.3 Linear Least Squares Problems . . . . . . . . . . . . . . . . . . . . . . 717.4 Iterative Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

7.4.1 Kaczmarz’s Method . . . . . . . . . . . . . . . . . . . . . . . . . 757.4.2 Cimmino’s Method . . . . . . . . . . . . . . . . . . . . . . . . . 787.4.3 The Optimization Viewpoint . . . . . . . . . . . . . . . . . . . . 817.4.4 A Column-Action Method . . . . . . . . . . . . . . . . . . . . . 84

7.5 More About the Null Space . . . . . . . . . . . . . . . . . . . . . . . . 867.5.1 The Role of the Null Space in Tomography . . . . . . . . . . . . 867.5.2 Computations Related to the Null Space . . . . . . . . . . . . . 88

7.6 Block Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 907.6.1 The Algorithmic Perspective . . . . . . . . . . . . . . . . . . . . 917.6.2 Formal derivations and the Optimization Perspective . . . . . . 94

7.7 The Story So Far . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

8 AIR Methods with Noisy Data 978.1 Semi-Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

8.1.1 Analysis of Landweber’s Method . . . . . . . . . . . . . . . . . 1008.1.2 Analysis of Landweber’s Method with Projection . . . . . . . . 1068.1.3 Analysis of Kaczmarz’s Method . . . . . . . . . . . . . . . . . . 109

8.2 Stopping Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1118.2.1 Fitting to the Noise Level . . . . . . . . . . . . . . . . . . . . . 1128.2.2 Minimization of the Prediction Error – UPRE and GCV . . . . 1178.2.3 Stopping When All Information is Extracted — NCP . . . . . . 1198.2.4 Estimation of the Trace Term and the Noise Level . . . . . . . . 122

8.3 Choosing a Good Relaxation Parameter . . . . . . . . . . . . . . . . . 126

Preface

These lecture notes are still in a preliminary form, and we apologize for some miss-ing references. The notes are being developed for the PhD course “02946 ScientificComputing for X-Ray Computed Tomography (CT)” at the Technical University ofDenmark.

The main goal of the course, and these notes, is to give an introduction to some im-portant computational methods for CT. The emphasis is on the numerical algorithmsfor computing a CT reconstruction, but we also describe several tools that allow us toanalyze the CT problem. Our hope is that this will give CT practitioners some insightinto the underlying mathematical tools and algorithms, and provide a starting pointfor further algorithm development.

iii

iv PREFACE

List of Symbols

Symbol Explanationa(θ, s, πi) Line-pixel coefficientaij The element of A in row i and column jA The m× n system matrix

A#k Matrix that defines the kth iteration vector x(k)

b Vector of length m that represents the sinogramB Back projectionc(Lθ,s, π) Intersection length of line Lθ,s in pixel πcj The jth column of the system matrix Ac(·) NCP vectorδ(·) Dirac delta functione Vector of length m consisting of noise in the dataE Expected valuef(x, y) Function that represents the objectfπ(x, y) Pixel indicator function for pixel π

f(ω) 1D Fourier transform of f(x)F (u, v) 2d Fourier transform of f(x, y)F1 The 1D Fourier transformF2 The 2D Fourier transformG Projection geometryg(s, θ) Sinogram, the Radon transform of f(x, y)Gk GCV function for the kth iterateI Out-going intensity of X-ray (behind object)I0 Incident intensity of X-ray (before object)I Identity matrixk Number of iterationsL,Lθ,s Line that represents an X-raymθ Number of projection anglesm,n Number of rows and columns in the system matrix AM,N Size of image in pixelsM Diagonal “weight” matrix in an iterative reconstruction methodN Normal (Gaussian) distributionpθ(s) The Radon transform as a function of θ and sPC Orthogonal projection on the convex set CP Poisson distribution

v

vi LIST OF SYMBOLS

ri The ith row of the system matrix AR Radon transformRπ Pixel regions Coordinate that characterized the location of an X-raytk Trace of a certain matrixu, v Frequencies in the 2D Fourier transformui,vi Left and right singular vectors in the SD of Aumk(s, θ), vmk(x, y) Left and right singular functions of Radon transformUk UPRE function for the kth iterateU ,V Left and right singular matrices in the SVD of AV View, a set of linesx Vector of length n that represents an imagex The ground truth, i.e., an exact imagexk Truncated SVD (TSVD) solutionx(k) The kth iteration vectors in an iterative reconstruction methodxLS,xLS,M Least squares and weighted least squares solutionxoLS,x

oLS,M Minimum-norm east squares and weighted least squares solution

x, y Image coordinatesµ Attenuation coefficientµmk Singular values of Radon transformσi Singular value in the SVD of AΣ Diagonal matrix with singular values in the SVD of Aθ Angle that characterized the direction of an X-rayω Frequency in the 1D Fourier transformϕLP(ω) Low-pass filter in frequency domainπ, πi Pixelπx1 , πx2 , πy1 , πy2 Pixel coordinatesT The transposed of the vector or matrix · Inner product between two vectors〈·, ·〉 Inner product between two functions‖x‖2 2-norm of vector x, define as ‖x‖2

2 = x21 + · · ·+ x2

n

φi, φ(k)i Filter factor in SVD-expansion

Φ(k) Diagonal matrix of filter factors for kth iteration vector%(k) Residual vector for iterate x(k)

ω, ωk Relaxation parameter

Chapter 1

Forward Modeling: The RadonTransform

The goal of this chapter is to give a brief introduction to the forward modeling ofcomputed tomography, with just as many details as needed in this book.

First of all – what is tomography? The word come from Greek: Tomos a sectionor slice, Graphos to describe. In tomography we produce images of slices of an object– without actually slicing it! To see the inside, need information obtained from theoutside.

These days not just slices but 3D images can be obtained; we restrict this presen-tation to slices because that is sufficient to present the underlying principles. Thereare many applications of CT.

• Medical imaging. Here CT can be used to study anatomy, look for tumors,inspect broken teeth, etc.

• Non-destructive inspection and testing. This has applications in produc-tion, security, metrology, etc.

• Materials science. Development of advanced materials requires understandingtheir properties at the micro and nano scale, e.g., in order to maximize thestrength of glass fibres for wind turbine blades.

We will first describe the mapping from an object to the measured data, and thendescribe how this process can be reversed.

1.1 X-Ray Computed Tomography (CT): Imaging

From Projections

A projection is a picture obtained when sending X-rays through an object. We recordsuch projections all around the object, and the goal is to reconstruct the object fromthese projections. The simplest case is 2D parallel-beam geometry, which we focus onhere.

1

2 CHAPTER 1. FORWARD MODELING: THE RADON TRANSFORM

θ

θ

1

2

Figure 1.1: The principle in 2D parallel-beam geometry. The damping of parallel X-rays going through a 2D object is measured on a 1D detector, here illustrated for twodifferent angles of the X-rays.

Figure 1.2: Illustration of CT with fan-beam and cone-beam geometry [from where?].

1.1. X-RAY CT: IMAGING FROM PROJECTIONS 3

Figure 1.3: Wilhelm Conrad Rontgen and the first X-ray image ever taken showing hiswife’s hand (1895).

Another important case is the cone-beam geometry which is used in medical CTscanners, lab-based micro-CT scanners, etc. When the cone beam is restricted to acentral slice we obtain fan-beam geometry. When we move the source far away fromthe object then, in the limit, we obtain the parallel-beam geometry – which is the casein large-scale synchrotron facilities.

1.1.1 Transmission and Absorption

The contrast mechanism that allows CT reconstruction is the attenuation of the X-raysas they pass through the object. “Heavier” matter attenuate the X-rays more (air –tissue – bone – metal). The attenuation is quantified by the so-called linear attenuationcoefficient µ which varies with the location inside the object.

The physics underlying X-ray CT is known as the Lambert-Beer law of attenuation.Let the incident X-ray intensity be denoted by I0. For an X-ray passing through ahomogeneous block of length D and with constant attenuation coefficient µ0, the out-coming intensity is

I = I0 exp− µ0D

I0 Iµ0

-

D

(1.1)

In a more interesting material, the attenuation coefficient µ(x) is a function of theposition x along the X-ray, and the out-going intensity is then

I = I0 exp

−∫L

µ(x)dx

I0 Iµ(x)

X-ray L

- (1.2)

We rearrange this equation into line integral form:

− logI

I0

=

∫L

µ(x)dx (1.3)


The intensity I in (1.2) is called the transmission, while the corresponding quantity

b = − log(I/I0) (1.4)

is called the absorption.

Example 1. Here we illustrate the transmission I and the corresponding absorptionb = − log(I/I0) as the attenuation increases.

I0 = 10000 I I/I0 − log(I/I0)

-

-

-

-

-

10000

5000

2500

1250

625

1.0000

0.5000

0.2500

0.1250

0.0625

0.0

0.7

1.4

2.1

2.8

1.1.2 Data Statistics

The incident and out-going intensities are integers since the X-ray consists of a discretenumber of photons. Moreover, the out-going intensity is statistical in nature, as weshall briefly describe here, following the presentation in §2.6 and §9.8.1 in [6].

The measured transmission in a single detector element is

I = I0 d with d = exp

−∫L

µ(x) dx

. (1.5)

This is a photon count which follows a Poisson distribution P , i.e.,

I ∼ P(I0 d) , I0 d = expected value = variance. (1.6)

For large values of I this can be approximated by a Gaussian distribution,

I ∼ N (I0 d, I0 d) or I = I0 d+√I0 dZ , Z ∼ N (0, 1) . (1.7)

Hence the absorption b = − log I/I0 = −(log d+ log I/(I0 d)) is given by

b = − log d− log

(1 +

1√I0 d

Z

)(1.8)

≈ − log d− 1√I0 d

Z ∼ N(∫

L

µ(x) dx ,1

I0 d

). (1.9)

1.2. THE RADON TRANSFORM FOR PARALLEL-BEAM GEOMETRY 5

θ x

y

(cos θsin θ

) s

Lθ,s

Figure 1.4: The geometry of a line Lθ,s in normal form.

We see that we can approximate the error of the absorption in each detector elementby a Gaussian distribution with standard deviation (I0 d)−1.

Other Error Sources. In addition to the Poisson noise discussed above, data can beaffected by numerous other issues which we will not cover here, such as:

• Detector noise, e.g., coming from the electronic system that converts the mea-sured photons to a signal.

• Scatter within the object – some X-rays do not follow straight line.

• The X-rays are not monochromatic, but have full spectrum.

• Attenuation coefficient depends on energy (when ignored this leads to so-calledbeam hardening).

• Bad detectors, e.g., void measurements.

• Too dense features in the object, e.g., metal parts blocking the rays completely.

• The object changes during the acquisition, e.g., due to motion.

1.2 The Radon Transform for Parallel-Beam Ge-

ometry

The origin of tomographic reconstruction is usually accredited to the paper Uber dieBestimmung von Funktionen durch ihre Integralwerte langs gewisser Mannigfaltigkeitenby Johann Radon from 1917. There it is proved that an object can be reconstructedperfectly from a full set of line integrals over all angles.

To characterize these line integrals we must consider how to parameterize lines inthe plane. One way (which we learned in school) is by the line’s slope and intersection


θ x

y

s

spθ(s)

Figure 1.5: Illustration of the projection pθ(s) of a simple object, within the unit diskD of radius 1, for a single angle θ.

with the y-axis; in this representation vertical lines excluded. A better alternative, forus, is the normal form

Lθ,s = (x, y) | x cos θ + y sin θ = s , (1.10)

where s is the signed orthogonal distance of the line to the origin, and θ is the anglebetween the x-axis and unit normal vector to Lθ,s, cf. Figure 1.4.

We assume here that the object f(x, y) is contained in a unit disk D of radius 1.We define the projection that expresses all line integrals at angle θ:

pθ(s) =

∫Lθ,s

f(x, y) d` for s ∈ [−1, 1]. (1.11)

Figures 1.5 and 1.6 show simple examples of such a projections. The Radon transformof f is then defined as

[Rf ](θ, s) = pθ(s) =

∫Lθ,s

f(x, y) d` (1.12)

for θ ∈ [0, 360[ and s ∈ [−1, 1].

Example 2. Radon Transform of a Disk. Given an image with a small disk ofradius r < 1 centered at the origin,

f(x, y) =

1 for x2 + y2 ≤ r2

0 otherwise,

1.2. THE RADON TRANSFORM FOR PARALLEL-BEAM GEOMETRY 7

Figure 1.6: An example with three projections.

Figure 1.7: The Radon transform of a disk.


Figure 1.8: The geometry for the ellipse case.

the corresponding Radon transform is

[Rf ](θ, s) =

2√r2 − s2 for |s| ≤ r

0 otherwise.

This is illustrated in Figure 1.7.

Example 3. Radon Transform of an Ellipse (following §3.1 in [24]); see Fig-ure 1.8. An ellipse with semi-axes A and B centered at the origin,

f(x, y) =

c for x2

A2 + y2

B2 ≤ 1 (inside the ellipse)

0 otherwise (outside the ellipse)

has the Radon transform

p0θ(s) =

2cABa2(θ)

√a2(θ)− s2 for |s| ≤ a(θ)

0 otherwise,

where a2(θ) = A2 cos2 θ + B2 sin2 θ. If the ellipse is centered at (xc, yc) and rotated bythe angle α, then the Radon transform is given from p0

θ(s) above as

pθ(s) = p0θ−α(s− zc cos(γc − θ)) ,

with zc =√x2c + y2

c and γc = tan−1(yc/xc).

1.3. THE PARAMETERIZED FORM OF THE RADON TRANSFORM 9

1.3 The Parameterized Form of the Radon Trans-

form

The Radon transform (1.12) can be written explicitly using a parametrization of theline Lθ,s as

pθ(s) =

∫ ∞−∞

f(x(s), y(s)) d` , (1.13)

where, for fixed θ and s,

x(s) = s cos θ − s sin θ

y(s) = s sin θ + s cos θ .

The line is traced as s runs from −∞ to ∞.A useful alternative expression for the Radon transform, that we will need later, is

given by

pθ(s) =

∫ ∞−∞

∫ ∞−∞

f(x, y) δ(x cos θ + y cos θ − s) dx dy . (1.14)

The interpretation is that the line Lθ,s consists of exactly those (x, y) at which x cos θ+y cos θ − s = 0, which is the argument of the Dirac delta function δ (defined below).Thus, the integrand is restricted to function values of f(x, y) on Lθ,s, which amountsto the corresponding line integral.

The Dirac delta function is a generalized function – or distribution – with theheuristic definition:

δ(t) =

+∞ for t = 0

0 for t 6= 0and

∫ ∞−∞

δ(t)dt = 1 . (1.15)

An important property of Dirac delta function is that∫ ∞−∞

f(t) δ(t− T ) dt = f(T ) . (1.16)

This is called the sifting property : the Dirac delta function acts as a sieve and “siftsout” (or “samples”) the value of f at t = T .

We can now establish a connection between the Radon transform

pθ(s) =

∫Lθ,s

f(x, y) d`

and the Lambert-Beer law introduced earlier, along the same line Lθ,s

Iθ,s = I0 exp

(−∫Lθ,s

µ(x, y) d`

).


Example image

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100 0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Sinogram

φ50 100 150

ρ

20

40

60

80

100

120

1400

0.1

0.2

0.3

0.4

0.5

Figure 1.9: A simple image and the corresponding sinogram.

These two expressions are obviously connected through the identifications

f(x, y) = µ(x, y) (1.17)

pθ(s) = − log

(Iθ,sI0

)(1.18)

The Radon transform describes the forward problem of how (ideal) X-ray projectiondata arise in a parallel-beam geometry. The output of the Radon transform, for allangles θ, is called the sinogram. This is illustrated in Figure 1.9 for a simple image.Note that [0, 180] captures all necessary projections of the object. The angular range[180, 360] gives a “mirror image.”

Chapter 2

Filtered Back Projection

We will now describe how we can invert the Radon transform, such that we can – inprinciple – reconstruct the image of the object from the measured data.

2.1 The Back Projection and Fourier Transforms

To set the stage, we need the back projection which is expressed via an integrationover θ,

B[pθ(s)](x, y) =

∫ π

0

pθ(x cos θ + y sin θ) dθ . (2.1)

For fixed (x, y) this corresponds to integration along a sinusoidal curve in the sinogram,cf. Figure 2.1.

The complete process is sometimes referred to as “smearing and summation.” Eachpoint (x, y) and each angle θ define a unique location (θ, s) in the sinogram, withs = x cos θ + y sin θ. For a given θ, in the back projection (2.1) the image point (x, y)is assigned the sinogram value at s, i.e., the value pθ(s). This is “smearing.” Backprojection then sums all contributions, at each (x, y), by integrating over θ. This is“summation.” The combined process is illustrated in Figure 2.2. Unfortunately thissimple procedure does not invert the projection data.

Figure 2.1: A back projection for fixed (x, y) corresponds to integration along a sinu-soidal curve in the sinogram.

11

12 CHAPTER 2. FILTERED BACK PROJECTION

Figure 2.2: A simple illustration of back projection – smearing and summation – for aproblem with just 3 projections.

Figure 2.3: Illustration of the 1D and 2D Fourier transform pairs.

2.2. THE FOURIER SLICE THEOREM: THE KEY TO RECONSTRUCTION 13

Before we can derive a closed-form expression for the inverse Radon transform, webriefly summarize the definitions of the Fourier transform pair, with j =

√−1. See

also Figure 2.3.

• The 1D Fourier transform and its inverse:

f(ω) = F1[f(t)](ω) =

∫ ∞−∞

f(t) e−j2πωt dt , (2.2)

f(t) = F−11 [f(ω)](t) =

∫ ∞−∞

f(ω) e+j2πωt dω . (2.3)

• The 2D Fourier transform and its inverse:

F (u, v) = F2[f(x, y)](u, v) =

∫ ∞−∞

∫ ∞−∞

f(x, y) e−j2π(ux+yv) dx dy , (2.4)

f(x, y) = F−12 [F (u, v)](x, y) =

∫ ∞−∞

∫ ∞−∞

F (u, v) e+j2π(ux+yv) du dv . (2.5)

2.2 The Fourier Slice Theorem: The Key to Recon-

struction

One way to derive a closed-form expression for the inverse Radon transform is via theFourier Slice Theorem. To derive this result, we manipulate the 1D Fourier-transformedprojection into a slice through the 2D Fourier-transformed image.

pθ(ω) =

∫ ∞−∞

pθ(s) e−j2πωs ds

=

∫ ∞−∞

∫ ∞−∞

∫ ∞−∞

f(x, y) δ(x cos θ + y sin θ − s) e−j2πωs dx dy ds

=

∫ ∞−∞

∫ ∞−∞

f(x, y)

∫ ∞−∞

δ((x cos θ + y sin θ)− s) e−j2πωs ds dx dy

=

∫ ∞−∞

∫ ∞−∞

f(x, y)

∫ ∞−∞

δ(s− (x cos θ + y sin θ)) e−j2πωs ds dx dy

=

∫ ∞−∞

∫ ∞−∞

f(x, y) e−j2πω(x cos θ+y sin θ) dx dy .

Here we used the definition of the 1D Fourier transform, the Dirac delta expression ofpθ(s), reordering, the fact that δ(−t) = δ(t), and the sifting property. Now continueby reordering, and recognizing the result as 2D Fourier transform:

pθ(ω) =

∫ ∞−∞

∫ ∞−∞

f(x, y) e−j2π(xω cos θ+yω sin θ) dx dy

=

∫ ∞−∞

∫ ∞−∞

f(x, y) e−j2π(xu+yv) dx dy

∣∣∣∣ u = ω cos θv = ω sin θ

= F (u, v)

∣∣∣∣ u = ω cos θv = ω sin θ


All 1D Fouriertransformed projections

form

the 2D Fourier transformof the image.

Figure 2.4: Top left to right: a projection pθ(s) at angle θ, its 1D Fourier transformpθ(ω), and its location in 2D Fourier space at angle θ. Bottom: the Fourier SliceTheorem says that the 2D Fourier transform of the image consists of all the 1D Fouriertransforms of the projections.

This yields the Fourier Slice Theorem:

pθ(ω) = F (ω cos θ, ω sin θ) (2.6)

The interpretation is that (u, v) = (ω cos θ, ω sin θ) for θ ∈ [0, π) and ω ∈ (−∞,∞)specifies a line in 2D Fourier space rotated by θ relative to the positive u axis. Thiscorresponds to the s-axis in image space. Thus, the 1D Fourer transform of a projectionis equivalent to the corresponding slice/line through the 2D Fourier transform. See theillustration in Figure 2.4. This leads to the Fourier Reconstruction Method as specifiedin Figure 2.5.

This method requires a 2D inverse Fourier transform that needs all the data at once.Also, interpolation from polar to Cartesian grid, known as regridding, is required, cf.Figure 2.6; but accurate interpolation in the Fourier domain is difficult. As a result,the Fourier method is rarely used in practice. The alternative is the Filtered BackProjection (FBP) algorithm discussed in the next section.

2.2. THE FOURIER SLICE THEOREM: THE KEY TO RECONSTRUCTION 15

Figure 2.5: Specification of the Fourier Reconstruction Method. Each projection under-goes a 1D Fourier transform, and all the Fourier transforms are then pieced togetherto form the Fourier transform of the image. An inverse 2D Fourier transform thenproduces the image.

Figure 2.6: Illustration of regridding in the Forier reconstruction method.


2.3 The Filtered Back Projection Method

2.3.1 Derivation

Our strategy to obtain a more practical algorithm is to rewrite the inverse 2D Fouriertransform as follows:

f(x, y) = F−12 [F2f ]

Use the 2D Fourier transform definitions:

=

∫ ∞−∞

∫ ∞−∞

F (u, v) ej2π(ux+vy) du dv

Change to polar coordinates, including the Jacobian ω:

=

∫ 2π

0

∫ ∞0

F (ω cos θ, ω sin θ) ej2π(ω cos θx+ω sin θy) ω dω dθ

Split the integral over 2π in two:

=

∫ π

0

∫ ∞0

F (ω cos θ, ω sin θ) ej2π(ω cos θx+ω sin θy) ω dω dθ +∫ π

0

∫ ∞0

F (ω cos(θ + π), ω sin(θ + π)) ej2π(ω cos(θ+π)x+ω sin(θ+π)y) ω dω dθ

Use sin(θ + π) = − sin θ and cos(θ + π) = − cos θ :

=

∫ π

0

∫ ∞0

F (ω cos θ, ω sin θ) ej2π(ω cos θx+ω sin θy) ω dω dθ′

+

∫ π

0

∫ ∞0

F (−ω cos θ,−ω sin θ) ej2π(−ω cos θx−ω sin θy) ω dω dθ

Change sign/bounds in second integral:

=

∫ π

0

∫ ∞0

F (ω cos θ, ω sin θ) ej2π(ω cos θx+ω sin θy) ω dω dθ

+

∫ π

0

∫ 0

−∞F (ω cos θ, ω sin θ) ej2π(ω cos θx+ω sin θy) (−ω) dω dθ

Recollect using absolute value

=

∫ π

0

∫ ∞−∞

F (ω cos θ, ω sin θ) ej2πω(x cos θ+y sin θ) |ω| dω dθ

Apply Fourier Slice Theorem:

=

∫ π

0

∫ ∞−∞

pθ(ω) ej2πω(x cos θ+y sin θ) |ω| dω dθ

Use that s = x cos θ + y sin θ is constant wrt ω-integration:

=

∫ π

0

[∫ ∞−∞

pθ(ω) ej2πωs |ω| dω]s=x cos θ+y sin θ

dθ.

2.3. THE FILTERED BACK PROJECTION METHOD 17

Figure 2.7: A simple illustration of the Filtered Back Projection (FBP) algorithm fora case with just three projections. Note that the projections are filtered by the rampfilter before the back projection. The result is not so convincing – but the next pictureshows that it works well with many projections.

Recognizing the inner integral as the 1D inverse Fourier transform, we define the filteredprojection:

qθ(s) =

∫ ∞−∞

pθ(ω) ej2πωs |ω| dω (2.7)

= F−11

[pθ(ω) |ω|

](s) = F−1

1

[F1[pθ](ω) |ω|

](s) .

Note that the projections are filtered by multiplying with the so-called ramp filter |ω|(also known as the Ram-Lak1 filter) in the Fourier domain. Then we can write

f(x, y) =

∫ π

0

qθ(x cos θ + y sin θ) dθ = B[qθ](x, y) (2.8)

by recognizing the back-projection operation. This is the Filtered Back Projection(FBP) inversion formula for the Radon transform.

The interpretation of this formula goes as follows: At each point (x, y) in the imagef to be reconstructed, each angle θ defines a sinogram location s = x cos θ + y sin θ.Through back projection, the point (x, y) is assigned the sinogram’s value at s via thefiltered projection, and contributions at all angles θ are summed up. See Figure 2.7 fora simple example.

2.3.2 Implementation

When the FBP algorithm is used in practice, there are two implementation concerns:how to perform the filtering with the ramp filter in a suitable way for the measured data,and how to perform the back projection. The latter will be discussed in Section 5.5

1Ram-Lak is an abbreviation for Ramachandran and Lakshminarayanan


Projections must be filtered with a “ramp”filter before back projection.

In the Fourier domain: |ω|

ω

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

|ω|

0

0.2

0.4

0.6

0.8

1Ramp filter

Unfiltered

proj=1

Filtered

proj=3

proj = 21

proj = 500

Figure 2.8: Here NP denotes the number of projections. With many projections, theFBP algorithm works well – while simple back projection produces a blurred image.

2.3. THE FILTERED BACK PROJECTION METHOD 19

while we will address the former here, following the presentation in [24] but omittingmany signal-processing details.

A key point is that in practise we work with discrete data measured in a finiteamount of elements on the detector – while the projection pθ(s) in the above formulationis a function defined for all s. Moreover, to obtain a fast and stable implementation ofthe filter as a convolution operation, we want to use the discrete Fourier transform asimplemented in the Fast Fourier Transform (FFT) algorithm, which assumes a periodicdiscrete signal. Going from the function pθ(s) to working with the FFT on samples ofthis function has two implications:

1. The periodization of the discrete projection data, which is implicit in the use ofFFT, produces boundary artifacts (called “interperiod interference” in the CTcommunity).

2. Replacing the continuous function |ω| with discrete values, say, |i∆ω|, i =0,±1,±2, . . . for some ∆ω introduces a bias because the filter value “0” for i = 0actually represents a small range of frequencies around ω = 0. This can be seen,e.g., from the interpretation of the discrete Fourier transform as an instance ofthe trapezoidal quadrature rule, cf. §9.2 in [5].

The remedy is to notice that the discrete projection data, sampled with a distance∆s between the detector elements, represents a band limited function with a highestfrequency of ωmax = 1/(2∆s). Hence, in (2.7) we can replace |ω| with the function

H(ω) =

|ω| for |ω| ≤ ωmax

0 else,(2.9)

which corresponds to the discrete impulse response

hi =

ω2

max , i = 0

0 , n even

−(

2i πωmax

)2, i odd and nonzero.

(2.10)

This eliminates the bias as well as the periodization artifacts. For more details, see§17.7 in [14] and §3.3.3 in [24].

A practical implementation of the FBP algorithm also need to take into account thatreal data always contains some noise, due to the statistical nature of the photon countsas well as amplifier noise etc. The ramp filter is necessary part of the inversion formula,and it is a high-pass filter which is problematic in practice, when noise is present,because it amplifies high-frequent noise leading to noisy/grainy reconstructions. Hence,additional filtering is necessary in the form of a lov-pass filter ϕLP(ω) which is multipliedwith |ω| in the frequency domain; see Figure 2.9. More details can be found in §7.2.1of [6].

The Filtered Back Projection (FBP) algorithm step-by-step.

1. For each projection:


Figure 2.9: Different low-pass filters ϕLP(ω) and the resulting filters when multipliedwith the ramp filter |ω|.

(a) Fourier-transform the projection via a 1D Fourier transform.

(b) Apply the ramp filter as explained above.

(c) Optionally, apply an additional low-pass filter ϕLP(ω) to handle the noise.

(d) Inverse Fourier-transform to obtain the filtered projection.

(e) Back project the filtered projection.

2. Sum the back-projected filtered projections to obtain the reconstruction.

Steps (a) and (d) are implemented using the FFT algorithm. In practice, the FBP isoften the default reconstruction method is a commercial scanner/instrument. Manyimplementations of the FBP are available (MATLAB: iradon). The algorithm is easyto use – the main user input is the choice of low-pass filter.

The PBP algorithm is very popular due to the following strengths:

• It is fast, since it is based on FFT computations followed by a single back pro-jection, and the memory requirements are low.

• It is conceptually easy to understand and implement.

• The algorithm’s behavior and reconstruction capabilities are well understood,and there are only a few parameters to adjust.

• It typically works very well for complete and good data.

But FBP also has some weaknesses:

• A large number of projections and a full angular range is also required.

• Only a modest amount of noise in data can be tolerated.

• It is tailored to fixed scan geometries – other geometries require their own inver-sion formulas.

• It cannot incorporate constraints and make use of prior knowledge, such as non-negativity.

2.4. CORRECTIONS 21

Figure 2.10: From left to right: parallel-beam, fan-beam and cone-beam scanner geom-etry.

2.3.3 Reconstruction for Fan-Beam and Cone-Beam

The FBP algorithm cannot immediately be used for fan-beam and cone-beam ge-ometries, because it relies on the Radon transform which assume parallel X-rays, cf.Figure 2.10. Recall the principal form of the FBP algorithm:

Projections → Filter → Back project → Reconstruction

For fan-beam geometry there are two strategies: a dedicated reconstruction algorithm,and rebinning of the data followed by the FBP algorithm. For cone-beam geometry,there are also two options: a dedicated reconstruction algorithm; and the Feldkamp-Davis-Kress (FDK) algorithm (which is the standard), which is an approximate algo-rithm. The FDK algorithm has the following principal form:

Projections → Weight → Filter → Weight → Back project → Reconstruction

We will not dwell further on the FDK algorithm here.

2.4 Corrections

Flat and Dark-Field Correction. Typical data acquired:

• I: Measured transmission images (sample in, source on).

• I0: Measure of the actual flux, called the flat-field (sample out, source on).

• ID: Background: the dark field (sample out, source off).

FBP needs line integrals; recall the Radon transform and the conversion to a linearproblem in the attenuation coefficient f :∫

L

f(x, y) d` = − logI

I0

.

To obtain corrected projections we compute

Z = − log Y, Y =I − IDI0 − ID

← pixelwise division. (2.11)


Figure 2.11: Figure from Wang & Yu, Med. Phys., 36 (2009), 3575–3581.

Center-of-Rotation (COR) Correction

• Standard FBP implementations, such as MATLAB’siradon, assume a perfectly centered object.

• In other words, the center of rotation should be mappedto the central detector element.

• In practice only approximate centering is physicallypossible.

• Naive reconstruction yields artifacts. Need to performcenter-of-rotation correction.

• This can be done by “shifting” the projections bypadding the sinogram with sufficiently many artificialdetector element values.

&%'$s s

---------

Centered

&%'$s s

---------Non-centered

Region-of-Interest (ROI) Correction. In some cases the object to be scanned istoo large to fit in the field of view, or we want to focus on some small region-of-interest(ROI). In both cases, the projections are truncated – they do not cover entire objector ROI.

Can we still reconstruct the object? Or just the ROI? The answer is no. The interiorRadon data are “contaminated” by exterior Radon data, and an interior reconstructionwill also be “contaminated.” See Figures 2.12 and 2.13 for illustrations.

2.4. CORRECTIONS 23

Figure 2.12: With a detector that is smaller than the object, we still also capture datafrom the outer part of the object. The figure is from Bilgot et al., IEEE Nuc. Sci.Symp. Conf. Rec. (2009), pp. 4080–4085.

Figure 2.13: ROI correction in practice: The boundaries are correct, but large artifactare apparent. Padding of the sinogram yields a large improvement, but the intensitiesare still off.


Chapter 3

Limited-Data Artifacts

Write a short chapter that summarizes the central results from microlocal analysis andgives some computed examples.

25

26 CHAPTER 3. LIMITED-DATA ARTIFACTS

Chapter 4

Singular Values and Functions ofthe Radon Transform

We have studied an efficient algorithm – the filtered back projection (FBP) algorithm– for computing the CT reconstruction. And we have also seen that the reconstructionis somewhat sensitive to noise in the data. Some natural questions that now arise are:

• How can we further study this sensitivity to noise?

• How can we possibly reduce the influence of the noise?

• What consequence does that have for the reconstruction?

We need a mathematical tool that lets us perform a detailed study of these aspects,which happens to be the singular values and functions of the Radon transform. Butbefore going into these details, we will start with a simple example from signal process-ing, to explain the basic ideas. We need the notation from Table 4.1, and throughoutwe will use the Shepp-Logan test image from [31] shown in Figure 4.1 (it is availablein MATLAB through the function phantom).

Table 4.1: Notation for vectors and functions; a bar denotes complex conjugation.Vectors Functions

Norm ‖x‖22 =

n∑i=1

|xi|2 = x · x ‖f‖22 =

∫ ba|f(t)|2 dt = 〈f, f〉

Inner product x · y =n∑i=1

xi yi = xTy 〈f, g〉 =∫ baf(t) g(t) dt

Orthonormality vi · vj = δij 〈vi, vj〉 = δij

Expansion x =n∑i=1

ci vi f(t) =∞∑i=0

ci vi(t) means:

= [v1 . . .vn]︸︷︷︸matrix

c

n∑i=0

ci vi(t)→ f(t) for n→∞

27

28 CHAPTER 4. SINGULAR VALUES AND FUNCTIONS

Figure 4.1: The Shepp-Logan test image used in this chapter.

4.1 Signal Restoration and Deconvolution

System

'&

$%Input

'&

$%Output⇒ ⇒

Assume that we know the characteristics of a system, and that we have measuredthe noisy output signal g(t). Now we want to reconstruct the input signal f(t). Themathematical (forward) model, assuming 2π-periodic signals. takes the form

g(t) =

∫ π

−πh(τ − t) f(τ) dτ or g = h ∗ f (convolution).

Here, the function h(t) (called the “impulse response”) defines the system.Consider an example where the input f(t) is white noise, and the output g(t) is

filtered noise. The deconvolution problem is then, given h, to reconstruct the input ffrom the output g = h ∗ f . See Figure 4.2 for a simple example of f , g and h.

Our analysis and solution tool is the Fourier series of a 2π-periodic function f :

f(t) =∞∑

n=−∞

cn ej nω, j =

√−1, (4.1)

with the Fourier coefficients

cn =1

2π

∫ π

−πe−j n t f(t) dt =

1

2π〈ψn, f〉, ψn = ej n t. (4.2)

We can think of the functions ψn as a convenient basis for analysing the behavior ofperiodic functions. Similarly:

g(t) =∞∑

n=−∞

dn ψn, dn = 〈ψn, g〉. (4.3)

4.1. SIGNAL RESTORATION AND DECONVOLUTION 29

Figure 4.2: Top: a simple impulse response h. Bottom: the Fourier transforms of awhite-noise input signal f and th corresponding output signal g = h ∗ f . In thisexample, the convolution acts as a low-pass filer so the higher the frequency of g themore it is damped.

Due to the linearity of the convolution problem, we have

g = h ∗ f = h ∗

(∞∑

n=−∞

cn ψn

)=

∞∑n=−∞

cn (h ∗ ψn). (4.4)

Hence, all we need to know is the system’s response to each basis function ψn.For the periodic systems we consider here, the convolution of h with ψn produces

a scaled version of ψn,h ∗ ψn = µn ψn, for all n, (4.5)

where µn = 〈ψn, h〉 (we do not prove this). Hence,

g =∞∑

n=−∞

cn µn ψn =∞∑

n=−∞

dn ψn ⇔ f =∞∑

n=−∞

dnµn

ψn (4.6)

We see that the deconvolution is transformed to a simple algebraic operation, division,in the frequency domain.

Example 4. To illustrate how noise in the data g influences the reconstruction off by means of (4.6), we add a small amount of white noise to g and then computethe corresponding “naive” reconstruction fnaive. This is illustrated in Figure 4.3 foran example where the input f is a piecewise constant function while the data g =h ∗ f + noise is a blurred and noisy version of f . We make the following remarks tothe figure:

• Top left: the input f(t) and the noisy output g(t) (the noise is so small that it isnot visible here).


Figure 4.3: Analysis of the convolution test problem with noisy data.

Figure 4.4: Reconstruction by truncation of the noisy deconvolution problem.

• Bottom left: the corresponding Fourier coefficients; note the “noise floor” in thecoefficients dn for the noisy g.

• Bottom right: the reconstructed Fourier coefficients dn/µn are dominated by thenoise for n > 100, and a naive reconstruction is useless.

• Top right: the naive and useless reconstruction fnaive(t).

A very simple remedy for the amplification of the noisy, by the division with thesmall coefficients µn, is to simply “chop off” the noisy components. We can do thiskeeping only the first ±k terms in the sum for the reconstruction, i.e.,

ftrunc =k∑

n=−k

dnµn

ψn . (4.7)

Example 5. Continuing with the previous example, we compute the truncated recon-struction ftrunc for k = 100. The result is shown in Figure 4.4, and we make thefollowing remarks:

4.2. SINGULAR VALUES AND FUNCTIONS 31

• Left (top and bottom): same as in the previous figure.

• Bottom right: we keep the first ±100 coefficients only.

• Top right: comparison of f(t) and the truncated reconstruction ftrunc(t) using±100 terms in the Fourier expansion. It captures the general shape of f(t).

What we have learned from this simple example can be summarized as follows:

1. With the right choice of basis functions, we can turn a complicated problem intoa simper one.

Here, the basis functions are the complex exponentials ψn = ej n t, and deconvo-lution becomes division in Fourier domain.

2. Inspection of the expansion coefficients reveals how and when the noise enters inthe reconstruction.

Here, the noise dominates the output’s Fourier coefficients 〈ψn, g〉 for higher fre-quencies, while the low-frequency coefficients are ok.

3. We can avoid most of the noise (but not all) by means of filtering, at the costof loosing some details.

Here, we simply truncate the Fourier expansion for the reconstruction.

We will now apply the same ideas to parallel-beam CT reconstruction, to obtain moreinsight as well as an alternative to the FBP algorithm.

4.2 Singular Values and Functions

This section is based on material in [4], from where we reproduce the figure abovethat defines the geometry. We use the following definitions:


• The radon transform is now written as g = R f . In previous chapters g was

called pθ(s), but the new simple notation is more convenient here.

• The image is f(x) with x = (x, y) ∈ D, the unit disk with radius 1.

• The sinogram (the data) is g(s, θ) with s ∈ [−1, 1] and θ ∈ [0, 360].

For the radon transform R, it can be shown that there exist scalars µmk and functionsumk(s, θ) and vmk(x) such that

Rumk = µmk vmk, m = 0, 1, 2, 3, . . . k = 0, 1, 2, . . . ,m. (4.8)

The scalars are called the singular values and they are given by

µmk = 2√π/(m+ 1) with multiplicity m+1. (4.9)

The corresponding left and right singular functions are given by

umk = constmk ·√

1− s2 · Um(s) · e−i (m−k) θ,

vmk = constmk · r|m−2k| · P 0,|m−sk|ν (2r2−1) · e−i (m−k) θ,

where r = ‖x‖2, Um is the Chebyshev polynomial of the second kind, P0,|m−sk|ν is the

Jacobi polynomial, and ν = 12(m+|m−2k|). It is unclear why the term “singular” was

originally associated with these scalars and functions. The following properties of thesingular functions are important:

• The functions umk are an orthonormal basis for [−1, 1]× [0, 360].

• The functions vmk are an orthonormal basis for the unit disk D.

• Both functions are complex – similar to the Fourier basis.

The expansions of f and g take the form

f(x) =∑m,k

〈vmk, f〉 vmk(x), g(s, θ) =∑m,k

〈umk, g〉umk(s, θ). (4.10)

in which the expansion coefficients are the inner products

〈vmk, f〉 =

∫D

vmk(x) f(x) dx, (4.11)

〈umk, g〉 =

∫ 1

−1

∫ 360

0umk(s, θ) g(s, θ) dθ ds. (4.12)

Figure 4.5 shows plots of the real and imaginary parts of the singular functions, whilethe singular values µmk are plotted in the top part of Figure 4.6. We note that:

• All singular values are positive (there are no zeros), and they decay rather slowly.


• If µmk = µj with j = 12m(m+1) + k + 1, then µj ∝ 1/

√j for large j.

• Singular functions with higher index j have higher frequencies.

• The higher the frequency, the more the damping in Rumk = µmk vmk.

• Hence the Radon transform g = R f is a “smoothing” operation

• . . . and the reverse operation f = R−1g amplifies higher frequencies!

To conclude this analysis, the bottom part of Figure 4.6 shows the coefficients 〈umk, g〉for the sinogram g(s, θ) corresponding to the Shepp-Logan phantom – with the sameordering as the singular values. The coefficients 〈umk, g〉 decay, as expected. Thespecific behavior for k = 0, . . . ,m is due to the symmetry of the phantom.

The inverse Radon transform is unbounded. From linear algebra we knowthat if we have a linear system of equations b = Ax ⇔ x = A−1b then

‖b‖2 ≤ ‖A‖ ‖x‖2 and ‖x‖2 ≤ ‖A−1‖ ‖b‖2, ‖A‖2 =∑

i, j a2ij .

Hence, a perturbation ∆b of b produces a reconstruction perturbation ∆x = A−1∆bwith norm ‖∆x‖2 ≤ ‖A−1‖ ‖∆b‖2.

For the Radon transform g = R f ⇔ f = R−1g we similarly have

‖g‖2 ≤ ‖R‖‖f‖2 with ‖R‖2 =∑m,k

µ2mk

and

‖f‖2 ≤ ‖R−1‖ ‖g‖2 with ‖R−1‖2 =∑m,k

1

µ2mk

.

But the trouble here is that ‖R−1‖ =∞, because the singular values do not decay fastenough to ensure that the sum is finite. The consequence is that tiny perturbations ofg can lead to extremely large – or unbounded – perturbations of the reconstruction f =R−1g.

The left singular functions and the range condition. Due to the factor√1− s2 in the left singular functions umk, all these functions satisfy

umk(s, θ)→ 0 for s→ ±1 . (4.13)

See Figure 4.7 for a few examples. This reflects the fact that rays through the disk Dthat almost grace the edge of the disk contribute very little to the sinogram. This putsa so-called range restriction on sinograms g(s, θ) that admit a reconstruction:

• The sinogram g = R f is a sum of the singular functions umk.

• Hence, the sinogram inherits the property g(s, θ)→ 0 for s→ ±1.

• A perturbation ∆g of g that does not have this property may not produce abounded perturbation R−1∆g of f .


Figure 4.5: Real and imaginary parts of some left and right singular functions form = 0, 1, 2, 3, 4.


Figure 4.6: Top: some singular values (they are real). Bottom: the correspondingcoefficients |〈umk, g〉| for the sinogram g associated with the Shepp-Logan test image.Both are plottet according to increasing index m.

Figure 4.7: The right singular functions satisfy umk(s, θ) → 0 (marked with red) fors→ ±1.

Figure 4.8: Top: illustration of artifacts that appear in the corners of the reconstructedimage when the noise fails to satisfy the range condition for four different noise real-izations. Bottom row: reconstructions where the artifacts are removed.


Example 6. In a practical and finite-dimensional CT reconstruction problem, if thenoise violates the “range condition” then artifact will appear in the reconstruction. Thisis illustrated in Figure 4.8 where artifacts appear in the corners of the reconstructedimage. We note that this phenomenon is not restricted to FBP and disk domains; itis a general difficulty in CT reconstructions. The problem can be heuristically fixed byadding damping to the sinogram data near s = ±1.

4.3 Let’s Reconstruct

In terms of the singular values and functions, reconsstruction via the inverse Radontransform takes the form

f(x) =∑m,k

〈umk, g〉µmk

vmk(x) . (4.14)

Since the image f(x) has finite norm (finite energy), we conclude that the magnitude ofthe coefficient 〈umk, g〉/µmk must decay “sufficiently fast.” Specifically, the coefficients〈umk, g〉 for g(s, θ) must decay sufficiently faster than the singular values µmk, suchthat the Picard condition is satisfied:

∑m,k

∣∣∣∣〈umk, g〉µmk

∣∣∣∣2 <∞ (4.15)

When noise is present in the sinogram g(s, θ), then this condition is not satisfied forlarge m (cf. the signal deconvolution example from before).

A simple remedy for the noise-magnification, due to the division with µmk, is tointroduce filtering:

f(x) =∑m,k

ϕmk〈umk, g〉µmk

vmk(x) . (4.16)

The filter factors ϕmk must decay fast enough that they, for large m, can counteractthe factor µ−1

mk. More on this later in the course.We can think of the filter factors as modifiers of the expansion coefficients 〈umk, g〉

for the sinogram. In other words, they ensure that the filtered coefficients ϕmk 〈umk, g〉decay fast enough to satisfy the Picard condition from Eq. (4.15). The filtering in-evitably dampens the higher frequencies associated with the small µmk, and hence somedetails and edges in the reconstructed image are lost.

Connection to filtered back projection. Recall the filtered back projection(FBP) algorithm:

1. For fixed θ compute the Fourier transform g(ω, θ) = F(g(s, θ)

).

2. For the same fixed θ apply the ramp filter |ω| and compute the inverse Fouriertransform gfilt(s, θ) = F−1

(|ω| g(ω, θ)

).

4.3. LET’S RECONSTRUCT 37

3. Do the above for all θ ∈ [0, 360].

4. Then compute f(x) =∫ 360

0gfilt(x cos θ + y sin θ, θ) dθ.

It is the ramp filter |ω| in step 2 that magnifies the higher frequencies in the sinogramg(s, θ). This amplification is equivalent to the division by the singular values µmk inthe above analysis.

Now recall how the filtered back projection algorithm (FBP) is really implemented,with an additional low-pass filter:

1. Choose a low-pass filter ϕLP(ω).

2. For every θ compute the Fourier transform g(ω, θ) = F(g(s, θ)

).

3. Apply the combined ramp and low-pass filter, and compute the inverse Fouriertransform gfiltfilt(s, θ) = F−1

(|ω|ϕLP(ω) g(ω, θ)

).

4. Then frec(x) =

∫ 360

0gfiltfilt(x cos θ + y sin θ, θ) dθ.

The low-pass filter ϕLP(ω) counteracts the ramp filter |ω| for large ω. It is equivalentto the filter factors ϕmk introduced in Eq. (4.16).


Chapter 5

Discretization

In this chapter, we discuss how the Radon transform can be discretized and sampled,such that we can use it to perform numerical computations on a computer. We focus thepresentation on the problem of reconstructing a 2D image from its Radon transform,but the presented ideas extend to 3D problems in a straightforward way.

In the analytical model of Chapter 1, a two-dimensional image is represented by acontinuous real-valued function f , defined on the plane R2:

f : R2 → R+.

This function f assigns a grey level f(x, y) to each point (x, y) in the 2D coordinatesystem. In tomography the function f corresponds to the object being scanned and thevalue of this function at a certain point can be thought of as the attenuation coefficientof the object at that location. The plane R2 extends infinitely in the horizontal andvertical directions, but objects in the real world are always bounded. Therefore, weassume that the function f has bounded support, meaning that one can draw a closedregion Ω in the plane, outside of which the function f is zero.

The representation of an image as a function, and of the measurements as an integraloperator applied to this function, is convenient for mathematical analysis. It allows usto work with standard analysis concepts such as derivatives, integrals, and smoothness.However, if we want to compute with a tomography image, we need to discretize it,representing both the image and the measured projection data as finite arrays of scalarvalues. For the measured data, such a discretization is natural as the data is usuallymeasured by a digital detector, consisting of an array of detector elements, whichalready imposes a discretization. For the image of the actual object being scanned, thediscretization is a practical necessity, but the particular discretization scheme can beselected depending on the computational context and assumptions. Two-dimensionalimages are usually represented in discretized form as a two-dimensional array of smallsquares, known as pixels, where each pixel is assigned a single grey level. Such pixelizedimages are easy to display and also allow for efficient access and storage in computermemory. When representing a three-dimensional volume in such a way, a 3D volumeis typically composed of small cubes of constant grey value, known as voxels.

39

40 CHAPTER 5. DISCRETIZATION

Figure 5.1: Three image coordinate systems. From left to right: pixel coordinates,Euclidean coordinates, and matrix coordinates.

5.1 Image Coordinate Systems

Depending on the context and historical conventions that come with a specific co-ordinate system, the various notations are different in what they describe (discreteelements vs. geometrical coordinates), their ordering (top to bottom or vice versa) andtheir values (starting at 0, starting at 1, or unbounded). It is important to be awareof these difference and define clearly how discrete images can be converted to a planarrepresentation that fits the analytical framework of tomography.

Pixel Coordinates

Figure 5.1 shows a pixel grid that can be used to represent a digital grayscale image.A real-valued grey value is assigned to each of the pixels, which represents the valuein the entire region of that pixel. Each square pixel has two coordinates, its row (thevertical coordinate) and column (the horizontal coordinate). For such pixel images, thetypical convention is to refer to a pixel by a pair of nonnegative integers (v, w), wherev is the column index and w is the row index. Note that the column index counts fromleft to right and the row index counts from top to bottom. The ordering from top tobottom makes sense when drawing an image on a computer screen, as the first rowof the image (row 0) is typically drawn as the top row, then proceeding downwards.Using a 0-based representation (i.e., starting the indexes at 0) is also convenient, as itallows efficient translation between pixel coordinates and the memory addresses thatare used to store an image in a block of computer memory.

In computer memory, a pixel image is usually stored as a contiguous array a ofpixel values, which is indexed by a single index variable. In memory, the first row ofpixels is followed by the second row, then the third row, etc. To obtain the value of apixel in row w and column v, we can use the notation a[wN + v], where N denotes thenumber of pixels in each row.

5.2. PIXELS AND DISCRETE IMAGES 41

Euclidean Coordinates

To translate a pixel-representation of an image to the analytical setting, where weconsider an image as a function R2 → R+, we need to equip the each pixel also withgeometrical coordinates that define the size and position of this pixel in the plane R2.For this step, the Euclidean coordinate system depicted in Figure 5.1 is more suited,where each 2D point is denoted by a pair of real-valued coordinates (x, y), also allowingnegative coordinate values. Conflicting with the convention used in pixel coordinates,the convention for Euclidean coordinates is that the x-coordinate increases when goingfrom left to right, while the y-coordinate increases when going up. To convert pixelcoordinates (v, w) to square regions of R2 we need to specify where the corner pointsof that pixel reside in 2D space. For the vertical coordinate, this conversion typicallyinvolves a negation to compensate for the mismatch in ordering conventions.

Matrix Indices

Finally, when working with pixel images in mathematical software, such as MATLAB,it is often convenient to represent an image as a matrix, see Figure 5.1. Matrices comewith their own conventions on the coordinates (also called indexes) of their elements.For a matrix X, the element in row i and column j is referred to as xij. The row indexi starts from 1 and increases from the top row down to the bottom row. The columnindex j starts from 1 and increases from left to right.

5.2 Pixels and Discrete Images

We will now turn these intuitive concepts into a more formal model, which definesthe relation between a pixelized image and its “functional” counterpart: a real-valuedfunction defined on Ω ⊂ R2. Although a natural approach is to start with the imagef and then define a discretization of f using some kind of sampling operator, we willproceed in the other direction. We will first define a discretized image, which consistsof an assembly of non-overlapping squares, each having a constant grey level.

Definition 1. A pixel π is a quadruple (πx1 , πy1 , πx2 , πy2) of real values such thatπx1 < πx2 and πy1 < πy2. With a pixel π, we associate the set

Rπ = (x, y) ∈ R2 : πx1 ≤ x < πx2 and πy1 ≤ y < πy2 , (5.1)

which we call the pixel region. We also associate with π its pixel indicator func-tion fπ : R2 → 0, 1 defined by

fπ(x, y) =

1 if (x, y) ∈ Rπ

0 otherwise(5.2)

We now turn to the definition of lines. The lines represent rays going through thescanned object from a source to a detector. For (θ, s) ∈ [0, 2π)×R, recall our definitionof the line

Lθ,s = (x, y) ∈ R2 : x cos θ + y sin θ = s


which was for the Radon transform introduced in Eq. (1.10).For any pixel π and line Lθ,s, the intersection Lθ,s ∩ Rπ is a (possibly empty) line

segment, which we call the intersection segment of L and π. We denote the lengthof this segment by c(Lθ,s, π), called the intersection length for line Lθ,s and pixel π.We then have the following identity

c(Lθ,s, π) =

∫t

fπ(s cos θ − t sin θ, s sin θ + t cos θ)dt = Rfπ(θ, s) . (5.3)

So, by computing the intersection length between a pixel and a line, we actually obtainthe value of the Radon transform applied to the pixel indicator function fπ (5.2),sampled at the point (θ, s) in the sinogram.

Definition 2. A discrete grid Π is an ordered set π1, . . . , πn of pixels such thatfor all π1, π2 ∈ Π we have Rπ1 ∩ Rπ2 = ∅. Hence, a discrete grid is a set of pixels thathave non-overlapping pixel regions.

A common choice for a discrete grid is to center the grid around the origin andto take a rectangular tiling of unit squares as the set of pixels. This corresponds tothe standard representation of M×N bitmap images in a computer (as a 2D arrayof pixels). To represent the standard rectangular grid, we number the pixels by twopixel coordinates (v, w), where v = 0, 1, 2, . . . , N − 1 denotes the column and w =0, 1, 2, . . . ,M − 1 denotes the row. The pixel region for π = π(v,w) is then given by

πx1 = −N/2 + v

πx2 = −N/2 + v + 1

πy1 = M/2− w − 1

πy2 = M/2− w

(5.4)

We refer to this configuration of pixel regions as the standard grid of size M×N .

Definition 3. A discrete image I is a pair (Π,x), where Π = π1, . . . , πn is adiscrete grid and the vector

x =

x1...xn

∈ Rn with n = M ·N

describes the grey levels of the image. A discrete image I induces an associated imagefunction fI , defined by

fI(x, y) =n∑i=1

xi fπi(x, y) , (5.5)

This is a piecewise constant function on the pixel grid.

For a given image function f and discrete grid Π, there exists an associated discreteimage I on Π such that f = fI if and only if the function f is constant on each pixel

5.3. PROJECTION MODELS 43

Figure 5.2: Three projection models. Left: the line model; middle: the strip model;right: the interpolation (or Joseph) model.

region and 0 outside the pixel regions. Therefore, the full set of such images forms asubset of all image functions, containing those images that can be represented exactlyon the discrete grid.

For a given pixel grid Π, image functions that are constant within the boundariesof each pixel and zero outside the pixel grid can be represented perfectly on the grid Π.For such image functions, analytical operations involving the Radon transform cantherefore be formulated as a finite series of multiplications and additions on its discreterepresentation. This links the analytical model of the Radon transform with the field oflinear algebra.

5.3 Projection Models

Although the Radon transform is the standard model for tomography used in math-ematics, the Radon transform itself is also only an approximate model of the imageformation in an X-ray CT scanner. Other models than the Radon transform for de-scribing how the measurements depend on the image function are also commonly usedin practice, and may be more physically realistic or more computationally convenientin some cases. Here, we focus on a general linear model where the measurementMfI(θ, s) is a linear combination of the pixel values xi,

MfI(θ, s) =n∑i=1

a(θ, s, πi) xi . (5.6)

The value a(θ, s, πi) is commonly called the “pixel weight” within the field of tomog-raphy, but here we choose to call it the line-pixel coefficient for pixel πi and line (θ, s).In this way, it is clear that the coefficient depends on both a pixel and a line, and isnot just a property of the pixel.

There are various schemes in use for computing line-pixel coefficients, leading todifferent expressions for a(θ, s, πi). We refer to these various schemes by the genericterm projection models. In the early days of tomography, it was common to use binaryline-pixel coefficients, which simply determine whether a pixel intersects with a line,


or not. As this model does not take the actual intersection length into account, it is arather coarse model that does not correspond well with actual physical measurements.Three different common choices for the projection model that are used today are de-picted in Figure. 5.2: the line model, the strip model, and the interpolation model.These and other models are surveyed in [18].

It is important to realize that each projection model represents a different approxi-mation to the real imaging process that underlies the acquisition of tomographic data.In reality, beams of X-rays, neutrons or electrons neither behave as linear rays, nordo they behave as strips. Therefore, also the analytical Radon transform just repre-sents an approximation to the true imaging process. Each of the projection models hassomewhat different numerical and computational properties.

5.3.1 The Line Model

The Radon transform for the pixel indicator function (5.2) can be computed by (5.3),and we obtain the so-called line model given by

RfI(θ, s) =n∑i=1

xiRfπi(θ, s) =n∑i=1

c(Lθ,s, πi)xi . (5.7)

In this model, the line-pixel coefficient a(θ, s, πi) is identical to the intersection lengthc(Lθ,s, πi) defined in (5.3). The geometry is shown in Figure 5.3.

By computing a sum over all pixels of I, weighted by their intersection lengths,we obtain the value of the Radon transform of fI , sampled at (θ, s). This process issometimes/often referred to as Siddon’s method (although Siddon’s contribution [32]was a clever way to arrange the computations for a 3D grid). The line model givesexact data for images that are piecewise constant, i.e., when f(x, y) = fI(x, y).

5.3.2 The Strip Model

The strip model is based on the fact that the detector, which measures the X-rayintensity in the scanner, consists of a finite number of detector elements. Each detec-tor element simultaneously measures the intensity for all lines that intersect with it.Suppose that a set of parallel lines Lθ,s : s1 ≤ s ≤ s2 intersects with a particulardetector element. In his case the line-pixel coefficient for image pixel πi is given by

a(θ, s, πi) =

∫ s2

s′=s1

c(Lθ,s′ , πi) ds′ (5.8)

which equals the area of intersection between pixel πi and the area covered by the setof parallel lines surrounding Lθ,s, i.e, as the strip shown in Figure 5.2. This can beconsider a somewhat “better” model of the underlying physics in the CT scanner.

Like the line model, the strip model gives exact data for images that are piecewiseconstant, i.e., when f(x, y) = fI(x, y).

Figure 5.4 shows a comparison between the properties of the line model and thestrip model, for θ = 0 and θ = 45. The four plots show the line-pixel coefficient


°

Rray

f(x; y) d` ¼P

k hkfkhk = length of line segment in pixel (i(k); j(k))f(x; y) assumed a constant fk in pixel (i(k); j(k))

1

Line model

xi = i¡ 12 ; yj = j¡1

2

(xi(k); yj(k))

Figure 5.3: The geometry of the line model.


Figure 5.4: Examples of the pixel footprints for the line and strip models, which showthe line-pixel weight a(θ, s, πi) for a given angle θ and a given pixel πi, as a functionof the line parameter s.

a(θ, s, πi) as a function of s, keeping the angle θ and the pixel πi fixed within each plot.We call these plots the pixel footprints of the pixel on the detector. As can be seen inFigure 5.4, the pixel footprint for the line model is not always continuous. If a line Lθ,sis just at the boundary between two pixels, not taking this discontinuity properly intoaccount can lead to discretization errors. In contrast, the pixel footprint for the stripmodel is always continuous.

5.3.3 The Interpolation Model (a.k.a. the Joseph Model)

As a third projection model we discuss the interpolation model, also known asJoseph’s model. Contrary to the line and strip models, which can be used for any pixelgrid (possibly containing pixels of different sizes, not arranged in rows and columns),the interpolation model is defined specifically for the standard M×N pixel grid. It isbased on an image representation where the value of the image function is specified inthe center points of each pixel, while the values for other points are derived by linearinterpolation. This contrasts again with the linear and strip models, where the imagefunction is constant within each pixel.

The key idea in the interpolation model can be described as putting an artificialpixel over the line Lθ,s with a constant intensity, in such a way that we can use the linemodel within this artificial pixel, see Figure 5.5. The intensity value associated withthe artificial pixels is found by linear interpolation between to neighbour pixels, eitherin the same row or the same column. The choice is dictated by the angle θ: If the line


Figure 5.5: The location of the artificial pixel (shown with a red border) in our inter-pretation of the interpolation (or Joseph) model.

Lθ,s makes and angle with the x-axis within [−90, 90] then two neighbouring pixelsin the same column are used, otherwise two neighbouring pixels in the same row areused.

We now specifically consider an example with 45 < θ < 90, i.e., we are in thelatter case; see also Figure 5.6. In this case we need to perform interpolation betweenthe pixel values at the centers of two pixels in the same row w in the standard grid.

According to (5.4), all pixels π in this row have πy1 = M/2−w−1 and πy2 = M/2−w. The pixel midpoints of row w are at the horizontal center line at y = M/2−w− 1

2.

To determine the intersection between this horizontal line and the line Lθ,s, we inserty into the line’s equation x cos θ + y sin θ = s, yielding

x =s− (M/2− w − 1

2) sin θ

cos θ. (5.9)

This determines the point (x, y) where Lθ,s intersects with the horizontal center line ofrow w.

What remains now is to perform the linear interpolation between the pixel values atthe centers of the two pixels in row w that are immediately to the left and right of thepoint (x, y). The column coordinate vleft of the pixel whose center point is immediatelyto the left of (x, y) satisfies

−N/2 + vleft + 12≤ x < −N/2 + vleft + 3

2⇒ vleft =

⌊x−N/2− 1

2

⌋.

Obviously, the column coordinate of the pixel immediately to the right is vright =vleft + 1. Recalling that the pixels have unit size, the two interpolation weights are thetwo distances x− vleft and vright− x. Hence, the interpolated contribution from Lθ,s atrow w is

(x− vleft) f(w, vright) + (vright − x) f(w, vleft) .

Moreover, the relevant length of the line segment of Lθ,s (in the artificial pixel) isa/| cos θ|. Hence, the associated line-pixel coefficients are (x−vleft)/| cos θ| and (vright−x)/| cos θ|, respectively, for the two pixels.

The interpolation model gives exact data only for the very special case where theimage is linear in x and y, i.e.g, f(x, y) = αx+β y+γ where α, β and γ are constants.


•

° °

1

Inderpolation or Joseph model

(x(yj); yj) = (~x; ~y)

(xi; yj)(xi¡1; yj)

xi = i¡ 12 ; yj = j¡1

2

•

•

Rray

f(x; y) d` = 1j cos µj

Rray

f(x(y); y) dx ¼ hP

j f(x(yj); yj)

h = 1j cos µj = length of line line segment > 1

f(x(yj); yj) linearly interpolated from f(xi¡1; yj) and f(xi; yj)

row w

Figure 5.6: The geometry of the interpolation model, also called Joseph’s model.

5.4. THE SYSTEM MATRIX 49

5.4 The System Matrix

So far, we have covered the discretization of the image function and how differentprojection models relate this discretization to the measurement along a line. Thissection discusses how we arrive at a linear system of equations.

5.4.1 Projection Geometries in 2D

In tomography, measurements are carried out for many sets of lines, intersecting thescanned object from a range of angles. Each projection angle, a set of rays from a singlesource (possibly a parallel set of rays) traverses the object and the result is measuredby a detector at the other side of the object. This is referred to as a projection or aview. The total set of measurements is an assembly of projection views, each acquiredfor a different pair of source and detector positions.

Definition 4. A view V is a finite set of lines. A projection geometry G is acollection of d views V1, . . . ,Vd.

For the parallel beam geometry (cf. Figure 1.1), each view Vj is associated withan angle θj such that all lines Lθj ,s ∈ V . For each view, the lines are usually spacedregularly because all detector element have the same size. Assume that the widthof the detector is an integer W and that each detector element has size 1. Let S =−W

2+ 1

2,−W

2+ 3

2, . . . , W

2− 1

2. Then the view Vj is the set of lines Lθj ,s : s ∈ S.

For the fan beam geometry (cf. Figure 1.2), each view Vj is also associated with anangle θj, but the lines for a view are divergent, all intersecting in a single point calledthe source position. Suppose that the source is located at (R cos θj, R sin θj). We nowintroduce a second angle γ, which denotes the angle between these two lines: a lineoriginating from the source and to a detector element, and the line from the source tothe origin. The equation of a line at angle γ is then given by

x cos(θj + γ) + y sin(θj + γ) = R sin γ .

Suppose that a set Γ = γ1, . . . , γW contains the angles for which the measurementsare made, with the same detetor as above. Then the view Vj is the set of line Lθ+γ,R sin γ :γ ∈ Γ.

5.4.2 From Geometries to Matrices

We have now introduced the concepts and notation that is needed to describe a to-mography problem as a system of linear equations. The corresponding matrix, whichwe call the system matrix, contains one row for each measurement and one column foreach pixel in the image. Each view corresponds to a block of rows, containing the linesfor that view. The entries of the system matrix are formed by the line-pixel coefficientsa(θ, s, π) that are determined by the chosen projection model.

Let Π = (π1, . . . , πn) be a pixel grid and G = V1, . . . ,Vd a projection geometry.We further assume that all views contain the same number k of lines, such that the


total number of lines in all views is given by m = dk. For a view V ∈ G and a line Lθ,s,we define the following (row) vector:

aθ,s =(a(θ, s, π1) a(θ, s, π2) · · · a(θ, s, πn)

). (5.10)

The vector aθ,s contains the line-pixel coefficients of the line Lθ,s for all pixels in Π.This allows us to rewrite Eq. (5.6) as an inner product between two vectors

M(fI)(θ, s) = aθ,s · x . (5.11)

For the view Vi associated with (θ1, s1), . . . , (θk, sk), we now collect all row vectorsaθ1,s1 , . . . ,aθk,sk in the k×n matrix AL, called the view matrix :

AVi =

aθ1,s1...aθk,sk

⇒ AVi x =

aθ1,s1 · x...aθk,sk · x

.

Computing the matrix-vector product AVi x yields the measurements for the viewVi according to the chosen projection model. One can therefore also consider thisoperation as a simulation of the tomographic scanning process for a single view.

Finally, we assemble all the view matrices for the different views in G = V1, . . . ,Vdby stacking them in a single matrix, the system matrix A given by

A =

AV1...AVd

.

The number of rows in A equals the total number m of lines in all views for whichmeasurements are obtained. The number of columns in A equals the number n ofpixels in the image.

Suppose that measurements have been obtained in a tomographic scanner accordingto the scanning geometry G. Each row i of the system matrix then corresponds to ameasured value, which we denote by bi, and which corresponds to a single detectorelement at a given angle. The measurements jointly form the vector b ∈ Rm, whichis a stacked version of all the elements in the sinogram. The measurements b and thediscretized image x are then related by a system of linear equations, described by thesystem matrix A:

b = Ax .

The CT reconstruction problem then amounts to solving this system of equations,which is extensively discussed in the following chapters.

5.4.3 Image Discretization Issues

Any discretization of the unknown function f(x, y) obviously introduces a discretizationerror, and the discretized problem may behave in a different way than the underlyingphysical problem in the CT scanner. This is because the discretization itself imposes avery strong prior on the problem, namely, that the image f(x, y) is a piecewise constantfunction fI(x, y) on the pixels.

5.4. THE SYSTEM MATRIX 51

Figure 5.7: The very simple CT problem in Example 7 that leads to a nonsingular 4×4system matrix A. The numbers 1–4 are the pixel numbers.

Example 7. To illustrate this point with a very small example, consider a 2×2 image,a detector with 4 elements, and a single projection angle θ = 30 as shown in Figure5.7. The pixels have unit size, and discretization by the line model leads to the systemmatrix

A =

0.845 0 0 00.770 1.155 0.385 0

0 0.385 1.155 0.7700 0 0 0.845

which has full rank, and hence we can reconstruct the image from a single projection.While this seems to contradict Radon’s fundamental result, such a single-projectionreconstruction is possible due to the assumption f = fI .

While the discretization of the detector is defined by the physical array of detectorelements, the discretization of the image to be reconstructed can be freely chosen.However, it is common practice to choose the pixel size of the image roughly equal tothe width of a detector element, to avoid certain degenerate behavior in the system ofequations. Take the line model, for example:

• If the width of an image pixel is chosen much smaller than the width of a detectorelement, some pixels will not be touched by any line Lθ,s for some angles θ. As aresult, the line model is no longer a useful discretization of the physical reality,where each point in the image actually contributes to the measurement at thedetector for all projection angles.

• If the width of an image pixel is chosen much larger than the size of a detectorelement, the pixel intersects with multiple lines for each direction. As each lineresults in an additional equation in the linear system, choosing the pixel size largeenough even results in a number of equations per projection angle that is largerthan the number of pixels in the image to be reconstructed.


Figure 5.8: An example that admits a single-projection reconstruction (see the text fordetails).

Example 8. To elaborate on the second point: for a carefully chosen combination ofa single projection angle θ, detector size, and number of detector elements the systemmatrix A is square and nonsingular – and, in principle, we can thus compute thereconstruction from a single projection. As an example, we use N = 16, projectionangle θ = 7, the number of pixel elements is N2, and the detector size is the same asthe image size. Figure 5.8 shows the image f (the Shepp-Logan phantom) in very highresolution, the discretized version fI on a 16×16 grid, and the detector data in the singleprojection. The resulting system matrix A has full rank and we can thus reconstructthe discretized image fI from a single projection – this may seem sensational, but fI isa very poor representation of the actual image f .

It is therefore imperative to be aware that any discretization of the reconstructionproblem leads to discretization errors and artifacts that cause differences between thesolution of the discretized problem and its continuous, non-discretized counterpart.When modelling a tomography problem using a linear algebra formulation, one shouldtherefore always try to make sensible choices for the image discretization and projectionmodel that avoid this degenerate behavior.

We conclude with a few remarks about the physical size of the image grid. Ingeneral, choosing this size much larger than the object will just reduced the imagequality in the relevant domain of the reconstruction. On the other hand, choosing theimage sice smaller than the support of the object will lead to reconstruction artifactsbecause the actual object that gave rise to the measured date cannot be representedon the chosen pixel grid. An alternative way to say this is that the data vector b is notin the range of the system matrix A. As a rule of thumb, it is recommended to choosethe size of the pixel grid as tight as possible round the object.

5.5. BACK PROJECTION COMPUTATIONS 53

Figure 5.9: Three arbitrary columns ci of the system matrix A shown as images. Thenonzero elements for sinusoidal curves in the sinogram.

5.5 Back Projection Computations

5.6 The Rows and Columns of the System Matrix

We finish this chapter with a brief discussion of the roles played by the rows ri andcolumns cj of the system matrix, which we write as

A =

−−− r1 −−−...

−−− rm −−−

=

| | |c1 c2 · · · cn| | |

. (5.12)

With this notation, the matrix A maps the discretized absorption coefficients (thevector x) to the data in the detector pixels (the elements of the vector b) via

b =

b1

b2...bm

= Ax = x1 c1 + x2 c2 + · · ·+ xn cn︸︷︷︸linear combination of columns

=

r1 · xr2 · x

...rm · x

. (5.13)

If the image consists of a single white pixel on a black background, then the corre-sponding vector x is all zeros except of a single element xj = 1 in position j. Accordingto (5.13) the corresponding right-hand side is

b = 0 c1 + · · · 0 cj−1 + 1 cj + 0 cj+1 · · ·+ 0 cn = cj .

This means that the jth column of A is the image of a single pixel; cf. Figure 5.9.Now consider the ith row of A which maps the image x to the ith detector element,

bi = ri · x =n∑j=1

aij xj , i = 1, 2, . . . ,m .

This inner product approximates the line integral along ray i in the Radon transform.It means that the nonzeros of ri correspond to those image pixels that are intersectedby the ith X-ray. Hence, if we reshape ri and plot it as a 2D image then we get apicture of the ith ray’s path through the object; see Figure 5.10.


Figure 5.10: Three arbitrary rows ri of the system matrix A shown as images, showingthe ith X-ray’s path.

Example 9. This example illustrates the interpre-tation of a row ri in the system matrix A. In thissmall example:

aij = length of ray i in pixel j

ri = [ ai1 ai2 0 ai4 0 0 ai7 0 0 ]

bi = ri · x = ai1x+ ai2y + ai4x4 + ai7x7

Finally, let us consider the back projection (2.1) and its relation to the matrix trans-pose AT . Recall the “smearing and summation” interpretation of the back projectionfrom §2.1, that we integrate the sinogram g along a sinusoidal curve, cf. Figure 2.1,and the fact that each column cj of A indeed represents such a sinusoidal curve, cf.Figure. 5.9. Multiplication with the matrix transpose therefore performs this operation:

x = ATb =

| |c1 · · · cn| |

Tb =

−−− cT1 −−−...

−−− cTn −−−

b =

c1 · b...

cn · b

,

where each inner product cj · b corresponds to the above integration.

5.7 The System Matrix: Storage and Computation

We have just seen how the tomography problem can be cast into a system of linearequations. Such equation systems are well understood and a broad range of efficientalgorithms are available for solving linear equation system. One may therefore wonderwhy the tomography problem is still so interesting from a mathematical and com-putational perspective. Can’t we just solve it already? Indeed, well-known genericalgorithms from numerical linear algebra can be used to compute tomographic images.However, there are many tomography-specific details and complicating factors thatmust be taken into account. Here we will discuss a few of these aspects.

5.7. THE SYSTEM MATRIX: STORAGE AND COMPUTATION 55

5.7.1 Memory Size

As a mental exercise, it is instructive to consider the size of the system matrix. If wehave m equations (measured values) and n unknowns (pixel values), the system matrixhas mn elements. Even for a relatively small example of a 2D image of size 100×100,where 100 views have been recorded each containing 100 measured values, this alreadyleads to matrix of 108 (one-hundred-million) elements. Although such a matrix stillfits into the memory of a modern computer, it already becomes challenging to workwith. For 3D images both the number of voxels and the number of measurements arefar greater than in this simple case, resulting in a system matrix that cannot be storedin computer memory directly.

A first observation that can be used to reduce the memory size of the systemmatrix is that the matrix is sparse, meaning that most of its elements are 0. Whenreconstructing an image on the square standard grid of size N×N , each line throughthe image either intersects each row in at most two pixels, or each column in at mosttwo pixels, depending on the orientation of the line. As each line intersects with atmost 2N pixels, storing just the nonzero values and their indices requires just a fewmegabytes in the example given above. A different way to arrive at these numbers isto argue that each pixel has a nonzero line-pixel coefficient for approximately two linesin each view.

When scaling up to 3D images, we see that even when storing the system matrixin a sparse format, this can still be problematic if we deal with large 3D volumes. Ifwe collect 1000 views for a volume of size 1000×1000×1000, the number of non-zerosin the system matrix is in the order of 1012.

As an alternative to storing a sparse representation of the system matrix, whichmay still be too large for available computer memory, one can choose not to store thesystem matrix in memory at all. As we will see in later chapters of this book, a rangeof algorithms exist for solving the system Ax = b that do not store the matrix A,but only rely on computation of the product Ax for a series of vectors x, and oncomputation of the product ATy for a series of vectors y. Computing the product Axis commonly referred to as computing the forward projection of x, where the productATy is referred to as the back projection of y.

5.7.2 Computation

When computing the forward projection and back projection on-the-fly, the elementsof the system matrix can be generated in different orderings. Let us first consider theforward projection. If we generate the elements of A column-by-column, we deal withone image pixel at a time. This means that we need to determine which measurements(i.e., lines between source and detector) are influenced by the value of this pixel, suchthat it has a nonzero line-pixel coefficient. We refer to this approach as a pixel-driven(or, in 3D, voxel-driven) computation. For each pixel, the computation consists of twocomponents: for each angle, the point where a line from the source to this pixel hitsthe detector is computed. Next, the corresponding line-pixel coefficient is determined.If we generate the elements of A row-by-row, we deal with one ray at a time. This


Figure 5.11: Two different ways of ordering the computation of Ax. Left: pixel drivencomputation order (column-by-column); right: ray driven computation (row-by-row).

means that we need to generate the line-pixel coefficients for all pixels on the line thatcorresponds to a particular detector element. We call this a ray-driven computation,as it involves generating a list of all pixels on a line. A ray-driven computation consistsof two main steps for each line: a list is generated of all pixels on the line, and for eachpixel the line-pixel coefficient is computed. Figure 5.11 illustrates the computationordering for the ray-driven and pixel-driven approach.

Chapter 6

SVD Analysis of TomographyProblems

Now being able to set up a matrix problem for the CT reconstruction problem, we willintroduce a important tools from numerical linear algebra that are equivalent to thesingular values and functions discussed in Chapter 4.

Note that in the discretized problem Ax = b, the image and the sinogram arerepresented by the vectors x and b, respectively. While this is a convenient notationwhen we need the language of linear algebra, they are really 2D arrays and they shouldbe visualized as such. In MATLAB notation:

imagesc( reshape(x,N,N) ), axis image

imagesc( reshape(b,Ns,Nθ) ), axis image

where Ns = number of detector elements, and Nθ = number of projections. Goingfrom an image X to a vector x is simple: just write x = X(:).

6.1 The Singular Value Decomposition

Assume that the matrix A is m × n and, for simplicity, also that m ≥ n. Then theSingular Value Decomposition (SVD) takes the form

A = U ΣV T =n∑i=1

ui σi vTi . (6.1)

Here, Σ is a diagonal matrix with the singular values σi, satisfying

Σ = diag(σ1, . . . , σn) , σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0 . (6.2)

The matrices U and V consist of the singular vectors

U = (u1, . . . ,un) , V = (v1, . . . ,vn) , (6.3)

57

58 CHAPTER 6. SVD ANALYSIS OF TOMOGRAPHY PROBLEMS

and both matrices have orthonormal columns: UTU = V TV = In. Then ‖A‖2 = σ1,‖A−1‖2 = ‖V Σ−1UT‖2 = ‖Σ−1‖2 = σ−1

n , and the condition number is given by

cond(A) = ‖A‖2 ‖A−1‖2 = σ1/σn . (6.4)

The condition number play a crucial role in the analysis of linear systems of equations,e.g., for studies of sensitivity and for convergence analysis of iterative methods, cf.Section 6.2 and Chapter 7.

The matrix ATA is symmetric and hence its eigenvalues are real, its eigenvectorsare real and orthonormal (standard linear algebra stuff), and they satisfy

ATAvi = σ2i vi, i = 1, 2, . . . , n . (6.5)

I.e., the right singular vectors vi are the eigenvectors of ATA and the squared singularvalues are the corresponding eigenvalues. We emphasize that this is not how the SVDshould be computed – use only good numerical software. In MATLAB:

• Use [U,S,V] = svd(A) or [U,S,V] = svd(A,0) to compute the full or “economy-size” SVD.

• Use [U,S,V] = svds(A) to efficiently compute a partial SVD (the largest singu-lar values and corresponding singular vectors).

Relations similar to the analysis of the Radon transform have the form

Avi = σi ui, ‖Avi‖2 = σi, i = 1, . . . , n . (6.6)

In particular, if a singular values is zero, σi = 0, then the corresponding right singularvector satisfies Avi = 0 and we say that it is a null vector for the matrix. Also, ifm = n and A is nonsingular, then

A−1ui = σ−1i ui, ‖A−1ui‖2 = σ−1

i , i = 1, . . . , n . (6.7)

These equations are related to the solution:

x =n∑i=1

(vi · x)vi (6.8)

Ax =n∑i=1

σi (vi · x)ui , b =n∑i=1

(ui · b)ui (6.9)

A−1b =n∑i=1

ui · bσi

vi . (6.10)

Example 10. What the SVD Looks Like. Figures 6.1 and 6.2 illustrate the sin-gular values and vectors for simple 1D model problems.

The Picard Plot is a plot of the singular values, the coefficients |ui · b| for theright-hand side, and the solution’s coefficients |ui · b/σi|. Figure 6.3 shows two cases –without and with noise in the right-hand side b.

6.1. THE SINGULAR VALUE DECOMPOSITION 59

Figure 6.1: The singular values decay to zero, with no gap in the spectrum. The decayrate determines how difficult the problem is to solve. Discretization of the Radontransformation gives a mildly ill-posed problem, cf. Figure 4.6.

Figure 6.2: The singular vectors (here shown for a 1D problem) have increasing fre-quency as i increases.


Figure 6.3: Left: no noise. The singular values σi and and the right-hand side coef-ficients |ui · b| both level off at the machine precision. Right: with noise. Now theright-hand side coefficients |ui · b| level off at the noise level, and only ≈ 18 SVDcomponents are reliable.

6.2 Ill-Conditioned Problems

We recall that the coefficient matrix A is allowed to have more rows than columns,i.e.,

A ∈ Rm×n with m ≥ n .

For m > n it is natural to consider the least squares problem

minx‖Ax− b‖2 . (6.11)

When we say “the naive solution” xnaive we either mean the solution A−1b (whenm = n) or the least squares solution (when m > n). We emphasize the convenient factthat the naive solution has precisely the same SVD expansion in both cases:

xnaive =n∑i=1

ui · bσi

vi . (6.12)

Discretizations of inverse problems are characterized by having coefficient matriceswith a large condition number. The solution is very sensitive to errors in the data.Specifically, assume that the exact and perturbed solutions x and x satisfy

Ax = b , Ax = b = b+ e , (6.13)

where e denotes the perturbation (the errors and noise). Then classical perturbationtheory leads to the bound

‖x− x‖2

‖x‖2

≤ cond(A)‖e‖2

‖b‖2

. (6.14)

Since cond(A) = σ1/σn is large, this implies that the naive solution xnaive = A−1b canbe very far from x. See Figure 6.4.

6.3. SPECTRAL FILTERING 61

Rn = spanv1, . . . ,vn Rm = spanu1, . . . ,um

•Exact sol.: x

•b = Ax-

b = b+ e@@R

+Naive sol.: xnaive

Figure 6.4: The geometry of ill-conditioned problems.

The Need for Filtering. Recall that the (least squares) solution is given by

x =n∑i=1

ui · bσi

vi.

When noise is present in the data b = b+ e, then

ui · b = ui · b+ ui · e ≈

ui · b, |ui · b| > |ui · e|ui · e, |ui · b| < |ui · e|.

(6.15)

We note that due to the Picard condition, the noise-free coefficients |ui · b| decay. The“noisy” components |ui · b| are those for which |ui · b| is small, and they correspondto the smaller singular values σi.

6.3 Spectral Filtering

Many of the noise-reducing methods treated in this course produce solutions which canbe expressed as a filtered SVD expansion of the form

xfilt =n∑i=1

ϕiui · bσi

vi , (6.16)

where ϕi are the filter factors associated with the method. These methods are calledspectral filtering methods because the SVD basis can be considered as a spectral basis.

Truncated SVD. A simple approach is to discard the SVD coefficients correspond-ing to the smallest singular values:

ϕTSVD

i =

1 i = 1, 2, . . . , k

0 else.(6.17)


Figure 6.5: TSVD solutions to a simple 1D problem. As we increase the truncationparameter k in the TSVD solution xk, we include move SVD components and alsomore noise. At some point the noise becomes visible and then starts to dominate xk.

More sophisticated methods will be discussed later. We then define the Truncated SVD(TSVD) solution as

xk =n∑i=1

ϕTSVD

i

ui · bσi

vi =k∑i=1

ui · bσi

vi , k < n . (6.18)

Such solutions are illustrated in Figure 6.5. We can show that if Cov(b) = η2Im then

Cov(xk) = η2

k∑i=1

1

σ2i

vi vTi (6.19)

and therefore

‖xk‖2 ‖xnaive‖2 and ‖Cov(xk)‖2 ‖Cov(xnaive)‖2 . (6.20)

The prize we pay for smaller covariance is bias: E(xk) 6= E(xnaive).

Theorem. Let b = b+ e and let xk and xk denote the TSVD solutions computedwith the same k. Then

‖xk − xk‖2

‖xk‖2

≤ σ1

σk

‖e‖2

‖Axk‖2

. (6.21)

We see that the condition number for the TSVD solution is

κk =σ1

σk(6.22)

6.4. SVD ANALYSIS OF DISCRETIZED CT PROBLEMS 63

Rn = spanv1, . . . ,vn Rm = spanu1, . . . ,um

•Exact sol.: x

•b = Ax-

b = b+ e@@R

+Naive sol.: xnaive

∗TSVD solution: xk

AAAAAAAAAK

Figure 6.6: Where the TSVD method fits in the picture. The TSVD solution xk is abetter approximation to the exact solution x than the naive solution.

and it can be much smaller than cond(A) = σ1/σn.

The Truncation Parameter. Note that the truncation parameter k in

xk =k∑i=1

ui · bσi

vi (6.23)

is dictated by the coefficients ui ·b, not the singular values. Basically we should choosek as the index i where |uTi b| start to “level off” due to the noise. The TSVD solutionand residual norms vary monotonically with k:

‖xk‖22 =

k∑i=1

(ui · bσi

)≤ ‖xk+1‖2

2 , (6.24)

‖Axk − b‖22 =

n∑i=k+1

(ui · b)2 ≥ ‖Axk+1 − b‖22 . (6.25)

The use of the SVD analysis for CT is further elaborated in the exercises.

6.4 SVD Analysis of Discretized CT Problems

This is currently left as exercises in the course.

6.5 Additional Topics:

1. Random rays.


2. Limited angle CT.

3. Few projections.

4. Laminar CT (laminography?).

Chapter 7

Algebraic Iterative Reconstruction(AIR) Methods

The linear algebra approach to solving reconstruction problems – using the discretiza-tion techniques from the previous chapter – can sometimes provide a favorable alterna-tive to methods based on analytical expressions such as FBP, filtered back projection.FBP is fast because its computer implementation is built on the fast Fourier transform(FFT) algorithm and it uses very little auxiliary memory, but it is also somewhat lim-ited because a good reconstruction requires many projections with low nosie, and itis preferable to have regularly spaced projection angles. Methods based on a matrixformulation are computationally more expensive, but they are also much more flexiblebecause there are no underlying requirements about the geometry of the projectionsand because we have more freedom to incorporate priors about the object being recon-structed.

Solving linear systems of equations is among the most common computational prob-lems in science and industry, and everyone has at some point solved such a problem.So why do we need a whole chapter on this topic? The answer is that there are manyaspects of such systems that are really important for analysing and solving tomogra-phy problems, and we find it useful to present these aspects in a common notation andwithin the framework of this book.

7.1 A Motivating Example

Before we delve into the world of linear algebra we want to motivate the use of thealgebraic methods that we will present in this chapter. Specifically, we will comparethe reconstructions computed by means of filtered back projection (FBP) with thoseof the Landweber method to be discussed in §7.4.3, including box constraints meaningthat the pixel values are constrained to be between 0 and 1.

We consider a 200 × 200 phantom and parallel-beam geometry with 283 detectorelements. The width of the detector equals the diagonal of the image. We considertwo different scenarios for the projection angles θ.

• 60 equidistantly spaced angles: 3, 6, 9, . . . , 180.

65

66 CHAPTER 7. AIR METHODS

Figure 7.1: Two small tomography problems that compare FBP reconstructions withthose computed with Landweber’s method (to be discussed in §7.4.3). The phantomis 200 × 200, and we consider parallel-beam geometry with 283 detector elements. Inthe top example there are 60 equidistantly distributed projection angles, and in thebottom example there are 40 irregularly spaced angles; see the text for details. Inboth scenarios the Landweber method clearly outperforms FBP and produces sharperreconstructions with less artifacts.

7.2. LINEAR SYSTEMS OF EQUATION 67

• 40 irregularly spaced angles: 2, 4, . . . , 30, 50, 55, . . . , 100, 145, 148, . . . , 180.

The reconstructions are shown in Figure 7.1 together with the phantom and the datashown as sinograms (in the latter example note the “missing” parts).

Clearly there are severe artifacts in the FBP reconstructions, demonstrating thatthis method is not well suited for such scenarios with a small number of projectionangles, and that irregular spacing of the angles leads to further artifacts in the recon-struction. In the Landweber method we incorporated the prior knowledge (the solutionconstraints) that all pixels should lie between 0 and 1, which is the gray level rangeof the phantom. These constraints, combined with the fact that Landweber’s methodworks for any distribution of angles, explain why this method gives much sharperreconstructions with less artifacts.

7.2 Linear Systems of Equation

We have already seen in the previous chapter that when we discretize a tomographyproblem we arrive at a linear system of equations Ax = b, where A is the systemmatrix, b is the right-hand side, and the vector x is the solution which represents thereconstruction we want to compute.

7.2.1 Notation

Let us start with introducing a bit of notation, where we adopt the convention that avector variable has no fixed form, i.e., it can be either a row or a column as needed.:w

If the matrix A has dimensions m× n (i.e., m rows and n columns) then we writethis matrix as

A =

| | |c1 c2 · · · cn| | |

=

−−− r1 −−−...

−−− rm −−−

, (7.1)

where the vector ci (of length m) is the ith column of A and the vector rj (of length n)is the jth row ofA. At this time it is instructive to consider two different interpretationsof the matrix-vector product Ax, namely:

b =

b1

b2...bm

= Ax = x1 c1 + x2 c2 + · · ·+ xn cn︸︷︷︸linear combination of columns

=

r1 · xr2 · x

...rm · x

. (7.2)

The interpretation involving the rows is very natural in connection with tomographyproblems, because it says that the ith element of the data b = Ax is given by theinner product of the ith row ri and the image x,

bi = ri · x =n∑j=1

(ri)j xj =n∑j=1

aij xj, (7.3)


which is precisely how we define the approximation to the integral along the ith raythat defines the ithe measurement, cf. Eq. (5.11).

The alternative interpretation based on the columns is also interesting, because itreveals how each pixel of the image is mapped to the data. Consider a black imagewith a single nonzero pixel of intensity xj, whose vector representation is

x = ( 0 0 · · · 0︸︷︷︸j−1

xj 0 0 · · · 0︸︷︷︸n−j−1

).

The data corresponding to this simple image is

Ax = 0 c1 + · · ·+ 0 ci−1 + xi ci + 0 ci+1 + 0 cn = xi ci.

From this relation we immediately identify the ith column of the matrix A as thedata due to the ith pixel, with a unit pixel intensity. Also, we see from (7.3) that thecomplete data is the weighted sum of the partial data from every pixel in the image.

Example 11. This example shows the sum-of-columns interpretation of the matrix-vector product Ax. The 32 × 32 image x has four nonzero pixels with intensities1, 0.8, 0.6, and 0.4 as shown in Figure 7.2; in the vector representation x these fourpixels correspond to entries 468, 618, 206, and 793. Hence the sinogram, representedas a vector b, takes the form

b = 0.6 c206 + 1.0 c468 + 0.8 c618 + 0.4 c793.

Figure 7.2: The image is a sum of single-pixel images; similarly, the sinogram is aweighted sum of simple sinusoidal images corresponding to the columns of A.

7.2.2 Rank, Consistency, and Null Space

The first thing to notice is that we should be a bit careful because we may not alwaysbe able to find an x such that the equation Ax = b holds. This depends on the sizeand the rank of the matrix A. To analyze this, we need some notation and a fewdefinitions.

Definition 5. The rank r of an m×n matrix A is the number of linearly independentrows of the matrix (it is also equal to the number of linearly independent columns), andfor a nonzero matrix the rank satisfies 1 ≤ r ≤ min(m,n).

7.2. LINEAR SYSTEMS OF EQUATION 69

Definition 6. The range, or column space, R(A) of an m×n matrix A is the linearsubspace spanned by the columns of the matrix:

R(A) ≡ u ∈ Rm |u = α1c1 + α2c2 + · · ·+ αncn, arbitrary αj. (7.4)

The null space, or kernel, N (A) is the linear subspace of all vectors mapped to zero:

N (A) ≡ v ∈ Rn |Av = 0. (7.5)

The dimensions of the two subspaces are r and n− r, respectively.

Example 12. Consider the 3× 3 matrix

A =

1 2 34 5 67 8 9

.

This matrix has rank r = 2 since the middle row is the average of the first and thirdrows which are linearly independent. The range R(A) and null space N (A) consist ofall vectors of the forms

α1

147

+ α2

369

=

α1 + 3α2

4α1 + 6α2

7α1 + 9α2

and α3

1−2

1

,

respectively, for arbitrary α1, α2, and α3.

We can use the SVD of A to quantify these subspace. The rank r of A is equal tothe number of nonzero singular values, i.e., σr > 0 and σr+1 = 0. Then the range of Ais spanned by the first r left singular vectors, while the null space is spanned by thelast n− r singular values:

R(A) = spanu1, . . . ,ur , N (A) = spanvr+1, . . . ,vn . (7.6)

With the above notation and definition is place, let us consider when the linear systemAx = b has a solution. The following example shows that this is not always the case.

Example 13. Consider two linear systems with the matrix A from the previous exam-ple: 1 2 3

4 5 67 8 9

x =

142050

and

1 2 34 5 67 8 9

x =

61524

.

The first system has no solution because the right-hand side does not belong to the rangeR(A); no matter which linear combination of the columns of A we create, we can neverform this right-hand side. The second system, on the other hand, has infinitely manysolutions; any vector of the form

x =

111

+ α

1−2

1

, α arbitrary

satisfies this equation because the last, arbitrary, component is in the null space N (A).


Table 7.1: Overview of systems; A is m × n and r = rank of A. There is a uniquesolution only if rank = n and, in the case of m > n, the system is consistent.

Full rank Rank deficientm < nUnderdetermined

=

r = mAlways consistent.Always infinitelymany solutions.

r < mCan be inconsistent.No solution orinfinitely manysolutions.

m = nSquare

=

r = m = nAlways consistent.Always aunique solution.

r < m = nCan be inconsistent.No solution orinfinitely manysolutions.

m > nOverdetermined

=

r = nCan be inconsistent.No solution ora uniquesolution.

r < nCan be inconsistent.No solution orininitely manysolutions

At this stage, one might argue that since the data b always comes from the forwardprojection of an image x the system Ax = b will always have a solution. However,the right-hand side that we encounter when computing a tomographic reconstructionfrom measured data is contaminated with noise, so it is indeed relevant to study thesolvability of the system Ax = b.

The key issue is, of course, whether the system is consistent, i.e., whether we canfind a vector x that satisfies the equation. In other words, whether the right-hand sideb lies in the range R(A). This question has a nice geometric interpretation, as shownin the following example. Table 7.1 gives an overview of the different possibilities,depending on the rank and size of the matrix A. Notice that a matrix with m < n(some call such matrices for “short and fat” or “obese”) always has a non-trivial nullspace.

Example 14. The example shown in Figure 7.3 illustrates an inconsistent system witha 3 × 3 rank deficient matrix A =

(c1, c2, c3

). The range R(A) is the 2-dimensional

subspace spanned by the three columns c1, c2, and c3 of A. The right-hand side bdoes not lie entirely in R(A), it has a component in the range and another componentoutside the range (in the figure, these two components are orthogonal) and hence thereis no x such that Ax = b.

7.3. LINEAR LEAST SQUARES PROBLEMS 71

-

*

:

>

R(A)

c1

c2

c3

b

Figure 7.3: Illustration of an inconsistent system with a rank deficient A. The infinitesubspace R(A) is visualized as a plane sheet, and the dashed line in the orthogonalprojection of b on this subspace.

7.3 Linear Least Squares Problems

The linear least squares problem is probably the most well-known problem formulationthat seeks to handle inconsistency of a linear problem, i.e., to define a meaningful“solution” to an inconsistent system. The key idea is to find an x such that Axapproximates the right-hand side b in some optimal way – here by minimizing the2-norm of the residual vector b−Ax.

The philosophy underlying the least problem is as follows. Assume that the right-hand side has the form

b = Ax+ e,

where x is the exact, ground-truth solution and e is a random vector with zero meanand whose elements are all from the same Gaussian distribution (i.e., the covariancematrix is a scaled identity). Then the optimal estimate of x (given that it should belinear and unbiased) is obtained by solving the least squares problem

minx

1/2 ‖b−Ax‖22 (7.7)

and we refer to the solution xLS as the least squares solution. In statistics it is referredto as the best linear unbiased estimator of x. Geometrically, cf. Figure 7.3, this cor-responds to finding a least squares solution xLS such that AxLS (the dashed line) isorthogonal to the residual vector b−AxLS (the dotted line).

The above definition ensures the existence of a least squares solution xLS, no matterthe rank and the dimensions of the matrix A. Regarding uniqueness of the solution,the situation is quite simple: the least squares solution is unique if and only if the nullspace of A is trivial, i.e., it merely consists of the vector 0 of all zeros. This is thecase if and only if the rank of A is equal to the number of columns, i.e., r = n, whichrequires that the matrix is either square or “tall and skinny”: m ≥ n. In all other cases


the least squares solution has an arbitrary, un-determined component in the null spaceN (A).

Perhaps the simplest way to define a procedure for computing the least squaressolution for the case r = n ≤ m is via the so-called normal equations

ATAx = ATb ⇒ xLS = (ATA)−1ATb, (7.8)

and the condition that the rank satisfies r = n ensures that the cross-product matrixATA has full rank and is invertible. While this formulation is simple and compact,from a computational point of view this is not always the best way to compute xLS

due to the influence of rounding errors; a better approach is to use a QR factorization(or SVD) of A; see, e.g., [20].

Mathematicians and statisticians have found it convenient to come up with a def-inition of a unique solution x0

LS for all instances of the linear least squares problem,namely, the least squares solution that has no component in N (A). This happens tobe identical to the unique least squares solution with minimum 2-norm – and hence itis often referred to as the minimum-norm least squares solution:

x0LS = argminx‖x‖2 subject to ATAx = ATb. (7.9)

In connection with tomographic reconstruction problems, as well as other inverse prob-lems, its significance lies in the fact that all components of this solution can, ideally,be attributed to the given data.

Example 15. Let us return to the first problem from Example 13. This problem isinconsistent, and since the matrix has rank r = 2 < n = 3 the least squares solution isnot unique. It is straightforward to show that all least squares solutions have the form

xLS =

321

+ α

1−2

1

, α arbitrary.

The minimum-norm least squares solution x0LS is obtained by setting α = 0 in the above

expression.

The minimum-norm least squares solution depends linearly on the right-hand sideb and it can be written as x0

LS = A†b, where the matrix A† is the unique pseudoinverseof A. If A has full row or column rank then we have the expressions

A† =

AT (AAT )−1, if r = m,

(ATA)−1AT , if r = n(7.10)

(we give a precise definition and a general expression forA† in Chapter 6). In particular,the pseudoinverse of row vector r (we need this result later) is its scaled transpose andhence a column vector: r† = rT/‖r‖2

2.The pseudoinverse has many other names in different literatures, such as “gen-

eralized inverse,” “Moore-Penrose inverse,” and “Lanczos inverse.” There is a rich

7.4. ITERATIVE SOLVERS 73

literature on this matrix, but in this book the pseudoinverse will not receive much at-tention since it plays no important role in connection with tomographic reconstructionproblems.

The idea of a defining and computing a solution that fits the given noisy data is avery sound principle. The least squares solution xLS is a good way to deal with thissituation in the presence of white Gaussian noise in the data. But one should be awarethat for other types of noise, xLS may not be the right choice. In particular, if the dataincludes “outliers” – data with unusually large errors – then the least squares solutionis should be avoided because it is very sensitive to these “outliers.” For such problems,statisticians advocate to minimize the 1-norm of the residual (instead of the 2-norm)since this makes the solution much more robust (less sensitive) to the “outliers”:

minx‖Ax− b‖1, ‖Ax− b‖1 =

m∑i=1

|ri · x− bi|. (7.11)

We emphasize that the computational problem underlying this formulation is morecomplex than the least squares problem, so we advice to use only use the 1-norm whenit is needed.

Example 16. Consider the two overdetermined problems with the same matrix:

A =

1 1 11 2 41 3 91 4 161 5 251 6 36

, Ax =

6

17345786

121

, b =

6.0001

17.028533.997157.006185.9965

120.9958

, bo =

6.0001

17.285033.997157.006185.9965

120.9958

.

The two right-hand sides were generated by adding random errors to the same vectorAx with the ground truth x = (1, 2, 3); the second element of bo contains an “outlier”whose error is 10 times that of b. Let xLS and xo

LS denote the least squares solutions tothe two systems with right-hand sides b and bo, respectively, and let x1 and xo

1 denotethe corresponding 1-norm solutions. The solutions are

xLS =

1.00412.00512.9989

, xoLS =

1.08112.01512.9943

, x1 =

0.99322.00872.9986

, xo1 =

0.99322.00882.9986

.

This confirms the robustness of the 1-norm solution: It is much more robust to theoutlier in the second element of the right-hand side – in fact, the two solutions x1 andxo

1 are almost the same, while xLS and xLS differ in the second decimal place.

7.4 Iterative Solvers

So far we have not discussed the issue of how to actually solve the linear system of equa-tion Ax = b. Thinking back on the introductory mathematics courses, the techniques


of “Gaussian elimination” and “LU factorization” might come to mind. However, thesemethods are seldom used for solving tomographic reconstruction problems – they aredesigned for full-rank square matrices, while the matrices we encounter are typicallyrectangular, rank deficient, and very large. For this reason, other methods have becomepopular in the tomographic reconstruction communities.

The focus has been, and still is, on iterative methods for solving the systemAx = b.From a starting guess x(0) (often the zero vector), these methods produce a sequenceof iteration vectors x(1),x(2),x(3), . . . that converge to the (least squares) solution. Themain ingredients of these methods are matrix-vector multiplications with the matrixA and its transpose AT ; the matrix is not altered as is the case in, say, Gaussianelimination. In fact, we do not need to store the matrix A – all that we need arecomputational procedures that implement matrix-vector multiplication with A andAT ; this is a big advantage when working with very large tomography problems wherestoring and accessing the matrix A is computationally demanding.

We emphasize that the formulations of the iterative methods are the same, nomatter if the matrix A is explicitly available or not. Hence we can always formulatethe algorithms in terms A.

There is a rich literature on iterative methods for linear systems of equations (see,e.g., [3] and [29]) and state-of-the-art methods, such as CGLS, LSQR and GMRES, areoften based on so-called Krylov subspaces. In spite of their fast convergence properties,these methods have received little attention in the tomographic communities becausethey cannot be augmented with simple constraints on the solution, such as nonnega-tivity (which is a natural constraint in image reconstruction). Here we focus on othermethods that easily allow for inclusion of such constraints.

The iterative methods for solving Ax = b that have become most popular fortomographic reconstruction problems are the so-called row-action methods which – asthe name implies – involve computations with one or several rows ofA at a time. In theearly years of computed tomography such methods were very suited for the computersof that age because of the low demand for memory, and the methods are still relevantdue to their fast convergence for tomographic problems.

Throughout the book we assume that all rows ri of A are nonzero (i.e., at leastone element of each now is nonzero); otherwise such rows should be purged from A.

Underlying these methods is a simple geometric interpretation of the linear systemof equations Ax = b:

r1 · x = a11x1 + a12x2 + · · ·+ a1nxn = b1

r2 · x = a21x1 + a22x2 + · · ·+ a2nxn = b2

...

rm · x = am1x1 + am2x2 + · · ·+ amnxn = bm.

Then each equation ri ·x = bi for i = 1, 2, . . . ,m defines an affine hyperplane in Rn, asillustrated in Figure 7.4. Assuming that the system is consistent, the unique solutionx is the unique point in Rn where all the affine hyperplanes intersect.


-

6

x1

x2

ai1x1 + ai2x2 = bi

R2

-

6

SSSSSSSSS

""

"""

""SSSSSSSSS

"""

""""

x1

x2

x3 R3

x2

ai1x1 + ai2x2 + ai3x3 = bi

Figure 7.4: Illustration of affine hyperplanes for n = 2 (a line in R2) and n = 3 (aplane in R3).

Figure 7.5: Illustration of Kaczmarz’s method for m = n = 2.

7.4.1 Kaczmarz’s Method

Kaczmarz’s method is a simple, intuitive, and surprisingly efficient iterative methodbased on the above geometric interpretation. In each iteration, and in a cyclic fashion,compute the new iteration vector x such that one of the equations is satisfied. This isachieved by projecting the current iteration vector on one of the hyperplanes ri ·x = bi,and the rows are often accessed in a cyclic scheme: i = 1, 2, . . . ,m, 1, 2, . . . ,m, 1, 2, . . .as illustrated in Figure 7.5 for the case m = n = 2. In the tomographic communitythis method is known as the Algebraic Reconstruction Technique (ART); while thisis certainly a great acronym we find the term “algebraic reconstruction” somewhatambiguous and therefore we prefer to use the name “Kaczmarz’s method.”

It is instructive to derive the method algebraically. From the current iterate x wewant to take a step ∆x such that x+∆x satisfies one of the equations, say, equation i.This leads to the condition that the ith component of the residual vector b−A (x+∆x)should be zero:

bi − ri · (x+ ∆x) = 0 ⇔ ri ·∆x = bi − ri · x.

This is a very underdetermined system for ∆x, and the solution with the smallest


2-norm – giving the shortest step – is given by

∆x = (ri)† (bi − ri · x) =

ri‖ri‖2

2

(bi − ri · x).

Hence, the new iterate x+ ∆x is the orthogonal projection of the current iterate x onthe affine hyperplane defined by ri · x = bi.

An alternative derivation directly involves the orthogonal projection Pi(x) of thecurrent iterate x on this hyperplane:

Pi(x) = x+bi − ri · x‖ri‖2

2

ri. (7.12)

In words, we scale the row vector ri by (bi − ri · x)/‖ri‖22 and add it to x.

In both approaches we obtain the following algebraic formulation of the basic versionof Kaczmarz’s method:

Basic Kaczmarz algorithm

x(0) = initial vectorfor k = 0, 1, 2, . . .

i = k (mod m)

x(k+1) = x(k) +bi − ri · x(k)

‖ri‖22

ri

end

Each time we have performed m iterations of this algorithm, we have performed onesweep over the rows of A. The convergence issues of the basic Kaczmarz method canbe summarized as follows (see [REF] for details):

• If r = n ≤ m and the system Ax = b is consistent, then there is a uniquesolution and the basic method converges to this solution for any starting vector.

• If r < n and the system is consistent (independent on the size of m) then thebasic method converges to a solution x that satisfies Ax = b for any startingvector. If the starting vector is the zero vector then the basic method convergesto the minimum-norm solution x0

LS.

• If the system is not consistent, then the basic method does not converge for anystarting vector; see Figure 7.8 in the next section for an illustration of this.

The original publication by Kaczmarz [23] used precisely the formulation above. Asimilar formulation was used by Gordon, Bender and Herman [17] who are attributedto rediscovering the method; they did not use the geometric/projection frameworkpresented above neither did they use matrix notation (in our notation they workedwith a matrix whose elements are 0 and 1 only).

The convergence of Kaczmarz’s method is treated, e.g., in [25, Exercise 5.13.19 ].If all the rows of A are orthogonal then the basic Kaczmarz method will converge ina single sweep over the rows, independently of the ordering. This is easy to see from


Figure 7.6: Illustration of the influence of the row ordering in Kaczmarz’s method form = 4 and n = 2. The four colored lines represent the four hyperplanes and thenumbers indicates the ordering.

the method’s geometric interpretation, cf. Figure 7.5. In all other situations, the speedof the convergence for this method is a somewhat complicate matter, as it depends onthe sequence in which we access the rows of A; this is illustrated in the example below.The choice of a good ordering of the rows is important, but it is outside the scope ofthis presentation.

Example 17. To illustrate the influence of the row ordering on the convergence speedof Kaczmarz’s method, consider the consistent full-rank system

1.0 1.01.0 1.11.0 3.01.0 3.7

x =

2.02.14.04.7

whose solution is (1, 1). We applied Kaczmarz’s method to this problem with twodifferent row orderings, namely, 1,2,3,4 and 1,3,2,4. The resulting iterations for threerow sweeps are shown in the left and right parts of Figure 7.6. Clearly, the convergenceof the latter ordering is almost twice as fast as that of the former.

In order to quantify an “average performance” of Kaczmarz’s method that is inde-pendent of the ordering of the rows, let us consider a situation where A is square andhas full rank (r = n = m) and where we have scaled all the rows such that they haveunit 2-norm, i.e., ‖ri‖2 = 1 for i = 1, . . . ,m. Then it has been suggested to considera version where the ith row is selected randomly, in which case it is possible to derivethe following expression for the expected value E(·) of the squared error norm:

E(‖x(k) − x‖2

2

)≤(

1− 1

nκ2

)k‖x(0) − x‖2

2, (7.13)

in which κ = ‖A‖2 ‖A−1‖2 is a condition number of A as defined in Eq. (6.4). Thisis linear convergence and we emphasize that the upper bound can be somewhat pes-simistic (after all, it is a bound) and it does not reflect the fast initial convergence thatwe usually observe for this method.


QQQQQQQQQ

QQQ

QQQ

H1

H2 sx(k)JJJ

P1(x(k))

s

P2(x(k))

sx(k+1)

s

Figure 7.7: The geometric interpretation (for m = n = 2) of one iteration of Cimmino’smethod, from x(k) to x(k+1). The two hyperplanes H1 and H2 represent the twoequations r1 · x = b1 and r2 · x = b2.

If the condition number κ is large, which is usually the case in tomography prob-lems, then using a Taylor expansion we obtain the approximate upper bound (1 −k/(nκ2)‖x(0) − x‖2

2. Hence, after n steps – equivalent to one sweep over the n rowsof the matrix – the approximate upper bound is reduced by the factor 1− 1/κ2. Notethat this is a result about the average convergence rate, there are orderings for whichthe convergence is faster!

7.4.2 Cimmino’s Method

As an alternative to iterative methods that involve one row of A at a time, let usconsider methods that involve all the rows simultaneously – and this idea has lead to awhole class of methods sometimes referred to as “simultaneous iterative reconstructiontechniques” (e.g., in [21]), but one should note that this name is also used for a specificiterative reconstruction method to be introduced in Eqs. (7.15) and (7.17) below.

Cimmino’s method also uses orthogonal projections onto the affine hyperplanesand the key idea is to obtain the next iteration vector as the average of the all theprojections of the previous iteration vector, hence

x(k+1) =1

m

m∑i=1

Pi(x(k)

)=

1

m

m∑i=1

(x(k) +

bi − ri · x(k)

‖ri‖22

ri

)= x(k) +

1

m

m∑i=1

bi − ri · x(k)

‖ri‖22

ri.

This is illustrated in Figure 7.7. We can turn this expression into our matrix-vector


formalism as follows

x(k+1) = x(k) +1

m

( r1

‖r1‖22

· · · rm‖rm‖2

2

) b1 − r1 · x(k)

...bm − rm · x(k)

= x(k) +

1

m

r1...rm

T ‖r1‖−2

2. . .

‖rm‖−22

b−

r1...rm

x(k)

= x(k) +ATM−1(b−Ax(k)),

where we have defined the diagonal matrix M = diag(m‖ri‖2

2

). Hence, we can write

Cimmino’s method as

Basic Cimmino algorithm


x(k+1) = x(k) +ATM−1(b−Ax(k)

)end

Note that one iteration here involves all the rows ofA, while one iteration in Kaczmarz’smethod involves a single row. Therefore, the computational work in one Cimminoiteration is equivalent to m iterations (a sweep over all the rows) in basic Kaczmarz’salgorithm. The issue of finding a good row ordering is, of course, absent from Cimmino’smethod.

To be precise, the Cimmino method presented above is not strictly identical tothe method described in Cimmino’s original paper [7]. Instead of projections on thehyperplanes he used reflections in the hyperplanes, and he allowed a weighted averageof these reflections. In this book we will still refer to the above method as “Cimmino’smethod.”

With a bit of matrix analysis we can study the convergence of Cimmino’s method.In particular, if x(0) = 0 (the zero vector) and if I denotes the n× n identity matrix,then it is possible to derive the identities:

x(k+1) =k∑j=0

(I −ATM−1A)jATM−1b

=(I − (I −ATM−1A)k+1

)(ATM−1A)−1ATM−1b.

Here, provided that the rank of A satisfies r = n ≤ m, the vector

xLS,M = (ATM−1A)−1ATM−1b

is the least squares solution to the weighted least squares problem:

minx

1/2 ‖M−1/2(Ax− b)‖22 ⇔ (ATM−1A)x = ATM−1b. (7.14)


Figure 7.8: Illustration of the convergence of Kaczmarz’s and Cimmino’s methods foran inconsistent system with r = n = 2 and m = 3. The basic Kaczmarc method doesnot converge; but ends up in an infinite cycle over the same three points. Cimmino’smethod converges to the weighted least squares solution indicated by the red circle.

With the particular choice of M in Cimmono’s method the largest eigenvalue ofthe symmetric matrix I −ATM−1A is strictly smaller than one and hence we havethat (

I − (I −ATM−1A)k+1)→ I for k →∞.

This shows that the iterates x(k) of Cimmino’s method converge to the weighted leastsquares solution xLS,M . In particular, if A is square and has full rank (r = n = m)then x(k) converges to the solution A−1b. We have more to say about the convergenceof this method in Chapter XXX. When A is rank deficient, Cimmino converges to theminimum-norm weighted least squares solution (confirmed by Matlab).

Example 18. Figure 7.8 compares the convergence of Kaczmarz’s and Cimmino’smethods for an inconsistent system. Cimmino’s method is more versatile than Kacz-marz’s method in the sense that we can use Cimmino’s method to solve least squaresproblems – provided that all the rows of A are normalized to have unit 2-norm.

As already mentioned there is, in fact, a large class of algebraic iterative methodsthat simultaneously use all the rows of the matrix. These methods are characterizedby having an updating step of the general form

x(k+1) = x(k) + ωkD−1ATM−1

(b−Ax(k)

), (7.15)

where ωk is a scalar that can change over the iterations. Different choices of the diagonalmatrices D ∈ Rn×n and M ∈ Rm×m lead to methods with different names; see [21] formore details. For example the simple choice D = I and M = I leads to Landweber’smethod

x(k+1) = x(k) + ωkAT (b−Ax) (7.16)

(we return to this method in the next section).Another choice of the diagonal matrices is

D = diag(‖cj‖1

), M = diag

(‖ri‖1

), (7.17)


which is proposed in [2]. Since the system matrix A has nonnegative elements, thediagonal elements are simply the columns and row sums of A, and they are easy tocompute simply by multiplying vectors of all ones with A and AT . Unfortunatelythe naming conventions in different communities are not consistent, and this particu-lar method is often referred to as “Simultaneous Iterative Reconstruction Technique”(SIRT) in the tomography community [REF?] while it is called “Simultaneous Alge-braic Reconstruction Technique” (SART) in the linear algebra community, followingthe original paper [2]. This method is implemented in the software package ASTRA[37] where it is called SIRT.

Results for the speed of convergence can be derived from results in [Nesterov 2004];again we have linear convergence and we obtain an upper bound that is somewhatsimilar to that of Kaczmarz’s method, cf. Eq. (7.13). For example, for Landweber’smethod with a fixed ωk = ‖A‖−2

2 and assuming that r = n = m, we have:

‖x(k) − x‖22 ≤

(1− 2

1 + κ2

)k‖x(0) − x‖2

2, (7.18)

where again κ = ‖A‖2 ‖A−1‖2 is the condition number of the matrix A, cf. Eq. (6.4).When the condition number is large then we have the approximate upper bound (1−2/κ2)k ‖x(0) − x‖2, showing that the in each iteration the error is reduced by a factor1 − 2/κ2 which is almost the same factor as in one sweep through the rows of A inKaczmarz’s method.

7.4.3 The Optimization Viewpoint

There are several ways to modify the basic versions of Kaczmarc’s and Cimmino’smethods introduced above in order to improve their behavior. The two most importantare:

• We can introduce a relaxation parameter – or step length parameter - in thealgorithm which controls the “size” of the updating, such as in Eq. (7.15).

• We can also, in each updating step, incorporate a projection PC on a suitablychosen convex set C, such as the positive orthant Rn

+ (giving nonnegative solu-tions) or the n-dimensional box [0, 1]n (giving solutions with elements between 0and 1).

When incorporating either of these modifications – which is very often used in practise– then the method is really no longer a linear equation solver. In order to motivate andexplain these modifications, it is therefore helpful to leave the purely algebraic view-point of the previous sections, and instead consider the methods from an optimizationviewpoint.

It is instructive to start this discussion by returning to the least squares problemin (7.7). We introduce the objective function and its gradient

F(x) = 1/2 ‖b−Ax‖22, ∇F(x) = −AT (b−Ax). (7.19)


Then the least squares problem takes the form of an unconstrained convex optimizationproblem minxF(x), and one of the simplest ways to solve this problem is the methodof steepest descent which, from a starting vector x(0), performs the updates

x(k+1) = x(k) − ωk∇F(x) = x(k) + ωkAT (b−Ax), (7.20)

where ωk is a step-length parameter that can be chosen to give the maximum reductionin the objective function in each iteration (this is called line search). This method isknown in the image reconstruction communities as Landweber’s method which wealready introduced above, as it falls within the framework (7.15) with D = I andM = I.

Several iterative reconstruction methods arise from the above formalism by incor-porating a weighting matrix in the least squares problem (7.14), and the correspondingobjective function takes the form

FM (x) = 1/2 ‖M−1/2(b−Ax)‖22 with ∇FM (x) = −ATM−1(b−Ax). (7.21)

We have already seen Cimmino’s method withM = diag(m‖ri‖22), and another method

is the Component AVeraging method (CAV) in which the ith element of M is definedas

mii =n∑j=1

sj a2ij, sj = nnz(cj) = number of nonzeros in columns cj.

If the matrix A has no zero entries then sj = m and CAV is identical to Cimmino’smethod.

An important way to extend the (weighted) least squares problem (7.7) is to includeconstraints on the elements of the reconstructed image. Assume that we can write theconstraint as x ∈ C, where C is a convex set; this includes two very common specialcases:

Non-negativity constraints. The set C = Rn+ (the positive orthant) corresponds to

xi ≥ 0, i = 1, 2, . . . , n.

Box constraints. The set C = [0, 1]n (an n-dimensional box) corresponds to

0 ≤ xi ≤ 1, i = 1, 2, . . . , n.

The corresponding (weighted) constrained convex optimization problem now takesthe from

minx

1/2 ‖M−1/2(b−Ax)‖22 subject to x ∈ C, (7.22)

where the choice M = I gives the unweighted problem. A simple method for solv-ing this problem is the projected gradient method which incorporates the orthogonalprojection PC on the set C in each step of the steepest descent method:


Projected gradient algorithm


x(k+1) = PC(x(k) + ωkA

TM−1(b−Ax(k)))

end

Several strategies have been proposed for choosing the relaxation parameter ωk duringthe iterations, including the choice of using a fixed parameter ω, see Section 8.3 and[21] for details. We postpone the discussion of stopping criteria to Section 8.2.

The projected gradient method belongs to the class of first-order optimization meth-ods developed for solving large-scale convex optimization problems (see [34] and thereferences therein), and there is a rich theory for their convergence.

While this may not be immediately apparent from its definition in Section 7.4.1,Kaczmarz’s method also has an interpretation within the optimization framework. Thisallows a thorough analysis of this method, and several extensions of the method presentthemselves during such an analysis [1]. To set the stage, we write the objective for theweighted least squares problem (7.14) in the form

FM (x) =m∑i=1

fi(x),

where, for i = 1, 2, . . . ,m,

fi(x) = 1/2(bi − ri · x)2

m ‖ri‖22

⇒ ∇fi(x) = −bi − ri · xm ‖ri‖2

2

ri.

Incremental gradient methods use only the gradient of a single term fi(x) of the objec-tive function in each iteration. In this case, assuming cyclic row access and absorbingthe factor m−1 into the relaxation parameter, the update becomes:

x(k+1) = x(k) + ωkbi − ri · x‖ri‖2

2

ri, i = k (mod m).

Obviously, we can also incorporate the projection PC to solve the constrained weightedleast squares problem, and the resulting algorithm takes the form

Projected incremental gradient (Kaczmarz) algorithm


i = k (mod m)

x(k+1) = PC

(x(k) + ωk

bi − ri · x‖ri‖2

2

ri

)end

In this way, we have arrived at a general formulation of Kaczmarz’s method as it ismost often used, and which resembles that of the general formulation of Cimmino’smethod, within a solid optimization framework.


Example 19. The introduction of an iteration-dependent relaxation parameter ωk hasan advantage when using Kaczmarz’s method to solve an inconsistent system (in whichcase we saw in Figure 7.8 that ωk = 1 gives a cyclic and non-convergent behavior).

Figure 7.9: Illustration of the convergence of Kaczmarz’s method, applied to the in-consistent problem from Example 18, for a fixed relaxation parameter ωk = 0.8 and adiminishing parameter ωk = 1/

√k. We show the iteration vectors for k = 1, 2, . . . , 75

corresponding to 25 sweeps over the rows of A. The red circle denotes the weightedleast squares solution xLS,M .

Consider the same example with two different choices of the relaxation parameter:

ωk = 0.8 (independent of k) and ωk = 1/√k, k = 0, 1, 2, . . .

The corresponding iteration histories are shown in Figure 7.9 for k = 1, 2, . . . , 75 cor-responding to 25 sweeps over the rows of A. The rightmost plot is a “zoom” of themiddle plot. With the fixed parameter we still have a cyclic non-convergent behavior;with the so-called diminishing relaxation parameter ωk = 1/

√k → 0 as k → ∞ the

iterates converge slowly to the weighted least squares solution xLS,M .

Example 20. To illustrate the importance of incorporating constraints in the leastsquares problem formulation, consider a parallel-beam setup with 45 projection angles4, 8, 12, . . . , 180 and 91 detector elements. The reconstructed image x is 64×64. Thisleads to a system Ax = b with a 4, 097 × 4, 096 system matrix A. We added 0.1 %noise to the data b and used Kaczmarz’s method with and without box constraints tosolve this problem.

The results are shown in Figure 7.10, where we use the same color-scale [0, 1] in allthree images. In this particular problem the ground truth image x is binary, i.e., itspixel values are 0 or 1. This means that it lies on the boundary of the set C = [0, 1]n (thebox), and even a small perturbation of the data Ax will have the effect that the perturbedsolution A−1b will have many pixels outside the box. Our example clearly illustratesthe advantage of incorporating the box constraints into the problem formulation (ratherthan applying them after an unconstrained solution has been computed.

7.4.4 A Column-Action Method

There is an alternative version of Karzmarz’s method which has received less attentionin the literature. This version operates on the columns cj ofA, instead of the rows, and


Figure 7.10: The importance of box constraints when solving a problem with noisydata. Left: 64 × 64 ground truth image. Right: unconstrained solution; pixel valuesoutside [0,1] are shown as 0 or 1. Middle: solution with box constraints, all pixel valuesare constrained to the interval [0, 1].

it has the advantage that it always – even with a fixed relaxation parameter – convergesto a least squares solution; if m ≥ n it converges to the (minimum-norm) least squaressolution. Confirmed by Matlab. Moreover, in some applications the column-actionstrategy may also have an advantage from an implementation point of view.

The column-action method takes its basis in the simple coordinate descent opti-mization algorithm, in which each step is performed cyclically in the direction of theunit vectors

ej = ( 0 0 · · · 0︸︷︷︸j−1

1 0 0 · · · 0︸︷︷︸n−j−1

), j = 1, 2, . . . , n.

Hence, at iteration k we consider the update x(k) + αk ej with j = k (mod n), andthe goal is to find the step length αk that gives maximum reduction in the objectivefunction:

αk = argminα1/2‖A (x(k) + α ej)− b‖22

= argminα1/2‖α (Aej)− (b−Ax(k))‖22

= argminα1/2‖α cj − (b−Ax(k))‖22.

The minimizer is

αk =cj · (b−Ax(k))

‖cj‖22

and hence we obtain the following overall algorithm (where again we have introduceda relaxation parameter and a projection):

Coordinate descent (column iteration) method


j = k (mod n)

x(k+1) = PC

(x(k) + ωk

cj · (b−Ax(k))

‖cj‖22

ej

)end


1

2

3

N−1

N

N+1

N+1

2N N2

Figure 7.11: Illustration of a 2D N × N grid and the numbering of the pixels whenstored in the vector x. The red “blob” symbolizes an object inside the domain.

Note that the operation in the inner loop simply overwrites the jth element of the iter-ation vector with an updated value. Similar to the row-action method (i.e., Kaczmarz’salgorithm), the column-iteration method has linear convergence [10].

7.5 More About the Null Space

In Section 7.2.2 we introduced the null space N (A) and discussed its role in solvinglinear systems of equations. Specifically, when we are faced with a least squares problemwhose coefficient matrix A has a non-trivial null space (e.g., if A is rank deficient) thenany vector x can be uniquely decomposed into two components:

x = x0LS + xN .

The first component is the minimum-norm least squares solution while the second com-ponent xN , which is orthogonal to x0

LS, lies in N (A). It is impossible to determine thelatter component because smxN = 0 and hence Ax = Ax0

LS. In this section we dis-cuss the consequence of this in connection with tomographic reconstruction problems.

7.5.1 The Role of the Null Space in Tomography

We start with a small artificial example with only two projections. Consider a 2Ddomain divided into an array of N × N pixels as shown in Figure 7.11, and assumethat we send a total of 2N X-rays through this domain: N horizontal rays and Nvertical rays, each ray going through the middle of the pixels involved. The resultingsystem matrix A is thus 2N ×N2 and for N = 16 it has this structure:

A =

7.5. MORE ABOUT THE NULL SPACE 87

Figure 7.12: Artificial example with only two projections; all plots show the imagerepresentation of the corresponding vectors. The three left plots show three differentvectors inN (A). The two right plots show an exact image x with a nonzero componentin N (A) and the corresponding minimum-norm least squares solution x0

LS.

A blue dot denotes a nonzero element and all nonzero elements have the same value,namely, the width of a pixel.

Of course, we cannot expect to do a good reconstruction with only two projectionangles, and this is reflected in the fact that the null space N (A) of this matrix is verylarge. From the size of the matrix we immediately see that the dimension of the nullspace must be at least N2 − 2N . It turns out that, independently of N , on row ofA is always a linear combination of the others, and hence the dimension of N (A) isN2−2N+1. In other words, there is a total of N2−2N+1 linearly independent vectors(all of them with N2 components) that produce the zero vector when multiplied withA.The basis vectors of N (A), when represented as images, consist of all translates of the2× 2 subimage

(1 −1−1 1

).

Example 21. Consider the 2-projection example for N = 16. Three examples ofvectors in the null space, plotted as N × N images, are shown in Figure 7.12. Thisstructure is not surprising considering that the rows and columns of these image mustsum to zero.

The existence of a nontrivial null space means that certain components in an objectcannot be reconstructed, and this can also lead to artifacts in a reconstruction. Forexample, it is obvious that any checkerboard structure in the object cannot be recon-structed. To illustrate the appearance of an unexpected artifact, consider the artificialobject (represented by the vector x) shown in the fourth plot in Figure 7.12, and itsreconstruction computed as x0

LS = A†b, with b = Ax being the projected data. Notethat one pixel at the intersection of the vertical and horizontal strips has a large andincorrect value, while the values of the horizontal and vertical “strips” are slightly toolow.

Let us now turn to a limited-angle example. We use a parallel-beam geometrywith N rays per projection and N projection angles in the interval [50, 130]. Thecorresponding system matrix A is rectangular of size N2 × N2. For our numericalexample we choose N = 29 and the corresponding 841× 841 matrix has rank r = 817,so the null space has dimension 24.

From the discussion in Chapter 3 we know that it is difficult to reconstruct geomet-rical features aligned with the “missing angles” of the measurements. In particular,for this example we expect that the most difficult structures to reconstruct are those


Figure 7.13: These 24 images correspond to 24 linearly independent vectors than spanN (A) for the limited-angle problem with N = 29.

associated with the angle 0, corresponding to thin vertical structures. Our numericalexample confirms this. Figure 7.13 shows the images that correspond to the 24 linearlyindependent vectors which span the null space N (A) (in fact, these vectors are the lastn−r right singular vectors). These images are clearly dominated by vertical structures,telling us that these structures cannot be reconstructed.

To further illustrate this point, we constructed two different test images shown inthe left part of Figure 7.14 with a purely horizontal and vertical structure, respectively.The middle part of the figure shows the corresponding sinograms, and we see that theintensity of the sinogram corresponding to the second test image with the horizontalstructure is much lower than that of the other sinogram. The right part of Figure 7.14shows the reconstruction; both are imperfect due to the fact that A is rank deficient,but the vertical structure of the second test image is almost completely lost in thereconstruction. For other examples, see [41].

7.5.2 Computations Related to the Null Space

In Chapterchap:SVDreg we explained how we can use the SVVD to reliably computethe rank and null space of a matrix. This approach is very powerful and useful forsmall problems; but for large problem it becomes infeasible to use this method. Hencewe find it appropriate here to discuss how to efficiently compute one or a few vectorsin the null space, since this may still provide useful insight into the reconstructionproblem.

Perhaps the simplest approach is to apply an algebraic iterative method to thesystem Ax = 0 using a random starting vector x(0). If we, for simplicity of thediscussion, use Landweber’s method (7.16) with ωk = ω, then this approach is justthe “power method” applied to the matrix I − ωATA whose eigenvalues are 1− ωσ2

i

and so the iterations leave all components of x(0) in the null space unchanged, anddampens/purges/kills the other components. While this method is simple and easy touse, its convergence can be very slow.

7.5. MORE ABOUT THE NULL SPACE 89

Figure 7.14: These 24 images correspond to 24 linearly independent vectors than spanN (A) for the limited-angle problem with N = 29.

An approach that often gives much faster convergence takes its basis in the opti-mization framework. Assume that we are given a random vector z ∈ Rn; such a vectorwill with probability 1 have a component v in N (A) and the goal is the compute thiscomponent. (In the language of linear algebra we want to compute the orthogonalprojection of z on N (A).) We can formulate this task as the following optimizationproblem:

minv

1/2‖v − z‖22 subject to Av = 0. (7.23)

The so-called Lagrange function for this constrained optimization problem is

L(v,α) = 1/2(v − z)T (v − z) +αTAv, α ∈ Rm,

where α is a vector of Lagrange multiplies associated with the m equality constraintsin Ax = 0. The corresponding optimality conditions, obtained by setting the gradientof L(v,α) to zero, take the form of a linear system of equations with a symmetricindefinite matrix: (

I AT

A 0

)(xv

)=

(z0

). (7.24)

For large problems this problem can be solved by means of the iterative method MIN-RES [3]. Once an approximate solution v has been computed, one can check its“quality” by computing the norm ‖Av‖2 – the smaller this norm the better.

Example 22. We generate a parallel-beam problem with N = 16, 25 projection anglesin the interval [0, 120] and 11 rays per angle. The resulting system matrix is 275×256and it has a null space of dimension 1, hence we can easily test the convergence of theiterative methods towards this unique vector v which is shown in the corner of Figure7.15.


Figure 7.15: Error histories, ‖v − x(k)‖2 versus k, for the three iterative methods inExample 22 applied to a problem with a unique vector v in the null space – shown inthe bottom right corner.

We compare three methods: Landweber’s and Kaczmarz’s methods applied to thesystem Ax = 0 with a random starting vector z, and MINRES applied to the system(7.24) with the same z. We note that the amount of computational work in one iterationis essentially the same in each method; in each iteration the work is dominated by onemultiplication with A and one multiplication with AT . Figure 7.15 shows the errorhistories ‖v−x(k)‖2 for all three methods; clearly the approach using MINRES is muchfaster (it converges after 4800 iterations) and thus preferable to the other two methods.As a check of the “quality” of the computed vector we obtain the following results:

Landweber : ‖Ax(50000)‖2 = 0.0192,

Kaczmarz : ‖Ax(50000)‖2 = 0.00457,

MINRES : ‖Ax(4800) ‖2 = 0.000252.

7.6 Block Iterative Methods

For the large-scale problems we encounter in 3D tomographic reconstruction, a carefulmanagement of the data transfer in the computer memory hierarchy – as well as be-tween processors – is crucial for the computational speed. We finish the chapter withan introduction to variants of the iterative algebraic methods that take these aspectsinto account. We to not try to be complete, but rather we want to state the overallideas and techniques. The motivations for the block methods are:

• The best utilization of the computer is achieved by working on blocks of data,such as blocks of the system matrix A or blocks of the solution x.

7.6. BLOCK ITERATIVE METHODS 91

• Iterative methods that involve matrix operations (such as Landweber, Cimminoand SIRT) can easily be organized in terms of block operations leading to fastexecution – but the intrinsic convergence of these methods is slow.

• Kaczmarz’s method which involves a single row at a time has faster intrinsicconvergence – but this method and its variants work with small amounts of dataand hence they do not lend themselves well to the computer memory hierarchies,so their execution is slow.

Clearly, we must incorporate a “best of both worlds” strategy to obtain iterative meth-ods that combine fast convergence with good utilization of the computer memory.

7.6.1 The Algorithmic Perspective

The “best of both worlds” can achieved by designing block iterative methods whichare built on the above iterative methods. Unfortunately the term “block method” isnot well defined in the literature, and there are many ways in which we can definesuch methods. Our description here, which follows [HHS+PCH], is based on a blockpartitioning of the rows of the system matrix and the right-hand side:

A =

R1

R2...Rp

, b =

b1

b2...bp

, (7.25)

where p denotes the number of blocks. (Similar methods can be derived for blockcolumn partitionings.) There are many ways to select the blocks – for example, we canassociate a block with a projection such that bi corresponds to the ith projection image.In the literature the discussion of how to choose the blocks is very often associated withthe term ordered subsets which means that we seek an initial ordering of the rows ofA that gives fast convergence and good performance.

Within each iteration of such block methods we must treat all the blocks of A,and this can be done either sequentially (similar to how we treat the rows of A inKaczmarz’s method) or simultaneously (similar to the operations in the matrix-basedmethods). At the block level, there are different ways to treat each block system R`

and b`; for example, we can perform one iteration of, say, the Kaczmarz or Cimminomethod on the block system, or we can compute the minimum-norm least squaressolution x0

LS (7.9) associated with the block system. Table 7.2 gives an overview ofmany of the block methods that have been proposed in the literature, and below wediscuss some of these methods in more details.

For the methods that employ simultaneous treatment of the blocks it is natural toapply one sweep of Kaczmarz’s method to each block, and we must then define howthe partial results for each block are combined. For example we can simply computethe mean of these results, which leads to an algorithm of the form

Block simultaneous algorithm


Table 7.2: Overview of some block iterative methods (the refs. refer to [33]). We givethe original names which, unfortunately, do not reveal the placement of the methodwithin this table.

Sequential treatment ofthe rows of R`

Simultaneous treatmentof the rows of R`

Psudoinverse R†`

Sequentialtreatment of theblocks

The overall algorithm isidentical to Kaczmarz’smethod.

X? [8], block Cimmino[1], block iteration [17],Landweber-Kaczmarz[?], block sequential [?].

Gauss-Seidel type [14],generalized Kaczmarz[13].

Simultaneoustreatment of theblocks

CARP [23], string-ave-raging projections [11].

The overall algorithm isessentially identical tothe method used at theblock level.

Jacobi-type [14].

Initialization: choose an arbitrary x(0) ∈ Rn

Iteration: for k = 0, 1, 2, . . .

for ` = 1, . . . , p execute in parallel

y(`) = Kaczmarz-sweep(R`, b`,x(k))

end

x(k+1) = 1/p∑p

`=1 y(`)

This method is sometimes referred to as the method of string-averaging projections[11]. The partial results y(`) can also be combined in more sophisticated ways, cf. thealgorithm CARP [23].

The methods that treat the blocks sequentially achieve their individual “personal-ities” from the choice of the iterative method applied to the blocks. Using the genericnotation from (7.15) these methods take the following form

Block sequential algorithm

Initialization: choose an arbitrary x(0) ∈ Rn

Iteration: for k = 0, 1, 2, . . .

z(0) = x(k)

for ` = 1, 2, . . . , p

z(`) = PC(z(`−1) + ωkD

−1RT` M

−1` (b` −R` z

(`−1)))

end

x(k+1) = z(p).

This method is sometimes referred to as “block iterations” [11].The Jacobi and Gauss-Seidel type methods are obtained from the above two meth-

ods by substituting the statement in the inner loop by, respectively, with

y(`) = x(k) + ωkR†`(b` −R` x

(k))

andz(`) = z(`−1) + ωkR

†`(b` −R` z

(`−1))


It may seem computationally infeasible to work with the pseudoinverse A† of eachblock, but due to the sparsity of the blocks – especially if the blocks are chosen carefully– this is often feasible. See, e.g., [SøHa] for details.

Along this line, we note that if the rows of a block are orthogonal then one sweep ofKaczmarz’s method applied to this block is mathematically identical to one iterationof Cimmino’s method applied to this block. To see this, assume for ease of expositionthat there is no projection PC and that A consists of a single block with orthogonalrows, i.e.,

ri · rj =

‖ri‖2

2, i = j,

0, i 6= j. (7.26)

Then the first step of Cimmino’s algorithm takes the form

x(1) = x(0) + ωkATM−1(b−Ax(k))

= x(0) + ωk

m∑i=1

bi − ri · x(0)

m ‖ri‖22

ri = x(0) +m∑i=1

γi ri,

where we introduced γi = ωkm−1(bi − ri · x(0))/‖ri‖2

2, i = 1, . . . ,m. Now consider thefirst sweep of Kaczmarz’s method with starting vector x(0); using (7.26) several timeswe have:

x(1) = x(0) + ωkb1 − r1 · x(0)

‖r1‖22

r1 = x(0) + γ1 rT1

x(2) = x(1) + ωkb2 − r2 · (x(0) + γ1r1)

‖r2‖22

rT2

= x(1) + ωkb2 − r2 · x(0)

‖r2‖22

r2 = x(0) + γ1r1 + γ2r2

x(3) = x(2) + ωkb3 − r3 · (x(0) + γ1r1 + γ2r2)

‖r3‖22

r3

= x(2) + ωkb3 − r3 · x(0)

‖r3‖22

r3 = x(0) + γ1r1 + γ2r2 + γ3r3

...

x(m) = x(0) + γ1r1 + γ2r2 + · · ·+ γmrm = x(0) +m∑i=1

γi ri.

Clearly, the result after the first sweep of Kaczmarz’s method is identical to the resultafter the first step of Cimmino’s method with ωk = mωk.

With the same assumptions plus the assumption that the block has full row rankwe have

A† = AT (AAT )−1 = ATdiag(‖ri‖22)−1 = mATM−1, (7.27)

and hence

x(1) = x(0) + ωkA†(b−Ax(0)) = x(0) + ωkmA

TM−1(b−Ax(0))


showing that if we absorb the factor m into the relaxation parameter ωk then a singleblock step of the Jacobi and Gauss-Seidel type methods is identical to a single blockstep of Cimmino’s and Kaczmarz’s methods. It is interesting to note that if A doesnot have orthogonal columns, then the matrix AM−1 in Cimmino’s method (apartfrom a scaling) is an approximation to the pseudoinverse A†; this can be considered amotivation for using Cimmino’s method over Landweber’s method.

The above results motivate block partitioning of A that produce blocks with or-thogonal rows; ordering algorithms the achieve this are described in [REFS] and theresulting method is sometimes referred to as “Parallel ART” (PART) [22].

IS ALL THIS ALSO TRUE WITH PROJECTION?

7.6.2 Formal derivations and the Optimization Perspective

In the previous section the row-oriented block methods were motivated and derivedfrom an algorithmic point of view, where the simple iterative methods were combinedin various ways. It is instructive to also derive the block iterative methods in a moreformalized fashion, based on the linear-algebra and optimization perspectives.

Recall that we based our derivation of Kaczmarz’s method on an orthogonal pro-jection designed such that precisely one equation of the system Ax = b is satisfied.Of course, from the current iterate x we might as well try to take a step ∆x such thatwe satisfy all the equations associated with the `th block row R`. In other words, allthe associated elements of the residual vector b−A (x+ ∆x) should be zero:

b` −R` (x+ ∆x) = 0 ↔ R` ∆x = b` −R` x.

Similar to the scalar case, the shortest step – with minimum 2-norm – is given by

∆x = R†`(b` −R` x

).

This immediately leads to the Gauss-Seidel variant of the block sequential (row-oriented)algorithm.

Alternatively, we can follow the derivation of Kaczmarz’s method that takes itsbasis in the incremental gradient methods. To arrive at a block method we write theobjective function for the weighted least squares problem as

FM (x) =

p∑`

f`(x)

where, for ` = 1, . . . , p,

f`(x) = 1/2‖M−1` (b` −R` x)‖2

2 ⇒ ∇f`(x) = −RT`M

−1` (b` −R` x),

with M ` = diag(m‖ri‖22) and ri being the specific rows associated with block `. Then

the block update takes the form

x(k+1) = x(k) + ωkRT`M `(b` −R` x

(k)), ` = k (mod p).


This is, of course, the block sequential (row-oriented) method with Cimmino blocksteps.

In a similar fashion we can derive block versions of the sequential column-orientedmethods, based on the block column partitioning

A =(C1 , C2 , . . . , Cq

).

Following the considerations in Section 7.4.4 assume that we want to update a blockof elements in the iteration vector x(k), corresponding the coordinates associated withthe `th block column C`. Hence we want to perform the update x(k) +E`αk, whereE` is a matrix of zeros and ones defined such that it extracts the `th column blockfrom the system matrix: C` = AE`. The vector αk is then computed as

αk = argminα1/2

∥∥A (x(k) +E`αk)− b∥∥2

2

= argminα1/2

∥∥AE`α−(b−Ax(k)

)∥∥2

2

= argminα1/2

∥∥C`α−(b−Ax(k)

)∥∥2

2

= C†`(b−Ax(k)

),

and the update step takes the form

x(k+1) = x(k) +E`C†`

(b−Ax(k)

).

Of course, only the elements of x(k) associated with the `th block are affected by thisupdate. This works well if we can solve the least squares problem with C` efficiently.

There is an alternative way to perform the update: we can choose to compute a blockupdate which is just the mean of the updates for each component of x(k) associatedwith the `th block. Specifically, if J` denotes the column indices associated with the`th block and if n` is the number of columns in this block, then the update takes theform

x(k+1) = x(k) +1

n`

∑j∈J`

cj · (b−Ax(k))

‖cj‖22

ej

= x(k) +1

n`

|· · · ej · · ·|

. . .

‖cj‖−22

. . .

...cj · (b−Ax(k))

...

= x(k) +E`N

−1` CT

` (b−Ax(k)),

where we have defined the diagonal matrix N ` = diag(n` ‖cj‖−22 ). Here, we can think

of the matrix diag(‖cj‖−22 )CT

` as a rough approximation of C†` (it is identical to C†` ifthe columns of C` are orthogonal).

CAN WE DERIVE THE BLOCK SIMULTANEOUS ROW- AND COLUMN-METHODSIN THE SAME WAY? PROBABLY NOT.


7.7 The Story So Far

What the unconstr. methods converge to, with starting vector x(0) = 0:

• Kac: Kaczmarz’s method with a fixed relaxation parameter.

• K–d: Kaczmarz’s method with a diminishing parameter.

• Cit: Column iteration (coordinate descent algorithm).

• Cim: Cimmino’s method with a fixed relaxation parameter.

r < min(m,n) r = min(m,n)

b ∈ R(A) b /∈ R(A) b ∈ R(A) b /∈ R(A)

m < n x0LS = x0

LS,M —

m = n x0LS = x0

LS,M A−1b —

Kac: cyclic Kac: cyclic

m > n x0LS = x0

LS,M K–d: x0LS,M xLS = xLS,M K–d: xLS,M

Cim: x0LS,M Cim: xLS,M

Chapter 8

AIR Methods with Noisy Data

It is perhaps surprising that the algebraic iterative reconstruction (AIR) methods fromChapter 7 – which are designed to compute naive and un-regularized solutions – areso popular in some tomographic reconstruction communities. The explanation is thatthe AIR methods, when applied to noisy data, have a built-in regularization propertywhere the number of iterations acts as a regularization parameter. If we stop theiterations before they converge to the naive solution, then we can consider the suchobtained approximate solution as a regularized solution.

Figure 8.1 illustrates this behavior. We generated a parallel-beam test problemwith a system matrix A of dimensions m× n = 15 286× 40 000 and with 1% noise inthe data. Then we applied Cimmino’s and Kaczmarz’s methods to this noisy underde-termined problem, both the basic and the projected versions with non-negativity andbox constraints. All the error histories – the plots of the relative error ‖x−x(k)‖2/‖x‖2

versus the number of iterations k – have a distinct minimum, and the location and thesize of the error at this minimum depend on the iterative method and the constraints.

In this chapter we take a closer look at this phenomenon and give theoretical insightthat explains the observed behavior. We also discuss stopping rules that can be usedto terminate the iterations at the right time, when a suitable regularized solution hasbeen obtained (ideally when the error has a minimum). Along with this, we also brieflydiscuss a few strategies to select the relaxation parameter within our framework.

To set the stage for this chapter, recall that we assume a scenario where the givendata, in the form of the right-hand side b, is a sum of “clean” noise-free data Ax fromthe ground-truth image plus a noise component e:

b = Ax+ e, x = ground truth, e = noise.

Also recall that the naive solution is undesired because it has a large component comingfrom the noise in the data. For example, if the system matrix A is invertible then thenaive solution is

A−1b = A−1(Ax+ e) = x+A−1e,

and the component A−1e typically dominates over x, because A is an ill conditionedmatrix.

97

98 CHAPTER 8. AIR METHODS WITH NOISY DATA

Figure 8.1: The convergence histories for Cimmino’s and Kaczmarz’s methods – basicand projected versions – applied to an underdetermined tomography problem withnoisy data. Note the logarithmic k-axis! For all six methods the error decreases untilit reaches a minimum, shown by the circles, after which the error starts to increaseagain. The figure is from [9].

8.1. SEMI-CONVERGENCE 99

Figure 8.2: Illustration of semi-convergence of Kaczmarz’s method applied to a noisytest problem. Top: the error history, i.e., the norm ‖x − x(k)‖2 as a function of thenumber of iterations k. Bottom: selected iterations with inserts that zoom in on asmall region.

8.1 Semi-Convergence

The phenomenon underlying these AIR methods is often referred to as semi-convergence[27, p. 89], which exhibits itself as follows:

• During the initial iterations, the iteration vector x(k) tends to approach the de-sired but un-obtainable solution x to the noise-free problem.

• During later iterations, x(k) converges to the undesired naive solution associatedwith the particular AIR method (e.g., A−1b if the system matrix is invertible).

• If we can stop the iterations just when the convergence behavior changes from theformer to the latter, then we achieve a regularized solution – an approximationto the noise-free solution which is not too perturbed by the noise in the data.

The following example illustrates this behavior.

Example 23. We applied Kaczmarz’s method to a small parallel-beam test problemwith noisy data. Figure 8.2 shows the error history, ‖x− x(k)‖2 versus the number of


iterations k, together with selected iterates x(k) shown as images. The error history hasthe characteristic form associated with semi-convergence: During the first 16 iterationsthe iterates x(k) approach x and the reconstruction error decreases, after which theiterates start to be more influenced by the noise leading to a more grainy reconstruction.

8.1.1 Analysis of Landweber’s Method

Let us first consider the simultaneous AIR methods (Landweber, Cimmino, SIRT/SART,etc.) that employ matrix-vector multiplications with the full system matrix A and itstranspose. The analysis of these methods is uncomplicated because it can be performedin terms of the SVD; see [38] for an early reference to such an analysis.

While the Cimmino and SIRT/SART methods seem to be the preferred methods inthis category, we will focus on the Landweber method with a fixed relaxation parameterωk = ω because this makes the SVD analysis straightforward (similar SVD analysisfor the other methods can be found in [12]). Given the starting vector x(0), the kthLandweber iterate is given by:

x(k) = x(k−1) + ωAT (b−Ax(k−1))

= (I − ωATA)x(k−1) + ωATb

= (I − ωATA)[(I − ωATA)x(k−2) + ωATb

]+ ωATb

= (I − ωATA)2x(k−2) +((I − ωATA) + I

)ωATb

= (I − ωATA)3x(k−3) +((I − ωATA)2 + (I − ωATA) + I

)ωATb

= · · ·= (I − ωATA)kx(0) +

[(I − ωATA)k−1 + (I − ωATA)k−2 + · · ·+ I

]ωATb

= (I − ωATA)kx(0) +k−1∑j=0

(I − ωATA)jωATb. (8.1)

We note that for Landweber’s method to converge we must requre ω < 2/‖ATA‖2 –see Eq. (8.8) – which ensures that (I − ωATA)k → 0 as k →∞.

For simplicity we will now assume that we start with the zero vector, x(0) = 0, inwhich case (8.2) takes the simpler form

x(k) =k−1∑j=0

(I − ωATA)jωATb. (8.2)

When we insert the SVD of the system matrix A = U ΣV T then we obtain (usingthat I = V V T ):

x(k) = Vk−1∑j=0

(I − ωΣ2)jΣUTb = V Φ(k) Σ−1UTb,


where we have introduced the n× n diagonal matrix

Φ(k) =k−1∑j=0

(I − ωΣ2)j ωΣ2 = ωΣ2k−1∑j=0

(I − ωΣ2)j =

φ(k)1

φ(k)2

. . .

with diagonal elements

φ(k)i = ω σ2

i

k−1∑j=0

(1− ω σ2i )j, i = 1, 2, . . . , n. (8.3)

This sum in the above expression is a geometric series for which

k−1∑j=0

zj = (1− zk)/(1− z),

and thus for i = 1, 2, . . . , n we have:

φi(k) = ω σ2i

k−1∑j=0

(1− ω σ2i )j = ω σ2

i

1− (1− ω σ2i )k

1− (1− ω σ2i )

= 1− (1− ω σ2i )k

leading to a simple expression for the Landweber filter matrix

Φ(k) = diag(1− (1− ω σ2i )k). (8.4)

This analysis shows that after k iterations we have obtained a regularized solution

x(k) = V Φ(k) Σ−1UTb =n∑i=1

φ(k)i

uTb

σiv (8.5)

which is, in fact, a filtered SVD solution with the filter factors

φ(k)i = 1− (1− ω σ2

i )k, i = 1, 2, . . . , n. (8.6)

Notice that these filters depend on the number of iterations which, therefore, acts asthe regularization parameter. It is easy verify to what

φ(k)i ≈

1 for σi ω,

k ω σ2i for σi ω,

(8.7)

showing that the SVD components corresponding to large singular values are essentiallyunfiltered, while those corresponding to small singular values are damped by a factorproportional to σ2

i (similar to Tikhonov regularization). The breakpoint at which thefilter factors start to decay is approximately for σi ≈ 1/

√k ω, and we conclude that

the breakpoint decreases as k increases, i.e., more SVD components are included as weperform more iterations.


Figure 8.3: The Landweber filter factors, shown as continuous functions 1−(1−σ2i )k of

the singuar values σi, for the case ω = 1. The black dots correspond to the “breakpoint”values σ = 1/

√k ω.

Example 24. To illustrate the behavior of the filter factors for Landweber’s method,Figure 8.3 shows plots of 1 − (1 − σ2)k as a function of the continuous variable σ(corresponding to the choice ω = 1), for different values of k. We see that as moreiterations are performed, more SVD components are included in the reconstruction.The black dots on the curves correspond to the values σ = 1/

√k ω. Note that every

time we double the number of iterations k, then the breakpoint is reduced by a factor√

2.

The above analysis also provides an asymptotic convergence analysis for Landwe-ber’s method as k →∞. For the geometric series in (8.3) to converge we must requirethat the relaxation parameter ω is chosen such that |1 − ω σ2

i | < 1 for all i, whichimplies that we must have

ω < 2/σ21 = 2/‖ATA‖2. (8.8)

When this condition is satisfied, then φ(k)i → 1 for all i and thus Φ(k) → I for k →∞,

and consequently x(k) converges to the naive and noisy solution V Σ−1UTb = A−1b.

To obtain more insight into the mechanism underlying the semi-convergence ofLandweber’s method, we note that the filter factors φ

(k)i are independent of the right-

hand side. Therefore we can always split the reconstruction error for the iterationvector x(k) into two components:

x− x(k) =(x− x(k)

)+(x(k) − x(k)

),

where “clean” iteration vector x(k) is defined as the iteration vector what we obtainwhen we apply k steps of Landweber’s method to the noise-free data Ax.


• The first component x − x(k) is the iteration error which is an approximationerror caused by the finite number of iterations, and which is independent of thenoise in the data.

• The second component x(k)−x(k) is the noise error which is due to the presenceof the data errors and causing the actual iteration vector x(k) to differ from the“clean” iteration vector x(k).

We can write both these errors in terms of the SVD. To do so, note that we have therelations

x =n∑i=1

vTi x vi,

uTi b = uTi (Ax+ e) = uTi (Ax) + uTi e,

uTi (Ax) = uTi

n∑j=1

ujσjvTj x = σiv

Ti x.

The last identity follows from the fact that the singular vectors are orthonormal, i.e.,uTi uj = 1 when i = j and zero otherwise. Then it follows that

x− x(k) =n∑i=1

vTi x vi −n∑i=1

φ(k)i

uTi (Ax)

σivi =

n∑i=1

(1− φ(k)i )vTi x vi,

x(k) − x(k) =n∑i=1

φ(k)i

uTi b

σivi −

n∑i=1

φ(k)i

uTi (Ax)

σivi =

n∑i=1

φ(k)i

uTi e

σivi.

Obviously, as k increases and more filter factors φ(k)i approach 1, the iteration error

tends to zero, while the noise error increases because more noise components are in-cluded. The optimal number of iterations is obtained when these to errors balanceeach other (more about this in Section 8.2).

Example 25. Let us consider the “clean” iteration vectors x(k) associated with thenoise-free data, and whose deviation from the ground truth x defines the iterationerror. Figure 8.4 shows six selected iterations for Landweber’s method. Since the rightsingular vectors vi of the system matrix are all smooth – but with more oscillations as iincreases – it is no surprise that the initial iteration vectors correspond to very smoothimages (and bad reconstructions). As we take more iterations we include more SVDcomponents with higher frequencies and we obtain somewhat sharper (but not perfect)edges in the images.

For other simultaneous iterative reconstruction methods, such as Cimmino’s methodand SIRT/SART with a symmetric and positive definite matrix M (cf. Section 7.4.2),a very similar analysis can still be carried out. The key idea is that in the aboveanalysis we replace A and b with M 1/2A and M 1/2b and build the analysis upon theSVD of M 1/2A. We shall not perform this analysis here; see instead [12].


Figure 8.4: Selected “clean” iteration vectors x(k) – shown as images – for Landweber’smethod applied to noise-free data Ax.

Example 26. This example illustrates the semi-convergence behavior of Cimmino’smethod when we increase the noise level in the data. We use a test problem from theMatlab package AIR Tools II [21] with the following characteristics, which makesthe problem under-determined by a factor four:

Geometry: parallel beam, generated with paralleltomo

Phantom: smooth from phantomgallery

Image size: N ×N with N = 256Projection angles: θ = 4, 8, 12, . . . , 180, hence 45 in total

No. of detector pixels: b√

2Nc = 362Relative noise level: ρ = ‖e‖2/‖Ax‖2 = 0.01, 0.02, and 0.04Size of A: m× n = (2562)× (45 · 362) = 65 536× 16 290

Figure 8.5 shows the phantom (corresponding to the ground truth vector x) togetherwith results for three different values of the relative noise level ρ = ‖e‖2/‖Ax‖2 = 0.01,0.02 and 0.04. The leftmost plots show the norms of the reconstruction error x−x(k),the iteration error, and the noise error. The iteration error is independent of the noiselevel, while the noise error increases with increasing noise – and hence the break-evenpoint between to two error components changes: as the noise increases the optimalnumber of iterations decreases.

The center plots show the best reconstructions x(k) at the three noise levels, fork = 37, 31 and 25 iterations, respectively. The reconstruction deteriorates with thenoise level, for two reasons: clearly, the amount of noise in the reconstruction increaseswith the noise level, and the reconstruction error also increases because we include lessSVD components.


Figure 8.5: Illustration of the semi-convergence of Cimmino’s method for the phantomshown to the right, at three different noise levels ρ = ‖e‖2/‖Ax‖2 = 0.01, 0.02, and0.04. The left subplots show the norms ‖x−x(k)‖2 (reconstruction error), ‖x− x(k)‖2

(iteration error) and ‖x(k) − x(k)‖2 (noise error). The middle subplots show the bestreconstructions, obtained at the point of semi-convergence.


8.1.2 Analysis of Landweber’s Method with Projection

A main advantage of the algebraic iterative reconstruction methods is that it is easy toincorporate constraints, via a projection on a convex set; in particular, non-negativityand box constraints are often recommendable. Unfortunately, the SVD analysis pre-sented above does not apply to the projected algorithms, and we must accept a cruderanalysis in which we can only bound the norms of the iteration errors and the noiseerrors. Such an error analysis was carried out in [8], and we summarize the main resultshere – again for the case of Landweber’s method.

Assume that the system matrix A has full column rank, i.e., r = n, and let σndenote the smallest singular value (which is nonzero). Also assume that we start we azero initial vector x(0) = 0. For the un-constrained problem we already gave a boundfor the iteration error in (7.18); with constraints the norm of the iteration error isbounded as

‖x− x(k)‖2 ≤(1− ω σ2

n

)k‖x‖2, (8.9)

and we see that we are guaranteed a reduction of the error by a factor 1−ω σ2n in each

iteration. In practice the error often decreases faster.To derive a bound for the noise error x(k) − x(k) we recall that the iterates in the

projected algorithm are given by

x(k) = PC(x(k−1) + ωAT (b−Ax(k−1))

).

Using b = b+ e, it follows immediately that

x(k)− x(k) = PC(x(k−1) + ωAT (b+ e−Ax(k−1))

)− PC

(x(k−1) + ωAT (b−Ax(k−1))

).

At this stage, we use the fact that projection on a convex set is non-expansive, i.e., forx,y ∈ Rn we have ‖PC(x)− PC(y‖2 ≤ ‖x− y‖2. Hence

‖x(k) − x(k)‖2 ≤∥∥(I − ωATA) (x(k−1) − x(k−1))− ωATe

∥∥2

≤ ‖I − ωATA‖2 ‖x(k−1) − x(k−1)‖2 + ω ‖ATe‖2.

For convenience we introduce ν = ‖I − ωATA‖2, and by repeatedly using the aboverelation we obtain:

‖x(k) − x(k)‖2 ≤ ν ‖x(k−1) − x(k−1)‖2 + ω ‖ATe‖2

≤ ν(ν ‖x(k−2) − x(k−2)‖2 + ω ‖ATe‖2

)+ ω ‖ATe‖2

= ν2 ‖x(k−2) − x(k−2)‖2 + (1 + ν)ω ‖ATe‖2

≤ ν2(ν ‖x(k−3) − x(k−3)‖2 + ω ‖ATe‖2

)+ (1 + ν)ω ‖ATe‖2

= ν3‖x(k−3) − x(k−3)‖2 + (1 + ν + ν2)ω ‖ATe‖2

≤ · · ·

≤ νk−1 ‖x(0) − x(0)‖2 +

(k−1∑j=0

νj

)ω ‖ATe‖2.


With a zero starting vector the left term vanishes and hence we arrive at the upperbound

‖x(k) − x(k)‖2 ≤

(k−1∑j=0

‖I − ωATA‖j2

)ω ‖A‖2 ‖e‖2.

The challenge is then to bound the sum of matrix powers, and we refer to theanalysis in [8] where it is shown that the noise error is bounded above as

‖x(k) − x(k)‖2 ≤1− (1− ωσ2

n)k

σ2n

‖A‖2 ‖e‖2. (8.10)

A non-trivial lower bound has not yet been derived for the noise error.Write the relaxation parameter as ω = ω/σ2

1 where the factor ω ≈ 1 (it mustbe less than 2 to ensure convergence of the iteration error). Then we can writeω σ2

n = ω σ2n/σ

21 = ω κ−2 where κ = σ1/σn is the condition number of A, cf. Eq. (6.4).

Recalling that this condition number is typically quite large for tomographic reconstruc-tion problems, we can expect ω σ2

n to be quite small. From the expression (1 − ε)k =1−kε+ 1/2k(k+1)ε2 + · · · it then follows that the factor in (8.10) can be approximatedas

1− (1− ωσ2n)k

σ2n

=1− (1− k ωσ2

n +O(σ4n))

σ2n

= k ω +O(σ2n)

and thus we obtain the approximate bound

‖x(k) − x(k)‖2<∼ k ω ‖A‖2 ‖e‖2. (8.11)

It is possible to derive an alternative non-approximate upper bound where the factork ω is replaced by

√k ω κ, cf. [8], but unfortunately the presence of the condition

number κ makes this bound very pessimistic.We emphasize that the exact and approximate upper bounds in (8.10) and (8.11)

do not express the actual growth of the noise error with the number k of iterations (anon-trivial lower bound is needed to do that). But all experience shows that this errorincreases with k, and the above bounds give us an idea of their growth. See [8] formore details and for extensions of these bounds to other simultaneous AIR methods.

The approximate upper bounds (8.11) and (8.14) for the noise error – and Landwe-ber’s and Kaczmarz’s methods, respectively – have a common factor k.

Example 27. We finish with an example that illustrates the overall performance ofCimmino’s method with and without box constraints. We use the same under-determinedtest problem as in Example 26 together with an additional test image threephases fromphantomgallery.

For both test problems we observe semi-convergence, as expected, and we also seethat the use of box constraints improves the reconstruction compared to the basic un-constrained method. For these under-determined and noisy problems, Cimmino’s methodproduces reconstructions that are superior to those produced by the FBP method.

The phantom smooth is designed to be a very smooth image with no edges, and henceit is easily approximated by a fairly small number of SVD components. In accordancewith this, less than 40 iterations are needed to reach the point of semi-convergence.


Figure 8.6: The basic un-constrained and the box-constrained Cimmino method, aswell as FBP, applied to two test problems with a smooth and a non-smooth phantom(top half and bottom half, respectively). The numbers above the images are the relativereconstruction errors ‖x− xFBP‖2/‖x‖2 and ‖x− x(k)‖2/‖x‖2.


The phantom threephases is designed to have many sharp edges, and we there-fore expect that a large number of SVD components are needed to produce a good ap-proximation. Indeed, 464 box-constrained iterations are needed to reach the point ofsemi-convergence for this phantom.

8.1.3 Analysis of Kaczmarz’s Method

Example 28. We illustrate the performance of Kaczmarz’s method with the same twonoisy and underdetermined test problems as in Example 27. The overall behavior isthe same as before: the iterative methods exhibit semi-convergence and produce resultsthat are much better than those produced by filtered back projection (FBP). The bestKaczmarz reconstructions are obtained with much fewer iterations than with Cimmino’smethod, cf. Fig. 8.6, but we note that the best Kaczmarz reconstructions to the smoothphantom are not as good as those produced by Cimmino’s method.

As illustrated in Fig. 8.1 and the above example, Kaczmarz’s method often con-verges significantly faster than the simultaneous iterative methods, and it is thereforealso relevant to perform an analysis of the iteration error for Kaczmarz’s method. Anupper bound for the iteration error (for the unconstrained problem) was given in Eq.(7.13).

The derivation of an upper bound for the noise error essentially follows the sameroute as above, where again we assume that we start with a zero initial vector. Akey result for this analysis is that Kaczmarz’s can be written in a form that resemblesCimmino’s method.

First we define the following “splitting” of the symmetric m×m matrix AAT :

AAT = L+D +LT ,

where D is a diagonal matrix consisting of the diagonal elements of AAT while L is astrictly lower triangular matrix (it has zeros on the diagonal) consisting of the elementsof AAT below the diagonal. Now define the lower triangular matrix

L = (D + ωL)−1. (8.12)

Then one iteration of Kaczmarz’s method can be written as

x(k) = PC(x(k−1) + ωAT L (b−Ax(k−1))

)(see [11]). We emphasize that this results is for purely theoretical use; it should not beused for actual computations.

For the unconstrained problem it follows that

x(k) − x(k) =k−1∑j=0

(I − ωAT LA)jAT L e,

and the analysis in [9] then shows that for both the constrained and the un-constrainedproblem, the norm of the noise error is bounded as

‖x(k) − x(k)‖2 ≤1− (1− ω ς2)k

ς2‖AT L e‖2 +O(ς2), (8.13)


Figure 8.7: The basic un-constrained and box-constrained Kaczmarz method, as wellas FBP, applied to same two test problems as in Figure 8.6. The number above theimages are the relative reconstruction errors ‖x−xFBP‖2/‖x‖2 and ‖x−x(k)‖2/‖x‖2.

8.2. STOPPING RULES 111

short detector long detector

projection angles ‖e‖2 ‖AT L e‖2 β ‖AT L e‖2 βθ = 1, 2, 3, . . . , 180 49 7.6 1251 28.2 6.4 · 107

θ = 3, 6, 9, . . . , 180 70 7.4 403 22.6 2.1 · 107

θ = 6, 12, 18, . . . , 180 128 6.1 181 7.1 2232

Table 8.1: Comparison of the norm ‖AT L e‖2 in Eqs. (8.13) and (8.14) with an upperbound β which is seen to be very pessimistic. See the text in Example 29 for details.

where ς is the smallest nonzero singular value of the matrix D1/2LA. Moreover,following the previous arguments, we again obtain an approximate upper bound of theform

‖x(k) − x(k)‖2<∼ k ω ‖AT L e‖2. (8.14)

Similar to Landweber’s method, there is an alternative non-approximate upper boundwhere the factor k ω is replaced by

√k ω κ, and again this bound is very pessimistic.

Example 29. It is tempting to replace the norm ‖AT L e‖2 in (8.13) and (8.14) with

the upper bound β = ‖A‖2 ‖L‖2 ‖e‖2, but unfortunately this bound can be surprisinglycrude. To illustrate this we consider a small 2D parallel-beam test problem with anN × N = 64 times64 image, 90 detector pixels, and a varying number of projections.Also, we consider two different sizes of the detector, namely, a long and a short detectorof length

√2` and `, respectively, where ` is the object’s side length.

Table 8.1 compares the norm and the upper bound for an example with unbiasedwhite Gaussian noise. Clearly, the upper bound is very pessimistic – especially forthe long detector where the pixels in the corners of the square object are penetratedby very few rays, leading to a matrix D with some small diagonal elements and anill-conditioned matrix L.

8.2 Stopping Rules

In order to successfully use semi-convergent algebraic iterative reconstruction methodsfor noisy data, we obviously need an automatic method – a stopping rule – for terminat-ing the iterations at, or near, the point of semi-convergence where the reconstructionerror x − x(k) is as small as possible. Moreover, we must be able to do so withoutknowing the ground truth x – the decision must be made from available informationsuch as the kth iterate x(k) and/or its corresponding residual %(k) = b−Ax(k).

While such stopping rules have been studied frequently by mathematicians, theydo not seem to have achieved the same amount of attention in the majority of thetomographic reconstruction communities. There are certainly many explanations forthis:

• The influence of the noise – characterized by the noise error x(k) − x(k) – oftengrows quite slowly with the number of iterations k (we gave two different boundsthat suggest proportionality to both k and

√k). Hence the error history exhibits

a flat minimum, and it is not crucial to stop at a very specific number of iterations.


• Along this line, there are many applications for which the users have build avery good intuition of approximately how many iterations are needed to obtaina satisfactory reconstruction.

• In those situations where very many iterations are needed and the minimum isvery flat, the iterations are often terminated with one’s patience runs out – andhence one may not observe the semi-convergence effect and the eventual growthof the reconstruction.

Developing a robust stopping rule that works on many types of problems and formany kinds of data is a difficult task that ideally requires a lot of insight into both thetomography problem, the noise, and the reconstruction algorithm. Hence, we cannotexpect to develop a robust multi-purpose stopping rule for a broad range of problems.The best we can do in this section is to describe some of the successful stopping rulesthat have been proposed. Then the user can try these methods and see how well theywork on the given problems.

To make precise statements in this section, we need to introduce a small amount ofstatistical framework and notation. For the exact noise-free data that correspond tothe ground truth image, we write

b = Ax.

The elements of the noise vector e are random variables, i.e., their values dependon a set of well-defined random events. The vector of expected values E(e) and thecovariance matrix V(e) are defined as

E(e) =

E(e1)E(e2)

...

, V(e) = E((e− E(e)

) (e− E(e)

)T).

Throughout this section we will restrict our analysis to white Gaussian noise with zeromean:

E(e) = 0, V(e) = η2I, E(‖e‖22) = mη2, (8.15)

where η is the standard deviation of the noise and m is the number of elements in e.We emphasize that we make this choice merely to simplify our discussion; noise intomographic problems are rarely strictly Gaussian, cf. Section 1.1.2; but sometimesthis is a reasonable assumption.

8.2.1 Fitting to the Noise Level

A very common approach to fitting mathematical models to noisy data is to choosethe models parameters (or the complexity of the model) such that the output fromthe model fits the data “to the noise level.” For the tomography problems and theiterative methods considered here, this translates into a stopping rule where we choosethe number of iterations k such that the residual %(k) = b−Ax(k) is “of the same size”as the noise vector e.


To make this more precise, one can choose k such that the norm of the residualapproximates the norm of the noise – or, rather, the expected value of the latter:‖%(k)‖2 ≈

√mη. In the literature this is referred to as the “discrepancy principle” [13],

[26]. Of course, the residual norm ‖%(k)‖2 takes discrete values for k = 1, 2, 3, . . . andtherefore we cannot expect to find a k such that the above holds with equality. But‖%(k)‖2 often decreases monotonically and hence one can choose the smallest k suchthat ‖%(k)‖2 ≤

√mη. In practice the standard deviation η of the noise is not known;

but it can sometimes be estimated from the data, cf. 8.2.4.Some authors advocate the inclusion of a constant τ ≥ 1 such that the above

condition takes the form ‖%(k)‖2 ≤ τ√mη. This constant can be useful when we

have only a rough estimate of η and there is a risk that we take too many or too fewiterations. The constant also plays a role when studying the behavior of the stoppingrule when the noise tends to zero – a purely theoretical matter, of course.

At any rate, it may be risky to use the discrepancy criterion in practise. Here give astatistical analysis that reveals its shortcoming and also leads to a better formulation.

To motivate our analysis, we consider regularized solutions xk computed by meansof the truncated SVD (TSVD) method from Section 6.3 with truncation parameter k.Recall that the TSVD solution xk is the unique vector such that Axk captures thefirst k SVD components of b,

xk =k∑i=1

uTi b

σivi.

The corresponding TSVD residual takes the form

%k = b−Axk =m∑

i=k+1

(uTi b)ui = Pk b = Pk b+ Pk e.

Here we introduced the m ×m orthogonal projection matrix Pk =∑m

i=k+1 uiuTi that

projects onto the subspace spanned by uk+1, . . . ,um, and we used that b = b+e. Thetwo components of %k take the form

Pk b =m∑

i=k+1

(uTi b)ui and Pk e =m∑

i=k+1

(uTi e)ui

Recall the underlying property of tomographic reconstruction problems (and inverseproblems in general) that the noise-free right-hand side’s SVD components uTi b – onthe average – have a decaying magnitude as k increases. Hence the norm ‖Pkb‖2, whichalways decreases with k, will decrease quite fast because it’s largest SVD componentsare extracted first (for the small values of k). See Fig. 8.8 for an example.

We also need to characterize the noise component Pk e whose norm is given by with‖Pk e‖2

2 =∑m

i=k+1(uTi e)2. Since e is zero-mean white Gaussian noise so is UTe, andit follows that the quantities uTi e also follow a Gaussian distribution with zero meanand standard deviation η. Hence we obtain the following essential result:

E(‖Pk e‖2

2

)= E

(m∑

i=k+1

(uTi e)2

)=

m∑i=k+1

E((uTi e)2

)= (m− k) η2.


Figure 8.8: This small examples illustrates the typical behavior of the norms of theTSVD residual components Pk b and Pk e.

The factor m − k reflects the fact that the vector Pk e lies in a subspace of thatdimension and thus has m− k degrees of freedom. The norm ‖Pk e‖2 also decays withk, and compared to ‖Pk b‖2 it decays rather slowly, see the example in Fig. 8.8.

Returning to the TSVD residual %k = Pk b + Pk e, the following observations forincreasing k-values are important:

• When k is too small then we have not captured enough SVD components; henceAxk is not a good approximation the exact data b, %k is dominated by Pk b,and ‖Pk b‖2 is larger than ‖Pk e‖2.

• When k is “just about right” then Axk approximates b as well as possible; thenorm ‖Pk b‖2 has now become smaller and it is of the same size as the norm‖Pk e‖2.

• When k is too large the residual %k is still dominated by the noise componentPk e, and hence ‖Pk e‖2 dominates the residual norm.

According to these observations we should therefore choose the TSVD truncation pa-rameter such that ‖Pk b‖2 ≈ ‖Pk e‖2. Since both are unknown, in practise we shouldchoose k such that

‖%(k)‖2 ≈√m− k η.

The above – somewhat heuristic – reasoning has been formalized by several authors[19], [35], [36] for the case of unconstrained problems. We will not give the details oftheir analysis, which falls outside of the scope of this work, but instead summarize themain results formulated for the class of simultaneous AIR methods analyzed in §8.1.1.It follows from Eq. (8.5) that we can write the kth iterate as

x(k) = A#k b with A#

k = V Φ(k) Σ−1UT . (8.16)

The corresponding predicted date (i.e., as predicted by the reconstruction x(k)) is givenby the vector Ax(k) = AA#

k b. The matrix AA#k that transforms the given, noisy data

into this prediction is called the influence matrix.


Then – still under the assumption (8.15) of white Gaussian noise – it can be shownthat at the optimal k we have

E(‖%(k)‖22) = η2 (m− tk), (8.17)

in which

tk = trace(AA#k ) =

n∑i=1

φ(k)i , (8.18)

and where φ(k)i are the filter factors (8.6). The real number m − tk is sometimes

referred to as the effective (or equivalent) degrees of freedom [40] in the residual. Forthe TSVD solution discussed above, where the filter factors are 0’s and 1’s, we simplyhave tk = k. An exact computation of tk is cumbersome for most methods, but it canbe approximated quite efficiently as described in §8.2.4. We have thus arrived at thefollowing stopping rule:

Stop Rule: Fit to Noise Level

Stop at the smallest k for which ‖%(k)‖2 ≤ η√m− tk.

For the iterative methods we consider here, this stopping rule is particularly con-venient because the residual norm ‖%(k)‖2 decreases monotonically with the number ofiterations k. To see this, we write the residual vector in terms of the SVD:

%(k) = b−Ax(k) = U (I −Φ(k))UTb = U

(1− φ(k)

1

)uT1 b(

1− φ(k)2

)uT2 b

...

.

Hence, for Landweber’s method,

‖%(k)‖22 =

m∑i=1

(1− φ(k)

i

)2(uTi b)

2 =m∑i=1

(1− ω σ2i )

2k(uTi b)2.

Since ω is always chosen such that |1− ω σ2i | < 1 the factors (1− ω σ2

i )2k – and hence

the squared residual norm – decrease monotonically with k.

Example 30. We illustrate the “fit-to-noise-level” stopping rule with two small parallel-beam tomographic problems with image size 64×64 and 91 detector pixels. The projec-tion angles are, respectively, 3, 6, 9, . . . , 180 (giving an overdetermined system) and8, 16, 24, . . . , 180 (giving an underdetermined system).

We used Landweber’s method to solve these two problems. Figure 8.9 shows thereconstruction errors ‖x − x(k)‖2 and the residual norms ‖%(k)‖2 versus k, togetherwith the threshold η

√m and the function η

√m− tk. The graphs confirm the monotonic

decrease of the residual norm. For both problems, the “fit-to-noise-level” stopping ruleterminates the iterations close to the optimal number of iterations. A stopping ruleinvolving η

√m, on the other hand, would terminate the iterations much too early.


Figure 8.9: Illustration of the “fit-to-noise-level” stopping rule for Landweber’s method,with two parallel-beam tomographic problems. The smallest reconstruction error ismarked with the black dot, and the residual norm that satisfies the stopping rule ismarked with the red circle; for both problems the “fit-to-noise-level” rule works well.On the other hand, stopping at the smallest k for which ‖%(k)‖2 ≤ η

√m terminates

the iterations much too early.


8.2.2 Minimization of the Prediction Error – UPRE and GCV

Instead of fitting to the noise level as described above, we can find the number ofiterations that minimizes the prediction error, i.e., the difference between the noise-free data b = Ax and the predicted dataAx(k). Statisticians refer to various measuresof this difference as the predictive risk, and the resulting method for choosing k is oftencalled the unbiased predictive risk estimation (UPRE) method.

Again we present the results specifically in the framework of iterative reconstructionmethods. Following [39, §7.1], where all the details can be found, the expected squarednorm of the prediction error (the risk) is

E(‖b−Ax(k)‖2

2

)= ‖(I −AA#

k ) b‖22 + η2 trace

((AA#

k )2)

while the expected squared norm of the residual can be written as

E(‖b−Ax(k)‖2

2

)= ‖(I −AA#

k ) b‖22 +

η2 trace((AA#

k )2)− 2η2 trace(AA#

k ) + η2m.

Combining these two equations we can eliminate one of the trace terms and arrive atthe following expression for the risk:

E(‖b−Ax(k)‖2

2

)= E

(‖b−Ax(k)‖2

2

)+ 2η2 trace(AA#

k )− η2m.

Substituting the actual squared residual norm ‖%(k)‖22 = ‖b −Ax(k)‖2

2 for its ex-pected value, we thus define the UPRE risk as a function of k:

Uk = ‖%(k)‖22 + 2 η2 tk − η2m (8.19)

with tk = trace(AA#k ) given in (8.18). A minimizer of Uk will then give an approxi-

mation to a minimizer of the prediction error. We note that Uk may not have a uniqueminimizer, and we therefore choose the smallest k at which Uk has a local minimum.Thus we arrive at the following stopping rule:

Stop Rule: UPRE

Find the smallest k that minimizes Uk = ‖%(k)‖22 + 2 η2 tk − η2m.

This UPRE stopping rule, as well as the fit-to-noise-level rule, depend on an esti-mate of the standard deviation η of the noise – which may or may not be a problemin practise. We shall now describe an alternative method for minimization of theprediction error, derived by Wahba [40], that does not depend on knowledge of η.

The outset for this method is the principle of cross validation. Assume that weremove the ith element bi from the right-hand side (the noisy data), compute a recon-

struction x(k)[i] , and then use this reconstruction to compute a prediction bi = ri · x(k)

[i]

of the missing data bi. The goal would then be to choose the number of iterations kthat minimizes the following measure of all the prediction errors:

Gk =1

m

m∑i=1

(bi − bi

)2=

1

m

m∑i=1

(bi − ri · x(k)

[i]

)2

.


Then it is proved in [40, Thm. 4.2.1] that we can avoid the vectors x(k)[i] and write Gk

directly in terms of x(k):

Gk =1

m

m∑i=1

(bi − ri · x(k)

1− α(k)i

)2

,

where α(k)i is the ith diagonal element of the influence matrixAA#

k associated with x(k).At this stage, recall that the 2-norm is invariant under an orthogonal transforma-

tion, of which a permutation is a special case. Specifically, if Q is an orthogonal matrixthen ‖Q (Ax−b)‖2 = ‖Ax−b‖2 which means that the reconstruction x(k) is invariantto such a transformation. Unfortunately it can be proved [40] that the minimizer of

Gk is not invariant to an orthogonal transformation of the data. In particular, it isinconvenient that a stopping rule based on Gk would produce a k that depends on theparticular ordering of the data.

The generalized cross validation (GCV) method circumvents this problem by re-

placing all α(k)i with their average

µ(k) =1

m

m∑i=1

α(k)i =

1

mtrace(AA#

k ) =tkm,

leading to the modified measure

Gk =1

m

1

(1− µ(k))2

m∑i=1

(bi − ri · x(k)

)2=‖b−Ax(k)‖2

2

m(1− tk/m)2= m

‖%(k)‖22

(m− tk)2.

The minimizer of Gk is, of course, independent of the factor m and hence we choose todefine the GCV risk as a function of k as

Gk = ‖%(k)‖22 / (m− tk)2. (8.20)

We have thus arrived at the following η-free stopping rule where again, in practice, weneed to estimate the quantity tk:

Stop Rule: GCV

Find the k that minimizes Gk = ‖%(k)‖22 / (m− tk)2.

(The above presentation follows [40, §4.2–3]. A different derivation of the GCVmethod was presented in GoHW79; here the coordinate system for Rm is rotated suchthat the corresponding influence matrix becomes a circulant matrix with identicalelements along all its diagonals. This approach leads to the same GCV risk Gk asabove.)

Perhaps the most important property of the GCV stopping rule is that the valueof k which minimizes Gk in (8.20) is also an estimate of the value that minimizes theprediction error. Specifically, if kGCV minimizes the GCV risk Gk and kPE minimizesthe prediction error ‖b−Ax(k)‖2

2, then it is shown in [40, §4.4] that

E(‖b−Ax(kGCV)‖2

2

)→ E

(‖b−Ax(kPE)‖2

2

)for m→∞.


Figure 8.10: Illustration of the UPRE and GCV stopping rules for Landweber’s methodapplied to the two parallel-beam tomographic problems from Example 30.

Recall that the “fit-to-noise-level” stopping rule from §8.2.1 terminates the itera-tions as soon as the condition ‖%(k)‖2 ≤ η

√m− tk is satisfied. The UPRE and GCV

stopping rules have the slight inconvenience that we need to take at least one iterationtoo many in order to detect a minimum of Uk and Gk, respectively. In practise, thisis not really a problem. For tomography problems the iteration vector x(k) does notchange very much from one iteration to the next, and hence the minimum of the errorhistory ‖x − x(k)‖2 is usually very flat. Hence it hardly makes any difference if weimplement the UPRE and GCV stopping rules such that we terminate the algorithmone iteration (or a few iterations) after the actual minimum of Uk or Gk.

Example 31. We return to the two test problems from Example 30, this time toillustrate the UPRE and GCV stopping rules applied to Landweber’s method. Figure8.10 shows Uk (8.19) and Gk (8.20) versus k, together with the error histories. The twostopping rules terminate the iterations at approximately the same number of iterations– not too far from the minimum of the error history. Note how flat the error history is:in practise it make no difference if we terminate the iterations exactly at the minimumof Uk and Gk or a few iterations later.

8.2.3 Stopping When All Information is Extracted — NCP

All three stopping rules presented so far include the trace term tk (8.18). This term canbe estimated at additional cost as discussed in §8.2.4 below, but it is also worthwhile to


consider a stopping rule that needs neither the trace term tk nor the standard deviationη of the noise. The so-called NCP criterion from [22], [28] is one such method.

The considerations that underlies this method are as follows: 1) due to the noise thedata only contains partial information about the reconstruction, 2) in each iterationwe extract additional information from the data, and 3) eventually we have extractedall the available information in the noisy data. Therefore we want to monitor theproperties of the residual vector. During the initial iterations we have not yet extractedall information present in the data and the residual still resembles a meaningful signal,while at some stage – when all information is extracted – the residual starts to appearlike noise. When we iterate beyond this point, we solely extract noise from the data(we “fit the noise”) and the residual vector will appear as filtered noise where some ofthe noise’s spectral components are removed.

To formalize this approach, in the white-noise setting of this presentation, we needa computational approach to answering the questions: when does the residual vectorlook the most like white noise? To answer this question, statisticians introduced theso-called normalized cumulative periodogram.

In the terminology of signal processing, a periodogram is identical to a discretepower spectrum defined as the squared absolute values of the discrete Fourier coeffi-cients. Hence the periodogram for an arbitrary vector v ∈ Rm is given by

pi =∣∣vi∣∣2, i = 1, 2, . . . , q where v = DFT(v). (8.21)

Here, DFT denotes the discrete Fourier transform (computed by means of the FFTalgorithms) and q = bm/2c denotes the largest integer such that q ≤ m/2. The reasonfor including only about half of the Fourier coefficients in the periodogram/powerspectrum is that the DFT of a real vector is symmetric about its midpoint. We thendefine the corresponding normalized cumulative periodogram (NCP) for the vector vas the vector c(v) of length q with elements

cj(v) =p2 + · · ·+ pj+1

p2 + · · ·+ pq+1

=‖v2:j+1‖2

2

‖v2:q+1‖22

, j = 1, 2, . . . , q. (8.22)

White noise is characterized by having a flat power spectrum (similar to white lighthaving equal amounts of all colors), and thus the expected value of its power spectrumcomponents is a constant independent of i. Consequently, the expected value of theNCP for a white-noise vector vwhite is

E(c(vwhite)) = cwhite = (1/m, 2/m, . . . , 1).

How much a given vector v deviates from being white noise can be measured by thedeviation of the corresponding c(v) from cwhite – e.g., as measured by the norm ‖c(v)−cwhite‖2.

Example 32. Figure 8.11 illustrates the appearance of NCP vectors c(v) for vectors vof length m = 256 with different spectra. The completely flat spectrum for white noisecorresponds to a straight line from (0, 0) to (q, 1) with q = b256/2c = 128. The left plotshows NCPs for 10 random realizations of white noise, and they are all close the the


Figure 8.11: Illustration of NCP vectors c(v) ∈ R256 for vectors that are white noise(left), dominated by low-frequency components (middle), and dominated by high-frequency components (right).

ideal white-noise NCP cwhite. The middle and right plots show NCPs for random vectorsthat are dominated by low-frequency and high-frequency components, respectively; theirdeviation from cwhite is obvious.

To utilize the NCP framework in the algebraic iterative methods for tomographicreconstruction, a first idea might be to terminate the iterations when ‖c(%(k))−cwhite‖2

exhibits a minimum. But this would be a bit naive, since the residual vector does notreally correspond to a single signal of length m. Rather, the right-hand side b consistsof a number of “projections,” one for each angle of the measurements – and the residualvector inherits this organization. Hence, a better approach is to apply an NCP analysisto each projection’s residual, and then combine this information into a simple measure.

Depending on the CT scanner, each projection is either a 1D or 2D image – whenwe perform 2D and 3D reconstructions, respectively. To simplify our presentation, wenow assume that our data consists of mθ 1D projections, one for each projection anglesθ1, θ2, . . . , θmθ . We also assume that the data are organized such that we can partitionthe right-hand b and the residual vector into mθ sub-vectors,

b =

b1

b2...bmθ

, %(k) =

%

(k)1

%(k)2...

%(k)mθ

, (8.23)

with each sub-vector corresponding to a single 1D projection. Define the correspondingquantities

ν(k)` =

∥∥c(%(k)`

)− cwhite

∥∥2, ` = 1, 2, . . . ,mθ (8.24)

that measure the deviation of each residual sub-vector from being white noise. Thenwe propose to measure the kth residual’s deviation from being white noise by averagingthe above quantities, i.e., by means of the NCP-number

Nk =1

mθ

mθ∑`=1

ν(k)` . (8.25)

This multi-1D approach leads to the following stopping rule:


Stop Rule: NCP

Find the k that minimizes Nk =1

mθ

mθ∑`=1

∥∥c(%(k)`

)− cwhite

∥∥2.

In the case of 3D reconstructions, where the data consist of a collection of 2D images,the computation of ν

(k)` should take this into consideration. In particular we need to

define the NCP vector c(%

(k)`

)when the residual sub-vector %

(k)` represents an image;

how to do this is explained in [22].Similar to the previous stopping rules, in practise it is more convenient to implement

the NCP stopping rule such that we terminate the iterations at the first iteration kfor which Nk increases. There is no theory to guarantee that Nk will behave smoothly,and we occasionally see that Nk exhibits a minor zig-zag behavior. Hence it may benecessary to apply the NCP stopping rule to a smoothed version of the NCP-numbers,obtained by applying a “local” low-pass filter to the Nk-sequence.

Example 33. We illustrate the NCP stopping rule with a parallel-beam test problemwith image size N × N = 256 × 256, 362 detector pixels, and 180 projection angles1, 2, . . . , 180. The performance is shown in Fig. 8.12 together with surface plotsof the matrix

[c(%

(k)1

), c(%

(k)2

), . . . , c

(%

(k)mθ

)]for selected iterations k. We clearly see

the changing shape of the NCP vectors c(%

(k)`

)as k increases. The minimum of Nk

is obtained at kNCP = 179. This is somewhat early, considering that the minimumreconstruction error is obtained at k = 497 iterations – but on the other hand, thereconstruction and the error does not change between iterations 179 and 700.

8.2.4 Estimation of the Trace Term and the Noise Level

The fit-to-noise-level, UPRE and GCV stopping rules from §§8.2.2–8.2.1 include theterm tk = trace(AA#

k ) (8.18). To make these methods practical to use, we need to beable to estimate this trace term efficiently – without having to compute the SVD ofthe system matrix A or form the influence matrix AA#

k . The most common way tocompute this estimate is via a Monte Carlo approach.

Underlying this approach is the following result from [15]. If w ∈ Rm is a randomvector with elements wi ∼ N (0, 1), and if S ∈ Rm×m is a symmetric matrix, then

wTSw is an unbiased estimate of trace(S). Therefore testk = wTAA#

k w is an unbiased

estimator of tk = trace(AA#k ). To compute this estimate we need to compute the

matrix-vector product A#k w efficiently. Recalling the definition of A#

k in Eq. (8.16),this can be done simply by applying the algebraic iterative method to the system

Aξ = w which, after k iterations, produces the vector ξ(k)

= A#k w. The resulting

estimatet estk = wTAξ

(k)= (ATw)Tξ

(k)(8.26)

is the standard Monte Carlo trace estimate from [15]. In an efficient implementationof (8.26) the vector ATw is pre-computed and stored.

An alternative approach for stationary iterative methods, where the relaxation pa-rameter ω is independent of k, was presented in [30]. This approach also applies to


Figure 8.12: Illustration of the NCP stopping rules for Landweber’s method ap-plied to a parallel-beam test problem. We also show surface plots of the matrix[c(%

(k)1

), c(%

(k)2

), . . . , c

(%

(k)mθ

)]for selected iterations k. This stopping rule leads to

a somewhat premature termination of the iterations at kNCP = 179 (the minimumerror occurs for k = 497 iterations), but it should be noted that the error does notchange much between iterations 179 and 700.


methods with a nonsymmetric influence matrix. Specifically, the method applies tounprojected iterative methods of the general form

x(k+1) = x(k) + ωATB (b−Ax(k)),

where B is a general m×m matrix (it is not required to be symmetric). This formula-tion includes Landweber’s and Cimmino’s methods as well as Kaczmarz’s method, cf.the formulation in §8.1.3. When we apply such a method with an arbitrary nonzerostarting vector ξ(0) to the system Aξ = 0, then it follows from Eq. (8.1) that theiterates are ξ(k) = (I − ωATBA)k ξ(0).

Then it is show in [30] that if use a random starting vector w ∈ Rn with elementswi ∼ N (0, 1), and if ξ(k) denote the corresponding iterations for the system Aξ = 0,then wTξ(k) is an unbiased estimator of n− trace(AA#

k ). This leads to the alternativetrace estimate

t estk = n−wTξ(k). (8.27)

Extend the above method to the case in Eq. (7.15).In order to use either of these trace estimates instead of the exact tk, we must

simultaneously apply the iterative method to two right-hand sides, which essentiallydoubles the amount of work. If we are willing to increase the overhead further, we cancompute a more robust estimate of tk by applying the above idea to several randomvectors and computing the mean or median of the t est

k -values.

Example 34. We illustrate the two trace estimates t estk and t est

k for Landweber’s methodapplied to the overdetermined test problem from Example 30. Figure 8.13 shows thetrace estimates for 10 different realizations of the random vectors w and w, togetherwith the exact trace tk. We see that the computationally less expensive estimate t est

k ,shown in the bottom plot, has the smallest variance. We are not aware of theory thatsupports this observation.

Example 35. Continuing from the previous example, Fig. 8.14 illustrates the use ofthe trace estimate t est

k in the fit-to-noise-level stopping rule. To show the variabilityof the stopping rule we used 10 different random vectors w, leading to 10 differentrealizations of η2 (m − t est

k ). Their intersections with ‖%(k)‖22 are shown by the red

circles, corresponding to stopping the iterations at

k = 2406, 2460, 2662, 2685, 2698, 2731, 3324, 3553, 3756, 4194.

The black dot marks the intersection of the exact of ‖%(k)‖22 with η2 (m − tk), corre-

sponding to iteration k = 3314.

When we use the trace estimate t estk in the GCV stopping rule, then we actually

seek a minimum for the approximate GCV risk given by

G estk = ‖%(k)‖2

2 / (m− t estk )2.

Similarly we can define the function

V estk = ‖%(k)‖2

2 / (m− t estk ) = G est

k (m− t estk ).


Figure 8.13: Comparison of the two trace estimates t estk and t est

k for Landweber’smethod applied to the overdetermined test problem from Example 30. The thick redline is the exact trace tk, and the thin black lines are the trace estimates for 10 differentrandom vectors w and w.

Figure 8.14: Application of the trace estimate t estk in the fit-to-noise-level stopping rule

for Landweber’s method applied to the overdetermined test problem from Example 30.We used 10 different random vectors w in (8.27) and the corresponding 10 intersectionsbetween ‖%(k)‖2

2 (thick red line) and η2 (m − t estk ) (thin blue lines) are shown by the

red circles. The black dot shows the intersection with the exact η2 (m− tk).


Figure 8.15: Illustration of the approximate GCV risk G estk – whose minimum occurs

at k = 2410 – and the corresponding V estk . The solid black line represents the noise

variance η2, while the circles represent G estk

and V estk

, the latter being a good estimate

of η2.

According to the fit-to-noise-level stopping rule it follows from (8.17) that when we stopthe iterations, the ratio ‖%(k)‖2

2 / (m−tk) is approximately equal to the noise variance η2.Consequently, if we terminate at iteration k = k for which G est

k is minimum, then thecorresponding value V est

kis an inexpensive estimate of η2.

Example 36. Continuing again from the previous example, Fig. 8.15 illustrates the useof the approximate GCV risk G est

k as the stopping rule, and the corresponding V estk for

estimation of the noise variance. The exact GCV risk Gk has a minimum at k = 3314while minimum of G est

k occurs (prematurely) for k = 2410. At this iteration we hveV estk

= 5.69 · 10−3, which is a very good estimate of η2 = 5.59 · 10−3.

8.3 Choosing a Good Relaxation Parameter

This sections needs to be written!

Bibliography

[1] M. S. Andersen and P. C. Hansen, Generalized row-action methods for tomographicimaging, Numer. Algor., 67 (2014), pp. 121–144.

[2] A. H. Andersen and A. C. Kak, Simultaneous algebraic reconstruction technique(SART): a superior implementation of the ART algorithm, Ultrason. Imaging 6(1984), pp. 81-94.

[3] R. Barrett et al., Templates for the Solution of Linear Systems: Building Blocksfor Iterative Methods, SIAM, PA, 1994.

[4] M. Bertero and P. Boccacci, Introduction to Inverse Problems in Imaging, CRCPress, 1998.

[5] W. L. Briggs and V. E. Henson, The DFT – An Owner’s Manual for the DiscreteFourier Transform, SIAM, PA, 1995.

[6] T. M. Buzug, Computed Tomography – From Photon Statistics to Modern Cone-Beam CT, Springer, Berlin, 2008.

[7] G. Cimmino, Calcolo approssimato per le soluzioni dei sistemi di equazioni lineari,La Ricerca Scientifica, XVI, Series II, Anno IX 1, pp. 326-333, 1938.

[8] T. Elfving, P. C. Hansen, and T. Nikazad, Semi-convergence and relaxation pa-rameters for projected SIRT algorithms, SIAM J. Sci. Comput., 34 (2012), pp.A2000–A2017, doi: 10.1137/110834640.

[9] T. Elfving, P. C. Hansen, and T. Nikazad, Semi-convergence properties of Kacz-marz’s method, Inverse Problems, 30 (2014), doi: 10.1088/0266-5611/30/5/055007.

[10] T. Elfving, P. C. Hansen, and T. Nikazad, Convergence analysis for column-actionmethods in image reconstruction, Numer. Algo. 74 (2016), pp. 905–924. Erratum(Fig. 3 was incorrect), p. 925.

[11] T. Elfving and T. Nikazad, Properties of a class of block-iterative methods InverseProblems, 25 (2009), doi: 10.1088/0266-5611/25/11/115011.

[12] T. Elfving, T. Nikazad, and P. C. Hansen, Semi-convergence and relaxation pa-rameters for a class of SIRT algorithms, Electronic Trans. on Numerical Analysis,37 (2010), pp. 321–336.

127

128 BIBLIOGRAPHY

[13] H. W. Engl, M. Hanke, and A. Neubauer, Regularization of Inverse Problems,Springer, Berlin, 2000.

[14] L. D. Fosdick, E. J. Jessup, C. J. C. Schauble, and G. Domik, An Introduction toHigh-Performance Scientific Computing, MIT Press, MA, 1996.

[15] D. A. Girard, A fast ‘Monte Carlo cross-validation’ precedure for large least squaresproblems with noisy data, Numer. Math., 56 (1989), pp. 1–23.

[16] G. H. Golub, M. T. Heath, and G Wahba, Generalized cross-validation as a methodfor choosing a good ridge Parameter, Technometrics, 21 (1979), pp. 215–223.

[17] R. Gordon, R. Bender, and G. T. German, Algebraic reconstruction technique(ART) for three-dimensional electron microscopy and X-ray photography, J. Theor.Biol., 29 (1970), pp. 471–481.

[18] K. Hahn, H. Schondube, K. Stierstorfer, J. Hornegger, and F. Noo, A compari-son of linear interpolation models for iterative CT reconstruction, Med. Phys., 43(2016), pp. 6455–6473.

[19] P. Hall and D. M. Titterington, Common structure of techniques for choosingsmoothing parameters in regression problems, J. R. Statist. Soc. B, 49 (1987),184–198.

[20] P. C. Hansen, V. Pereyra, and G. Scherer, Least Squares Data Fitting with Appli-cations, Johns Hopkins University Press, 2012.

[21] P. C. Hansen and J. S. Jørgensen, AIR Tools II: algebraic iterative reconstructionmethods, improved implementation, Numer. Algo., 79 (2018), pp. 107–137.

[22] P. C. Hansen, M. E. Kilmer, and R. H. Kjeldsen, Exploiting residual informationin the parameter choice for discrete ill-posed problems, BIT, 46 (2006), pp. 41–59.

[23] S. Kaczmarz, Angenaherte auflosung von Systemen linearer Gleichungen, Bulletinde lAcademie Polonaise des Sciences et Lettres, A35 (1937), pp. 355-357.

[24] A. C. Kak and M. Slaney, Principles of Computerized Tomographic Imaging,reprinted by SIAM, PA, 2001.

[25] C. D. Meyer, Matrix Analysis and Applied Linear Algebra, SIAM, PA, 2001.

[26] V. Morozov, Methods of Solving Incorrectly Posed Problems, Springer Verlag, NewYork, 1984.

[27] F. Natterer, The Mathematics of Computerized Tomography, John Wiley, NewYork, 1986.

[28] B. Rust and D. P. O’Leary, Residual periodograms for choosing regularizationparameters for ill-posed problems, Inverse Problems, 24 (2008), doi: 10.1088/0266-5611/24/3/034005.

BIBLIOGRAPHY 129

[29] Y. Saad and H. A. van der Vorst, Iterative solution of linear systems in the 20thcentury, J. Comput. Appl. Math., 123 (2000), pp. 1–33.

[30] R. J. Santos and A. R. De Pierro, A cheaper way to computed generalized ross-validation as a stopping rule for linear stationary methods, J. Comp. GraphicalStatistics, 12 (2003), pp. 417–433.

[31] L. A. Shepp and B. F. Logan, The Fourier reconstruction of a head phantom,IEEE Trans. Nuclar Science, 21 (1974), pp. 21–43.

[32] R. L. Siddon, Fast calculation of the exact radiological path for a threedimensionalCT array, Med. Phys. 12 (1985), pp. 252-255.

[33] H. H. B. Sørensen and P. C. Hansen, Multicore performance of block algebraiciterative reconstruction methods, SIAM J. Sci. Comp., 36 (2014), pp. C524–C546.

[34] M. Teboulle, A simplified view of first order methods for optimization, Math. Pro-gram., Ser. B, 170 (2018), pp. 67–96.

[35] A. M. Thompson, J. C. Brown, J. W. Kay, and D. M. Titterington, A study ofmethods of choosing the smoothing parameter in image restoration by regulariza-tion, IEEE Trans. Pattern Anal. Machine Intell., 13 (1991), pp. 326–339.

[36] V. F. Turchin, Solution of Fredholm equations of the first kind in a statisticalensemble of smooth functions, USSR Comput. Math. and Math. Phys., 7 (1967),pp. 79–101.

[37] W. van Aarle, W. J. Palenstijn, J. Cant, E. Janssens, F. Bleichrodt, A. Dabravol-ski, J. De Beenhouwer, K. J. Batenburg, and J. Sijbers, Fast and flexible X-raytomography using the ASTRA Toolbox, Optics Express, 24 (2016), pp. 25129–25147.

[38] A. van der Sluis and H. A. van der Vorst, SIRT- and CG-type methods for theiterative solution of sparse linear least-squares problems, Linear Algebra Appl.,130 (1990), pp. 257-303.

[39] C. R. Vogel, Computational Methods for Inverse Problems, SIAM, PA, 2002. doi:10.1137/1.9780898717570.

[40] G. Wahba, Spline Models for Observational Data, SIAM, PA, 1990. doi:10.1137/1.9781611970128.

[41] G. L. Zeng and G. T. Gullberg, Null-space function estimation for the three-dimensional interior problem, 11th International Meeting on Fully Three-Dimensional Image Reconstruction in Radiology and Nuclear Medicine, July 11-15,2011, Potsdam, Germany: Proceedings pp. 241–245.

Documents

Scienti c Computing for Computed Tomographypcha/HDtomo/SC/SCforCTbook.pdf · Scienti c Computing for Computed Tomography K. Joost Batenburg, Per Christian Hansen, Jakob Sauer J˝rgensen,