21
MATRIX STRUCTURES AND PARALLEL ALGORITHMS FOR IMAGE SUPERRESOLUTION RECONSTRUCTION QIANG ZHANG * , RICHARD T. GUY , AND ROBERT J. PLEMMONS Abstract. Computational resolution enhancement (superresolution) is generally regarded as a memory intensive process due to the large matrix-vector calculations involved. In this paper, a detailed study of the structure of the n 2 × n 2 superresolution matrix is used to decompose the matrix into 9 matrices of size l 2 × l 2 where l is the upsampling factor. As a result, previously large martix vector products can be broken into many small, parallelizable products. An algorithm is presented that utilizes the structural results to perform superresolution on compact, highly parallel architectures such as Field-Programmable Gate Arrays. Key words. image superresolution, FPGA, parallel computation, structured matrices AMS subject classifications. 65R32, 65F10, 65F50, 94A08 1. Introduction. Computational methods for resolution improvement (super- resolution) have attracted much attention lately due in part to their ability to over- come the optical limitations of inexpensive, lower resolution sensors. See, for instance, [6, 14, 16]. Superresolution (SR) is based on the idea that slight variations in the in- formation encoded in a series of low resolution (LR) images can be used to recover a high resolution (HR) image. The basic superresolution problem can be posed as an inverse problem [1, 6], min f ||DH i S i f - g i || 2 2 ,i =1,...,l 2 , (1.1) where f is the vectorized true high resolution image, g i is a vectorized lower resolution image, D is the decimation matrix, H i is a blurring matrix, S i is a shift matrix and l is the upsampling factor. In the models that follow, the decimation matrix D is a local averaging matrix that aggregates values of non-intersecting small neighborhoods of HR pixels to produce LR pixel values. The shift matrix S i , also called the interpolation matrix, assigns weights according to a bilinear interpolation of HR pixel values to perform a rigid translation of the original image. The blurring matrix H i is generated from a point spread function (PSF) and represents distortion from atmospheric and other sources. As it will be better explained in Section 2, usually the l 2 matrices DH i S i are stacked to create one large least squares problem min f ||Af - g|| 2 2 , (1.2) where, using the MATLAB notation, A =[DH 1 S 1 ; ... ; DH l 2 S l 2 ], g =[g 1 ; ... ; g l 2 ], and A R n 2 ×n 2 , being n × n the dimension of the true high resolution image f . The dimensionality of the problem is usually quite large. Given a moderate HR image size of 256 × 256 with l = 4, the na¨ ıve way to construct A would require 2l 2 matrices H i and S i , for i =1,...,l 2 , each 65536 × 65536, plus one smaller matrix D that is 4096 × 65536, assuming l = 4. The system matrix A is sparse but is of size 65536 × 65536. * Department of Biostatistical Sciences, Wake Forest University Health Sciences, Medical Center Boulevard, Winston-Salem, NC 27157 ([email protected]). Department of Mathematics, Wake Forest University, Winston-Salem, NC 27106 Departments of Mathematics and Computer Science, Wake Forest University, Winston-Salem, NC 27106. 1

MATRIX STRUCTURES AND PARALLEL ALGORITHMS FORusers.wfu.edu/plemmons/papers/siam_maa3.pdf · xi 0, QTAP is an upper bidiagonal block Toeplitz matrix. 2. If all xi 0, QTAP is a lower

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MATRIX STRUCTURES AND PARALLEL ALGORITHMS FORusers.wfu.edu/plemmons/papers/siam_maa3.pdf · xi 0, QTAP is an upper bidiagonal block Toeplitz matrix. 2. If all xi 0, QTAP is a lower

MATRIX STRUCTURES AND PARALLEL ALGORITHMS FORIMAGE SUPERRESOLUTION RECONSTRUCTION

QIANG ZHANG∗, RICHARD T. GUY † , AND ROBERT J. PLEMMONS ‡

Abstract. Computational resolution enhancement (superresolution) is generally regarded asa memory intensive process due to the large matrix-vector calculations involved. In this paper,a detailed study of the structure of the n2 × n2 superresolution matrix is used to decompose thematrix into 9 matrices of size l2 × l2 where l is the upsampling factor. As a result, previously largemartix vector products can be broken into many small, parallelizable products. An algorithm ispresented that utilizes the structural results to perform superresolution on compact, highly parallelarchitectures such as Field-Programmable Gate Arrays.

Key words. image superresolution, FPGA, parallel computation, structured matrices

AMS subject classifications. 65R32, 65F10, 65F50, 94A08

1. Introduction. Computational methods for resolution improvement (super-resolution) have attracted much attention lately due in part to their ability to over-come the optical limitations of inexpensive, lower resolution sensors. See, for instance,[6, 14, 16]. Superresolution (SR) is based on the idea that slight variations in the in-formation encoded in a series of low resolution (LR) images can be used to recover ahigh resolution (HR) image.

The basic superresolution problem can be posed as an inverse problem [1, 6],

minf||DHiSif − gi||22, i = 1, . . . , l2, (1.1)

where f is the vectorized true high resolution image, gi is a vectorized lower resolutionimage, D is the decimation matrix, Hi is a blurring matrix, Si is a shift matrix and l isthe upsampling factor. In the models that follow, the decimation matrix D is a localaveraging matrix that aggregates values of non-intersecting small neighborhoods of HRpixels to produce LR pixel values. The shift matrix Si, also called the interpolationmatrix, assigns weights according to a bilinear interpolation of HR pixel values toperform a rigid translation of the original image. The blurring matrix Hi is generatedfrom a point spread function (PSF) and represents distortion from atmospheric andother sources. As it will be better explained in Section 2, usually the l2 matricesDHiSi are stacked to create one large least squares problem

minf||Af − g||22, (1.2)

where, using the MATLAB notation, A = [DH1S1; . . . ;DHl2Sl2 ], g = [g1; . . . ; gl2 ],and A ∈ Rn2×n2

, being n × n the dimension of the true high resolution image f .The dimensionality of the problem is usually quite large. Given a moderate HRimage size of 256 × 256 with l = 4, the naıve way to construct A would require 2l2

matrices Hi and Si, for i = 1, . . . , l2, each 65536× 65536, plus one smaller matrix Dthat is 4096 × 65536, assuming l = 4. The system matrix A is sparse but is of size65536× 65536.

∗Department of Biostatistical Sciences, Wake Forest University Health Sciences, Medical CenterBoulevard, Winston-Salem, NC 27157 ([email protected]).†Department of Mathematics, Wake Forest University, Winston-Salem, NC 27106‡Departments of Mathematics and Computer Science, Wake Forest University, Winston-Salem,

NC 27106.

1

Page 2: MATRIX STRUCTURES AND PARALLEL ALGORITHMS FORusers.wfu.edu/plemmons/papers/siam_maa3.pdf · xi 0, QTAP is an upper bidiagonal block Toeplitz matrix. 2. If all xi 0, QTAP is a lower

This motivates a search for efficient SR algorithms, which has prompted variousstudies, [5, 12]. To give one example, Nguyen et al. [12] proposed efficient blockcirculant preconditioners to accelerate convergence of a conjugate gradient algorithm,due to the fact that its complexity is O(

√kn2), where k is the condition number.

Conjugate gradient algorithms and variations are popular for this problem due totheir strength in solving sparse systems.

Only recently have studies appeared that address implemention of SR algorithmswith on-board hardware (System-on-Chips) [13, 15]. In those implementations, theSR model is simplified or expensive post-processing steps are included. This paperpresents an algorithm that makes use of a detailed examination of the matrices D,Hi and Si to replace large scale computations involving sparse matrices with a seriesof smaller operations which are readily parallelizable. The result is a Gauss-Siedeltype algorithm optimized for use on highly parallel, compact architectures such as aField-Programmable Gate Arrays (FPGA). In particular, the algorithm is suitable foruse on hardware that can be integrated with a camera.

The paper proceeds as follows. In Section 2, we examine the block structures ofpermutations of the matrices D, Hi and Si. The results motivate a simple algorithmbased on the Block Gauss-Seidel algorithm which is introduced and analyzed in Sec-tion 3. In Section 4 we present numerical evidence that the new algorithm producesresults comparable to the popular conjugate gradient algorithm despite very modestcomputational and memory demands. Finally, in Section 5 we discuss the use of ma-trix structures for general matrix vector products on small scale hardware. All proofsappear in the Appendix.

2. Matrix Structures. Consider the image superresolution reconstruction prob-lem defined in (1.1). We can concatenate all the product matrices DHiSi, each ofsize n2/l2 × n2, to form a larger matrix A, having size n2 × n2, and similarly we canconcatenate all gi to form one n2 × 1 vector g. Thus we treat the original problem asa least squares problem given in (1.2).

The matrix A is often ill-conditioned and a Tikhonov regularization term,

minf||Af − g||22 + α||f ||22 (2.1)

is applied [2]. Well developed algorithms exist to solve (2.1) using iterative methodsor by considering the normal equations

(ATA+ αI)f = AT g. (2.2)

(see [8].)It is readily apparent that all three component matrices of A involve only local

operations, and thus we should expect A and possibly ATA to possess a sparse form.For instance, if one assumes that the interpolation matrix Si represents spatiallyinvariant translational shifts (δxi, δyi) then the entire n2 × n2 matrix is generatedby only two scalar quantities. The decimation and blurring matrices (subject toconditions discussed later) also have a simply defined structure, and it is possible topermute the matrix A to bring all of the non-zero elements into a tridiagonal structure.The next four theorems make this notion more precise. For simplicity, we will firstignore the blurring matrices Hi. Again, all proofs appear in the Appendix.

Theorem 2.1. Let A = (DS1; . . . ;DSl2) be a no-blurring superresolution systemmatrix with Si representing 2D rigid translational subpixel shifts, δxi, δyi ∈ (−1, 1),

2

Page 3: MATRIX STRUCTURES AND PARALLEL ALGORITHMS FORusers.wfu.edu/plemmons/papers/siam_maa3.pdf · xi 0, QTAP is an upper bidiagonal block Toeplitz matrix. 2. If all xi 0, QTAP is a lower

and D representing a weighted average decimation matrix. Then there exist permu-tations Q and P such that QTAP has a tridiagonal block Toeplitz structure withtridiagonal block Toeplitz blocks, represented as

QTAP =

B0 A1

A−1 A0 A1

A−1 A0 A1

...A−1 A0

, (2.3)

with

Ai =

B

(i)0 A

(i)1

A(i)−1 A

(i)0 A

(i)1

A(i)−1 A

(i)0 A

(i)1

...

A(i)−1 A

(i)0

, (2.4)

where B0, Ai ∈ Rnl×nl, i = −1, 0, 1 and B(i)0 , A

(i)j ∈ Rl2×l2 , j = −1, 0, 1. Both QTAP

and Ai have n/l × n/l blocks.Note that using a circulant boundary condition, as assumed later, we would have

B0 = A0 and B(i)0 = A

(i)0 , but if other boundary conditions are assumed, B0 and B(i)

0

differ from A0 and A(i)0 .

Before rigorously defining the permutation matrices P and Q in the Appendix,we briefly explain here that P is equivalent to an alternate indexing method forvectorizing a matrix. The typical way to vectorize a matrix follows a column-wiseordering as illustrated in (2.5). Using a 256 × 256 matrix M, each number in thematrix represents the position of that element in the vectorized matrix. For example,the element on the first row and the second column of the original image will be the257th entry of the vectorized matrix.

1 257 513 769 1025 1281 1537 1793 ...2 258 514 770 1026 1282 1538 1794 ......

256 512 768 1024 1280 1536 1792 2048 ...

(2.5)

If we assume an upsampling factor of 4, the following indexing method (2.6) representsthe action of vectorizing the matrix PTMP.

1 2 3 4 1025 1026 1027 1028 ...5 6 7 8 1029 1030 1031 1032 ......

1021 1022 1023 1024 2045 2046 2047 2048 ...

(2.6)

Now the element on the first row and the second column of the matrix PTMP is the2nd entry. We may name this vectorization method as “l-length block vectorization”.The motivation comes from the fact that matrix A in (1.2), is essentially a spatiallylocal operator, which operates on spatially close pixels of the HR image f . Thetraditional column-wise ordering would leave pixels in the next column of the imagean n pixels away. The l-length block ordering maintains more spatial information by

3

Page 4: MATRIX STRUCTURES AND PARALLEL ALGORITHMS FORusers.wfu.edu/plemmons/papers/siam_maa3.pdf · xi 0, QTAP is an upper bidiagonal block Toeplitz matrix. 2. If all xi 0, QTAP is a lower

leaving proximate pixels nearby in the vectorized f. The result is a more compactstructure for spatially local operators like A.

The left permutation matrix Q is the product PQ where Q maps element k ofthe ith vectorized LR image gi to element (k − 1)l2 + i of the stacked g in (1.2),for k = 1, . . . , (n/l)2. Intuitively, Q performs a perfect shuffle [7] on the l2 blocksof size (n/l)2 × n. It is not necessary to explicitly construct and store the matrix Pin the computations as the vectors Pf and Pg can be constructed from block-wisereorderings.

In the process of proving the theorem above, we note that A can be simplifiedeven further by putting constraints on δx or δy.

Corollary 2.2. The following conditions hold:1. If all δxi ≥ 0, QTAP is an upper bidiagonal block Toeplitz matrix.2. If all δxi ≤ 0, QTAP is a lower bidiagonal block Toeplitz matrix.3. If all δyi ≥ 0, Ai is an upper bidiagonal block Toeplitz matrix.4. if all δyi ≤ 0, Ai is a lower bidiagonal block Toeplitz matrix.

It is always possible to satisfy conditions 1 or 3 of Corollary 2.2 by choosing theleft most or upper most LR image as the reference. It is not possible in general tosatisfy both constraints unless the imaging system is designed such that the leftmostand uppermost images are the same.

In general, if the matrix B is a tridiagonal block Toeplitz matrix, then BTB willbe a pentadiagonal block Toeplitz matrix with a rank-2 correction. However, as thenext result states, the outermost blocks in the pentadiagonal ATA are identically zeroand the correction is not necessary. This makes it easier to find a direct solution tothe normal equations (2.2). We formalize this notion in the following lemma andtheorem.

Lemma 2.3. AT1 A−1 = 0 and (A(i)1 )TA(j)

−1 = 0, where i, j = −1, 0, 1.Theorem 2.4. Matrix PTATAP has the same structure as matrix A in Theorem

2.1.The two-level tridiagonal block Toeplitz structure in QTAP and PTATAP intro-

duce algorithmic efficiency by reducing the storage requirement to only nine l2 × l2matrices. In practice, Corollary 2.2 makes it possible to store the permuted systemmatrix in six l2×l2 matrices. We also see a simplification of the matrix vector productsused in Conjugate Gradient Least Squares (CGLS) [11] and other iterative algorithmsfor sparse matrices. Similarly, the permuted normal equations satisfy a tridiagonalblock structure that allows an efficient solution to (2.2).

In general, Hi can have many nonzero entries and the matrix A can suffer from amore complicated structure. However, in many applications the nonzero elements ofthe PSF are concentrated in a small circle around the center. By posing a moderatelimit on the size of the diameter containing nonzero entries, it is possible to retainmany of the patterns previously introduced.

Theorem 2.5. Let A = (DH1S1; . . . ;DHl2Sl2) where Hi represents a PSF ofdiameter less than or equal to 2l+1. Then QAPT has a two level penta-diagonal blockToeplitz structure represented as,

QTAP =

B0 A1 A2

A−1 A0 A1 A2

A−2 A−1 A0 A1 A2

...A−2 A−1 A0

, (2.7)

4

Page 5: MATRIX STRUCTURES AND PARALLEL ALGORITHMS FORusers.wfu.edu/plemmons/papers/siam_maa3.pdf · xi 0, QTAP is an upper bidiagonal block Toeplitz matrix. 2. If all xi 0, QTAP is a lower

with

Ai =

B

(i)0 A

(i)1 A

(i)2

A(i)−1 A

(i)0 A

(i)1 A

(i)2

A(i)−2 A

(i)−1 A

(i)0 A

(i)1 A

(i)2

...

A(i)−2 A

(i)−1 A

(i)0

, (2.8)

where Aj ∈ Rnl×nl, j = {−2,−1, 0, 1, 2} and A(i)j ∈ Rl2×l2 , i = {−2,−1, 0, 1, 2}.

As in the case without blurring, the matrix PTATAP has the same structure asQTAP.

Theorem 2.6. With A defined as in Theorem 2.5, PTATAP has the samestructure as QTAP .

In some cases, it is advantageous to consider the structure of the sub matricesA

(i)j . To that end, we have the following theorem:

Theorem 2.7. Under the hypotheses of Theorem 2.1, the following conditionssummarize the tertiary (third level) structure of the permuted system matrix QTAP :

1. If {δx} = {δy} = {0, 1/z, . . . , (l− 1)/z} for some integer z ∈ [1,∞), sorted iny then x, then A

(i)j is Block Hankel with l× l Hankel Blocks. When z = 1, all

nonzero blocks have constant value 1/l2. Furthermore, if j = 0 then all blocksHij such that i < j are nonzero and if j = −1 then all blocks with i ≥ j arenon-zero, where i, j = 1, . . . , l.

2. For all δx, δy, the following sum holds for A(i)j :

1∑i=−1

1∑j=−1

A(i)j =

(1l2

)l2×l2

(2.9)

where the right hand side is a constant matrix with each entry equal to 1/l2.

At this point, the structure of QTAP and PTATAP are sufficiently simplified tosuggest efficient structured matrix algorithms to solve (2.2). In the next section, toavoid treating B0 as a different diagonal block matrix, we assume a periodic boundarycondition to effectively change it to A0.

3. Algorithms. In this section we first present an algorithm to solve the normalequations (2.2) with a chosen α that takes advantage of the specific matrix structureintroduced in the last section. The algorithm presented is a Block Gauss-Seidel ap-proach with an inner Cyclic Reduction (no blurring) or Gauss-Seidel (blurring present)iteration. The algorithm is first presented for the matrix in Theorem 2.1, then ex-tended to the matrix in Theorem 2.5.

3.1. Without blurring. The Cyclic Reduction (CR) method [3, 8] is a directmethod for solving a linear system in which matrix A has a tridiagonal block Toeplitzstructure. After the first CR step, i.e. the even-odd permutations in both rows and

5

Page 6: MATRIX STRUCTURES AND PARALLEL ALGORITHMS FORusers.wfu.edu/plemmons/papers/siam_maa3.pdf · xi 0, QTAP is an upper bidiagonal block Toeplitz matrix. 2. If all xi 0, QTAP is a lower

columns, matrix PTATAP in (2.3) becomes

PTATAP ⇒

A0 A1

A0 A−1 A1

... ... ...A0 A−1 A1

A−1 A1 A0

A−1 A1 A0

... ...A−1 A0

. (3.1)

We could follow the CR steps by inverting A0 and multiplying it with A1 and A−1.However, given that the size of Ai is nl × nl, this is still computationally intensive.Notice that A0 has the same first order structure as A, but with each block of smallersize l2 × l2. Thus, we can use CR to solve a subproblem A0x = b. To set upthe n/2l subproblems, we introduce an outer block Gauss-Seidel iteration. Thatis, we first break f into n/l segments, fi ∈ Rnl and at step k, we use f (k)

i wherei = n/2l + 1, . . . , n/l to solve for each f

(k+1)i , i = 1, . . . , n/2l. The updating formula

for the first half of f at step k is

A0f(k+1)1 = g1 −A1f

(k)n/2l+1,

A0f(k+1)i = gi −A−1f

(k)i+n/2l−1 −A1f

(k)i+n/2l, i = 2, . . . , n/2l. (3.2)

The tridiagonal block Toeplitz structure within A1 and A−1 allows us to performmatrix-vector multiplications on the l2 × l2 blocks. Each f

(k+1)i can be solved in-

dependently using CR since A0 is a tridiagonal block Toeplitz matrix. Next we usethe updated first half of f (k+1) to solve for the second half of f (k+1) using CR. Theupdating formula is

A0f(k+1)i = gi −A1f

(k+1)i−n/2l+1 −A−1f

(k+1)i−n/2l, i = n/2l + 1, . . . , n/l − 1,

A0f(k+1)n/l = gi −A−1f

(k+1)n/2l . (3.3)

For simplicity, we assume that the size of A0 is a power of 2, however there is anextension to other sizes that adds a slight computational penalty (see [8, Sec. 4.5.4]).

One important feature of this approach is extensive parallelism. Each step reducesto n/2l smaller scale subproblems that can be solved independently by n/2l processors,which greatly enhances throughput. Furthermore, implementation of superresolutionon systems like FPGAs is not only feasible but desirable due to the FPGA’s strongperformance in parallel applications. Note as well that the algorithm only requiresmatrix vector multiplication and that all multiplications are on the scale of l2. As aresult, implementation is vastly simplified.

It is well known that the Gauss-Seidel iteration approach is absolutely convergenton Sx = y provided the matrix S is symmetric and positive definite (see the proof in[8, Sec 10.1]). The regularized normal equations ATA + αI meet these criteria, andthe proof can easily be extended to the block Gauss-Seidel method.

Theorem 3.1. The Block Gauss-Seidel algorithm described above converges for(2.2) from any f (0).

6

Page 7: MATRIX STRUCTURES AND PARALLEL ALGORITHMS FORusers.wfu.edu/plemmons/papers/siam_maa3.pdf · xi 0, QTAP is an upper bidiagonal block Toeplitz matrix. 2. If all xi 0, QTAP is a lower

3.2. With blurring. The main difference when the blurring matrix is includedis that the first and second level structures are penta-diagonal rather than tridiagonal,forcing us to abandon Cyclic Reduction to solve the inner problem. However, onecan utilize Gauss-Seidel on both the outer and inner iterations. We first group theblock rows and columns of PTATAP into sets {3k + 1}, {3k + 2} and {3k + 3} fork = 0, . . . , n/3l − 1, as shown in (3.4).

PTATAP ⇒

A0 A1 A2

A0 A−2 A1 A−1 A2

... ... ...A0 A−2 A1 A−1 A2

A−1 A2 A0 A1

A−1 A2 A0 A−2 A1

... ... ...A−1 A0 A−2 A1

A−2 A1 A−1 A2 A0

A−2 A1 A−1 A2 A0

... ... ...A−2 A−1 A0

=

A0 A1 A2

A−1 A0 A1

A−2 A−1 A0

. (3.4)

There are n/3l × n/3l inner blocks on each one of the 3 × 3 outer blocks. Forconvenience, we assume n is divisible by 3l. Otherwise, it is always possible to addextra one or two zero columns and rows to the LR images to make n divisible by 3l.In (3.4), each Ai is penta-diagonal block Toeplitz, which we can permute in the sameway to create a second level 3×3 block form identical to (3.4) above. The block Gauss-Seidel iterations occur at two levels corresponding to the two level matrix structure.We first group entries of f in sets {3k+ 1}, {3k+ 2} and {3k+ 3} denoted f1, f2 andf3, which will be updated iteratively in the first level as,

A0f(k+1)1 = g1 − A1f

(k)2 − A2f

(k)2

A0f(k+1)2 = g2 − A−1f

(k+1)1 − A1f

(k)3

A0f(k+1)3 = g3 − A−1f

(k+1)2 − A−2f

(k+1)1 . (3.5)

For each sub-problem in (3.5), we use the same update rules on the second levelblock matrices which is structurally identical. Each sub-problem requires a matrixinversion on the order of l2 × l2 which can be performed once and stored. Absoluteconvergence is provable for each level of the two levels of Gauss-Seidel interations in(3.5), leading to a proof of absolute convergence for the combined two-level iteration.

Theorem 3.2. Using the algorithm described above for Problem (2.2), the itera-tion converges from any f (0).

4. Numerical Experiments. The algorithms in the last section were appliedto both a simulated satellite image and real images taken by a lenslet array camera onan Air Force resolution target [10]. The algorithm in Section 3.1 was also implementedon an FPGA development board [4]. In the next three subsections, we present detailsof all three experiments.

7

Page 8: MATRIX STRUCTURES AND PARALLEL ALGORITHMS FORusers.wfu.edu/plemmons/papers/siam_maa3.pdf · xi 0, QTAP is an upper bidiagonal block Toeplitz matrix. 2. If all xi 0, QTAP is a lower

(a) (b)

Fig. 4.1. (a) Original satellite image. (b) Low resolution satellite image.

4.1. Simulated images. One original HR satellite image [11] of size 256 ×256, shown in Figure 4.1(a), is downsampled, interpolated using known (δx, δy) anddegraded with additive Gaussian noises to create 16 LR images of size 64 × 64, oneof which appears in Figure 4.1(b). A sub-pixel registration algorithm [17] was thenapplied on 16 LR images to estimate a set of (δx, δy). The LR images combined withoffsets are used to reconstruct a 256× 256 HR image.

We compare our algorithm with the CGLS method [9] in light of the generalpopularity of CG for solving sparse systems. A naıve CG implementation requires theconstruction of 16 shift matrices Si of size 65536 × 65536 plus a decimation matrixD of size 4096 × 65536. Matrix vector products occur with the entire 65536 × 1vector. Figures 4.2(a) and 4.2(b) show the results of CG and the Block Gauss-Seidelwith Cyclic Reduction (BGS-CR). It is clear that the results are nearly identical.The relative difference in Frobenius norm between the two results is .0218 and meansquare errors when compared to the true image are .0119 for CG and .0118 for BGS-CR. However, the BGS-CR algorithm takes 3.2 seconds on a 3.0GHz Pentium IVprocessor whereas CG takes 8.7 seconds. Both algorithms stopped after 5 iterationswhen no significant improvement was observed, i.e. when the mean difference betweeniterations was less than 10−4.

Much of the work in the CG algorithm goes into fully constructing large matricesD and Si in the scale of n2, while our algorithm only needs to construct small innerblocks of D and Si in the scale of l2. Using the matrix structures presented in thispaper to avoid explicit construction of the system matrix leads to a much faster CGimplementation. In fact, the results in Section 2 can be used to create a matrixvector multiplication function for use in any reconstruction method that only requiresa matrix-vector multiplication. Such a function will have the advantage of reducedmemory and computational complexity.

The same test was performed with the addition of a Gaussian blur to a noisy HRimage to create a blurred and noisy HR image see in 4.3(a) before downsampling andinterpolating. Reconstruction was performed using the two level Block Gauss-Seidel(BGS) algorithm described in Section 3.2. Figure 4.3(b) shows the recovered image,which needed a smaller regularization parameter due to the fact that the smoothingeffect of the blur operator removed some of the original noise in the noisy HR image.Once again, results were comparable with the CG method, but BGS is much faster.The CG method without using the block structure took 24.5 seconds, while the twolevel BGS algorithm took 15 seconds. We see a reduced speedup factor here because

8

Page 9: MATRIX STRUCTURES AND PARALLEL ALGORITHMS FORusers.wfu.edu/plemmons/papers/siam_maa3.pdf · xi 0, QTAP is an upper bidiagonal block Toeplitz matrix. 2. If all xi 0, QTAP is a lower

(a) (b)

Fig. 4.2. (a) Conjugate gradient method using the tridiagonal block Toeplitz structure, α = .1and MSE = .0118. (b) Block Gauss-Seidel coupled with Cyclic Reduction method, α = .1 andMSE = .0119.

(a) (b)

Fig. 4.3. (b) Blurred and noisy image. (c) Recovered image, α = .03.

of the sequential processing of many more small matrix-vector multiplications in thescale of l2. An even more dramatic improvement can be expected if the many smallproducts are performed in parallel on a platform such as FPGA.

4.2. Real images from a lenslet array camera. A lenslet array camera [10]was used to capture images of an Air Force resolution target in a lab setting. A 10mega-pixel raw image is segmented into 16 subimages of size 128 × 128, which arethen registered with the subpixel registration algorithm [17] and used to reconstructa 512 × 512 HR image. Figure 4.4 compares the resolutions of the reconstructedHR image on the left and a blown-up LR image. Clearly, we observe resolutionenhancement.

4.3. FPGA implementation. Last, the algorithm in Section 3.1 was portedto a Xilinx Virtex 5 SX50T development board, with 52K logic cells, 594KB Block-RAM, 288 DSPs and 256MB DDR2 SDRAM. A custom 32-bit pipeline was designed,based on matrix-vector multiplication. A raw image from a lenslet array camera wassegmented into 16 LR images of size 128 × 128, which were then registered with thesubpixel registration algorithm [17] and used to reconstruct a 512 × 512 HR image.The LR images were read into the onboard memory through a Ethernet port andthe reconstructed HR image was retrieved through a USB port and displayed on a 7inch LCD. The current maximum processing capability is 2 frames per second (fps).

9

Page 10: MATRIX STRUCTURES AND PARALLEL ALGORITHMS FORusers.wfu.edu/plemmons/papers/siam_maa3.pdf · xi 0, QTAP is an upper bidiagonal block Toeplitz matrix. 2. If all xi 0, QTAP is a lower

Fig. 4.4. Resolution enhancement after the reconstruction using real images taken by a lensletarray camera.

Without the memory interface bottleneck, the processing capability can move up 5fps. The system is highly scalable, because the core co-processing element is an l2

scale matrix-vector multplication (MVM), which can be easily replicated for largerSR problems and saved on larger FPGAs. Thus the speedup should be near linearuntil memory bandwidth is exhausted. The current system has only one MVM corewhile we project that a scaled system on a Virtex 5 LX330T could hold 6 MVMcores and thus an estimated performance of 30 fps. This translates into a speedupfactor of approximately 383 when compared to a desktop computer running the samealgorithm in MATLAB and a speedup factor of 1, 043 when compared to a desktopcomputer running the CGLS algorithm. However, regarding implementing the CGLSalgorithm on FPGA, it is important to note that it is not currently possible to storethe system matrix A in onboard memory, even in a sparse format. Furthermore, veryfew tools for large scale matrix-vector products exist for FPGAs [4, 13].

5. Conclusions. Traditional models of superresolution lead to large, sparse ma-trices that are difficult to implement with on-board electronics. This paper presentsalgorithms that make use of the structure of the sparse matrices to replace large scalematrix vector products with small, parallelizable products. The introduced algorithmsdo not introduce any degradation in quality over current reconstruction methods butthey offer a much faster reconstruction with limited memory requirements. As a result,a problem once thought difficult to implement with special purpose digital computerscan be made to take full advantage of the small scale, highly parallel capabilities ofsuch systems. In addition, the structural results are suitable for constructing matrixvector products for use in any algorithm in which they are required.

Acknowledgments. The authors thank Dr. James Nagy, Dr. Paul Pauca,Dr. Sudhakar Prased, Dr. Todd Torgensen and other researchers in the PERIODICresearch group for their critiques [10]. The research described in this paper wassupported in part by the Intelligence Advanced Research Projects Agency (IARPA)through the Defense Microelectronics Activity (DMEA) under cooperative agreementnumber H94003-08-2-0802, and by the Air Force Office of Scientific Research (AFOSR)under award number FA9550-08-1-0151.

REFERENCES

10

Page 11: MATRIX STRUCTURES AND PARALLEL ALGORITHMS FORusers.wfu.edu/plemmons/papers/siam_maa3.pdf · xi 0, QTAP is an upper bidiagonal block Toeplitz matrix. 2. If all xi 0, QTAP is a lower

[1] R. Barnard, P. Pauca, T. Torgersen, R. Plemmons, S. Prasad, J. van der Gracht,J. Nagy, J. Chung, G. Behrmann, S. Matthews, and M. Mirotznik, High-resolutioniris image reconstruction from low-resolution imagery, in Proc. SPIE, Advanced SignalProcessing Algorithms, Architectures, and Implementations, vol. 6313, San Diego, CA,Aug. 2006, pp. 63130D1–63130D13.

[2] M. Bertero and P. Boccacci, Introduction to Inverse Problems in Imaging, Institute ofPhysics Publishing, 1998.

[3] D. Bini and B. Meini, Solving block banded block toeplitz systems with structured blocks: newalgorithms and open problems, Large-scale Scientific Computations of Engineering andEnvironmental Problems II, Notes on Numerical Fluid Mechanics, 13 (2000), pp. 15–24.

[4] S.D. Brown, R.J. Francis, J. Rose, and Z.G. Vranesic, Field-Programmable Gate Arrays,Springer, 1992.

[5] J. Chung, E. Haber, and J. Nagy, Numerical methods for coupled super-resolution, InverseProblems, 22 (2006), pp. 1261–1272.

[6] S. Farsiu, D. Robinson, M. Elad, and P. Milanfar, Advances and challenges in super-resolution, International Journal of Imaging Systems and Technology, 14 (2004), pp. 47–57.

[7] S.W. Golomb, Permutations by cutting and shuffling, SIAM Review, 3 (1961), pp. 293–297.[8] G.H. Golub and C.F. Van Loan, Matrix Computations, Johns Hopkins University Press,

3rd ed., 1996.[9] M.R. Hestenes and E. Stiefel, Methods of conjugate gradients for solving linear systems,

Journal of Research of the National Bureau of Standards, 49 (1952), pp. 409–436.[10] M. Mirotznik, S. Mathews, R. Plemmons, P. Pauca, T. Torgersen, R. Barnard, T. Guy,

B. Gray, Q. Zhang, J. van der Gracht, C. Petersen, M. Bodnar, and S. Prasad,A practical enhanced-resolution integrated optical-digital imaging camera (PERIODIC),in Proc. SPIE, vol. 7348, Orlando, FL, April 2009, Conference on Defense, Security andSensing.

[11] J. Nagy, R.J. Plemmons, and T.C. Torgersen, Iterative image restoration using approxi-mate inverse preconditioning, IEEE Transactions on Image Processing, 5 (1996), pp. 1151–1162.

[12] N. Nguyen, P. Milanfar, and G. Golub, A computational efficient image superresolutionalgorithm, IEEE Transactions on Image Processing, 10 (2001), pp. 573–583.

[13] F.E. Ortiz, E.J. Kelmelis, J.P. Durbano, and D.W. Pratherb, FPGA acceleration ofsuperresolution algorithms for embedded processing in millimeter-wave sensors, in Proc.SPIE, vol. 6548, May 2007, pp. 6548–0K.

[14] S.C. Park, M.K. Park, and M.G. Kang, Super-resolution image reconstruction: A technicaloverview, IEEE Signal Processing Magazine, 20 (2003), pp. 21–36.

[15] D. Reddy, Z. Yue, and P. Topiwala, An efficient real time superresolution ASIC system, inProc. SPIE, vol. 6957, 2008, pp. 6957–09.

[16] R.S. Wagner, D.E. Waagen, and M.L. Cassabaum, Image super-resolution for improvedautomatic target recognition, in Proc. SPIE, vol. 5426, Orlando FL, April 2004, pp. 188–196.

[17] Q. Zhang, Analytical approximations of translational subpixel shifts in signal and image reg-istrations, in Proc. SPIE, vol. 7074, San Diego, CA, August 2008, pp. 70740E1–70740E7.

Appendix A. Proofs.

The following section contains proofs of Theorems 2.1 to 2.7. The proofs revolvearound the definitions of the three component matrices D, Si, and Hi plus the permu-tation matrices P and Q. Structures of the matrices in question are revealed throughthe row and column indices of nonzero entries. In particular, if A is a matrix suchthat all nonzero entries on the ith row are within the range [i− l2, i+ l2], then A is atridiagonal block matrix having a block size of l2 × l2.

A.1. Decimation matrix. We start with the decimation matrix, which canalso be regarded as a local mean matrix.

Definition A.1. We define a decimation operation, D ∈ Rn2/l2×n2, on a vec-

torized image, f ∈ Rn2×1, as g = Df , such that the entries of D are determined using11

Page 12: MATRIX STRUCTURES AND PARALLEL ALGORITHMS FORusers.wfu.edu/plemmons/papers/siam_maa3.pdf · xi 0, QTAP is an upper bidiagonal block Toeplitz matrix. 2. If all xi 0, QTAP is a lower

the following averaging equation.

gij =

l−1∑i=0

l−1∑j=0

fi−i,j−j

/l2, (A.1)

where i, j are the row and column indices of decimated image. The vectorizationfollows the typical column ordering.

As an example, the (1, 1) pixel of g is an averaging of the pixels in the squarefrom (1, 1) to (l, l) in f. The structure of D is given below.

Proposition A.2. Matrix D defined above has a block diagonal given by

D =

D1

D1

...D1

n2/l2×n2

, (A.2)

where D has n/l × n/l blocks, and D1 = [D2D2 . . . D2] ∈ Rn/l×nl, whose also has adiagonal structure as

D2 =

v

v...

v

n/l×n

, (A.3)

where D2 also has n/l×n/l blocks, and v = (1/l2, 1/l2, . . . , 1/l2) ∈ R1×l is a constantrow vector.

Proof. The structure follows from the definition of a traditional matrix vector-ization. Furthermore, all values are equal to 1/l2 because D takes an unweightedaverage.

The nonzero entries in each row of matrix D are separated by a distance of nentries, so di = (0, . . . , 0, v, 0, . . . , 0, v, 0, . . . , v, 0, . . . , v, 0, . . . , 0, . . . , 0), where di is theith row of D. However, we can group these entries into one continuous block of size1× l2 by moving the nonzero columns together through a permutation matrix P. Thepermuted D is a diagonal Block Toeplitz matrix.

Definition A.3. Define the permutation matrix P = (pi) by the row vectorpi = enli1+ni2+i3 where i = 1, . . . , n2, i1 = bi/nlc, i = i − 1 mod nl, i2 = i mod n

and i3 = d(i + 1)/ne. Here ej ∈ R1×n2is the identity vector with the entry 1 at

position j and entry zeros at other positions.While it is not used until later, we include the definition of the permutation matrix

Q here for convenience.Definition A.4. Define the permutation matrix Q such that column i is the

unit vector eb i−1n2/l2

c+l2((i−1) mod n2

l2). The permutation matrix is defined by the product

Q = PQ.Proposition A.5. Under the permutation matrix P defined above, matrix D is

a block diagonal matrix

PTDP =

v

v...

v

n2/l2×n2

, (A.4)

12

Page 13: MATRIX STRUCTURES AND PARALLEL ALGORITHMS FORusers.wfu.edu/plemmons/papers/siam_maa3.pdf · xi 0, QTAP is an upper bidiagonal block Toeplitz matrix. 2. If all xi 0, QTAP is a lower

and v = (1/l2, 1/l2, . . . , 1/l2) ∈ R1×l2 .Proof. The proof is by comparison between entries in (2.5) and the corresponding

entry in (2.6).Proposition A.6. Matrix (DP )TDP = PTDTDP is also a block diagonal

matrix, while DP (DP )T is a diagonal matrix given by

(DP )TDP =

D2

D2

...

D2

n2×n2

, (A.5)

where D2 = vT v ∈ Rl2×l2 is a constant matrix (1/l2)l2×l2 .Proof. Note that PTDTDP = PTDTPPTDP = (PTDP )TPTDP. Products

involving non-zero elements only occur on the block diagonal and thus vT v is thediagonal block of (DP )TDP .

A.2. Shift matrix. The support of the shift matrix Si in (1.1) depends on the2D translational shifts. To avoid a notation conflict in using index i, throughout thissection, we use S to represent Si in (1.1) for any i = 1, . . . , l2, except in the proofs ofTheorem 2.1, Corollary 2.2 and Theorem 2.4.

Definition A.7. We define a shift operation, S ∈ Rn2×n2, on a vectorized image,

f ∈ Rn2×1, as f = Sf , such that the entries of S are determined by a 2D translationalshift (δx, δy) and the relationship between fij and fij is given by

fij = w11fi+δy,j+δx+w12fi+δy+1,j+δx

+w21fi+δy,j+δx+1 +w22fi+δy+1,j+δx+1, (A.6)

where δy = bδyc and δx = bδxc, w1 to w4 are weights formulated as (1− δx)(1− δy),(1− δy)δx, (1− δx)δy and δxδy respectively, and δx = δx − δx, δy = δy − δy.

Figure A.1 shows the grid of a 10× 10 original image in (circles) and the shiftedimage (in squares), where (δx, δy) = (1.5,−2.5). Here we use the image intensityvalues at circles to linearly interpolate for intensity values at the square points.

It is clear that the non-zero entries of S can only be one of w1 to w4, and that theypossess a regular pattern. In fact, using the column-wise ordering in the vectorizationof f and f , we can pinpoint the four nonzero entries on the i+ (j − 1)nth row as

i+δy+(j+δx−1)n, i+δy+1+(j+δx−1)n, i+δy+(j+δx)n, i+δy+1+(j+δx)n. (A.7)

This corresponds to a matrix structure specified in the following proposition.Proposition A.8. Given a 2D rigid translational shift, (δx, δy), where δx ∈

(−l, l) and δy ∈ (−l, l), the shift matrix, S, defined in Definition A.7, has a blockToeplitz form that can be determined in the following way. If δx > 0,

S =

0 . . . 0 S1 S2

0 . . . 0 0 S1 S2

. . .S1 S2

0 0. . .

0 0

n2×n2

, (A.8)

13

Page 14: MATRIX STRUCTURES AND PARALLEL ALGORITHMS FORusers.wfu.edu/plemmons/papers/siam_maa3.pdf · xi 0, QTAP is an upper bidiagonal block Toeplitz matrix. 2. If all xi 0, QTAP is a lower

Fig. A.1. An illustration of the shift matrix, Si.

where S1, S2 ∈ Rn×n. The number of columns of zero blocks to the left is δx and thenumber of rows of zero blocks at the bottom is δx + 1. If δx < 0,

S =

0 0. . .

0 0S1 S2

0 S1 S2

. . .S1 S2 0 . . . 0

n2×n2

, (A.9)

where the number of columns of zero blocks to the right is δx and the number of rowsof zero blocks at the top is δx + 1.

If δy > 0,

Si =

0 . . . 0 wi1 wi20 . . . 0 0 wi1 wi2

. . .wi1 wi20 0

. . .0 0

n×n

, i = 1, 2 (A.10)

The number of columns of zero blocks to the left is δy and the number of rows of zeroblocks at the bottom is δy + 1. If δy < 0,

Si =

0 0. . .

0 0wi1 wi20 wi1 wi2

. . .wi1 wi2 0 . . . 0

n×n

, i = 1, 2 (A.11)

where the number of columns of zero blocks to the right is δy and the number of rowsof zero blocks at the top is δy + 1.

14

Page 15: MATRIX STRUCTURES AND PARALLEL ALGORITHMS FORusers.wfu.edu/plemmons/papers/siam_maa3.pdf · xi 0, QTAP is an upper bidiagonal block Toeplitz matrix. 2. If all xi 0, QTAP is a lower

Proof. The structure described by (A.8) to (A.11) is simply the matrix represen-tation of (A.7). The details can be verified by the interested reader.

Next we permute S to gain a better structure.Proposition A.9. The permuted matrix PTSP has a two level bidiagonal block

Toeplitz structure. If δx > 0,

PTSP =

S1 S2

S1 S2

. . .

S2

S1

n2×n2

, (A.12)

and if δx < 0,

PTSP =

S1

S2 S1

. . .

S2 S1

n2×n2

, (A.13)

where Si ∈ Rnl×nl.If δy > 0,

Si =

Si1 Si2

Si1 Si2. . .

Si2Si1

nl×nl

, (A.14)

and if δy < 0,

Si =

Si1Si2 Si1

. . .

Si2 Si1

nl×nl

, (A.15)

where Sij ∈ Rl2×l2 .Proof. We again rely on (A.7) to identify the nonzero entries after applying both

row and column permutations using P .After the vectorization using the new indexing method, the entry (i, j) in f will

be at the position

pij = (i− 1)l + jnl + j, (A.16)

where j = b(j − 1)/lc and j = (j − 1) mod l + 1.The following inequalities define a two-level bidiagonal structure and make use of

the restriction that δx, δy ∈ [−l + 1, l − 1]. The diagonal block of size l2 × l2 is givenby

15

Page 16: MATRIX STRUCTURES AND PARALLEL ALGORITHMS FORusers.wfu.edu/plemmons/papers/siam_maa3.pdf · xi 0, QTAP is an upper bidiagonal block Toeplitz matrix. 2. If all xi 0, QTAP is a lower

pij − l2 ≤ pi+δy,j+δx≤ pij + l2, (A.17)

pij − l2 ≤ pi+δy+1,j+δx≤ pij + l2, (A.18)

pij − l2 ≤ pi+δy,j+δx+1 ≤ pij + l2, (A.19)

pij − l2 ≤ pi+δy+1,j+δx+1 ≤ pij + l2. (A.20)

The upper block diagonal when δx > 0 is given by

pij + nl − l2 ≤ pi+δy,j+δx+1 ≤ pij + nl + l2, (A.21)

pij + nl − l2 ≤ pi+δy+1,j+δx+1 ≤ pij + nl + l2, (A.22)

while the lower block diagonal when δx < 0 is given by

pij − nl − l2 ≤ pi+δy,j+δx+1 ≤ pij − nl + l2, (A.23)

pij − nl − l2 ≤ pi+δy+1,j+δx+1 ≤ pij − nl + l2. (A.24)

To verify the Toeplitz structure, we only need to prove that the permuted S =PTSP satisfies

sIJ = sI+l2,J+l2 . (A.25)

It is not difficult to verify that

pij + l2 ={pi+l,j if i ≤ n− lpi+l−n,j+l if i > n− l (A.26)

which is the definition of an l2 × l2 block Toeplitz structure.As an example, we note that

pi+l+δy,j+δx= pi+δy,j+δx

+ l2,

and

pi+l−nδy,j+l+δx= pi+δy,j+δx

+ l2.

The three remaining nonzero entries on row i can be verified in a similar manner.Proposition A.10. Matrix PT (DS)P has a two-level bidiagonal block Toeplitz

structure similar to which in (A.12) and (A.13), with second level blocks consisting of1× l2 row vectors vSi instead of Si, for i = 1, 2.

Proof. Note that PT (DS)P = PTDP (PTSP ). By Proposition A.5, PTDP hasa block diagonal structure with each block of size 1× l2. By Proposition A.9, PTSPhas a two level bidiagonal block Toeplitz structure with l2× l2 blocks. It follows thatthe product PTDSP is two level bidiagonal block Toeplitz with 1× l2 blocks.

A.2.1. Proof of Theorem 2.1. Each PTDSiP is a block bidiagonal matrixwith 1 × l2 blocks, but it is important to note that not all matrices have the samenon-zero diagonals. The particular non-zero diagonal depends on the sign of δx and δy.However, if we stack them to form A = (PTDS1P ; . . . ;PTDSl2P ) then premultiplyby Q to form the shuffle of each 1 × l2 block, we form a tridiagonal block Toeplitzmatrix with l2 × l2 second level blocks. This completes the proof of Theorem 2.1.

16

Page 17: MATRIX STRUCTURES AND PARALLEL ALGORITHMS FORusers.wfu.edu/plemmons/papers/siam_maa3.pdf · xi 0, QTAP is an upper bidiagonal block Toeplitz matrix. 2. If all xi 0, QTAP is a lower

A.2.2. Proof of Corollary 2.2. The proof of this corollary is a natural exten-sion of the two-level bidiagonal structure of PTSiP , proven in Proposition A.9. If allδx or δy have the same sign then all PTDSiP are bidiagonal with the same non-zerodiagonals. It follows that one of the three diagonals in the tridiagonal block Toeplitzmatrix proved in A.2.1 is identically zero.

A.2.3. Proof of Lemma 2.3. Note that

PT (DS)TDSP = PTSTP (PTDTDP )PTSP,

and PTDTDP = (DP )TDP has a block diagonal structure with each block havinga size of l2 × l2 (see Proposition A.6). By Proposition A.10, PTSP has a two-levelbidiagonal block Toeplitz structure, so PTSTP also has a two-level bidiagonal blockToeplitz structure, except the upper diagonal blocks will be transposed to the lowerdiagonal and vice versa. Hence the multiplication of these three matrices will be atwo-level tridiagonal block Toeplitz structure.

A.2.4. Proof of Theorem 2.4. Note that ATA =∑l2

i=1(DSi)TDSi and thus

PTATAP =∑l2

i=1 PT (DSi)TDSiP .

A.3. Blurring matrix. The blurring matrix we consider here is the regularblock Toeplitz with Toeplitz block blurring matrix generated by a spatially invariantn × n PSF matrix and zero boundary conditions. Furthermore, we assume the bluris radially symmetric with a small diameter. A large diameter corresponds to moreoff-diagonal entries in each Toeplitz block of H, and thus a more complex structure.Here we impose a limit on the diameter to gain a simpler structure of H while stillaccepting a large enough PSF for real applications. In particular, we have the followingproposition. Again, to avoid the notation conflict in using index i, throughout thissection, we use H to represent Hi in (1.1) for any i = 1, . . . , l2, except in the proofsof Theorem 2.5 and 2.6.

Proposition A.11. If the diamater of a spatial invariant PSF is not greaterthan 2l + 1, PTHP is a two level tridiagonal block Toeplitz structure.

Proof. We can write the blurring operation in the matrix form as

f = Hf,

where f and f are two vectorized images. Written explicitly, the entries of f are givenby,

fij =l∑

i=−l

l∑j=−l

hijfi−i,j−j . (A.27)

Note that we have applied the diameter limit of 2l+ 1 to the two summation indices,i and j. The (2l + 1)2 nonzero entries on each row, except the first and last severalrows, lie at entries i − i + (j − j)n which represents a block Toeplitz with Toeplitzblock structure.

After permuation, the matrix representation is

PT f = PTHP (PT f), (A.28)

and the nonzero entries of PTHP are at pi−i,j−j , where p is defined in (A.16). Theproof of the two level tridiagonal structure is equivalent to proving similar inequalities

17

Page 18: MATRIX STRUCTURES AND PARALLEL ALGORITHMS FORusers.wfu.edu/plemmons/papers/siam_maa3.pdf · xi 0, QTAP is an upper bidiagonal block Toeplitz matrix. 2. If all xi 0, QTAP is a lower

in Proposition A.9. For the diagonal block, we need

pij − l2 ≤ pi−i,j−j ≤ pij + l2. (A.29)

For the block off diagonal to the right, we need

pij + nl − l2 ≤ pi−i,j−j ≤ pij + nl + l2. (A.30)

For the block off diagonal to the left, we need

pij − nl − l2 ≤ pi−i,j−j ≤ pij − nl + l2. (A.31)

The inequalities follow from the domains of i and j.To verify the Toeplitz structure, we demonstrate that the permuted H = PTHP

satisfies

hIJ = hI+l2,J+l2 .

Note that I and J are the row and column indices of matrix H while i and j are theindices of the original image f , and their relationships are I = pij and J = pi−i,j−j .Now it becomes clear that

pi−i,j−j + l2 = (i+ l − 1)l + j − jnl + j − j,

where j − j = b(j− j−1)/lc and j − j = (j− j−1) mod l+1. Thus i and j remainsunchanged when pi−i,j−j increases by l2, which in turns means hI+l2,J+l2 = hi,j =hI,J .

The next proposition about the structure of PTHSP gives the key piece of theproofs of Theorems 2.5 and 2.6.

Proposition A.12. With H defined in Proposition A.11 and S defined in Defi-nition A.7, PTHSP is also a two level tridiagonal block Toeplitz structure.

Proof. Once again, the proof utilizes the definitions of H and S in the subscriptform. We have that f = Hf = HSf is equivalent to

fij =l∑

i=−l

l∑j=−l

hi,j fi−i,j−j

=l∑

i=−l

l∑j=−l

hi,j(w11fi−i+δy,j−j+δx+ w12fi−i+δy+1,j−j+δx

+ w21fi−i+δy,j−j+δx+1 + w22fi−i+δy+1,j−j+δx+1). (A.32)

As an example, one of the nonzero entries on the row pij is given by pi−i+δy,j−j+δx.

By the definition of pij , it is easy to see the positions of nonzero entries are effectivelyshifted by a fixed amount determined by δx on the first level and by δy on the secondlevel. Due to the limitation that |δ(x,y)| ≤ l we have a shifted two-level tri-diagonalstructure.

This proposition comes as somewhat of a surprise because we would normallyexpect that a two-level tri-diagonal block Toeplitz structure (PTHP ) times by a two-level bi-diagonal block Toeplitz structure (PTSP ), would result in a two-level quadra-diagonal block Toeplitz structure. However, in our case, it remains tri-diagonal.

18

Page 19: MATRIX STRUCTURES AND PARALLEL ALGORITHMS FORusers.wfu.edu/plemmons/papers/siam_maa3.pdf · xi 0, QTAP is an upper bidiagonal block Toeplitz matrix. 2. If all xi 0, QTAP is a lower

A.3.1. Proof of Theorem 2.5. The proof of theorem is similar to that The-orem 2.1, because each block of PTDHiSiP is also of size 1 × l2. Variations in theparticular set of offsets δx and δy corresponding to Si imply that some PTDHiSiPhave a tri-diagonal structure shifted to the right while others shifted to the left. Theirconcatenation into QSP is a two-level penta-diagonal block Toeplitz structure.

A.3.2. Proof of Theorem 2.6. Theorem 2.6 follows from Proposition A.12.Note that

PT (DHS)TDHSP = (PT (HS)TP )(PTDTDP )(PTHSP ).

Since PTDTDP is a block diagonal matrix, (PT (HS)TP )(PTDTDP )(PT (HS)P ) hasthe same structure as PT (HS)TPPT (HS)P. It follows that the product of two two-level tridiagonal block Toeplitz matrices is a two-level penta-diagonal block Toeplitzmatrix. SinceATA =

∑l2

i=1(DHiSi)TDHiSi and PTATAP =∑l2

i=1 PT (DHiSi)TDHiSiP ,

the matrix ATA has the same two-level structure.

A.3.3. Proof of Theorem 2.7. Note that Si is sparse with non-zero diagonalswhose weights correspond to a bilinear interpolation defined by δ(i)

x , δ(i)y . This allows

us to restrict our attention to the the nl × n2 submatrix

[A−1A0A10 . . . 0] (A.33)

and the l2 × nl submatrix

[A(−1)i A

(0)i A

(1)i 0 . . . 0] (A.34)

and use the first and second order Toeplitz structure of A.Furthermore, we have the following lemma whose proof follows by direct calcula-

tion.Lemma A.13. The following structural descriptions hold:1. Only si,j in S where nl < i ≤ 2nl, l < (j mod n) ≤ 2l contribute to Aji under

the product QTDSP.2. Row k of Aji contains information from only DSk.

Proof. The proof involves a partition of rows of matrices DSi so their permutedform can be investigated. We first label the rows of Si so that row α ≡ k mod n haslabel ck. The rows are then labeled

[c1c2 . . . cnc1 . . . cn . . . c1 . . . cn]T

with n repetitions of c1, . . . , cn. Under the permutation PTSiP the labels are reorderedas

[c1 . . . c1c2 . . . c2 . . . cn . . . cn]T (A.35)

such that the l rows labeled ck are clustered together. The label pattern repeatsafter nl rows because the permutation P is closed on sets of indices of length nl.The operation DP averages l2 consecutive rows into one row of DSiP. Finally, thepermutation Q shuffles rows of DSiP into rows j ≡ i mod l2. Excluding the firstrepetition of row labels in (A.35) we have that the sub matrix [A−1A0A10 . . . 0] iscomprised of one set of l2 rows of DSiP for each i ≤ l2, proving part (2).

Part (1) of the matrix follows from the proceeding analysis and Definition A.7.19

Page 20: MATRIX STRUCTURES AND PARALLEL ALGORITHMS FORusers.wfu.edu/plemmons/papers/siam_maa3.pdf · xi 0, QTAP is an upper bidiagonal block Toeplitz matrix. 2. If all xi 0, QTAP is a lower

Using Lemma A.13, it is possible to completely characterize Aji from only 3l2

elements of each Sk for 1 ≤ k ≤ l2.Next, we examine the effect of multiplying PTSP by DP. Note that DP averages

columns in row blocks of size l2. Thus, the proof reduces to an investigation of thesupport of rows n + l2 + 1, . . . , n + 2l2 of each PTSiP (the first few blocks form Band the analysis follows similarly). Partition these rows into l2 × l2 blocks Hi whichfurther divide into l× l blocks Jα,β . Within each Hi, the permutation PTSiP shufflesJα,β such that the first l× l block of the permuted block Hi contains the (1, 1) elementfrom each J(α,β). The first block has format

J(1,1),1,1 J(1,2),1,1 . . . J(1,l),1,1

J(2,1),1,1 . . . J(2,l),1,1

......

J(l,1),1,1 J(l,2),1,1 . . . J(l,l),1,1

. (A.36)

where the left indices identify a block J and the right indices identify an element inthe block.

Parts (1) of Theorem 2.7 follows directly. Part (2) follows because the sum of the4 weights in each bilinear interpolation is 1.

A.4. Proof of Theorem 3.1. The matrices ATA + αI and A0 are positivedefinite, so we can refer to Theorem 10.1.2 in [8]. However, Theorem 10.1.2 concernsthe Gauss-Seidel iteration method, not the block Gauss-Seidel iteration introducedhere. Most of the proof extends naturally, but we clarify one less obvious point. Usingthe notation in [8], we define G = −(D + L)−1LT , where D = diag(A0, A0, . . . , A0)and L is a strictly lower triangle matrix. We need to prove that

G1 ≡ D1/2GD−1/2 = −(I + L1)−1LT1 , (A.37)

where L1 = D−1/2LD−1/2, or equivalently,

D1/2(D + L)−1D1/2 = (I + L1)−1. (A.38)

When D is only a diagonal matrix, it is easy to verify (A.41), but in this case Dis block diagonal. This proves not to be a problem. Notice that PTATAP + αI hasa 2× 2 block form and thus we can explicitly write its inverse.

(D + L)−1 =(D0 0L D0

)−1

=(

D−10 0

−D−10 LD−1

0 D−10

), (A.39)

where D0 is the upper left or the lower right block and L is the lower left block ofPTATAP + αI in (3.1). Then we multiply D1/2 on both sides to get

D1/2(D + L)−1D1/2 = D1/2

(D−1

0 0−D−1

0 LD−10 D−1

0

)−1

D1/2

=(

I 0−D−1/2

0 LD−1/20 I

). (A.40)

It is easy to verify the right side of the equation above is (I + L1)−1.20

Page 21: MATRIX STRUCTURES AND PARALLEL ALGORITHMS FORusers.wfu.edu/plemmons/papers/siam_maa3.pdf · xi 0, QTAP is an upper bidiagonal block Toeplitz matrix. 2. If all xi 0, QTAP is a lower

A.5. Proof of Theorem 3.2. Again, ATA + αI and A0 are positive definiteand we can refer to Theorem 10.1.2 in [8]. We need to verify that that

G1 ≡ D1/2GD−1/2 = −(I +D−1/2LD−1/2)−1(D−1/2LD−1/2)T , (A.41)

or equivalently,

D1/2(D + L)−1D1/2 = (I +D−1/2LD−1/2)−1, (A.42)

Notice that ATA+ αI has a 3× 3 block form and we can explicitly write out itsinverse.

(D + L)−1 =

A0 0 0A−1 A0 0A−2 A−1 A0

−1

=

A−10 0 0

−A−10 A−1A

−10 A−1

0 0−A−1

0 (A−1A−10 A−1 − A−2)A−1

0 −A−10 A−1A

−10 A−1

0

,

where A0 is the diagonal block, which is positive definite. Hence,

D1/2(D+L)−1D1/2 =

I 0 0−A−1/2

0 A−1A−1/20 I 0

−A−1/20 (A−1A

−10 A−1 − A−2)A−1/2

0 −A−1/20 A−1A

−1/20 I

.

(A.43)It is easy to verify the right side of (A.43) is (I +D−1/2LD−1/2)−1.

21