· 2010. 10. 5. · 2 · G. Quintana-Ort´ı et al. 1. INTRODUCTION For many dense linear algebra operations there exist algorithms that cast most computation in term of matrix-matrix

Families of Algorithms for Reducing a Matrix toCondensed Form

G. JOSEPH ELIZONDO

The University of Texas at Austin

and

GREGORIO QUINTANA-ORTı

Universidad Jaume I

and

ROBERT A. VAN DE GEIJN


and

FIELD G. VAN ZEE


In a recent paper it was shown how memory traffic can be diminished by reformulating the classicalgorithm for reducing a matrix to bidiagonal form, a preprocess when computing the singularvalues of a dense matrix. Key is a reordering of the computation so that the most compute andmemory intensive operations can be “fused”. In this paper, we show that other operations thatreduce matrices to condensed form (reduction to upper Hessenberg and tridiagonal form) can besimilarly reorganized, yielding different sets of operations that can be fused. By developing thealgorithms with a common framework and notation, we facilitate the comparing and contrastingof the different algorithms and opportunities for optimization. In the end, we argue that thereis only very limited benefit from the fusing of operations, since other algorithms that do notrequire it are almost as efficient or more efficient. We discuss the algorithms and showcase theperformance improvements that result.

Categories and Subject Descriptors: G.4 [Mathematical Software]: —Efficiency

General Terms: Algorithms; Performance

Additional Key Words and Phrases: linear algebra, libraries, high-performance

Authors’ addresses: G. Joseph Elizondo, Robert A. van de Geijn, Field G. Van Zee, De-partment of Computer Science, The University of Texas at Austin, Austin, TX 78712,{elizondo,rvdg,field}@cs.utexas.edu. Gregorio Quintana-Ortı, Departamento de Ingenierıay Ciencia de Computadores, Universidad Jaume I, Campus Riu Sec, 12.071, Castellon, Spain,[email protected] to make digital/hard copy of all or part of this material without fee for personalor classroom use provided that the copies are not made or distributed for profit or commercialadvantage, the ACM copyright/server notice, the title of the publication, and its date appear, andnotice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish,to post on servers, or to redistribute to lists requires prior specific permission and/or a fee.c© 20YY ACM 0098-3500/20YY/1200-0001 $5.00

ACM Transactions on Mathematical Software, Vol. V, No. N, Month 20YY, Pages 1–30.

2 · G. Quintana-Ortı et al.

1. INTRODUCTION

For many dense linear algebra operations there exist algorithms that cast mostcomputation in term of matrix-matrix operations that overcome the memory band-width bottleneck in current processors [Dongarra et al. 1991; Dongarra et al. 1990;Anderson et al. 1999]. Reduction to condensed form operations are important ex-ceptions. For these operations reducing the number of times data must be broughtin from memory is the key to optimizing performance since inherently O(n3) readsand writes from memory are encountered as O(n3) floating point operations areperformed on an n× n matrix.

The Basic Linear Algebra Subprograms (BLAS) [Lawson et al. 1979; Dongarraet al. 1988; Dongarra et al. 1990] provide an interface to commonly used compu-tational kernels in terms of which linear algebra routine can be written. The ideais that if these kernels are optimized, then implementations of algorithms for com-puting more complex operations benefit in a portable fashion. As we will see, theproblem is that the interface itself is limiting and can stand in the way of mini-mizing memory traffic. In response, as part of the BLAST Forum [BLA 2002] ,additional, more complex, operations were suggested for inclusion in the BLAS.In [Howell et al. 2008], it was shown how one of the reduction to condensed formoperations, reduction to bidiagonal form, benefits from this new functionality inthe BLAS.

The present paper makes a number of new contributions to this area:

—We show how the techniques used to reduce memory traffic in the reduction tobidiagonal form algorithm can be modified to similarly reduce such traffic whencomputing a reduction to upper Hessenberg or tridiagonal form.

—We present these algorithms, as well as the reduction to bidiagonal form alreadyreported in [Howell et al. 2008], using a notation developed as part of the FLAMEproject [Quintana et al. 2001; Gunnels et al. 2001; Bientinesi et al. 2005; van deGeijn and Quintana-Ortı 2008]. We believe our explanations are easier to followby non-experts.

—We identify five sets of operations that can be fused which can potentially reducethe cost due to memory trafficc of the three algorithms for reduction to condensedform.

—We show how these five fused operations can be optimized by building upon high-performance implementations of the dot product and “axpy” operations, whichare part of the level-1 BLAS.

—We report what we believe is the best performance attained on sequential archi-tectures for reduction to upper Hessenberg and tridiagonal form.

Together, these contributions advance the state-of-the-art of the field.This paper is structured as follows: In Section 2 we discuss the Householder

transform, including some of its properties we will use later in the paper. Vari-ous algorithms for reducing a matrix to upper Hessenberg form are developed inSection 3, including a discussion on how to fuse key matrix-vector operations toreduce memory traffic. Section 4 briefly discusses reduction to tridiagonal formand how it is similar to its upper Hessenberg counterpart. The third operation,reduction to bidiagonal form, is discussed in Section 5. In Section 6 we introduceACM Transactions on Mathematical Software, Vol. V, No. N, Month 20YY.

Families of Algorithms for Reducing a Matrix to Condensed Form · 3

a complex Householder transform and give examples of how generalizing to thecomplex domain affects the various reduction algorithms. Performance is discussedin Section 7 and concluding remarks can be found in Section 8.

2. HOUSEHOLDER TRANSFORMATIONS

We start by reviewing a few basic properties of Householder transformations (re-flectors).

2.1 Computing Householder vectors and transformations

Definition 1. Let u ∈ Rn, τ ∈ R. Then H = H(u) = I − τ−1uuT , whereτ = 1

2uTu, is said to be a reflector or Householder transformation.

We observe:

—Let z be any vector that is perpendicular to u. Applying a Householder transformH(u) to z leaves the vector unchanged: H(u)z = z.

—Let any vector x be written as x = z + uTxu, where z is perpendicular to u anduTxu is the component of x in the direction of u. Then H(u)x = z − uTxu.

This can be interpreted as follows: The space perpendicular to u acts as a “mirror”:any vector in that space (along the mirror) is not reflected, while any other vectorhas the component that is orthogonal to the space (the component outside andorthogonal to the mirror) reversed in direction. Notice that a reflection preservesthe length of the vector. Also, it is easy to verify that:

(1) HH = I (reflecting the reflection of a vector results in the original vector);

(2) H = HT , and so HTH = HHT = I (a reflection is an orthogonal matrix andthus preserves the norm); and

(3) ifH0, · · · ,Hk−1 are Householder transformations andQ = H0H1 · · ·Hk−1, thenQTQ = QQT = I (an accumulation of reflectors is an orthogonal matrix).

As part of the reduction to condensed form operations, given a vector x we willwish to find a Householder transformation, H(u), such that H(u)x equals a vectorwith zeroes below the first element: H(u)x = ∓‖x‖2e0 where e0 equals the firstcolumn of the identity matrix. It can be easily checked that choosing u = x±‖x‖2e0yields the desired H(u). Notice that any nonzero scaling of u has the same property,and the convention is to scale u so that the first element equals one. Let us define[u, τ, y] = Housev(x) to be the function that returns u with first element equal toone, τ = 1

2uTu, and y = H(u)x.

2.2 Computing Au from Ax

Later, we will see that given a matrix A, we will need to form Au where u iscomputed by Housev(x), but we will do so by first computing Ax. Let

x→(χ1

x2

), v →

(ν1v2

), u→

(υ1

u2

),

ACM Transactions on Mathematical Software, Vol. V, No. N, Month 20YY.


v = x− αe0 and u = v/ν1, with α = −sign(χ1)‖x‖2. Then

‖x‖2 =∥∥∥∥( χ1

‖x2‖2

)∥∥∥∥2

(1)

‖v‖2 =∥∥∥∥( χ1 − α‖x2‖2

)∥∥∥∥2

(2)

‖u‖2 =∥∥∥∥( ‖v‖2

χ1 − α

)∥∥∥∥2

(3)

τ =uTu

2=‖u‖22

2=

‖v‖222(χ1 − α)2

(4)

w = Ax (5)

Au =A(x− αe0)(χ1 − α)

=(w − αAe0)

(χ1 − α). (6)

We note that Ae0 simply equals the first column of A. We will assume that variousresults in (1–4) are computed by the function HousevS(x) where [χ1 − α, τ, α] =HousevS(x). Here, HousevS stands for “Householder scalars”. Then, given thatw = Ax, the desired vector Au can be computed via (6).

2.3 Accumulating transformations

Consider the transformation formed by multiplying b Householder transformations(I − τ−1

j ujuTj

), j = {0, . . . , b− 1}. In [Joffrain et al. 2006; ?; ?] it was shown that

if U =(u0 u1 · · · ub−1

), then(

I − τ−10 u0u

T0

) (I − τ−1

1 u1uT1

)· · ·(I − τ−1

b−1ub−1uTb−1

)= (I − UT−1UT ).

Here T = 12D + S where D and S equal the diagonal and strictly upper triangular

parts of UTU(= ST +D + S). Later we will use the fact that if

U =(U0 u1

)and T =

(T00 t010 τ11

)then

t01 = UT0 u1, τ11 =

uT21u21

2, and

(T00 t010 τ11

)−1

=(T−1

00 −τ−111 T

−100 t01

0 τ−111

).

For further details, see [Joffrain et al. 2006; ?; ?].

3. REDUCTION TO UPPER HESSENBERG FORM

In the first step towards computing the Schur decomposition of a matrix A, thematrix is reduced to upper Hessenberg form: A → QBQT where B is an upperHessenberg matrix (zeroes below the first subdiagonal) and Q is orthogonal.

3.1 Unblocked algorithm

The basic algorithm for reducing the matrix to upper Hessenberg form, overwritingthe original matrix with the result, can be explained as follows.ACM Transactions on Mathematical Software, Vol. V, No. N, Month 20YY.


Algorithm: [A] := HessRed unb(b, A)

Partition A→ ATL ATR

ABL ABR

!, u→

uT

uB

!, y →

yT

yB

!, z →

zT

zB

!where ATL is 0× 0 and uT , yT , and zT have 0 rows

while m(ATL) < m(A) doRepartition

ATL ATR

ABL ABR

!→

0@ A00 a01 A02

aT10 α11 aT

12

A20 a21 A22

1A,

uT

uB

!→

0@ u01

υ11

u21

1A ,

yT

yB

!→

0@ y01ψ11

y21

1A ,

zT

zB

!→

0@ z01ζ11

z21

1Awhere α11, υ11, ψ11, ζ11 are scalars

Basic unblocked: Rearranged unblocked:

[u21, τ, a21] := Housev(a21)

y21 := AT22u21

z21 := A22u21

β := uT21z21/2

y21 := (y21 − βu21/τ)/τz21 := (z21 − βu21/τ)/τaT12 := aT

12 − aT12u21uT

21/τA02 := A02 −A02u21uT

21/τA22 := A22 − u21yT

21 − z21uT21

α11 := α11 − υ1ψ1 − ζ1υ1 (?)aT12 := aT

12 − υ1yT21 − ζ1uT

21 (?)a21 := a21 − u21ψ1 − z21υ1 (?)[x21, τ, a21] := Housev(a21)A22 := A22 − u21yT

21 − z21uT21 (?)

v21 := AT22x21

w21 := A22x21

u21 := x21; y21 := v21z21 := w21

β := uT21z21/2


12 − aT12u21uT

21/τA02 := A02 −A02u21uT

21/τ

Continue with ATL ATR

ABL ABR

!←

0@ A00 a01 A02

aT10 α11 aT

12

A20 a21 A22

1A,

uT

uB

!←

0@ u01

υ11

u21

1A ,

yT

yB

!←

0@ y01ψ11

y21

1A ,

zT

zB

!←

0@ z01ζ11

z21

1Aendwhile

Fig. 1. Unblocked algorithms for reduction to upper Hessenberg form. Left: basic algorithm.Right: rearranged algorithm so that operations can be fused. Operations marked with (?) are notexecuted during the first iteration.

—Partition A→(α11 aT

12

a21 A22

).

—Let [u21, τ ] = Housev(a21).—Update(

α11 aT12

a21 A22

):=(

1 00 H

)(α11 aT

12

a21 A22

)(1 00 H

)=(

α11 aT12H

Ha21 HA22H

),

where H = H(u21).ACM Transactions on Mathematical Software, Vol. V, No. N, Month 20YY.


—Continue this process with the updated A22.

Let us look at updates of A in more detail.

—The simplest update is that of a21. This vector is to be overwritten by Ha21,which is given to us by the Housev function.1

—Next, we update aT12 by applying H from the right:

aT12 := aT

12H

= aT12(I − τ−1u21u

T21)

= aT12 − aT

12u21uT21/τ

If any matrix rows exist above aT12, we must update this partition as well. Let us

call this partition A02:

A02 := A02H

= A01(I − τ−1u21uT21)

= A02 −A02u21uT21/τ

—Let us now focus on the update of A22 to HA22H:

A22 := HA22H

= (I − u21uT21/τ)A22(I − u21u

T21/τ)

= A22 − u21( AT22u21︸︷︷︸v21

)T /τ − ( A22u21︸︷︷︸w21

)uT21/τ + (uT

21 A22u21︸︷︷︸w21

)u21uT21/τ

2

= A22 − u21vT21/τ − w21u

T21/τ + uT

21w21︸︷︷︸2β

u21uT21/τ

2

= A22 − u21(vT21 − βuT

21/τ)/τ − ((w21 − βu21/τ)/τ)uT21

= A22 − u21 ((v21 − βu21/τ)/τ)︸︷︷︸)T

y21

− ((w21 − βu21/τ)/τ)︸︷︷︸z21

uT21

= A22 − (u21yT21 + z21u

T21)

This motivates the algorithm in Figure 1(left).The problem with the algorithm in Figure 1(left) is that the computation of

v21 and w21 (which are stored in y21 and z21, respectively) is formulated as twoseparate calls to the matrix-vector multiply BLAS routine (gemv), which requiresmatrix A22 to be read from memory twice. Similarly, the update A22 := A22 −(u21y

T21 + z21u

T21) requires two calls to the general rank-1 update BLAS routine

(ger), forcing A to be read from and written to memory two more times. Thus, asformulated, A22 is read four times and written twice for each iteration. In addition,A02 has to be read twice and written once: to form x01 := A02u21 and to updateA02 := A02−x01u

T21/τ . In Section 3.4 we will show that memory operations related

1In practice, the update of a21 is implicit, as the implementation stores overwrites the first el-ement of a21 with the first element of Ha21 while overwriting the remaining elements with thecorresponding non-unit elements of the vector u21 associated with H.



to the computation with A02 can be reduced via a so-called blocked algorithm.What we will show next is that by delaying the update A22 := A22−(u21y

T21+z21u

T21)

until the next iteration, we can reformulate the algorithm so that A22 needs onlybe read and written once per iteration.

Let us focus on the update A22 := A22 − (u21yT21 + z21u

T21). Partition

A22 →(α+

11 a+T12

a+21 A+

22

), u21 →

(υ+

1

u+21

), y21 →

(ψ+

1

y+21

), z21 →

(ζ+1

z+21

),

where + indicates the partitioning in the next iteration. Then A22 := A22 −(u21y

T21 + z21u

T21) translates to(

α+11 a+T

12

a+21 A+

22

):=(α+

11 a+T12

a+21 A+

22

)−

((υ+

1

u+21

)(ψ+

1

y+21

)T

+(ζ+1

z+21

)(υ+

1

u+21

)T)

=(α+

11 − (υ+1 ψ

+1 + ζ+

1 υ+1 ) a+T

12 − (υ+1 y

+T21 + ζ+

1 u+T21 )

a+21 − (u+

21ψ+1 + z+

21υ+1 ) A+

22 − (u+21y

+T21 + z+

21u+T21 )

),

which shows what computation would need to be performed if the update of A22 isdelayed until the next iteration. Now, before v21 = AT

22u21 and z21 = A22u21 can becomputed in the next iteration, Housev(a21) has to be computed, which requires a21

to be updated. But what is important is that A22 can be updated by the two rank-1updates from the previous iterations just before v21 = AT

22u21 and w21 = A22u21

are computed, which allows them to be “fused” into one operation that reads andwrites A22 to and from memory only once. The algorithm in Figure 1(right) takesadvantage of these insights.

3.2 Lazy algorithm

We now show how the reduction to upper Hessenberg form can be restructured sothat the update A22 := A22 − (u21y

T21 + z21u

T21) during each step can be avoided.

This algorithm in and by itself is not practical, since it requires too much temporaryspace and eventually new matrix-vector multiplies that incur memory reads startdominating the operation. But it will become an integral part of the blockedalgorithm discussed in Section 3.4.

The rather curious choice of subscripts for u21, and y21, and z21 now becomesapparent: By passing matrices U , Y , and Z into the algorithm in Figure 1, andpartitioning them just like we do A in that algorithm, we can accumulate thesubvectors u21, y21 and z21 into those marices. Now, let us assume that at the topof the loop ABR has not yet been updated. Then α11, a21, aT

12 and A22 have notyet been updated, which means we cannot perform many of the computations inthe current iteration. However, if we let α11, a21, aT

12, and A22 denote the originalvalues in A in those locations, then the desired α11, a21, and aT

12 are given by

α11 = α11 − uT10y10 − zT

10u10

a21 = a21 − UT20y10 − ZT

20u10

aT12 = aT

12 − uT10Y

T20 − zT

10UT20

A22 = A22 − U20YT20 − Z20U

T20.

Thus, we start the iteration by updating in this fashion these parts of A.ACM Transactions on Mathematical Software, Vol. V, No. N, Month 20YY.


Algorithm: [A,U, Y, Z] := HessRed lazy unb(A,U, Y, Z)

Partition X → XTL XTR

XBL XBR

!for X ∈ {A,U, Y, Z}

where XTL is 0× 0while n(UTL) < n(U) do

Repartition XTL XTR

XBL XBR

!→

0@ X00 x01 X02

xT10 χ11 xT

12

X20 x21 X22

1Afor (X,x, χ) ∈ {(A, a, α), (U, u, υ), (Y, y, ψ), (Z, z, ζ)}

where χ11 is a scalar

α11 := α11 − uT10y10 − zT

10u10

a21 := a21 − U20y10 − Z20u10

aT12 := aT

12 − uT10Y

T20 − zT

10UT20

[u21, τ, a21] := Housev(a21)

y21 := AT22u21

z21 := A22u21

y21 := y21 − Y20(UT20u21)− U20(ZT

20u21)z21 := z21 − U20(Y T

20u21)− Z20(UT20u21)

β := uT21z21/2


12 − aT12u21uT

21/τA02 := A02 −A02u21uT

21/τ

Continue with XTL XTR

XBL XBR

!←

0@ X00 x01 X02

xT10 χ11 xT

12

X20 x21 X22


endwhile

Fig. 2. Lazy unblocked algorithm for reduction to upper Hessenberg form.

Next, we observe that the updated A22 itself is not actually needed in updatedform: We need to be able to compute AT

22u21 and A22u21. But this can be donevia the alternative computations

y21 := AT22u21 = AT

22u21 − Y20(UT20u21)− U20(ZT

20u21)z21 := A22u21 = A22u21 − U20(Y T

20u21)− Z20(UT20u21)

which requires only matrix-vector multiplications. This inspires the algorithm inFigure 2.

3.3 GQvdG unblocked algorithm

The lazy algorithm discussed above requires at each step a matrix-vector and atransposed matrix-vector multiply which can be fused so that the matrix onlyneeds to be brought into memory once. In this section, we show how the bulk ofcomputation (and associated memory traffic) can be cast in terms of a single matrixmultiplication. This has the benefit that it does not require a fused operation.ACM Transactions on Mathematical Software, Vol. V, No. N, Month 20YY.


Algorithm: [A,U, Z, T ] := HessRed GQvdG unb(b, A, U, Z, T )


XBL XBR

!for X ∈ {A,U, Z, T}

where XTL is 0× 0while n(UTL) < b do

Repartition XTL XTR

XBL XBR

!→

0@ X00 x01 X02

xT10 χ11 xT

12

X20 x21 X22

1Afor (X,x, χ) ∈ {(A, a, α), (U, u, υ), (Z, z, ζ), (T, t, τ)}

where χ11 is a scalar0@ a01

α11

a21

1A :=

0@ a01

α11

a21

1A−0@ Z00

zT10

Z20

1AT−100 u10

0@ a01

α11

a21

1A :=

0B@I −0@ U00

uT10

U20

1AT−100

0@ U00

uT10

U20

1AT1CAT 0@ a01

α11

a21

1A[u21, τ11, a21] := Housev(a21)0@ z01

ζ11z21

1A :=

0@ A02

aT12

A22

1Au21

t01 := UT20u21


XBL XBR

!←

0@ X00 x01 X02

xT10 χ11 xT

12

X20 x21 X22

1Afor (X,x, χ) ∈ {(A, a, α), (U, u, υ), (Z, z, ζ), (T, t, τ)}

endwhile

Fig. 3. GQvdG unblocked algorithm for the reduction to upper Hessenberg form.

The underlying idea builds upon how Householder transformations can be accu-mulated: The first b updates can be accumuluated into a lower trapezoidal matrixU and triangular matrix T so that(

I − τ−10 u0u

T0

) (I − τ−1

1 u1uT1

)· · ·(I − τ−1

b−1ub−1uTb−1

)= (I − UT−1UT ).

After b iterations the basic unblocked algorithm overwrites matrix A with

A(b) = H(ub−1) · · ·H(u0)AH(u0) · · ·H(ub−1)=(I − τ−1

b−1ub−1uTb−1

)· · ·(I − τ−1

0 u0uT0

)A(I − τ−1

0 u0uT0

)· · ·H(ub−1)

= (I − UT−1UT )T A(I − UT−1UT ) = (I − UT−1UT )T (A− AU︸︷︷︸Z

T−1UT )

= (I − UT−1UT )T (A− ZT−1UT ),

where A denotes the original contents of A.ACM Transactions on Mathematical Software, Vol. V, No. N, Month 20YY.


Let us assume that this process has proceeded for k iterations. Partition

X →(XTL XTR

XBL XBR

)for X ∈ {A, A, U, Z, T},

where XTL is k × k. Then

A(k) =

(A

(k)TL A

(k)TR

A(k)BL A

(k)BR

)=

(I −

(UTL

UBL

)T−1

TL

(UTL

UBL

)T)T ((

ATL ATR

ABL ABR

)−(ZTL

ZBL

)T−1

TL

(UTL

UBL

)T).

Now, assume that after the first k iterations our algorithm leaves our variables inthe following states:

—A =(ATL ATR

ABL ABR

)contains

(A

(k)TL ATR

A(k)BL ABR

). (In other words, the first k

columns have been updated and the rest of the columns are untouched.)

—Only(UTL

UBR

), TTL, and

(ZTR

ZBR

)have been updated.

The question is how to advance the computation. Now, at the top of the loop, weexpose (

XTL XTR

XBL XBR

)→

X00 x01 X02

xT10 χ11 xT

12

X20 x21 X22

for (X,x, χ) ∈ {(A, a, α), (A, a, α), (U, u, υ), (Z, z, ζ), (T, t, τ). In order to computethe next Householder transformation, the next column of A must be updated ac-cording to prior computation:

a01

α11

a21

=

I − U00

uT10

U20

T−100

U00

uT10

U20

T

T a01

α11

a21

− Z00

zT10

Z20

T−100 u10

︸︷︷︸column k of ZkT

−1k UT

k

,

which means first updating a01

α11

a21

:=

a01 − Z00w10

α11 − zT10w10

a21 − Z20w10

, where w10 = T−100 u10.



Next, we need to perform the update a01

α11

a21

:=

I − U00

uT10

U20

T−100

U00

uT10

U20

T

T a01

α11

a21

=

a01

α11

a21

− U00

uT10

U20

T−T00

U00

uT10

U20

T a01

α11

a21

=

a01 − U00y10α11 − uT

10y10a21 − U20y10

, where y10 = T−T00 (UT

00a01 + u10α11 + UT20a21).

After these computations we can compute the next Householder transform froma21, updating a21:

—[u21, τ, a21] := Housev(a21).

The next column of Z is computed by z01ζ11z21

=

A00 a01 A02

aT10 α11 aT

12

A20 a21 A22

0

0u21

=

A02u21

aT12u21

A22u21

.

We finish by computing the next column of T : T00 t01 T02

0 τ11 tT120 0 T22

:=

T00 UT20u21 T02

0 12u

T21u21 tT12

0 0 T22

.

We note that the τ computed by Housev(a21) equals 12u

T21u21.

These steps are summarized in Figure 3.

3.4 Blocked algorithms

We now discuss how much of the computation can be cast in terms of matrix-matrixmultiplication.

In Figure 4 we give four possible blocked algorithms which differ by how compu-tation is accumulated in the body of the loop:

—Two correspond to using the unblocked algorithms in Figure 1.—A third results from using the lazy algorithm in Figure 2. For this variant, we

we introduce matrices U , Y , and Z of width b in which vectors computed by thelazy unblocked algorithm are accumulated.

—The fourth results from using the algorithm in Figure 3. It returns matrices U ,Z, and T .

Let us consider having progressed through the matrix so that it is in the state

A =(ATL ATR

ABL ABR

), U =

(UT

UB

), Y =

(YT

YB

), Z =

(ZT

ZB

),



Algorithm: [A] := HessRed blk(A, T )


ABL ABR

!, X →

XT

XB

!for X ∈ {T, U, Y, Z}

where ATL is 0× 0 and TT , UT , YT , and ZT have 0 rowswhile m(ATL) < m(A) do

Determine block size bRepartition

ATL ATR

ABL ABR

!→

0@ A00 A01 A02

A10 A11 A12

A20 A21 A22

1A,

XT

XB

!→

0@ X0

X1

X2

1Afor X ∈ {T, U, Y, Z}

where A11 is b× b and T1, U1, Y1, and Z1 have b rows

Algorithm 1, 2:

[ABR, UB ] := HessRed unb(b, ABR)

T1 = 12D + S where UT

BUB = ST +D + S

ATR := ATR(I − UBT−11 UT

B )

Algorithm 3:

[ABR, UB , YB , ZB ] := HessRed lazy unb(b, ABR, UB , YB , ZB)

T1 = 12D + S where UT

BUB = ST +D + S


B )A22 := A22 − U2Y T

2 − Z2UT2

Algorithm 4: (HessRed GQvdG blk)

[ABR, UB , ZB , T1] := HessRed GQvdG unb(b, ABR, UB , ZB , T1)


B )„A12

A22

«:=

I −

„U1

U2

«T−11

„U1

U2

«T!T „„

A12

A22

«−„Z1

Z2

«T−11 UT

2

«=

A12

A22

!− U1

U2

!WT

2

where WT2 = T−T

1 (UT1 A12 + UT

2 A22 − (UT1 Z1 + UT

2 Z2)T−11 UT

2 )


ABL ABR

!←

0@ A00 A01 A02

A10 A11 A12

A20 A21 A22

1A,

XT

XB

!←

0@ X0

X1

X2

1Afor X ∈ {T, U, Y, Z}

endwhile

Fig. 4. Blocked reduction to Hessenberg form based on original or rearranged algorithm. The callto HessRed unb performs the first b iterations of one of the unblocked algorithms in Figures 1or 2. In the case of the algorithms in Figure 1, UB accumulates and returns the vectors u21

encountered in the computation and YB and ZB are not used.

where ATL is b×b. Let us assume we have pregressed to the point where the factor-ization has completed with ATL and ABL (meaning that ATL is upper Hessenbergand ABL is zero except for its top-right most element), and ATR and ABR havebeen updated so that only an upper Hessenberg factorization of ABR has to becompleted, updating the ATR submatrix correspondingly. In the next iteration ofthe blocked algorithm, we perform the following steps:ACM Transactions on Mathematical Software, Vol. V, No. N, Month 20YY.


—Perform the first b iterations of the lazy algorithm with matrix ABR, accumulat-ing the appropriate vectors in UB , YB , and ZB .

—Apply the resulting Householder transformations from the right to ATR. InSection 2.3 we discussed that this requires the computation of UTU = ST +D+S,where D and S equal the diagonal and strictly upper triangular part of UTU ,after which ATR := ATR(I−UT−1UT ) = ATR−ATRUT

−1UT with T = 12D+S.

—Repartition(ATL ATR

ABL ABR

)→

A00 A01 A02

A10 A11 A12

A20 A21 A22

,

(UT

UB

)→

U0

U1

U2

, . . .

—Update A22 := A22 − U2YT2 − Z2U

T2 .

—Move the thick line (which denotes how far the factorization has proceeded)forward by the block size:(

ATL ATR

ABL ABR

)←

A00 A01 A02

A10 A11 A12

A20 A21 A22

,

(UT

UB

)←

U0

U1

U2

, . . .

Proceeding like this block by block computes the reduction to upper Hessenbergform while reducing the size of the matrices U , Y , and Z, casting some of thecomputation in terms of matrix-matrix multiplications that are known to achievehigh performance.

When one of the unblocked algorithms in Figure 1 is used instead, A22 is alreadyupdated upon return from HessRed unb.

3.5 Fusing operations

We now discuss how three sets of operations encountered in the various algorithmscan be fused to reduce memory traffic.

In the lazy algorithm, delaying the update of A22 yields the following threeoperations that can be fused (here we drop the subscripts):

A := A− (uyT + zuT )v := ATxw := Ax

Fuse

Partition

A→`a0 · · · an−1

´, u→

0BB@υ0

.

..υn−1

1CCA , v →

0BB@ν0...

νn−1

1CCA , x→

0BB@χ0

.

.

.χn−1

1CCA , y →

0BB@ψ0

.

.

.ψn−1

1CCA .

Then the following steps, 0 ≤ i < n, compute the desired result (provided initiallyw = 0):

ai := ai − ψiu− υiz; νi := aTi x; w := w + χiai.

Similarly,

v := ATxw := Ax

}Fuse



Algorithm: [A] := TriRed unb(b, A)


ABL ABR

!, x→

xT

xB

!for x ∈ {u, y}

where ATL is 0× 0 and uT , yT have 0 rowswhile m(ATL) < m(A) do

Repartition ATL ATR

ABL ABR

!→

0@ A00 a01 A02

aT10 α11 aT

12

A20 a21 A22

1A,

xT

xB

!→

0@ x01

χ11

x21

1Afor (x, χ) ∈ {(u, υ), (y, ψ)}

where α11, υ11, and ψ11 are scalars


[u21, τ, a21] := Housev(a21)

y21 := A22u21

β := uT21y21/2

y21 := (y21 − βu21/τ)/τA22 := A22 − u21yT

21 − y21uT21

α11 := α11 − 2υ11ψ11 (?)a21 := a21 − (u21ψ11 + y21υ11) (?)[x21, τ, a21] := Housev(a21)A22 := A22 − u21yT

21 − y21uT21 (?)

v21 := A22x21

u21 := x21; y21 := v21β := uT

21y21/2y21 := (y21 − βu21/τ)/τ


ABL ABR

!←

0@ A00 a01 A02

aT10 α11 aT

12

A20 a21 A22

1A,

xT

xB

!←

0@ x01

χ11

x21

1Afor (x, χ) ∈ {(u, υ), (y, ψ)}

endwhile

Fig. 5. Unblocked algorithms for reduction to tridiagonal form. Left: basic algorithm. Right:rearranged to allow fusing of operations. Operations marked with (?) are not executed during thefirst iteration.

can be computed via

νi := aTi x; w := w + χiai, 0 ≤ i < n.

Finally,

y := y − Y (UTu)− U(ZTu)z := z − U(Y Tu)− Z(UTu)

}Fuse

can be computed by partitioning

U →(u0 · · · uk−1

), Y →

(y0 · · · yk−1

), Z →

(z0 · · · zk−1

),

and computing

α := uTi u; β := zT

i u; γ := yTi u; y := y − αyi − βui; z := z − αzi − γui,

for 0 ≤ i < n.ACM Transactions on Mathematical Software, Vol. V, No. N, Month 20YY.


Algorithm: [A,U, Y ] := TriRed lazy unb(A,U, Y )


XBL XBR

!for X ∈ {A,U, Y }


Repartition XTL XTR

XBL XBR

!→

0@ X00 x01 X02

xT10 χ11 xT

12

X20 x21 X22

1Afor (X,x, χ) ∈ {(A, a, α), (U, u, υ), (Y, y, ψ)}


α11 := α11 − uT10y10 − yT

10u10

a21 := a21 − U20y10 − Y20u10

[u21, τ, a21] := Housev(a21)y21 := A22u21

y21 := y21 − Y20(UT20u21)− U20(Y T

20u21)β := uT

21y21/2y21 := (y21 − βu21/τ)/τ


XBL XBR

!←

0@ X00 x01 X02

xT10 χ11 xT

12

X20 x21 X22

1Afor (X,x, χ) ∈ {(A, a, α), (U, u, υ), (Y, y, ψ)}

endwhile

Fig. 6. Lazy unblocked reduction to tridiagonal form.

4. REDUCTION TO TRIDIAGONAL FORM

The first step towards computing the eigenvalue decomposition (EVD) of a sym-metric matrix is to reduce the matrix to tridiagonal form.

Let A ∈ Rn×n be symmetric. If A → QBQT where B is upper Hessenberg andQ is orthogonal, then B is symmetric and therefore tridiagonal. In this section weshow how to take advantage of symmetry, assuming that matrix A is stored in onlythe lower triangular part of A and only the lower triangular part of that matrix isoverwritten with B.

When matrix A is symmetric, and only the lower triangular part is stored andupdated, the unblocked algorithms for reducing A to upper Hessenberg form canbe changed by noting that v21 = w21 and y21 = z21. This motivates the algorithmsin Figures 5–7.

In the rearranged algorithm, delaying the update of A22 allows the highlightedoperations in Figure 5(right) to be fused via the algorithm in Figure 8. We leave itas an exercise to the reader to fuse the highlighted operations in Figure 6.

5. REDUCTION TO BIDIAGONAL FORM

The previous sections were inpired by the paper [Howell et al. 2008] that discussessimilar techniques for a reduction to bidiagonal form. The purpose of this sectionis to present the basic and rearranged unblocked algorithms for this operation with



Algorithm: [A,U, Y ] := TriRed blk(A,U, Y )


ABL ABR

!, X →

XT

XB

!for X ∈ {U, Y }

where ATL is 0× 0 and UT , YT have 0 rowswhile m(ATL) < m(A) do


ATL ATR

ABL ABR

!→

0@ A00 A01 A02

A10 A11 A12

A20 A21 A22

1A,

XT

XB

!→

0@ X0

X1

X2

1Afor X ∈ {U, Y }

where A11 is b× b and U1, and Y1 have b rows

[ABR, UB , YB ] := TriRed lazy unb(b, ABR, UB , YB)A22 := A22 − U2Y T

2 − Y2UT2


ABL ABR

!←

0@ A00 A01 A02

A10 A11 A12

A20 A21 A22

1A,

XT

XB

!←

0@ X0

X1

X2

1Afor X ∈ {U, Y }

endwhile

Fig. 7. Blocked reduction to tridiagonal form based on original or rearranged algorithm.TriRed unb performs the first b iterations of the lazy unblocked algorithm in Figure 6.

our notation to facilitate the comparing and contrasting of the reduction to upperHessenberg and tridiagonal form algorithms to those for the reduction to bidiagonalform.

The first step towards computing the Singular Value Decomputation (SVD) ofA ∈ Rm×n is to reduce the matrix to bidiagonal form: A→ QUBQ

TV where B is a

bidiagonal matrix (nonzero diagonal and superdiagonal) and QU and QV are againsquare and orthogonal.

For simplicity, we explain the algorithms for the case where A is square.

5.1 Basic algorithm

The basic algorithm for this operation, overwriting A with the result B, can beexplained as follows:


12

a21 A22

).

—Let[(

1u21

), τL,

(α11

0

)]:= Housev

((α11

a21

)).2

2Note that the semantics here indicate that α11 is overwritten by the first element of

„α11

0

«.



Algorithm: [A] := Fused Syr2 Symv(A, x, y, u, v)


ABL ABR

!, z →

zT

zB

!for z ∈ {x, y, u, v}

where ATL is 0× 0, xT , yT , uT , vT have 0 elementsv = 0


ATL ATR

ABL ABR

!→

0@ A00 a01 A02

aT10 α11 aT

12

A20 a21 A22

1A,

zT

zB

!→

0@ z0

ζ1z2

1Afor (z, ζ) ∈ {(x, χ), (y, ψ), (u, υ), (v, ν)}

where α11, χ1, ψ1, υ1, ν1 are scalars

α11 := α11 + 2ψ1υ1

a21 := a21 + ψ1u2 + υ1y2 (axpy ×2)

fftowardA :=A+(uyT +yuT )

ν1 := ν1 + α11χ1 + aT21x2 (dot)

v2 := v2 + χ1a21 (axpy)

fftowardv := Ax


ABL ABR

!←

0@ A00 a01 A02

aT10 α11 aT

12

A20 a21 A22

1A,

zT

zB

!←

0@ z0ζ1

z2

1Afor (z, ζ) ∈ {(x, χ), (y, ψ), (u, υ), (v, ν)}

endwhile

Fig. 8. Algorithm that fuses a symmetric rank-2 update and a symmetric matrix-vector multiply:A := A+ (uyT + yuT ); v := Ax, where A is symmetric and stored in the lower triangular part

of A.

—Update(α11 aT

12

a21 A22

):=

(I − τ−1

L

(1u21

)(1u21

)T)(

α11 aT12

a21 A22

)=(ψ11 aT

12 − yT21/τL

0 A22 − u21yT21/τL

), where

(ψ11

y21

)=(α11 + aT

21u21

a12 +AT22u21

)Note that α11 := ψ11 need not be executed since this update was performed bythe instance of Housev above.

—Let [v21, τR, a12] := Housev (a12).—Update A22 := A22(I − τ−1

R v21vT21) = A22 − z21vT

21/τR, where z21 = A22v21.—Continue this process with the updated A22.

The resulting algorithm, slightly rearranged, is given in Figure 9(left).

5.2 Rearranged algorithm

We now show how, again, the loop can be restructured so that multiple updatesof, and multiplications with, A22 can be fused. Focus on the update A22 := A22 −(u21y

T21 + z21v

T21). Partition

A22 → α+

11 a+T12

a+21 A+

22

!, u21 →

„υ+11

u+21

«, y21 →

„ψ+

11

y+21

«, z21 →

„ζ+11z+21

«, v21 →

„ν+11

v+21

«,



Algorithm: [A] := BiRed unb(A)


ABL ABR

!, x→

xT

xB

!for x ∈ {u, v, y, z}

where ATL is 0× 0, uT , vT , yT , zT have 0 elementswhile m(ATL) < m(A) do

Repartition ATL ATR

ABL ABR

!→

0@ A00 a01 A02

aT10 α11 aT

12

A20 a21 A22

1A,

xT

xB

!→

0@ x01

χ11

x21

1Afor (x, χ) ∈ {(u, υ), (v, ν), (y, ψ), (z, ζ)}

where α11, υ11, ν11, ψ11, and ζ11 are scalars


»„1

u21

«, τL,

„α11

0

«–:=

Housev

„„α11

a21

««

y21 := a12 +AT22u21

aT12 := aT

12 − yT21/τL

[v21, τR, a12] := Housev (a12)

β := yT21v21

y21 := y21/τLz21 := (A22v21 − βu21/τL)/τR

A22 := A22 − u21yT21 − z21vT

21

α11 := α11 − υ11ψ11 − ζ11ν11 (?)a21 := a21 − u21ψ11 − z21ν11 (?)aT12 := aT

12 − υ11yT21 − ζ11vT

21 (?)»„1

u+21

«, τL,

„α11

0

«–:=

Housev

„„α11

a21

««a+12 := a12 − a12/τLA22 := A22 − u21yT

21 − z21vT21 (?)

y21 := AT22u

+21

a+12 := a+

12 − y21/τLw21 := A22a

+12

y21 := y21 + a12

[ψ11 − α12, τR, α12] := HousevS(a+12)

v21 := (a+12 − α12e0)/(ψ11 − α12);

aT12 := α12eT

0

u21 := u+21

β := yT21v21

y21 := y21/τLz21 := (w21 − α12A22e0)/(ψ11 − α12)z21 := z21 − βu21/τLz21 := z21/τR


ABL ABR

!←

0@ A00 a01 A02

aT10 α11 aT

12

A20 a21 A22

1A,

xT

xB

!←

0@ x01

χ11

x21

1Afor (x, χ) ∈ {(u, υ), (v, ν), (y, ψ), (z, ζ)}

endwhile

Fig. 9. Unblocked algorithms for reduction to bidiagonal form. Left: basic algorithm. Right:rearranged to allow fusing of operations. Operations marked with (?) are not executed during thefirst iteration.



where + indicates the partitioning in the next iteration. Then α+

11 a+T12

a+21 A+

22

!:=

α+

11 a+T12

a+21 A+

22

!−„υ+11

u+21

«„ψ+

11

y+21

«T

−„ζ+11z+21

«„ν+11

v+21

«T

=

α+

11 − υ+11ψ

+11 − ζ

+11ν

+11 a+T

12 − υ+11y

+T21 − ζ

+11v

+T21

a+21 − u

+21ψ

+11 − z

+21ν

+11 A+

22 − u+21y

+T21 − z

+21v

+T21

!,

which shows how the update of A22 can be delayed until the next iteration. Ifu21 = y21 = z21 = v21 = 0 during the first iteration, the body of the loop may bechanged to

α11 := α11 − υ11ψ11 − ζ11ν11a21 := a21 − u21ψ11 − z21ν11aT12 := aT

12 − υ11yT21 − ζ11vT

21[(1u+

21

), τL,

(α11

0

)]:= Housev

((α11

a21

))A22 := A22 − u21y

T21 − z21vT

21

y21 := a12 +AT22u

+21

aT12 := aT

12 − yT21/τL

[v21, τR, a12] := Housev (a12)β := yT

21v21y21 := y21/τLz21 := (A22v21 − βu+

21/τL)/τR

Now, the goal becomes to bring the three highlighted updates together. The prob-lem is that the last update, which requires v21, cannot commense until after thesecond call to Housev completes. This dependency can be circumvented by ob-serving that one can perform a matrix-vector multiply of A22 with the vectoraT12 = aT

12 − yT21/τL instead of with v21, after which the result can be updated

as if the multiplication had used the output of the Housev, as indicated by (6) inSection 2. These observations justify the rearrangement of the computations asindicated in Figure 9(right).

5.3 Lazy algorithm

A lazy algorithm can be derived by not updating A22 at all, by accumulating theupdates in matrix U , V , Y , and Z, much like was done for the other reduction tocondensed form operations.

We start with the rearranged algorithm to make sure that

y21 := AT22u

+21

a+12 := a+

12 − y21/τLw21 := A22a

+12

can still be fused. Next, the key is to realize that what was previously a multipli-cation by A22 must now be replaced by a multiplication by A22−U20Y

T20−Z20V

T20.

This yields the algorithm in Figure 10.

5.4 Blocked algorithm

Finally, a blocked algorithm is given in Figure 11.ACM Transactions on Mathematical Software, Vol. V, No. N, Month 20YY.


Algorithm: [A,U, V, Y, Z] := BiRed lazy unb(A,U, V, Y, Z)


XBL XBR

!for X ∈ {A,U, V, Y, Z}


Repartition XTL XTR

XBL XBR

!→

0@ X00 x01 X02

xT10 χ11 xT

12

X20 x21 X22



α11 := α11 − uT10y10 − zT

10v10a21 := a21 − U20y10 − Z20v10aT12 := aT

12 − uT10Y

T20 − zT

10VT20»„

1

u+21

«, τL,

„α11

0

«–:= Housev

„„α11

a21

««a+12 := a12 − a12/τLy21 := −Y20UT

20u+21 − V20ZT

20u+21

y21 := y21 +AT22u

+21

a+12 := a+

12 − y21/τLw21 := A22a

+12

w21 := w21 − U20Y T20a

+12 − Z20V T

20a+12

a22l := A22e0 − U20Y T20e0 − Z20V T

20e0

y21 := a12 + y21[ψ11 − α12, τR, α12] := HousevS(a+

12)

v21 := (a+12 − α12e0)/(ψ11 − α12);

aT12 := α12eT

0

u21 := u+21

β := yT21v21

y21 := y21/τLz21 := (w21 − α12a22l)/(ψ11 − α12)z21 := z21 − βu21/τLz21 := z21/τR


XBL XBR

!←

0@ X00 x01 X02

xT10 χ11 xT

12

X20 x21 X22


endwhile

Fig. 10. Lazy algorithm for reduction to bidiagonal form. Upon entry matrix A is n × n andmatrices U , V , Y , and Z are n× b. Note that the multiplications A22e0, Y T

20e0, and UT20e0 do not

require computation: they simply extract the first column or row of the given matrix.

6. COMPUTING IN THE COMPLEX DOMAIN

For simplicity and clarity, the algorithms given thus far have assumed computationon real matrices. In this section, we briefly discuss how to formulate a few of thesealgorithms for complex matrices.

In order to capture more genearlized algorithms which work in both the real andcomplex domains, we must first introduce a complex Householder transform.ACM Transactions on Mathematical Software, Vol. V, No. N, Month 20YY.


Algorithm: [A] := BiRed blk(A,U, V, Y, Z)


ABL ABR

!, X →

XT

XB

!for X ∈ {U, V, Y, Z}

where ATL is 0× 0 and UT , VT , YT , ZT have 0 rowswhile m(ATL) < m(A) do


ATL ATR

ABL ABR

!→

0@ A00 A01 A02

A10 A11 A12

A20 A21 A22

1A,

XT

XB

!→

0@ X0

X1

X2

1Afor X ∈ {U, V, Y, Z}

where A11 is b× b and U1, V1, Y1, and Z1 have b rows

[ABR, UB , VB , YB , ZB ] := BiRed lazy unb(b, ABR, UB , VB , YB , ZB)A22 := A22 − U2Y T

2 − Z2V T2


ABL ABR

!←

0@ A00 A01 A02

A10 A11 A12

A20 A21 A22

1A,

XT

XB

!←

0@ X0

X1

X2

1Afor X ∈ {U, V, Y, Z}

endwhile

Fig. 11. Blocked algorithm for reduction to bidiagonal form. For simplicity, it is assumed that Ais n× n where n is an integer multiple of b. Matrices U , V , Y , and Z are all n× b.

Definition 2. Let u ∈ Cn, τ ∈ R. Then H = H(u) = I − τ−1uuH , whereτ = 1

2uHu, is a complex Householder transformation.

The complex Householder transform has properties similar to those of the realinstantiation, namely: (1) HH = I; (2) H = HH , and so HHH = HHH =I; and (3) if H0, · · · ,Hk−1 are complex Householder transformations and Q =H0H1 · · ·Hk−1, then QHQ = QQH = I.

Let x, v, u ∈ Cn,

x→(χ1

x2

), v →

(ν1v2

), u→

(υ1

u2

),

v = x− αe0, and u = v/ν1. We can re-express the complex Householder transformH as:

H =

(I − τ−1

(1u2

)(1u2

)H)

It can be shown that the application of H(u) to a vector x,

H

(χ1

x2

)=(α0

)(7)

is satisfied for

α = −‖x‖2χ1

|χ1|.



Notice that for x, v, u ∈ Rn, this definition of α is equivalent to the definitiongiven for real Householder transformations in Section 2.2, since χ1/|χ1| = sign(χ1).By re-defining α this way, we allow τ to remain real, which allows the complexHouseholder transform to retain the property of being a reflector. Other instancesof the Householder transform, such as those found in LAPACK, restrict α to thereal domain [LAPACK Users’ Guide ; Lehoucq 1994]. In these situations, Eq. (7)is only satisfiable if τ ∈ C, which results in H 6= HH and HH 6= I. We prefer ourHouseholder transforms to remain reflectors in both the real and complex domains,and so we choose to define α as above.

Recall that Figures 1–11 illustrate algorithms for computing on real matrices.We will now review a few of the algorithms, as expressed in terms of the complexHouseholder transform.

6.1 Reduction to upper Hessenberg form

Since the complex Householder transform H is a reflector, the basic unblockedalgorithm for reducing a complex matrix to upper Hessenberg is, at a high level,identical to the algorithm for real matrices:


12

a21 A22

).

—Let [u21, τ ] = Housev(a21).

—Update

(α11 aT

12

a21 A22

):=(

1 00 H

)(α11 aT

12

a21 A22

)(1 00 H

)=(

α11 aT12H

Ha21 HA22H

),

where H = H(u21).

—Continue this process with the updated A22.

As before, Ha21 is computed by Housev.The real and complex algorithms begin to differ with the updates of aT

12 and A02:

aT12 := aT

12H

= aT12 − aT

12u21uH21/τ

A02 := A02H

= A02 −A02u21uH21/τ

Specifically, we can see that u21 is conjugate-transposed instead of simply trans-posed.ACM Transactions on Mathematical Software, Vol. V, No. N, Month 20YY.


Algorithm: [A] := ComplexHessRed unb(b, A)


ABL ABR

!, u→

uT

uB

!, y →

yT

yB

!, z →

zT

zB

!where ATL is 0× 0 and uT , yT , and zT have 0 rows


ATL ATR

ABL ABR

!→

0@ A00 a01 A02

aT10 α11 aT

12

A20 a21 A22

1A,

uT

uB

!→

0@ u01

υ11

u21

1A ,

yT

yB

!→

0@ y01ψ11

y21

1A ,

zT

zB

!→

0@ z01ζ11

z21

1Awhere α11, υ11, ψ11, ζ11 are scalars


[u21, τ, a21] := Housev(a21)

y21 := AH22u21

z21 := A22u21

β := uH21z21/2


12 − aT12u21uH

21/τA02 := A02 −A02u21uH

21/τA22 := A22 − u21yH

21 − z21uH21

α11 := α11 − υ1ψ1 − ζ1υ1 (?)aT12 := aT

12 − υ1yH21 − ζ1uH

21 (?)a21 := a21 − u21ψ1 − z21υ1 (?)[x21, τ, a21] := Housev(a21)A22 := A22 − u21yH

21 − z21uH21 (?)

v21 := AH22x21

w21 := A22x21

u21 := x21; y21 := v21z21 := w21

β := uH21z21/2


12 − aT12u21uH

21/τA02 := A02 −A02u21uH

21/τ


ABL ABR

!←

0@ A00 a01 A02

aT10 α11 aT

12

A20 a21 A22

1A,

uT

uB

!←

0@ u01

υ11

u21

1A ,

yT

yB

!←

0@ y01ψ11

y21

1A ,

zT

zB

!←

0@ z01ζ11

z21

1Aendwhile

Fig. 12. Unblocked reduction to upper Hessenberg form using a complex Householder transform.Left: basic algorithm. Right: rearranged algorithm so that operations can be fused. Operationsmarked with (?) are not executed during the first iteration.

The remaining differences can be seen by inspecting the update of A22:

A22 := HA22H

= (I − u21uH21/τ)A22(I − u21u

H21/τ)

= A22 − u21( AH22u21︸︷︷︸v21

)H/τ − ( A22u21︸︷︷︸w21

)uH21/τ + (uH

21 A22u21︸︷︷︸w21

)u21uH21/τ

2

= A22 − u21vH21/τ − w21u

H21/τ + uH

21w21︸︷︷︸2β

u21uH21/τ

2

= A22 − u21(vH21 − βuH

21/τ)/τ − ((w21 − βu21/τ)/τ)uH21

= A22 − u21 ((v21 − βu21/τ)/τ︸︷︷︸)H

y21

− ((w21 − βu21/τ)/τ)︸︷︷︸z21

uH21

= A22 − (u21yH21 + z21u

H21)



Algorithm: [A] := ComplexTriRed unb(b, A)


ABL ABR

!, x→

xT

xB

!for x ∈ {u, y}

where ATL is 0× 0 and uT , yT have 0 rowswhile m(ATL) < m(A) do

Repartition ATL ATR

ABL ABR

!→

0@ A00 a01 A02

aT10 α11 aT

12

A20 a21 A22

1A,

xT

xB

!→

0@ x01

χ11

x21

1Afor (x, χ) ∈ {(u, υ), (y, ψ)}

where α11, υ11, and ψ11 are scalars


[u21, τ, a21] := Housev(a21)

y21 := A22u21

β := uH21y21/2

y21 := (y21 − βu21/τ)/τA22 := A22 − u21yH

21 − y21uH21

α11 := α11 − υ11ψ11 − ψ11υ11 (?)a21 := a21 − (u21ψ11 + y21υ11) (?)[x21, τ, a21] := Housev(a21)A22 := A22 − u21yH

21 − y21uH21 (?)

v21 := A22x21

u21 := x21; y21 := v21β := uH

21y21/2y21 := (y21 − βu21/τ)/τ


ABL ABR

!←

0@ A00 a01 A02

aT10 α11 aT

12

A20 a21 A22

1A,

xT

xB

!←

0@ x01

χ11

x21

1Afor (x, χ) ∈ {(u, υ), (y, ψ)}

endwhile

Fig. 13. Unblocked reduction to tridiagonal form using a complex Householder transformation.Left: basic algorithm. Right: rearranged to allow fusing of operations. Operations marked with(?) are not executed during the first iteration.

This leads towards the basic and rearranged unblocked algorithms in Figure 12.

6.2 Reduction to tridiagonal form

Let A ∈ Cn×n be Hermitian. If A → QBQH where B is upper Hessenberg and Qis unitary, then B is Hermitian and therefore tridiagonal. We may take advantageof the Hermitian structure of A just as we did with symmetry in Section 4. Let usassume that only the lower triangular part of A is stored and read, and that onlythe lower triangular part is overwritten by B.

When matrix A is Hermitian, and only the lower triangular part is referenced,the unblocked algorithms for reducing A to upper Hessenberg form can be changedby noting that v21 = wH

21 and y21 = zH21. This results in the basic and rearranged

unblocked algorithms shown in Figure 13.

6.3 Reduction to bidiagonal form

The basic algorithm for reducing a complex matrix to bidiagonal form can be ex-plained as follows:ACM Transactions on Mathematical Software, Vol. V, No. N, Month 20YY.


Algorithm: [A] := ComplexBiRed unb(A)


ABL ABR

!, x→

xT

xB

!for x ∈ {u, v, y, z}

where ATL is 0× 0, uT , vT , yT , zT have 0 elementswhile m(ATL) < m(A) do

Repartition ATL ATR

ABL ABR

!→

0@ A00 a01 A02

aT10 α11 aT

12

A20 a21 A22

1A,

xT

xB

!→

0@ x01

χ11

x21

1Afor (x, χ) ∈ {(u, υ), (v, ν), (y, ψ), (z, ζ)}

where α11, υ11, ν11, ψ11, and ζ11 are scalars


»„1

u21

«, τL,

„α11

0

«–:=

Housev

„„α11

a21

««

y21 := a12 +AT22u21

aT12 := aT

12 − yT21/τL

[v21, τR, a12] := Housev (a12)

β := yT21v21

y21 := y21/τLz21 := (A22v21 − βu21/τL)/τR

A22 := A22 − u21yT21 − z21vT

21

α11 := α11 − υ11ψ11 − ζ11ν11 (?)a21 := a21 − u21ψ11 − z21ν11 (?)aT12 := aT

12 − υ11yT21 − ζ11vT

21 (?)»„1

u+21

«, τL,

„α11

0

«–:=

Housev

„„α11

a21

««a+12 := a12 − a12/τLA22 := A22 − u21yT

21 − z21vT21 (?)

y21 := AT22u

+21

a+12 := a+

12 − y21/τLw21 := A22a

+12

y21 := y21 + a12

[ψ11 − α12, τR, α12] := HousevS(a+12)

v21 := (a+12 − α12e0)/(ψ11 − α12);

aT12 := α12eT

0

u21 := u+21

β := yT21v21

y21 := y21/τLz21 := (w21 − α12A22e0)/(ψ11 − α12)z21 := z21 − βu21/τLz21 := z21/τR


ABL ABR

!←

0@ A00 a01 A02

aT10 α11 aT

12

A20 a21 A22

1A,

xT

xB

!←

0@ x01

χ11

x21

1Afor (x, χ) ∈ {(u, υ), (v, ν), (y, ψ), (z, ζ)}

endwhile

Fig. 14. Unblocked reduction to bidiagonal form using a complex Householder transformation.Left: basic algorithm. Right: rearranged to allow fusing of operations. Operations marked with(?) are not executed during the first iteration.




12

a21 A22

).

—Let[(

1u21

), τL,

(α+

11

0

)]:= Housev

((α11

a21

)).

—Update (α11 aT

12

a21 A22

):=

(I − τ−1

L

(1u21

)(1u21

)H)(

α11 aT12

a21 A22

)=(α+

11 aT12 − yT

21/τL0 A22 − u21y

T21/τL

),

where yT21 = a12 + uH

21A22.—Let [v21, τR, a12] := Housev (a12).—Update A22 := A22(I − τ−1

R v21vT21) = A22 − z21vT

21/τR, where z21 = A22v21.—Continue this process with the so updated A22.

The resulting unblocked algorithm and a rearranged variant that allows fusing aregiven in Figure 14.

6.4 Blocked algorithms

Blocked algorithms may be constructed for reduction to upper Hessenberg form bymaking the following minor changes to the algorithms shown in Figure 4:

—For Algorithms 1–4, update ATR by applying the complex block Householdertransform, (I − UBT

−1UHB ), instead of (I − UBT

−1UTB ).

—For Algorithm 3, update A22 as A22 = A22 − U2YH2 − Z2U

H2 .

—Compute T as T = 12D + S where UH

B UB = SH +D + S.

Blocked algorithms for reduction to tridiagonal form and bidiagonal form can beconstructed in a similar fashion.

7. IMPACT ON PERFORMANCE

In this section we report performance for implementations of various unblockedalgorithms discussed in this paper and then compare the results to our expectations.

7.1 Platform details

All experiments were performed on a single core of a Dell PowerEdge R900 serverconsisting of four Intel “Dunnington” six-core processors. Each core provides a peakperformance of 10.64 GFLOPS (255×109 floating-point operations per second) andhas access to 96 GBytes of shared main memory. Performance experiments weregathered under the GNU/Linux 2.6.18 operating system. Source code was compiledby the Intel C/C++ Compiler, version 11.1. All experiments were performed indouble precision floating point arithmetic on real-valued matrices.

7.2 Fused operation implementations

Fused operations were coded at a high level, in terms of level-1 BLAS. For example,the fused operation illustrated in Figure 8, which fuses the symmetric rank-2 updateACM Transactions on Mathematical Software, Vol. V, No. N, Month 20YY.


settozero( v )for i = 0 : n− 1

aii := aii + 2yiui α11 := α11 + 2ψ1υ1

axpy(yi, ui+1, ai+1,i)axpy(ui, yi+1, ai+1,i)

ffa21 := a21 + ψ1u2 + υ1y2

dot(ai+1,i, xi+1, vi)vi := vi + aiixi

ffν1 := ν1 + α11χ1 + aT

21x2

axpy(xi, ai+1,i, vi+1) v2 := v2 + χ1a21

endfor

Fig. 15. Algorithm from Figure 8 with explicit indexing and calls to level-1 BLAS.

A22 := A22−(u21yT21+y21uT

21) and symmetric matrix-vector multiply v21 := A22x21,was implemented with a loop that invokes each operation in sequence, as shown inFigure 7.2.

Recall that the purpose of fusing operations was to minimize memory accesses.However, notice that some of the data in Figure 7.2 is read redundantly; for exam-ple, vector a21 is accessed by multiple BLAS calls. It is easy to see that this kind ofimplementation is suboptimal for the same reasons that the basic unblocked vari-ants in Figures 1, 5, and 9 are suboptimal. So why was this style of implementationused instead of one which minimized memory accesses? In order to move data inand out of memory in a controlled and efficient manner, one would need to codeat a lower level, and thus would need expertise in both assembly programming andthe instruction set present in the hardware. The authors simply do not have thisexpertise. We do, however, have access to optimized level-1 BLAS which do utilizethe hardware efficiently, so we accept this as the next best thing to custom softwarekernels.

Regardless of the performance observed using these fused operations, which werecoded in terms of the BLAS, we suspect that even higher performance is attainableprovided that the fused operations are carefully coded by an expert.

7.3 Analysis

To be filled in.

8. CONCLUSION

To be filled in.

Acknowledgments. This research was partially sponsored by NSF grant CCF–0702714 and a grant from Microsoft.

Any opinions, findings and conclusions or recommendations expressed in thismaterial are those of the author(s) and do not necessarily reflect the views of theNational Science Foundation (NSF).

REFERENCES

2002. Basic linear algebra subprograms technical forum standard. International Journal of HighPerformance Applications and Supercomputing 16, 1 (Spring).

Anderson, E., Bai, Z., Bischof, C., Blackford, L. S., Demmel, J., Dongarra, J. J., Croz,J. D., Hammarling, S., Greenbaum, A., McKenney, A., and Sorensen, D. 1999. LAPACK



0 100 200 300 400 500 600 700 800 900 10000

1

2

3

4

5

6

7

8

9

10

problem size

GF

LOP

S

Performance of various reduction to upper Hessenberg form implementations

unblocked variant 1unblocked variant 2unblocked variant 3unblocked variant 4unblocked variant 1 with fusingunblocked variant 2 with fusingunblocked variant 3 with fusingblocked variant 4netlib dgehrd

0 100 200 300 400 500 600 700 800 900 10000

1

2

3

4

5

6

7

8

9

10

problem size

GF

LOP

S

Performance of various reduction to tridiagonal form implementations

unblocked variant 1unblocked variant 2unblocked variant 3unblocked variant 2 with fusingunblocked variant 3 with fusingblocked variant 3netlib dsytrd

Fig. 16. Top: Performance of various implementations for reduction to upper Hessenberg form.Bottom: Performance of various implementations for reduction to tridiagonal form.



0 100 200 300 400 500 600 700 800 900 10000

1

2

3

4

5

6

7

8

9

10

problem size

GF

LOP

S

Performance of varous reduction to bidiagonal form implementations

unblocked variant 1unblocked variant 2unblocked variant 3unblocked variant 1 with fusingunblocked variant 2 with fusingunblocked variant 3 with fusingblocked variant 3netlib dgebrd

Fig. 17. Performance of various implementations for reduction to bidiagonal form.

Users’ guide (third ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA,USA.

Bientinesi, P., Gunnels, J. A., Myers, M. E., Quintana-Ortı, E. S., and van de Geijn, R. A.2005. The science of deriving dense linear algebra algorithms. ACM Trans. Math. Soft. 31, 1(March), 1–26.

Dongarra, J. J., Du Croz, J., Hammarling, S., and Duff, I. 1990. A set of level 3 basic linearalgebra subprograms. ACM Trans. Math. Soft. 16, 1 (March), 1–17.

Dongarra, J. J., Du Croz, J., Hammarling, S., and Hanson, R. J. 1988. An extended set ofFORTRAN basic linear algebra subprograms. ACM Trans. Math. Soft. 14, 1 (March), 1–17.

Dongarra, J. J., Duff, I. S., Sorensen, D. C., and van der Vorst, H. A. 1991. Solving LinearSystems on Vector and Shared Memory Computers. SIAM, Philadelphia, PA.

Gunnels, J. A., Gustavson, F. G., Henry, G. M., and van de Geijn, R. A. 2001. FLAME:Formal linear algebra methods environment. ACM Trans. Math. Soft. 27, 4 (December), 422–455.

Howell, G. W., Demmel, J. W., Fulton, C. T., Hammarling, S., and Marmol, K. 2008. Cacheefficient bidiagonalization using BLAS 2.5 operators. ACM Transactions on MathematicalSoftware 34, 3 (May), 14:1–14:33.

Joffrain, T., Low, T. M., Quintana-Ortı, E. S., van de Geijn, R., and Zee, F. V. 2006.Accumulating Householder transformations, revisited. ACM Transactions on MathematicalSoftware 32, 2 (June), 169–179.

LAPACK Users’ Guide. Representation of orthogonal or unitary matrices.http://www.netlib.org/lapack/lug/node128.html.

Lawson, C. L., Hanson, R. J., Kincaid, D. R., and Krogh, F. T. 1979. Basic linear algebrasubprograms for Fortran usage. ACM Trans. Math. Soft. 5, 3 (Sept.), 308–323.

Lehoucq, R. B. 1994. LAPACK Working Note 72: The computation of elementary unitarymatrices. Tech. Rep. CS-94-233, University of Tennessee.



Quintana, E. S., Quintana, G., Sun, X., and van de Geijn, R. 2001. A note on parallel matrixinversion. SIAM J. Sci. Comput. 22, 5, 1762–1771.

van de Geijn, R. A. and Quintana-Ortı, E. S. 2008. The Science of Programming MatrixComputations. www.lulu.com.

Received Month Year; revised Month Year; accepted Month Year


Documents

· 2010. 10. 5. · 2 · G. Quintana-Ort´ı et al. 1. INTRODUCTION For many dense linear algebra operations there exist algorithms that cast most computation in term of matrix-matrix